What is a Machine Learning Algorithm?
An algorithm is a set of step-by-step instructions given to a computer to carry out discrete tasks. Programmers can write algorithms to do mundane things like auto-save a document every 30 seconds while a word processing application is open. However, this guide is primarily concerned with algorithms that carry out machine learning (ML) tasks.
You can think of an ML algorithm like a recipe, where the ingredients are data, and the dish is a model that can make predictions or inferences about the future. For example, video recommendation systems, like the one that powers Netflix, take in data about your preferences, location, shows you’ve watched in the past, which videos you’ve exited before finishing, and what other people who’ve watched the same videos as you have also streamed. The program then uses this information (and perhaps much more) to predict what you're most likely to enjoy watching next.
Some of the steps necessary to prepare the "dish" include sensibly organizing those data, accounting for the scientist's prior beliefs about what the data should look like, computing the relationship between each observation, deriving a mathematical equation representing that relationship, and updating that framework as more data are integrated. This process is known as "training." The result is a model that can dole out future predictions based on previously unobserved data. Netflix then serves you, the end-user, your helping of predictions as options on your home screen, likely prioritizing those that benefit the company most (i.e., videos produced by Netflix or those that are most profitable for them to stream).
Just like any recipe can be optimized for different goals—an entree can be optimized to be healthy, flavorful, authentic to a particular cuisine, or inexpensive—so too can predictive algorithms. The default in many machine learning tasks is to optimize for predictive accuracy. So, we "test" the model on prediction problems for which we already have the answer to see how often it answers correctly. Sometimes the model needs to be fine-tuned before it can perform well on these tests. Once it reaches an acceptable score (what constitutes "acceptable" is arbitrary and varies widely across projects), it is deployed and used on data "out in the wild."
development process of a machine learning system
-
In the private sector, angel investors and venture capitalists look for technologies that seem to have the opportunity for a high return on investment. They also look for teams that they believe will successfully bring the product to market, scale it, and lead an organization to a successful exit (e.g., acquisition or IPO). Within larger tech organizations (e.g., Facebook, Google), research teams are generally given broad leeway to develop new products so long as they advance the organization's goals. In the public sector or academic setting, scientists write applications for grants, fellowships, and scholarships to fund their research. In these settings, a project must align with a funding agency's goals and convincingly argue that it will advance science. While there are variants of these paths to funding, it is generally the case that scientists must, at some point, appeal to some outside body to fund their project.
-
Building a machine learning model requires data. In scientific settings, where one sets out to develop novel methods, we often use toy data sets for proof of concept. For example, a popular data science library, scikit-learn, comes pre-programmed with small datasets like Boston housing prices (506 rows and 13 columns) and the Optical recognition of handwritten digits dataset, which includes 1,797 images of handwritten integers (64 columns). In application settings, existing data must also be obtained or collected. The data need to be preprocessed and cleaned to be useful for the specific task at hand. For instance, one may need to create new categories and tags. We may need to mask particular subgroups in the data to provide appropriate levels of anonymity. Incomplete entries may be removed or imputed (i.e., generate fake, but probable, values). Datasets may need to be fused to see trends across contexts. If data is exceptionally high-dimensional (i.e., lots of features), one might use a principal components algorithm to determine the most critical elements to use. It is also a common, but sometimes controversial, practice to remove outliers and extreme values. At this step, engineers must decide what variable they want to predict—the outcome—and which variables predict that outcome. In some cases, there is no outcome variable, because we do not know what the outcome is, so the goal is to see how the observations relate to one another and guess (the class of models we use in these instances are called "unsupervised"). Finally, the scientist divides the data into subsets to train, validate, and test the models.
-
Models are mathematical representations of the nature of the data. Our goal is to find the model that best "fits" the data so that we can make inferences about data we have not seen yet. Models can be simple, like the equation of a line (remember y = mx + b?). They can also be very complex—such as an equation that a human would not be able to easily or concisely write down. Scientists must choose from a menu of statistical paradigms and modeling tools that suit their data and objectives. For example, non-linear data would require a non-linear model. Classification tasks require different tools (e.g., logistic regression) than sequential decision-making tasks (e.g., reinforcement learning). If there is no outcome variable to predict, one will use unsupervised or semi-supervised methods instead of supervised ones. If there is substantial prior evidence about the believed distribution of the data, Bayesian methods are preferable over Frequentist ones. Each modeling approach comes with benefits and tradeoffs for computational cost, accuracy, over/under-fitting, scalability, interpretability, generalizability, ease of development, and ease of use.
-
Algorithms, either written by hand or packaged in black boxes that can be imported to your programming environment, detect the patterns in the dataset. Because computer programs are essentially a collection of equations connected by logic statements (if this, then that), most things can be written by hand. However, many times we do not need to write everything by hand because someone has already done that for us. These black box solutions are the result of a lot of lines of code that have been refined over time and packaged so that virtually anyone can import them for use and to contribute to their development. They are relatively easy to use, somewhat flexible, and optimized for speed. Some popular machine learning libraries are scikit-learn, TensorFlow, PyMC3, PyTorch, and Keras. New, cutting-edge packages are always being built and released, and creating a library of functions that many people use is a form of social and intellectual capital.
-
We must test models after building them to ensure that they perform well on unseen data. For this, we use our trained model to predict the outcome variable of the reserved "validation" data. Because we already know the "ground truth" outcome for these data, we can compare our predictions to see how well our model predicts the true values. The models are fine-tuned until the accuracy metrics are sufficiently high, and minimal accuracy is lost when the model makes predictions on the validation set. Once a model seems ready for deployment, it is run once more on the subset of data reserved for testing to ensure it works as expected on never-before-seen data.
-
Finally, one can deploy their model in the real world. For academics, this may mean that their scientific paper is published and code is made available on a public repository for others to use in their applications. For a company, this means that the product is made available for customers to use—whether they be other businesses, government agencies, or everyday folk on the internet. At this point you may be wondering, what does machine learning look like in the wild? The answer varies! Voice assistants (i.e., Alexa and Siri) use machine learning to translate your speech into data. Phones use machine learning to scan your face, check it against other photos of you, and then unlock your device. Social media sites use machine learning to predict which content will keep you on the app longer. Search engines use machine learning to predict which information you'll find most appealing. Credit card companies use machine learning to predict which transactions are fraudulent. Banks use machine learning to predict who is most likely to repay a loan. Militaries use machine learning to predict and prevent cyber (and other) attacks from foreign enemies and assassinate enemies with autonomous weapons. The applications, both current and potential, are endless.