An Intro to Machine Learning for Biomedical Scientists

Machine learning and the immune system have something in common: they both learn from experience to make future predictions.

Image from Unsplash


Machine learning is everywhere, and biomedical science is not the exception. From the large-scale analysis of genomic data advancing personalized medicine to the solving of a 50-year-old challenge in biology by predicting protein folding from amino acid sequences, there’s no doubt machine learning is enabling breakthroughs that are shaping the future of biomedical research. But, what is machine learning exactly?


Machine learning (ML), as defined by Arthur Samuel, is a “field of study that gives computers the ability to learn without being explicitly programmed”. An ML algorithm is able to learn from data. But, how do we define learning for a machine? Formally, “a computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” [1].




To put this in a familiar context for biomedical scientists, let’s think of the immune system. Our immune system protects our bodies from outside invaders, such as bacteria, viruses, fungi, and toxins. When the innate immune response is insufficient to control an infection, the adaptive or acquired immune response, mediated by T-cells and B-cells, is activated. Let’s dive into an example.


When an antigen enters our bodies, antigen-presenting cells (APCs) detect and engulf them. They form many different fragments of the antigen and transport them to their membrane surface. Some T-cells respond to APCs by stimulating B cells to prepare their response. B-cells then produce proteins called antibodieswhich are specific for the invading pathogen and bind to the surface of the invading pathogen to mark it for destruction. After infection, a population of the pathogen-specific B-cells is maintained so the immune response is quicker and more effective in the case of re-infection.


This means our immune system has learned to distinguish the antigen as pathogenic, and makes use of that information in the case of re-infection. Or, in other words, our immune system learns from experience to get better at a task —in this case, killing pathogens. This is pretty much what machine learning does.


Given the formal definition of ML, We can define the 4 main components of anymachine learning algorithm: a dataset (experience), a model (algorithm to solve the task), a cost or loss function (measuring how well the model solves the task), and an optimization algorithm (which calibrates the model to increase its performance).


Let’s look into each one in more detail:

Dataset (Experience)

ML algorithms can be understood as being allowed to experience a dataset, which is a collection of many examples. Depending on the experience they are allowed to have during the learning process, ML algorithms can be broadly categorized as unsupervised or supervised.


Unsupervised algorithms experience a dataset containing many features, then learn useful properties of the structure of this dataset. Typically, they aim at learning the entire probability distribution that generated a dataset, or at partitioning or clustering the dataset into homogeneous groups (i.e., groups in which samples are similar).


Supervised algorithms experience a dataset containing features, and a label or target associated to each example. Supervised algorithms aim at outputting correct labels for input examples.



Task (T)

ML enables us to tackle tasks that are too difficult to solve with fixed programs written and designed by human beings [1]. The process of learning itself is not the task. Learning is our means to attaining the ability to perform the task. Task and dataset are highly interconnected. One could argue that the kind of task we can set up will depend on the dataset or, conversely, that the kind of task we want to tackle will determine the dataset we need to use or collect.


For example, if our starting point is the task of predicting whether breast cancer patients will respond well to chemotherapy treatment based on the mutational status of the BRCA gene, we need a dataset of patients with wild-type and mutated BRCA gene (called feature), and their outcomes to chemotherapy treatment (called target/label). But if our starting point is a dataset of breast cancer patients, with features (such as the BRCA mutational status, age, ethnicity, stage, etc) and chemotherapy response label, we could use a supervisedapproach (like the we introduced previously), or we could use an unsupervisedapproach analyzing only the patient features, to find feature-driven clusters (groups). With the latter approach the machine could discover, for example, that all BRCA wild-type patients are younger than 50 years old (hypothetically).


The first task corresponds to classification (identifying to which class k the input belongs to) and the second is clustering (identifying groups of similar examples in the dataset). Examples of other tasks include:

  • Regression. The algorithm is asked to predict a numerical value from the input. For example, predicting cell line response to drug treatment.

  • Machine translation, to transform a sequence of symbols in a language to a sequence of symbols in a different language. For example, translating from English to Spanish.

  • Anomaly detection. Detect anomalous examples in a given dataset. For example, detecting fraudulent bank transactions.


Performance measure (P)

To evaluate our model, we must design a quantitative measure of its performance (which we call cost or loss function). Performance is often measured using accuracy (proportion of correct output values), or its equivalent, the error rate (proportion of examples for which the model produces wrong output). The loss function is sometimes called the 0–1 loss (in classification models, the value of the loss function is 0 if the answer is correct and 1 if it is incorrect).


Optimization algorithm

Machine learning algorithms are typically optimized using numerical methods to minimize the loss function, given that a minimal value of the loss function means maximal performance.



How do we make sure the ML algorithm will perform well in the future?

When we build an ML algorithm, our goal is for it to be able to make predictions for new or unseen data. Going back to the immune system example, we’re interested in the immune system to distinguish if a given antigen is a previously seen pathogen so that B cells are ready to produce antibodies against them. Or for example, if we build an algorithm to predict response to chemotherapy treatment of breast cancer patients, we want it to be able to make correct predictions for new patients, not only the ones we used for model training. This property is called generalization, which is the ability to perform well on previously unobserved data.


Let’s think about this for a second. When we train a model with a dataset, how do we know whether it is learning with good generalization or just memorizing the dataset? To answer this question, we should probably answer another question first: are ML algorithms capable of memorizing a dataset? The answer is in most cases, yes, they are: any ML algorithm, with sufficient capacity, is able to memorize or learn a dataset. Let’s look at the example in the below figure. When trying to separate the 2 sets of examples by color, the task is impossible with a model with low capacity. However, a model with high capacity can easily create arbitrary boundaries to separate the set of points.


Now that we’ve answered this question, we can get back to: how do we know whether an ML model is learning with good generalization or just memorizing the dataset? We already know that any sufficiently powerful model is able to memorize the dataset, so, how do we make sure that the model will perform well with unseen data? There are 2 major practices used to tackle this.


The first one consists on designing the model training procedure so as to mimic a real-life scenario. This means splitting (i.e., partitioning/separating) the dataset into training, validation and test sets. The algorithm has access to the training and validation sets during training, and the final model performance in the testset is used as a proxy for the performance in unseen new data.


In practice, the model learns by updating its internal parameter values so that its predictions get better after each training iteration. In each iteration, the model learns from the training set, and its performance is tested in the validation set. The best model state will be that in which the performance is maximal in the validation set. After training, the final model performance in the test set gives an indication of how well it will perform in new data.


Typically, we repeat this process many times, assigning samples to different sets, so we can obtain a more robust evaluation of model performance. The most typical method of doing this is called k-fold cross-validation, in which the dataset is divided into k groups (called folds), and the process is repeated k times, each time keeping one fold as the test set, and the rest of the data as the training+validation set (see figure below). There are many variations of this setting, for example, it is possible to perform cross-validation for each of the train+val sets in a cross-validation fold, resulting in a nested cross-validation. It is also possible to leave out a unique test set, and perform k-fold cross-validation of the train+val set.


The second practice consists on selecting a model with just the right capacity. Going back to the task of separating sets of examples, we would select a model that has enough capacity to learn from the dataset and have good generalization, but not so much capacity that it memorizes the training set and has poor performance in the test set (overfitting), or so little capacity that it would not be able to learn at all (underfitting). This is not so straight-forward in practice, and is sometimes oriented by trial and error.


. . .


TLDR

Machine learning and the immune system have something in common: they both learn from experience to make future predictions. Any machine learning algorithm has 4 basic elements: a dataset, a model, a performance measure, and an optimization algorithm. The dataset is the experience the model uses to learn. Model selection will depend on the task to solve (regression, classification, etc). During training, a performance measure indicates how well the model is solving the task. For an ML algorithm to learn, it needs to have just the right capacity, and the experiment design should be such that its final performance is tested on unseen data (data not seen during training).




I am thankful to Dr. Andrea Gonzalez Pigorini for her input in the medical portion of this post, and to Matias Mikkola for proof-reading this and offering suggestions.



References

[1] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. The MIT Press.

[2] Immune system notes from: OpenStax, Biology. OpenStax CNX. Sept 18, 2021 https://opentextbc.ca/biology/chapter/23-2-adaptive-immune-response/

[3] Machine learning and deep learning notes from: [1]