In many fields, research teams aim to model data for purposes going from better understanding of our world to prediction of the future. Although, historically, these problems were only tackled through statistical modeling, in the last decade, machine learning gained popularity. Indeed, machine learning is the branch of computer science that uses past experiences to take future decisions without a complete knowledge of all influencing elements (Bonaccorso 2017). Machine learning techniques can be divided in three main categories: supervised learning, unsupervised learning and reinforcement learning. In supervised approaches, such as classification or regression, the relationships between data and a targeted output are taught beforehand whereas for unsupervised techniques (e.g., clustering) relations and hidden patterns in data are independently found. Afterwards, both approaches can be included in reinforcement learning where the model will continue to learn from environment feedback.

The world of machine learning is wide and is still in expansion. It would be difficult to go through all underlying concepts of artificial intelligence, and thus, this article mainly focuses on one aspect of machine learning: supervised learning for classification.

1 Problem formulation

The term "classification" covers all techniques that classify data into a given number of classes. Being part of the supervised branch of machine learning, classification requires a learning phase on labeled data to identify in which class a new data sample belongs to (Joshi 2017). Therefore, for classification problems, data has to be formed by the pair (\bm{X},Y)¹ with the data description matrix \bm{X} of size N x p and the set of labels Y \in \left\{c_1, \ldots, c_k\right\}, where k is the number of targeted classes, N is the number of labeled samples and p is the number of features which characterize each sample. More precisely, each sample i of the data is described through the feature set X_i = [x_{i1}, x_{i2}\ldots, x_{ip}]. This is depicted and illustrated by an example in Figure 1.

Data for classification summary (left), supplemented by a sleep state dataset example (right).

Visually, each line of \bm{X} contains the feature set of dimension p that describe a sample. A sample is associated with an output class which is reported in Y at the corresponding line. Hence, in the example, sleep data is formed by the pair (\bm{X},Y), where \bm{X} is composed by N instants. Four qualitative features are used to describe each instant (= sample): Eyes, Breathing, Movement and Crying (p=4). Associated sleep states are contained in Y. Sleep states are labeled from 1 to 5 (k=5). In the sleep dataset example, only qualitative features (i.e, descriptive) are presented although quantitative features (i.e., numeric) can also be integrated into \bm{X}.

To construct an accurate classification model, the process, depicted by Figure 2 and described below, is generally followed (Dangeti 2017). It implies five steps: collection of the data, feature engineering, learning, testing and deployment.

Overview of the classification process from data collection to deployment.

In fact, it is necessary to keep in mind these steps to avoid two major problems of machine learning: underfitting and overfitting. Underfitting characterizes a model which fails to generalize the data, usually, due to a lack of training samples. Reversely, overfitting occurs when a classifier corresponds too closely or exactly to a particular set of data and fails to fit future data.

1.0.0.1 Collection of data

The data collection is an important step to construct an accurate classification model. The more various are the training samples, the better will be the classification when deploying the model. It is important to notice that data used to train are the only knowledge of the model and thus, if a given event is too much represented in the data, the model may overfit.

1.0.0.2 Feature engineering

One of the main challenges of classification is to provide an informative set of features regarding the targeted outputs of the model. In some cases, data may have to be processed first to extract describing features. One can think that the more features are extracted the better will be the model. However, a large big set of features can also lead to overfitting. This effect is known as the curse of dimensionality (Tang, Alelyani, and Liu 2014). To overcome this issue, it is sometimes necessary to reduce the feature set dimension p. To do so, several techniques exist and are presented in Section 2.

1.0.0.3 Learning phase

Once data have been prepared, the learning phase can be initiated. Most of the machine learning techniques depend on parameters and/or hyper-parameters² that need to be tuned to better fit the data (see Section 3). During the learning phase two datasets are usually used: the training and the validation dataset. We will see in Section 4 that several approaches exist to divide the data to obtain both sets. In machine learning, parameters are tuned by covering a large scale of possible values and integrating them into the learning phase. Then, trained model is applied on the validation set and the set of parameters that will be retained is the one giving the best performances regarding the objective of classification. As the way of looking performances can change regarding the application, further details on this question are provided in Section 4.1.

1.0.0.4 Testing the algorithm on unseen data

In order to ensure the quality of the model predictions, it is common in machine learning to apply the model on an additional test set of unseen (but labeled) data. This way, if high performances are observed at the learning phase but a poor generalization is observed on the test set, it is a direct indicator of overfitting.

1.0.0.5 Deployment of the algorithm

The last step of the process is to deploy the algorithm in order to make predictions on unlabeled real data. Normally, if the model has been correctly evaluated on a test set independently of the training phase, future classifications would be accurate. However, it can happen that the performances may be altered. Indeed, the annotated data, inherently to conditions of collection, can be unreliable to infer the whole population. Although it is a tough question to ensure a totally random and independent collection of data, it is important to keep in mind this limitation (Grimson, Guttag, and Bell 2016).

In the following sections, further details are provided about dimensionality reduction methods, machine learning algorithms and evaluation techniques used for classification. An overview of the methods mentioned hereafter is proposed by Figure [mindmapML].

2 Dimensionality reduction

Dimensionality reduction is the process of reducing the size of the feature set. Although this step is not mandatory, when the data are described by many features, performing dimensionality reduction is a good trick for, among others, the following reasons (Coelho and Richert 2015):

Reduce the risk of overfitting;
Speed up the learning phase;
Lower the model computational complexity;
Visualize data using a limited number of dimensions.

In practice, two main groups of dimensionality reduction methods are used: feature selection and feature extraction, sometimes also called feature projection. The main difference between these approaches is that feature extraction maps the original feature space into a lower-dimensional space while, in feature selection, a subset of the original feature set is selected. The choice to use either selection or extraction methods can depend on the objective of the model. If the objective is to better understand the influence of features on classes, feature selection is more suited. Indeed, in feature extraction, the new feature set obtained by projection is generally difficult to link with physical meaning (Tang, Alelyani, and Liu 2014).

2.1 Feature selection

Feature selection methods are used to select features that are the most suitable to discriminate samples that belong to different classes. Hence, the goal is to find the best subset of features among the 2^p candidate subsets (Dash and Liu 1997).

Generally, the selection approach, depicted by Figure 3, combines feature subset evaluation and search algorithm. Several iterations are performed to search the best subset. For each iteration, a subset is evaluated and its "goodness" is returned.

When the original feature set is large, looking for all possibilities is greedy. Hence, computational complexity of the search can be reduced by heuristic or randomized methods to prevent an exhaustive search. These methods are associated with a stopping criterion on the "goodness".

The "goodness" criterion depends on the applied method and objectives. Two main categories of methods stand out: filter models or wrapper models. In the filter approach, the "goodness" is related to information content of the subset while, in wrapper, it is the predictive performances, obtained on the validation set with the subset of features that is evaluated.

2.1.1 Filter methods

Filter models rely on the characteristics of the data without using any classification method (Liu and Motoda 2007). Typically, features are ranked and the highest ranked ones are selected. This can be done in two ways: univariate or multivariate. In the univariate scheme, each feature is ranked independently from others whereas all features are considered simultaneously in the multivariate approach. Several criteria have been applied to rank data. Among them, we can cite: quality (Fisher score (Duda, Hart, and Stork 2012; Gu, Li, and Han 2012)), independency (Chi-squared (Bonaccorso 2017)), redundancy (information gain (Roobaert, Karakoulas, and Chawla 2006)) or separation of class instances (ReliefF (Kira and Rendell 1992)).

Since filter models are easy to understand and implement, it is a popular technique. However, features are selected independently from classification and thus, filter models totally ignore the effects of the selected subset on the performance of the classification algorithm (Hall and Smith 1999).

2.1.2 Wrapper methods

Wrapper models were developed to overcome the limitation of filter models. The subset evaluation is performed by integrating the classifier. Hence, the subset is adapted to the inherent particularities and bias of a predefined classifier (Tang, Alelyani, and Liu 2014). To find the best subset, a wide range of search strategies exists including hill-climbing, best-first, branch-and-bound, and genetic algorithms (Guyon and Elisseeff 2003). Two other popular methods are forward selection and backward elimination where features are respectively added or removed one by one.

Wrapper models provide better predictive performances than filter models (Kohavi and John 1997). Nevertheless, they are more computationally expensive than filter models.

2.2 Feature extraction

The concept under feature extraction is to project the original feature set of dimension p into a lower-dimensional space of dimension m.

Two types of feature extraction methods can be distinguished depending on the way to combine the features: linear and non-linear. In linear feature extraction methods, new features in the lower dimensional space are given by a linear combination of the original feature set. Reversely, non-linear combinations are sought with non-linear approaches. Linear techniques being principally used, this section mainly focuses on this approach. The main challenge of linear feature extraction methods is to find a transformation \bm{W}, such as: \bm{Z} = \bm{W}^T \bm{X} where \bm{Z} is the projected data set of size m * N, with m < p and N is the number of samples. Dimensionality reduction by the mean of linear feature extraction is depicted by Figure 4 with an example of projection from a space of dimensions p = 2 to a space of dimension m=1. This example shows that an infinite number of solutions exists to find a new axis to project data points. However, in the literature, two methods stand out: Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) (Tang, Alelyani, and Liu 2014), described below.

Feature extraction concept (left) and associated example (right). X_i is the original feature set describing the sample i and Z_i is the extracted feature set. In the example, black dots represent data in a two-dimensional space and the red line represents an example of axis for projection in a one-dimensional space.

2.2.1 Principal Component Analysis

Principal Component Analysis is a well-known technique of data transformation (Jolliffe 2011). In its standard formulation, it finds the most discriminant projection of the original feature set by maximizing the variance between data points. In practice, the eigenvectors of the covariance matrix, calculated from the original feature set, are computed. Then, the ones with the largest eigenvalues (principal components) are used to reconstruct a part of the variance of the original dataset.

After that, there are two ways of using principal components regarding the objective of the analysis. The first one is data visualization. In that case, only two or three principal components, carying most of the variance, will be retrieved in order to plot the data in a two to three dimensional graph. This way, further analyses can be conducted regarding the distribution of data within the objective of class separation. Another purpose of PCA is to feed machine learning algorithms for classification purpose. With PCA, the reduction of dimension can be regulated by the total variance wanted for the feature subset. In other words, the number of principal components that are kept depends on the percent of variance information wanted by the user.

The main limitation of PCA is that there is no guarantee that a small number of principal components with the highest variance will contain the information needed for the classification (Neal and Zhang 2006). Hence, relevant information can be lost and the resulting projected features \bm{Z} can lead to weak classification performances. Additionally, PCA is only suited for quantitative set of feature. Other factor methods exists. Among them we can cite Multiple correspondence analysis (Le Roux and Rouanet 2010) for qualitative feature set and factor analysis of mixed data for mixed feature set (Escofier and Pagès 2008). Moreover, a version of PCA, called kernel PCA, has been proposed to make non-linear projections (Schölkopf, Smola, and Müller 1997). Briefly, an initial step is first performed to find a particular space where the dataset becomes linearly separable (Bonaccorso 2017).

2.2.2 Linear Discriminant Analysis

Contrary to PCA, in Linear Discriminant Analysis, labels are integrated in the process of dimension reduction. LDA finds the most discriminant projection by maximizing between-class distance and minimizing the within-class distance (Balakrishnama and Ganapathiraju 1998).

In practice, the eigenvectors of the between-class and within-class covariance matrices are computed and the best projection is found using Fisher’s criterion.

Comparison of data projection between Principal Components Analysis and Linear Discriminant Analysis for a two-classes problem. Orange dots represent data of the first class and data belonging to the second class are reported in black. The red line represents the resulting axis for projection obtained for each method.

A comparative example between PCA and LDA is given by Figure 5. For a two-class data example, the resulting axis that preserves the variance on the whole data set (PCA) and the one that preserves the distance between both classes (LDA) are drawn. We can see, on each axis, the resulting data projection. LDA offers, in that case, a better projection for discriminating purpose. In fact, besides dimensionality reduction, LDA can also be used for classification (see Section 3.1.3).

3 Machine learning algorithms

As mentioned in Section 1, a supervised machine learning technique is an algorithm that learns from past experiences (training set) a model to make future predictions.

Classifiers³ aim to find the best decision boundaries to discriminate between classes. As illustrated by Figure 6, there are two types of classification cases: the ones where classes can be separated with linear boundaries and the ones where classes are not linearly separable.

Illustration of a linearly separable case (left) and a non-linearly separable case (right). Red lines represent examples of boundary decisions.

Hence, classifiers can be divided in two groups: linear algorithm and non-linear algorithms. To illustrate the different approaches used to solve classification problems, in this section, we will go through eight commonly used classifiers.

3.1 Linear algorithms

Linear classifiers aim to find a linear decision boundary between classes. It can either be a line, a plane or a hyperplane, depending on the dimension of the problem.

3.1.1 Perceptron

The most basic linear model is the perceptron (Rosenblatt 1958), depicted in Figure 7.

Illustration of a perceptron with two inputs.

In this example, the perceptron output decision y_i, which can either be -1 or +1, is computed for each sample i as: y_i = f(w_1x_{i1} + w_2x_{i2}) where f is a step function, also called threshold or activation function. This can be generalized for a larger dimension p by: \label{linearEquation} Y = f(W^T\bm{X}+b) where W is the weight vector defined as W = \left\{ w_1, w_2, \ldots, w_p \right\} and b is the bias.

To summarize, to make a prediction, the perceptron computes two quantities. First, the weighted sum of the input features is calculated. Then, this sum is thresholded by the function f in order to retrieve a prediction equal to -1 or +1.

During the training phase, W is firstly randomly initialized. Then, weights are settled by considering all the samples of the training set and looking at the output decision (Shiffman 2012). It is performed in three steps which are repeated until all training samples are correctly classified:

Make a prediction for an input sample;
Compute the error \epsilon between the prediction and real label;
Adjust the new vector of weights W' accordingly to the error, such as: W' = W + \Delta W

where \Delta W = \epsilon x \eta x X_i, with \eta, called the learning rate⁴ and X_i, the input features of the sample.

So far, we presented a two-class classification approach. In case of multiclass problems, several perceptrons can be trained in order to predict each class versus all others (one-versus-the-rest).

The power of perceptron classifiers resides in the fact that if a linear boundary exists to discriminate classes, the model will converge perfectly. However, in most classification problems, an overlap exists between classes and perceptron models will necessary present misclassification.

3.1.2 Logistic Regression

Logistic Regression (LR) is quite similar to perceptron except that instead of a class prediction, it returns a probability of belonging to the positive class. Hence, in some cases, it may be more robust to class overlap. To compare between perceptron and logistic regression lets restart from Equation ([linearEquation]). In logistic regression, b is commonly renamed w_0, the model intercept. Then, for each sample, a score S(X_i) is computed as: S(X_i) = f(W^TX_i+w_0)

In logistic regression, f is fixed and is called the logistic (or sigmoid) function. The sigmoid function, depicted by Figure 8, is used to associate each score to a probability bounded between 0 and 1, such as: P(Y_i) = f(S(X_i))

After that, a cut-off value is usually applied on the probability to obtain the class output.

In the learning phase, the vector of weights W has to be estimated. Here, the best W is the one that maximizes the conditional probabilities P(Y|\bm{X}, W) on the training set. This is commonly done by Maximum Likelihood Estimation (MLE), based on the assumption that weights are normally distributed (Czepiel 2002).

Logistic regression is usually applied for binary classification. However, as well as perceptron, it is possible to extend it to multiclass problems by training several one-versus-the-rest LR classifiers.

3.1.3 Linear Discriminant Analysis

As mentioned in Section 2.2.2, Linear Discriminant Analysis can be used for feature extraction. It can also be applied for classification purpose. Contrary to previous linear methods, LDA is more suited for multiclass analysis (Rao 1948). We saw that LDA aims to find the best projection by maximizing the mean between-class distances while minimizing the within-class variance. For multiclass problems, a initial step is added to compute the overall mean (= center) of the data. Then, it is the distance between each class and this center that is maximized while minimizing the within-class variance.

In this way, distributions (means and variance) of each class are estimated. Thus, by the use of Bayes’ theorem, the probability of a sample to belong to each class can be estimated. For future predictions, the class associated with the higher probability will be returned.

Nevertheless, LDA presents two main limitations. The first one is due to the number of samples for each class in the training set. If a class is under-represented, the estimated distribution for this class will be corrupted. Secondly, as previous methods, LDA is more suitable on linearly separable multiclass problems.

3.2 Non-linear algorithms

In case of non-linear problems, where class boundaries cannot be approximated by hyperplanes, non-linear algorithms may be more reliable than linear classifiers.

3.2.1 K-Nearest Neighbors

K-Nearest Neighbors is a basic method used for classification. Basically, it computes all the distances between a new sample and the ones of the training set. Then, the majority class of its neighbors is assigned to it. The number of neighbors that contributes to the vote is determined by k.

Example of K-nearest neighbors classification with k=3.

An example of KNN classification, with k=3, is given by Figure 9. In this example, the class of a new sample (in blue) is sought. The distance with all others points of the learning set is computed. Labels of the 3 nearest neigbours are checked. With two neighbors belonging to class "Black" and one to class "Orange", the new sample is classified as a "Black" sample.

Although Hamming, Manhattan or Minkowski distances can be used, KNN classifier is commonly based on the Euclidean distance. The Euclidean distance between a sample of the training set X_{i} and a new sample X_{n}, is computed as: d(X_{i},X_{n}) = \sqrt{(x_{i1}-x_{n1})^2+(x_{i2}-x_{n2})^2+\ldots+(x_{ip}-x_{np})^2} To compute an accurate distance, it is necessary to work on homogeneous features (i.e., with the same scale). Indeed, absolute differences in features must weight the same to avoid the computation of a meaningless distance.

KNN is a popular algorithm in classification due to its simplicity. In fact, the algorithm is intuitive and easy to implement. Additionally, it can perform well for both linear and non-linear cases.

However, KNN has some drawbacks. It is easily subject to overfitting, especially when working with an imbalanced training dataset.⁵ In addition, it can require a lot of memory for future predictions since it needs the training data samples to compute distances.

3.2.2 Decision tree

Decision tree is a predictive model approach which is constructed as a tree-shaped diagram, composed by nodes, branches and leaves. It provides the statistical probability of a class to occur. An example of tree architecture, for a four classes problem and a feature set of three components, is provided in Figure 10.

Tree architecture for a four classes examples, where x_{i1}, x_{i3} and x_{i3} are the feature set of a sample i and y_i is the predicted class. T_1, T_2, T_3 are three thresholds.

The prediction for a new sample is given using a succession of small tests. Each node corresponds to a test until a leaf, giving the overall output decision, is reached. Hence, in the tree example, if a new data sample n has:

its feature x_{n2} is superior to T_2;
its feature x_{n1} is inferior to T_1;
its feature x_{n3} is inferior to T_3.

Then, there is a probability of 0.9 that it belongs to class 3 and a probability of 0.1 to belong to class 1. Generally, the highest probability is retained, making "class 3" the final decision for this sample.

In the training phase, two elements are determined: the architecture (e.g., the order of the considered features, the number of nodes and leaves) and threshold for each feature. A decision tree is constructed following these steps:

Split the training set by thresholding a feature;
Compute a measure of the quality of the split;
Repeat Step 1 and Step 2 until all the splitting possibilities (all thresholds for all features that wasn’t used in previous nodes) are associated with a quality measurement;
If the split with the best quality measurement improves the classification of the previous node, it is a new node. Otherwise, the previous node was better and is turned into a leaf.
Repeat all the previous steps until only leaves can be reached.

To measure the quality of a split, a criterion based on impurity (e.g., Gini’s) or information gain (e.g., entropy) can be used (Dangeti 2017). In practice, it is often the Gini’s impurity that is applied: Gini = 1 - \sum_{c=1}^C P(c)^2 where C is the number of classes and P(c) is the fraction of samples of class c observed after the split. The lower is the Gini’s criterion, the better is the split.

Decision trees are easy to understand and interpret. Additionally, they require very few data preparation (such as scaling in KNN) since splits are independently made for each feature. However, they can easily lead to overfitting if the set of features is too wide. As KNN, they are also sensitive to imbalanced dataset since the number of samples in each class impacts the quality criterion.

3.2.3 Random Forests

Random Forest is a part of "ensemble methods" of machine learning. Ensemble methods are based on the assumption that diversified and independent models tend to give better classification results. Therefore, Random Forest is a combination of tree predictors (Breiman 2001). The classification result comes from the majority vote of a collection of decision trees (also called bagging). In fact, it aims to enhance the generalization by averaging multiple decision trees trained on different parts of the same training set. Figure 11 shows the workflow for a new prediction with a RF model of E trees.

Prediction workflow with a Random Forest model composed by E trees. Successive test results for each tree are reported in green.

In the learning phase, each tree is growing, as described in Section 3.2.2, from a randomly selected subset of the feature set. In this way, each tree is growing with different features. Indeed, if the whole feature set was used, significant features would always come first in the top nodes of splitting which would make all trees be more or less similar.

Random forest is a very popular learning method that can reach really high performances. Contrary to previous methods, it can be robust to imbalanced dataset since a prior knowledge of class occurrences can be integrated in the algorithm. However, results are difficult to interpret since it can be constructed with hundreds of trees. Additionally, sometimes, RF can overfit due to a too noisy dataset (i.e., with a lot of outliers/extreme cases).

3.2.4 Support Vector Machines

Support Vector Machines are suited for both linear and non-linear problems (Cortes and Vapnik 1995). The aim of SVM is to find the hyperplane boundary that leaves the maximum margin between two classes. Figure 12 illustrates the SVM terminology on a linear example.

Example of a linear boundary estimated by SVM for a two class problem.

In the learning phase, the basis of SVM is the perceptron. This time, the vector of weights W is estimated ensuring that the margin is maximized between the support vectors of both classes. Support vectors are the data points of both classes which are near the hyperplane. Generally, a parameter g defines the quantities of points that will be taken into account while estimating W. The highest is g, the less support vectors will be used. For multiclass problem, the one-versus-the-rest strategy is often used.

The main strength of SVM is what is commonly called the "kernel trick". The idea is to apply a transformation \phi to the dataset in order to work on a space where data are linearly separable. An example of kernel transformation is given by Figure 13.

Example of kernel transformation \phi and boundary estimation (in red) with SVM.

A wide variety of kernel transformations can be used such as polynomial, radial basis function or sigmoid as well as customized kernels.

Support Vector Machines is a widely used machine learning algorithm. Contrary to RF, it is robust to extreme cases since the hyperplane boundary is computed from support vectors. The main drawback of SVM is that a lot of parameters, depending on the kernel, has to be tuned and finding the best set of parameters can be computationally expensive.

3.2.5 Multi-Layer Perceptron

Multi-Layer Perceptron (MLP) is a feedforward artificial neural network. Although a sole perceptron is limited to linear situations, Artificial Neural Networks (ANN) have been constructed in order to solve non-linear problems. There are composed by one input layer, at least one hidden layer and one output layer. Hidden layers and output layers are made up of perceptrons that are all connected from a layer to another. Figure 14 depicts an example of MLP with two hidden layers.

Example of multi-layer perceptron with 2 hidden layers and two output classes.

Predictions are made by feeding the feature set of a sample to the network. Then, perceptrons are progressively activated (or not) until reaching an output node which gives the predicted class. Contrary to the linear case, the activation function of each perceptron is non-linear. Among them, we can cite hyperbolic tangent, rectified linear unit function or sigmoid.⁶

For the learning phase, MLP uses a supervised technique called backpropagation. As for the linear case, all samples are passed through the network and weights W are updated according to the output error (Parizeau 2004). Weights of each perceptron are updated one-by-one, starting from the ones of the output layer.

Neural network is a field in expansion notably because of improvement in computing capacities. Nowadays, neural networks can be composed of a large amount of hidden layers, increasing their depth. This increase of the computing capabilities leads to an evolution of Neural Networks (e.g., Convolutional Neural Networks, Recurrent Neural Networks) to what is today commonly called deep learning. The main drawback of neural networks is that it involves a lot of parameters to tune (e.g., number of layers, number of perceptrons per layer, connections between layers). In addition, it is also subject to overfit if the training set is too small or not representative of the whole population. Although today, various complex problems can be solved with deep learning, it is important to remind that it exists alternatives (e.g.,SVM, RF) that are faster, easier to train and can provide better performances regarding the classification objective.

4 Techniques for performance evaluation

As briefly mentioned in Section 1, evaluation strategies can change regarding the classification objectives, as well as considering the available data. This section firstly focuses on the metrics that are used to evaluate the performances. Secondly, we will go through different techniques used to split an annotated dataset in order to ensure good results for future predictions.

4.1 Metrics

Parameters and hyper-parameters of machine learning algorithms are tuned by maximizing the performances metrics. Additionally, performance metrics are mandatory to compare the results of different classifiers. In practice, most of performance metrics are based on the confusion matrix, reporting the number of good classifications and misclassifications by comparing the predicted with actual labels. A confusion matrix for a two class problem is reported in Figure 15. It is composed of four numbers:

True Positive TP: number of samples correctly classified as yes;
True Negative TN: number of samples correctly classified as no;
False Positive FP: number of samples classified as ’yes’ instead of no;
False Negative FN: number samples classified as ’no’ instead of yes.

Confusion matrix for a two class classification.

From there, the overall classification accuracy Acc measures the total of good classification over the whole data set: Acc = \frac{TP+TN}{TP+FP+TN+FN}

Measuring the accuracy of a classification is important to give an insight on what the algorithm is capable of. However, it is not always the best way of looking. In fact, for example, in case of imbalanced dataset, a class can take the lead on the accuracy metric and a very high value can be observed even if a less represented class is never detected. Actually, it is often the case in biomedical engineering, notably when working on rare incident diseases. For example, for disease screening, it may be better to hand up with some false positive cases than missing one patient. Reversely, in some cases, it may be more important to be sure that samples predicted as belonging to a class really belongs to this class.

For these reasons, it exists a bunch of metrics which analyze the performances in different ways. A list of the more common ones is provided hereafter.

Historically, sensitivity and specificity are used to precise classification results:

Sensitivity Se, which measures the proportion of actual positives that are correctly identified as such: Se = \frac{TP}{FN+TP}
Specificity Sp, which measures the proportion of actual negatives that are correctly identified as such: Sp = \frac{TN}{FP+TN}

Recently, with the gain of popularity of machine learning, recall, precision and F-scores are commonly used:

Recall R, corresponds to Se: R = \frac{TP}{FN+TP}
Precision P, also known as the Positive Predictive Value (PPV), which measures the fraction of actual positives among the retrieved positive cases: P = \frac{TP}{TP+FP}
F1-score F_1 is the harmonic mean of precision and recall and allows a single measure of performance: F_1 = \frac{2 \cdot P \cdot R}{P + R}
F_\beta score, which is the weighted (according to \beta) harmonic average of precision and recall: F_{\beta} = \frac{(1+ \beta)^2 \cdot (P \cdot R)}{(\beta^2 + P + R)}

All these metrics are bounded between 0 and 1, where a value of one means a perfect score. Until there, it may seem that these metrics are only applied on two class problems. However, all these performance measurements can be generalized to multiclass problems, notably by constructing a confusion matrix of one-versus-the-rest for each class.

4.2 Cross-validation

In order to validate the classifier abilities to make future predictions, several evaluation strategies can be followed. Cross-validation is the most popular method that helps to ensure the robustness of a classifier since it allows the detection of underfitting or overfitting (Dangeti 2017). In this section, three popular techniques of cross-validation are presented: Hold-out, K-fold cross-validation and Leave-one-out cross-validation.

4.2.1 Hold-out

The hold-out method is the simplest cross-validation technique. Data is divided into two parts: the training set and the testing set, as depicted by Figure 16.

This way, the classifier is trained on the training set and can be evaluated on unseen data of the testing set. Usually, a larger part of the data is used for training. There is two different ways to divide the data: by randomly selecting a number of samples over the whole dataset or by randomly selecting a number of samples of each class, in order to ensure a representation of all classes in the training and in the testing set. This method is usually applied for small datasets.

4.2.2 k-fold cross-validation

One way to improve the hold-out strategy is to perform a k-fold cross-validation to tune the parameters and train the model. The idea is to train the model on k different data splits to ensure its robustness. For each iteration, a training set and a validation set are constructed. An example of 3-fold cross-validation is provided by Figure 17.

Illustration of the 3-fold cross-validation strategy.

For better control on the classifier performances, it is also recommended to work with an additional set, that wasn’t seen during the learning phase and the tuning of the parameters. This set is called the testing set. Hence, in Figure 17, the 3-fold cross-validation step is only performed on a part of the data.

4.2.3 Leave-one-out cross-validation

Leave-one-out cross-validation (LOOCV) is the extreme k-fold validation, where k is equal to the number of samples in the training/validation dataset. An example of LOOCV is depicted by Figure 18.

Illustration of the leave-one-out cross-validation strategy for a training/validation dataset of 20 samples.

Another way to perform LOOCV exists in biomedical engineering. In fact, to evaluate the capacity of classifiers to work with a new patient, a leave-one-patient-out strategy is often performed by working on all patients except one at each time. Performances for each draw indicate the generalization of the model to make future predictions for a new patient.

References

Balakrishnama, Suresh, and Aravind Ganapathiraju. 1998. “Linear Discriminant Analysis-a Brief Tutorial.” Institute for Signal and Information Processing 18: 1–8.

Bonaccorso, Giuseppe. 2017. Machine Learning Algorithms. Packt Publishing Ltd.

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32.

Coelho, Luis Pedro, and Willi Richert. 2015. Building Machine Learning Systems with Python. Packt Publishing Ltd.

Cortes, Corinna, and Vladimir Vapnik. 1995. “Support-Vector Networks.” Machine Learning 20 (3): 273–97.

Czepiel, Scott A. 2002. “Maximum Likelihood Estimation of Logistic Regression Models: Theory and Implementation.” Available at Czep. Net/Stat/Mlelr. Pdf.

Dangeti, Pratap. 2017. Statistics for Machine Learning. Packt Publishing Ltd.

Dash, Manoranjan, and Huan Liu. 1997. “Feature Selection for Classification.” Intelligent Data Analysis 1 (1-4): 131–56.

Duda, Richard O, Peter E Hart, and David G Stork. 2012. Pattern Classification. John Wiley & Sons.

Escofier, Brigitte, and Jérôme Pagès. 2008. Analyses Factorielles Simples et Multiples. Objectifs Méthodes et Interprétation. Dunod.

Grimson, Eric, John Guttag, and Ana Bell. 2016. “Introduction to Computational Thinking and Data Science.” MIT OpenCourseWare, https://ocw.mit.edu. Massachusetts Institute of Technology.

Gu, Quanquan, Zhenhui Li, and Jiawei Han. 2012. “Generalized Fisher Score for Feature Selection.” arXiv Preprint arXiv:1202.3725.

Guyon, Isabelle, and André Elisseeff. 2003. “An Introduction to Variable and Feature Selection.” Journal of Machine Learning Research 3 (Mar): 1157–82.

Hall, Mark A, and Lloyd A Smith. 1999. “Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper.” In FLAIRS Conference, 1999:235–39.

Jolliffe, Ian. 2011. Principal Component Analysis. Springer.

Joshi, Prateek. 2017. Artificial Intelligence with Python. Packt Publishing Ltd.

Kira, Kenji, and Larry A Rendell. 1992. “The Feature Selection Problem: Traditional Methods and a New Algorithm.” In Aaai, 2:129–34.

Kohavi, Ron, and George H John. 1997. “Wrappers for Feature Subset Selection.” Artificial Intelligence 97 (1-2): 273–324.

Le Roux, Brigitte, and Henry Rouanet. 2010. Multiple Correspondence Analysis. Vol. 163. Sage.

Liu, Huan, and Hiroshi Motoda. 2007. Computational Methods of Feature Selection. CRC Press.

Neal, Radford M, and Jianguo Zhang. 2006. “High Dimensional Classification with Bayesian Neural Networks and Dirichlet Diffusion Trees.” In Feature Extraction, 265–96. Springer.

Parizeau, Marc. 2004. “Le Perceptron Multicouche et Son Algorithme de Rétropropagation Des Erreurs.” Département de Génie électrique et de Génie Informatique, Université de Laval.

Rao, C Radhakrishna. 1948. “Tests of Significance in Multivariate Analysis.” Biometrika 35 (1/2): 58–79.

Roobaert, Danny, Grigoris Karakoulas, and Nitesh V Chawla. 2006. “Information Gain, Correlation and Support Vector Machines.” In Feature Extraction, 463–70. Springer.

Rosenblatt, Frank. 1958. “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.” Psychological Review 65 (6): 386.

Schölkopf, Bernhard, Alexander Smola, and Klaus-Robert Müller. 1997. “Kernel Principal Component Analysis.” In International Conference on Artificial Neural Networks, 583–88. Springer.

Shiffman, Daniel. 2012. “The Nature of Code: Chapter 10.” Neural Networks.

Tang, Jiliang, Salem Alelyani, and Huan Liu. 2014. “Feature Selection for Classification: A Review.” Data Classification: Algorithms and Applications, 37.

Matrices are noted in bold capital letters and vectors with capital letters↩︎
To avoid confusion, the use of "parameter" and "hyper-parameter" has been dedicated to mention the setting of methods whereas the term "feature" is exclusively associated with data either when it is question of raw features or after computation.↩︎
In classification, machine learning algorithms are also called classifiers.↩︎
It is used to regulate the scale of the weight modifications.↩︎
when the proportion of samples for a class is high or low in comparison with others↩︎
Note that in LR, a sigmoid was also used and LR was classified as a linear model. In fact, the output of LR can be written in a linear form whereas, because of the multiple perceptrons, it is not possible to write a linear equation to summarize neural networks outputs. Thus, Neural Networks are classified as non-linear models.↩︎

Introduction to the methods for classification