Data Mining and Machine Learning — Credibility of the trained model

Gioacchino Lonardo
Analytics Vidhya
Published in
39 min readOct 6, 2020

--

When we train a machine learning algorithm we have to be credible and prove that it is better than another model. In this article we will see the techniques to do this.

How do I split the data of my set in order to make correct estimates of the performance of my ML algorithm? We will see it also with respect to the confidence limits. We will see how to evaluate two classifiers. Since in general there is no better technique than another, you have to compare the different approaches. Depending on the problem under examination, we have preferable approaches compared to others. In general we have no guarantee that one technique works better than others if we do not use statistical tests.
For example, if we have a binary problem, our choice will probably fall on a binary classifier. Another important aspect is that, in many contexts, not all errors weigh in the same way, for example in the medical field saying that a sick patient is well means that the error has a higher cost than the opposite error. We initially assume constant costs for each type of error, which means not considering the costs. We will also see the cost-sensitive performance evaluation in our scheme (model). It is also possible to make a numeric evaluation, for example, to predict the performance of our processor. While the classification works only with labels or probabilities, it is also possible to evaluate a numerical value.

Although in theory I could do it, if I evaluate the performance on the errors of the training set I make a mistake. Evaluation of the training set error is not a good indicator of performance on future data. In this hypothesis the classifier k-nearest neighbors algorithm (k-NN) with k=1 would be the optimal classifier, since it goes to associate an input instance to the nearest one, but since I am putting the samples of the training set these are themselves I would never make an error and would always make a perfect classification.

1-NN

The green circles are the training samples (each circle is supposed to be of a different class), while the red circles are the test data, the area outlined by the dashed circle is the classification area. It is inferred that the classification will be perfect since each green circle is a class. You have a 1-NN.

The first thing I can do is that if I have a significant enough data set we can split the data set into a training set and a test set, we train with the training set and evaluate the performance on the test set. But the problem is that the amount of data, the number of labelled samples, is limited and therefore more sophisticated techniques are required to make evaluation. We will see that another approach to try to fight the fact that in some context it is difficult to have large amounts of data labelled the solution is the semi-supervised approach. The reason why I might have few data labelled is because maybe few patients (medical data) come to me. In the traffic flows of an internet network, I may have problems because I have a lot of samples, but then I do few labeling.

Evaluation

An aspect to take into account is the one related to statistical reliability when I evaluate the differences in performance, I have to use tests called statistical significance to understand, even if I have several classifiers or different configurations, if the performance I measure are different or it just depends on the fact that the learning process is affected by error. This can be done with the technique of confidence intervals and then with the discourse of statistical tests. As types of performance measurements, we have :

  • Number of correctly classifications;
  • Accuracy of probability estimates (in case we estimate a probability, in the sense that the closer we get to the ideal value the better);
  • Error in numeric predictions (average, quadratic average, absolute, etc).

Training and testing

In the case of classification we use the Error rate where we indicate with:

  • Success: instance’s class is predicted correctly;
  • Error: instance’s class is predicted incorrectly:
  • Error rate: is the proportion of errors on the entire set of instances.
    For example if I have 100 instances and 40 errors the error rate in percentage is 40% or 0.4 from the number point of view.

The error you make on the training data is normally defined as Resubstitution error, in practice the error rate on the training set, which is a very optimistic estimate.

So I consider the test set, independent instances not used in the training process, what is the problem? In some cases, during the training phase, I have to not only train the classifier, but also tuning the parameters and to do the tuning I can’t use data that are part of the test set, but I have to use only the data from the training set to do this too!
Giving generalizing ability to the classifier also consists of a correct setting of the parameters in training phase. For example, let’s say that if I show him the samples of a set A and B and in test I put the samples of A and B having seen them in training it’s easy that I’m good, but this doesn’t mean that my classifier is very general when he sees samples never seen before. So it’s good to set the parameters by training on the samples of a set and then test on the samples of another set, in order to evaluate also the ability of generalization.
Practical example: There are cases where I can extract data from different sets, the example that is done is to have a classifier that uses data from clients from different cities, what could I do? Suppose that in the two cities there are 100 clients, I could use as training set 50 clients of A and 50 of B and do the same for the test, doing so, mixing samples of different nature I’m making an optimistic estimate, because I’m saying to the system: “I’ll give you all the possible information” but if there were other data coming from a city C whose data I’ve never seen and these data have different characteristics, I could have worse performances. If I trained on A and tested on B I would have a more realistic estimate.

Training set, validation set, and test set

Some learning systems operate on two phases:

  1. Construction of the basic structure;
  2. Optimization of parameter settings.

Test data should not be used for parameter optimization, because it would be statistically incorrect and because we are modifying something inherent to training with test data. It is recommended to use three sets:

  1. Training set: to build the basic model;
  2. Validation set: to perform parameter tuning;
  3. Testing set. used only once to evaluate performance.

Once the evaluation is completed, all data can be used to build the final classifier. We have two separate problems because on the one hand we want to build the best possible classifier and on the other to evaluate its performance. So we use training set, validation set, and test set (disjointed useful to evaluate only the performance), but then it would be silly to use only the training data to train the classifier, because anyway they are a percentage of for example 70%, we might as well use all of them.
Having used the validation set to set the parameters, we also know the parameters and so to do the final training we use the whole set and I do a training that theoretically is better because I use more data than the first training section.
A case in which the result obtained only on the training set is higher than what you have training with all the data happens if in the training phase I do a training on clean data and then in the final phase I put dirty data (data with noisy labels) the final result could get worse. But in general the noise is spread over the entire dataset and therefore the more data I use the better.
In general the bigger is my training set and the lower is my error rate, but to make the estimation, dually, I would like to have a test set that is the maximum possible, so I have two conflicting needs. One possibility is to use the holdout procedure, that is to divide the starting data into a training and a set (but they should ideally be as large as possible). Another possibility is to overcome the problem that the starting data is not large enough, using the cross-validation technique.

Predicting performance

Let’s consider a dataset of 10000 samples and one of 100, we immediately realize that the error rate we get, even if in absolute equal value, from the statistical point of view is not. Let’s suppose to have an error rate of 25%, in reality more than error rate we talk about success rate of 75%. Now, this is only an estimate. What can be said about the true success rate on the target population? Of course, it is expected to be close to 75%. But how close within 5 or 10%? This is a statistical problem!

In statistics, a succession of independent events that either succeed or fail is called a Bernoulli process.
The classic example is coin tossing. Each toss is an independent event. Let’s say we always predict heads; but rather than “heads” or “tails,” each toss is considered a “success” or a “failure.” Let’s say the coin is biased, but we don’t know what the probability of heads is. Then, if we actually toss the coin 100 times and 75 of the tosses are heads, we have a situation very like the one just described for a classifier with an observed 75% success rate on a test set. What can we say about the true success probability? In other words, imagine that there is a Bernoulli process — a biased coin — with a true (but unknown) success rate of p. Suppose that out of N trials, S are successes; thus, the observed success rate is f = S/N. The question is, what does this tell you about the true success rate p?
The answer to this question is usually expressed as a confidence interval that is, p lies within a certain specified interval with a certain specified confidence. For example, if S = 750 successes are observed out of N = 1000 trials, this indicates that the true success rate must be around 75%. But how close to 75%? It turns out that with 80% confidence, the true success rate p lies between 73.2% and 76.7%. If S = 75 successes are observed out of N = 100 trials, this also indicates that the true success rate must be around 75%. But the experiment is smaller, and so the 80% confidence interval for p is wider, stretching from 69.1 to 80.1%.

Let’s talk about confidence intervals in a simpler way. The same performances (in reality) have different interpretation according to the considered set, for example: I have 1000 elements and on 750 I have correct classification, the success rate (classification rate) is 75% but it is the same of the case in which n=100 and s=75!
The result from the point of view of the error rate is always the same. But through the confidence intervals (statistical approach), if I say that the probability that the percentage of correct real classification is in this interval is 80%, I’m saying that with 80% probability the real success rate happens in this interval. The difference of the two cases is that if I use 1000 samples the interval is narrower towards the values I have examined. Obviously made the Gaussian behavior hypothesis of the classifier, the worst that goes to me is that the classifier can go down to a performance of 73%, the best that goes to me reaches 76%. If I make an estimate on a smaller number of samples the confidence interval widens and the estimate I made is less robust.
The more instances I use to get the estimate is better, because I get a more robust value. Normally the confidence intervals are expressed at 90% or 95%. The lower is the number I estimate, the lower is the robustness (reliability) of the estimate.
If I have a confidence interval of 80% it means that 1 time out of 5 I can end up outside the interval, so 4 times out of 5 I’m inside, that’s why we choose confidence intervals of 90 or 95%, that is 1 time out of 10, statistically, I end up outside the interval.

Given a single Bernoulli process its average is p, while the variance is p*(1-p). If N trials are taken from a Bernoulli process, the expected success rate f = S/N is a random variable with the same mean p; the variance is reduced by a factor of N to p(1 − p)/N. For large N, the distribution of this random variable approaches the normal distribution. The latter are static facts and for the moment, for reasons of simplicity, we will not go into the merits of this theory.
The probability that a random variable X, with zero mean, lies within a certain confidence range of width 2*z is probability P[−z ≤ X ≤ z] = c. The c and z values for a normal distribution are present within tables, where the values are already calculated, for example standard normal distribution[mathisfun]. In the table in the link we have the value of z, in fact in many cases the tables give the confidence that X is out of range. Whit a symmetric distribution P[−z ≤ X ≤ z]=1−2×Pr [x ≥ z] they could only give the probability P[X ≥ z], this is called a one-tailed probability because it refers only to the upper “tail” of the distribution. Normal distributions are symmetric, so the probabilities for the lower tail P[X ≤ −z] are just the same.

Example 1

Find the percentage of the population between 0 and 0.47.

P[X ≥ z]

Start at the row for 0.4, and read along until 0.07, now we have 0.47. The value we find is 0.1808. So 18.08% of the population are between 0 and 0.47 Standard Deviations from the Mean.

Example 2

Consider the following table:

This table assumes that the random variable X has a mean of 0 and a variance of 1. The probability that X is outside the standard deviation of 1.65 is 5%, mathematically P[X ≥ 1.65] = 0.05, or 5%. So the probability that a value is above 1.65 is 5%.

Note: 4.95% has been approximated to 5%.

Because the distribution is symmetric, the chance that X lies more than 1.65 standard deviations from the mean (above or below) is 10%, or
P[−1.65 ≤ X ≤ 1.65] = 90%.

There are formulas that say in closed form, once I know the success percentage (f) and once I know the value of N (number of samples of the training set), fixed the probability that I want to have such, so the probability happens in that confidence interval, I can go to calculate the value directly.
If I want to know the confidence interval in which in 80% of the cases my classification result ends, I choose c = 80% (which means that I will be outside the interval in 10% of the data on the right and 10% of the data on the left), from the table I derive z=1.28. The values seen plus the success rate and number of samples I put in (formula 2.2). This is the solution where I want to have a confidence value at 80% that the classification result ends in this range, note the percentage I’m going to measure and note what is the number of samples in the training set.

Some theory

Now we have to render the random variable f = S/N to zero mean and unit variance. We do this by subtracting the mean p and dividing by the standard deviation:

standard deviation formula (1.1)

This lead to:

(formula 2.2)

Here is the procedure for finding confidence limits. Given a particular confidence figure c, consult confidence_limits_normal_distribution table for the corresponding z value. To use the table you will first have to subtract c from 1 and then halve the result, so that for c = 90% you use the table entry for 5%. Linear interpolation can be used for intermediate confidence levels. Then write the inequality in the preceding expression as an equality and invert it to find an expression for p.
The final step involves solving a quadratic equation. Although this is not hard to do, it leads to an unpleasantly formidable expression for the confidence limits:

(formula 3.3)

The ± in this expression gives two values for p that represent the upper and lower confidence boundaries. Although the formula looks complicated, it is not hard to work out in particular cases.
This result can be used to obtain the values in the numeric example given earlier.

Setting f = 75%, N = 1000, and c = 80% (so that z = 1.28) leads to the interval [0.732, 0.767] for p, and N = 100 leads to [0.691, 0.801] for the same level of confidence. Note that the normal distribution assumption is only valid for large N (say, N > 100). Thus, f = 75% and N = 10 leads to confidence limits [0.549, 0.881], but these should be taken with a grain of salt.

Holdout

The holdout method reserves a certain amount for testing and uses the remainder for training, typically one third for testing, the rest for training. A practical problem is that the division between the sets we are going to do may not represent the various classes well. For example, class might be missing in the test data.

When we build the training set, especially when there are imbalances between classes, we use a stratification approach. We ensure that each class is represented (in the training, validation and test set) with approximately the same proportions in each subset, say with the same percentage, compared to the overall dataset. For example if we have a dataset with a class represented by 80% of the samples and another with 20%, if as training set I use 60% of the dataset but I take it in the set of samples of only one class, the classifier will be trained on only one class and this is clearly wrong.

If I use a holdout approach (divided into training and testing), I have a more robust estimate if I make subsequent extractions. I make more extractions and then I average the results (repeated holdout). For example if I have 1/3 for the test and 2/3 for the training, instead of making only one extraction I make more than one and then I average the results. But the different sets could overlap! Since if I make an extraction of samples in a random way of 1/3 and 2/3 I have no guarantee that the samples of the next subdivisions will not overlap.

k-fold cross-validation

One possibility is to use cross-validation, which by definition avoids overlapping test sets. It consists of the following steps

  1. Divide the data in k sub-sets of equal size (if the samples are odd I may not have a perfect division, the fact is that equal size should be understood as such);
  2. I do k training and I use, in turn, the subsets to carry out the training and one for the test in an iterative way. For example if subdivided the dataset in 5 sets I make 5 different training, in every training I take 4 sets for the training and the other one for the test. Doing it 5 times I have different sets for training and test. Somehow on one side I used a test set that seen as a whole is as large as possible and I also have test sets that do not overlap. In general, it would be appropriate that the sets initially generated are stratified.

A good value for K=10, the error is estimated with the average over the whole process (in this case I average the results of 10 tests). The value 10 is good enough following practical experiments but also following theoretical demonstrations on this matter. The cross validation I can repeat it several times and mediate the results (thus having an even more robust result).

Note: Extensive tests on numerous different datasets, with different learning techniques, have shown that 10 is about the right number of folds to get the best estimate of error. Stratification reduces the variation, but it certainly does not eliminate it entirely.

When seeking an accurate error estimate, it is standard procedure to repeat the cross-validation process 10 times — that is, 10 times tenfold cross-validation — and average the results.This involves invoking the learning algorithm 100 times on datasets that are all nine-tenths the size of the original. Getting a good measure of performance is a computation-intensive undertaking.

Leave One Out

The cross validation approach can be taken to the extreme with the leave-one-out approach where I use a number of folds equal to the number of training instances. For example if I have 1000 samples I could do a 1000-cross validation, in theory I could train 1000 times my classifier. The advantage is to make the best possible use of the data, there is no data overlap, since the selection is deterministic, but it is extremely expensive from a computational point of view. Nevertheless, leave-one-out seems to offer a chance of squeezing the maximum out of a small dataset and getting as accurate an estimate as possible.

I can’t give the stratification, I have only one sample in the test set that belongs to one class, so I can’t have stratified test sets and this could lead me to a higher error than normal, but in general this approach is convenient with quite small datasets and we want to have robust results. An extreme case is if the dataset is divided in two subset one per class. We have for example 50 samples of one class and 49 of another and 1 sample for the test. The LOO technique attributes the sample to the majority class and in this way I would make an error of 100% because it attributes the samples to the majority class, always having 50% accuracy.

Bootstrap

Another technique that can be used to evaluate performance is the bootstrap technique. Unlike cross-validation, once I have selected a sample there is no “replacement”. In the case of bootstrap I do a sample to determine what my training is, but I have the possibility to insert more times the same sample in the training set, with the possibility to have a replacement. Then I sample a dataset of N instances and do it N times with replacements, to form a new dataset of N instances. So my training set will have the same number of instances as the original dataset, where some can be repeated and those of the original set that I did not put in the training set will be part of the test set. For example I have a training set with 1000 samples, I extract the samples randomly and can take a sample already extracted and then create a new set of 1000 elements some of which appear more than once. Some training algorithms may not take into account that the instances appear more than once. The important thing is to have a subset of data that have not been sampled and therefore are different from the generated training set that make up the test set.
Note: Mathematical sticklers will notice that we should not really be talking about “sets” at all if the same object can appear more than once.

Although the training set is made from n samples, it still contains only 63% of the instances of the dataset, which is still less than a 10 fold cross validation that considers 90% of the samples for training.
The estimate I am going to make on the test data will be quite pessimistic, the error we measure in the case of bootstrap is a combination, weighed, of the one made on the test and the one made on the training.
To compensate for this, we combine the test-set error rate with the resubstitution error on the instances in the training set. The resubstitution figure, as we warned earlier, gives a very optimistic estimate of the true error and should certainly not be used as an error figure on its own. But the bootstrap procedure combines it with the test error rate to give a final estimate e as follows:

The error we measure in the case of bootstrap is a combination, weighed, of that made on the test and that made on the training.Error = 0.632* error_on_test + 0.368 *error_on_training.
I can repeat the procedure several times in order to mediate and have a more robust estimate. This works quite well, compared to Leave One Out with only one bootstrap I only do one learning, but mediating ten times I only do ten learning. The Leave One Out with a thousand samples makes me a thousand learnings! So Bootstrap also performs well in terms of computational complexity. Probably the best way of estimating performance for very small datasets.

The bootstrap procedure may be the best way of estimating the error rate for very small datasets. However, like leave-one-out cross-validation, it has disadvantages that can be illustrated by considering a special, artificial situation. In fact, the very dataset we considered above will do: a completely random dataset with two classes of equal size. The true error rate is 50% for any prediction rule. But a scheme that memorized the training set would give a perfect resubstitution score of 100%, so that training instances = 0, and the 0.632 bootstrap will mix this in with a weight of 0.368 to give an overall error rate of only 31.6% (0.632 × 50% + 0.368 × 0%), which is misleadingly optimistic.

Comparing Machine Learning models

One possibility is to compare estimates by cross-validation, in particular ten cross-validation, which is often a reasonable solution. Because the amount of training data naturally affects performance, all datasets should be the same size. Indeed, the experiment might be repeated with different sizes to obtain a learning curve.

Machine learning research need to show convincingly that a particular method works better than another!

In some cases if I wanted to demonstrate that scheme A is better than scheme B in a certain domain, I would have to do it among all the possible training sets that I can have and so, in reality, one possibility I can have is to use a statistical approach through the Student Test in order to verify if the average cross validation of scheme A is better than that of scheme B.
The approach is statistical and is similar to that of the confidence intervals seen before, i.e. in a similar way, with slightly different calculations. I am going to calculate the confidence intervals of the two results, if they overlap it means that they are not so different from each other, if they do not overlap it means that they are different from each other. We have seen that the confidence intervals depend on the probability with which I want to give credit to the confidence interval, I will have to set the probability to 90 or 95%, that is 19 times out of 20 the real value of the classifier will be in that interval. So if I calculate the confidence interval of the classifiers in this way and then I associate them to the two averages and I see that the confidence intervals are disjoined, it means that the two classifiers behave in a different way, that is the two schemes are statistically different, that is the difference I am going to measure is statistically significant, fixed a certain probability value. If the intervals overlap there may be a number of cases, in which the difference I am going to measure is only fictitious, but in fact the classifiers are similar or behave in the same way.

T test - 10 fold cross-validation

Suppose you have trained two ML algorithms and you want to find out which one is more performing. We perform 10 times the ten fold cross-validation on both models, taking care to match the folds we input to the two ML models. The ML algorithm “A” will give us in output x1, x2, …, xk, while the algorithm “B” will give us y1, y2,…, yk. For example, after dividing the dataset in 10 subset, to get x1 and y1 we will use the first 9 subset for the training and the tenth for the test, to get to use the last 9 subset for the training and the first subset for the test to get xk and yk.

If there are enough samples, the “x marked” average of a set of independent samples (x1, x2, …, xk) has a normal distribution, i.e. Gaussian, regardless of the distribution below the samples themselves. We call the real value of the μ mean, which is the average of the population of our samples (we will see these concepts in more detail in another article). If we knew the variance of that normal distribution, so that it could be reduced to have zero mean and unit variance, we could obtain confidence limits on μ given the mean of the samples (“x marked”). However, the variance is unknown, and the only way we can obtain it is to estimate it from the set of samples.
The fact of estimating the variance only from our samples and not from the global population introduces uncertainty. Since the variance is only an estimate, it does not have a normal distribution (even if it becomes normal for large k-values). The distribution we get instead is called the Student’s distribution with k -1 degree of freedom, where k is the number of ten fold cross-validation we have performed. As a consequence we can no longer use the table confidence_limits_normal_distribution, but we must use a table of relative confidence intervals Student’s distribution (confidence_limits_student_distribution_9_degrees_of_freedom). If we compare the two tables for a given degree of confidence, the range is slightly wider and this reflects the additional uncertainty caused by the need to estimate the variance. Different tables are needed for different degrees of freedom, and if there are more than 100 degrees of freedom, the confidence limits are very close to those of the normal distribution. In our case we have a table referring to 9 degrees of freedom (k-1). Like the table for the normal distribution, the Student’s distribution refers to a “one-sided” confidence interval.
Consider the differences of “di = xi-yi”, making sure that the observations are paired. The mean of this difference is just the difference between the two means, d = x-y, and, like the means themselves, it has a Student’s distribution with k-1 degrees of freedom.If the average is the same, the difference will be 0 (this in statistics is called null hypothesis). If the averages are significantly different, the difference will be significantly different from zero. So for a given confidence level, we will check whether the actual difference exceeds the confidence limit. First, reduce the difference to a zero-mean, unit-variance variable called the t-statistic:

where σd2 is the variance of the difference samples. Suppose we choose a confidence interval of 1%. From this confidence interval I get z from the table confidence_limits_student_distribution_9_degrees_of_freedom if k is equal to 10. A two-tailed test is appropriate because we do not know in advance whether the mean of the x’s is likely to be greater than that of the y’s or vice versa; thus, for a 1% test we use the value corresponding to 0.5%. If the value of t according to the last formula is greater than z, or less than –z, we reject the null hypothesis that the means are the same and conclude that there really is a significant difference between the two learning methods on that domain for that dataset size.

Predicting probabilities

For now we have used as an estimate the success rate which is also called as 0–1 loss function.
It is a binary loss function, in the sense that if I say a right thing is 0, if I’m wrong it’s 1). If the classifier gives me a probability estimation this loss function is not really ideal.
Note: it is called loss function but we can also see it as a gain function, in the sense that if I classify gain correctly points if I classify gain 0 wrongly.
Many classifiers classify a tuple and tell us how certain they are with a probability. In case the classifier gives me a probability measure I can make an estimate of the performance that is better. One thing is to make a mistake when it tells us to be 99% sure, another is when it tells us to be 51% sure. In this case making a mistake when it is 99% sure, the amount of error is greater. If I am 51% sure and I make a mistake, the contribution that error has to make to the classification would be less than the overall contribution it would make when it makes a mistake and is 99% sure.

Quadratic Loss Function

Let’s suppose with our “probabilistic” classifier can give as result one between k classes, for each class will express with the probabilities p1, …, pk, where pi=P[class=i]. We express with the vector a1, …, ac, …. , ak which corresponds to k classes and where only ac=1 while the rest of the values is equal to 0. We can express the penalty associated with this situation as a loss function that depends on both the p vector and the a vector. One criterion that is frequently used to evaluate probabilistic prediction is the
quadratic loss function:

If you denote with pi*=P[class=i] the true probability that it belongs to class i, considerations can be made on the following expression:

The final goal is to minimize this formula, to minimize the error that the classifier makes. This happens when pi=pi*, that is when our classifier predicts the true probabilities. If pi=pi* the classifier, during our training, will strive to achieve equality. Quindi se data una tupla la classe corretta è la i-esima il classificatore cercherà di minimizzare:

in other words the difference between the estimated probability and the true probability. For example if we have 3 classes, and for a certain tuple the class we expect is the first we will have a=[1, 0, 0] and maybe our classifier will give p=[0.6, 0.1, 0.3], in order to minimize that formula, at the next training and evaluation he will give p=[0.8, 0.1, 0.1] because he “understands” that the correct classification is class 1. Therefore it is understood because in the case of classifiers that return probability the quadratic loss function is a metric to minimize. In the present context, the quadratic loss function forces the predictor to be honest about choosing its best estimate of the probabilities. So let’s see it as an evolution of the 0–1 loss function. Moreover, the quadratic loss function has some useful theoretical properties that we will not go into here. For all these reasons, it is frequently used as the criterion of success in probabilistic prediction situations.

Informational Loss Function

Another popular criterion used to evaluate probabilistic prediction is the informational loss function,

where the i-th prediction is the correct one. It represents the information (in bits) required to express the actual class i with respect to the probability distribution p1, p2, …, pk. In other words, if you were given the probability distribution and someone had to communicate to you which class was the one that actually occurred, this is the number of bits they would need to encode the information if they did it as effectively as possible.

The expected value of the informational loss function, if the true probabilities are p1*, p2*, …, pk*, is

Like the quadratic loss function, this expression is minimized by choosing pj = pj*, in which case the expression becomes the entropy of the true distribution:

Thus, the informational loss function also rewards honesty in predictors that know the true probabilities, and encourages predictors that do not to put forward their best guess.

One problem with the informational loss function is that if you assign a probability of 0 to an event that actually occurs, the function’s value is infinity (zero-frequency problem). But prudent predictors operating under the informational loss function do not assign zero probability to any outcome.

Classification cost

What we see here is reminiscent of the fact that different types of classification errors can give different costs. It could be much worse to grant a loan to someone who does not pay it back than not to grant a loan to someone who would pay it back. So, in practice different types of classification errors
often incur different costs.

Confusion matrix

Different Outcomes of a Two-Class Prediction (CM1)

Different costs must be applied to classification errors. To present the errors that a classifier makes can be used the Confusion Matrix where the actual class is on the rows and the predicted class is on the columns:

  • True positive: sample class yes, attributed to class yes (correct);
  • True negative: sample class no, attributed to class no (correct);
  • False negatives: sample wrongly attributed to the class no;
  • False positives: sample erroneously attributed to the class yes.

From the terminological point of view a false negative is a “missed detection”, while a false positive is a “false alarm”. A single prediction can be found in one of these four cases. In order to have good results most or all of the samples must be found on the diagonal and therefore some or even better zero off the diagonal.

Consider two confusion matrices for a 3-class problem, where we have actual predictor on the left and random predictor on the right:

CM2

In the case of n classes, the percentage of correct classification is given by the sum of the elements on the diagonal, if these are absolute numbers then they must be divided by the total number of samples. For example we have (88+40+12)/200 = 0.7. Reference has been made to this when talking about the success percentage (f).

Kappa statistic

Kappa statistic (D=diagonal)

The Kappa statistic measures the difference between the correct percentages and those generated by a random classifier, which gives each class the same number of elements respecting the proportions. It tells us how far the real classifier deviates from the random one from the ideal one (which correctly attributes all the samples). The classifier tells us that 120 predicts them with respect to class A, 60 to class B, and 20 to class C. The right matrix represents the behavior of a classifier that from the external point of view behaves like my classifier because the number of samples predicted per class are equal (last matrix line), but it does so in a random way respecting the proportions of my classifier. The Kappa statistic is another indication to get to the correct classification. Let’s make a numerical example regarding the two confusion matrix seen before (CM2):

Classification with costs

Default Cost Matrixes: (CM3) Two-Class Case and (CM4) Three-Class Case

We have matrices with two and three classes, but in general we can have confusion matrices with N classes. Until now I have made the hypothesis that errors always weigh 1. On the diagonal I have the instances correctly classified and have a cost equal to 0 and all errors weigh 1. In the cost-sensitive classification I have values outside the diagonal that are different depending on the class I am considering, I have a real matrix of costs in correspondence of the “physical” sense of the error, I can therefore have other different values besides 0 and 1.
Actually, I can operate in various ways:

  • I can keep in mind the costs during the classification;
  • Make the classification in the normal way and then evaluate the performance taking into account the costs.

In the first case I can try to modify the classification algorithm taking into account the costs, making a proper tuning of the costs during the training and/or going to modify the distribution of the samples, making my classifier give me more times the samples at a higher cost. Or you can use the cost matrix only to evaluate performance (cost sensitive performance evaluation).

Cost-sensitive learning

We explain cost sensitive learning in other terms. A simple method to make a cost sensitive learning is to do a resampling of instances according to cost, or I can have classifiers in which I can also weigh the different instances. Some classification schemes, such as the Baysian one, I could add another factor other than the a priori probability that can be the cost associated with a certain type of error.
As already said, one possibility is to insert the costs directly in the learning algorithm. Beyond the fact that there are schemes that already take it into consideration, I could make a resampling of the instances with respect to the costs. Therefore if an error is more expensive than another I try to force the system not to make that type of error modifying the relationship of the instances. For example, if I have two classes with 50 and 50 samples, I go to oversample the class for which it is more expensive to make errors.
Or weigh the different instances, if allowed by the algorithm we use). In the case of the Naive Bayes algorithm I can enter the cost as an additional factor.

Lift charts

Very often the actual costs are not known, we need a tool that allows us to consider different possible scenarios, where one scenario is a different cost allocation. Example: I have a marketing company and potentially I can send emails to one million people (customers).

If I send to everybody I know (because maybe I have done research before) at statistical level that I will have 0.1% of answers (1000 answers), they will already have 1000 people.

Anyway the sending of these communications can have a cost in turn and therefore my DM tool should highlight me among all the possible recipients those who have the highest probability to respond to my commercial invitation. If the DM tool can identify the most promising 100000 people, they will respond with a 0.4% higher percentage or 400 people. It’s clear that 400 people are less than 1000 but I have also reduced by 1/10 my mailing, in case I pay these mailings it is like saving (1000000–100000 = 900000) equivalent “stamps”. If I identify the subset of the 400000 most promising, they answer better than the whole and I have the 0.2%, twice more than if I had sent the email to everybody and this involves a total answer of 800. We’ve now more than cut the number of mailings in half and we have a number of people responding that is very similar to before. The lift chart allows a visual comparison to show performance in different scenarios, allows you to make a marketing choice. In marketing terminology, the increase in response rate, a factor of 4 in this case, is known as the lift factor yielded by the learning tool. If you knew the costs, you could determine the payoff implied by a particular lift factor. The term lift comes out of the fact that it measures the “lift”, i.e. the multiplicative factor that I can have when I go to select a subset.

Given a learning scheme that outputs probabilities for the predicted class of each member of the set of test instances (as Naïve Bayes does), your job is to find subsets of test instances that have a high proportion of positive instances, higher than in the test set as a whole. To do this, the instances should be sorted in descending order of predicted probability of yes. Then, to find a sample of a given size with the greatest possible proportion of positive instances, just read the requisite number of instances off the list, starting at the top. So, sort instances according to predicted probability of being positive:

If I can order my samples from the point of view of probability of success, I can build the Lift Chart (Lift Chart 1). The choice to make is always dependent on the cost, the revenue and the cost of sending. For example, if I suppose that the cost of sending is zero then I might as well send the email to everyone. The advantage of the lift chart is not to have to preconfigure the costs but, on the contrary, by varying the costs I can decide how to act.
On what hypothesis is it based? It is based on the fact that I can order the requests according to the probability that they are positive. For example, if a Bayesian classifier is available and I order the instances according to the above probability and if I know the actual class of the samples, I can build this chart. As the population increases I determine the number of correct answers. On the x axis I have the instances considered, the percentage in our example, while on the y axis we have the true positives, the number of samples correctly classified for the class “yes”, as in the example I consider the people who actually respond.

Consider the point A, B, C in the graph:

  • A: if I properly order my customers, with 10% of the cost (since I pay the shipping only for 10% of the samples, in number 10000 customers) we will have ((400/100000)*100)=0.4% of the total answers, in number 400 → lift = 4;
  • B: with 40% of my samples (in number 400000 customers) they respond to me at ((800/400000)*100)= 0.2% and therefore the number of people who respond is 800 →lift = 2;
  • C: if I consider 100% of my sample will answer 1000 people we have ((1000/1000000)*100)=0.1% of the total answers, as we said in the example, here we can see it graphically.

The horizontal axis shows the sample size as a proportion of the total possible mailout. The vertical axis shows the number of responses obtained. The lower left and upper right points correspond to no mailout at all, with a response of 0, and a full mailout, with a response of 1000. If you knew the different costs involved, you could work them out for each sample size and choose the most profitable. But a graphical depiction of the various possibilities will often be far more revealing than presenting a single “optimal” decision. Repeating the operation for different-size samples allows you to plot a lift chart like that of Figure (List chart 1). The diagonal line gives the expected result for different-size random samples.

The fact of making evaluations based on cost is very important. Because if it costs me a lot to send information then it is better to put myself in 10% of the cost. If for example in the time I discover a less expensive method of sending then maybe I put myself in a point of the curve more to the right (e.g. 40%). If then I discover that the cost is 0, I put myself on 100% sample size.
Lift concept: In the basic case we had 0.1% of people answering ((100000000/100)*0.1)=1000, while in the second case I have a lift equal to 2 (since they answer me twice as many people), in the third case I have a lift equal to 4 since they answer me 4 times the number of people. The graph takes me into account the lift and gives me an idea of how I can improve considering a subset.

My ideal curve is a step: if I could order all the samples in such a way that all the classifications are correct, I would have an ideal system and this corresponds to a curve that rises immediately. In practice since I order the first samples according to the probability that they give me “yes”, if the first 1000 that I meet are all “yes” I would have an ideal system that takes me all the best and corresponds to a curve that rises immediately. In reality the closer the curve is to the top left point the better! My DM system has to go in that direction.
The bisector is obtained when sample is chosen randomly (random system), which gives me a fair distribution of positives and negatives. In this case as the size of my samples increases I have a probability that is 0.1%. In fact if I put myself at 40% I have 400, at 60% I have 600, so the curve represents a completely random result, that is when I send the email to everyone.
If instead the curve that I generate is very close to the bisector (random choice) my system will not serve me much.
On what hypothesis is it based? On that of being able to order the instances assuming that they are positive. Besides evaluating the costs in general, I should evaluate the cost of my Data Mining technique, if to build the graph requires me a disproportionate cost, higher than my gain, theoretically I could/should decide not to adopt it.

ROC Curves

ROC curves are used in the situation where you are trying to select samples of test instances that have a high percentage of positives. They are related to Lift Chart, used in marketing, where we try to maximize the true positives in the confusion matrix. The acronym stands for Receiver Operating Characteristic, a term used in signal detection to characterize the tradeoff between hit rate and false-alarm rate over a noisy channel. The ROC curve does not take into account the class distribution, i.e. if there are more instances of one class than another and does not even take into account the error cost, as we have seen in the paragraph “Classification with costs”. They differ from Lift Chart because here we have that:

  • the y axis shows the percentage of true positives with respect to the sample (TP Rate), instead of an absolute number (Number of respondents);
  • the x axis shows the percentage of false positives with respect to the sample (FP Rate), instead of showing the sample size.
(TP=True Positive, FN=False Negative)
(FP=False Positive, TN=True Negative)

Comparing the y-axis with the Lift Chart we have almost the same thing but expressed in percentage. The x-axis is very different because it considers false positives (FP), but if in our problem the percentage of true positive (TP) is very low, for example 0.1%, there is a negligible difference between the size of a sample and the number of negatives that our dataset contains. As for the Lift Cart the goal is to be at the top left point.
The figure “ROC curve 1” shows a jagged line that is built with data from the table “data_for_lift_chart”. Go up two positives, then go horizontally one negative, then 5 positives and so on. We proceed in this way for all the test set we have available until we reach the top right point.

ROC Curve 1

The way the ROC curve is constructed we realize that the jagged line depends on the sample of test data. To solve this problem we can apply a cross-validation. Smoother ROC curves can be obtained using cross validation (assuming that my classifier gives me a result in terms of probability), I split my set into n fold and then take the results in descending order of probability, so as to build a reasonably smoother curve. I could even generate a ROC curve for each fold and then get an overall result by averaging the ROC curves. To do this I collect probabilities for each test instance and order them according to the probability.

ROC Curve 2

Remembering the similarity between Lift Chart and Curve ROC, we can say that in the example in figure “Curve ROC 2” the “A” ML algorithm excels if you create a small focused sample, as you can see in the left side of the graph, while if you want to cover more samples then the choice will fall on the “B” ML algorithm. To cover about 40% of TP you should choose the A method, which gives a low percentage of FP of 5%, where for small samples the B method gives more than 20% false positives. As already said the B method excels for large samples, in fact we have a coverage of 80% of TP and a percentage of FP of 60%.
It is possible to move always on the convex hull, to do this you can adopt this combined scheme where tA, fA true positive rate and false positive rate scheme 1; tB, fB true positive rate and false positive rate scheme 2, while q is the percentage of times in which I use scheme A to predict the sample, 1-q the percentage in which I use scheme B. The concept is, given the two possible schemes, I try to take the best from both using for one part scheme A, for one part scheme B and for another part the combination of the two (in our example is the intermediate part indicated by the arrow). Using this kind of combinations is like moving on the convex hull of the two curves.
Note: in principle the two schemes could also be the same type of classifier with different parameters.

Synthetic measurements

We have other measurements, some of which are widely used in information retrieval, such as precision and recall. The Recall is nothing but the true positive rate:

In other words, it is the True positive/in combination of positives (given by true positive and false negative), that is the percentage of true positives detected.
In the case of information retrieval I must find documents relevant to my interests, so it is the percentage of relevant documents compared to all those belonging to the category. For example they are the number of documents that I find compared to those that were at the origin, so if I have 100 articles of economics and I find 10 the recall is 10%.
The Precision tells us starting from the set of documents, among all those returned by the system, how many are relevant.

For example, I want to look for economy documents, I have 100 and I find only 20 of which 10 are economy, I have a Recall = 10% while the Precision is 10/20 = 50% because out of the 20 found 10 were actually correct. I would like to maximize both of them. In the case of the example of the documents we could see the Recall and Precision with the following formulas:

Often a synthetic (unique) measure can be of interest, for example the F-measure a sort of harmonic mean between recall and precision, not an arithmetic mean.

Different terms are used in different areas. Doctors, for example, talk about the sensitivity and specificity of diagnostic tests. Sensitivity refers to the percentage of people with the disease who have a positive test result, i.e. tp.

Specificity refers to the percentage of people without disease who have a negative test result, i.e. 1 — fp.

Sometimes the product of these is used as an overall measurement:

Finally, among all sizes combinations of TP, FP, TN, FN we have the classic success rate or accuracy:

Another measurement used is the area under the ROC curve (AUC), often expressed as a fraction of the unit area. The ideal ROC curve is a step, from 0 it rises immediately vertically and then goes horizontally to 100%, so it will subtend an area equal to 1, since we have a square of side 1. The more the curve is squeezed towards the bisector (random value) the less is the area under the curve. This can be useful to compare two curves and then eventually choose one system instead of another. This can be useful because the ROC curve does not foresee to use costs, but provides a number of possible scenarios.

Reference

DATA MINING — Practical Machine Learning Tools and Techniques, https://www.cs.waikato.ac.nz/ml/weka/book.html

--

--

Gioacchino Lonardo
Analytics Vidhya

Computer Engineer, AI enthusiast, biker, guitar&bass player