Data Mining and Machine Learning— Introduction

Gioacchino Lonardo
15 min readSep 24, 2020

We will deal with the representation of the data that are connatural with the techniques of data mining and machine learning. Many references are taken from the book DATA MINING — Practical Machine Learning Tools and Techniques.

Data Mining

In order to understand what Data Mining is, we refer to concepts of data and information. A data is a value that belongs to a certain type and is typically stored in a database. From this data we want to extract an information, that is to associate a semantics to this data and from here extract in mining concept.

Mining represents the process of extracting information from the data. Among the techniques that we can use to extract information from data we have the techniques based on machine learning.

Machine Learning

Machine Learning (ML) is understood as the set of techniques and algorithms that allow a machine to learn automatically. They allow to extract information when it is not explicitly present in the data.
When the information is explicitly present in the data, what the machine learning technique does is to use this information, which is the class to which the data belongs in order to train it, for example in the following classification problems.

Example of classification problem

Example data:

Here we have some data describing the characteristics of some people with also the type of contact lenses recommended for those people. So the information that interests us is already in the data, what I want to get is a rule (a system) that automatically allows me to associate the class of contact lenses that are best suited to each person. The prerequisites for this are to give a representation to the data. This can be vectorial, in which the object to be attributed to one of the N classes, then it will be described by a vector of characteristics that can be numerical or categorical (i.e. a discrete type) and also based on this we will use a certain machine learning technique. Depending on the scope we can see these data as tuples, or as a vector of discrete characteristics (i.e. each attribute assumes one of the N values that belong to a certain set).

With reference to if_then_rules.py inside we have some simple rules of type if then else in which the attributes considered are the age of the patient, astigmatism and tear production rate.
This is the rule that a doctor might use. We want to extract these rules from the information contained in the data, or through less explicit models to realize what the doctor would do, that is to assign the correct class (the correct attribute between none, soft and hard) to the type of lenses recommended.

Ensemble Learning and Deep Learning

It is possible to put together several machine learning techniques to obtain better results during classification. These techniques are called Ensemble Learning techniques. That is, they are means that realize learning not only with one technique, but with a set of techniques combined with each other. The ensemble learning techniques either put together different approaches or implement algorithms that, while using the same classification system (same machine learning technique), iterate it in a clever way to increase the performance and accuracy of the classifier. Often when the emphasis is on classification they are called multi classification systems.
We will also see the aspect of measuring the performance of a classifier, because using different approaches we must have the possibility to evaluate these different approaches to compare the results and choose the most appropriate one. Sometimes the type of data can lead us to classifiers, but often not only one so we need performance measurement.

Deep learning (DL) techniques that are recently used also in companies, the basic idea dates back to the 80s. They are widely used for large data piers and in particular for images. The normal ML techniques that actually take in input a description made by someone else, i.e. the characterization of the set of attributes is done before feeding the data to the classifier. A Deep Learning technique realizes both the data extraction (feature extraction process) and then also the classification part. The learning is deeper because for example on an image, the DL technique tries first to understand the characteristics that make it possible to discriminate an image from the other and then realize the classification.

In a separate article I will introduce the more complex concept of reinforcement learning (RL).

Data Preprocessing

Example where the data comes from a sensor

Data preprocessing is an important step, for example to set up a data warehouse. It begins with a phase of cleaning of the data to then manage some critical issues such as the lack of a certain attribute (missing data). Then, depending on the machine learning techniques you are going to use, we will manage this phase. Sometimes it is done automatically, sometimes it has to be done “manually”. In this phase we also have techniques of attribute selection as feature selection, or extraction of information from attributes through PCA (features extraction) techniques. All useful techniques before mining in the strict sense.

Features is crucial

We will use attributes and features as synonyms, although in data mining it is more appropriate attribute.

Example 1: in vitro fertilization
Given: embryos described by 60 features
Problem: selection of embryos that will survive
Data: historical records of embryos and outcome

The features are in general measurements made on the specific problem, in this case on embryos. The features must be as much as possible discriminating in order to characterize the embryos that are more likely to survive than others. In the data I have the embryos with their characteristics and with associated also the data if those embryos have given rise to pregnancy or not.
It is important to select discriminating features. By discriminating we mean features that are quite different as a certain element belongs to different classes. For example, if an embryo leads to pregnancy must have a certain feature quite different from the embryo that does not lead to pregnancy. If I have a feature that has a variance tending to zero that does not give me a lot of information, because if the variation of the samples always has the same value I do not need anything.

In this case I have two approaches:

  1. I am not an expert in the domain and I think of descriptions of embryos that help me to discriminate them from morphological features as an area of the embryo or associate a geometric figure, etc..
  2. I’m referring to an expert of the domain that in the selection of the features gives a contribution, so that he advises me the features that he would have used to make a “manual” classification.

These approaches can be complementary, I can rely on the domain expert and then consider other features that I consider reasonable.
I can use statistical techniques that allow me to reduce the set of features. This is because then the classifiers are run on machines and you have to consider problems related to the computational complexity of the algorithms. For example if the complexity of the algorithm is quadratic and I have to make decisions in a short time then I can have problems.

Explainable AI (XAI)

Explainable AI (XAI) refers to methods and techniques that make it possible to understand how the algorithm makes a decision. The ultimate goal is that the decision is understood by the human. This aspect leads us to consider which DM or ML technique we choose. We may have as a constraint how the prediction capability has been realized. For example in the medical field I don’t just need a system that automatically tells me the class, but we also want to see how the system came to the solution.
We have ML techniques in which we give the input and then we have the output, but it all happens in a “closed box” and we do not go into how it came to that conclusion. Other techniques also allow us to provide an explanation of why we choose a certain thing or why it behaves in a certain way.
Some classifiers work with an implicit internal representation that is very complicated to make explicit, such as Neural Networks and Support Vector Machine (SVM). Interpretability with these classifiers is very difficult.

Machine learning techniques

Algorithms for acquiring structural descriptions from examples

The machine learning techniques are used to acquire descriptions. Structural descriptions are descriptions that explicitly represent patterns, but in some cases we have structural descriptions like the rules seen, in other cases we have implicit descriptions that are not immediately interpretable.
An ML system can automatically learn what can be the right value for this attribute for each of the tuples. For example, when the client is presented to the doctor instead of having the contact lenses assigned by the latter, we have them assigned by the automatic system, which has built a model that makes a classification.
The prediction is made thanks to a training on a set of data called training set, because they are processed during the training phase. One of the problems of the DM is having a lot of data available for training and then also the information that I can extract will be few. For an ML algorithm also the ability to generalize, i.e. the ability to provide correct results on data never seen before.

Prediction

ML can be used to predict outcome in new situation

If every time I make a prescription I already have in my database and the corresponding tuple with the values (in our case of astigmatism, tearing rate and age with the correct prescription) I don’t need machine learning but a simple lookup table with an efficient database that gives me the result through a simple query. We don’t need something that looks like a lookup table, otherwise we would simply use a well indexed database, but we need a system that gives me an answer to a question compared to a tuple that was not present in the database. From this comes a problem that plagues ML systems that is overfitting, i.e. systems adapt too much to data and we lose in generalization capabilities.

Classification vs. association rules

Let’s see the difference between classification rules and association rules. Classification rule predicts value of a given attribute for example:

The classification rule predicts the value of a specific attribute. In these systems we must choose a priori the attribute that represents the class to which to associate the sample. For example in the case with meteorological attributes we have the fact of playing or not, in the case of contact lenses we had the type of lens to use.
Whatever is the ML technique that we use must choose a priori the attribute and our system must predict the most appropriate value given the characteristics vectors (for each tuple) for that attribute. The attribute values are always discrete values that represent the set of classes to which each vector that describes the object to be classified can be associated.

A problem of classification can be formulated in the following way: given a certain object this must be attributed to one of the N possible classes, where the labels of the classes are defined to own in a finished set. For example the classes can be yes or no. This is a two-class (binary) classification problem, because the class attribute allows only two values, in the case of contact lenses there are three.

An association rule can generally predict the values of any attribute (and not a specific one) or a combination of them. Let’s consider these data:

the rules:

While the classification rule has as output the decision on the value that the play attribute must assume, in the case of association rule I can have a prediction on a pair of attributes. For example an association rule can tell me that if there is no wind and you don’t play, this means that the prediction is to have sunny weather.
We use the term rule because in data mining the rules of association are among the approaches that are followed, in the case of classification is not said that the classification is made through a rule, but there could be a technique that gives us the prediction expected for this attribute without giving me the way of classification interpretable. So when we have interpretable rules the domain expert could also go and modify them. He can take away some of them that are contrary to his vision, or he can go and simplify some of them, and take away some redundant ones.

Decision tree

In the image we have a typical representation of a decision tree which in this case is binary, but in general it can have more branches. For each node has the name of an attribute, the number of branches refer to the possible values that that attribute can assume. In this case the decision tree is used to assign a sample to a class.
To make the classification of the tuples seen before, for example, you enter the root, you evaluate the attribute related to the tear rate if it is reduced decide for none. Through the tree (structural representation) I also see how to arrive at the classification and I do not consider it as a black box. The classification is done starting from the root and arriving in a leaf. The tree implicitly selects the most significant characteristics and those that best help us to discriminate the various elements. It may not use features and this means that the features were not very significant.

Let’s see an example of classification with iris flowers in which we have numerical attributes:

Note that each tuple can also be made exclusively using numerical values. Here we have a typical example of ML dataset. The dots are shown for brevity.
Note (type of discrete or continuous data): in some datasets if I have attributes with all discrete values, some structural representations have within them the use of rules that provide for the equality of attribute values, in the case of numerical values the thing can be managed using relational operators.
Let’s see the rules:

These rules are in OR between them. Classification systems evaluate these rules from the top. We may have rules that go against what is in the database, but if the rule is never activated the system still has optimal performance and never makes mistakes.

Linear regression

We can have predictions in which I don’t have a classification with the relative vector of features that we then associate to a class, but I have to predict a numerical value associated to the vectors of features. For example in this case I want to estimate the performance of a CPU considering a number of parameters.

Note: CT stands for Cycle time, M for Main memory, CACH for Cache, CH for Channel, PRP for Performance.

The original dataset consists of 209 computer configurations and one way to do the performance estimation is to use a regression function, where you have to estimate the coefficients of the linear regression function. The problem is different since here we do not predict a categorical value (a class) but a numerical value. Example of linear regression function:

In this case the DM techniques will always use ML algorithms that in principle will be more effective than a linear regression function because they can also discover non-linear relationships between the values of known attributes and the unknown value that are the performance of the processor.
We note that regressive functions can also be non-linear, but in functions where I use a statistical approach I generally have to determine the type of non-linearity I want to look for and then I have techniques to evaluate the goodness of that choice. With ML algorithms in principle the possible type of non-linearity required should be determined automatically (although not explicitly) by the algorithm.
In this case I can have an idea of the influence of the various parameters in the final value.

In general I can think to build a decision tree where each leaf is given by a sample of my dataset (I could have a rule for each configuration of attribute values). This is something that goes under the name of overfitting, i.e. I’m adapting too much to the data. We must avoid the bias given by overfitting.

Machine learning and statistics

Let’s say that there is no real dividing line between statistics and ML. Many ML techniques are related to standard statistics while others are more related to the computer science world. Wanting to see a point of difference, we could say that statistics is more interested in verifying hypotheses, while machine learning is more interested in formulating the process of generalization as research through possible hypotheses. But this is a gross simplification, statistics is much more than a simple hypothesis test, and many machine learning techniques do not involve any research at all.

In the past, very similar schemes have developed in parallel in machine learning and statistics. One is decision tree induction. Four statisticians (Breiman et al., 1984) published a book, Classification and regression trees, in the mid-1980s, and throughout the 1970s and early 1980s a prominent machine learning researcher, J. Ross Quinlan, was developing a system for inferring classification trees from examples. These two independent projects produced quite similar schemes for generating trees from examples, and the researchers only became aware of one another’s work much later.

Today the two points of view converge. For example, most learning algorithms use statistical tests when building rules or trees and for correcting models that are “overfitted”.

Generalization as search

In the hypothesis of describing my data as a set of rules, I can see the set of rules as a description language. After that I have to ask myself the problem of generalization, i.e. to build a model that is general, but I can see it as a research problem, i.e. I search in the space of the rules the best set of rules for my problem.
From a theoretical point of view I can enumerate the space of the concepts in case the attributes are discrete, even if the complexity of this operation is exponential. I can then eliminate all the rules that go against the examples of my training set, so only the rules that match examples that are traning will remain.

There are several practical problems in the enumeration, one is that of the research space. In the case of the time problem with the different attributes I can have 288 combinations that with 14 rules lead to 2.7*10³⁴ rules.
Other approaches are for example the descent along the gradient, where I don’t have the rules space but a numeric space and in this way given a description based on numbers I have a maximization or minimization of a certain function. Typically the choice is made with heuristic techniques, which often lead to excellent sub solutions.

Bias and Overfitting

In many of these approaches we have a bias that we must take into account. We must avoid overfitting and to do so we see the possible bias against which we must confront and we must try to avoid.
In the hypothesis of a rule-based classifier the first thing is the choice of language that allows us to describe the problem.
Also the order in which we perform the search is important, because if I have a system based on a list of rules, the order of the rules is also important.

We must make sure that the language chosen for the classification is universal or the performance of my system is influenced by the type of language (or the type of model chosen for the description). In general the problem with language bias is that the choice of the description of my problem could somehow limit the result I can obtain.

In the case of the search problem I have to take into account the type of search I do. Not all search strategies give the best result. Considering a greedy search may not by definition lead me to the optimal result. For example in decision trees, the way the algorithms we will see are made, they go to choose the root of my tree. The root implies a certain type of decision, when I then apply the procedure to generate the other nodes in reality no one guarantees me that the first pair of nodes I generate is the best because I generated it sequentially. If I consider all possible pairs I can theoretically find the best pair, but if I build the nodes in a sequential way this may not happen, for this reason the search is greedy because I made a “more limited” choice, because I conditioned it to the first attribute I chose. So the algorithm can give me a negative polarization not allowing me the best choice in absolute, but it proceeds for local optimum. The greedy search takes me to an excellent local that does not necessarily coincide with the global one.
We note that some algorithms repeat several times an attribute as a tree node, but others do not and an attribute cannot appear several times.

A particular case of search bias is to avoid overfitting. This in decision trees can be done using pruning strategies, which we will see in the next stories.

Reference

DATA MINING — Practical Machine Learning Tools and Techniques, https://www.cs.waikato.ac.nz/ml/weka/book.html

--

--

Gioacchino Lonardo

Computer Engineer, AI enthusiast, biker, guitar&bass player