Content recommendations on YouTube, Netflix or even on that recurring site from which one buys treats for his or her pet already seem to be an intrinsic part of a day-to-day routine. At least from a digital user point of view, it is something transparent and omnipresent.
At some point everyone asks themselves what happens in the backstage and how do these kinds of services know what the users like. In my case, watching sci-fi movies. However, we need to understand the concept of machine learning, how it works and what are some of its types to answer this question.
A stricter definition of machine learning defines it as a branch of artificial intelligence based on the idea that systems can learn from data and that they can recognize patterns that allow them, for example, to identify a cat or a dog within an image. Another good example would be their ability to find out a consumer behavior for a user.
Machine learning algorithms can be divided in accordance with the problem they aim to solve, including:
Classification: the goal is to predict a class label, which is a choice from a predefined list of possibilities, e.g., finding out if there is a dog or a cat in a given picture.
Regression: the goal is to predict a continuous number (in programming terms, a floating-point number - or, in mathematical terms, a real number). Predicting a person’s annual income from their education, from their age, and from where he or she lives is an example of a regression task.
Clustering: task of partitioning a dataset into groups called clusters. Within a group, points are very similar, as well as points in different clusters are very different. Similarly to classification algorithms, clustering algorithms assign (or predict) a number to each data point, indicating to which cluster a particular point belongs.
Optimization: the goal is to compare a number of possible solutions until reaching an optimal solution or at least a satisfactory one, like in a game in which the enemies look for a better path towards the hero.
Beyond this division by types of responses, algorithms can also be classified according to their types of learning: supervised, and unsupervised.
Anyway, what is the difference between these machine learning types and how learning processes work? Let us jump into that right now.
As we have already established, machine learning algorithms are commonly divided by its types of learning: supervised, and unsupervised.
Supervision is strongly associated with human intervention on datasets used for algorithms training, which means that these algorithms cannot learn for themselves if an object belongs to a certain class. Therefore, one needs, given a group of features of an object, to provide a label associated with it.
A label can be defined as the known answer to some already solved cases. This information is generated by a human that previously analyzed data by data.
In essence, what distinguishes one type of learning from another is the knowledge of the labels that are going to be used in the algorithm’s learning process.
By analogy, in a classroom, a label would be the knowledge provided by the teacher to his or her students. This is the supervised learning model. On the other hand, a student that studies alone would represent the unsupervised learning model.
In a supervised approach, an algorithm learns to execute some tasks based on a dataset of known results. But how to do that?
In practice, this kind of learning depends on human intervention.
Imagine that we need to develop a solution able to classify images of various types of vehicles.
Before we start the learning process (which is training the machine learning model), we need to map, sanitize, and organize all the images on the dataset, labeling every image with its corresponding vehicle (car, motorcycle, bus, truck, among others).
There is another type of supervised learning, the regression, in which the expected result is an estimated value. In this case, the entry points for the learning stage could be a table of historical values linked to products demands inside an e-commerce store through time. The selling value is the known information, and the expected output is a prediction, a price set for some new related products that will be part of the e-commerce portfolio.
As we have seen, classification, and regression algorithms work based on supervised learning methods. Some of the algorithms within this group are K-NN (k-nearest neighbors), SVM (support vector machine), decision trees, and neural networks, among others.
KNN: it is a k-nearest neighbor classifier, in which learning is based on “how similar” the data for a given element and its k-neighbors are.
SVM: it can be used to solve classifications or regressions problems and be briefly described as a line search (hyperplane) that better discriminates different categories of a given dataset.
Decision trees: they consist of two elements (nodes), the root and the leaf nodes. In it, any decision is taken after a certain data follows a path from the root node to the leaf node. In other words, if you make a question, the answer will be found after it goes through a set of nodes in the tree until it reaches an end node.
Neural networks: in general, artificial neural networks are like a biological neural network. In it, a stimulus is received, a function is processed, and a certain value is returned.
Unsupervised learning identifies similarities in data and reacts accordingly to the presence and/or lack of such similarities in each newly data input.
Looking at our previous example of vehicles in images, the unsupervised learning algorithm is now fed with several vehicle images, but without the corresponding response.
In this case, the algorithm finds patterns to group similar vehicles, but it does not learn if a certain element is a car, a motorcycle, a truck, or a bus.
When we define labels, we are somewhat restricting pattern discovery in our dataset. By not requiring labels, unsupervised learning allows the search of patterns that were not noticed initially.
Since there is no problem in employing unsupervised learning even if we already know the patterns (correct answers) of our data, it is possible to use unsupervised learning as a preprocessing step that assists data discovery.
Therefore, unsupervised learning provides a more challenging output to analyze. Once we do not know which is the proper answer, it may become somewhat complicated to comprehend it.
This is why unsupervised methods are usually applied to initial problems resolutions stages, as they explore and find potential relations in data even when working with a labelled dataset.
As discussed, the major kind of unsupervised learning algorithms is clustering. Beyond it, we can also highlight dimensionality reduction algorithms, such as Principal Component Analysis (PCA).
Briefly, PCA is a mathematical method aimed at turning a collection which describes one dataset element into a smaller new set. This new smaller collection is known as a principal component.
Some frequently used clustering algorithms are K-means, and DBSCAN.
K-means is a cluster method that divides a series of data into k-groups.
This grouping is obtained by computing distances among all elements, so every element is assigned to the most similar group, which is the group closest to its centroid. It results in a data distribution like a Voronoi Diagram, divided into areas.
But how exactly does this algorithm work?
First, it creates random K-centroids, which will become the resultant data clusters.
Data division happens after the calculation of the distance between the centroids of all the elements in the database.
Next, each element is assigned to its closest K-group. After that, the centroids are repositioned according to the average centroid of all the elements in the cluster - hence the name K-means.
The process continues until the elements are properly assigned to a cluster.
There are also some methods that can be applied to optimal amounts of k-centroids determination, like Elbow and Silhouette methods.
Another well-known grouping algorithm is DBSCAN, density-based spatial clustering of applications with noise.
One of the main advantages of DBSCAN is that it does not require a pre-set number of clusters, once it can identify boundaries between clusters of complex data and of pointing out what does not seem to be part of any cluster, (noise and outliers).
However, DBSCAN is a little bit slow in comparison to other clusterization algorithms, like k-means. However, it adapts itself to relatively large groups of data.
DBSCAN works identifying points in high density areas, regions that have a lot of data plotted together, too close to each other. The idea behind the algorithm is that clusters are composed of densely populated regions separated by these other relatively empty spaces.
The algorithm starts by picking an arbitrary point to evaluate. It then looks for all points that are under a certain distance from it. If a few closer points are found, it is considered an outlier, meaning that it doesn’t belong to any cluster (group).
If enough similar neighbors exist, the point being evaluated, and its neighbors are labeled and assigned to the same cluster. The algorithm also visits the legacy neighbors of any given point that has already been labeled to another cluster.
The cluster grows until no more new neighbors can be identified by the algorithm, then the process repeats when another point, which has not yet been visited, is selected.
In this article, we have covered the differences between supervised and unsupervised machine learning methods and some of their main algorithms. When it comes to developing machine learning solutions, an engineer needs to identify what is the desired response, thus determining the most suitable algorithm to a certain task. The algorithm’s role consists of learning through the inputs to recognize which math function best fits the intended answer. In upcoming posts, we will deeply explore these algorithms through some examples and will try to provide a more complete explanation.
Witten in partnership with Roberto Momberger Reginato.
Müller, Andreas C., and Sarah Guido. Introduction to machine learning with Python: a guide for data scientists. O’Reilly Media, Inc., 2016.