Basic Machine Learning Algorithms (Part 1)

👋

Hi! Welcome to my series of notes on AI Programming for Beginners!

Let’s deep-diving into four machine learning algorithms: K-Nearest Neighbors, K-Means Clustering, Linear Regression, and Support Vector Machine!

K-Nearest Neighbors (KNN)

Basic Concept of K-Nearest Neighbors

K-Nearest Neighbors (KNN) is one of the most popular algorithm in machine learning.

Unlike traditional machine learning methods that learn a target function, KNN doesn't build a clear model for the target function.

Instead, it memorizes the training data. When a new data point needs a prediction, KNN looks at its $k$ nearest neighbors in the training set.

Image Source: https://medium.com/towards-data-science/k-nearest-neighbor-knn-algorithm-3d16dc5c45ef

The prediction process is somewhat similar to how people would assess potential partners in real life. Rather than making one-sided judgments, you might want to learn about someone’s personality by observing their family or speaking with their close friends and neighbors!

How KNN Works

The KNN algorithm includes two key components:

A distance measurement: This determines how ‘close’ the two data points are to each other.
$k$ : The number of nearest neighbors considered for future prediction.

Let’s consider an example classification problem:

Given the dataset consists of two classes, $c_1$ and $c_2$ . There’s a new point, $z$ , (shown as a blue dot) needs to be classified into one of these classes.

The KNN algorithm assigns a class to $z$ based on the majority class among its $k$ nearest neighbors.

Case $k = 1$ :

The closest neighbor to $z$ belongs to class $c_2$ . Therefore $z$ belongs to class $c_2$ .

Case $k = 3$ :

The $3$ closest neighbors to $z$ are examined. Since class $c_1$ has more neighbors, $z$ is assigned to class $c_1$ .

Case $k = 5$ :

The $5$ closest neighbors to $z$ are examined. Since class $c_1$ has more neighbors, $z$ is assigned to class $c_1$ .

In Classification Problems

For a new data point $z$ , calculate the distance between $z$ and every training data point $x$ in the dataset $D$ .
Form a set, $NB(z)$ , consisting of the $k$ nearest neighbors of $z$ . This involves selecting the $k$ data points in $D$ that have the smallest distance to $z$ , as determined by a chosen distance function.
Assign $z$ to the class that appears most frequently among the classes of its $k$ nearest neighbors in the set $NB(z)$ . This is known as “the majority class”.

In Regression Problems

Similarly, we first need to calculate the distance between the new data point $z$ and each training example $x$ in the dataset $D$ .
Identify the $k$ nearest neighbors of $z$ , forming the set $NB(z)$ .
Predict the output value $y$ for $z$ by averaging the output values of its $k$ nearest neighbors in $NB(z)$ :

$y_z = \frac{1}{k}\sum_{x\in NB(z)} y_x$

Distance Functions

Alright, so we’ve briefly mentioned the distance function in the process of solving the above problems, what exactly are those?

The distance function, often denoted as $d$ , plays an important role in the KNN method. They determine how the similarity or dissimilarity between data points is measured. The choice of distance function can significantly impact the performance of the KNN algorithm. Typically, the distance function is predetermined and remains constant throughout the learning and classification/prediction processes.

Normally, when the input attributes are real numbers ( $x_i \in\mathbb{R}$ ), we usually use geometric distance functions for KNN. These includes the Minkowski, the Manhattan, the Euclidean, and the Chebyshev distance functions.

The Minkowski distance ( $p$ -norm): This is a generalized distance metric. In this formula, different values of $p$ yield different distance metrics.

$d(x, z) = \left(\sum_{i = 1}^{n} \left| x_i - z_i \right|^p \right)^{\frac{1}{p}}$
The Manhattan distance ( $p = 1$ ): This calculates the distance as the sum of absolute differences between the coordinates of the data points.

$d(x, z) = \sum_{i=1}^{n} \left| x_i - z_i \right|$
The Euclidean distance ( $p = 2$ ): This is the most commonly used distance function, which involves calculating the straight-line distance between two points.

$d(x, z) = \left( \sum_{i=1}^{n} (x_i - z_i)^2 \right)^{\frac{1}{2}}$
The Chebyshev distance ( $p = ∞$ ): This calculates the maximum absolute difference between the coordinates of the data points.

$d(x, z) = \lim_{p \to \infty} \left( \sum_{i=1}^{n} \left| x_i - z_i \right|^p \right)^{\frac{1}{p}} = \max_{i} \left| x_i - z_i \right|$

Data Normalization

Normalization the data ranges is an important step in the KNN algorithm, especially when dealing with input features that have different scales or value ranges. Without normalization, features with larger values can dominate the distance calculations, leading to skewed results.

Let’s consider a dataset with features like $\text{Age}$ , $\text{Income}$ (monthly), and $\text{Height}$ (in meters).

$x = (\text{Age} = 20, \text{Income} = 12000, \text{Height} = 1.68)$ is a specific example in the dataset.
$z = (\text{Age} = 40, \text{Income} = 1300, \text{Height} = 1.75)$ is the value to be classified.

Let’s calculate the distance between $x$ and $z$ :

$𝑑(x, z) = [(20 − 40)^2 + (12000 − 1300)^2 + (1.68 − 1.75)^2]^\frac{1}{2}$

As can be seen from the calculation, $\text{Income}$ values are much larger than those for $\text{Age}$ or $\text{Height}$ , which means the distance between two data points will be dominated by the difference in $\text{Income}$ . As a result, this can overshadow the contributions of $\text{Age}$ and $\text{Height}$ and potentially skewing the overall result.

To address this issue, it is essential to normalize the range of input attributes so that each feature contributes more equitably to the distance calculation.

The most frequently used normalization ranges are $[0, 1]$ and $[-1, 1]$ . For each attribute $i$ , the normalized value $x$ can be calculated as $x := \frac{x_i}{\max{(x_j)}}$ .

Let’s revisit the above example and normalize the $\text{Income}$ and $\text{Age}$ data points to the $[0, 1]$ range:

$12000$ is normalized to $1 = \frac{12000}{12000}$ .
$1300$ is normalized to $0.183333333 = \frac{1300}{12000}$ .
$40$ is normalized to $1 = \frac{40}{40}$ .
$20$ is normalized to $0.5 = \frac{20}{40}$ .

By normalizing the features, you ensure that each feature contributes proportionally to the distance calculation, leading to more accurate and reliable KNN results.

K-Nearest Neighbors Using scikit-learn

This is a basic implementation of the KNN algorithm using the scikit-learn library:

Declaration: This involves importing the necessary KNN classifier or regressor from the scikit-learn library.

from sklearn.metrics import precision_score, recall_score, accuracy_score
from sklearn.neighbors import KNeighborsClassifier

Initialization: Create an instance of the KNN classifier or regressor, specifying parameters such as the number of neighbors (K) and the distance metric. For example, you can specify the distance metric (Euclidean, Minkowski, etc.) and the value of $p$ for the Minkowski distance.
```
knn_classifier = KNeighborsClassifier(n_neighbors = 10, metric = 'minkowski',
	p = 2, weights = 'distance')
```
Training: Train the KNN model using the training data. In Scikit-learn, this typically involves calling the fit() method on the KNN object and passing the training data as arguments.
```
knn_classifier.fit(X_train, y_train)
```

Prediction: Use the trained KNN model to make predictions on new, unseen data. This involves calling the predict() method on the KNN object and passing the new data as an argument.

y_pred_knn = knn_classifier.predict(X_test)

print('Accuracy: ', accuracy_score(y_test, y_pred_knn))
print('Precision: ', precision_score(y_test, y_pred_knn))
print('Recall: ', recall_score(y_test, y_pred_knn))

K-Means Clustering

Basic Concept of K-Means Clustering

K-Means Clustering is a popular partition-based method used to divide a dataset into clusters. The data is represented as $D = \{x_1,x_2,\dots,x_r\}$ where each $x_i$ is an observation in the form of a vector in an $n$ -dimensional space.

The idea is to partition a dataset into $k$ clusters, where each cluster has a centroid (center point). The data points are assigned to the cluster with the nearest centroid.

In contrast to the K-Nearest Neighbors algorithm that works with labeled data in supervised learning problems, we often use the K-Means Clustering algorithm with unlabeled datasets in unsupervised learning problems. While the algorithm is efficient and robust, having to predefine $k$ or the number of clusters is one of its biggest limitations.

How K-Means Clustering Works

The K-Means Clustering algorithm includes two key components:

A distance function $d(x, y)$ : This determines the distance between data points and centroids.
$k$ : The number of clusters to be created, which is predetermined.

Consider we have a dataset $D$ , the number of clusters $k$ , and a distance function $d = (x, y)$ :

Step 1: Randomly select $k$ observations from the dataset to serve as initial centroids for the $k$ clusters.
Step 2: Repeat step 2.1 and 2.2 until a convergence criterion is met.
- Step 2.1: Assign each observation to the nearest cluster based on the distance to the centroid.
- Step 2.2: Recalculate the centroid of each cluster based on the observations assigned to it. The centroid is computed by averaging the coordinates of all the points in the cluster.

Let’s visualize the algorithm.

1. Divide the dataset into 2 clusters and assign 2 centroids

2. Assign each data point to the nearest cluster.

3. Recalculate the centroid of each cluster.

4. Assign each data point to the nearest cluster.

5. Recalculate the centroid of each cluster.

6. Assign each data point to the nearest cluster.

7. Recalculate the centroid of each cluster.

Mathematical Derivation of K-Means

Given the data $(x_1, x_2, ..., x_n), \quad x_i \in \mathbb{R}^d$ .

For simplicity, let’s assume $d=2$ , meaning each data point is a 2D coordinate: $x_i = (x_{i1}, x_{i2})$ .

Our aim is to find:

$k$ clusters, each represented by a centroid $c_j$ , where $j \in (1,2, ..., k)$ .
Assignments of each data point $x_i$ to a cluster $c_j$ using an indicator variable:

$r_{ij} = \begin{cases} 1, & \text{if } x_i \text{ belongs to cluster } c_j \\ 0, & \text{otherwise} \end{cases}$

Let’s define an objective function $L$ , which is what we optimize during training to make our model approximate the target function as closely as possible, as

L = \sum_{i=1}^{n} \sum_{j=1}^{k} r_{ij} \|x_i - c_j\|^2

where $\|x_i - c_j\|^2$ is the squared Euclidean distance between a data point and its assigned cluster center. The goal is to minimize $L$ by optimizing $r$ (assignments) and $c$ (centroid).

To minimize $L$ , we perform two alternating steps:

Fix $c$ , optimize $r$ : Assign each data point to the nearest cluster center.
Fix $r$ , optimize $c$ : Recalculate cluster centers based on assigned points.

Step 1: Fix $c$ , Find $r$

For a single data point $x_1$ , its contribution to $L$ is:

L_1 = \sum_{j=1}^{k} r_{1j} \|x_1 - c_j\|^2

To minimize $L_1$ , we assign $x_1$ to the nearest cluster:

r_{1j} = 1 \quad \text{for the cluster } c_j \text{ closest to } x_1

This process is repeated for all data points.

Step 2: Fix $r$ , Find $c$

Now, we assume data points $x_1$ and $x_2$ are assigned to cluster $c_j$ . We want to find the optimal cluster center $c_j$ that minimizes:

L_j = \|x_1 - c_j\|^2 + \|x_2 - c_j\|^2

Expanding the squared Euclidean distances:

L_j = (x_{11} - c_{j1})^2 + (x_{12} - c_{j2})^2 + (x_{21} - c_{j1})^2 + (x_{22} - c_{j2})^2

To minimize $L_j$ , take the derivative and set it to zero. The optimal cluster center is:

c_{j1} = \frac{x_{11} + x_{21}}{2}, \quad c_{j2} = \frac{x_{12} + x_{22}}{2}

or in vector form:

c_j = \frac{x_1 + x_2}{2}

For a general case with more points, the new cluster center is simply the mean of all assigned points:

c_j = \frac{1}{m} \sum_{\text{points in cluster } j} x_i

where $m$ is the number of points in the cluster.

K-Means Clustering Using scikit-learn

This is a basic implementation of the K-Means Clustering algorithm using the scikit-learn library:

Declaration: Begin by importing the KMeans module from the scikit-learn library.
```
from sklearn.cluster import KMeans
```
Initialization: Specify the number of clusters and create a KMeans model with the number of clusters and a random_state for reproducibility.
```
k_cluster = 3
k_mean_model = KMeans(n_clusters=k_cluster, random_state=random_state)
```
Training: Train the model using the data.
```
k_mean_model.fit(X)
```

Linear Regression

Basic Concept of Linear Regression

Linear Regression involves learning a function $y = f(x)$ from a given training set $D = \{(x_1, y_1), (x_2, y_2), \dots, (x_M, y_M)\}$ , where $y_i ≈ f(x_i)$ $\forall$ $i \in [1, M]$ , to estimate output values for new observations.

Each observation is represented by an $n$ -dimensional vector, $x_i = (x_{i1}, x_{i2}, \dots, x_{in})^T$ , where each dimension represents an attribute or feature.

For example, given the dataset:

From the dataset, the most optimal linear regression function would be:

f(x) = -1.02 + 0.83x

Future Value Prediction

In a dataset, each observation consists of multiple input values (features), written as $x = (x_1, x_2, \dots, x_n)^T$ . For each input, there is a true but unknown output value, called $c_x$ .

The machine learning system will try to predict the output value, using a linear function:

y_x = w_0 + w_1x_1 + \dots + w_nx_n

Here:

$y_x$ is the predicted output.
$w_0, w_1, ..., w_n$ are weights (which determine how much influence each input feature has).
$x_1, x_2, ..., x_n$ are the input features.

Our goal is to make $y_x$ as close as possible to the real value $c_x$ .

Learning a linear regression function means finding the best weights $w = (w_0, w_1, …, w_n)^T$ that define the relationship between inputs and outputs.

Once we have these weights, we can use the learned function to predict outputs for new data. If we get a new observation $z = (z_1, z_2, \dots, z_n)^T$ , we use:

f(z) = w_0 + w_1z_1 + … + w_nz_n

This means we multiply each feature by its corresponding weight and sum up the results.

If we have one input feature $(n=1)$ , this function represents a straight line.
If we have two input features $(n=2)$ , it represents a plane.
If we have many features $(n>2)$ , it represents a hyperplane in higher dimensions.

Since there are infinitely many possible linear functions, we need a rule to find the best one. How can we do this?

Learning a Linear Regression Function

Learning a linear regression function, denoted as $f^*$ , involves finding the best linear relationship between input features and an output variable based on a given training dataset. Our aim is to ensure the model generalizes well to new, unseen data.

This involves improving the system performance by minimizing the error, represented as $|c_z – f(z)|$ , between the predicted value $f(z)$ and the actual value $c_z$ for future inputs $z$ .

To measure how well our model is performing, we use a loss function. A popular choice is the empirical loss, also known as the Residual Sum of Squares (RSS), which helps us find the best-fitting line by minimizing the total squared errors.

Loss Function

The basic idea of the loss function is to measure how well our function $f$ fits the data by summing up the squared errors. This is called the Residual Sum of Squares (RSS).

RSS(f) = \sum_{i=1}^{M} (y_i - f(x_i))^2 = \sum_{i=1}^{M} (y_i - w_0 - w_1 x_{i1} - \dots - w_n x_{in})^2

We need to find the best function $f^* = \arg\min_{f \in H} RSS(f)$ (or equivalently, the best weights $w∗$ that minimizes this error):

w^* = \arg\min_w \sum_{i=1}^{M} (y_i - w_0 - w_1 x_{i1} - \dots - w_n x_{in})^2

Essentially, we are finding the values of $w_0,w_1,...,w_n$ that make the squared differences as small as possible.

The best way to find $w∗$ is by setting the derivative of $RSS$ to $0$ and solving for $w$ .

Linear Regression Using scikit-learn

Declaration: This involves importing the necessary Linear Regression modules from the scikit-learn library.
```
from sklearn import linear_model
```
Initialization: Create a Linear Regression model.
```
regr = linear_model.LinearRegression()
```

Training: Train the model using the data.

regr.fit(X_train, y_train)
print("[w1, ..., w_n] = ", regr.coef_)
print("w0 = ", regr.intercept_)

Prediction: Use the trained Linear Regression model to make predictions by calling the predict() method on the model object wit the new data as the argument.
```
y_pred = regr.predict(X_test)
```

Basic Concept of Ridge Regression

A problem with standard linear regression is that it may not perform well with noisy data because it tries to fit the noise, causing the weights to change drastically.

Ridge Regression is a linear regression technique used to prevent the model's weights from changing drastically, which can occur with noisy data, by adding a penalty. This technique is called regularization and helps the model generalize better to new data.

Instead of just minimizing the usual $RSS$ , we also minimize the sum of squared weights multiplied by a penalty factor $λ$ . The optimization problem looks like this:

f^* = \arg\min_{f \in H} RSS(f) + \lambda ||w||_2^2

w^* = \arg\min_w \sum_{i=1}^{M} (y_i - A_i w)^2 + \lambda \sum_{j=0}^{n} w_j^2

The first term $\sum_{i=1}^{M} (y_i - A_i w)^2$ is the usual $RSS$ , which measures how well the model fits the data.
The second term $\lambda \sum_{j=0}^{n} w_j^2$ is the regularization term, which discourages significant changes in weight values.
$λ$ is a tuning parameter that controls how much regularization we apply.

Although Ridge Regression can work with noisy data, it’s a good practice to preprocess the training data beforehand to achieve the best results.

Ridge Regression Using scikit-learn

The only difference of using Ridge Regression in scikit-learn is in the initialization step:

# 'alpha' is the penalty factor
regr_ridge = linear_model.Ridge(alpha = 0.1)

Training the Ridge Regression model:

regr_ridge.fit(X_train, y_train)

Support Vector Machine (SVM)

Basic Concepts of SVM

Support Vector Machine (SVM) is a powerful machine learning algorithm used mainly for classification. It tries to find the best possible boundary to separate different classes of data. SVM and its variations include Linear SVM, Soft-margin SVM, and Non-linear SVM.

Linear SVM

Image Source: https://towardsdatascience.com/understand-support-vector-machines-6cc9e4a15e7e/

Linear SVM is a method which works with simple linear separable data that can be divided by a straight line (or a hyperplane in higher dimensions).

The core idea of Linear SVM is to find a boundary as a line (or hyperplane) that maximizes the distance (margin) between two classes.

Considering we have the dataset $D = \{(x_1, y_1), (x_2, y_2), …, (x_r, y_r)\}$ , where $x_i$ is an input vector in an $n$ -dimensional space $X \subseteq \mathbb{R}^n$ with the equivalent output value $y_i \in \{1, -1\}$ .

The equation of linear separation function is:

f(x) = ⟨w ⋅ x⟩ + b

where $w$ is the weight vector (determines the direction of the boundary) and $b$ is a real number (adjusts the position of the boundary).

The goal is to correctly classify each point, $x_i$ , so that:

y_i = \begin{cases} 1, & \text{if } ⟨w⋅x_i⟩+b≥0 \\ -1, & \text{if } ⟨w⋅x_i⟩+b<0 \end{cases}

There are hyperplanes separating observations of positive ( $y_i = 1$ ) and negative ( $y_i = -1$ ) classes with the equation $⟨w ⋅ x⟩ + b = 0$ , and SVM chooses the optimal separating hyperplane with the maximum margin.

Calculating the margin is pretty straightforward.

You first need to select an observation in the positive class $(x^+, 1)$ and another one in the negative class $(x^-, -1)$ , where the distance from each observation to the separating hyperplane $H_0$ is the closest.

Two parallel support hyperplanes are then defined:

$H^+$ passes through $x^+$ and is parallel to $H_0$ : $⟨w,x^+⟩+b=1$
$H^-$ passes through $x^-$ and is parallel to $H_0$ : $⟨w,x^-⟩+b=-1$

so that $\begin{cases} ⟨w⋅x_i⟩+b≥1, & \text{if } \quad y_i = 1 \\ ⟨w⋅x_i⟩+b≤-1, & \text{if } \quad y_i = -1 \end{cases}$ .

Now, the margin can be calculated as:

margin=\frac{⟨w,x^+⟩+b−(⟨w,x^−⟩+b)=1−(−1)}{||w||}=\frac{2}{||w||}

Optimizing Linear SVM

To maximize the margin, we need to minimize $||w||$ . This is where the optimization problem comes in. The objective is to minimize $\frac{1}{2}||w|| ^2$ because minimizing $||w||^2$ is equivalent to maximizing the margin $\frac{2}{||w||}$ , and the factor of $\frac{1}{2}$ is just for convenience when taking derivatives during optimization.

Finally, SVM solves the following optimization problem to choose the most optimal hyperplane by finding $w$ and $b$ so that

\min_w{\frac{1}{2}||w||^2}

while subjecting to the constrains of each datapoint $x_i$ with label $y_i$ :

\begin{cases} ⟨w⋅x_i⟩+b≥1, & \text{if } \quad y_i = 1 \\ ⟨w⋅x_i⟩+b≤-1, & \text{if } \quad y_i = -1 \end{cases}

Soft-margin SVM

Sometimes, in real-world data, perfect separation may not be possible (or desirable), especially with data containing noises. Therefore, the Soft-margin SVM method allows some misclassification using slack variables $ξ_i ≥ 0$ to relax the strict boundary constraints.

The updated constrains are now:

\begin{cases} ⟨w⋅x_i⟩+b≥1 - ξ_i, & \text{if } \quad y_i = 1 \\ ⟨w⋅x_i⟩+b≤-1 + ξ_i, & \text{if } \quad y_i = -1 \end{cases}

This means some points can be within the margin or even misclassified, but, at least the errors are minimized!

Non-linear SVM

In many real-world problems, when data cannot be separated by a straight line, Non-linear SVM is used. It transforms the data into a higher-dimensional space, where it becomes linearly separable.

This is done using kernel functions, which map the data to a new space. Some common kernels are:

Polynomial Kernel

K(x, z) = (\langle x \cdot z \rangle + \theta)^d \quad \text{where} \quad \theta \in\mathbb{R}, d \in \mathbb{N}.

Gaussian RBF Kernel

K(x, z) = e^{-\frac{||x - z||^2}{2\sigma}} \quad \text{where} \quad \sigma > 0.

Tanh Kernel

K(x, x) = \tanh (\beta \langle x \cdot z \rangle - \lambda) = \frac{1}{1 + e^{-(\beta \langle x \cdot z \rangle - \lambda)}} \quad \text{where} \quad \beta, \lambda \in \mathbb{R}.

Finally, it applies the same formulas and steps as in linear SVM but in the new transformed space.

Support Vector Machine Using scikit-learn

For Classification Problems

>>> from sklearn import svm
>>> X = [[0, 0], [1, 1]]
>>> y = [0, 1]
>>> clf = svm.SVC()
>>> clf.fit(X, y)
SVC()

For Regression Problems

>>> from sklearn import svm
>>> X = [[0, 0], [2, 2]]
>>> y = [0.5, 2.5]
>>> regr = svm.SVR()
>>> regr.fit(X, y)
SVR()
>>> regr.predict([[1,1]])
array([1.5])

Non-linear SVM

svr_rbf = SVR(kernel = "rbf", C = 100, gamma = 0.1, epsilon = 0.1)
svr_lin = SVR(kernel = "linear", C = 100, gamma = "auto")
svr_poly = SVR(kernel = "poly", C = 100, gamma = "auto", degree = 3, epsilon = 0.1, coef0 = 1)

👏 Thanks for reading!

Basic Machine Learning Algorithms (Part 1) - AI4B #5

K-Nearest Neighbors (KNN)

Basic Concept of K-Nearest Neighbors

How KNN Works

In Classification Problems

In Regression Problems

Distance Functions

Data Normalization

K-Nearest Neighbors Using scikit-learn

K-Means Clustering

Basic Concept of K-Means Clustering

How K-Means Clustering Works

Mathematical Derivation of K-Means

K-Means Clustering Using scikit-learn

Linear Regression

Basic Concept of Linear Regression

Future Value Prediction

Learning a Linear Regression Function

Loss Function

Linear Regression Using scikit-learn

Basic Concept of Ridge Regression

Ridge Regression Using scikit-learn

Support Vector Machine (SVM)

Basic Concepts of SVM

Linear SVM

Optimizing Linear SVM

Soft-margin SVM

Non-linear SVM

Support Vector Machine Using scikit-learn

For Classification Problems

For Regression Problems

Non-linear SVM