skip to content
Site header image Minh Ton

Basic Machine Learning Algorithms (Part 1) - AI4B #5

In this note, we’ll be deep-diving into four machine learning algorithms: K-Nearest Neighbors, K-Means Clustering, Linear Regression, and Support Vector Machine!

Last Updated:

👋
Hi! Welcome to my series of notes on AI Programming for Beginners!


Let’s deep-diving into four machine learning algorithms: K-Nearest Neighbors, K-Means Clustering, Linear Regression, and Support Vector Machine!

K-Nearest Neighbors (KNN)

Basic Concept of K-Nearest Neighbors

K-Nearest Neighbors (KNN) is one of the most popular algorithm in machine learning.

Unlike traditional machine learning methods that learn a target function, KNN doesn't build a clear model for the target function.

Instead, it memorizes the training data. When a new data point needs a prediction, KNN looks at its kk nearest neighbors in the training set.

The prediction process is somewhat similar to how people would assess potential partners in real life. Rather than making one-sided judgments, you might want to learn about someone’s personality by observing their family or speaking with their close friends and neighbors!

How KNN Works

The KNN algorithm includes two key components:

  • A distance measurement: This determines how ‘close’ the two data points are to each other.
  • kk: The number of nearest neighbors considered for future prediction.

Let’s consider an example classification problem:

Given the dataset consists of two classes, c1c_1 and c2c_2. There’s a new point, zz, (shown as a blue dot) needs to be classified into one of these classes.

The KNN algorithm assigns a class to zz based on the majority class among its kk nearest neighbors.

Case k=1k = 1:

The closest neighbor to zz belongs to class c2c_2. Therefore zz belongs to class c2c_2.

Case k=3k = 3:

The 33 closest neighbors to zz are examined. Since class c1c_1 has more neighbors, zz is assigned to class c1c_1.

Case k=5k = 5:

The 55 closest neighbors to zz are examined. Since class c1c_1 has more neighbors, zz is assigned to class c1c_1.

In Classification Problems

  • For a new data point zz, calculate the distance between zz and every training data point xx in the dataset DD.
  • Form a set, NB(z)NB(z), consisting of the kk nearest neighbors of zz. This involves selecting the kk data points in DD that have the smallest distance to zz, as determined by a chosen distance function.
  • Assign zz to the class that appears most frequently among the classes of its kk nearest neighbors in the set NB(z)NB(z). This is known as “the majority class”.

In Regression Problems

  • Similarly, we first need to calculate the distance between the new data point zz and each training example xx in the dataset DD.
  • Identify the kk nearest neighbors of zz, forming the set NB(z)NB(z).
  • Predict the output value yy for zz by averaging the output values of its kk nearest neighbors in NB(z)NB(z):
    yz=1kxNB(z)yxy_z = \frac{1}{k}\sum_{x\in NB(z)} y_x

Distance Functions

Alright, so we’ve briefly mentioned the distance function in the process of solving the above problems, what exactly are those?

The distance function, often denoted as dd, plays an important role in the KNN method. They determine how the similarity or dissimilarity between data points is measured. The choice of distance function can significantly impact the performance of the KNN algorithm. Typically, the distance function is predetermined and remains constant throughout the learning and classification/prediction processes.

Normally, when the input attributes are real numbers (xiRx_i \in\mathbb{R}), we usually use geometric distance functions for KNN. These includes the Minkowski, the Manhattan, the Euclidean, and the Chebyshev distance functions.

  • The Minkowski distance (pp-norm): This is a generalized distance metric. In this formula, different values of pp yield different distance metrics.
    d(x,z)=(i=1nxizip)1pd(x, z) = \left(\sum_{i = 1}^{n} \left| x_i - z_i \right|^p \right)^{\frac{1}{p}}
  • The Manhattan distance (p=1p = 1): This calculates the distance as the sum of absolute differences between the coordinates of the data points.
    d(x,z)=i=1nxizid(x, z) = \sum_{i=1}^{n} \left| x_i - z_i \right|
  • The Euclidean distance (p=2p = 2): This is the most commonly used distance function, which involves calculating the straight-line distance between two points.
    d(x,z)=(i=1n(xizi)2)12d(x, z) = \left( \sum_{i=1}^{n} (x_i - z_i)^2 \right)^{\frac{1}{2}}
  • The Chebyshev distance (p=p = ∞): This calculates the maximum absolute difference between the coordinates of the data points.
    d(x,z)=limp(i=1nxizip)1p=maxixizid(x, z) = \lim_{p \to \infty} \left( \sum_{i=1}^{n} \left| x_i - z_i \right|^p \right)^{\frac{1}{p}} = \max_{i} \left| x_i - z_i \right|

Data Normalization

Normalization the data ranges is an important step in the KNN algorithm, especially when dealing with input features that have different scales or value ranges. Without normalization, features with larger values can dominate the distance calculations, leading to skewed results.

Let’s consider a dataset with features like Age\text{Age}, Income\text{Income} (monthly), and Height\text{Height} (in meters).

  • x=(Age=20,Income=12000,Height=1.68)x = (\text{Age} = 20, \text{Income} = 12000, \text{Height} = 1.68) is a specific example in the dataset.
  • z=(Age=40,Income=1300,Height=1.75)z = (\text{Age} = 40, \text{Income} = 1300, \text{Height} = 1.75) is the value to be classified.

Let’s calculate the distance between xx and zz:

𝑑(x,z)=[(2040)2+(120001300)2+(1.681.75)2]12𝑑(x, z) = [(20 − 40)^2 + (12000 − 1300)^2 + (1.68 − 1.75)^2]^\frac{1}{2}

As can be seen from the calculation, Income\text{Income} values are much larger than those for Age\text{Age} or Height\text{Height}, which means the distance between two data points will be dominated by the difference in Income\text{Income}. As a result, this can overshadow the contributions of Age\text{Age} and Height\text{Height} and potentially skewing the overall result.

To address this issue, it is essential to normalize the range of input attributes so that each feature contributes more equitably to the distance calculation.

The most frequently used normalization ranges are [0,1][0, 1] and [1,1][-1, 1]. For each attribute ii, the normalized value xx can be calculated as x:=ximax(xj)x := \frac{x_i}{\max{(x_j)}}.

Let’s revisit the above example and normalize the Income\text{Income} and Age\text{Age} data points to the [0,1][0, 1] range:

  • 1200012000 is normalized to 1=12000120001 = \frac{12000}{12000}.
  • 13001300 is normalized to 0.183333333=1300120000.183333333 = \frac{1300}{12000}.
  • 4040 is normalized to 1=40401 = \frac{40}{40}.
  • 2020 is normalized to 0.5=20400.5 = \frac{20}{40}.

By normalizing the features, you ensure that each feature contributes proportionally to the distance calculation, leading to more accurate and reliable KNN results.

K-Nearest Neighbors Using scikit-learn

This is a basic implementation of the KNN algorithm using the scikit-learn library:

  • Declaration: This involves importing the necessary KNN classifier or regressor from the scikit-learn library.
    from sklearn.metrics import precision_score, recall_score, accuracy_score
    from sklearn.neighbors import KNeighborsClassifier
  • Initialization: Create an instance of the KNN classifier or regressor, specifying parameters such as the number of neighbors (K) and the distance metric. For example, you can specify the distance metric (Euclidean, Minkowski, etc.) and the value of pp for the Minkowski distance.
    knn_classifier = KNeighborsClassifier(n_neighbors = 10, metric = 'minkowski',
    	p = 2, weights = 'distance')
  • Training: Train the KNN model using the training data. In Scikit-learn, this typically involves calling the fit() method on the KNN object and passing the training data as arguments.
    knn_classifier.fit(X_train, y_train)
  • Prediction: Use the trained KNN model to make predictions on new, unseen data. This involves calling the predict() method on the KNN object and passing the new data as an argument.
    y_pred_knn = knn_classifier.predict(X_test)
    
    print('Accuracy: ', accuracy_score(y_test, y_pred_knn))
    print('Precision: ', precision_score(y_test, y_pred_knn))
    print('Recall: ', recall_score(y_test, y_pred_knn))

K-Means Clustering

Basic Concept of K-Means Clustering

K-Means Clustering is a popular partition-based method used to divide a dataset into clusters. The data is represented as D={x1,x2,,xr}D = \{x_1,x_2,\dots,x_r\} where each xix_i is an observation in the form of a vector in an nn-dimensional space.

The idea is to partition a dataset into kk clusters, where each cluster has a centroid (center point). The data points are assigned to the cluster with the nearest centroid.

In contrast to the K-Nearest Neighbors algorithm that works with labeled data in supervised learning problems, we often use the K-Means Clustering algorithm with unlabeled datasets in unsupervised learning problems. While the algorithm is efficient and robust, having to predefine kk or the number of clusters is one of its biggest limitations.

How K-Means Clustering Works

The K-Means Clustering algorithm includes two key components:

  • A distance function d(x,y)d(x, y): This determines the distance between data points and centroids.
  • kk: The number of clusters to be created, which is predetermined.

Consider we have a dataset DD, the number of clusters kk, and a distance function d=(x,y)d = (x, y):

  • Step 1: Randomly select kk observations from the dataset to serve as initial centroids for the kk clusters.
  • Step 2: Repeat step 2.1 and 2.2 until a convergence criterion is met.
    • Step 2.1: Assign each observation to the nearest cluster based on the distance to the centroid.
    • Step 2.2: Recalculate the centroid of each cluster based on the observations assigned to it. The centroid is computed by averaging the coordinates of all the points in the cluster.

Let’s visualize the algorithm.

1. Divide the dataset into 2 clusters and assign 2 centroids

2. Assign each data point to the nearest cluster.

3. Recalculate the centroid of each cluster.

4. Assign each data point to the nearest cluster.

5. Recalculate the centroid of each cluster.

6. Assign each data point to the nearest cluster.

7. Recalculate the centroid of each cluster.

Mathematical Derivation of K-Means

Given the data (x1,x2,...,xn),xiRd(x_1, x_2, ..., x_n), \quad x_i \in \mathbb{R}^d.

For simplicity, let’s assume d=2d=2, meaning each data point is a 2D coordinate: xi=(xi1,xi2)x_i = (x_{i1}, x_{i2}).


Our aim is to find:

  • kk clusters, each represented by a centroid cjc_j, where j(1,2,...,k)j \in (1,2, ..., k).
  • Assignments of each data point xix_i to a cluster cjc_j using an indicator variable:
    rij= {1,  if xi  belongs to cluster cj  0,  otherwise r_{ij} = \begin{cases} 1, & \text{if } x_i \text{ belongs to cluster } c_j \\ 0, & \text{otherwise} \end{cases}

Let’s define an objective function LL, which is what we optimize during training to make our model approximate the target function as closely as possible, as

L= i=1n j=1krij xicj2L = \sum_{i=1}^{n} \sum_{j=1}^{k} r_{ij} \|x_i - c_j\|^2

where xicj2\|x_i - c_j\|^2 is the squared Euclidean distance between a data point and its assigned cluster center. The goal is to minimize LL by optimizing rr (assignments) and cc (centroid).

To minimize LL, we perform two alternating steps:

  • Fix cc, optimize rr: Assign each data point to the nearest cluster center.
  • Fix rr, optimize cc: Recalculate cluster centers based on assigned points.

Step 1: Fix cc, Find rr

For a single data point x1x_1, its contribution to LL is:

L1= j=1kr1j x1cj2L_1 = \sum_{j=1}^{k} r_{1j} \|x_1 - c_j\|^2

To minimize L1L_1, we assign x1x_1 to the nearest cluster:

r1j=1  for the cluster cj  closest to x1r_{1j} = 1 \quad \text{for the cluster } c_j \text{ closest to } x_1

This process is repeated for all data points.


Step 2: Fix rr, Find cc

Now, we assume data points x1x_1 and x2x_2 are assigned to cluster cjc_j. We want to find the optimal cluster center cjc_j that minimizes:

Lj= x1cj2+ x2cj2L_j = \|x_1 - c_j\|^2 + \|x_2 - c_j\|^2

Expanding the squared Euclidean distances:

Lj=(x11cj1)2+(x12cj2)2+(x21cj1)2+(x22cj2)2L_j = (x_{11} - c_{j1})^2 + (x_{12} - c_{j2})^2 + (x_{21} - c_{j1})^2 + (x_{22} - c_{j2})^2

To minimize LjL_j, take the derivative and set it to zero. The optimal cluster center is:

cj1= x11+x212,  cj2= x12+x222c_{j1} = \frac{x_{11} + x_{21}}{2}, \quad c_{j2} = \frac{x_{12} + x_{22}}{2}

or in vector form:

cj= x1+x22c_j = \frac{x_1 + x_2}{2}

For a general case with more points, the new cluster center is simply the mean of all assigned points:

cj= 1m points in cluster jxic_j = \frac{1}{m} \sum_{\text{points in cluster } j} x_i

where mm is the number of points in the cluster.

K-Means Clustering Using scikit-learn

This is a basic implementation of the K-Means Clustering algorithm using the scikit-learn library:

  • Declaration: Begin by importing the KMeans module from the scikit-learn library.
    from sklearn.cluster import KMeans
  • Initialization: Specify the number of clusters and create a KMeans model with the number of clusters and a random_state for reproducibility.
    k_cluster = 3
    k_mean_model = KMeans(n_clusters=k_cluster, random_state=random_state)
  • Training: Train the model using the data.
    k_mean_model.fit(X)

Linear Regression

Basic Concept of Linear Regression

Linear Regression involves learning a function y=f(x)y = f(x) from a given training set D={(x1,y1),(x2,y2),,(xM,yM)}D = \{(x_1, y_1), (x_2, y_2), \dots, (x_M, y_M)\}, where yif(xi)y_i ≈ f(x_i) \forall i[1,M]i \in [1, M], to estimate output values for new observations.

Each observation is represented by an nn-dimensional vector, xi=(xi1,xi2,,xin)Tx_i = (x_{i1}, x_{i2}, \dots, x_{in})^T, where each dimension represents an attribute or feature.


For example, given the dataset:

From the dataset, the most optimal linear regression function would be:

f(x)=1.02+0.83xf(x) = -1.02 + 0.83x

Future Value Prediction

In a dataset, each observation consists of multiple input values (features), written as x=(x1,x2,,xn)Tx = (x_1, x_2, \dots, x_n)^T. For each input, there is a true but unknown output value, called cxc_x.

The machine learning system will try to predict the output value, using a linear function:

yx=w0+w1x1++wnxny_x = w_0 + w_1x_1 + \dots + w_nx_n

Here:

  • yxy_x is the predicted output.
  • w0,w1,...,wnw_0, w_1, ..., w_n are weights (which determine how much influence each input feature has).
  • x1,x2,...,xnx_1, x_2, ..., x_n are the input features.

Our goal is to make yxy_x as close as possible to the real value cxc_x.

Learning a linear regression function means finding the best weights w=(w0,w1,,wn)Tw = (w_0, w_1, …, w_n)^T that define the relationship between inputs and outputs.

Once we have these weights, we can use the learned function to predict outputs for new data. If we get a new observation z=(z1,z2,,zn)Tz = (z_1, z_2, \dots, z_n)^T, we use:

f(z)=w0+w1z1++wnznf(z) = w_0 + w_1z_1 + … + w_nz_n

This means we multiply each feature by its corresponding weight and sum up the results.

  • If we have one input feature (n=1)(n=1), this function represents a straight line.
  • If we have two input features (n=2)(n=2), it represents a plane.
  • If we have many features (n>2)(n>2), it represents a hyperplane in higher dimensions.

Since there are infinitely many possible linear functions, we need a rule to find the best one. How can we do this?

Learning a Linear Regression Function

Learning a linear regression function, denoted as ff^*, involves finding the best linear relationship between input features and an output variable based on a given training dataset. Our aim is to ensure the model generalizes well to new, unseen data.

This involves improving the system performance by minimizing the error, represented as czf(z)|c_z – f(z)|, between the predicted value f(z)f(z) and the actual value czc_z for future inputs zz.

To measure how well our model is performing, we use a loss function. A popular choice is the empirical loss, also known as the Residual Sum of Squares (RSS), which helps us find the best-fitting line by minimizing the total squared errors.

Loss Function

The basic idea of the loss function is to measure how well our function ff fits the data by summing up the squared errors. This is called the Residual Sum of Squares (RSS).

RSS(f)=i=1M(yif(xi))2=i=1M(yiw0w1xi1wnxin)2RSS(f) = \sum_{i=1}^{M} (y_i - f(x_i))^2 = \sum_{i=1}^{M} (y_i - w_0 - w_1 x_{i1} - \dots - w_n x_{in})^2

We need to find the best function f=argminfHRSS(f)f^* = \arg\min_{f \in H} RSS(f) (or equivalently, the best weights ww∗ that minimizes this error):

  w=argminwi=1M(yiw0w1xi1wnxin)2   w^* = \arg\min_w \sum_{i=1}^{M} (y_i - w_0 - w_1 x_{i1} - \dots - w_n x_{in})^2

Essentially, we are finding the values of w0,w1,...,wnw_0,w_1,...,w_n that make the squared differences as small as possible.

The best way to find ww∗ is by setting the derivative of RSSRSS to 00 and solving for ww.

Linear Regression Using scikit-learn

  • Declaration: This involves importing the necessary Linear Regression modules from the scikit-learn library.
    from sklearn import linear_model
  • Initialization: Create a Linear Regression model.
    regr = linear_model.LinearRegression()
  • Training: Train the model using the data.
    regr.fit(X_train, y_train)
    print("[w1, ..., w_n] = ", regr.coef_)
    print("w0 = ", regr.intercept_)
  • Prediction: Use the trained Linear Regression model to make predictions by calling the predict() method on the model object wit the new data as the argument.
    y_pred = regr.predict(X_test)

Basic Concept of Ridge Regression

A problem with standard linear regression is that it may not perform well with noisy data because it tries to fit the noise, causing the weights to change drastically.

Ridge Regression is a linear regression technique used to prevent the model's weights from changing drastically, which can occur with noisy data, by adding a penalty. This technique is called regularization and helps the model generalize better to new data.

Instead of just minimizing the usual RSSRSS, we also minimize the sum of squared weights multiplied by a penalty factor λλ. The optimization problem looks like this:

f=argminfHRSS(f)+λw22f^* = \arg\min_{f \in H} RSS(f) + \lambda ||w||_2^2
w=argminwi=1M(yiAiw)2+λj=0nwj2w^* = \arg\min_w \sum_{i=1}^{M} (y_i - A_i w)^2 + \lambda \sum_{j=0}^{n} w_j^2
  • The first term i=1M(yiAiw)2\sum_{i=1}^{M} (y_i - A_i w)^2 is the usual RSSRSS, which measures how well the model fits the data.
  • The second term λj=0nwj2\lambda \sum_{j=0}^{n} w_j^2 is the regularization term, which discourages significant changes in weight values.
  • λλ is a tuning parameter that controls how much regularization we apply.

Although Ridge Regression can work with noisy data, it’s a good practice to preprocess the training data beforehand to achieve the best results.

Ridge Regression Using scikit-learn

The only difference of using Ridge Regression in scikit-learn is in the initialization step:

# 'alpha' is the penalty factor
regr_ridge = linear_model.Ridge(alpha = 0.1)

Training the Ridge Regression model:

regr_ridge.fit(X_train, y_train)

Support Vector Machine (SVM)

Basic Concepts of SVM

Support Vector Machine (SVM) is a powerful machine learning algorithm used mainly for classification. It tries to find the best possible boundary to separate different classes of data. SVM and its variations include Linear SVM, Soft-margin SVM, and Non-linear SVM.

Linear SVM

Linear SVM is a method which works with simple linear separable data that can be divided by a straight line (or a hyperplane in higher dimensions).

The core idea of Linear SVM is to find a boundary as a line (or hyperplane) that maximizes the distance (margin) between two classes.


Considering we have the dataset D={(x1,y1),(x2,y2),,(xr,yr)}D = \{(x_1, y_1), (x_2, y_2), …, (x_r, y_r)\}, where xix_i is an input vector in an nn-dimensional space XRnX \subseteq \mathbb{R}^n with the equivalent output value yi{1,1}y_i \in \{1, -1\}.

The equation of linear separation function is:

f(x)=wx+bf(x) = ⟨w ⋅ x⟩ + b

where ww is the weight vector (determines the direction of the boundary) and bb is a real number (adjusts the position of the boundary).

The goal is to correctly classify each point, xix_i, so that:

yi={1,if wxi+b01,if wxi+b<0y_i = \begin{cases} 1, & \text{if } ⟨w⋅x_i⟩+b≥0 \\ -1, & \text{if } ⟨w⋅x_i⟩+b<0 \end{cases}

There are hyperplanes separating observations of positive (yi=1y_i = 1) and negative (yi=1y_i = -1) classes with the equation wx+b=0⟨w ⋅ x⟩ + b = 0, and SVM chooses the optimal separating hyperplane with the maximum margin.

Calculating the margin is pretty straightforward.

You first need to select an observation in the positive class (x+,1)(x^+, 1) and another one in the negative class (x,1)(x^-, -1), where the distance from each observation to the separating hyperplane H0H_0 is the closest.

Two parallel support hyperplanes are then defined:

  • H+H^+ passes through x+x^+ and is parallel to H0H_0: w,x++b=1⟨w,x^+⟩+b=1
  • HH^- passes through xx^- and is parallel to H0H_0: w,x+b=1⟨w,x^-⟩+b=-1

so that {wxi+b1,if yi=1wxi+b1,if yi=1\begin{cases} ⟨w⋅x_i⟩+b≥1, & \text{if } \quad y_i = 1 \\ ⟨w⋅x_i⟩+b≤-1, & \text{if } \quad y_i = -1 \end{cases}.

Now, the margin can be calculated as:

margin=w,x++b(w,x+b)=1(1)w=2wmargin=\frac{⟨w,x^+⟩+b−(⟨w,x^−⟩+b)=1−(−1)}{||w||}=\frac{2}{||w||}

Optimizing Linear SVM

To maximize the margin, we need to minimize w||w||. This is where the

optimization problem comes in. The objective is to minimize 12w2\frac{1}{2}||w|| ^2 because minimizing w2||w||^2 is equivalent to maximizing the margin 2w\frac{2}{||w||}, and the factor of 12\frac{1}{2} is just for convenience when taking derivatives during optimization.

Finally, SVM solves the following optimization problem to choose the most optimal hyperplane by finding ww and bb so that

minw12w2\min_w{\frac{1}{2}||w||^2}

while subjecting to the constrains of each datapoint xix_i with label yiy_i:

{wxi+b1,if yi=1wxi+b1,if yi=1\begin{cases} ⟨w⋅x_i⟩+b≥1, & \text{if } \quad y_i = 1 \\ ⟨w⋅x_i⟩+b≤-1, & \text{if } \quad y_i = -1 \end{cases}

Soft-margin SVM

Sometimes, in real-world data, perfect separation may not be possible (or desirable), especially with data containing noises. Therefore, the Soft-margin SVM method allows some misclassification using slack variables ξi0ξ_i ≥ 0 to relax the strict boundary constraints.

The updated constrains are now:

{wxi+b1ξi,if yi=1wxi+b1+ξi,if yi=1\begin{cases} ⟨w⋅x_i⟩+b≥1 - ξ_i, & \text{if } \quad y_i = 1 \\ ⟨w⋅x_i⟩+b≤-1 + ξ_i, & \text{if } \quad y_i = -1 \end{cases}

This means some points can be within the margin or even misclassified, but, at least the errors are minimized!

Non-linear SVM

In many real-world problems, when data cannot be separated by a straight line, Non-linear SVM is used. It transforms the data into a higher-dimensional space, where it becomes linearly separable.

This is done using kernel functions, which map the data to a new space. Some common kernels are:

  • Polynomial Kernel
K(x,z)=(xz+θ)dwhereθR,dN.K(x, z) = (\langle x \cdot z \rangle + \theta)^d \quad \text{where} \quad \theta \in\mathbb{R}, d \in \mathbb{N}.
  • Gaussian RBF Kernel
K(x,z)=exz22σwhereσ>0.K(x, z) = e^{-\frac{||x - z||^2}{2\sigma}} \quad \text{where} \quad \sigma > 0.
  • Tanh Kernel
K(x,x)=tanh(βxzλ)=11+e(βxzλ)whereβ,λR.K(x, x) = \tanh (\beta \langle x \cdot z \rangle - \lambda) = \frac{1}{1 + e^{-(\beta \langle x \cdot z \rangle - \lambda)}} \quad \text{where} \quad \beta, \lambda \in \mathbb{R}.

Finally, it applies the same formulas and steps as in linear SVM but in the new transformed space.

Support Vector Machine Using scikit-learn

For Classification Problems

>>> from sklearn import svm
>>> X = [[0, 0], [1, 1]]
>>> y = [0, 1]
>>> clf = svm.SVC()
>>> clf.fit(X, y)
SVC()

For Regression Problems

>>> from sklearn import svm
>>> X = [[0, 0], [2, 2]]
>>> y = [0.5, 2.5]
>>> regr = svm.SVR()
>>> regr.fit(X, y)
SVR()
>>> regr.predict([[1,1]])
array([1.5])

Non-linear SVM

svr_rbf = SVR(kernel = "rbf", C = 100, gamma = 0.1, epsilon = 0.1)
svr_lin = SVR(kernel = "linear", C = 100, gamma = "auto")
svr_poly = SVR(kernel = "poly", C = 100, gamma = "auto", degree = 3, epsilon = 0.1, coef0 = 1)



👏 Thanks for reading!