skip to content
Site header image Minh Ton

AI Programming Libraries - AI4B #1

In this note, I'll discuss some popular AI programming libraries, such as NumPy, Pandas, Matplotlib, and Seaborn.

Last Updated:

👋
Hi! Welcome to my series of notes on AI Programming for Beginners!


In this note, I'll guide you through the concepts of libraries, packages, and modules in Python, along with their practical applications. I'll also introduce some widely used AI programming libraries, including Pandas, NumPy, Matplotlib, and Seaborn, and provide insights on how to effectively utilize these libraries.

I know this note is kind of theoretical and boring, but it is pretty necessary for starters who’ve just picked up AI programming (like me!). Why not just give it a read?

Overview of AI Programming Libraries

Libraries, as you might know, are physical spaces filled with books that contain knowledge from various fields, such as science, literature, and research. Python libraries are no different. They are are collections of functions and tools designed to perform specific tasks within Python programming.

Python Libraries

Let's delve into some fundamental concepts of Python libraries: variables, functions, modules, and packages. These can be described in the following chart:

Library{Package...{Module...{VariablesFunctions...\begin{array}{c c c c c c c} \fbox{\hspace{0.1cm}Library\hspace{0.1cm}} & \bigg\{ & \begin{array}{c} \fbox{\hspace{0.1cm}Package\hspace{0.1cm}} \\ ... \end{array} & \bigg\{ & \begin{array}{c} \fbox{\hspace{0.1cm}Module\hspace{0.1cm}} \\ ... \end{array} & \Bigg\{ & \begin{array}{c} \fbox{\hspace{0.15cm}Variables\hspace{0.1cm}} \\ \fbox{\hspace{0.1cm}Functions\hspace{0.1cm}} \\ ... \end{array} \end{array}

Let's break it down. A program is composed of numerous variables and functions. If it's designed for reusability, it's referred to as a module. A package is a collection of modules that share similar functions. A library is a repository that houses all these packages, providing a collection of pre-written functions and objects that programmers can leverage to enhance coding efficiency.

- Modules: A program designed for reuse.
- Packages: A collection of modules with similar functions.
- Library: A collection of packages.

Using libraries in Python generally involves the keywords import, from, and as. The import keyword is used to bring a library into a program, and as can be used to rename a library for easier use. The from keyword can be used to import specific objects from a library.

import math # using module math
pie = math.pi
print("Value of pi: ", pie)
Value of pi: 3.141592653589793
Output
from math import pi # using variable pi
print("Value of pi: ", pi)
Value of pi: 3.141592653589793
Output
from math import * # using all
print(pi)
print(factorial(6))
3.141592653589793
720
Output
import random as rd
rd_val = rd.random()
print(rd_val)
0.04592068876016775
Output

Common AI Programming Libraries

Let’s look at the top 10 Python libraries. The first eight libraries in the list are the most popular libraries for AI programming, especially in Machine Learning and Deep Learning.

In this note, I’ll focus on the top 4 libraries - Pandas, NumPy, Matplotlib, and Seaborn.


Managing, analyzing, and performing statistical operations on data using Pandas

Pandas Overview

Image Source:  https://logowik.com
Image Source: https://logowik.com

Pandas aims to provide users a method of displaying data using tables. Each data can be indexed by rows or columns, like in the table in the above picture. This library is used for data management, analysis, and statistics, and to presents data.

You can install this library using the terminal command pip install pandas.

pandas documentation — pandas 2.2.3 documentation

Data Structures in Pandas

The Series Data Structure

These are one-dimensional arrays with labeled indices.

import pandas as pd

series_a = pd.Series([1, 7, 2])
print(series_a)
Create a Series consists of three integer datapoints.
0   1
1   7
2   2
dtype: int64
Output
import pandas as pd

series_a = pd.Series([1, 7, 2], 
           index = ["a", "b", "c"])
print(series_a)
Create a Series consists of three integer datapoints with labeled indices.
a   1
b   7
c   2
dtype: int64
Output
import pandas as pd

marks = { "day1": 20, "day2": 70, "day4": 90 }
series_a = pd.Series(marks)
print(series_a)
Create a Series from a dictionary.
day1   20
day2   70
day3   90
dtype: int64
Output

The DataFrame Data Structure

These are two-dimensional arrays with labeled rows and columns, similar to a spreadsheet or an SQL table.

import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data)
print(df)
Create a DataFrame with two columns, 'calories' and 'duration', and index the rows from 0.
	 calories   duration 
0       420         50
1       380         40
2       390         45
Output


import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data)
# access 'calories' column
print(df["calories"])
Access the ‘calories’ column from the dataframe.
0   420
1   380
2   390
Name: calories, dtype: int64
Output
import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data)
print(df.loc[0])
print(df.loc[[0, 1]])
Use the ‘loc’ property of DataFrame to access a group of rows or columns, in this case we’re accessing the 1st and 2nd row.
calories    420
duration     50
Name: 0, dtype: int64
	 calories   duration 
0       420         50
1       380         40
Output

Pandas Operations

We can also perform read-write operations on Excel spreadsheets, CSV, and JSON files.

Read a CSV file using Pandas.
Read a CSV file using Pandas.

Some other operations include:

  • Viewing data: head(), to_string()
  • Data analysis and statistics: info(), describe(), corr(), etc.
  • Accessing data: [], .loc[], .iloc[]
  • Updating data: dropna(), fillna()

Numerical Operations using NumPy

NumPy Overview

This library is used for scientific computing with sequences and matrices. NumPy works with N-dimensional arrays (ndarray), which are similar to lists, but can have multiple dimensions. NumPy arrays have a specific data type for all elements.

You can install this library using the terminal command pip install numpy.

NumPy documentation — NumPy v2.2 Manual

Data Structures in NumPy

The ndarray Data Structure

import numpy as np

arr = np.array([1, 2, 3, 4, 5])
print(arr)
print(type(arr))
Declare a NumPy array.
[1 2 3 4 5]
<class 'numpy.ndarray'>
Output
import numpy as np

list_a = [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]
arr_a = np.array(list_a)

print('5th element on 2nd row: ', list_a[1][4])
print('5th element on 2nd row: ', arr_a[1][4])
print('5th element on 2nd row: ', list_a[1, 4])
Access elements in the 2D NumPy array.
5th element on 2nd row: 10
5th element on 2nd row: 10
5th element on 2nd row: 10
Output
import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5:2]) # step = 2
Get a subarray from the NumPy array, similar to a list.
[2 4]
Output

Unlike Python lists which can store elements of various types, NumPy arrays have a specific data type for all elements. These are some of the data types for NumPy arrays:

  • i - integer
  • b - boolean
  • u - unsigned integer
  • f - float
  • c - complex float
  • m - timedelta
  • M - datetime
  • O - object
  • S - string
  • U - unicode string
  • V - fixed chunk of memory for other type (void)
import numpy as np

arr = np.array([1.1, 2.1, 3.1])
newarr = arr.astype(int) # 'i'
print(newarr)
print(newarr.dtype)
Convert a NumPy array containing floating-point numbers into an array of integers.
[1 2 3]
int64
Output
import numpy as np

arr = np.array([1, 2, 3, 4], dtype='f')
print(arr)
print(arr.dtype)
Create a NumPy array with a specific data type.
[1. 2. 3. 4.]
float32
Output
import numpy as np

arr = np.array(['1', 2, 3, 4, 5, 6, 7])
print(arr)
NumPy promotes all elements to the "highest" type that can accommodate all values. Since strings are more general than integers, the integers are converted to strings in this case.
['1', '2', '3', '4', '5', '6', '7']
Output

Operations on ndarray

Common array manipulations include:

  • Shape: shapereshape().
  • Initialization: numpy.zeros()numpy.ones(), etc.
  • Concatenation: numpy.concatenate()numpy.stack(), etc.
  • Splitting arrays: numpy.array_split()numpy.hsplit().
  • Searching: numpy.where(), etc.
  • Sorting: numpy.sort().

Data Visualization Techniques with Matplotlib and Seaborn

Why Visualize Data?

Data visualization is all about representing data graphically, using charts, graphs, and images. Visualizing the data can provide valuable insights, which allows us to extract the most meaningful information from the data sets.

Before Training AI Models

A plot of 2D data about the distribution of weight and height of a group of people.
A plot of 2D data about the distribution of weight and height of a group of people.
  1. Analyze data distribution:

    Let’s consider the plots. If we only view this data in a table, it's challenging to visualize its distribution on a Cartesian plane. In this instance, the data points (in red) are distributed almost in a straight line. Therefore, our goal is to train an AI model to predict data values that are approximately equivalent to those on the straight line.

  2. Clean datasets by removing noise or outliers:

    However, in the first example plot, there's an outlier data point—a value that lies outside the majority of the data points. The presence of this outlier causes the predicted values from your model, when viewed as a straight line, to deviate (as seen in the plot). This is because the predicted line must be the closest distance to every data point, including the outlier. By visualizing the data and removing the outliers, we acknowledge the loss of some data points, but our model can make more accurate predictions based on the remaining data points.

  3. Augment data if necessary.

During AI Models Training

Monitor loss/accuracy curves (1st plot) & parameter distribution visualization (2nd plot).
Monitor loss/accuracy curves (1st plot) & parameter distribution visualization (2nd plot).

Monitor the variation of the model over time:

  • When you've built an AI model, how can you ensure that it's actually learning or that the optimization of your model is effective? Typically, we monitor the loss curve during training and the accuracy curve during validation using a specific test dataset. If your model is improving, the loss curve decreases over time, while the accuracy curve increases. Visualizing these values in real-time provides valuable insights.
  • However, there are instances when your model may become under-fitting or over-fitting. In such cases, it's crucial to monitor the parameter distribution, which are the values your model is expected to predict. By observing if these values change equally, you can identify potential issues and take necessary actions to improve the model's performance.

After Training AI Models

A classification problem.
A classification problem.
  1. Evaluate the results:
    • Take a classification problem, for instance. Neighboring data points will be grouped into similar color groups. If you're displaying the data in tables, it becomes challenging for both analytical and statistical analysis, and it's also difficult to visualize the distances between them on a Cartesian plane.
  2. Perform qualitative analysis.
    • For instance, consider a model that processes images and removes noise. To evaluate the quality of the processed image, you need to display the output image. This process allows you to identify areas for improvement, such as specific scales, and calculate the percentage of noise that has been removed.

Matplotlib and Seaborn Overview

Matplotlib is a tool for data visualization using a MATLAB-like interface. Matplotlib uses the figure as its main structure, which is like a window containing visual elements. Seaborn is built on top of Matplotlib and is designed to integrate with Pandas DataFrames and create more advanced and professional-looking plots.

You can install these two libraries by using the following terminal commands: pip install matplotlib and pip install seaborn.

Matplotlib documentation — Matplotlib 3.10.0 documentation

seaborn: statistical data visualization — seaborn 0.13.2 documentation

Matplotlib Data Structures

The Figure Data Structure

A Figure is like a window that holds all the visual elements, including axes, titles, and graphical objects on a 2D plane.

We’re going to use pyplot to perform operations with Figure.


import matplotlib.pyplot as plt
import numpy as np

xpoints = np.array([0, 6])
ypoints = np.array([12, 24])

plt.figure(figsize=(4, 3)) # (4 inch x 3 inch)
plt.title("Draw a line from (0, 12) to (6, 24)")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.plot(xpoints, ypoints)
plt.show()
Draw a line on the plot figure.
Output
Output


import matplotlib.pyplot as plt
import numpy as np

xpoints_a = np.array([1, 4, 6, 8])
ypoints_a = np.array([3, 8, 7, 10])
xpoints_b = np.array([1, 4, 6, 8])
ypoints_b = np.array([6, 4, 3, 12])

plt.figure(figsize=(4, 3)) # (4 inch x 3 inch)
plt.title("Draw multiple points")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.plot(xpoints_a, ypoints_a, 'o', xpoints_b, ypoints_b, '^r')
plt.show()
Draw datapoints on the plot figure.
Output
Output
import matplotlib.pyplot as plt
import numpy as np

xpoints_a = np.array([1, 4, 6, 8])
ypoints_a = np.array([3, 8, 7, 10])
xpoints_b = np.array([1, 4, 6, 8])
ypoints_b = np.array([6, 4, 3, 12])

fig, ax = plt.subplots(nrows = 2, ncols = 1, figsize = (8, 8))
ax[0].plot(xpoints_a, ypoints_a, '-o')
ax[0].set_title("Draw lines of blue points")
ax[1].plot(xpoints_b, ypoints_b, '-^r')
ax[1].set_title("Draw lines of red points")
plt.show(
Draw lines from datapoints.
Output
Output


We can use Matplotlib to draw all sorts of statistical graphs or charts!

Image Source:  https://matplotlib.org
Image Source: https://matplotlib.org

We can also depict data distributions using charts and plots!

Image Source:  https://matplotlib.org
Image Source: https://matplotlib.org

Seaborn Operations

Seaborn is designed to work with Pandas DataFrames, which simplifies the process of loading data for visualization. Seaborn provides a high-level API, which allows users to create complex and professional-looking statistical graphics with less code compared to Matplotlib. You’ll need to refer to the documentation for the usage of these APIs.

Image Source:  https://seaborn.pydata.org/index.html
Image Source: https://seaborn.pydata.org/index.html
import seaborn as sns
sns.set_theme(style="darkgrid")

# Load an example dataset
df = sns.load_dataset("fmri")
sns.lineplot(x = "timepoint", y = "signal",
             hue = "region", style = "event",
             data = df)
Draw a line plot from an example dataset.
Output
Output

How to Effectively Use These Libraries?

To effectively utilize these libraries, remember the following:

  1. Understand the specific usage of each library.
  2. Identify the keywords relevant to your own objectives.
  3. Refer to the documentation of each library when needed.



👏 Thanks for reading!