1. Fundamentals

1. Fundamentals#

Taught by: Dat Doan, Alex Ganose

Getting started#

Welcome to our first practical session! If you haven’t used Jyupyter Book before or your Python is getting a little rusty, make sure you complete the Python Refresher notebook before continuing.

The weekly notebooks are designed to be run online directly in your browser. You can activate the server by clicking the rocket icon on the top right and selecting Live Code.

Outline#

This workshop will cover the following content:

Linear regression and error metrics
Polynomial regression and over/underfitting
Logistic and softmax regression
Advanced Topic: Solving linear regression coefficients
Task: Regress the periodic table

Linear regression#

Linear regression is the simplest and most common form of regression. We’ll explore more powerful forms of function fitting later in the course. There is no need to code linear regression from scratch as there is an optimised class already developed in the scikit-learn package. The availability of well-developed libraries such as this are a major motiviation for using Python in data analytics.

In linear regression, the response variable is a linear function of the input parameters:

\[ y = \beta_0 + \beta_1x \]

Let’s start by generating some random data we can use for fitting. We can store them in a Pandas DataFrame for easy manipulation.

import numpy as np
import pandas as pd

num_points = 

x = np.random.rand(num_points)
y = 4 + 5 * x + np.random.randn(num_points)

df = pd.DataFrame({'x': x, 'y': y})
df

Code hint

You need to choose the number of points. 10 should be fine, but you have the power to decide.

Let’s plot the data using matplotlib.

import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.scatter(x, y)
ax.set(
    xlabel='$x$',
    ylabel="$y$",
    xlim=(0, 1),
    ylim=(3, 11),
)
plt.show()

Now we can train a linear regression model using scikit-learn. We will use the LinearRegression class from the linear_model module. The fit method is used to train the model.

from sklearn.linear_model import LinearRegression

X = df.x.values.reshape(-1, 1)  # sklearn expects the features to be a 2D array
y = df.y.values

model = LinearRegression().fit(  )

Code hint

You need to input the arguments for the `fit` function. The first argument should be the matrix of features `X` and the second argument should be the target property `y`.

We can explore the fitted model parameters using the intercept_ ($\beta_0$) and coef_ ($\beta_1$) attributes of the fitted model.

print('Slope:', model1.coef_)
print('Intercept:', model1.intercept_)

Code hint

Is the name of the model variable correct? It should be `model` not `model1`.

We can now plot the line of best fit on top of the data.

fig, ax = plt.subplots()
ax.scatter(x, y)
ax.plot([0, 1], [model.intercept_, model.intercept_ + model.coef_[0]], color='red')
ax.set(
    xlabel='$x$',
    ylabel="$y$",
    xlim=(0, 1),
    ylim=(3, 11),
)
plt.show()

We can use the trained model to make predictions on new data using the predict function. The fit and predict function names will be the same for all scikit-learn models.

y_pred = model.predict(  )

Code hint

You need to input the argument for the `predict` function. This function takes the matrix of features `X` as input.

We can now calculate the residuals and calculate error metrics for our model. Here, we will just calculate the mean absolute error (MAE). Other error metrics will be covered in workshop 2.

from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y, y_pred)
print(f'Mean Absolute Error: {mae:.2f}')

# note, MAE is just as easily calculated using numpy too
mae = np.mean(np.abs(y - y_pred)**2)
print(f'Mean Absolute Error: {mae:.2f}')

Something doesn’t look right? Why are the two values different.

Code hint

It looks like we accidentally calculated the mean squared error not mean absolute error. You should remove the `**2`.

We can plot a final graph showing the interpretation of MAE.

fig, ax = plt.subplots()

ax.scatter(x, y)
ax.plot(x.repeat(2).reshape(-1, 2).T, np.c_[y, y_pred].T, c="grey", zorder=-1)
ax.plot([0, 1], [model.intercept_, model.intercept_ + model.coef_[0]], color='red')
ax.set(
    xlabel='$x$',
    ylabel="$y$",
    xlim=(0, 1),
    ylim=(3, 11),
)
plt.show()

Polynomial regression#

Linear regression is only suitable for linear problems. These only makeup a small fraction of the chemical problems you might be interested in.

One simple generalisation of linear regression is polynomial regression. Here the response variable depends on the powers of the input variables as $$ y = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots = \phi(\mathbf{x})^T\boldsymbol{\beta} $$ where $\phi$ is the link/basis function. Many popular machine learning methods (e.g. SVMs, neural nets, decision trees, etc) can be seen as just different ways of estimating basis functions.

First we generate some noisy polynomial reference data and store it in a pandas DataFrame.

import numpy as np
import pandas as pd

num_points = 

x = 6 * np.random.rand(num_points) - 3
y = 0.5 * x ** 2 + x + 2 + np.random.randn(num_points)

df = pd.DataFrame({'x': x, 'y': y})
df

Code hint

You need to choose the number of points. 50 should be fine, but you have the power to decide.

Now plot the data.

fig, ax = plt.subplots()
ax.scatter(x, y)
ax.set(
    xlabel='$x$',
    ylabel="$y$",
    xlim=(-3.5, 3.5),
    ylim=(-1.5, 11),
)
plt.show()

Scikit-learn doesn’t have a PolynomalRegression class. Instead, we have to create powers of the features manually and use these with LinearRegression.

To do this we can use the PolynomialFeatures class from the preprocessing module.

from sklearn.preprocessing import PolynomialFeatures

X = df.x.values.reshape(-1, 1)  # sklearn expects the features to be a 2D array

# here we do not include the bias in the polynomial features since this will
# be added automatically by the linear regression model
poly_features = PolynomialFeatures(degree=2, include_bias=False)

# we can now transform our features to include the polynomial features
X_poly = poly_features.fit_transform(  )
X_poly

Code hint

You need to input the argument for the `fit_transform` function. This function takes the matrix of features `X` as input.

Note, the first column of X_poly is the original feature, $x$, and the second column is $x^2$. We can now train our model using the polynomial features. Note that we are now performing multivariate linear regression since we have more than one feature per sample.

model = LinearRegression()
model.fit(X_poly, y)

print('Slope:', model.coef_)
print('Intercept:', model.intercept_)

How do the coefficients compare to the true values of the polynomial? You can check what these were when we defined the noisy training data.

We can plot the fitted polynomial against our training data and calculate the mean absolute error.

# use the model to predict across the range of x values
X_new = np.linspace(-3, 3, 100).reshape(100, 1)
X_new_poly = poly_features.transform(X_new)
y_new = model.predict(X_new_poly)

# plot the results
fig, ax = plt.subplots()
ax.scatter(X, y)
ax.plot(X_new, y_new, c="red")
ax.set(
    xlabel='$x$',
    ylabel="$y$",
    xlim=(-3.5, 3.5),
    ylim=(-1.5, 11),
)
plt.show()

# calculate the mean absolute error
y_pred = model.predict(X_poly)
mae = mean_absolute_error(  )
print(f'Mean Absolute Error: {mae:.2f}')

Code hint

You need to input the arguments for the `mean_absolute_error` function. This function takes the predicted and ground truth values of `y`.

In general, the higher the degree of the polynomial, the better the model will fit the data. However, this can lead to overfitting, where the model is too closely tailored to the training data and does not generalize well to new, unseen data.

This is why it is important to evaluate the model on a separate test set to ensure that it is not overfitting. This will be discussed more in Workshop 2.

fig, ax = plt.subplots()
ax.scatter(x, y)

for degree in (1, 2, 20):
    poly_features = PolynomialFeatures(degree=degree, include_bias=False)
    X_poly = poly_features.fit_transform(X)
    model = LinearRegression()
    model.fit(X_poly, y)
    y_new = model.predict(poly_features.transform(X_new))
    ax.plot(X_new, y_new, label=f'degree={degree}')

ax.set(
    xlabel='$x$',
    ylabel="$y$",
    xlim=(-3.5, 3.5),
    ylim=(-1.5, 11),
)
ax.legend()
plt.show()

The degree=1 model is underfit to the data while the degree=20 is overfit.

What happens if you set the maximum degree to 100? What do you think is happening here?

Hint

With very high polynomial degrees, you run into an issue called an explosion of terms. The large number of variables means each is given a very small weight (see the model coefficients). This can run into numerical stability issues. One way to avoid this problem is to scale the input features. This is a good idea in general for linear regression, but it is particularly important when using polynomial features.

You can achieve this using the StandardScalar class from the preprocessing module.

from sklearn.preprocessing import StandardScaler

poly_features = PolynomialFeatures(degree=100, include_bias=False)
scaler = StandardScaler()
X_poly = poly_features.fit_transform(X)
X_scaled = scaler.fit_transform(X_poly)
model = LinearRegression()
model.fit(X_scaled, y)
y_new = model.predict(scaler.transform(poly_features.transform(X_new)))

fig, ax = plt.subplots()
ax.scatter(x, y)
ax.plot(X_new, y_new, label=f'degree={degree}')
ax.set(xlabel='$x$', ylabel="$y$", xlim=(-3.5, 3.5), ylim=(-1.5, 11))

Logisitic Regression#

So far we have focussed on linear and non-linear regression which are suitable when predicting continuous response variables. For classification tasks, these approaches do not give realisitic results.

Instead, one of the simplest approaches for tackling these problems is logisitic regression. Here, we use sigmoid function as the basis, giving

\[ \phi(x) = \frac{e^{\beta_0 + \beta_1 x}}{1+e^{\beta_0 + \beta_1 x}} = \frac{1}{1 + e^{-\beta_0 - \beta_1 x}} \]

We can plot this function using matplotlib.

def sigmoid(x, beta0=0, beta1=1):
    return # EDIT ME

fig, ax = plt.subplots()

x = np.linspace(-6, 6, 100)
for beta1 in (1, 2, 6):
    ax.plot(x, sigmoid(x, beta1=beta1), label=f'$\\beta_1={beta1}$')

ax.set(
    xlabel='$x$',
    ylabel=r"$\phi(x)$",
    xlim=(-6, 6),
    ylim=(-0.1, 1.1),
)
ax.legend()
ax.axhline(0, c='grey', ls='--')
ax.axhline(1, c='grey', ls='--')
plt.show()

Code hint

You need to input the code for the sigmoid function. If you get stuck, you can use the code below:

return 1 / (1 + np.exp(-beta0 - beta1 * x))

Let’s apply logistic regression to the iris dataset from sci-kit learn. This dataset contains 150 samples of iris flowers, each with four features. The target variable is the species of iris, which can be one of three possible values: setosa, versicolor, or virginica (given as integers 0, 1, and 2, respectively).

First, we load the dataset and create a pandas DataFrame object.

from sklearn.datasets import load_iris

iris = load_iris()
df = pd.DataFrame(
    data=np.c_[iris['data'], iris['target']],
    columns=iris['feature_names'] + ['target'],
)
df.target = df.target.astype(int)
df

Play around with the dataset. What are the features and what ranges do they have? What is the target variable and what are the possible values it can take?

Let’s now train a logistic regression model on the dataset. To begin with, we’ll only using the petal length feature to predict the target (univariate regression).

from sklearn.linear_model import LogisticRegression

X = df[["petal width (cm)"]].values.reshape(-1, 1)
y = (df.target == 2).astype(int)  # 1 if Iris virginica, else 0

model = LogisticRegression()
model.fit(  )

Code hint

You need to input the arguments for the `fit` function. The first argument should be the matrix of features `X` and the second argument should be the target property `y`.

We can look at the fitted model parameters for LogisticRegression in the same way as for LinearRegression. The naming is slightly confusing:

intercept_ ($\beta_0$): the log odds when the $x$ variable is 0.
coef_ ($\beta_1$): how much the log odds change with an increase (or decrease) in X by 1.0

print('β₀:', model.coef_)
print('β₁:', model.intercept_)

For a visual intepretation of the results, we can plot the sigmoid function along with the data points and the decision boundary.

X_new = np.linspace(0, 3, 1000).reshape(-1, 1)
y_prob = model.predict_proba(X_new)
decision_boundary = X_new[y_prob[:, 1] >= 0.5][0]

fig, ax = plt.subplots()
ax.scatter(X, y, c=y, cmap='tab10', vmax=10)
ax.axvline(decision_boundary, color='grey', ls="--")
ax.plot(X_new, y_prob, label=['Not Iris virginica', 'Iris virginica'])
ax.text(decision_boundary + 0.1, 0.5, 'Decision  boundary', ha='left')
ax.legend(loc="center left")
ax.set(
    xlabel='Petal width (cm)', 
    ylabel='Probability', 
    xlim=[0, 3], 
    ylim=[-0.02, 1.02]
) 
plt.show()

What is the decision boundary for our logistic regression model?

For classiciation problems, we can’t use MAE as an error metric. Instead, we’ll use log loss (otherwise called cross-entropy loss). This will be introduced in Workshop 2.

We can calculate the loss for our model.

from sklearn.metrics import log_loss

y_pred = model.predict(  )
loss = log_loss(  )
print(f'Log loss: {loss:.2f}')

Code hint

You need to input the arguments for the `predict` and `log_loss` functions.

For predict, the argument should be the matrix of features X. For log_loss, the arguments are the ground truth y and the predicted y y_pred (the order doesn’t matter).

We can try improving our model by including more features. Let’s extend the model to include both petal width and petal length.

X = df[["petal length (cm)", "petal width (cm)"]].values
model = LogisticRegression()
model.fit(  )

y_pred = model.predict(X)
loss = log_loss(y, y_pred)
print(f'Log loss: {loss:.2f}')

Code hint

You need to input the arguments for the `fit` function. The first argument should be the matrix of features `X` and the second argument should be the target property `y`.

We can see the loss has decreased!

We can plot the decision boundary in 2D.

fig, ax = plt.subplots()
ax.scatter(*X.T, c=y, cmap='tab10', vmax=10)
ax.axline((0, -model.intercept_[0] / model.coef_[0][1]), slope=-model.coef_[0][0] / model.coef_[0][1], color='grey', ls='--')
ax.text(3.5, 1.5, 'Not Iris virginica', color='C0', ha='center')
ax.text(6.5, 2.3, 'Iris virginica', color='C1', ha='center')
ax.set(
    xlabel='Petal length (cm)',
    ylabel='Petal width (cm)',
    xlim=[2.9, 7],
    ylim=[0.8, 2.7]
)
plt.show()

Softmax regression#

We can further generalise logistic regression to support multiple classes by using softmax regression (also known as multinomial logistic regression). In this case, the model predicts the probability of each class and the class with the highest probability is selected as the predicted class.

The general concept is as follows: for any $x$, compute a score $s_k(x)$ for each class $k$ and then convert these scores to probabilities ($\hat{p}_k$) using the softmax function:

\[\begin{split} \begin{align*} s_k(x) &= \mathbf{x}^T \mathbf{\beta}_k \\ \hat p_k &= \sigma(s(x))_k = \frac{exp(s_k(x))}{\Sigma_{j=1}^K exp(s_j(x))} \end{align*} \end{split}\]

We now use the full target value of the iris dataset, that can take values of 0, 1, and 2. The model will automatically detect that this is a multinomial problem.

X = df[["petal length (cm)", "petal width (cm)"]].values
y = df.target

model = LogisticRegression()
model.fit(X, y)

# generate a large number of points that cover the full space
x0, x1 = np.meshgrid(
    np.linspace(0, 8, 500).reshape(-1, 1),
    np.linspace(0, 3.5, 500).reshape(-1, 1),
)
X_new = np.c_[x0.ravel(), x1.ravel()]

y_predict = model.predict(X_new)
zz = y_predict.reshape(x0.shape)

fig, ax = plt.subplots()
ax.scatter(*X.T, c=y, cmap='tab10', vmax=10)
ax.contourf(x0, x1, zz * 2 + 1.5, cmap="tab20", vmax=20, vmin=0, zorder=-1)
ax.set(
    xlabel='Petal length (cm)',
    ylabel='Petal width (cm)',
    xlim=[0.5, 7.5],
    ylim=[0, 2.6]
)
ax.text(1.5, 0.8, 'Setosa', color='C0', ha='center')
ax.text(4, 0.8, 'Versicolor', color='C1', ha='center')
ax.text(6.5, 1.5, 'Virginica', color='C2', ha='center')
plt.show()

Advanced Topic: Solving linear regression coefficients#

The analytical solution to the linear regression problem is given by the following formula:

\[ \boldsymbol{\beta} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \]

Where $\mathbf{X}$ is the matrix of features for all samples. This is called the normal equation. Keep in mind that for many models, we can’t derive analytical solutions. In those cases we need to make use of approximations as we will see further below.

Let’s calculate the coefficients for the simple linear regression example we used earlier. In order to do this, we need to include a column of ones in the feature matrix $\mathbf{X}$ to account for the intercept term. For example, our feature matrix $\mathbf{X}$ will look like this:

\[\begin{split} \mathbf{X} = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} \end{split}\]

Similarly, the coefficient vector $\boldsymbol{\beta}$ will look like this:

\[\begin{split} \boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} \end{split}\]

First, let’s create some test data.

num_points = 100
x = np.random.rand(num_points)
y = 4 + 5 * x + np.random.randn(num_points)

fig, ax = plt.subplots()
ax.scatter(x, y)
ax.set(
    xlabel='$x$',
    ylabel="$y$",
    xlim=(-0.1, 1.1),
    ylim=(2.5, 11.5),
)
plt.show()

Add the constant term to the features.

# add x0 = 1 to each data point
X_b = np.stack([np.ones(num_points), x], axis=1)
X_b

Finally, we can solve for the optimal values of $\beta$ using the normal equation:

beta =  # FILL ME IN

print('β₀:', beta[0])
print('β₁:', beta[1])

Code hint

This was a hard one. We need to use the linear algebra routines from numpy. Something like this will work:

beta = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

Let’s check the fit.

fig, ax = plt.subplots()
ax.scatter(x, y)
ax.plot([0, 1], [beta[0], beta[0] + beta[1]], color='red')
ax.set(
    xlabel='$x$',
    ylabel="$y$",
    xlim=(-0.1, 1.1),
    ylim=(2.5, 11.5),
)
plt.show()

Task: Regressing the periodic table#

For this exercise, we’re going to use a periodic table data set which has been uploaded to GitHub. You should use the skills developed in this workshop to:

Explore the data. What does each field mean? Plot some of the numeric data.
Try a linear regression on atomic number and atomic weight.
Can a softmax regression predict the phase at room temperature from the melting and boiling points.
Think of any other questions you may be able to answer through linear or logistic regression

You should use your creativity to come up with novel approaches to understanding the dataset.

You can use pandas to load a dataframe directly from a website.

df = pd.read_csv("https://raw.githubusercontent.com/utf/DataAnalyticsChemistry/refs/heads/main/datasets/periodic-table.csv")
df

# Your code here

Additional reading#

This notebook can be complimented with Chapter 1 of Introduction to Statistical Learning for more background to the field.

1. Fundamentals

Contents

1. Fundamentals#

Getting started#

Outline#

Linear regression#

Polynomial regression#

Logisitic Regression#

Softmax regression#

Advanced Topic: Solving linear regression coefficients#

Task: Regressing the periodic table#

Additional reading#