Python Refresher#
Note: This refresher was based on the ML for Materials course developed by Prof. Aron Walsh.
Welcome to Jupyter Book#
The workshops for the Data Analytics in Chemistry module are provided in the form of Jupyter Notebooks embedded in a Jupyter Book. These allow you to run and easily share computer code. This combination makes Jupyter notebooks a useful tool for analysing data.
Unlike spreadsheets or combinations of separate data analysis codes, you can collect descriptions and notes for individual experiments, links to the raw data collected, the computer code that performs any necessary data analysis, and the final figures generated with these data, ready for use in a report or published paper.
There are a few components to be aware of:
Python#
A working knowledge of the Python programming language is assumed for this course. If you are rusty, Chapters 1-4 of Datacamp cover the base concepts, as do many other online resources including Imperial’s Introduction to Python course.
Markdown#
Markdown is a markup language that allows easy formatting of text. It is widely used for creating and formatting online content. It is easier to read and write than html. A guide to the syntax can be found here.
# Heading
## Smaller heading
### Even smaller heading
Write an equation#
This is written in LaTeX format. It’s easy to learn and useful for complex expressions.
$-\dfrac{\hslash^2}{2m} \, \dfrac{\partial^2 \psi}{\partial x^2}$
which renders as
\(-\dfrac{\hslash^2}{2m} \, \dfrac{\partial^2 \psi}{\partial x^2}\)
Link an image#
The syntax used here is Markdown, which can be used in notebooks and is also popular on github for documentation and even a fast way to take notes during lectures.

which renders as
Github#
GitHub is a platform for writing and sharing code. There are many materials science projects hosted there, which enable researchers from around the world to contribute to their development.
Running the Notebook#
The weekly notebooks are designed to be run online directly in your browser. You can activate the server by clicking the rocket icon on the top right and selecting Live Code
. There is an option to open in Binder or Google Colab, which you may prefer if you are an advanced user, but the formatting won’t be as nice. You can opt to install Python on your own computer with Anaconda and run the notebooks locally, but we do not offer support if things go wrong.
Analyse data with code#
By programming a series of instructions, researchers can consistently obtain the same results from a given dataset. This approach enables us to share datasets and code, allowing other scientists to review, repeat and reuse the analysis. The transparency and reproducibility of code-based analysis enhances research integrity and credibility, while minimising errors. It also enables efficient handling of large datasets and complex calculations, accelerating the exploration of different techniques.
Running code#
Different programming languages can be used in Jupyter notebooks. We will be using Python 3. The large scientific community for Python means that well-developed resources exist for data processing and specific prewritten tools for manipulating and plotting data.
Any code typed into a code cell can be run (executed) by pressing the run
button. You can also run the selected code block using Shift-Enter
combination on your keyboard.
2 + 3 # run this cell
print("Hello World!") # anything after '#' is a comment and ignored
12 * 2.40 * 3737 * 12 # you get the idea
2**1000 - 2 # a big number
import math as m # import a math module
m.pi
20 * m.atan(1/7) + 8 * m.atan(3/79) # Euler's approximation
Multidimensional data with numpy#
Numpy makes it easy to work with multidimensional data such as vectors and matrices. All of the packages used in this course are designed to handle numpy arrays. Let’s import it and show you some features.
import numpy as np
x = np.arange(0, 10, 0.5) # x = 0 to 10 in steps of 0.5
x
Many numpy functions can be run on entire vectors.
np.sin(x) # calculate sin(x) for every element in the list
(x + 10) / 12 # perform numerical operations
np.random.random(10) # generate random numbers
y = np.array([[3, 1, 0], [0, 3, 4], [0, 5, 10]]) # create a matrix (2D array) from scratch
y
np.linalg.norm(y) # linear algebra routines
np.dot(x, x) # dot products
Plotting with Matplotlib#
Let’s import the package Matplotlib, which we will be using a lot for data visualisation.
import matplotlib.pyplot as plt
x = np.arange(0, 10, 0.001) # x = 0 to 10 in steps of 0.001
y = np.sin(x*x) # define your function
fig, ax = plt.subplots(figsize=(5, 3)) # create a new figure (5x3 inches)
ax.plot(,y) # plot x against y
Code hint
You need to plot x vs y. Fix the plot command to (x,y).Using a Pandas DataFrames#
Pandas DataFrames are useful tools to store, access, and modify large sets of data. In this course, we’ll make use of Pandas to process input and output data for our machine learning models.
import pandas as pd # Data manipulation with DataFrames
df = pd.DataFrame() #This instantiates an empty Pandas DataFrame
data = {
"Element" : ['C', 'O', 'Fe', 'Mg', 'Xe'],
"Atomic Number" : [6, 8, 26, 12, 54],
"Atomic Mass" : [12, 16, 56, 24, 131]
}
# Let's try loading data into DataFrame df
df = pd.DataFrame(data)
df
# We can make the 'Element' column the index of this DataFrame using the set_index function
df = df.set_index("Element")
df
# Printing the values in the 'Atomic Number' column
print(df["Atom Number"])
Code hint
Check you are printing the correct column name. Try out some of the other options.# Add a new column
df["Energy (eV)"] = [5.47, 5.14, 0.12, 4.34, 7.01]
print(df["Energy (eV)"])
# Print a row from the DataFrame
# Use the df.loc[index] function to print the entry "C"
print(df.loc[''])
print('-----')
# Use the df.iloc[index] function to print the first entry (counting starts at 0...)
print(df.iloc[0])