(re)introduction to Python for data analysis
Note on the use of Generative AI tools (ChatGPT, Copilot, etc.): while not forbidden, I highly recommend to not use GenAI tools until you really understand the code that you are trying to write. GenAI tools are unreliable, and notoriously bad as a “learning” tool. They require a lot of micromanagement to be accurate. They can be helpful as a productivity boost for skilled developers, but will stop you from growing your own skills if used too soon.
Objectives: Create a random data generator using the Numpy library, and plot a histogram of this data using Matplotlib.
random_data_generator.py
).Hint: look at the documentation for the np.zeros method.
Hint: look at the scipy.stats package for information on generating random numbers from a distribution.
Starter code: best practice for scripts in Python.
"""Title
Description of the script
"""
def my_method(my_argument: str) -> int:
"""Short description of the method
"""
return 1
def main():
pass
if __name__ == "__main__":
main()
Objectives: plot data using matplotlib’s pyplot module, demonstrate code reusability.
visualize_data.py
) in the same directory as the random
generator.Hint: look at the documentation of pyplot.hist.
Starter code (names need to be adapted based on your own work!):
from matplotlib import pyplot as plt
import numpy as np
from random_data_generator import generate_samples
def visualize(data: np.ndarray) -> None:
plt.figure()# plot histogram
plt.show()
def main():
= generate_samples(100, 0., 1.) # assuming sample_size, mu, variance
data
visualize(data)
if __name__ == "__main__":
main()
Objectives: read a CSV file using the standard csv library from Python, and extract a variable based on a simple filter.
The height.csv
file (available on the Virtual
University) contains information about the gender and the
height of 40 students (source from the file: companion
data to King & Eckersley, 2019).
height.csv
file.def filter_by_gender(data: np.ndarray, gender: str) -> np.ndarray
),
making sure to convert them to integers.Objectives: read a CSV file using the Pandas library, and filter the data.
Hints: See
pandas.read_csv
for how to read CSVs. 10 minutes to pandas is a good resource on how to manipulate pandas’DataFrame
objects.
Jupyter Notebooks are a very common way to present reports & data analysis. There are however also a common source of errors for students (and beyond), as it is very easy to accidentally overwrite variables or methods in the namespace by running the cells in an unknown order.
When using Notebooks, I recommend the following process:
Exercise: write a report for this lab in a Jupyter Notebook, importing and demonstrating the methods you wrote from their respective files. You’ll need to put the Notebook in the same directory as the files (or to put the file in a module that you install locally, but that is beyond the scope of these laboratories !)
Note that it won’t be compulsory to write the reports in a Notebook for the “real” laboratories.