STAT-H400 - Lab 0

(re)introduction to Python for data analysis

Objectives

Use Python and common data science libraries to perform simple operations: read/write CSV files, extract variables, plot data, print summaries.
Good practices in Python (type hints, main function…).
Using the libraries documentation.

Note on the use of Generative AI tools (ChatGPT, Copilot, etc.): while not forbidden, I highly recommend to not use GenAI tools until you really understand the code that you are trying to write. GenAI tools are unreliable, and notoriously bad as a “learning” tool. They require a lot of micromanagement to be accurate. They can be helpful as a productivity boost for skilled developers, but will stop you from growing your own skills if used too soon.

Exercise 1

Objectives: Create a random data generator using the Numpy library, and plot a histogram of this data using Matplotlib.

Create a new Python file (suggested name: random_data_generator.py).
Create a function that takes as input a sample size (integer) and returns a numpy array with one column and one row per sample, filled with zeros. Test that it works as intended.

Hint: look at the documentation for the np.zeros method.

Modify that function so that it also takes as parameters a mean value and a variance value. Instead of filling the array with zeros, generate random numbers from a gaussian distribution with the given parameters.

Hint: look at the scipy.stats package for information on generating random numbers from a distribution.

Starter code: best practice for scripts in Python.

"""Title

Description of the script
"""

def my_method(my_argument: str) -> int:
    """Short description of the method
    """
    return 1


def main():
    pass


if __name__ == "__main__":
    main()

Exercise 2

Objectives: plot data using matplotlib’s pyplot module, demonstrate code reusability.

Create a new Python file (suggested name: visualize_data.py) in the same directory as the random generator.
Create a method that plots a histogram of the values in a numpy array.

Hint: look at the documentation of pyplot.hist.

Import the random generator from exercise 1 and use the new function to visualize the values in random samples of different sizes and distribution parameters.

Starter code (names need to be adapted based on your own work!):

from matplotlib import pyplot as plt
import numpy as np

from random_data_generator import generate_samples


def visualize(data: np.ndarray) -> None:
    plt.figure()
    # plot histogram
    plt.show()


def main():
    data = generate_samples(100, 0., 1.) # assuming sample_size, mu, variance
    visualize(data)


if __name__ == "__main__":
    main()

Exercise 3

Objectives: read a CSV file using the standard csv library from Python, and extract a variable based on a simple filter.

The height.csv file (available on the Virtual University) contains information about the gender and the height of 40 students (source from the file: companion data to King & Eckersley, 2019).

Create a method that reads a CSV file and returns a numpy array with its content. Test it on the height.csv file.
Create a method that filters the array in order to only keep the heights of a target gender (provided as argument to the method, so that its signature would be something like def filter_by_gender(data: np.ndarray, gender: str) -> np.ndarray), making sure to convert them to integers.

Exercise 4

Objectives: read a CSV file using the Pandas library, and filter the data.

Using the same CSV file as an example, create a method that reads the CSV file using Pandas and filter by gender, as in exercise 3.

Hints: See pandas.read_csv for how to read CSVs. 10 minutes to pandas is a good resource on how to manipulate pandas’ DataFrame objects.

Modify your data visualization file so that it gets the data from the CSV file (import the method you’ve just made, don’t try to re-code it in the visualization file!) and plot the histograms for each gender.

Optional: Notebook

Jupyter Notebooks are a very common way to present reports & data analysis. There are however also a common source of errors for students (and beyond), as it is very easy to accidentally overwrite variables or methods in the namespace by running the cells in an unknown order.

When using Notebooks, I recommend the following process:

You can use the Notebook to quickly prototype the methods that you need (work in small increments, don’t try to do the whole pipeline in one go).
Once a method is tested and functional, put it in an appropriately named file outside of the notebook.
Import the method in the notebook for further usage.
At the end of the whole process, restart the notebook and run everything from scratch to make sure that the output you get is reliable.

Exercise: write a report for this lab in a Jupyter Notebook, importing and demonstrating the methods you wrote from their respective files. You’ll need to put the Notebook in the same directory as the files (or to put the file in a module that you install locally, but that is beyond the scope of these laboratories !)

Note that it won’t be compulsory to write the reports in a Notebook for the “real” laboratories.