Top data analysis mistakes in master theses

Here are some of the most common mistakes that students make in their master thesis regarding data analysis, ranked by how much they annoy me (so this is mostly a valuable document if you end up with me as a reader, I guess!).

First the list, then some more detailed explanations below.

Not looking at the data.
The claims are not supported by the results.
Poor visualization choices.
Poor handling of the train/test set separation in machine learning.
Arbitrary choices in the methodology that are not presented as such.
No (or poor) comparison to the state-of-the-art.

Note that I’m focusing here only on the data analysis related mistakes. Otherwise number one on the list would be “plagiarism, even if unintentional” (or maybe “using ChatGPT”), but that’s a whole other topic that I can’t really cover here!

In more details

Not looking at the data

Look at your data. The raw data: tables, images, annotations… Check the types, how information is encoded. Check if there are any weird things in the data (labels that are mispelled, missing values, corrupted images, etc.). Check that the labels match the data (i.e. in a supervised problem, check that you can correctly retrieve the pairs \(x_i, y_i\)). Look at the outputs as well, and at intermediate results through your pipeline if there are any.

Many times, students have issues during their work, and the results aren’t what they expected… and the reason is extremely easy to spot if you just take the time to look at the data, which many students never do.

One example: not noticing that annotations in a segmentation problem were made using a bottom left origin and not a top left origin (i.e. “plot” convention rather than “image” convention) \(\rightarrow\) all annotations were flipped vertically, and the neural network was being trained mostly on random noise.

This means that at no point during the year the student tried to display an image with overlaid annotations to check that everything was in order. Please don’t do that.

The claims are not supported by the results

It generally starts from a big table of results such as:

	Accuracy	AUC
ALGO-1	0.714	0.610
ALGO-2	0.712	0.607
ALGO-3	0.716	0.612

Followed by a statement such as:

We have shown that ALGO-3 is the best choice for solving our task.

The results, however, don’t seem to support this claim. With such small differences in the chosen metrics, we cannot really differentiate between the results. But to be more confident one way or another about the claim, we need to better present the variability in the results:

If the metric is an average over several experiments or samples (e.g. through cross-validation), give standard deviation with the mean.
If the test set is the same for all algorithms, you can use paired samples tests to determine if the differences are statistically significant. Depending on the number of algorithms tested, and the metric, we have different possibilities.
- Binary metric: results per item are “correct” or “incorrect”, so we have an “average of binary values” as the end metric (e.g. Accuracy).
- Numerical metric: results per item are numerical (e.g. IoU in a segmentation problem).

	Binary metric	Numerical metric
2 algorithms	McNemar	t-test (parametric), Wilcoxon (non-parametric)
>2 algorithms	Cochrane’s Q test	ANOVA (parametric), Friedman (non-parametric)

Poor visualization choices

A graph, plot, or image should:

help the reader understand what you’ve done
help the reader understand what works / doesn’t work

Things that don’t help that are often present in master theses:

Boxplots when there are very few samples and/or a highly nonnormal distribution.
Similarly, points that represent “averages” in a scatter plot or line plot, but where the “average” is only for a few points: in that case, putting every point would be a lot more informative so that the “spread” of the distribution is immediately visible.
If there are enough samples: not putting error bars on the graph.
If there are lots of samples: putting all the points to such an extent that nothing is visible anymore (solutions: boxplots, heatmaps, histograms…).

Other “classics”:

Graph with outliers such that the area with interesting things to look at is very small with all points/lines overlapping.
Red-green color maps, or poorly contrasted color maps in general: as a rule of thumb and to avoid problem with the ~4.5% of people with some form of colorblindness, your graph should be legible without colors. Use grayscale contrast and/or markers (dash, diamonds, etc.) to make things more clear.

Poor handling of the train/test set separation in machine learning

This was partially adressed in the “look at your data” section, but this is a more general problem that often starts at the data collection stage.

The problem comes in two main flavours:

A test set that is not independent enough from the training set (“contamination”).

A very classic mistake when handling a dataset is to think: “to make sure that I don’t introduce a bias, I’ll just set aside x% of the dataset for testing by using a random sampling”. This is fine… if your dataset is composed of fully independent samples, which is not often the case.

In a medical dataset, you may have several measures per patient \(\rightarrow\) putting measures from the same patient in both training and test set results, which transforms your problem from “solve task X” to “solve task X on these patients”, as the model/network/algorithm can use knowledge about the patient to better solve the task.
In a large dataset, particularly if scraped from the internet, you may have repetitions (the same item has been included several times, sometimes from different sources with very small differences). This very obviously leads to problems if some of the repetitions end up in both sets.
In image datasets, we often train the models on small images taken from larger images, or we may have several images that show the same thing with small differences (e.g. same building from different angles, same person at different times, etc.). This again contaminates the test set.

A poor methodology that uses the test set to make some choices about the pipeline.

You are developping an algorithm \(\mathcal{A}\) to solve task \(\mathcal{T}\). You have some choices to make about \(\mathcal{A}\), such as the architecture of the network (if using a Deep Learning approach) and all hyper-parameters of the pipeline.

You take your dataset \(\mathcal{D}\) and separate it in two: \(\mathcal{D}_{train}\) and \(\mathcal{D}_{test}\).

You make some arbitrary choices to get a baseline method \(\mathcal{A}_0\). You train \(\mathcal{A}_0\) on \(\mathcal{D}_{train}\) and evaluate the results on \(\mathcal{D}_{test}\).

It doesn’t work as well as you hoped, so you change a few things to get a new method \(\mathcal{A}_1\). You train \(\mathcal{A}_1\) on \(\mathcal{D}_{train}\) and evaluate the results on \(\mathcal{D}_{test}\).

Repeat until you get to \(\mathcal{A}_k\), your best version of the algorithm.

What’s the problem? You’ve used \(\mathcal{D}_{test}\) to make choices about \(\mathcal{A}\). That means that \(\mathcal{D}_{test}\) is no longer a test set. Your “best algorithm” has been designed to be the best on that particular set. But if you now want to know the generalization capabilities of \(\mathcal{A}_k\)… you can’t. Because you no longer have a test set avaible.

What to do instead? Make all your choices based on \(\mathcal{D}_{train}\) and \(\mathcal{D}_{train}\) only, from which a \(\mathcal{D}_{validation}\) set may be extracted (e.g. in a cross-validation pattern). Use \(\mathcal{D}_{test}\) only at the very end, once all your choices have been made, as a final test for the capabilities of \(\mathcal{A}_k\). Once you have used it… you’re done. You cannot keep improving the algorithm, because you no longer have a test set.

Arbitrary choices in the methodology that are not presented as such

There are many choices that need to be made when making a study, or designing an algorithm, or a device. In a master thesis, you don’t have the time (nor, often, the resources) to test everything. Some of the choices that you make will be arbitrary, or of convenience: this is what was available, this is from a piece of code I reused from somewhere and it worked, this is from a few tests I did without recording the results…

This is fine, and expected (as long as that’s not the case for all your choices, of course…).

What is not fine is to then pretend in the text that “those parameters were set empirically to get the best possible result” (paraphrasing from many different works). If you claim that the parameters are really the right ones, and better than potential alternatives, you need to be able to show it with some systematic testing. Otherwise, you need to be honest about what your process was to make those choices. And about your sources if you are “borrowing” the parameters from somewhere else.

No (or poor) comparison to the state-of-the-art

Sometimes, you design/develop/create something, you test it, you compare it to results from other scientific papers in the literature, and you’re clearly not at the level of the state-of-the-art.

This is perfectly normal. Sometimes, the state-of-the-art is a big laboratory, twenty post-doctoral researchers, and essentially infinite material resources. You won’t beat them. That doesn’t mean that you can’t make something interesting, though. And sometimes, you just made some mistakes that you don’t have the time to correct. Again, normal: you have a strict deadline.

What you can do (that they can’t or won’t do): - Provide a more critical analysis of the datasets, metrics, results that you and they obtain. You’re not Meta or Google, so you don’t have shareholders that wouldn’t like to hear that your results are not as good as they seem. Use that freedom! - Thoroughly test one specific aspect, trying to understand in depth how it works.

What you shouldn’t do: - Pretend that the state-of-the-art doesn’t exist and offer no comparison to well-known methods. Your reader will probably now about those methods, or will find them easily. - Find some creative way to pretend that your results are actually really good and comparable to the state-of-the-art when they clearly aren’t.