Midterm 2 (2024)
Assignment Rules and Execution
The second midterm covers the probabilisti models introduced up to Lecture 14. To pass the midterm you should
- perform one (only one) of the assignments described in the following;
- report your results as a Colab notebook or a 10 slide presentation (both versions are equally fine) and upload it here by the (strict) deadline;
In case you are delivering a Colab notebook, please upload a txt file with the link to your Colab notebook (already ran).
You can use library functions to perform the analysis unless explicitly indicated.
You can use whatever programming language you like,
but I strongly suggest to use either Python or Matlab for which you have
coding
examples. Python will further allow you to deliver your midterm as a
Colab notebook, but this is not a requirement (you can deliver a
presentation instead).
Your report (irrespectively of whether it is a notebook or a presentation) needs to cover at least the following aspects (different assignments might have different additional requirements):
- A title with the assignment number and
your name
- The full code to run your analysis (for Colabs) or a few slides (for presentations) with code snippets highlighting the key
aspects of your code
- A section reporting results of the analysis and your brief comments on it
- A final section with your personal considerations (fun things, weak aspects, possible ways to enhance the analysis, etc.).
Do not waste time and space to describe the dataset or the
assignment you are solving as we will all be already informed on it.
N.B.: In order to deliver a Colab you might just upload a txt file containing the link to the Colab.
List of Midterm Assignments
Assignments are based on the following two datasets already used in Midterm 1: refer to Midterm 1 page for details on the content.
DSET1 (Signal processing): https://archive.ics.uci.edu/dataset/360/air+quality
DSET2 (Image processing): www.kaggle.com/datasets/ztaihong/weizmann-horse-database/data
In addition to these datasets from Midterm 1, we will use these additional datasets:
DSET3 (Image processing: MNIST): http://yann.lecun.com/exdb/mnist/
DSET4 (Car evaluation): https://archive.ics.uci.edu/dataset/19/car+evaluation
Assignment 1
Fit an
Hidden Markov Model to the data in DSET1: it is
sufficient to focus on a single column of the
dataset
of your choice (i.e. choose one of the sensors available and focus on analysis that single sensor). Experiment with training HMMs with two different choices of the emission distribution and confront the results. Experiment also with HMMs with a varying number of
hidden states (e.g. at least 2, 3 and 4) and identify what is the best choice according to your own reasoning.
Once you have identified the best HMM configuration (emissions and number of states), choose a reasonably sized subsequence (e.g. last 25% of the timeseries) and compute the optimal assignement using two methods: i) Viterbi (true optimal) and ii) best state according to the hidden state posterior (very local decision). Then plot the timeseries data highlighting (e.g. with different colours) the hidden state assigned to each timepoint by the Viterbi algorithm and the posterior method. Discuss the results.
Assignment 2
Implement a simple image understanding application for DSET2 using the LDA model and the bag of visual terms approach described in Lecture 12-13. For details on how to implement the approach see the BOW demo and paper [12] referenced on the Moodle site. Keep 10 of the pictures out of training for testing. In short:
1. Pick up one interest detector of your choice (it is ok to use those implemented in a library)
1. For each image (train and test) extract
the SIFT descriptors for the interest points identified by the your detector of choice
2. Learn a K-dimensional codebook (i.e. run k-means with K clusters, with K chosen by you) from the SIFT descriptors of the training images (you can choose a subsample of them if k-means takes too long).
3. Generate the bag of visual terms for each image (train and test): use the bag of terms for the training images to train an LDA model (use one of the available implementations). Choose the number of topics as you wish (but reasonably).
4. Test the trained LDA model on test images: plot (a selection of) them with overlaid visual patches coloured with different colours depending on the most likely topic predicted by LDA.
Assignment 3
Implement from scratch an RBM and apply it to DSET3. The RBM should be implemented fully by you (both CD-1 training and inference steps) but you are free to use library functions for the rest (e.g. image loading and management, etc.).
1. Train an RBM with a number of hidden neurons selected by you (single layer) on the MNIST data (use the training set split provided by the website).
2. Use the trained RBM to encode a selection of test images (e.g. using one per digit type) using the corresponding activation of the hidden neurons.
3. Train a simple classifier (e.g.
any simple classifier in scikit) to recognize the MNIST digits using as inputs
their encoding obtained at step 2. Use the standard training/test split. Show
a performance metric of your choice in the presentation/handout.
Once you have modelled the BN, also plug in the necessary local conditional probability tables. You can set the values of the probabilities following your own intuition on the problem (ie no need to learn them from data). Then run some episoded of Ancestral Sampling on the BN and discuss the results.
The assignment needs to be fully implemented by you, without using BN libraries.
Assignment 5
Learn the structure of the Bayesian Network (BN) resulting from the dataset DSET4 using two BN structure learning algorithms of your choice. For instance you can consider the algorithms implemented in PGMPY or any other comparable library (e.g. see the list of libraries listed in Lecture 7). Compare and discuss the results obtained with the two different algorithms. Also discuss any hyperparameter/design choice you had to take.