ISPR-6CFU (Until 2021): Midterm 3 (2021) | INF - e-learning

Midterm 3 (2021)

Opened: Monday, 17 May 2021, 12:00 AM

Due: Sunday, 6 June 2021, 6:00 PM

Assignment Rules and Execution

The third midterm covers the foundational deep learning models introduced up to Lecture 21. To pass the midterm you should

perform one (only one) of the assignments described in the following;
prepare the short presentation describing your results and upload it here by the (strict) deadline;
give your short presentation in front of the class on the midterm date (June the 7^th, 2021).

You can use any deep learning framework or library to complete the assignment, unless explicitly indicated. You can use whatever programming language you like, but I strongly suggest to use either Python or Matlab for which you have coding examples.

The midterm presentation MUST take a maximum of 4 minutes and should include a maximum of 5 slides, whose content should cover:

A title slide with the assignment number and your name
A slide with code snippets highlighting the key aspects of your code
Slides showing the results of the analysis
A final slide with your personal considerations (fun things, weak aspects, possible ways to enhance the analysis, etc.).

Don’t waste slides and time to describe the dataset or the assignment you are solving as we will all be already informed on it.

Assignments are based different datasets, each detailed in the corresponding description.

List of Midterm Assignments

Assignment 1

DATASET (MNIST): http://yann.lecun.com/exdb/mnist/

Train a denoising or a contractive autoencoder on the MNIST dataset: try out different architectures for the autoencoder, including a single layer autoencoder, a deep autoencoder with only layerwise pretraining and a deep autoencoder with fine tuning. It is up to you to decide how many neurons in each layer and how many layers you want in the deep autoencoder. Show an accuracy comparison between the different configurations.

Provide a visualization of the encodings of the digits in the highest layer of each configuration, using the t-SNE model to obtain 2-dimensional projections of the encodings.

Try out what happens if you feed one of the autoencoders with a random noise image and then you apply the iterative gradient ascent process described in the lecture to see if the reconstruction converges to the data manifold.

Assignment 2

DATASET (CIFAR-10): https://www.cs.toronto.edu/~kriz/cifar.html

Implement your own convolutional network, deciding how many layers, the type of layers and how they are interleaved, the type of pooling, the use of residual connections, etc. Discuss why you made each choice a provide performance results of your CNN on CIFAR-10.

Now that your network is trained, you might try an adversaria attack to it. Try the simple Fast Gradient Sign method, generating one (or more) adversarial examples starting from one (or more) CIFAR-10 test images. It is up to you to decide if you want to implement the attack on your own or use one of the available libraries (e.g. foolbox, CleverHans, ...). Display the original image, the adversarial noise and the final adversarial example.

Assignment 3

DATASET (ENERGY PREDICTION): https://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction#

Train a gated recurrent neural network of your choice (LSTM, GRU) to predict energy expenditure (“Appliances” column) using two approaches:

Predict the current energy expenditure given as input information the temperature (T_i) and humidity (RH_i) information from all the i sensors in the house.
Setup a one step-ahead predictor for energy expenditure, i.e. given the current energy consumption, predict its next value.

Show and compare performance of both methods. Remember that testing in a one-step ahead prediction does not use teacher forcing.

Assignment 4

DATASET (PRESIDENTIAL SPEECHES): http://www.thegrammarlab.com/?nor-portfolio=corpus-of-presidential-speeches-cops-and-a-clintontrump-corpus#

Pick up one of the available implementations of the Char-RNN (e.g. implement1, implement2, implement3, implement4, etc.) and train it on the presidential speech corpora. In particular, be sure to train 2 separate models, one on all the speeches from President Clinton and one on all the speeches from President Trump. Use the two models to generate new speeches and provide some samples of it at your choice. Should you want to perform any other analysis, you are free to do so.

Please note that the speech files contain XML tags: be sure to remove them before feeding the text to the Char-RNN (or you might consider leaving just the <APPLAUSE> and/or the <BOOING> tags to see if the network is smart enough to understand when the speech reaches a climax).

Assignment 5

DATASET (LERCIO HEADLINES) - Dataset collected by Michele Cafagna

As in Assignment 4, pick one of the CHAR-RNN implementations and train one model on the dataset which contains about 6500 headlines from the Lercio satirical newspage, scraped by Michele Cafagna, past student of the ISPR course. The dataset is contained in a CSV file, one line per headlines. Be aware that the dataset can be a bit noisy (some errors due to encoding conversions) so you might need some preprocessing in order to prepare it for the task. Also, I am afraid the dataset is in Italian only as this is the language of the newspage.

Try experimenting with different configurations of the CHAR-RNN, varying the number of layers. Since the dataset is quite small, keep the number of hidden neurons contained otherwise the net will overfit. Use the trained model (the best or the worst, your choice) to generate new headlines.

Assignment 6

DATASET (SARS-Cov-2 Inhibitors) - Dataset from ChemAI group

The dataset (direct download link) contains over 300K drug molecules in SMILES representation (i.e. a molecule is a string of parenthesised characters). The task is to train a predictor that receives in input a SMILE sequence and predicts whether the corresponding molecule is Active/Inactive for 2 SARS-Cov related proteins. The Active/inactive target is on the columns marked as “PUBCHEM_ACTIVITY_OUTCOME_*”. There are four target columns but you can use them in pairs, in particular:

PUBCHEM_ACTIVITY_OUTCOME_ASY0 and PUBCHEM_ACTIVITY_OUTCOME_ASY1 refer to the same protein, hence you can substitute the two values with a single value which is the OR between the 2 (considering active=1, inactive=0)
PUBCHEM_ACTIVITY_OUTCOME_ASY2 and PUBCHEM_ACTIVITY_OUTCOME_ASY3 refer to the same protein, so treat them as above.

This will give you the possibility of creating two binary classification tasks, one for each protein.

There are many targets for which activity is not available (n/a), while the number of molecules being active on at least one of the proteins is low. Therefore I strongly suggest that you subsample the full dataset, by taking all molecules which are active plus a subset of molecules which are not active (using a number of samples which no more than 5 times that of active molecules). Discard those molecules which have only n/a targets.

Once you have created your dataset apply any reasonable neural network to learn the classifier (on one or both the tasks). For instance, you can consider, gated recurrent network, 1D convolutional neural networks (e.g. those introduced in the lecture to deal with DNA sequences), etc. Provide results in training, validation and test (up to you to decide splits and validation procedure).

Since the predictive task deals with unbalanced binary classification, I strongly suggest using AUC as an evaluation metric. Also, feel free to experiment with any approach for strengthening unbalanced classification, e.g. Weighted cross entropy.

Please note that in order to feed the neural network with the SMILES strings in input, you will need to encode the SMILES characters (brackets included) with an appropriate numerical encoding.