This course is tailored for MSc students of the AI and Data Science Master of the Heinrich Heine University of Dusseldorf.
We provide all the course materials, including lectures, slides, and exercise classes.
- Lecture || Slides
- Introduction to autoencoders for representation learning, early/old traditional approaches. Based on Bengio et al. 2012 paper
Image autoencoders. Learning to use and evaluate the intermediate learned representation.
- Lecture || Slides
- Self-supervised learning VS Transfer Learning. Pretext VS Downstream Task
- Pretext tasks covered: Colorization, Jigsaw puzzles, Image inpainting, Shuffle and Learn (Videos), - Classify corrupted images, Rotation Prediction
- Semi-supervised learning: Consistency loss
- A small intro to Contrastive loss (infoNCE)
In this exercise, we will train a ResNet18 on the task of rotation prediction. Rotation prediction provides a simple, yet effective way to learn rich representations from unlabeled image data. The basic idea behind rotation prediction is that the network is trained to predict the orientation of a given image after it has been rotated by a certain angle (e.g., 0°, 90°, 180°, or 270°).
- Lecture || Slides
- Natural Language Processing (NLP) basics
- RNN, self-attention, and Transformer recap
- Language pretext tasks
- Pretext tasks for representation learning in NLP. An in-depth look into BERT.
In this exercise, you will train a small BERT model on the IMDB dataset (https://huggingface.co/datasets/imdb). You will then use the model to classify the sentiment of movie reviews and the sentiment of sentences from the Stanford Sentiment Treebank (SST2, https://huggingface.co/datasets/sst2).
- Lecture || Slides || Notes
- A deep look into contrastive learning, theory, and proof of MI bound.
- SimCLR Paper
Build and train SimCLR resnet18 on CIFAR10.
- Lecture || Slides || MoCO implementation
- Contrastive Learning, L2 normalization, Properties of contrastive loss
- Momentum encoder (MoCO). Issues and concerns regarding batch normalization
- Multi-view contrastive learning
- Deep Image Clustering: task definition and challenges, K-means and SCAN, PMI and TEMI
Use pretrained MoCO ResNet50 for image clustering.
- Lecture || Slides
- Transformer encoder and Vision transformer
- ViTs VS CNNs: receptive field and inductive biases
- Knowledge distillation and the mysteries of model ensembles
- Knowledge distillation in ViTs and masked image modeling
Knowledge distillation on CIFAR100 with Vision Transformers.
- Lecture || Slides
- A small review of self-supervised methods
- A small review of knowledge distillation
- Self-Supervised Learning & knowledge distillation
- An in-depth look into DINO
In this exercise you will implement and train a DINO model on a medical dataset, the PathMNIST dataset from medmnist consisting of low-resolution images of various colon pathologies.
- Lecture || Slides
- MAE: https://arxiv.org/abs/2111.06377
- BEiT: BERT-style pre-training in vision: https://arxiv.org/abs/2106.08254
- iBOT: Combining MIM with DINO https://arxiv.org/abs/2111.07832
- DINOv2: https://arxiv.org/abs/2304.07193
- Lecture || Slides
- Defining Robustness and Types of Robustness
- Zero-shot learning
- Contrastive Language Image Pretraining (CLIP)
- Image captioning
- Few-shot learning
- Visual anomaly detection: task definition
- Anomaly detection scores
- Anomaly detection metrics: AUROC
Use a CLIP-pre-trained model for out-of-distribution detection.
- Lecture || Slides
- Investigating CLIP models and scaling laws
- Determine factor of success of CLIP?
- How does CLIP scale to larger datasets and models?
- OpenCLIP: Scaling laws of CLIP models and connection to NLP scaling laws
- Robustness of CLIP models against image manipulations
- Learned representations of supervised models:CNNs VS Vision Transformers (ViTs), the texture-shape bias
- Robustness and generalization of supervised-pretrained CNNs VS ViTs
- Scaling (Supervised) Vision Transformers
- Properties of ViT pretrained models
- Lecture || Slides
- Limitations of existing vision language models
- Self-supervised VS supervised learned feature representations
- What do vision transformers (ViTs) learn “on their own”?
- MoCOv3 and DINO: https://arxiv.org/abs/2104.14294
- Self-supervised learning in medical imaging
- Investigating the pre-training self-supervised objectives
No exercise takes place this week.
- Lecture || Slides
- A closer look at the attention mechanism. The attention mechanism in Natural Language Translation
- A tiny intro to proteins
- Representing protein sequences with Transformers: BERT masked language modeling VS GPT?
- ESM, [ESMv2])(https://pubmed.ncbi.nlm.nih.gov/36927031/)
- Looking & combining at the attention maps of a pre-trained Transformer
- Protein Language models generalize beyond natural proteins
Use a pretrained Protein Language Model
Just play around with an Alphafold notebook
Feel free to open issues regarding errors on the exercises or missing information and we will try to get back to you.
Important: Solutions to the exercises are not provided, but you can cross-check your results with the Expected results in the notebook.