AI-Protein Design Lab / teaching

Machine Learing in Bioinformatics (Spring 2024)

Course Information

Locations: This course is taught at both the Beijing and Hangzhou campuses of University of Chinese Academy of Sciences (UCAS).

Beijing: Room 302, Teaching Building No. 1, Yanqihu Campus.
Hangzhou: Room 104, Teaching Building No. 10, Yunyi Campus (Hangzhou Institute for Advanced Study).

Instructor: Haicang Zhang
Office Hours: 1 hour immediately following the lecture, held in the classroom.

Course Description

Recent advancements in AI have pushed research at the intersection of computer science, biology, and medicine into a new era. Notably, leveraging cutting-edge AI models, AlphaFold and RFdiffusion have achieved significant breakthroughs in protein structure prediction and protein design, respectively. This course explores how state-of-the-art AI is utilized to address fundamental scientific challenges in life sciences, including protein folding, protein design, drug discovery, function prediction, and gene regulation. Specifically, the course will cover a brief history of AI models and their applications in bioinformatics, with a specific focus on Generative AI like large language models (LLMs) and diffusion-based generative models, as well as reinforcement learning algorithms for post-training.

Syllabus

Please note that we have three classes per week.

Week	Topic	Contents	Instructor
1	Introduction (Part 1)	Fundamental scientific questions in life sciences; The scale of biological data. A brief history of AI. Overview of recent breakthroughs (e.g., protein folding, drug discovery).	Haicang Zhang
2	Introduction (Part 2)	Core ML Concepts: Supervised versus unsupervised learning, clustering, generative model; Training, testing, and validation; Loss functions; Evaluation metrics; Optimization algorithms; Linear models (e.g. logistic regression, linear regression, SVM); Non-linear models (e.g. decision tree, random forest, neural network); Generalization: The Bias-Variance Tradeoff; Strategies for mitigating overfitting in biological datasets.	Haicang Zhang
3	Sequence Modeling (Part 1)	Background on biological sequences. The pre-LLM era: n-grams; Recurrent Neural Networks (RNNs), LSTMs, and Seq2Seq. The information bottleneck and the origin of the Attention Mechanism.	Haicang Zhang
4	Sequence Modeling (Part 2)	Deep dive into the Transformer architecture; BERT vs. GPT. Interpretability: token-level embedding, sequence-level embedding, and attention maps.	Haicang Zhang
5	Sequence Modeling (Part 3)	SOTA Models: ESM2/3, MSA-Transformer, ProGen, xTrimoPGLM, and LLMs for antibodies, RNAs, and DNAs. Advanced Tuning: Parameter-efficient fine-tuning (PEFT) using LoRA; Direct Preference Alignment (DPO).	Haicang Zhang
6	Structure Modeling (Part 1)	Background on biomolecular structure modeling. Model architectures in pre-AlphaFold2 era: CNNs, ResNets, DenseNets, AlexNet, GoogleNet, and Inception architectures. Structure modeling in pre-AlphaFold2 era: RaptorX, ProFold, trRosetta, and AlphaFold1.	Haicang Zhang
7	Structure Modeling (Part 2)	Deep Dive into AlphaFold2. Adapting AF2 for multimer prediction, docking, mutation effect prediction, and protein design.	Haicang Zhang
8	Probabilistic Graphical Models (Part 1)	Introduction to Directed Graphical Models. Gaussian Mixture Models (GMMs), Bayesian GMMS, and Hidden Markov Models (HMMs). Markov Chain Monte Carlo (MCMC) vs. Variational Inference.	Haicang Zhang
9	Probabilistic Graphical Models (Part 2)	Deep Directed Graphical Model: Variational Autoencoder (VAE). VAEs for functional effects prediction, single-cell analysis, and sequence generation.	Haicang Zhang
10	Probabilistic Graphical Models (Part 3)	Introduction to Undirected Graphical Models. Ising Models, Potts Models, and Markov Random Fields (MRFs). Pseudo-likelihood approximation. Deep Undirected Graphical Models. Applications in protein contacts prediction and protein sequence design.	Haicang Zhang
11	Diffusion Models (Part 1)	From VAEs to Denoising Diffusion Probabilistic Models (DDPM). Stochastic Differential Equation (SDE)-based Difusion Models; EDM. Applications: De novo backbone generation (e.g. RFdiffusion, FrameDiff, Chroma, and CarbonNovo).	Haicang Zhang
12	Diffusion Models (Part 2)	Consistency Models. Flow Matching and Optimal Transport. More applications: FoldFlow, AlphaFold3 (EDM).	Haicang Zhang
13	Diffusion Models (Part 3)	Classifier guidance vs. classifier free diffusion. Post-training with Direct Preference Optimization (DPO); Applications in antibody design: AbDPO, AbNovo.	Haicang Zhang
14	Computational Proteomics	Principles of peptide sequencing with Mass Spectrometry (MS/MS). Deep Learning models for de novo peptide sequencing (e.g., DeepNovo).	Shiwei Sun
15	Computational Glycomics	Introduction to glycan identification with MS/MS. Deep learning models for glycan identification.	Shiwei Sun