Machine Learing in Bioinformatics (Spring 2024)
Course Information
Locations: This course is taught at both the Beijing and Hangzhou campuses of University of Chinese Academy of Sciences (UCAS).
  • Beijing: Room 302, Teaching Building No. 1, Yanqihu Campus.
  • Hangzhou: Room 104, Teaching Building No. 10, Yunyi Campus (Hangzhou Institute for Advanced Study).
Instructor: Haicang Zhang
Office Hours: 1 hour immediately following the lecture, held in the classroom.
Course Description
Recent advancements in AI have pushed research at the intersection of computer science, biology, and medicine into a new era. Notably, leveraging cutting-edge AI models, AlphaFold and RFdiffusion have achieved significant breakthroughs in protein structure prediction and protein design, respectively. This course explores how state-of-the-art AI is utilized to address fundamental scientific challenges in life sciences, including protein folding, protein design, drug discovery, function prediction, and gene regulation. Specifically, the course will cover a brief history of AI models and their applications in bioinformatics, with a specific focus on Generative AI like large language models (LLMs) and diffusion-based generative models, as well as reinforcement learning algorithms for post-training.
Syllabus
Please note that we have three classes per week.
Week Topic Contents Instructor
1 Introduction (Part 1)
Fundamental scientific questions in life sciences; The scale of biological data.
A brief history of AI.
Overview of recent breakthroughs (e.g., protein folding, drug discovery).
Haicang Zhang
2 Introduction (Part 2)
Supervised versus unsupervised learning, clustering, generative model; Training, testing, and validation; Loss functions; Evaluation metrics; Optimization algorithms; Linear models (e.g. logistic regression, linear regression, SVM); Non-linear models (e.g. decision tree, random forest, neural network);
The Bias-Variance Tradeoff; Strategies for mitigating overfitting in biological datasets.
Haicang Zhang
3 Sequence Modeling (Part 1)
Haicang Zhang
4 Sequence Modeling (Part 2)
Haicang Zhang
5 Sequence Modeling (Part 3)
ESM2/3, MSA-Transformer, ProGen, xTrimoPGLM, and LLMs for antibodies, RNAs, and DNAs.
Parameter-efficient fine-tuning (PEFT) using LoRA; Direct Preference Alignment (DPO).
Haicang Zhang
6 Structure Modeling (Part 1)
Background on biomolecular structure modeling.
CNNs, ResNets, DenseNets, AlexNet, GoogleNet, and Inception architectures.
RaptorX, ProFold, trRosetta, and AlphaFold1.
Haicang Zhang
7 Structure Modeling (Part 2)
Deep Dive into AlphaFold2.
Adapting AF2 for multimer prediction, docking, mutation effect prediction, and protein design.
Haicang Zhang
8 Probabilistic Graphical Models (Part 1)
Introduction to Directed Graphical Models.
Gaussian Mixture Models (GMMs), Bayesian GMMS, and Hidden Markov Models (HMMs).
Markov Chain Monte Carlo (MCMC) vs. Variational Inference.
Haicang Zhang
9 Probabilistic Graphical Models (Part 2)
Variational Autoencoder (VAE).
VAEs for functional effects prediction, single-cell analysis, and sequence generation.
Haicang Zhang
10 Probabilistic Graphical Models (Part 3)
Introduction to Undirected Graphical Models.
Ising Models, Potts Models, and Markov Random Fields (MRFs).
Pseudo-likelihood approximation.
Deep Undirected Graphical Models.
Applications in protein contacts prediction and protein sequence design.
Haicang Zhang
11 Diffusion Models (Part 1)
From VAEs to Denoising Diffusion Probabilistic Models (DDPM).
Stochastic Differential Equation (SDE)-based Difusion Models; EDM.
Applications: De novo backbone generation (e.g. RFdiffusion, FrameDiff, Chroma, and CarbonNovo).
Haicang Zhang
12 Diffusion Models (Part 2)
Consistency Models.
Flow Matching and Optimal Transport.
More applications: FoldFlow, AlphaFold3 (EDM).
Haicang Zhang
13 Diffusion Models (Part 3)
Haicang Zhang
14 Computational Proteomics
Principles of peptide sequencing with Mass Spectrometry (MS/MS).
Deep Learning models for de novo peptide sequencing (e.g., DeepNovo).
Shiwei Sun
15 Computational Glycomics
Introduction to glycan identification with MS/MS.
Deep learning models for glycan identification.
Shiwei Sun