Duke Course: Computational Sequence Biology
Course Description:
Algorithmic and computational issues in analysis of biological sequences: DNA, RNA, and protein. Emphasizes probabilistic approaches and machine learning methods, e.g. Hidden Markov models. Explores applications in analysis of high-throughput sequencing data, protein and DNA homology detection, gene finding, motif discovery, comparative genomics and phylogenetics, genome segmentation, DNA/RNA/protein structure prediction, with a strong focus on algorithmic aspects. Prerequisites: basic knowledge of algorithmic design (COMPSCI 330 or equivalent), probability and statistics (STA 611 or equivalent), molecular biology (BIO 201L or equivalent), basic computer programming skills (preferred programming languages: Python, Java, C/C++, Perl, R, or Matlab).
Course materials, homeworks and quizzes are avalaible through Sakai.
Instructor:
Raluca Gordan
Office hours: Tue 9:45am-10:45am (right after class)
Zoom link: same as the class meeting for that day
Email: raluca.gordan at duke dot edu
TA:
Harshit Sahay
Office hours: TBP
Zoom link: TBD
Email: harshit.sahay at duke dot edu
Grading:
Course grade is based on homeworks (70%), pre-class quizzes (15%), and class participation (15%). Homeworks and quizzes will be distributed through Sakai.
You will have 2 weeks to complete each homework. Late homeworks will not be accepted; however, you are allowed one late homework for the course, for a maximum of 1 week.
Pre-class quizzes will be due 1 hour before class. The quizzes will test either your background on a subject (to make sure you will be able to follow and participate in the lecture) or your understanding of a subject or paper presented in a previous lecture. You can take each quiz twice; only the highest grade will be considered.
Collaboration policy:
All homeworks and pre-class quizzes should be completed individually, unless otherwise stated. However, if you have worked for a while on a particular problem and have encountered a mental wall, and if you have banged your head against the wall for a while, you should consult others to make progress—that is better than giving up entirely. Your first course of action is to speak to the instructor or TA. If for any reason you consult your peers, it should remain understood that such an interaction must be one of consultation and not collaboration: hints rather than answers; after consultation, it is expected that you should still have some thinking to do (otherwise this course will not be very useful for you!). In addition, if you happen to consult with another student, both of you must cite this.
Readings/textbook:
We will have readings for the course (which will be available on Sakai), but there is no formal textbook. Useful resources include:
• Durbin, Eddy, Krogh, Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids
• Cristianini and Hahn, Introduction to Computational Genomics: A Case Studies Approach
• Jones and Pevzner, An Introduction to Bioinformatics Algorithms
• Majoros, Methods for Computational Gene Prediction
• Alberts, Johnson, Lewis, Raff, Roberts, Walter, Molecular Biology of the Cell
• Cormen, Leiserson, Rivest, Stein, Introduction to Algorithms
Syllabus |
This syllabus is tentative and may change (slighly) during the semester. Please check Sakai for the latest version.
1 | Jan-21 | Introduction; DNA sequencing |
2 | Jan-26 | Global sequence alignment; Needleman-Wunsch |
3 | Jan-28 | Local sequence alignment; Smith-Waterman |
4 | Feb-2 | Heuristic search; FASTA; BLAST |
5 | Feb-4 | String matching; suffix arrays |
6 | Feb-9 | Short read alignment; BWA; Bowtie |
7 | Feb-11 | Probabilistic models for biological sequences |
8 | Feb-16 | HMM parsing; Viterbi |
9 | Feb-18 | HMM training; Baum-Welch |
10 | Feb-23 | HMM applications |
11 | Feb-25 | Profile HMMs; PSIBLAST |
12 | Mar-2 | Phylogenetic trees: UPGMA; NJ |
13 | Mar-4 | Unsupervised learning |
Mar-9 | NO CLASS | |
14 | Mar-11 | Clustering; non-negative matrix factorization |
15 | Mar-16 | Algorithms in single-cell data analysis |
16 | Mar-18 | Supervised learning; classification and regression |
17 | Mar-23 | SVM; string kernels |
18 | Mar-25 | Naive Bayes; logistic regression |
19 | Mar-30 | Deep neural networks |
20 | Apr-1 | Motif finding: EM and Gibbs sampling |
21 | Apr-6 | Motif finding: Bayesian networks |
Apr-8 to Apr-22 | Student presentations |