Stochastic Models in Bioinformatics

Instructor:  Dr. István MIKLÓS

Text: Durbin, Eddy, Krogh, Mitchison: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids + handouts.

Prerequisite: None, but very elementary probability theory and some degree of mathematical maturity is needed for this course. The course starts with a short overview of mathematics and biology needed.

Course description:  Bioinformatics is a new and hot discipline, which is extremely application oriented, however, it also has a wonderful background theory consisting of a nice mixture of combinatorics, probability theory, statistics and algorithm theory. This course is a computer science flavoured  introduction into the mathematical background of bioinformatics with a special emphasis on problem solving and applications.

Topics:

Basics: Models in biology. Biological sequences. RNA secondary structures and pseudo-knotted structures. Protein folding. Evolutionary trees. Basic concepts of evolutionary and comparative biology. Introduction to statistical inferring: likelihood function, maximum likelihood estimation, expectation maximization, the Bayes theorem, Bayesian statistics.

Sequence alignment: The classical and automaton approach for aligning sequences. Hidden Markov Models (HMMs): aligning sequences to a structure. Aligning sequences with pair-HMMs.

Stochastic grammars: The Chomsky hierarchy. Regular grammars are HMMs. Stochastic Context Free Grammars (SCFGs) and their applications in RNA structure prediction. The algorithm theory of regular and SCFGs.

Evolutionary trees: Concepts for inferring trees. Stochastic models of evolutionary trees. The Kingmann's coalescent.

Time continuous Markov models: Substitution models of nucleic and amino acids. Insertion-deletion models. Statistical sequence alignment. Comparative bioinformatics.

Optional topics (depending on how much time we will have):

Markov chain Monte Carlo: The concept of MCMC. Metropolis-Hastings. The Gibbs sampler. Partial Importance Sampler. Simulated Annealing. Parallel Tempering. Applications: Bayesian statistics of evolutionary trees, multiple sequence alignment, genome rearrangement.

RNA structures (advanced): Stochastic grammars for inferring pseudo-knotted structures. Folding simulations. Co-transcriptional folding.