Alex Lee, 2023-10-17
NVIDIA estimates 40 exabytes of data by 2025!–driven by decreases in cost.
Ex: discovery of PCSK9 as being protective against heart disease, ApoE from familial studies.
However: genome is big (~3 billion base pairs) and the space of possible mutations is large.
What if we could use computational sequence models to help us decipher the effect of mutations?
The mainstay tool of these analyses has been a multiple-sequence alignment.

Conservative/non-conservative terminology refers to (broad sense) notions of similarity across species or between individuals [ie T/S conservative site largely the same across species]
Figure 1 from Reva et al. (2007, Genome Biology)

Each ball is an amino acid (AA); AAs colored in orange/purple are the ones in the sequence alignment denoted by different colors; hashes are sites that have high entropy, and are likely to be functionally relevant.
Early models (like in this paper, from 2007) focused on creating probabilistic models of specific sites based on entropy (\(\sum_i p_i\ \mathrm{ln}(p_i)\)) at a given site (column of alignment matrix) and clustering sequences based on their entropy at different sites.
A key assumption of this idea is basically the idea that if a residue (individual AA) is commonly observed in the population then it has high “fitness”. Any statements made are population level about a given sequence alignment.

Figure 2 from Reva et al. (2007, Genome Biology)
Marks lab at Harvard pioneers more complex probabilistic models based on stat. mech.

Previous method focuses on marginal probabilities at specific sites \(\mathrm{\mathbf{h}_i}(\sigma_i)\) [ie a fixed effect] for a given protein alignment.
Now we have an overall model of a given sequence \(\sigma\): \(P(\sigma)\ =\ \frac{1}{Z} \mathrm{exp}\ E(\sigma)\), where:
\[E(\sigma) = \sum_i \mathbf{h}_i (\sigma_i) + \sum_{i<j} \mathbf{J}_{ij} (\sigma_i, \sigma_j)\]
What’s significant about this change is that now we can score a given sequence on it’s overall likelihood.
Hopf et al., (2017, Nature Biotechnology)

New applications also introduced by sequence-level modeling:




VAE models learn highly accurate ClinVar scores, even with only a couple hundred sequences per protein family.
Simple gaussian mixture on top of VAE probabilities gives ~90% correlation, ~99% AUC specific examples like TP53.


No sequence alignment needed: (pre-)training is over huge amounts of data–although model is relatively small (MM-15B param)
Figure 1a from Brandes et al. (2023, Nature Genetics)
A bit unfair given the architectural context, but still clear differences using 650M parameter model:

Not shown–commensurate increase in prediction of mutational scanning effect datasets.
Figure 2b from Brandes et al. (2023, Nature Genetics)





Note: people have been doing de novo design for ~20-30 years now, but it’s only in the last ~10 that it’s been possible with really extensive automation.

Model is autoregressively trained from “control tags” (sort of like

Most importantly, proteins designed using this method can actually be expressed in cells comparably to natural proteins – a big milestone
Targets antibody generation using ESM1b/1v using a sort of simulated annealing-like process.
Basically, select protein candidates as based on probability from some model: for a length \(L\) protein given by \(\mathbf{x}\), with \(i \in [1, 2, ..., L]\)
\[ p(x^{'}_i | \mathbf{x}) > p(x_i | \mathbf{x}) \]
where \(x_i\) denotes the wild-type residue (or the “start” residue at a given site \(i\)) and \(x^{'}_i\) is a particular substitution at that site.

DNA Language models emerging (Benzegas et al. (2023), from Song lab at Berkeley): could be used to decode effect of regulatory DNA.
Builds on history of sequence modeling to predict transcription (Kelley lab at Calico, Enformer etc.)
First paper from Anand and Achim (2022) and then followed by RFDiffusion from Baker lab at UW (Watson et al., 2023 Nature Biotechnology). Here is a trajectory for a binder designed against insulin receptor (IPD website)
Protein sequence modeling field has moved quite quickly in last five years.
Moving from simpler statistical models to more complex models, ultimately to learn energy functions / likelihood scorers.
Robust performance on variant scoring tasks–unclear what the field will do now that we have these strong general representation learners. So far analyses focused on protein-protein interaction, general prediction tasks (like stability).
