The next five years are going to be amazing for learning biological processes through probabilistic models

I am looking forward to a future in which AI will greatly reduce friction in designing and implementing models that directly reflect underlying biological knowledge and are directly interpretable. This can be contrasted with the “monolithic” paradigm of AI that works to develop a big black-box model. In that paradigm, one feeds all the data into a large model, and the model makes predictions. One can interpret the monolithic model post-hoc, meaning that one can query the model to try to figure out how it is reasoning. I do not doubt that this monolithic paradigm is going to be interesting and useful for some situations, but it’s not the one that I’m most excited about.

Rather, I love building models that directly reflect underlying biology; the parameters of these models can be directly interpreted in scientific terms. For example, in recent work led by Maggie Russell, we developed a probabilistic model of V(D)J recombination that explicitly incorporates microhomology effects on trimming and ligation. This model is formulated in terms that we know from lab experiments, and the parameters thus have a direct interpretation that quantify a biological mechanism - in this case, how short stretches of sequence homology influence immune receptor generation. But V(D)J recombination is only one part of a larger collection of processes that shape the adaptive immune response. There are lots of different data types and a constellation of models could model them all in an interesting way (schematized as M1, M2, M3 in the diagram). For example, V(D)J recombination leads to self-tolerance mechanisms, which then lead to the pool of sequences that are ready to respond to infection.

My proximal long-term goal is to continue to build out a comprehensive collection of models that describe adaptive immune “repertoires” of sequences.

It currently takes a lot of time and effort to build and do inference under each of these models. First one must read a stack of biological papers, which use language unfamiliar to computational biologists, to gain an understanding of the system under investigation. The next step is to formulate a probabilistic model of the system. Then comes inference, which because these are typically nonstandard models, can be difficult. Finally come the plots to understand the results.

AI is already greatly speeding up all of these steps for us. NotebookLM enables us to find concepts, not words in collections of papers. LLMs are already marvelous tools for discussing model design. I now operate by writing a TeX file describing my model and then have helpful and fun discussions with Claude about implementation. Agent Mode in VSCode is very helpful in prototyping models and can automatically fix error messages and problems.

However, there is still friction. Having these processes spread across systems is like having a multi-way lobotomy. LLMs can be myopic in code design, and still require close supervision to generate clean code. Sometimes Agent Mode can go off the rails and do horrible things when requested to refactor. It is slow. Compute for us still happens in the traditional way on servers that have no connection to the AI agents, requiring manual work to move things around so that the AI models can see error messages and results.

All of these problems will be solved in the next five years, without a doubt. Indeed, the unified system I would like doubtless already exists in AI labs. And the code is going to get very good: when Jeff Dean says that in one year AI will be as productive as a junior engineer, I believe it.

Returning to the central theme in this post, all of these technological developments mean that it’s going to be much easier, and much more fun, to develop models that directly learn about biological processes in a world where AI permeates the development process. I think that this will close the prediction gap between interpretable models and black-box models, if there is one. These smaller models will provide insight, and we can scale their size according to the task: sometimes a transformer is the right fit, but not always!

My research is just a case-study. In the bigger picture is that we want AI to be able to greatly expand all biological knowledge. I earnestly hope that this does not come in the form of a black-box oracle.

By formalizing models explicitly, we can build on and extend the vast knowledge that we already have about biology, and understand what the new results mean. I haven’t ever payed attention to formal languages for describing biology but perhaps they are going to have a renaissance. Personally, I hope that we have a nice way of verifying statements about conditional probability. The Lean language has some support for probability but I haven’t tried to see if this is a practical solution.

In this post I have completely ignored the impact that this transformation will have on trainees and the world more generally. Certainly there is a lot to say about that, and I may to do so in a future post.