Gems In My Haystack

My last day at AstraZeneca was June 17th. I’ve been hoarding papers since February. I probably had at least…a hundred…lying around at home. But at work there were exactly 37 papers on my desk at the time I was clearing it out. Here are the highlights from my stack. I think about some of these papers like at least once a week.

From reductionism to realism: Holistic mathematical modelling for complex biological systems (Nartallo-Kaluarachchi et al., 2025)

Keywords: biological complexity, nonlinear dynamics, emergent properties

I’m such a sucker for arXiv papers with vaguely philosophically leaning titles and I was so pleasantly surprised by this little perspective from Oxford. Biology, as we understand it today, does not lend itself well to reductive mathematical modelling, which has historically worked very well in the realms of theoretical chemistry and physics. As the authors describe: on a scale between highly ordered and highly random, reductionism works well at each extreme but fails in the middle, where biology tends to lie. Biology is also intimately controlled by time and spacec — features that frequently get abstracted away in mathematical models.

The authors argue that the first problem of biology should be determining patterns and underlying mechanisms found in extremely rich, extremely dense experimental data (cases in which reductive approaches are not particularly appropriate and where machine learning especially shines). They continue to emphasize that only once we have built a foundational understanding of biological systems can we move onto the “traditional forward-problem” of establishing broad, theoretical principles (going from descriptive > mechanistic > predictive models rather than the other way around, allowing for “posteriori reductionism”). I side with the authors on this.

Assembly theory explains and quantifies selection and evolution (Sharma et al., 2023)

Keywords: mass action kinetics, quantifiable evolution, building complex systems

This paper is cited in the preprint just discussed, which I think is how I stumbled upon it. If you haven’t heard of assembly theory (AT) and are itching for scientific controversy, give this paper a read and then skim the strongly heated arguments people are having about it online. That being said, I’m not entirely sold on nor against AT and I am firmly skeptical of its utility. Broad strokes “biological” theories are hard to confirm empirically, and I think I would consider AT an abstract biochemical theory at best. After all, it’s a long leap from explaining how amino acids formed in the soup of life to explaining how complex biological systems grow and function. And honestly, a lot of the controversy seems to boil down to arguments over wording rather than the data or the math (their big mistake was using the word “evolution” in my and other’s opinion).

Protein FID: Improved Evaluation of Protein Structure Generative Models (Faltings et al., 2025)

Keywords: generative protein models, Fréchet inception distance, protein designability

I was super excited to see this type of work coming from Regina Barzilay and Tommi Jaakkola. Currently, generative protein models have a stark trade-off between designability (can you express this synthetic protein IRL) and diversity (how similar is this synthetic protein to other proteins in the PDB). A very common example of this is seen in early generative models (e.g. ProteinMPNN), which tend to produce proteins with high alpha-helical content. Why is this a problem? Models that are trained on the PDB or CATH databases see a wide range of secondary structures, and so an ideal model not only generates proteins that are stable in vitro (bundles of helices are fairly good at this), but that are also structurally distinct from preexisting proteins, meaning the generative outputs are as equally diverse as the data the model was trained on.

This concept isn’t new, and I was first introduced to it in this paper from Possu Huang’s lab. Both sets of authors champion the use of Fréchet inception distance (FID) to quantify the gap between generated structures and natural distribution of protein structures (lower FID is better). Long story short: current state-of-the-art models have pretty high FIDs, in fact it’s kind of shocking (see figure below, we want the orange and blue to overlap).

Figure highlight:

Designing Simple Mechanisms (Li, 2024)

Keywords: theoretical economics, game theory, human error

If you consider yourself an eBay (dynamic ascending auction) or Goodwill Auctions (second-price sealed-bid auction) connoisseur, you should read this paper. The author uses auctions to illustrate why mechanisms need to offer clear incentives for their participants, meaning the mechanism must be “strategy-proof”. In an ideal game-theory-abiding world, simplicity leads to more efficient and fair outcomes for all participants. Often, when the incentives are not clear, the participant will behave in a way that is at a disadvantage to themselves because they attempt to employ a strategy to “game” the system rather than act truthfully (or responsibly). In the case of the second-price sealed-bid auction, they will typically overbid. In this model, the highest bidder pays the second highest bid price (rather than the price they bid). Because of this, the participant will typically overbid assuming that the second highest bid will be reasonable or close to what they would already pay for the object, rather than bidding the true self-assigned worth of the object (this incentive is much clearer in an ascending auction, as in the case of eBay). Anyway, now that I am aware of how irrationally I behave on shopgoodwill.com, I will be keeping this paper in mind at all times. But this doesn’t just apply to auctions — next time you’re applying for a job, entering a lottery, filling out a survey, consider the mental gymnastics you’re doing, how many bad actors you’re forced to consider. Are the mechanisms around you truly simple?

Biophysics-based protein language models for protein engineering (Gelman et al., 2024)

Keywords: simulated training data, transfer learning, biophysical modeling

This paper did not change my life, but it did have a really thorough section about using simulated (synthetic) data versus experimental data that I greatly appreciated. This is a problem I’ve spent quite some time thinking about: there are only so many crystal structures, and those crystal structures offer a very ideal snapshot of a protein interaction or conformation. How can we extract data representing the full range of conformations or dynamic movement? Through simulation (molecular dynamics)! And then the question becomes, what is the ideal balance of experimental and synthetic data for training a model? The authors explored just that and show that performance gets better by increasing both the number of synthetic and experimental data points, but that there is a performance plateau at around 128k simulated samples (in the use case of their model). If you read across a row, you can see a gradual increase for set amounts of experimental samples, with the most impactful increases being for the lower sample counts (40-160k). Naturally, the contour plots will look different depending on the size of the protein, with smaller proteins exhibiting a steady increase in performance with more simulated samples, while larger proteins exhibit a higher threshold only after which there is sharp increase in performance.

The takeaway here is that if you have experimental data — use as much of it as possible. If you don’t have a lot of experimental data, you can supplement it with simulated data for some boost in your model’s success rate.

Figure highlight:

Have protein-ligand co-folding methods moved beyond memorization? (Škrinjar et al., 2025)

Keywords: all-atom co-folding, protein-ligand interaction prediction, data leakage

The answer is no! And this is not surprising at all. Nevertheless, I was quite excited to read this and have since pointed to this preprint many numbers of times over the past few months when talking about the “success” of co-folding models. Essentially, most state-of-the-art co-folding models still struggle with out-of-distribution cases, meaning the model is just memorizing poses and pockets rather than learning generalizable, biophysical patterns that govern ligand binding. The more likely a protein-ligand system was seen in the training set, the better the model’s performance — in fact it’s an almost a perfectly linear correlation between % success rate and similarity to the training set (see figure below). And this sucks for us! These current models aren’t going to help us discover anything new, they are only good at telling us what we already know. A similar point was made with molecular glue ternary structure prediction (Liao et al., 2025), where unseen targets with low homology to the training data have success rates that are ~4x lower than targets with high homology (for AF3, 16.0% vs 64.5% respectively).

Figure highlight:

Data Organization in Spreadsheets (Broman & Woo, 2018)

Keywords: data management, Excel spreadsheets, bioinformatics

Exactly what it sounds like, this article covers data entry best practices and other faux pas to avoid. As someone who has worked in a wet lab and collected large amounts of data in Excel, I wish I had read this prior to undergrad. This is the kind of stuff that either no one ever tells you or, on the contrary, is drilled into you from the beginning. Let’s just say I learned the limitations of Excel the really, really hard way (i.e. losing my data, corrupting my data, unknowingly manipulating my data) and I think a lot of my mistakes could have been avoided if someone had sat me down and forced me to read this. To be totally fair, some of the points in the article (such as choosing good names for things) are *still* an active struggle for me.

This paper also reminds me of this article about bad bar charts in biology — essentially bench biologists are struggling to handle the larger and larger amounts of data they are producing…with serious consequences (this study found 1-in-3 papers had data distortion in their bar charts)! Another point to consider: a lot of biologists want to use their data for ML purposes, but half the grunt work is making the data compatible, readable, and clean. Data literacy is important, but it all really starts with how we are entering, organizing, and storing that data.

P.S. This paper came onto my radar by way of Ming “Tommy” Tang (currently the director of Bioinformatics at AZ), who gave an awesome career talk while I was there.

Biological databases in the age of generative artificial intelligence (Pop et al., 2025)

Keywords: biological databases, error propagation, transitive annotation, data maintenance

Large databases are inevitably full of errors, and although a lot of funding and effort goes into the creation and maintenance of public databases, the systematic quantification of errors is practically null (usually handled on a case-to-case basis). Typically, data that gets deposited goes through a series of annotation steps. Annotations can be manually assigned but more often they are computationally inferred. However, issues begin to arise when additional labels are computationally imputed (sometimes inaccurately) based on that label, causing uncaught errors to affect a whole series of data points in the database. For ML model-training purposes, this poses a serious issue of data contamination, which in turn can cause “model collapse” (where models trained on computationally or self-generated data degrade in accuracy and quality). The article also makes an interesting point that it’s really difficult (as in, we don’t have the tools) to discern which labels are computationally generated and which ones are not — and to an untrained eye, everything will seem legit solely because it’s in the database in the first place! There is such a thing as *bad* public data, unfortunately.

They end with five recommendations for the future, two of which focus on studying error propagation as bona fide scholarship (this is a topic of active research for ML but not so much for “hard” sciences) and explicitly funding and supporting public data stewardship. This last recommendation also reminds me of this blogpost from New Science, which also argues for better, more secure funding to maintain life sciences software (some things are slowly emerging, like the Virtual Institute for Scientific Software (VISS) supported by Schmidt Sciences).

AS 07/17/25

Page updated

Google Sites

Report abuse