Executive Summary

We usually credit AlphaFold's uncanny accuracy at protein folding to the cleverness of the model. Yet a large share of that accuracy came from somewhere else: its ability to comb through vast databases and pull in hundreds to thousands of "look-alike proteins," a step known as multiple sequence alignment (MSA). This article looks at a 2026 Nature Machine Intelligence study that cut that data dependency head-on.

Strip those homologous sequences away and AlphaFold2's average TM-score collapses from about 0.80 to 0.41 — close to a halving. In other words, the bottleneck on accuracy was never model size; it was the richness of the reference data. TDFold uses none of that data. Instead it redefines a protein's geometry as a two-dimensional image and lets an image-generation model (a diffusion model) construct those relationships. The result reached the best accuracy among methods that use no MSA, and it pulled ahead by the largest margin on orphan proteins, which have no homologous sequences at all.

One caveat has to be stated plainly. TDFold did not surpass an AlphaFold that had every homologous sequence at its disposal. "Removing the data made it more accurate" is a claim about the comparison with other single-sequence methods under the same conditions. Holding to that distinction, this article asks a broader question: can the gap we usually fill with more data be filled instead by how we represent the problem?

0.80 → 0.41

AlphaFold2 TM-score when MSA is removed

Without homologs, accuracy collapses by nearly half — that was the real bottleneck

71.91

TDFold CASP16 GDT-TS

Best among MSA-free single-sequence methods (ESMFold 70.33, OmegaFold 61.55)

10–100×

Inference speed

Versus language-model methods on long sequences — no database search either

RTX 4090 · 1 week

Full training cost

A single GPU stands in for the weeks a TPU cluster once took

1

Did AlphaFold really solve this on its own?

A protein is a chain of amino acids strung in a single line that folds on its own into a three-dimensional shape. Guessing that final shape from the sequence alone is the structure-prediction problem. When people say AlphaFold2 essentially solved this decades-old challenge, they tend to overlook a quiet accomplice: multiple sequence alignment (MSA).

An MSA is, put simply, a "list of similar proteins." Proteins that branched from a common evolutionary root have sequences that resemble one another, and lining that list up side by side reveals a telling signal. Two positions that sit close together in the folded structure tend to change in tandem: when one mutates, the other mutates to match, preserving the fold. Read that co-evolution pattern and you can infer, indirectly, which two points along the sequence lie near each other in space. AlphaFold2's core engine is built precisely to extract that signal.

Co-evolution Signal — What the MSA Tells AlphaFold Homolog list (MSA) position → 1 i 3 j 5 sp.1 A K L E V sp.2 G R V D I sp.3 G H L Q I sp.4 A K I E V i changes → j changes too (K↔E · R↔D · H↔Q) What AlphaFold2 reads from this Co-evolution = spatial proximity signal i and j change together → close in the 3D fold (core material for AlphaFold2's pairwise representation) No MSA → no signal → TM-score 0.80 → 0.41 Why homologous sequences were AlphaFold's accuracy engine Pebblous original diagram (Fig. 1 reinterpretation)
▲ Reading co-evolution patterns from homolog lists to infer spatial proximity between residues — why AlphaFold2 depended on MSA | Pebblous original diagram

The trouble lies in building the list in the first place. Searching giant databases such as UniRef and BFD for similar sequences usually takes tens of minutes on its own, and memory balloons as the sequence grows longer. And performance rides directly on how deep that list is.

Input condition Average TM-score What it means
Full MSA + templates ~0.98 Nearly matches the experimentally determined structure
Full MSA (no templates) ~0.80 Homologs alone are accurate enough
Single sequence (no MSA) ~0.41 Remove the homologs and accuracy collapses by nearly half

The numbers are unambiguous. Take the homologs out of AlphaFold2 and accuracy sinks from 0.80 to 0.41. The model is untouched; only the data has been removed. The single biggest determinant of accuracy, then, was not the model's architecture but how many similar proteins it could find and attach as reference.

For certain proteins that dependency becomes a wall outright. Orphan proteins, which have no known relatives, and synthetic proteins freshly designed in the lab have no list to build in the first place. This is what spawned the many attempts to predict without an MSA. Protein language models like ESMFold and OmegaFold pretrain on hundreds of millions to billions of sequences, folding evolutionary information into model weights instead of a runtime database lookup. That does not remove the data dependency so much as shift it from search time to training time. And so their limits on orphan proteins remained stark.

To sum up: much of AlphaFold's power lay in its data-retrieval knack for "finding and attaching similar things," and without that data it fell apart. Which turns the question around. Must that gap be filled only with more data, or can we represent the problem differently and cover it another way?

2

TDFold rewrote the problem as an image

TDFold, presented in Nature Machine Intelligence by the team of Xudong Wang, changed the angle on the problem rather than hunting for more data. It begins with a single observation. Write down every distance and orientation between residue i and residue j in a protein, and the result is an N×N table. And such a table can naturally be seen as a single two-dimensional image.

Stack one distance map and a few orientation maps and you get a multi-channel image — one that carries geometric relationships rather than color. Once the representation is recast this way, the tools built for images can be borrowed wholesale. What TDFold borrowed is the image-generation model that turns text into pictures: Stable Diffusion.

The whole pipeline runs in two stages. The first generates a geometric map from the sequence; the second raises that map into an actual three-dimensional structure.

TDFold Two-Stage Pipeline — Structure Without MSA Input Amino-acid sequence Single sequence (no MSA) A-G-T-K-L-E-V… Stage 1 Diffusion model (fine-tuned SD) Geometric template N×N distance & orientation geometry treated as image generated — no search Stage 2 Lightweight network (coordinate refinement) Output 3D atomic coordinates protein 3D structure no database search Pebblous original diagram (Fig. 2 reinterpretation) — TDFold pipeline structure
▲ TDFold two-stage pipeline — Stage 1 generates the geometric image via diffusion; Stage 2 lifts it into 3D coordinates | Pebblous original diagram

2.1"Generating" a geometric template

The first stage fine-tunes Stable Diffusion for proteins. Just as the original model takes a sentence and imagines a scene, here it takes an amino-acid sequence and draws the distance and orientation maps between residues. The information AlphaFold2 read out of its list of homologs — which position sits near which — TDFold imagines into being with a diffusion model, no search required. That imagination is not arbitrary, of course. Because training uses the distance and orientation maps of experimentally determined structures as ground truth, the model is disciplined, given a sequence, to draw plausible geometry. The generative model stands in for the step of trawling a database.

2.2Meshing sequence and geometry into 3D

The second stage is a lightweight network that carries the generated geometric map into actual three-dimensional coordinates. It re-aligns the relationships between the sequence and its residues in a co-evolutionary fashion and factors in how side-chain atoms shape the backbone, settling the final structure. Because no heavy large language model is running, this stage is light on both memory and speed.

The point is not "they gathered more data" but "they rewrote the problem as an image." The moment geometric relationships are cast as an image, all the knowledge already accumulated in Stable Diffusion within the image domain can be pulled straight over. The gap once filled with homolog data was instead covered by changing the representation and borrowing knowledge from another domain.

3

So how accurate is it, really?

Protein structure prediction methods are tested at CASP, an open competition held every two years. How close a predicted structure comes to the real one is measured by the GDT-TS score (0–100, higher is better). The table below gathers the single-sequence methods from the two most recent competitions, alongside the full-homolog methods included for reference.

Model CASP15 GDT-TS CASP16 GDT-TS Input
TDFold 63.52 71.91 Single sequence (no MSA)
ESMFold 62.99 70.33 Single sequence (no MSA)
OmegaFold 57.37 61.55 Single sequence (no MSA)
AlphaFold2 73.24 74.05 Homologs + templates
AlphaFold3 73.26 79.59 Homologs + templates
CASP16 GDT-TS Performance Comparison GDT-TS: 0–100, higher is better · single-sequence vs full-homolog methods Single sequence (no MSA) TDFold 71.91 ESMFold 70.33 OmegaFold 61.55 ── below: full homologs used — different condition, shown for reference ── Full homologs (MSA used) AlphaFold2 74.05 AlphaFold3 79.59
▲ CASP16 GDT-TS — TDFold leads among MSA-free single-sequence methods. AlphaFold2/3 use full homologs (dashed, different condition) | Pebblous original diagram

The table has to be read in two layers. First, among the single-sequence methods that share the same conditions, TDFold leads both ESMFold and OmegaFold with 63.52 on CASP15 and 71.91 on CASP16. On pLDDT, the model's self-confidence measure, it scores 72.06 on CASP14, well above ESMFold's 67.14 and OmegaFold's 53.25. Among methods that use no homologs, it is the most accurate class to date.

The second layer matters just as much. AlphaFold2 and AlphaFold3, with every homolog at their disposal, still lead in the 73–79 range. So saying "TDFold beat AlphaFold" would misstate the facts. The accurate sentence is this: under the condition where the data has been removed, TDFold was more accurate than other methods under that same condition.

TDFold's real strength shows up where there is no data at all. On a benchmark of 77 orphan proteins with not a single known relative, TDFold outpaced ESMFold, OmegaFold, and the other existing single-sequence methods by a wide margin. This is territory where even AlphaFold effectively gives up, having no list to build. A design that produces geometric maps through generation rather than search shines brightest precisely when there is no reference data to be found.

The cost is striking too. Because TDFold skips the database-search stage entirely, inference runs 10 to 100 times faster than language-model methods on long sequences. Training the full model finishes within a week on a single RTX 4090 GPU. Set against a previous generation that ran for weeks on TPU clusters, the barrier to entry has dropped markedly.

4

When Representation Replaces Data

Here we return to the vantage point of people who work with data. The default of the past few years was clear: if performance disappoints, gather more data. Deeper MSAs, more training sequences, bigger databases. AlphaFold's success reinforced that direction, since accuracy scaled with the richness of the reference data.

TDFold answered the same bottleneck differently. It did not find more data; it rewrote the problem to be solved in the form of an image. And with that, the knowledge already piled up in the image domain flowed over to the protein side. Rather than topping up scarce data, it borrowed a different kind of knowledge that already existed by changing the representation. Set three vantage points side by side and the difference sharpens.

Vantage point Central question What this case leaves us
Volume How much reference data can we gather? AlphaFold's strength and its limit. It collapsed where the data ran out.
Weights Should we bake the data into the model? The protein-language-model path. The dependency did not vanish; it moved to training time.
Representation How should we rewrite the problem? TDFold's path. Change the representation and knowledge from another domain stands in for data.

This shift is not confined to proteins. Whether you view a body of data as a time series, an image, or a graph changes what you can draw out of the same information. What TDFold showed is a possibility: a problem that looks short on data may in fact be short on representation. Before gathering more, it is worth asking first whether there is room to write it down differently.

Representation, of course, does not always replace data. TDFold, too, never reached the accuracy of an AlphaFold running on full homologs. Even so, this case is valuable because it surfaces one other question worth asking before the answer defaults to "we need more data." The gap we set out to fill with sheer volume — could we cover it instead by representing the problem another way?

Editor's Note

The point Pebblous keeps returning to when we talk about AI-Ready Data lands right here. More than piling up data indiscriminately, it is the form in which the data is prepared and represented that changes a model's outcome. TDFold adds one more piece of evidence to that argument, from outside our own lab. It reconfirms that the axis of scaling data volume and the axis of representing data point in different directions — and that the latter remains the less explored of the two.

R

References

KeyKey paper

BgComparison & background