Executive Summary
The knowledge bases that companies run always have holes in them. It is not just one entry that goes missing. Often several slots are empty at once, and sometimes nearly everything is blank. KREPE, presented at ICML 2026 by Professor Joyce Jiyoung Whang's team at KAIST, confronts this reality head-on. Until now, research on knowledge graph completion has stayed within link prediction: "given an almost-complete fact, rank the candidates for the one missing slot." KREPE breaks that frame and defines a new task, Fact Generation: producing a valid new fact in its entirety no matter how many slots are empty.
The most striking result came from the hardest setting of all: fabricating a new fact from a completely empty input. The paper reports that KREPE outpaced large language model (LLM) baselines built on GPT-5.2 and Gemini 3 Pro by a wide margin. Unlike LLMs, which mimic facts from text probabilities, a model that learns graph structure and the context inside a fact directly produced structured knowledge more accurately and more efficiently. On top of that came a counterintuitive bonus: learning to generate also delivered top performance on the existing ranking task — relation prediction in particular.
This article reads what KREPE changed through the lens of data quality. The core message is simple: a model's capability ultimately comes from the structure and completeness of its input data. All specific performance figures are as reported in the ICML 2026 paper (arXiv:2605.24064) and are cited with their source in the body below.
Four figures compress both this study's achievement and the backdrop behind it. The accuracy and efficiency of generation from empty inputs show what KREPE accomplished; the scale of Wikidata's gaps shows why that work was needed; and the relation-prediction performance shows that a model trained to generate also rose to first place on the ranking task.
0.855
Empty-input generation accuracy
Valid-and-novel fact ratio, Scratch · WikiPeople⁻ (LLMs reach 0.46–0.60 at most)
2.85 tries
Per valid fact
Competitors up to 27.58 — roughly 10× more efficient
12.5×
Wikidata incompleteness
Missing inContinent facts are 12.5× the existing ones
1st everywhere
Relation prediction
Beats prior methods across all datasets and settings
Holes Are the Constant, Not the Exception
A knowledge graph is a structured way of writing down human knowledge so that computers can work with it. Search, recommendation, question answering, and reasoning systems all run on top of this structure. Its most basic unit is the triplet — a single fact written across three slots (subject, relation, object), as in "Einstein — received — Nobel Prize in Physics."
The problem is that facts in the real world do not fall neatly into three slots. The sentence "Einstein received the Nobel Prize in Physics" leaves out when and for what work. So large knowledge bases like Wikidata and YAGO attach auxiliary key-value pairs to the base triplet. This auxiliary information is called a qualifier. A knowledge graph that expresses complex facts by adding qualifiers — "year = 1921," "for = the photoelectric effect" — is called a hyper-relational knowledge graph (HKG).
1.1 Even Wikidata Is More Than Half Empty
A hyper-relational structure raises expressive power, but it also multiplies the blanks: the more slots there are, the more of them go unfilled. Statistics from Wikidata, the world's largest collaborative knowledge base, make this plain. The table below compares, for specific properties, the scale of "facts currently filled in" against "facts that ought to exist but are missing."
| Property | Existing facts | Estimated missing | Ratio |
|---|---|---|---|
| Continent (inContinent) | ~71K | ~889K | 12.5× more missing |
| Spoken language (spokenLanguage) | ~2.1M | ~7.1M | +174% missing |
Even in Wikidata — refined for well over a decade by hundreds of editors — some properties have more than twelve times as many missing facts as filled ones. Qualifier density is even more uneven across datasets. Among the same family of hyper-relational benchmarks, only 13.6% of facts in WD50K carry qualifiers, while WikiPeople⁻ sits at around 2.6%. These figures come from different populations and aren't directly comparable, but one thing is clear: whether qualifiers are abundant or scarce, the holes are always there.
For anyone who works with data quality, this diagnosis is familiar. A company's master data, product catalogs, and customer graphs all operate in a perpetually incomplete state. Incompleteness is not an exception to be fixed but a constant to be lived with. That is exactly KREPE's starting point. If you cannot eliminate the holes, redesign how you fill them well.
The Limits of Link Prediction — One Blank Isn't Enough
Until now, filling in hyper-relational knowledge graphs has been handled almost entirely through the frame of link prediction. Leading models like StarE, GRAN, QUAD, and Hy-Transformer all take this approach. The principle is this: assume that "an almost-complete fact is missing just one entity or relation," score the candidates that could fill that empty slot, and put the most plausible one at the top. It is a kind of multiple-choice fill-in-the-blank.
This is a well-defined problem. Given the premise that there is exactly one blank, all you need is to rank a list of candidates well. And performance did climb steadily. Figures reported across one lineage of prior work show that early StarE reached an MRR (Mean Reciprocal Rank — a measure of how near the top the correct answer is placed) of 0.349, which Hy-Transformer later improved incrementally to 0.356.
But all of this progress rests on the assumption that "there is only one blank." As Section 1 showed, real knowledge bases are not like that. Picture the moment a new entity is registered. For a newly onboarded company, a freshly launched product, or a person being recorded for the first time, almost no slots are filled. It isn't one blank — nearly everything is blank. This is the so-called cold-start situation.
Link prediction cannot handle this. With no filled-in context, the very basis for ranking candidates disappears. The same problem arises as the blanks grow to two or three. Which slot to fill first, and how filling one slot affects another, are hard to express within the link-prediction frame.
Link prediction is optimized for guessing "the last piece of a fact you almost entirely know." But where data is scarcest — exactly where new facts are just arriving — that very premise collapses. It is a tool you cannot use at the precise moment you need it the most.
Fact Generation Doesn't Count the Blanks
The team's answer was to change the frame of the problem itself. Instead of ranking one blank, they defined a new task: "from a hyper-relational fact with some or all of its components masked, produce a valid new fact." This is Fact Generation. If completion is the work of finishing off a given fact, generation extends all the way to shaping a new fact from a near-blank slate.
3.1 Three Patterns of Missingness
How fact generation embraces the varied gaps of the real world becomes clear in the three settings the team used for evaluation. The table below summarizes what situation each setting corresponds to.
| Setting | Information given | Real-world counterpart |
|---|---|---|
| Scratch | Nothing (a completely empty input) | A new entity with no records at all; knowledge expansion from a blank slate |
| Targeted | One component | Routine enrichment that fills in the rest from a single clue |
| Arbitrary Masking | An arbitrary number of slots are empty | The uneven gaps of a real knowledge base, just as they are |
Laid side by side, link prediction is effectively a special case of Targeted (exactly one blank). Fact generation widens that single point into a whole range. At one end sits Scratch, with no clue at all; at the other, Arbitrary Masking, where the number and position of the gaps change every time. In other words, fact generation does not discard link prediction — it absorbs it while covering a far wider slice of reality.
The shift from "completion" to "generation" is not an incremental improvement but a widening of the problem frame. Once you stop assuming there is exactly one blank to fill, the model has to behave consistently regardless of how many components are missing. That requirement drives the whole design of KREPE.
How KREPE Works
The name KREPE is an acronym for "Contextual HKG REPresentation learning via masked discretE diffusion" — contextual representation learning on hyper-relational knowledge graphs through masked discrete diffusion. Three core mechanisms are tucked inside that name: masked discrete diffusion, which hides components and learns to restore them; contextual message passing, which draws on two levels of context — dependencies inside a single fact and correlations across the whole graph; and bi-level noising, which puts the model through every kind of missingness during training.
4.1 Masked Discrete Diffusion — Hide, Then Restore
A diffusion model is the approach now widely used in image generation. It gradually adds noise to corrupt the original data, then learns to reverse that process and recover the original. But while the pixels of an image are continuous values, the entities and relations of a knowledge graph are discrete tokens represented by IDs. There is no in-between value between "Einstein" and "Nobel Prize in Physics."
So KREPE uses not continuous-space diffusion but masked discrete diffusion, suited to discrete tokens. Instead of adding noise, it masks tokens and gradually restores the probability distribution over which entity or relation belongs in each hidden slot. Because the "hide, then restore" operation is the same whether there is one blank or many, link prediction and fact generation can be handled by the same mechanism.
4.2 Contextual Message Passing — Seeing Two Layers of Context at Once
Restoring a masked slot well requires context, and KREPE draws on two layers of it at once. The first is the context inside a single fact. Within one fact, the subject, relation, object, and qualifiers constrain one another — if the award year is 1921, the candidate prizes narrow down. The second is the context of the whole graph. Similar facts already exist somewhere in the graph, and those patterns narrow the candidates for a blank.
The mechanism that reflects both of these contexts together is contextual message passing. By feeding the dependencies inside a fact and the correlations across the whole graph into the model at once, it can sift out plausible candidates even when clues are scarce.
4.3 Bi-level Noising — Rehearsing Every Kind of Incompleteness
During training, KREPE treats some of the knowledge graph's facts as an observed set and randomly masks the rest to serve as generation targets. On top of this, it introduces a bi-level noising strategy that perturbs the observed graph structure and the mask pattern inside the query at the same time — varying both how much of the graph to reveal and how many slots to mask in the query.
As a result, during training the model rehearses every level of incompleteness, from "slightly empty" to "almost entirely empty." That lets it handle, at inference time, both completing a partially observed fact and generating a new fact from a completely empty input within a single framework. This is why the Scratch, Targeted, and Arbitrary Masking settings from Section 3 are not separately trained models but merely different input conditions of one model.
The three mechanisms converge on a single goal: a model unshaken by the shape of the missingness. Discrete diffusion gives it a unified "hide, then restore" operation; contextual message passing gives it context to lean on even when clues are thin; and bi-level noising rehearses it on every kind of gap. Unifying link prediction and fact generation under one training objective is the heart of KREPE's design.
Experimental Results — First at Ranking, Past LLMs at Generation
The team evaluated KREPE on three standard hyper-relational knowledge graph benchmarks: WD50K, WikiPeople⁻, and WikiPeople. All figures below are values reported in the ICML 2026 paper (arXiv:2605.24064); rather than assert them as settled, we cite them as the paper reports them.
5.1 Learning to Generate Made Ranking Better, Too
The first result is counterintuitive. Although KREPE was trained with generation as its objective, it matched or exceeded state-of-the-art methods on the existing link-prediction task. In relation prediction especially, the paper reports that it surpassed prior methods across every dataset and setting to take first place. Relation-prediction MRR was reported at 0.963 on WD50K (a slight edge over the runner-up's 0.950) and around 0.984 on WikiPeople⁻.
This overturns conventional wisdom. Generative training objectives and discriminative (ranking) performance had long been thought to sit on opposite sides of a trade-off. KREPE's results show that if you learn to properly estimate the probability distribution over the masked components, "ranking a single blank" follows as a subset of that ability.
5.2 Surpassing Frontier LLMs from an Empty Input
The second result is the highlight of the study. In the hardest setting — Scratch, where a new fact is built from a completely empty input — KREPE outpaced several LLM baselines built on GPT-5.2 and Gemini 3 Pro by a wide margin. The table below shows the key figures the paper reports for the WikiPeople⁻ Scratch setting.
| Metric | KREPE | LLM baselines |
|---|---|---|
| Valid-and-novel fact ratio | 0.855 | GPT-5.2 up to 0.463 / Gemini 3 Pro up to 0.604 |
| Human evaluation score | 0.83 | Competitors 0.11 – 0.38 |
| Tries per valid fact | 2.85 | Competitors up to 27.58 |
What the numbers say comes down to two things: accuracy and efficiency. KREPE produced valid, novel facts more often (0.855 vs. 0.46–0.60), and it wasted far less effort getting to each valid fact (2.85 tries vs. up to 27.58 — roughly 10× more efficient). This does not mean LLMs are less capable. It means that mimicking facts from general text probabilities was disadvantaged, on this particular task, against a method that learns graph structure and the context inside a fact directly.
Across all three patterns of missingness (Scratch, Targeted, and Arbitrary Masking), the paper reports that KREPE was consistently superior. In the qualifier-sparse Scratch setting of WikiPeople⁻ in particular, it had the highest valid-and-novel fact ratio and the lowest number of tries needed to produce a single valid fact.
Both results condense into one sentence: a model that has truly learned to generate also ranks well, and a model that learned structure directly produces structured knowledge more accurately and efficiently than one that merely memorized text. That said, all figures are reported values from a single paper, and reproduction with open code and large-scale industrial knowledge graphs remains to be confirmed.
What It Means, and Where Pebblous Stands — Structure-Learning AI vs. Text-Memorizing AI
KREPE's contribution lies in widening hyper-relational knowledge graph completion from a candidate-ranking problem into a generative reasoning problem. That expansion leads straight to practical use cases: automatic knowledge base expansion, knowledge-graph-based question answering, information retrieval, recommendation, scientific knowledge discovery, and enterprise knowledge management are all candidates. They share one thing: the need to fill a slot that holds almost no information with a grounded new fact.
6.1 Within the Diffusion × Knowledge Graph Research Wave
KREPE is not a one-off study that sprang up out of nowhere. Around 2024, a current of applying diffusion models to knowledge graphs took shape — KGDM (AAAI 2024) and DiffKG (WSDM 2024 Oral) among them. Most of these, however, used continuous-space diffusion or focused on assisting recommendation and completion. What sets KREPE apart is that it introduces masked discrete diffusion tailored to discrete tokens and redefines the task itself, moving from completion to generation. It earns its place as an inflection point within the wave.
6.2 Demand for Knowledge Graphs Grows Alongside RAG
Why does this research matter to companies now? The knowledge graph market is projected to grow from roughly $1.9B (estimated 2026) to $8.9B (estimated 2032), at an annual rate somewhere around 20–29%. Figures vary widely by firm and by what they count, so a single number is hard to pin down, but the direction is consistent. Gartner projected early on that by 2025, 80% of data and analytics innovations would make use of graph technology. Layer onto this the rise of retrieval-augmented generation (RAG) and GraphRAG, and demand for "accurate structured knowledge" is climbing fast. That is the backdrop against which the practical value of KREPE-style approaches — filling in new facts without hallucination — rises.
6.3 Reading It Again Through a Data-Quality Lens
If we sum up this article from the perspective of data quality, there are two lessons. First, incompleteness is a constant. Even the best-managed knowledge base always has holes, and not one but many. So a tool that leans on "the assumption of a single blank" is powerless exactly where data is scarcest. Second, a model's capability comes from the structure of its data. On the same fact-generation task, a model that learned structure and context directly was more accurate and efficient than an LLM that memorized text probabilities — and the results bear this out.
Both lessons converge on a single proposition: shaping data into a form a model can use — structured, consistent, and as complete as possible — is itself performance. Whether to fill gaps by candidate ranking or to extend into generation is now a real choice facing any team that runs a knowledge graph. Just as RAG reduces hallucination, structured fact generation can become another axis of "grounded new-fact filling."
✏️ Editor's Note — A View from Pebblous
The proposition that "structured data beats general-purpose models" touches the very questions Pebblous has worked on through AI-Ready Data and DataClinic. KREPE's starting point — the diagnosis that "knowledge bases always have holes" — and its finding that a model which learned structure directly surpassed frontier LLMs, we read as an outside observation that the direction we have emphasized and the currents in academia and industry are heading the same way. We take it not as a boast but as a signal of alignment.
References
Academic Papers
- 1.Lee, J., Kim, S., & Whang, J. J. (2026). "Generative Representation Learning on Hyper-relational Knowledge Graphs via Masked Discrete Diffusion." ICML 2026. arXiv:2605.24064
- 2.Galkin, M. et al. (2020). "Message Passing for Hyper-Relational Knowledge Graphs (StarE / WD50K)." EMNLP 2020. arXiv:2009.10847
- 3."DHGE: Dual-View Hyper-Relational Knowledge Graph Embedding" (source of dataset statistics). arXiv:2207.08562
- 4."GPT-based Wikidata completion study" (quantifying knowledge base incompleteness). arXiv:2310.14771
- 5.HKUDS. "DiffKG: Knowledge Graph Diffusion Model for Recommendation." WSDM 2024 (Oral).
- 6."KGDM: Knowledge Graph Diffusion Model for Embedding." AAAI 2024 (diffusion × knowledge graph embedding).
- 7."Knowledge Graphs: Opportunities and Challenges" (review). PMC10068207
Market & Statistics
- 8.Wikidata:Statistics (~122.29M items, as of Aug 2025).
- 9.Gartner. Graph technology adoption outlook (80% of data and analytics innovations by 2025), Knowledge Graph Hype Cycle "Slope of Enlightenment".
- 10.Research and Markets — Knowledge Graph Market (~$1.9B (2026) → $8.9B (2032)).
- 11.Market Research Future — Knowledge Graph Market (CAGR 18.62%, 2025–2035).
- 12.MarketsandMarkets — Retrieval-Augmented Generation (RAG) Market.
Ecosystem
- 13.migalkin/StarE — message passing implementation for hyper-relational knowledge graphs.