Speeding up the annotation process in semantic segmentation industrial applications

Fernandez-Moreno, Marta; Guerrero, Margarita; Rementeria, Rosalia; Mesejo, Pablo; Moreno, Raul

Executive Summary

"The label costs more than the AI model" is not a metaphor. A paper released by a Spanish research team in June 2026 proves it in numbers. It took three experts 170 hours to paint pixel-level masks onto 82 steel microstructure images. This article looks at how those 170 hours were cut to 37 — and at what the 22% that did not shrink is telling us.

The key is the division of labor. An unsupervised CNN that learns with no labels at all lays down a draft mask, and the expert fixes only the parts that are wrong. The draft quality itself was low. The pre-labels averaged an IoU of just 32–49%. Even so, the time fell by 78%, because people are overwhelmingly faster at "fixing" than at "drawing from scratch."

So why is the remaining 22% still human work? That question is the sharpest point of this article. The answer is not "because the machine isn't precise enough" but "because judgment is required there."

Stainless steel (A961) microstructure under optical microscopy after chemical etching — grain boundaries and phase-distinguishing brightness differences — ▲ Stainless steel (A961) microstructure under optical microscopy after chemical etching. Dark lines are grain boundaries; brightness differences between regions identify phases. Pixel-labeling these irregular boundaries is what annotation means. | Source: Wikimedia Commons (CC BY 4.0)

Key Figures

Four numbers compress this study's result and its paradox. The first two are the scale of the savings; the last two are the condition under which those savings happened and their limit.

Source: Fernandez-Moreno et al. (arXiv:2606.19934)

170h → 37h

Annotation time

3 experts · 82 images

78%

Time saved

73–83% by type

IoU 32–49%

Pre-label quality

Low, yet savings peak

22%

Time left to humans

Judgment, not precision

1

How 170 Hours Became 37

Even the same steel can have a different internal microstructure. The ratio and distribution of phases such as Alpha, TiB2, and TiN determine its strength and durability. To classify these structures automatically with AI, you first need a ground-truth mask in which a human has painted, for each pixel of each image, which phase it belongs to. That task is annotation.

The problem is cost. It took three experts 170 hours to paint 82 images pixel by pixel from scratch. At an expert rate of only $100 an hour, labeling those 82 images cost about $17,000 — an amount that can be several times the cost of training a model once, tied up in data preparation alone.

To cut this time, the researchers compared five unsupervised algorithms: Multi-Otsu, which splits the grayscale histogram; graph-based superpixels; K-means clustering; Meta's general-purpose Segment Anything Model (SAM); and a deep-learning-based unsupervised CNN that jointly optimizes feature similarity and spatial continuity. The final choice was the unsupervised CNN, because it delivered consistent performance across every steel type.

With drafts laid down by the chosen method, total annotation time fell from 170 hours to 37 — a 78% cut. Broken out by steel type, the savings are even sharper.

▲ Annotation time fell 73–83% across all three steel types. Overall, from 170 to 37 hours — a 78% cut (paper, Table 7)

The dataset released alongside the paper comprises 82 images, 5 microstructure phases (Alpha · TiB2 · TiN · FeTiB · Fe2B), and 3 industrial process routes. The researchers published it under an MIT license, calling it "the largest publicly available steel microstructure segmentation dataset to date." In other words, they opened up not only the method of the savings but the result, so both can be verified.

2

The Machine Drafts, the Expert Refines

The structure of this pipeline is simple. The unsupervised CNN generates a draft mask from the patterns in the image alone, with no labels. Loaded into the labeling tool, that mask means the expert starts not from a blank screen but on top of an already-painted picture — moving wrong boundaries, filling missed regions, and merging parts that were split too finely.

▲ The unsupervised CNN lays down a draft and the expert fixes it. Not starting from a blank screen is the key to the time savings

The paradox lives here. The draft masks were not high quality. The pre-labels averaged an IoU (the overlap between prediction and ground truth) of 32–49% — accuracy below half. Yet that is exactly where the time fell the most. It looks counterintuitive, but the reason is clear: for an expert, redrawing a shape on a blank canvas is the slowest task, while evaluating and touching up a shape that is already drawn is far faster.

The savings did not show up only in the average. The variance in the time to touch up a single image shrank as well. When drawing from scratch, per-image time was erratic, but fixing on top of a draft narrowed that spread noticeably. The point is not only that the average got faster but that you could estimate in advance how long a single image would take. Because labeling schedules and budgets can be set ahead of time, this predictability matters in practice as much as the average savings do.

Even an imperfect draft sends a reviewer's productivity soaring as long as the direction is right. This is not unique to this paper. It is the principle shared by RLHF (reinforcement learning from human feedback), active learning, and every kind of AI-assisted labeling. It is the ease of correction, not the accuracy of the draft, that determines productivity.

The segmentation model trained on the finished masks (a DeepLabV3+ with an EfficientNet-b0 backbone) scored an IoU of 48–67% for per-type dedicated models and 40–55% for a single general model. The type-specialized models were consistently better. Performance like this follows once the data is in place. In the end, the bottleneck was not on the model side but in the labels upstream of it.

3

Why the 22% Stays Human

Cutting 78% also means 22% never shrank. Looking into why the machine could not absorb that 22% reveals the real boundary line of labeling automation. The machine got stuck in three places.

• Precise boundaries of minority classes. TiN is just 0.17% of all pixels and TiB2 only 3.85%. Small in area, yet decisive for the material's properties. The machine often misses such rarely appearing classes or misclassifies them as another phase.
• Over-segmentation errors. Inside an irregularly structured majority class like Alpha, the algorithm tends to chop a region that should be one into several. The expert has to merge these fragments back into a single piece.
• Consensus on "doubtful" regions. Where even the human eye cannot say for certain which phase a boundary belongs to, three experts confer to reach a final verdict — a process that cannot be replaced without domain knowledge.

▲ The 78% the machine takes and the 22% left to humans differ not in amount but in kind

What the three have in common is clear. The 22% the machine could not cut remains not "because accuracy is low" but "because judgment is required." The eye to recognize a scientifically meaningful minority class, the debate to reach consensus on an ambiguous boundary, the verification to tell what is an error — these are not the kind of blanks a better algorithm can fill. The further automation advances, the more the work left to humans does not shrink so much as concentrate into something closer to judgment.

4

The Bottleneck Is the Label, Not the Model

Steel microstructure is a narrow case, but the structure this study quantified carries over to AI-Ready Data as a whole. Model architectures are shared as open source and compute costs are falling fast, while the cost of labeling — domain experts marking ground truth by hand — barely comes down. That is why the data-labeling market is growing by double digits every year. One industry estimate sees this market reaching around $17 billion by 2030, expanding at a rate above 20% a year. Models keep getting cheaper while the human hands that turn data into ground truth keep getting more expensive.

▲ Pebblous original diagram — market-data reinterpretation. AI compute costs (GPU cloud pricing) are falling fast, but expert annotation costs keep rising. This is why the data labeling market is on track toward $17 billion by 2030 at a 20%-plus annual rate (Reference [4])

The fix this study points to is not to remove labeling altogether but to divide it by the nature of the work. The repetitive, high-volume draft goes to the machine; the finish that needs judgment goes to the human. It is fine if the machine's draft is not perfectly accurate. What matters is whether a human can review it quickly. The better the verification loop is designed, the more data the same expert can turn into ground truth in the same amount of time.

The core of what Pebblous calls data-quality autonomy is no different. Hand the machine what the machine can do and automate it; leave the human to focus where judgment is needed. The difference between 170 hours and 37 across 82 steel images shows that this division of labor is not an abstract slogan but measurable productivity. The value of automation lies not in clearing people away but in moving their judgment to where it is worth the most.

R

References

R.1Academic Papers

1.Fernandez-Moreno, M., Guerrero, M., Rementeria, R., Mesejo, P., & Moreno, R. (2026). "Speeding up the annotation process in semantic segmentation industrial applications." arXiv:2606.19934.
2.Stuckner, J., et al. (2023). "A unified microstructure segmentation approach via human-in-the-loop machine learning." Acta Materialia.

R.2Dataset

3.Fernandez-Moreno, M., et al. (2026). "Steel Microstructure Segmentation Dataset." Zenodo. DOI: 10.5281/zenodo.18826160. MIT License.

R.3Industry & Market

4.HeroHunt AI. (2026). "The Ultimate AI Data Labeling Industry Overview (2026)."