Should We Score Spatial AI? Five Criteria from PebbloSim's Perspective

A natural-language prompt can now generate an entire city. But there's no agreed way to ask whether the result is any good. This is a hypothesis placed in the empty seat.

UrbanWorld system architecture — Spatial AI pipeline from natural-language prompt to 3D urban environment via MLLM-driven asset design and diffusion rendering — The standard Spatial AI pipeline: natural language in, 3D urban environment out. UrbanWorld's architecture (Layout → MLLM → Diffusion → Refinement) is representative of the broader category — STF Labs' UrbanGPT 2.0 follows the same shape. Source: UrbanWorld (arXiv:2407.11965)

The Year Spatial AI Became a Category

In September 2024, Fei-Fei Li founded World Labs with a $230M seed round. By November 2025, Marble — World Labs' first commercial product — was shipping a navigable 3D world from text or image input, pushing total funding past $1B. In the same window, STF Labs (Studio Tim Fu) released the UrbanGPT 2.0 Beta, taking prompts like "70% residential, 30% commercial, FAR 450%" and returning 3D urban layouts with automatic GFA optimization. HKUST AI4City presented Sat2City at ICCV 2025, reconstructing 3D cities from a single satellite image. Cesium added AI to its 3D geospatial platform. Esri rolled out an ArcGIS AI Assistant.

Different tools, different markets, but the same underlying pattern: natural language in, 3D spatial structure out. The market signal matches the cadence — Geospatial AI is projected to grow from $38B in 2024 to $64.6B by 2030 (Arizton), and AI urban planning from $2.26B in 2025 to $13.60B by 2035 (Metatech Insights). Spatial AI is no longer a research curiosity.

(Note on naming: the "UrbanGPT" discussed here is STF Labs' generative 3D tool, not the HKUDS UrbanGPT — KDD'24, a spatio-temporal LLM for traffic and population prediction. Same name, different project.)

The Empty Seat: Evaluation

Across the ten major players we surveyed — STF Labs, World Labs, Cesium AI, Esri GeoAI, Mapbox MapGPT, NVIDIA Omniverse, Bentley iTwin, Autodesk Forma, TestFit, Hypar — only Esri publishes a usable evaluation methodology grounded in ISO/IEC 23894's "Trusted AI" framework. The other nine excel at generation and remain silent on verification.

The academic side acknowledges the same gap. T3Bench (arXiv:2310.02977, 2024) opens by stating that Text-to-3D evaluation has "largely relied on subjective user experiments." Even when automated metrics reach Spearman correlation 0.78 with human judges, they cannot answer the questions that matter for real urban design: is this city legal? Is it close to the ground truth? Is the variation across seeds meaningful or cosmetic?

Visual plausibility is not the same as usability. UrbanWorld's authors flag the limitation themselves — "homogeneous styles, limited diversity." A city that looks beautiful but collapses into the same shape under different seeds is not a decision-support tool.

Five Criteria We Propose

Pebblous proposes five evaluation criteria from PebbloSim's perspective. PebbloSim is our simulation-based synthetic data generator for cities, traffic, and environments. The five are not a standard we operate today — they are a hypothesis placed in the empty seat, deliberately positioned at the intersection of academic prior and industry blind spot.

1. Geographic Coherence

Do the coordinates, scale, and road topology of the generated city align with the actual GIS reference? Measurable through IoU against GIS polygons, road network topology matching, and satellite alignment. Academic prior: nearly absent. This is a new contribution.

2. Scale Consistency

Are building heights, road widths, and vegetation ratios statistically plausible for the target city? Measurable through Depth Error and Camera Error (from CityDreamer, CVPR 2024) extended with urban sub-metrics. Academic prior: rich. Half the work is done.

CityDreamer urban generation results — segmentation masks, depth estimates, and multi-view renderings used to evaluate scale consistency — CityDreamer's diagnostic outputs — segmentation masks, depth estimates, multi-view renderings. These are the raw materials of scale consistency measurement. Source: CityDreamer (CVPR 2024, arXiv:2309.00610)

3. GFA Validation

Do the floor area ratio, footprint ratio, and zoning designations comply with local regulations? Building footprint extraction is mature in GeoAI, but regulatory validation is absent. Autodesk Forma, TestFit, and Hypar excel at automated generation but do not publish ML evaluation metrics. STF Labs UrbanGPT 2.0 mentions "automatic GFA calculation" without revealing the underlying database or algorithm. Immediate B2B differentiation.

4. Scenario Coverage

How much variation does a single seed produce? Measurable through Homogeneity Index (UrbanWorld, 2024), Precision/Recall/Density/Coverage (Kynkäänniemi 2019, Naeem 2020), Coreset Selection. Academic prior: most extensive. Ready for immediate adoption.

5. Sim-to-Real Gap

How far is the generated data from real-world LiDAR, aerial imagery, GIS? UCF's HiFi DT framework (arXiv:2509.02904, 2025) consolidates the four-metric package: Chamfer Distance, MMD, EMD, Fréchet Distance. Their evidence is striking — a model trained on well-built synthetic data achieved 44.74% AP, beating the real-data model at 42.70%. Sat2City uses the same metric family on urban data (Chamfer 100% COV, EMD 60% COV). Standard package already exists. Adopt it.

Sat2City reconstructs Geometry and Appearance of a 3D city from a single satellite image — a demonstration that Sim-to-Real Gap metrics apply to the urban domain — Sat2City reconstructs Geometry and Appearance of a 3D city from a single satellite image. The same metrics that quantify Sim-to-Real Gap in autonomous driving (Chamfer, EMD, MMD) now apply to cities. Source: Sat2City (ICCV 2025, arXiv:2507.04403)

Why the Five, and Why Now

Two criteria (Geographic Coherence, GFA Validation) are net-new academic territory. Two (Scenario Coverage, Sim-to-Real Gap) have mature academic prior ready for adoption. One (Scale Consistency) sits in the middle. This distribution matters: defining the new ones is academic contribution; operationalizing the existing ones is industry contribution.

The standards conversation is also moving. ISO/IEC 5259-4 was published in July 2024 and adopted as EU standard EN ISO/IEC 5259-4:2025 in February 2025. ISO/IEC 23894 (2023) covers AI risk management. ISO/IEC 5259-5 (2025) addresses data quality governance. None of these have yet been applied to 3D spatial outputs in any standardized way. The seat at the standards table is open.

The Bigger Picture

Spatial AI is not the first market where evaluation lagged generation. LLMs went through the same arc — generation outran benchmarks for two years before MMLU, HELM, and HumanEval normalized the conversation. The teams that authored those benchmarks earned both citation gravity and product influence.

The same window is open now for Spatial AI. Whoever first articulates "what good looks like" gets to shape what the market measures, which shapes what the tools optimize for, which shapes what cities get built. The evaluation framework is the market entry barrier.

Pebblous places this report as a first step. The five criteria are an opening proposal, not a closed answer. We're exploring the practical deployment of this framework with potential partners including STF Labs, in close conversation with the academic and standards bodies that hold the missing pieces.

Read the full technical analysis on Pebblous Blog →
Should We Score Spatial AI? Five Criteria from PebbloSim's Perspective