GEO and AEO

Semantic similarity scores are not ground truth

Vector embedding scores look like measurement. They're a higher-resolution approximation. Treating them as ground truth is the next SEO failure mode.

Semantic similarity scores are not ground truth

Duane Forrester published a piece this morning that does something the SEO industry has been avoiding for about three years. He named the trap.

The trap is this: vector embeddings and semantic similarity tools have given us a measurable number where we used to have an editorial judgement. That number feels like progress. In one specific sense it is progress — it's a higher-resolution approximation than keyword overlap ever was. But a higher-resolution approximation is still an approximation, and the entire content optimisation industry is currently treating a cosine similarity score as if it has settled a question that keyword research only ever pretended to answer.

It hasn't. It's given us a more precise way to be wrong, and the precision is exactly what makes it dangerous.

I want to develop this further than Forrester does, because the implication for how UK businesses are being sold "AI content optimisation" right now is genuinely serious. The semantic alignment score is becoming the new domain authority — a number that vendors put on a dashboard, that practitioners optimise toward, and that has only a partial relationship with the thing it claims to represent.

What the embedding score actually measures

A cosine similarity score between two pieces of text is a measurement of angular distance between two vector representations in a specific embedding model's representation of language. That's the entire claim it can honestly make.

It is not a measurement of whether your content answers a query. It is not a measurement of whether Google's retrieval system, or ChatGPT's RAG pipeline, or Perplexity's index will surface your content for that query. It is not a measurement of relevance in any platform-independent sense. It is a measurement of geometric closeness inside one model's learned representation of the language.

This sounds pedantic. It isn't. The Netflix research team — Steck, Ekanadham and Kallus — showed in 2024 that cosine similarity applied to learned embeddings can produce results that are, in their framing, arbitrary. The way an embedding model is trained, the regularisation applied, the data it saw — all of these shape the geometry of the vector space in ways that make a raw similarity score unreliable as an absolute measure of semantic relationship. A score of 0.92 in one embedding space might correspond to strong retrieval in one production system, weak retrieval in another, and irrelevance in a third.

This is not a fringe finding. It's been replicated and extended. And it has barely touched the marketing tooling conversation.

The resolution problem

The reason this matters more now than it did with keyword research is counterintuitive, and it's the part of Forrester's argument that deserves more attention than he gave it.

Higher resolution removes the humility that low resolution used to enforce.

Keyword research was a low-resolution instrument. Everyone knew it. The fact that "best running shoes" and "top trainers for jogging" had different keyword volumes despite meaning roughly the same thing was a constant reminder that the instrument was imperfect. The low resolution enforced humility. You knew you were approximating, and that knowledge changed how you used the data.

Vector embeddings produce a smooth, continuous, dimensional number. They look authoritative. They feel scientific. The output of a semantic similarity tool is a decimal to two or three places, presented with the visual confidence of a precise measurement. The interface signals certainty even when the underlying measurement is no more grounded than what keyword research was telling you twenty years ago — just at higher resolution.

Higher resolution removes the humility that low resolution used to enforce.

This is how the industry talks itself into mistaking precision for accuracy. And it's how vendors talk businesses into paying for tools that produce confident-looking numbers about a relationship that nobody — not the vendor, not Google, not OpenAI — has actually solved.

Why this gets worse in AI search, not better

When Google was the only serious distribution surface, optimising toward an imperfect approximation of relevance produced predictable failures. You'd rank for the wrong query, or you'd rank for the right query with the wrong intent match, and the bounce-back signal would eventually correct the system. The feedback loop was tight. The error showed up quickly.

abstract diagram suggesting a precise measurement over an imprecise underlying field

In AI search, this loop is broken in two places.

First, the retrieval is hidden. You don't know which of your pages got pulled into ChatGPT's context window for a given query. You don't know what proportion of your content was used in an AI Overview response. Marie Haynes wrote yesterday about the new AI Overview reporting in Search Console, and the honest assessment is that it gives us a partial view — impressions only, no queries, no behavioural data. Useful, incomplete, and not nearly enough to validate whether your semantic optimisation actually produced retrieval.

Second, the generation is hidden. Even if you knew your page was retrieved, you don't know how it was summarised, cited, or paraphrased. The user might get an answer that misrepresents your content. The user might get an answer that uses your content correctly but doesn't cite you. The error signal that used to come back through bounce rate and dwell time barely exists in this environment.

So you have a measurement instrument with known precision-versus-accuracy problems, feeding into a distribution system where the feedback loop is degraded, and the industry response has been to double down on the instrument. Optimise harder. Score higher. Trust the number.

What this looks like in practice

A new arXiv paper out today, from a team running a longitudinal field study on a single high-traffic domain, makes the practical version of this point in a different way. They looked at ChatGPT referral traffic to glasp.co after a bundle of AEO interventions in January 2026. The raw growth multiple was 5.7x. That's the number a vendor would put on a case study slide.

The actual treatment effect, after controlling for platform-level growth in ChatGPT itself, was around 1.82x. And even that estimate failed a conservative placebo-in-time permutation test. The honest summary is "suggestive, not conclusive."

This is the gap between measurement and reality at the platform level, not the embedding level. But it's the same shape of problem. The headline number looks like a fact. The underlying causal claim is much weaker than the headline implies. Strip away the platform tailwind and the intervention effect is real but modest. Strip away the embedding model's specific geometry and the alignment score is real but partial.

The pattern repeats: a confident-looking number that, on closer inspection, is measuring something narrower than the practitioner thinks it's measuring.

Where this leaves serious content work

I want to be careful here, because the easy read of Forrester's piece — and the easy read of mine — is "semantic similarity tools are useless, ignore them." That's not the argument.

Vector-based content analysis is a genuine improvement over keyword-only approaches. It catches relationships that keyword overlap misses. It surfaces topical gaps that lexical analysis would miss entirely. Used carefully, it's a useful input.

The argument is about what kind of input it is. It's a directional signal, not a target metric. It tells you something about the geometric relationship between your content and a query inside one specific model's representation. That's worth knowing. It's not worth optimising toward as if the number itself were the goal.

The failure mode I've watched develop over the last eighteen months is consultants and agencies treating the score as the target. They run the analysis, get a number, write more content until the number goes up, and report the improvement to the client. The number going up is real. Whether anything has improved about how the content actually performs in production retrieval systems is unmeasured and largely unmeasurable from outside.

Optimising toward a measurable proxy without validating the proxy against the actual outcome is how every previous SEO failure mode has started.

This is the same pattern as keyword stuffing in 2008, exact-match anchor link building in 2012, thin schema markup spam in 2018. Each one started with a measurable signal that correlated with the desired outcome. Each one ended with practitioners optimising the signal until it stopped correlating with anything. The signal becomes the goal, the goal becomes the signal, and the relationship to actual user value disappears somewhere in the middle.

The honest limits

I want to flag where this argument doesn't go, because the strongest version of the position needs to acknowledge what it doesn't cover.

First: the alternative isn't going back to keyword research. Keyword research was a worse approximation, not a better one. The right response to "this instrument has precision-versus-accuracy problems" is not "use a less precise instrument." It's "use this instrument with appropriate caution and triangulate against other signals."

Second: some semantic similarity tools are better than others, and the gap matters. A tool that uses a single embedding model and presents one score is more vulnerable to the Steck-Ekanadham-Kallus critique than a tool that ensembles multiple models or grounds the analysis against observed retrieval behaviour. Not all alignment scores are equally suspect.

Third: this critique applies more strongly to absolute scores than to relative comparisons. "Page A scores higher than Page B against this query" is a more defensible claim than "Page A scores 0.87, which means it's aligned." The geometric comparison within a fixed model is more reliable than the absolute number interpreted as ground truth.

Reasonable people will disagree about how much weight to put on semantic similarity scores in a content strategy. My position is that they're useful as a triangulation signal and dangerous as a target metric. That's a moderate position. The industry's current default — treat the score as ground truth and optimise — is the immoderate one.

What to do with this

If you're commissioning content work right now, or running it in-house, the practical implication is small but important.

Treat semantic similarity scores as a check, not a target. Look at them as one input among several — alongside actual retrieval signals from Search Console's new AI reports, citation monitoring in the AI systems your audience actually uses, and the qualitative editorial judgement that used to do this job before there was a number to point at. Don't let the existence of a precise-looking decimal replace the judgement.

Be suspicious of any vendor pitch that leads with "we score your content against the query" without explaining which embedding model they're using, how they validate the score against observed retrieval, and what the score doesn't capture. The honest answer to those questions is uncomfortable. The vendors who can't give you an honest answer are selling you a number.

And be especially suspicious of yourself when the number goes up. Verify the improvement against something downstream. Citation appearances, referral traffic, qualitative review of how the content reads against the query. If the only thing improving is the score, you're not improving the content — you're improving the measurement.

The instrument got better. The question it's trying to answer is the same one we've been approximating for sixty years. And the gap between "better approximation" and "knowing" is exactly where the industry's next round of self-inflicted damage will come from, if we let the number do the thinking for us.

Ready to get started?

Ready to improve your visibility in AI search?

If you're an SME in Surrey or London and you want more qualified leads from search — including the growing AI answer layer — let's talk.

Book a discovery call