GEO and AEO

The AI search measurement problem is a definition problem

Every GEO tool measures one of four AI visibility layers and sells it as the whole picture. The measurement gap isn't tooling. It's definition.

The AI search measurement problem is a definition problem

For the last eighteen months, every conversation about AI search measurement has been framed as a tooling gap. We don't have log analysis good enough. We don't have citation tracking mature enough. We don't have attribution models built for agent traffic. The pitch from every new GEO tool is some variant of *the data exists, it's just hard to get to, and we'll get to it for you.*

I want to argue something different. The measurement problem isn't primarily a tooling problem. It's a definition problem. We don't know what we're measuring because we haven't decided what counts as a result. And until we do, every dashboard being sold into the market is measuring whatever was cheapest to instrument, not whatever is closest to the truth.

This week made the gap painfully obvious. Google expanded Preferred Sources into AI Overviews and AI Mode — readers can now tell Google which sites to surface more often in AI answers, and 345,000 sources have already been tagged that way, up from 90,000 in December. iPullRank published a study suggesting Gmail content materially shifts which brands appear in AI Mode once Personal Intelligence is on. Cloudflare's CEO is on record saying bot traffic will exceed human traffic by 2027, with AI bots already growing 187% in 2025 against 3.1% human growth. Anthropic's ClaudeBot crawls about 24,000 pages per referral.

Each of those data points is real. None of them are the same kind of thing. And nobody is being honest about which of them you should be paying attention to, in what order, and why.

We're conflating four different layers and calling all of them "AI visibility"

There is a citation layer — whether your URL appears as a linked source under an AI answer. There is a mention layer — whether your brand is named in the answer text whether you're linked or not. There is an influence layer — whether the model's underlying weights or its retrieval-augmented context treat you as authoritative for a topic. And there is a behaviour layer — whether any of the above results in a person doing something measurable downstream, on your site or elsewhere.

Every vendor pitch I've seen this year flattens these into a single metric called "AI visibility" or "AEO score" or some equivalent piece of dashboard furniture. They are not the same thing. Optimising for one can actively work against another.

A citation in an AI Overview that triggers no click is qualitatively different from a brand mention in a ChatGPT conversation that gets repeated to three colleagues and ends with one of them visiting your site directly two weeks later. The first is measurable but increasingly worthless. The second is nearly invisible but probably what's actually moving revenue. We are building dashboards for the first because it's tractable, and ignoring the second because it isn't.

The Preferred Sources expansion is the cleanest illustration

Take this week's Google news at face value. Preferred Sources is now an explicit user-level signal that shapes which links appear in AI Overviews. 345,000 sources have been selected. Google says users are twice as likely to click through to a Preferred Source. Marie Haynes, Glenn Gabe and the rest of the field are correctly pointing out that publishers should be prompting their own audiences to subscribe.

Preferred Sources is half follow-graph, half trust signal, and half ad inventory. It will be sold to you as all three depending on who's pitching.

So far, so reasonable.

Now ask the harder question: what should you actually be measuring to know whether this matters for your business?

Preferred Sources is half follow-graph, half trust signal, and half ad inventory. It will be sold to you as all three depending on who's pitching.

If you measure citations, you'll see whether your link appears under AI answers more often once you've nudged readers to add you. If you measure clicks, you'll see whether those citations convert at the doubled rate Google claims. If you measure the influence layer, you can't, because Google doesn't expose it. If you measure behaviour — did adding the Preferred Source button to your site actually grow loyal-reader return visits, newsletter signups, direct traffic, brand searches — you might find the button worked brilliantly and the AI citation lift was incidental.

All four are valid. All four require different instrumentation. None of the GEO platforms currently being sold to UK businesses measure more than one of them with any seriousness.

The Gmail study is the same problem in a different dress

iPullRank's Personal Intelligence test found brands linked to a user's Gmail appeared more often in AI Mode once the feature was switched on, with Gmail being the strongest signal. The methodology is small-sample, measures outputs rather than internal systems, and is openly described as such. That's fine — early signal is still signal.

Four stacked layers with a signal reaching only the smallest

But notice what this means for measurement. If a brand citation appears for one user with a particular Gmail history and doesn't appear for another with different inbox contents, what is the unit you're tracking? "Citation share" presupposes a single output for a given query. Personalisation breaks the assumption. The query no longer has *a* result. It has a distribution of results across a population of users you cannot observe.

This is the part the tooling market hasn't caught up with. The major GEO platforms are running scripted prompts from clean test accounts and reporting back "your brand appeared in 31% of responses for query X." That number is increasingly meaningless once Personal Intelligence is widely on. The 31% is the result for an account that looks like nobody, in a session that has no history, with an inbox containing nothing. It tells you something. It doesn't tell you what your actual customers see.

What I'd actually measure, in order

I want to be concrete here, because the alternative to bad measurement isn't no measurement.

The thing that matters most, and is the cheapest to track, is branded search volume and direct traffic. If AI answers are influencing the discovery layer for your category — citation or mention, doesn't matter which — the downstream signal will eventually show up here. Someone hears your name in an AI response, asks the assistant a follow-up, then later types your brand into Google or your URL directly. Search Console branded queries plus GA4 direct traffic, tracked weekly, will catch this before any citation tracker will. It lags. That's a feature, not a bug. Lag means the noise has shaken out.

Second is referral traffic from AI surfaces where it's exposed. ChatGPT, Perplexity and Copilot all pass referrer data in varying degrees. Similarweb reported this week that ChatGPT is showing more links lately, driving a 150% increase in referrals. That number is real and your server logs will see it if you look. This is the only place where AI-driven behaviour produces a clean signal you can attribute. Use it.

Third is citation tracking as a leading indicator only. Run a small set of brand-relevant prompts weekly against the three or four assistants that matter for your category. Watch trends, not absolute numbers. If you appear less often than competitors of comparable size and authority, that's worth investigating. If you appear in 22% versus 24% week-on-week, that's noise. Most tools sell you the noise as a service.

Fourth, and this is the one nobody wants to hear, is qualitative monitoring. Read what the assistants actually say about your category, your brand, and your competitors. Not metrics. Words. The reputation layer in AI search is a reputation layer in a much older sense than the citation-share dashboards admit, and the only way to know what it's saying about you is to read it.

What the tool market is doing instead

The dominant pattern right now is platforms scraping AI assistant outputs at scale, normalising them into "share of voice" or "AI presence" scores, and charging four-figure monthly fees for dashboards that look like SEMrush circa 2019. The mechanism is fine. The interpretation is where it falls apart.

These tools cannot see the influence layer. They cannot see personalisation. They cannot see the conversation-to-direct-traffic pathway that's probably driving most of the actual commercial impact. They can see citation share for unauthenticated test queries, and they sell that as the whole picture because that's what's tractable.

The honest version of this pitch would be: *we measure one of four relevant layers, the one most loosely connected to revenue, because the other three are either invisible or expensive to instrument properly.* Nobody is selling that version.

The honest limits

I'm aware this argument has a weak point. Saying "the measurement problem is a definition problem" is itself a kind of dodge — if the definitions stabilise, the tooling argument becomes valid again, and the platforms currently being built will look prescient rather than premature.

That might happen. Google might expose AI citation impression data in Search Console within twelve months. Anthropic might publish referrer data more cleanly. Bing might do something coherent. If the surfaces start exposing structured measurement primitives, the gap I'm describing will close, and the tools that survive will be the ones already instrumented to consume them.

I'd also concede that for very large brands with the budget to run all four measurement layers in parallel, the citation-tracking layer is genuinely useful as one input among many. The problem is mid-market businesses being sold the dashboard as the whole answer, when it's a slice of one layer at best.

What this means for the next twelve months

If you're a business owner or in-house marketer being pitched a GEO platform right now, the question to ask isn't "what does it measure." It's "which of the four layers does it cover, and which are you flying blind on."

If you're a consultant, the work is to be specific about what each metric does and doesn't tell the client. Citation share is not visibility. Visibility is not influence. Influence is not behaviour. Behaviour is the only one that pays.

And if you're building one of these tools — which judging by my inbox, more people than I'd realised are — the differentiated product is not a prettier dashboard for the same citation scrape everyone else is running. It's the instrumentation nobody else is bothering with: the personalisation-distribution problem, the conversation-to-direct-traffic pathway, the qualitative layer. The boring one. The expensive one. The one that's actually closest to revenue.

The reason measurement feels hard right now is that we've collectively agreed to measure the easy thing and call it the important thing. We can stop doing that whenever we like.

Ready to get started?

Ready to improve your visibility in AI search?

If you're an SME in Surrey or London and you want more qualified leads from search — including the growing AI answer layer — let's talk.

Book a discovery call