Passage Matrix — Case Study

The problem

Keyword density is a dead metric

Most content teams still optimize by checking whether target keywords appear in headers, body text, and meta tags. The tools they use count occurrences. But modern search engines — and AI-powered answer engines — don't match keywords. They embed content into vector space and measure semantic similarity against the query.

Google's passage indexing means a single well-written paragraph can rank for a query even if the rest of the page is about something different. AI search tools like Perplexity and Google's AI Overviews extract and cite individual passages, not entire pages. The unit of optimization has shifted from the page to the passage — and almost nobody's tooling reflects that.

If search engines chunk your content into passages and score each one against queries independently, your optimization tool should do the same thing. Anything else is measuring the wrong signal.

The solution

Simulate the retrieval pipeline

Passage Matrix does what keyword tools can't: it simulates how a search engine actually processes your content. It chunks the page using heading-aware logic, embeds each chunk, generates query variations, and measures cosine similarity between every passage and every query variant.

The result isn't a keyword density percentage. It's a matrix showing exactly which passages in your content align with which queries — and which queries have no strong passage match at all.

Step 1

Heading-aware chunking

Content is split into passages that respect the document's heading structure. A section under an H2 stays together rather than being arbitrarily split at a character limit. This mirrors how search engines identify topical boundaries.

Step 2

Embedding generation

Each chunk is embedded using Gemini's text-embedding-004 model — the same class of embedding that powers modern retrieval systems. This converts text into high-dimensional vectors that capture semantic meaning, not just word overlap.

Step 3

Query fan-out

A single target query is expanded into multiple variations — rephrasings, related questions, long-tail variants. Because users don't all type the same query, and your content needs to match the semantic neighborhood, not just the exact phrase.

Step 4

Cosine similarity scoring

Every passage is scored against every query variant. The output is a matrix: rows are passages, columns are queries, cells are similarity scores. High scores mean strong alignment. Gaps mean your content doesn't answer that question — even if you thought it did.

Passage Matrix tool showing query fan-out, content analysis, coverage scoring, and passage-by-query similarity matrix

Full analysis pipeline — head keyword generates 12 synthetic subqueries (color-coded by type), content is chunked and scored against each. The coverage matrix reveals which passages align with which queries — and where the gaps are. This example scores 5.9: broad but not deep enough for AI source selection.

Context

Why this matters now

This tool was inspired by Mike King's research on how search engines have moved from page-level to passage-level indexing. The concept of "entity anchoring" for AI search visibility — which I wrote about in late 2024, ahead of academic validation — is rooted in the same shift: the unit of relevance is no longer the document, it's the retrievable chunk.

For content teams, the practical implication is stark: you can have a 3,000-word article that ranks for nothing because no individual passage strongly aligns with any specific query. Or you can have a 500-word post with one perfectly written paragraph that gets extracted as an AI Overview citation. Length doesn't matter. Passage-level semantic alignment does.

Passage Matrix makes that visible.

Python Gemini text-embedding-004 Cosine Similarity Heading-Aware Chunking Query Fan-out Generation

Search engines don't read pages. They extract passages.