Specification/Semantic HTML & Content Structure

Semantic HTML & Content Structure

17 checks · Weight: 8% of overall score

4 high9 medium4 low
6.1

Single h1 per page

highPass / Fail

AI agents use the single <h1> as the authoritative title of the page for content indexing and answer generation. Ensure exactly one <h1> per page.

Why This Matters

AI agents use the single <h1> as the authoritative page title for content indexing and answer generation. Multiple <h1> elements create ambiguity about the page's primary topic, causing agents to misidentify or conflate subjects when generating answers.

How to Fix

Ensure every page has exactly one <h1> element that clearly describes the page's primary topic. Use h2-h6 for all other headings. If your CMS or template generates multiple <h1> elements, change the extras to the appropriate lower heading level.

Example

<h1>Your Primary Page Title</h1>
<h2>First Section</h2>
<h2>Second Section</h2>
Effort: Trivial (minutes)Documentation →
headingsh1structuresemantic
6.2

Sequential heading hierarchy

highPass / Warn / Fail

AI systems build content outlines from headings to understand document structure. Skipped levels (e.g., h1 to h3 without h2) break the hierarchy, causing agents to misinterpret section nesting and produce inaccurate content summaries. Fix heading levels to follow a sequential order.

Why This Matters

AI systems build content outlines from heading levels to understand document hierarchy. Skipped levels (e.g., h1 directly to h3) break this hierarchy, causing agents to misinterpret section nesting and produce inaccurate content summaries with wrong parent-child relationships.

How to Fix

Ensure headings follow a sequential order without skipping levels. After an h1, use h2 for major sections, h3 for subsections within h2, and so on. Never jump from h1 to h3 or h2 to h4 without the intermediate level.

Example

<h1>Page Title</h1>
  <h2>Major Section</h2>
    <h3>Subsection</h3>
    <h3>Another Subsection</h3>
  <h2>Another Major Section</h2>
Effort: Easy (< 1 hour)Documentation →
headingshierarchystructuresemantic
6.3

<main> element present

highPass / Warn / Fail

AI scrapers use <main> to identify primary content and discard nav/footer chrome, reducing hallucination risk from boilerplate text. Without <main>, agents must guess which content is primary versus navigational, often ingesting menus and footers into their context window.

Why This Matters

Without a <main> element, AI scrapers cannot distinguish primary content from navigation, sidebars, and footer boilerplate. This causes agents to ingest menus, disclaimers, and repeated chrome into their context window, increasing hallucination risk and reducing answer relevance.

How to Fix

Add a single <main> element to every page wrapping only the primary content area. Do not include site navigation, sidebars, or footers inside <main>. There should be exactly one <main> per page.

Example

<body>
  <header><!-- Navigation --></header>
  <main>
    <!-- Primary page content only -->
  </main>
  <footer><!-- Footer --></footer>
</body>
Effort: Easy (< 1 hour)Documentation →
landmarksmainstructuresemantichtml
6.4

<article> used for content

mediumPass / Warn / Fail

RAG systems chunk content by <article> boundaries for vector embedding, treating each article as an independent retrieval unit. Without <article> tags, AI chunking algorithms fall back to arbitrary text splitting, which fragments related content across multiple embeddings and reduces answer quality.

Why This Matters

RAG systems chunk content by <article> boundaries for vector embedding, treating each article as an independent retrieval unit. Without <article> tags, AI chunking algorithms fall back to arbitrary text splitting, which fragments related content across embeddings and reduces answer quality.

How to Fix

Wrap each self-contained content block (blog post, news story, product card, forum post) in an <article> element. Each <article> should make sense on its own and include a heading.

Example

<article>
  <h2>Article Title</h2>
  <p>Self-contained content block that makes sense independently...</p>
</article>
Effort: Easy (< 1 hour)Documentation →
articlestructuresemantichtml
6.5

<header> and <footer> landmarks

mediumPass / Warn / Fail

AI agents use <header> and <footer> landmarks to identify and exclude boilerplate content (navigation, copyright, links) from primary content extraction. Without these landmarks, agents may include footer disclaimers or nav menus in their content summaries.

Why This Matters

AI agents use <header> and <footer> landmarks to identify and exclude boilerplate content (navigation menus, copyright notices, legal links) from primary content extraction. Without these landmarks, agents may include footer disclaimers or nav menus in their content summaries, reducing answer accuracy.

How to Fix

Wrap your site navigation and branding area in a <header> element, and your copyright, legal links, and secondary navigation in a <footer> element. These should be present on every page for consistent content extraction.

Example

<header>
  <nav><!-- Site navigation --></nav>
</header>
<main><!-- Primary content --></main>
<footer>
  <p>&copy; 2025 Your Company. All rights reserved.</p>
</footer>
Effort: Easy (< 1 hour)Documentation →
landmarksheaderfooterstructuresemantichtml
6.6

<aside> for supplementary content

lowPass / Fail

AI agents use <aside> to distinguish supplementary content (sidebars, callouts, related links) from primary content. Without it, sidebar content may be mixed into the main content extraction, diluting the primary message in AI-generated summaries.

Why This Matters

Without <aside>, AI agents cannot distinguish supplementary content (sidebars, callouts, related links) from primary content. This causes sidebar promotions, ads, and tangential content to be mixed into AI-generated summaries, diluting the accuracy of your main message.

How to Fix

Wrap sidebar content, callout boxes, pull quotes, and related-links sections in <aside> elements. This signals to AI agents that the content is supplementary and should not be treated as part of the main narrative.

Example

<aside>
  <h3>Related Resources</h3>
  <ul>
    <li><a href="/related-topic">Related Topic</a></li>
  </ul>
</aside>
Effort: Easy (< 1 hour)Documentation →
asidestructuresemantichtml
6.7

<section> elements have headings or labels

mediumPass / Warn / Fail

AI agents use section headings to build a topic map of your page for retrieval-augmented generation (RAG). Unlabeled sections are opaque to AI systems that chunk content by semantic boundaries, reducing the quality of retrieved context for answer generation.

Why This Matters

AI agents use section headings to build a topic map of your page for retrieval-augmented generation. Unlabeled <section> elements are opaque to AI chunking systems, preventing them from indexing and retrieving your content by topic, which reduces your visibility in AI-generated answers.

How to Fix

Add a heading (h2-h6) as the first child of every <section> element, or use aria-label/aria-labelledby if a visible heading is not appropriate for the design. Every section should have a clear, descriptive label.

Example

<section>
  <h2>Pricing Plans</h2>
  <p>Choose the plan that fits your needs...</p>
</section>

<!-- Or with aria-label for visually hidden labels: -->
<section aria-label="Customer testimonials">
  <!-- Content without a visible heading -->
</section>
Effort: Easy (< 1 hour)Documentation →
sectionsheadingsstructuresemantichtml
6.8

Semantic list usage

mediumPass / Warn / Fail

AI agents recognize <ul>, <ol>, and <dl> as structured data lists and extract them as bullet points in generated answers. Content formatted as styled divs instead of semantic lists is invisible to list-extraction algorithms, meaning your feature lists and step-by-step content will not be surfaced as structured answers.

Why This Matters

AI agents recognize <ul>, <ol>, and <dl> as structured lists and extract them as bullet points or numbered steps in generated answers. Content formatted as styled <div> elements instead of semantic lists is invisible to list-extraction algorithms, so your feature lists and step-by-step instructions will not be surfaced as structured answers.

How to Fix

Replace styled <div> elements used as lists with proper <ul> (unordered), <ol> (ordered), or <dl> (definition) elements. Use <ol> for sequential steps, <ul> for unordered items, and <dl> for term-definition pairs.

Example

<ul>
  <li>Feature one: description</li>
  <li>Feature two: description</li>
</ul>

<ol>
  <li>Step one</li>
  <li>Step two</li>
</ol>
Effort: Easy (< 1 hour)Documentation →
listsstructuresemantichtml
6.9

Data tables properly structured

mediumPass / Warn / Fail

AI agents use <thead> and <th> elements to understand column headers and interpret table data correctly. Without proper structure, agents cannot map cell values to their column meanings, leading to garbled data extraction in AI-generated comparisons and summaries.

Why This Matters

AI agents rely on <thead> and <th> elements to understand column headers and map cell values to their meanings. Without proper table structure, agents cannot interpret tabular data correctly, leading to garbled comparisons and inaccurate data extraction in AI-generated summaries.

How to Fix

Add a <thead> section containing a <tr> with <th> elements for each column header. Place data rows inside a <tbody> section. Use the scope attribute on <th> elements for complex tables with row and column headers.

Example

<table>
  <thead>
    <tr><th scope="col">Feature</th><th scope="col">Value</th></tr>
  </thead>
  <tbody>
    <tr><td>Speed</td><td>100ms</td></tr>
  </tbody>
</table>
Effort: Easy (< 1 hour)Documentation →
tablesstructuresemantichtml
6.10

Code blocks have language annotations

lowPass / Warn / Fail

AI agents use language annotations on code blocks to apply the correct syntax understanding and provide accurate code explanations. Without them, agents must guess the programming language, which can lead to incorrect interpretations in AI-generated code answers.

Why This Matters

AI agents use language annotations on code blocks to apply correct syntax highlighting and interpretation. Without language classes, agents must guess the programming language, leading to incorrect code explanations and potentially dangerous misinterpretations in AI-generated technical answers.

How to Fix

Add a class attribute with a "language-" prefix to every <code> element inside <pre>. Use the standard language identifier (e.g., language-javascript, language-python, language-html). Most syntax highlighting libraries (Prism, Highlight.js) do this automatically.

Example

<pre><code class="language-javascript">
const response = await fetch("/api/data");
const data = await response.json();
</code></pre>
Effort: Trivial (minutes)Documentation →
codelanguagesemantichtml
6.11

<time datetime=""> used for dates

mediumPass / Fail

AI agents use <time datetime> elements to reliably parse dates for freshness scoring and temporal reasoning. Without machine-readable dates, agents must regex-parse human-readable date formats, which frequently fails across locales and ambiguous formats like "01/02/2025".

Why This Matters

AI agents use <time datetime> elements to reliably parse dates for freshness scoring and temporal reasoning. Without machine-readable dates, agents must regex-parse human-readable formats, which frequently fails across locales and ambiguous formats like "01/02/2025" (Jan 2 vs. Feb 1).

How to Fix

Wrap all dates and times in <time> elements with a datetime attribute in ISO 8601 format (YYYY-MM-DD or YYYY-MM-DDThh:mm:ss). Include publication dates, event dates, and last-modified dates.

Example

<p>Published on <time datetime="2025-01-15">January 15, 2025</time></p>
<p>Event starts <time datetime="2025-03-20T09:00:00-05:00">March 20 at 9 AM EST</time></p>
Effort: Trivial (minutes)Documentation →
datestimesemantichtml
6.12

<address> for contact info

lowPass / Fail

AI agents use <address> elements to extract contact information (email, phone, physical address) for structured answers to "how to contact" queries. Without semantic <address> markup, agents must guess which text on your page is contact info.

Why This Matters

AI agents cannot reliably extract contact information (email, phone, physical address) when it is not wrapped in an <address> element. This means your business contact details may be omitted from AI-generated answers to "how do I contact" queries.

How to Fix

Wrap all contact information blocks (email addresses, phone numbers, physical addresses) in an <address> element. Place it in the <footer> or near the relevant content section.

Example

<address>
  <a href="mailto:[email protected]">[email protected]</a><br>
  <a href="tel:+1234567890">+1 (234) 567-890</a><br>
  123 Main St, City, ST 12345
</address>
Effort: Trivial (minutes)Documentation →
contactsemantichtml
6.13

Definition elements

lowPass / Fail

AI agents use <dfn> and <dl> elements to extract term-definition pairs for "what is X?" queries. Semantic definition markup makes your glossary terms and key concepts directly extractable as AI-generated answer snippets.

Why This Matters

AI agents use <dfn> and <dl> elements to extract term-definition pairs for "what is X?" queries. Without semantic definition markup, your glossary terms and key concepts cannot be directly surfaced as AI-generated answer snippets.

How to Fix

Use <dl> (definition list) with <dt> (term) and <dd> (definition) pairs for glossaries, FAQs, and key-value content. Use <dfn> inline to mark the defining instance of a term within running text.

Example

<dl>
  <dt><dfn>API Rate Limit</dfn></dt>
  <dd>The maximum number of requests allowed per time period.</dd>
</dl>
Effort: Easy (< 1 hour)Documentation →
definitionsglossarysemantichtml
6.14

Sufficient content depth

mediumPass / Warn / Fail

AI RAG systems need sufficient content depth to generate accurate, detailed answers. Pages with fewer than 300 words provide too little context for meaningful vector embeddings, causing your content to rank poorly in retrieval and be excluded from AI-generated responses.

Why This Matters

Pages with fewer than 300 words provide too little context for AI RAG systems to generate accurate, detailed answers. Thin content produces weak vector embeddings that rank poorly in retrieval, causing your pages to be excluded from AI-generated responses entirely.

How to Fix

Expand thin pages with substantive content: add detailed explanations, practical examples, FAQs, and relevant context. Aim for at least 300 words of meaningful content per page. Avoid filler text -- focus on answering real user questions comprehensively.

Effort: Moderate (hours)
contentdepthquality
6.15

Image alt text coverage

highPass / Warn / Fail

Most AI agents are text-only and rely entirely on alt text to understand images. Missing alt text makes your visual content invisible to AI systems, meaning product images, diagrams, and infographics contribute nothing to AI-generated answers about your page.

Why This Matters

Most AI agents are text-only and rely entirely on alt text to understand images. Missing alt text makes your product photos, diagrams, and infographics completely invisible to AI systems, meaning they contribute nothing to AI-generated answers about your pages.

How to Fix

Add descriptive alt text to every non-decorative image. Describe what the image shows and why it matters in context. For product images, include the product name and key visual features. For decorative images, use an empty alt="" with role="presentation" instead.

Example

<img src="product.jpg" alt="Blue running shoe, side view, with breathable mesh upper and cushioned sole">
Effort: Moderate (hours)Documentation →
imagesalt-textaccessibilitysemantic
6.16

Decorative images marked correctly

mediumPass / Warn / Fail

AI agents processing the accessibility tree treat images with empty alt but no role="presentation" as potentially missing alt text rather than intentionally decorative. Adding role="presentation" explicitly tells agents to skip these images, preventing them from flagging false content gaps.

Why This Matters

AI agents processing the accessibility tree treat images with empty alt but no role="presentation" as potentially missing alt text rather than intentionally decorative. This creates false-positive content gaps and wastes agent processing on irrelevant images.

How to Fix

Add role="presentation" (or role="none") to all decorative images that already have an empty alt attribute. This explicitly tells AI agents and assistive technologies to skip these images entirely.

Example

<img src="decorative-border.png" alt="" role="presentation">
Effort: Trivial (minutes)Documentation →
imagesdecorativeaccessibilitysemantic
6.17

<figure> + <figcaption> usage

mediumPass / Warn / Fail

AI agents use <figcaption> to understand the purpose and context of figures beyond what alt text provides. Without captions, agents treat figures as opaque image containers with no semantic meaning, missing opportunities to cite your visual data in AI-generated answers.

Why This Matters

AI agents use <figcaption> to understand the purpose and context of visual content beyond what alt text provides. Without captions, figures are treated as opaque image containers, and your charts, diagrams, and illustrations cannot be meaningfully cited in AI-generated answers.

How to Fix

Wrap images, charts, diagrams, and code examples in <figure> elements. Add a descriptive <figcaption> that explains the significance of the visual content -- not just what it shows, but why it matters in context.

Example

<figure>
  <img src="sales-chart.png" alt="Bar chart showing quarterly sales">
  <figcaption>Figure 1: Sales increased 40% year-over-year in Q4 2024, driven by the new product launch.</figcaption>
</figure>
Effort: Easy (< 1 hour)Documentation →
imagesfigurescaptionssemantichtml