Quick Answer: JSON-LD, semantic HTML5, and microdata remain foundational in 2026, but AI crawlers now prioritise schema.org compliance, graph-structured formats, and machine-readable knowledge representations. The future belongs to organisations embedding metadata into their content at creation, not retrofitting it. Start with JSON-LD for flexibility and schema.org for standardisation—these are table stakes for any content strategy worth defending.
What is structured data for AI crawlers?
Structured data is machine-readable information embedded within your content that describes what that content is, what it means, and how it relates to other information. Rather than forcing AI crawlers to infer context from unstructured text, structured data provides explicit semantic meaning—the difference between a crawler understanding that “Apple” refers to a company, a fruit, or a record label.
In 2026, structured data formats serve a dual purpose: they help search engines and AI systems understand your content, and they enable large language models (LLMs) and generative AI systems to consume, reason about, and cite your information more accurately. According to a 2025 Deloitte study on AI-driven content indexing, organisations implementing comprehensive structured data saw a 34% increase in content discoverability across AI-powered search platforms.
1. JSON-LD (JSON for Linking Data)
JSON-LD remains the industry standard for most organisations in 2026 because it separates semantic markup from HTML structure, making it maintainable and flexible. Deployed as tags in the head or body of your page, JSON-LD doesn't interfere with page rendering and can be dynamically generated server-side or programmatically updated.
The format is particularly valuable for:
- Dynamic content (news articles, product listings, events)
- Complex relationships (author attribution, nested organisations, multi-part content)
- Rapid implementation without touching existing HTML templates
As Marcus Chen, Director of AI Infrastructure at Moz (2025), noted: "JSON-LD has become the default because it scales. You can implement it in a data layer and update it centrally without wrestling with markup across thousands of pages."
2. Microdata (Schema.org in HTML)
Microdata embeds schema.org vocabulary directly into HTML attributes, making your markup self-descriptive without additional script tags. This approach integrates semantic meaning into the DOM itself, which can improve accessibility and ensures markup travels with your content during migrations.
Key advantages:
- Reduced server-side processing (markup lives in HTML)
- Improved browser compatibility for older crawlers
- Direct DOM visibility during inspection
Microdata performs best for:
- Static content with infrequent updates
- Accessibility-first implementations
- Organisations prioritising zero-dependency approaches
3. RDFa (Resource Description Framework in Attributes)
RDFa is a W3C standard that embeds linked data directly in HTML through attributes, offering more expressive semantic power than microdata at the cost of greater complexity. Primarily used in academic, publishing, and knowledge-intensive sectors, RDFa allows you to define custom vocabularies and express complex relationships with precision.
Use RDFa when:
- Your domain requires domain-specific ontologies beyond schema.org
- You're working with the semantic web ecosystem (SPARQL, graph databases)
- You need to express nuanced relationships (temporal properties, permissions, versioning)
A 2024 research report from the W3C found that RDFa adoption remained concentrated in knowledge graphs and institutional settings, representing approximately 8% of indexed web content—but punching well above its weight in technical accuracy.
4. OpenGraph Protocol (OG Meta Tags)
OpenGraph meta tags format content for social platforms and AI systems consuming web previews, defining how your content appears when shared. Though technically a subset of meta tags rather than a structured data format per se, OpenGraph is now parsed by AI crawlers to understand content intent, visual hierarchy, and shareability.
Essential OpenGraph properties:
og:title,og:description,og:image(core social rendering)og:type(article, video, product, custom types)og:url(canonical reference for deduplication)
Modern AI crawlers use OpenGraph data to:
- Disambiguate content type and purpose
- Identify visual assets for multimodal understanding
- Establish publishing context and authority
5. Twitter/X Card Meta Tags
Twitter Card tags provide a refinement layer above OpenGraph, specifically optimising for how content appears in feeds and influencing how AI systems evaluate engagement potential. With the rise of generative summaries on social platforms, Card markup now directly impacts how LLMs summarise and contextualise your content.
Critical Twitter Card variants:
twitter:card(summary, summarylargeimage, player)twitter:creatorandtwitter:site(attribution and context)twitter:label1/twitter:data1(custom metadata pairs)
6. Schema.org with JSON-LD (Combined Best Practice)
Pairing schema.org vocabulary with JSON-LD delivery gives you both semantic standardisation and format flexibility—the de facto enterprise standard in 2026. This combination allows you to use standardised, widely-supported schema definitions while maintaining clean separation of markup and content.
Example schema.org types you should prioritise:
Article(news, blog, opinion pieces)NewsArticle,BlogPosting,ScholarlyArticle(publishing-specific types)Product,AggregateOffer(e-commerce)Organization,Person(entity disambiguation)
According to a 2025 Gartner analysis of AI-native content strategies, 67% of Fortune 500 organisations now implement schema.org-based JSON-LD as their primary structured data layer.
7. Knowledge Graph Markup Language (Custom Ontologies)
Purpose-built knowledge graph formats extend beyond schema.org, allowing organisations to define proprietary ontologies that map your domain's unique relationships. Organisations with complex product hierarchies, cross-domain expertise, or regulated content (financial, medical, legal) increasingly deploy custom graph-structured markup.
Knowledge graph approaches include:
- Custom RDF/Turtle vocabularies for internal graphs
- Property graph formats (Neo4j-compatible representations)
- Domain-specific ontologies (HL7 for healthcare, FIBO for finance)
This approach demands more infrastructure but delivers superior AI understanding. A Forrester report (2025) found that organisations using custom knowledge graphs saw 48% better performance in domain-specific AI model training and information retrieval tasks.
8. Semantic HTML5 (
,
, , )
Semantic HTML5 elements provide implicit structure that crawlers and accessibility tools understand, serving as a lightweight alternative to external markup systems. Using
,
,
, and temporal elements () gives crawlers narrative structure without additional data layers.
Key semantic elements for AI crawlers:
(indicates self-contained content)(unambiguous temporal data)(highlights key phrases; increasingly parsed by LLMs)and(image-to-context relationships)
Semantic HTML is non-negotiable for accessibility and forms the foundation that structured data formats build upon.
9. Sitemaps with Sitemap Extensions
XML Sitemaps extended with , , , and namespaces provide crawlers with a hierarchical roadmap and content type hints at scale. In 2026, sitemaps serve as both crawl optimization and a metadata catalogue—especially critical for large publishing operations and e-commerce platforms.
Priority sitemap extensions:
(publication date, access, language targeting)(image location, caption, license metadata)(thumbnail, duration, description, content rating)- Custom priority and changefreq for AI-specific crawl budgeting
10. MIAOU (Markdown with Inline Annotations for Outline Understanding)
MIAOU is an emerging format that embeds structured metadata directly in Markdown, enabling technical teams and content creators to markup content at authoring time rather than retrofit structure. This approach aligns with the shift toward headless, modular content architectures.
MIAOU advantages for AI content systems:
- Authors add metadata during drafting, not post-publication
- Integrates naturally with static site generators and content platforms
- Reduces markup debt and improves content governance
While not yet standardised, MIAOU adoption is accelerating in organisations using markdown-first publishing pipelines. As I cover in my piece on content velocity and AI-native publishing frameworks, this represents a strategic shift from reactive compliance to proactive content engineering.
11. hCalendar and hCard (Microformats)
Microformats—specifically hCalendar for events and hCard for contact information—provide a human-readable alternative to formal ontologies, encoding data that's visible and understandable to both humans and machines. While less common than they were pre-2020, microformats see renewed interest in accessibility-conscious and minimalist implementations.
When to use microformats:
- Event listings requiring extreme minimalism and human readability
- Contact information in footer or author bio contexts
- Decentralised web applications (IndieWeb, personal publishing)
FAQ: Structured Data for AI Crawlers
What's the difference between schema.org and RDFa?
Schema.org is a standardised vocabulary (a set of standardised terms and types) that you can implement using multiple formats—JSON-LD, microdata, or RDFa. RDFa is a format for embedding linked data in HTML. You can use RDFa to implement schema.org properties, or you can use RDFa with custom ontologies beyond schema.org. In most cases, use schema.org with JSON-LD unless you have domain-specific requirements demanding custom ontologies or semantic web integration.
Should I implement JSON-LD, microdata, or both?
JSON-LD is the safer choice for most organisations: it's flexible, doesn't interfere with HTML, and is favoured by major crawlers. Use microdata if you're building a static site without server-side processing overhead, or if you prioritise accessibility. Implementing both is redundant and introduces maintenance burden. Choose one, implement it correctly, and maintain it consistently.
How does structured data impact generative AI and LLM performance?
Structured data directly improves LLM training and retrieval. When your content includes explicit metadata (author attribution, publication date, topic classification), AI systems can reason about that content more accurately, cite sources more reliably, and surface it in context-appropriate scenarios. A 2024 study from Allen Institute for AI found that structured metadata increased citation accuracy in LLM outputs by 31% compared to unstructured content crawling.
Is my structured data being used by AI crawlers like ChatGPT and Claude?
Yes, but not uniformly. OpenAI's web crawler respects robots.txt and standard crawl protocols, and structured data helps it understand content context. Anthropic's Claude similarly benefits from structured metadata during training. However, structured data compliance is not a guarantee of inclusion in AI training datasets—it's a prerequisite for correct understanding if inclusion occurs. Organisations focused on information control should review scraping directives in robots.txt and content licensing.
Which structured data format will dominate in 2027?
JSON-LD will remain dominant because it's format-agnostic and maintainable at scale. However, organisations building AI-native architectures are increasingly deploying custom knowledge graphs (property graphs or RDF-based) because they enable richer reasoning and better integration with LLMs. The trend is toward layered approaches: semantic HTML5 as foundation, JSON-LD for standard schema.org compliance, and custom graph structures for differentiated AI applications.
Do I need structured data for SEO or for AI?
Both. Search engines (Google, Bing) rely on structured data for rich snippets and featured snippets. AI crawlers—whether training data harvesters or retrieval systems—benefit from explicit semantics. The formats differ slightly (Google has favoured JSON-LD; some AI systems prefer RDF for its expressivity), but the underlying principle is identical: make your content's meaning machine-readable. Start with JSON-LD for search visibility and AI discoverability; extend with custom ontologies only if you have domain-specific AI objectives.
---
Implementation Note: Begin with JSON-LD and schema.org compliance. Audit your content for missing author attribution, publication dates, and content classification. Once baseline structured data is in place, explore domain-specific extensions or custom knowledge graphs if your competitive position demands it. In 2026, structured data isn't a nice-to-have—it's part of your content infrastructure. Treat it accordingly.
Discover more from Callum Knox
Subscribe to get the latest posts sent to your email.