llms.txt, Schema & AI Crawlers: The Technical Basics

Three technical jobs to complete for AI visibility: unblock AI crawlers in robots.txt, add Product and FAQPage schema, and understand what llms.txt actually does and doesn't do.

TL;DR

Check robots.txt for AI crawler blocks, add Product and FAQPage schema markup, then add an llms.txt—it won't help AI Overviews but it matters for agentic tools like Cursor and Claude Code.

llms.txt, Schema & AI Crawlers: The Technical Basics

Technical AI visibility work breaks into three jobs. Job one takes ten minutes. Job two takes a few hours. Job three is one text file. Together they remove the infrastructure barriers that prevent AI from reading and citing your content accurately.

Why Does AI Crawler Access Matter Before Anything Else?

Before content quality, before schema markup, before llms.txt — confirm that AI crawlers can actually reach your site. An AI that cannot crawl your pages cannot learn from them, cannot cite them in browsing-mode answers, and cannot retrieve them for live queries.

The three-step check:

Step 1: Read your robots.txt. Visit yourdomain.com/robots.txt and look for Disallow rules targeting any of these user-agent strings: GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot (Perplexity AI), Google-Extended (Google's AI training crawler). A Disallow: / rule for any of these agents means that crawler cannot access your site at all. Remove the Disallow rules for crawlers you want to allow.

Step 2: Check your CDN and hosting provider. This is the step most brands skip, and it is often where the real block lives. Cloudflare, Fastly, and many managed hosting platforms introduced AI-crawler blocking features in 2023 and 2024. Some enabled these blocks by default for all customers. Your robots.txt can be perfectly clean while a WAF rule at the CDN level silently returns 403 errors to every AI crawler request. Log into your CDN dashboard, check bot-management and security rules, and confirm that AI crawler user-agents are explicitly permitted.

Step 3: Verify with a crawl log sample. If you have server-side access logs, filter for GPTBot and ClaudeBot user-agent strings in the past 30 days. Zero requests from either — on a site with meaningful content — suggests a block at the infrastructure level. Request logs from your hosting provider if you do not have direct access.

This job takes ten minutes for a simple site and an afternoon for a complex one. No other AI visibility work matters until your pages are reachable.

What Schema Markup Should You Add, and in What Order?

Schema markup is structured data that tells machines — search engines and AI systems — what type of content a page contains and what the key facts are. It does not replace good prose content, but it provides a machine-readable summary that AI can use when the natural language content is ambiguous or missing.

The priority order for most brands:

1. Product schema on product pages. This passes price, currency, availability status, SKU, and aggregate rating score to AI systems in a standardised format. A product page without Product schema forces AI to parse price from natural language, which introduces error. With Product schema, the price is a structured fact. If you are on Shopify, your platform likely adds baseline Product schema automatically — run the Google Rich Results Test on three to five key product pages to confirm what is actually present, because Shopify's default schema often omits fields like aggregateRating and shippingDetails.

2. FAQPage schema on pages with FAQ sections. FAQPage schema explicitly maps question-and-answer pairs in a machine-readable format. AI models can extract these pairs directly, which makes them far more likely to be cited when a buyer asks that exact question. Add FAQPage schema to every page that has a FAQ section — product pages, buying guides, and standalone FAQ pages.

3. Article schema on blog posts and guides. Article schema passes publication date, author, and article type. Publication date is particularly useful for AI systems that weight recency — a guide without a structured publication date looks undated; one with Article schema and a recent datePublished is clearly current.

4. Organization schema on your homepage and About page. Organization schema passes your official name, URL, logo, contact details, and social profiles in a standardised format. It is what AI uses to resolve entity ambiguity — to confirm that "Brand X" the AI knows about is the same entity as brandx.com. For international brands with names in multiple languages, this schema includes an alternateName field that is essential for entity consistency.

Want this checked for your brand automatically? Run a free AI visibility snapshot.

Run free snapshot

What Does llms.txt Actually Do?

The llms.txt standard — a plain text file at yourdomain.com/llms.txt — was proposed in late 2024 as an AI equivalent of robots.txt: a file that tells AI systems what your site is, what content is most important, and how to navigate it.

Understanding what it does and does not do is essential for setting accurate expectations.

What it does not do: Google's official guidance, published May 15 2026, explicitly states that AI Overviews and AI Mode do not use llms.txt. These systems index your site through Googlebot and Google-Extended, process the HTML of your pages, and make citation decisions based on that content — not on a summary file. The same is true of the major AI chatbots: ChatGPT, Claude, and Perplexity fetch HTML directly when retrieving pages for live queries. They do not check llms.txt before deciding whether to cite you.

What it does: llms.txt works well in the agentic context — AI tools that need to understand a developer's codebase or documentation. Cursor, GitHub Copilot, Claude Code, and similar agentic tools regularly read llms.txt to understand what a repository or documentation site contains before generating code against it. If your product has a developer API, SDK, or technical documentation, llms.txt helps agentic tools navigate that documentation correctly. For consumer-facing product sites, the benefit is more limited but still present: it functions as an official self-introduction that any AI reading your domain can use.

The cost-benefit calculation: An llms.txt file takes one to two hours to write well. The file itself is a few kilobytes. For a developer product, the benefit is clear and direct. For a consumer product, treat it as a low-cost signal of good faith — a concise, authoritative statement of who you are and what you sell, in a format that AI systems will understand immediately if they do read it.

What Does the Complete Technical Checklist Look Like?

Six items cover the essential technical foundation:

  1. robots.txt — No Disallow rules for GPTBot, ClaudeBot, PerplexityBot, or Google-Extended.
  2. Firewall/CDN — Confirmed with your provider that AI crawlers are not blocked at the infrastructure level.
  3. Product schema — On all product pages; validated with the Google Rich Results Test.
  4. FAQPage schema — On all pages with FAQ sections; question-answer pairs structured correctly.
  5. Organization schema — On homepage and About page; includes official name, alternateName if applicable, URL, and social profiles.
  6. All key information as HTML text — Specifications, pricing, key claims, and FAQs must exist as crawlable text, not inside images or PDFs.

The last item is the most commonly violated. Beautiful long-scroll product images with specifications overlaid as text in the design are invisible to every crawler — AI, Google, and assistive technology alike. All key information must exist as HTML text somewhere on the page, even if the visual presentation uses images for aesthetics.

These six items are not a comprehensive GEO programme. They are the foundation that makes all other work possible. Off-site authority building, content strategy, and FAQ development all produce less impact if the technical layer is blocking crawlers or hiding content from machines.

Frequently Asked Questions

Does llms.txt help with ChatGPT or Google AI Overviews?

Not directly. Google's official May 2026 guidance explicitly states AI Overviews and AI Mode do not use llms.txt. ChatGPT and most AI chatbots fetch HTML directly rather than reading llms.txt. The file is most useful for agentic developer tools like Cursor, GitHub Copilot, and Claude Code.

Which schema type should I add first?

Product schema on your product pages, if you sell products. It passes price, availability, and aggregate rating to search and AI systems in a machine-readable format. FAQPage schema is a close second—it tells AI exactly which questions your page answers.

How do I check if AI crawlers are being blocked?

Check your robots.txt file at yourdomain.com/robots.txt for Disallow rules referencing GPTBot, ClaudeBot, PerplexityBot, or Google-Extended. Also contact your CDN or hosting provider—many WAF and bot-protection rulesets block AI crawlers by default at the infrastructure level without a robots.txt entry.

Does Shopify add schema automatically?

Shopify adds basic Product schema on product pages, but it is often incomplete—missing aggregate rating, shipping details, or return policy. Run the Google Rich Results Test on your key product pages to see exactly what schema is present and what is missing.

Run your free snapshot

Three technical jobs to complete for AI visibility: unblock AI crawlers in robots.txt, add Product and FAQPage schema, and understand what llms.txt actually does and doesn't do.

AnswerAtlas is an independent AI visibility intelligence platform. It is not affiliated with or endorsed by OpenAI, Google, Anthropic, or any AI platform mentioned on this page. All trademarks belong to their respective owners.