# Websites

Alhena AI supports crawling and indexing content from publicly accessible websites to answer queries. This includes:

* Landing pages
* Sitemaps
* Product pages
* Help articles
* Notion docs
* Support articles
* Developer docs
* Zendesk support articles
* CSV file links hosted on public cloud

For each website link, there are two different modes of crawling:

**Crawl multiple pages**: In multi-page crawl, Alhena AI will find the child pages and continue crawling as long as the root path of the child pages is the same as the root path of the parent URL. We crawl up to 5,000 pages per URL. If you have specific needs or require crawling more than 5,000 pages, [contact support](/docs/troubleshooting/troubleshooting.md#contact-support). For sitemaps, choose the multi-page crawl as it will also crawl child pages.

**Crawl single page**: In single-page crawl, we crawl only one page of the given URL.

<figure><img src="/files/XmKWHTRHeLl8GlXDMmgH" alt=""><figcaption><p>Alhena AI Website / URL Crawling options</p></figcaption></figure>

## Automatic Site Metadata Discovery

When you add a website URL for multi-page crawling, Alhena automatically discovers structured metadata files from your domain to improve coverage. No extra configuration is needed.

### What gets discovered

| File            | Purpose                                                                           |
| --------------- | --------------------------------------------------------------------------------- |
| `robots.txt`    | Reads `Sitemap:` directives to find your sitemaps                                 |
| `sitemap.xml`   | Discovers all pages on your site (used as fallback if not listed in robots.txt)   |
| `llms.txt`      | A curated file of your site's most important content, designed for AI consumption |
| `llms-full.txt` | Your entire site content in a single AI-ready document                            |
| `llms-ctx.txt`  | An expanded version of llms.txt with linked content included inline               |

This means you no longer need to manually add your sitemap URL — Alhena finds and crawls it automatically.

### llms.txt

[llms.txt](https://llmstxt.org/) is a growing web standard where site owners publish a curated markdown file at `/llms.txt` that describes their site's most important content for AI systems.

When Alhena finds an `llms.txt` file on your domain:

* The file content itself (such as inline FAQs or product descriptions) is ingested as training data
* All links in the file are extracted and added as pages to crawl
* If `llms-full.txt` or `llms-ctx.txt` exist, they are ingested as single training documents

If you maintain the website being trained, publishing an `llms.txt` file is one of the best ways to ensure Alhena learns from your most important content.

### Troubleshooting discovery

* **Sitemap not found?** Verify the file is accessible at `https://yoursite.com/sitemap.xml` or referenced in your `robots.txt`. Files that return an HTML page instead of XML will be skipped.
* **llms.txt not picked up?** The file must be at the domain root (`https://yoursite.com/llms.txt`) and must not return an HTML response.
* **Duplicate URLs?** Alhena deduplicates automatically. If you already added a sitemap URL manually, it won't be crawled twice.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://alhena.gitbook.io/docs/ai-configuration/data-sources/websites.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
