# Website

> Point your agent at any website and it automatically crawls and learns from the content.

## Setting up the crawler

1. Go to **Integrations > Website** (or use Quick Start when creating an agent)
2. Enter your website URL (e.g., `https://www.yourcompany.com`)
3. Optionally configure path filters
4. Click **Connect** to start crawling

### Prerequisites

* A publicly accessible website
* The crawler respects `robots.txt`

## How it works

The crawler:

1. Starts at the URL you provide
2. Follows links to discover pages
3. Extracts text content from each page
4. Indexes the content for your agent to search

### Crawl limits

| Plan           | Max pages |
| -------------- | --------- |
| **Trial**      | 100 pages |
| **Paid plans** | 500 pages |

## Configuring path filters

Use include and exclude paths to control what gets crawled:

**Include paths** — Only crawl pages matching these paths

* Example: `/help`, `/docs`, `/support`

**Exclude paths** — Skip pages matching these paths

* Example: `/blog`, `/careers`, `/pricing`

This is useful for large websites where you only want your agent to learn from specific sections.

## Sync frequency and updates

* The crawler periodically re-crawls your site to pick up changes
* You can manually trigger a re-crawl from the integration settings
* New pages linked from existing pages are discovered automatically

## Tips

**Use path filters on large sites.** If your site has thousands of pages, focus the crawler on help and documentation sections.

**Crawl your help center.** If your help center is on a subdomain (like `help.yourcompany.com`), enter that URL directly.

**Combine with other sources.** The website crawler is great for getting started quickly, but add help center articles, documents, and past tickets for comprehensive coverage.

**Check what got crawled.** After the crawl completes, browse the indexed pages in your Files tab to verify the right content was picked up.

## Troubleshooting

**Crawler not finding pages?**

* Make sure pages are publicly accessible (not behind a login)
* Check that `robots.txt` isn't blocking the crawler
* Verify that pages are linked from the starting URL (orphan pages won't be found)

**Too many irrelevant pages crawled?**

* Add exclude paths for sections you don't want (blog, careers, etc.)
* Use include paths to restrict crawling to specific sections
