globeWebsite

Train your agent by crawling your website

Point your agent at any website and it automatically crawls and learns from the content.

Setting up the crawler

  1. Go to Integrations > Website (or use Quick Start when creating an agent)

  2. Enter your website URL (e.g., https://www.yourcompany.com)

  3. Optionally configure path filters

  4. Click Connect to start crawling

Prerequisites

  • A publicly accessible website

  • The crawler respects robots.txt

How it works

The crawler:

  1. Starts at the URL you provide

  2. Follows links to discover pages

  3. Extracts text content from each page

  4. Indexes the content for your agent to search

Crawl limits

Plan
Max pages

Trial

100 pages

Paid plans

500 pages

Configuring path filters

Use include and exclude paths to control what gets crawled:

Include paths — Only crawl pages matching these paths

  • Example: /help, /docs, /support

Exclude paths — Skip pages matching these paths

  • Example: /blog, /careers, /pricing

This is useful for large websites where you only want your agent to learn from specific sections.

Sync frequency and updates

  • The crawler periodically re-crawls your site to pick up changes

  • You can manually trigger a re-crawl from the integration settings

  • New pages linked from existing pages are discovered automatically

Tips

Use path filters on large sites. If your site has thousands of pages, focus the crawler on help and documentation sections.

Crawl your help center. If your help center is on a subdomain (like help.yourcompany.com), enter that URL directly.

Combine with other sources. The website crawler is great for getting started quickly, but add help center articles, documents, and past tickets for comprehensive coverage.

Check what got crawled. After the crawl completes, browse the indexed pages in your Files tab to verify the right content was picked up.

Troubleshooting

Crawler not finding pages?

  • Make sure pages are publicly accessible (not behind a login)

  • Check that robots.txt isn't blocking the crawler

  • Verify that pages are linked from the starting URL (orphan pages won't be found)

Too many irrelevant pages crawled?

  • Add exclude paths for sections you don't want (blog, careers, etc.)

  • Use include paths to restrict crawling to specific sections

Last updated