Knowledge Sources

Connecting your website

Crawl your website so the assistant answers from your real pages — and the settings that control what gets indexed.

The website source is the fastest way to give your assistant a working knowledge of your business. Paste a URL, pick a few options, and the assistant has hundreds of pages of context within minutes.

Adding a website source

Open your assistant and go to Sources → Website.
Click Add website.
Paste the URL you want to start from. This is usually your homepage, but it can be a sub-section like https://yourdomain.com/help.
Click Crawl.

Gabbex follows internal links from the starting URL and reads each page it finds, up to the page limit on your plan. Status updates appear in the source list as the crawl progresses.

What gets included and what does not

The crawler is built to be respectful and efficient:

It follows links on the same domain only. Links to other websites are ignored.
It respects robots.txt. Pages that are disallowed for crawlers are skipped.
It skips obvious non-content URLs like images, PDFs (use the Files source for those), and asset files.
It strips navigation, headers, and footers so the index focuses on real content, not boilerplate.

If a page is missing after a crawl, the most common reasons are: the page is not linked from anywhere reachable from the starting URL, it requires login, it is blocked by robots.txt, or it loads its content entirely from JavaScript after the page renders.

Re-crawling after a change

Website content changes. When you update your pricing page, add a new product, or rewrite your returns policy, re-crawl so the assistant picks up the change.

From the website source list, click Re-crawl on the row you want to refresh. You can re-crawl as often as your plan allows. Most teams re-crawl weekly or after any meaningful content change.

Excluding pages

If your starting URL pulls in pages you do not want — old blog posts, a careers section, a legal archive — you have two options:

Use a more specific starting URL. Instead of the homepage, start the crawl from /help or /products so only the relevant subtree is indexed.
Add an exclusion pattern. From the source settings, add URL patterns to skip. Patterns support simple wildcards.

Tips for better answers from a website source

Make sure your most important pages are linked. If a page is not reachable by clicking from the starting URL, the crawler will not see it.
Avoid JavaScript-only content. If a page is empty until React or another framework renders it client-side, the crawler may see an empty shell. Server-rendered content always indexes best.
Cut the fluff. Marketing copy with no concrete details is worse than no source at all. Replace “We pride ourselves on great service” with “Our support team replies within 4 business hours, Monday to Friday.”
Use a Q&A entry to override. If the assistant keeps getting one specific question wrong, the fastest fix is a Q&A entry with the exact answer you want. Q&A always wins over a noisy crawl.

Common issues

“The crawler found zero pages.” Your starting URL might block crawlers, redirect off-site, or render entirely in JavaScript. Try a sub-page that you know contains static text.
“It indexed pages I did not want.” Use a more specific starting URL or add exclusion patterns.
“The answer is out of date.” Re-crawl. The assistant only knows what was true at the time of the last crawl.

Next steps

Building a Q&A list — for the questions you want exactly right.
Uploading PDFs and files — for content that lives in PDFs rather than on the web.