Create Web Crawl Feed

User Intent

"I want to crawl a website and ingest all pages for search and AI interactions"

Operation

  • SDK Method: graphlit.createFeed() with web configuration

  • GraphQL: createFeed mutation

  • Entity Type: Feed

  • Common Use Cases: Website crawling, documentation ingestion, site mirroring, web scraping

TypeScript (Canonical)

import { Graphlit } from 'graphlit-client';
import { ContentTypes, FeedTypes } from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

const response = await graphlit.createFeed({
  name: 'Company Documentation',
  type: FeedTypes.Web,
  web: {
    uri: 'https://docs.example.com',
    readLimit: 500,
    includeFiles: true,
    allowedDomains: ['docs.example.com'],
    excludedPaths: ['/api/', '/archive/'],
  },
});

const feedId = response.createFeed.id;
console.log(`Web crawl feed created: ${feedId}`);

// Poll for crawl completion
while (true) {
  const status = await graphlit.isFeedDone(feedId);
  if (status.isFeedDone.result) {
    break;
  }

  console.log('Still crawling website...');
  await new Promise((resolve) => setTimeout(resolve, 15_000));
}

console.log('Web crawl complete!');

const pages = await graphlit.queryContents({
  feeds: [{ id: feedId }],
  types: [ContentTypes.Page],
});

console.log(`Crawled ${pages.contents.results.length} pages`);

Parameters

FeedInput (Required)

  • name (string): Feed name

  • type (FeedTypes): Must be WEB

  • web (WebFeedPropertiesInput): Web crawl configuration

WebFeedPropertiesInput (Required)

  • uri (string): Starting URL to crawl

    • Crawl starts here and follows links

  • readLimit (int): Max pages to crawl

    • Safety limit to prevent runaway crawls

    • Typical: 100-1000 pages

  • includeFiles (boolean): Download linked files (PDFs, docs)

    • Default: false

    • Recommended: true for comprehensive ingestion

  • allowedDomains (string[]): Domains to crawl

    • Prevents crawling external sites

    • Example: ['docs.example.com', 'help.example.com']

  • excludedPaths (string[]): URL paths to skip

    • Example: ['/admin/', '/login/', '/api/']

  • allowedPaths (string[]): Only crawl these paths

    • Example: ['/docs/', '/guides/']

Other Optional

  • correlationId (string): For tracking

  • collections (EntityReferenceInput[]): Auto-add pages to collections

  • workflow (EntityReferenceInput): Apply workflow to pages

Response

Developer Hints

Always Set Domain Restrictions

Without restrictions (dangerous):

With restrictions (safe):

Important: Always use allowedDomains to prevent crawling external sites.

Read Limit Strategy

Important: Set readLimit higher than expected page count to ensure complete crawl.

Path Filtering Patterns

File Ingestion

Variations

1. Basic Documentation Crawl

Simplest site crawl:

2. Crawl with File Downloads

Include PDFs and documents:

3. Targeted Path Crawl

Only specific sections:

4. Multi-Domain Crawl

Crawl multiple related domains:

5. Crawl with Workflow

Apply preparation workflow to pages:

6. Exclude Unwanted Sections

Skip specific paths:

Common Issues

Issue: Crawl takes very long Solution: This is normal for large sites. Use higher polling interval (30-60 seconds). Crawls can take 15-60 minutes for 500+ pages.

Issue: Not all pages crawled Solution: Increase readLimit. Check allowedDomains and allowedPaths aren't too restrictive. Some pages may not be linked.

Issue: Crawling external sites Solution: Set allowedDomains to prevent crawling links to external sites.

Issue: Crawling unwanted sections Solution: Use excludedPaths to skip admin, login, API routes, etc.

Issue: Files (PDFs) not ingested Solution: Ensure includeFiles: true is set. Check files are actually linked from crawled pages.

Issue: Duplicate pages Solution: Graphlit deduplicates by URL. If site has URL parameters, duplicates may occur. Use excludedPaths to skip query params.

Production Example

Complete documentation ingestion:

Multi-site documentation aggregation:

Last updated

Was this helpful?