Create Web Crawl Feed

User Intent

"I want to crawl a website and ingest all pages for search and AI interactions"

Operation

SDK Method: graphlit.createFeed() with web configuration
GraphQL: createFeed mutation
Entity Type: Feed
Common Use Cases: Website crawling, documentation ingestion, site mirroring, web scraping

TypeScript (Canonical)

import { Graphlit } from 'graphlit-client';
import { ContentTypes, FeedTypes } from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

const response = await graphlit.createFeed({
  name: 'Company Documentation',
  type: FeedTypes.Web,
  web: {
    uri: 'https://docs.example.com',
    readLimit: 500,
    includeFiles: true,
    allowedDomains: ['docs.example.com'],
    excludedPaths: ['/api/', '/archive/'],
  },
});

const feedId = response.createFeed.id;
console.log(`Web crawl feed created: ${feedId}`);

// Poll for crawl completion
while (true) {
  const status = await graphlit.isFeedDone(feedId);
  if (status.isFeedDone.result) {
    break;
  }

  console.log('Still crawling website...');
  await new Promise((resolve) => setTimeout(resolve, 15_000));
}

console.log('Web crawl complete!');

const pages = await graphlit.queryContents({
  feeds: [{ id: feedId }],
  types: [ContentTypes.Page],
});

console.log(`Crawled ${pages.contents.results.length} pages`);

Parameters

FeedInput (Required)

name (string): Feed name
type (FeedTypes): Must be WEB
web (WebFeedPropertiesInput): Web crawl configuration

WebFeedPropertiesInput (Required)

uri (string): Starting URL to crawl
- Crawl starts here and follows links
readLimit (int): Max pages to crawl
- Safety limit to prevent runaway crawls
- Typical: 100-1000 pages

Optional (Highly Recommended)

includeFiles (boolean): Download linked files (PDFs, docs)
- Default: false
- Recommended: true for comprehensive ingestion
allowedDomains (string[]): Domains to crawl
- Prevents crawling external sites
- Example: ['docs.example.com', 'help.example.com']
excludedPaths (string[]): URL paths to skip
- Example: ['/admin/', '/login/', '/api/']
allowedPaths (string[]): Only crawl these paths
- Example: ['/docs/', '/guides/']

Other Optional

correlationId (string): For tracking
collections (EntityReferenceInput[]): Auto-add pages to collections
workflow (EntityReferenceInput): Apply workflow to pages

Response

{
  createFeed: {
    id: string;              // Feed ID
    name: string;            // Feed name
    state: EntityState;      // ENABLED
    type: FeedTypes.Web;     // WEB
    web: {
      uri: string;           // Starting URL
      readLimit: number;     // Page limit
      includeFiles: boolean;
      allowedDomains: string[];
      excludedPaths: string[];
    }
  }
}

Developer Hints

Always Set Domain Restrictions

Without restrictions (dangerous):

//  BAD - Could crawl entire internet
const feed = {
  web: {
    uri: 'https://example.com',
    readLimit: 10000
  }
};

With restrictions (safe):

//  GOOD - Stays on specified domains
const feed = {
  web: {
    uri: 'https://docs.example.com',
    readLimit: 1000,
    allowedDomains: ['docs.example.com']
  }
};

Important: Always use allowedDomains to prevent crawling external sites.

Read Limit Strategy

// Small site (< 100 pages)
readLimit: 150

// Medium site (100-1000 pages)
readLimit: 1500

// Large site (1000+ pages)
readLimit: 5000

// Documentation site (typical)
readLimit: 500-1000

Important: Set readLimit higher than expected page count to ensure complete crawl.

Path Filtering Patterns

// Exclude admin/auth pages
excludedPaths: ['/admin/', '/login/', '/logout/', '/auth/']

// Exclude API docs
excludedPaths: ['/api/', '/swagger/']

// Only crawl documentation
allowedPaths: ['/docs/', '/guides/', '/tutorials/']

// Exclude specific sections
excludedPaths: ['/blog/', '/news/', '/archive/']

File Ingestion

// Include linked files (PDFs, Word docs, etc.)
const feed = {
  web: {
    uri: 'https://docs.example.com',
    readLimit: 500,
    includeFiles: true,  // Download and ingest PDFs, DOCX, etc.
    allowedDomains: ['docs.example.com']
  }
};

// Then query files separately
const files = await graphlit.queryContents({
  feeds: [{ id: feedId }],
  types: [ContentTypes.File]
});

const pages = await graphlit.queryContents({
  feeds: [{ id: feedId }],
  types: [ContentTypes.Page]
});

console.log(`Crawled ${pages.contents.results.length} pages and ${files.contents.results.length} files`);

Variations

1. Basic Documentation Crawl

Simplest site crawl:

const feed = await graphlit.createFeed({
  name: 'Docs Crawl',
  type: FeedTypes.Web,
  web: {
    uri: 'https://docs.example.com',
    readLimit: 500,
    allowedDomains: ['docs.example.com']
  }
});

2. Crawl with File Downloads

Include PDFs and documents:

const feed = await graphlit.createFeed({
  name: 'Docs with Files',
  type: FeedTypes.Web,
  web: {
    uri: 'https://docs.example.com',
    readLimit: 1000,
    includeFiles: true,
    allowedDomains: ['docs.example.com']
  }
});

3. Targeted Path Crawl

Only specific sections:

const feed = await graphlit.createFeed({
  name: 'API Documentation Only',
  type: FeedTypes.Web,
  web: {
    uri: 'https://docs.example.com/api',
    readLimit: 200,
    allowedDomains: ['docs.example.com'],
    allowedPaths: ['/api/']  // Only /api/* pages
  }
});

4. Multi-Domain Crawl

Crawl multiple related domains:

const feed = await graphlit.createFeed({
  name: 'All Company Docs',
  type: FeedTypes.Web,
  web: {
    uri: 'https://docs.example.com',
    readLimit: 2000,
    allowedDomains: [
      'docs.example.com',
      'help.example.com',
      'support.example.com'
    ]
  }
});

5. Crawl with Workflow

Apply preparation workflow to pages:

// Create workflow for better extraction
const workflow = await graphlit.createWorkflow({
  name: 'Web Page Prep',
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.ModelDocument,
        modelDocument: {
          specification: { id: visionSpecId }
        }
      }
    }]
  }
});

// Create feed with workflow
const feed = await graphlit.createFeed({
  name: 'Docs with Prep',
  type: FeedTypes.Web,
  web: {
    uri: 'https://docs.example.com',
    readLimit: 500,
    includeFiles: true,
    allowedDomains: ['docs.example.com']
  },
  workflow: { id: workflow.createWorkflow.id }
});

6. Exclude Unwanted Sections

Skip specific paths:

const feed = await graphlit.createFeed({
  name: 'Docs (No Archive)',
  type: FeedTypes.Web,
  web: {
    uri: 'https://docs.example.com',
    readLimit: 1000,
    allowedDomains: ['docs.example.com'],
    excludedPaths: [
      '/archive/',
      '/deprecated/',
      '/old-versions/',
      '/beta/',
      '/admin/'
    ]
  }
});

Common Issues

Issue: Crawl takes very long Solution: This is normal for large sites. Use higher polling interval (30-60 seconds). Crawls can take 15-60 minutes for 500+ pages.

Issue: Not all pages crawled Solution: Increase readLimit. Check allowedDomains and allowedPaths aren't too restrictive. Some pages may not be linked.

Issue: Crawling external sites Solution: Set allowedDomains to prevent crawling links to external sites.

Issue: Crawling unwanted sections Solution: Use excludedPaths to skip admin, login, API routes, etc.

Issue: Files (PDFs) not ingested Solution: Ensure includeFiles: true is set. Check files are actually linked from crawled pages.

Issue: Duplicate pages Solution: Graphlit deduplicates by URL. If site has URL parameters, duplicates may occur. Use excludedPaths to skip query params.

Production Example

Complete documentation ingestion:

// 1. Create collection
const collection = await graphlit.createCollection({
  name: 'Product Documentation'
});

// 2. Create preparation workflow
const workflow = await graphlit.createWorkflow({
  name: 'Web Prep',
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.ModelDocument,
        modelDocument: {
          specification: { id: visionSpecId }
        }
      }
    }]
  }
});

// 3. Create web crawl feed
const feed = await graphlit.createFeed({
  name: 'Documentation Crawl',
  type: FeedTypes.Web,
  web: {
    uri: 'https://docs.example.com',
    readLimit: 1000,
    includeFiles: true,
    allowedDomains: ['docs.example.com'],
    excludedPaths: ['/api/', '/admin/', '/login/']
  },
  collections: [{ id: collection.createCollection.id }],
  workflow: { id: workflow.createWorkflow.id }
});

// 4. Wait for crawl
console.log('Starting crawl...');
const feedId = feed.createFeed.id;
let isDone = false;
let checkCount = 0;

while (!isDone) {
  const status = await graphlit.isFeedDone(feedId);
  isDone = status.isFeedDone.result || false;
  
  if (!isDone) {
    checkCount++;
    
    // Show progress
    const crawled = await graphlit.queryContents({
      feeds: [{ id: feedId }]
    });
    
    console.log(`Check ${checkCount}: ${crawled.contents.results.length} pages crawled so far...`);
    await new Promise(resolve => setTimeout(resolve, 30000)); // 30 seconds
  }
}

console.log(' Crawl complete!');

// 5. Get final stats
const pages = await graphlit.queryContents({
  feeds: [{ id: feedId }],
  types: [ContentTypes.Page]
});

const files = await graphlit.queryContents({
  feeds: [{ id: feedId }],
  types: [ContentTypes.File]
});

console.log(`\nCrawl Results:`);
console.log(`- ${pages.contents.results.length} pages`);
console.log(`- ${files.contents.results.length} files`);

// 6. Enable RAG queries
const answer = await graphlit.promptConversation({
  prompt: 'How do I configure authentication?',
  collections: [{ id: collection.createCollection.id }]
});

console.log(`\nRAG Answer: ${answer.message.message}`);

Multi-site documentation aggregation:

// Crawl multiple documentation sites
const docSites = [
  { name: 'Main Docs', url: 'https://docs.example.com', limit: 1000 },
  { name: 'API Docs', url: 'https://api.example.com/docs', limit: 500 },
  { name: 'Help Center', url: 'https://help.example.com', limit: 300 }
];

const collection = await graphlit.createCollection({
  name: 'All Documentation'
});

for (const site of docSites) {
  const feed = await graphlit.createFeed({
    name: site.name,
    type: FeedTypes.Web,
    web: {
      uri: site.url,
      readLimit: site.limit,
      includeFiles: true,
      allowedDomains: [new URL(site.url).hostname]
    },
    collections: [{ id: collection.createCollection.id }]
  });
  
  console.log(`Started crawl: ${site.name}`);
}

// All sites crawl in parallel
console.log('All crawls started. Pages will be available as they\'re crawled.');

Last updated 5 days ago

Was this helpful?