Create Web Crawl Feed
User Intent
"I want to crawl a website and ingest all pages for search and AI interactions"
Operation
SDK Method:
graphlit.createFeed()with web configurationGraphQL:
createFeedmutationEntity Type: Feed
Common Use Cases: Website crawling, documentation ingestion, site mirroring, web scraping
TypeScript (Canonical)
import { Graphlit } from 'graphlit-client';
import { ContentTypes, FeedTypes } from 'graphlit-client/dist/generated/graphql-types';
const graphlit = new Graphlit();
const response = await graphlit.createFeed({
name: 'Company Documentation',
type: FeedTypes.Web,
web: {
uri: 'https://docs.example.com',
readLimit: 500,
includeFiles: true,
allowedDomains: ['docs.example.com'],
excludedPaths: ['/api/', '/archive/'],
},
});
const feedId = response.createFeed.id;
console.log(`Web crawl feed created: ${feedId}`);
// Poll for crawl completion
while (true) {
const status = await graphlit.isFeedDone(feedId);
if (status.isFeedDone.result) {
break;
}
console.log('Still crawling website...');
await new Promise((resolve) => setTimeout(resolve, 15_000));
}
console.log('Web crawl complete!');
const pages = await graphlit.queryContents({
feeds: [{ id: feedId }],
types: [ContentTypes.Page],
});
console.log(`Crawled ${pages.contents.results.length} pages`);Parameters
FeedInput (Required)
name(string): Feed nametype(FeedTypes): Must beWEBweb(WebFeedPropertiesInput): Web crawl configuration
WebFeedPropertiesInput (Required)
uri(string): Starting URL to crawlCrawl starts here and follows links
readLimit(int): Max pages to crawlSafety limit to prevent runaway crawls
Typical: 100-1000 pages
Optional (Highly Recommended)
includeFiles(boolean): Download linked files (PDFs, docs)Default: false
Recommended: true for comprehensive ingestion
allowedDomains(string[]): Domains to crawlPrevents crawling external sites
Example:
['docs.example.com', 'help.example.com']
excludedPaths(string[]): URL paths to skipExample:
['/admin/', '/login/', '/api/']
allowedPaths(string[]): Only crawl these pathsExample:
['/docs/', '/guides/']
Other Optional
correlationId(string): For trackingcollections(EntityReferenceInput[]): Auto-add pages to collectionsworkflow(EntityReferenceInput): Apply workflow to pages
Response
{
createFeed: {
id: string; // Feed ID
name: string; // Feed name
state: EntityState; // ENABLED
type: FeedTypes.Web; // WEB
web: {
uri: string; // Starting URL
readLimit: number; // Page limit
includeFiles: boolean;
allowedDomains: string[];
excludedPaths: string[];
}
}
}Developer Hints
Always Set Domain Restrictions
Without restrictions (dangerous):
// BAD - Could crawl entire internet
const feed = {
web: {
uri: 'https://example.com',
readLimit: 10000
}
};With restrictions (safe):
// GOOD - Stays on specified domains
const feed = {
web: {
uri: 'https://docs.example.com',
readLimit: 1000,
allowedDomains: ['docs.example.com']
}
};Important: Always use allowedDomains to prevent crawling external sites.
Read Limit Strategy
// Small site (< 100 pages)
readLimit: 150
// Medium site (100-1000 pages)
readLimit: 1500
// Large site (1000+ pages)
readLimit: 5000
// Documentation site (typical)
readLimit: 500-1000Important: Set readLimit higher than expected page count to ensure complete crawl.
Path Filtering Patterns
// Exclude admin/auth pages
excludedPaths: ['/admin/', '/login/', '/logout/', '/auth/']
// Exclude API docs
excludedPaths: ['/api/', '/swagger/']
// Only crawl documentation
allowedPaths: ['/docs/', '/guides/', '/tutorials/']
// Exclude specific sections
excludedPaths: ['/blog/', '/news/', '/archive/']File Ingestion
// Include linked files (PDFs, Word docs, etc.)
const feed = {
web: {
uri: 'https://docs.example.com',
readLimit: 500,
includeFiles: true, // Download and ingest PDFs, DOCX, etc.
allowedDomains: ['docs.example.com']
}
};
// Then query files separately
const files = await graphlit.queryContents({
feeds: [{ id: feedId }],
types: [ContentTypes.File]
});
const pages = await graphlit.queryContents({
feeds: [{ id: feedId }],
types: [ContentTypes.Page]
});
console.log(`Crawled ${pages.contents.results.length} pages and ${files.contents.results.length} files`);Variations
1. Basic Documentation Crawl
Simplest site crawl:
const feed = await graphlit.createFeed({
name: 'Docs Crawl',
type: FeedTypes.Web,
web: {
uri: 'https://docs.example.com',
readLimit: 500,
allowedDomains: ['docs.example.com']
}
});2. Crawl with File Downloads
Include PDFs and documents:
const feed = await graphlit.createFeed({
name: 'Docs with Files',
type: FeedTypes.Web,
web: {
uri: 'https://docs.example.com',
readLimit: 1000,
includeFiles: true,
allowedDomains: ['docs.example.com']
}
});3. Targeted Path Crawl
Only specific sections:
const feed = await graphlit.createFeed({
name: 'API Documentation Only',
type: FeedTypes.Web,
web: {
uri: 'https://docs.example.com/api',
readLimit: 200,
allowedDomains: ['docs.example.com'],
allowedPaths: ['/api/'] // Only /api/* pages
}
});4. Multi-Domain Crawl
Crawl multiple related domains:
const feed = await graphlit.createFeed({
name: 'All Company Docs',
type: FeedTypes.Web,
web: {
uri: 'https://docs.example.com',
readLimit: 2000,
allowedDomains: [
'docs.example.com',
'help.example.com',
'support.example.com'
]
}
});5. Crawl with Workflow
Apply preparation workflow to pages:
// Create workflow for better extraction
const workflow = await graphlit.createWorkflow({
name: 'Web Page Prep',
preparation: {
jobs: [{
connector: {
type: FilePreparationServiceTypes.ModelDocument,
modelDocument: {
specification: { id: visionSpecId }
}
}
}]
}
});
// Create feed with workflow
const feed = await graphlit.createFeed({
name: 'Docs with Prep',
type: FeedTypes.Web,
web: {
uri: 'https://docs.example.com',
readLimit: 500,
includeFiles: true,
allowedDomains: ['docs.example.com']
},
workflow: { id: workflow.createWorkflow.id }
});6. Exclude Unwanted Sections
Skip specific paths:
const feed = await graphlit.createFeed({
name: 'Docs (No Archive)',
type: FeedTypes.Web,
web: {
uri: 'https://docs.example.com',
readLimit: 1000,
allowedDomains: ['docs.example.com'],
excludedPaths: [
'/archive/',
'/deprecated/',
'/old-versions/',
'/beta/',
'/admin/'
]
}
});Common Issues
Issue: Crawl takes very long Solution: This is normal for large sites. Use higher polling interval (30-60 seconds). Crawls can take 15-60 minutes for 500+ pages.
Issue: Not all pages crawled
Solution: Increase readLimit. Check allowedDomains and allowedPaths aren't too restrictive. Some pages may not be linked.
Issue: Crawling external sites
Solution: Set allowedDomains to prevent crawling links to external sites.
Issue: Crawling unwanted sections
Solution: Use excludedPaths to skip admin, login, API routes, etc.
Issue: Files (PDFs) not ingested
Solution: Ensure includeFiles: true is set. Check files are actually linked from crawled pages.
Issue: Duplicate pages
Solution: Graphlit deduplicates by URL. If site has URL parameters, duplicates may occur. Use excludedPaths to skip query params.
Production Example
Complete documentation ingestion:
// 1. Create collection
const collection = await graphlit.createCollection({
name: 'Product Documentation'
});
// 2. Create preparation workflow
const workflow = await graphlit.createWorkflow({
name: 'Web Prep',
preparation: {
jobs: [{
connector: {
type: FilePreparationServiceTypes.ModelDocument,
modelDocument: {
specification: { id: visionSpecId }
}
}
}]
}
});
// 3. Create web crawl feed
const feed = await graphlit.createFeed({
name: 'Documentation Crawl',
type: FeedTypes.Web,
web: {
uri: 'https://docs.example.com',
readLimit: 1000,
includeFiles: true,
allowedDomains: ['docs.example.com'],
excludedPaths: ['/api/', '/admin/', '/login/']
},
collections: [{ id: collection.createCollection.id }],
workflow: { id: workflow.createWorkflow.id }
});
// 4. Wait for crawl
console.log('Starting crawl...');
const feedId = feed.createFeed.id;
let isDone = false;
let checkCount = 0;
while (!isDone) {
const status = await graphlit.isFeedDone(feedId);
isDone = status.isFeedDone.result || false;
if (!isDone) {
checkCount++;
// Show progress
const crawled = await graphlit.queryContents({
feeds: [{ id: feedId }]
});
console.log(`Check ${checkCount}: ${crawled.contents.results.length} pages crawled so far...`);
await new Promise(resolve => setTimeout(resolve, 30000)); // 30 seconds
}
}
console.log(' Crawl complete!');
// 5. Get final stats
const pages = await graphlit.queryContents({
feeds: [{ id: feedId }],
types: [ContentTypes.Page]
});
const files = await graphlit.queryContents({
feeds: [{ id: feedId }],
types: [ContentTypes.File]
});
console.log(`\nCrawl Results:`);
console.log(`- ${pages.contents.results.length} pages`);
console.log(`- ${files.contents.results.length} files`);
// 6. Enable RAG queries
const answer = await graphlit.promptConversation({
prompt: 'How do I configure authentication?',
collections: [{ id: collection.createCollection.id }]
});
console.log(`\nRAG Answer: ${answer.message.message}`);Multi-site documentation aggregation:
// Crawl multiple documentation sites
const docSites = [
{ name: 'Main Docs', url: 'https://docs.example.com', limit: 1000 },
{ name: 'API Docs', url: 'https://api.example.com/docs', limit: 500 },
{ name: 'Help Center', url: 'https://help.example.com', limit: 300 }
];
const collection = await graphlit.createCollection({
name: 'All Documentation'
});
for (const site of docSites) {
const feed = await graphlit.createFeed({
name: site.name,
type: FeedTypes.Web,
web: {
uri: site.url,
readLimit: site.limit,
includeFiles: true,
allowedDomains: [new URL(site.url).hostname]
},
collections: [{ id: collection.createCollection.id }]
});
console.log(`Started crawl: ${site.name}`);
}
// All sites crawl in parallel
console.log('All crawls started. Pages will be available as they\'re crawled.');Last updated
Was this helpful?