Create Web Crawl Feed
User Intent
"I want to crawl a website and ingest all pages for search and AI interactions"
Operation
SDK Method:
graphlit.createFeed()with web configurationGraphQL:
createFeedmutationEntity Type: Feed
Common Use Cases: Website crawling, documentation ingestion, site mirroring, web scraping
TypeScript (Canonical)
import { Graphlit } from 'graphlit-client';
import { ContentTypes, FeedTypes } from 'graphlit-client/dist/generated/graphql-types';
const graphlit = new Graphlit();
const response = await graphlit.createFeed({
name: 'Company Documentation',
type: FeedTypes.Web,
web: {
uri: 'https://docs.example.com',
readLimit: 500,
includeFiles: true,
allowedDomains: ['docs.example.com'],
excludedPaths: ['/api/', '/archive/'],
},
});
const feedId = response.createFeed.id;
console.log(`Web crawl feed created: ${feedId}`);
// Poll for crawl completion
while (true) {
const status = await graphlit.isFeedDone(feedId);
if (status.isFeedDone.result) {
break;
}
console.log('Still crawling website...');
await new Promise((resolve) => setTimeout(resolve, 15_000));
}
console.log('Web crawl complete!');
const pages = await graphlit.queryContents({
feeds: [{ id: feedId }],
types: [ContentTypes.Page],
});
console.log(`Crawled ${pages.contents.results.length} pages`);Parameters
FeedInput (Required)
name(string): Feed nametype(FeedTypes): Must beWEBweb(WebFeedPropertiesInput): Web crawl configuration
WebFeedPropertiesInput (Required)
uri(string): Starting URL to crawlCrawl starts here and follows links
readLimit(int): Max pages to crawlSafety limit to prevent runaway crawls
Typical: 100-1000 pages
Optional (Highly Recommended)
includeFiles(boolean): Download linked files (PDFs, docs)Default: false
Recommended: true for comprehensive ingestion
allowedDomains(string[]): Domains to crawlPrevents crawling external sites
Example:
['docs.example.com', 'help.example.com']
excludedPaths(string[]): URL paths to skipExample:
['/admin/', '/login/', '/api/']
allowedPaths(string[]): Only crawl these pathsExample:
['/docs/', '/guides/']
Other Optional
correlationId(string): For trackingcollections(EntityReferenceInput[]): Auto-add pages to collectionsworkflow(EntityReferenceInput): Apply workflow to pages
Response
Developer Hints
Always Set Domain Restrictions
Without restrictions (dangerous):
With restrictions (safe):
Important: Always use allowedDomains to prevent crawling external sites.
Read Limit Strategy
Important: Set readLimit higher than expected page count to ensure complete crawl.
Path Filtering Patterns
File Ingestion
Variations
1. Basic Documentation Crawl
Simplest site crawl:
2. Crawl with File Downloads
Include PDFs and documents:
3. Targeted Path Crawl
Only specific sections:
4. Multi-Domain Crawl
Crawl multiple related domains:
5. Crawl with Workflow
Apply preparation workflow to pages:
6. Exclude Unwanted Sections
Skip specific paths:
Common Issues
Issue: Crawl takes very long Solution: This is normal for large sites. Use higher polling interval (30-60 seconds). Crawls can take 15-60 minutes for 500+ pages.
Issue: Not all pages crawled
Solution: Increase readLimit. Check allowedDomains and allowedPaths aren't too restrictive. Some pages may not be linked.
Issue: Crawling external sites
Solution: Set allowedDomains to prevent crawling links to external sites.
Issue: Crawling unwanted sections
Solution: Use excludedPaths to skip admin, login, API routes, etc.
Issue: Files (PDFs) not ingested
Solution: Ensure includeFiles: true is set. Check files are actually linked from crawled pages.
Issue: Duplicate pages
Solution: Graphlit deduplicates by URL. If site has URL parameters, duplicates may occur. Use excludedPaths to skip query params.
Production Example
Complete documentation ingestion:
Multi-site documentation aggregation:
Last updated
Was this helpful?