# Create Web Crawl Feed

## User Intent

"I want to crawl a website and ingest all pages for search and AI interactions"

## Operation

* **SDK Method**: `graphlit.createFeed()` with web configuration
* **GraphQL**: `createFeed` mutation
* **Entity Type**: Feed
* **Common Use Cases**: Website crawling, documentation ingestion, site mirroring, web scraping

## TypeScript (Canonical)

```typescript
import { Graphlit } from 'graphlit-client';
import { ContentTypes, FeedTypes } from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

const response = await graphlit.createFeed({
  name: 'Company Documentation',
  type: FeedTypes.Web,
  web: {
    uri: 'https://docs.example.com',
    readLimit: 500,
    includeFiles: true,
    allowedDomains: ['docs.example.com'],
    excludedPaths: ['/api/', '/archive/'],
  },
});

const feedId = response.createFeed.id;
console.log(`Web crawl feed created: ${feedId}`);

// Poll for crawl completion
while (true) {
  const status = await graphlit.isFeedDone(feedId);
  if (status.isFeedDone.result) {
    break;
  }

  console.log('Still crawling website...');
  await new Promise((resolve) => setTimeout(resolve, 15_000));
}

console.log('Web crawl complete!');

const pages = await graphlit.queryContents({
  feeds: [{ id: feedId }],
  types: [ContentTypes.Page],
});

console.log(`Crawled ${pages.contents.results.length} pages`);
```

## Parameters

### FeedInput (Required)

* **`name`** (string): Feed name
* **`type`** (FeedTypes): Must be `WEB`
* **`web`** (WebFeedPropertiesInput): Web crawl configuration

### WebFeedPropertiesInput (Required)

* **`uri`** (string): Starting URL to crawl
  * Crawl starts here and follows links
* **`readLimit`** (int): Max pages to crawl
  * Safety limit to prevent runaway crawls
  * Typical: 100-1000 pages

### Optional (Highly Recommended)

* **`includeFiles`** (boolean): Download linked files (PDFs, docs)
  * Default: false
  * Recommended: true for comprehensive ingestion
* **`allowedDomains`** (string\[]): Domains to crawl
  * Prevents crawling external sites
  * Example: `['docs.example.com', 'help.example.com']`
* **`excludedPaths`** (string\[]): URL paths to skip
  * Example: `['/admin/', '/login/', '/api/']`
* **`allowedPaths`** (string\[]): Only crawl these paths
  * Example: `['/docs/', '/guides/']`

### Other Optional

* **`correlationId`** (string): For tracking
* **`collections`** (EntityReferenceInput\[]): Auto-add pages to collections
* **`workflow`** (EntityReferenceInput): Apply workflow to pages

## Response

```typescript
{
  createFeed: {
    id: string;              // Feed ID
    name: string;            // Feed name
    state: EntityState;      // ENABLED
    type: FeedTypes.Web;     // WEB
    web: {
      uri: string;           // Starting URL
      readLimit: number;     // Page limit
      includeFiles: boolean;
      allowedDomains: string[];
      excludedPaths: string[];
    }
  }
}
```

## Developer Hints

### Always Set Domain Restrictions

**Without restrictions** (dangerous):

```typescript
//  BAD - Could crawl entire internet
const feed = {
  web: {
    uri: 'https://example.com',
    readLimit: 10000
  }
};
```

**With restrictions** (safe):

```typescript
//  GOOD - Stays on specified domains
const feed = {
  web: {
    uri: 'https://docs.example.com',
    readLimit: 1000,
    allowedDomains: ['docs.example.com']
  }
};
```

**Important**: Always use `allowedDomains` to prevent crawling external sites.

### Read Limit Strategy

```typescript
// Small site (< 100 pages)
readLimit: 150

// Medium site (100-1000 pages)
readLimit: 1500

// Large site (1000+ pages)
readLimit: 5000

// Documentation site (typical)
readLimit: 500-1000
```

**Important**: Set `readLimit` higher than expected page count to ensure complete crawl.

### Path Filtering Patterns

```typescript
// Exclude admin/auth pages
excludedPaths: ['/admin/', '/login/', '/logout/', '/auth/']

// Exclude API docs
excludedPaths: ['/api/', '/swagger/']

// Only crawl documentation
allowedPaths: ['/docs/', '/guides/', '/tutorials/']

// Exclude specific sections
excludedPaths: ['/blog/', '/news/', '/archive/']
```

### File Ingestion

```typescript
// Include linked files (PDFs, Word docs, etc.)
const feed = {
  web: {
    uri: 'https://docs.example.com',
    readLimit: 500,
    includeFiles: true,  // Download and ingest PDFs, DOCX, etc.
    allowedDomains: ['docs.example.com']
  }
};

// Then query files separately
const files = await graphlit.queryContents({
  feeds: [{ id: feedId }],
  types: [ContentTypes.File]
});

const pages = await graphlit.queryContents({
  feeds: [{ id: feedId }],
  types: [ContentTypes.Page]
});

console.log(`Crawled ${pages.contents.results.length} pages and ${files.contents.results.length} files`);
```

## Variations

### 1. Basic Documentation Crawl

Simplest site crawl:

```typescript
const feed = await graphlit.createFeed({
  name: 'Docs Crawl',
  type: FeedTypes.Web,
  web: {
    uri: 'https://docs.example.com',
    readLimit: 500,
    allowedDomains: ['docs.example.com']
  }
});
```

### 2. Crawl with File Downloads

Include PDFs and documents:

```typescript
const feed = await graphlit.createFeed({
  name: 'Docs with Files',
  type: FeedTypes.Web,
  web: {
    uri: 'https://docs.example.com',
    readLimit: 1000,
    includeFiles: true,
    allowedDomains: ['docs.example.com']
  }
});
```

### 3. Targeted Path Crawl

Only specific sections:

```typescript
const feed = await graphlit.createFeed({
  name: 'API Documentation Only',
  type: FeedTypes.Web,
  web: {
    uri: 'https://docs.example.com/api',
    readLimit: 200,
    allowedDomains: ['docs.example.com'],
    allowedPaths: ['/api/']  // Only /api/* pages
  }
});
```

### 4. Multi-Domain Crawl

Crawl multiple related domains:

```typescript
const feed = await graphlit.createFeed({
  name: 'All Company Docs',
  type: FeedTypes.Web,
  web: {
    uri: 'https://docs.example.com',
    readLimit: 2000,
    allowedDomains: [
      'docs.example.com',
      'help.example.com',
      'support.example.com'
    ]
  }
});
```

### 5. Crawl with Workflow

Apply preparation workflow to pages:

```typescript
// Create workflow for better extraction
const workflow = await graphlit.createWorkflow({
  name: 'Web Page Prep',
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.ModelDocument,
        modelDocument: {
          specification: { id: visionSpecId }
        }
      }
    }]
  }
});

// Create feed with workflow
const feed = await graphlit.createFeed({
  name: 'Docs with Prep',
  type: FeedTypes.Web,
  web: {
    uri: 'https://docs.example.com',
    readLimit: 500,
    includeFiles: true,
    allowedDomains: ['docs.example.com']
  },
  workflow: { id: workflow.createWorkflow.id }
});
```

### 6. Exclude Unwanted Sections

Skip specific paths:

```typescript
const feed = await graphlit.createFeed({
  name: 'Docs (No Archive)',
  type: FeedTypes.Web,
  web: {
    uri: 'https://docs.example.com',
    readLimit: 1000,
    allowedDomains: ['docs.example.com'],
    excludedPaths: [
      '/archive/',
      '/deprecated/',
      '/old-versions/',
      '/beta/',
      '/admin/'
    ]
  }
});
```

## Common Issues

**Issue**: Crawl takes very long\
**Solution**: This is normal for large sites. Use higher polling interval (30-60 seconds). Crawls can take 15-60 minutes for 500+ pages.

**Issue**: Not all pages crawled\
**Solution**: Increase `readLimit`. Check `allowedDomains` and `allowedPaths` aren't too restrictive. Some pages may not be linked.

**Issue**: Crawling external sites\
**Solution**: Set `allowedDomains` to prevent crawling links to external sites.

**Issue**: Crawling unwanted sections\
**Solution**: Use `excludedPaths` to skip admin, login, API routes, etc.

**Issue**: Files (PDFs) not ingested\
**Solution**: Ensure `includeFiles: true` is set. Check files are actually linked from crawled pages.

**Issue**: Duplicate pages\
**Solution**: Graphlit deduplicates by URL. If site has URL parameters, duplicates may occur. Use `excludedPaths` to skip query params.

## Production Example

**Complete documentation ingestion**:

```typescript
// 1. Create collection
const collection = await graphlit.createCollection({
  name: 'Product Documentation'
});

// 2. Create preparation workflow
const workflow = await graphlit.createWorkflow({
  name: 'Web Prep',
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.ModelDocument,
        modelDocument: {
          specification: { id: visionSpecId }
        }
      }
    }]
  }
});

// 3. Create web crawl feed
const feed = await graphlit.createFeed({
  name: 'Documentation Crawl',
  type: FeedTypes.Web,
  web: {
    uri: 'https://docs.example.com',
    readLimit: 1000,
    includeFiles: true,
    allowedDomains: ['docs.example.com'],
    excludedPaths: ['/api/', '/admin/', '/login/']
  },
  collections: [{ id: collection.createCollection.id }],
  workflow: { id: workflow.createWorkflow.id }
});

// 4. Wait for crawl
console.log('Starting crawl...');
const feedId = feed.createFeed.id;
let isDone = false;
let checkCount = 0;

while (!isDone) {
  const status = await graphlit.isFeedDone(feedId);
  isDone = status.isFeedDone.result || false;
  
  if (!isDone) {
    checkCount++;
    
    // Show progress
    const crawled = await graphlit.queryContents({
      feeds: [{ id: feedId }]
    });
    
    console.log(`Check ${checkCount}: ${crawled.contents.results.length} pages crawled so far...`);
    await new Promise(resolve => setTimeout(resolve, 30000)); // 30 seconds
  }
}

console.log(' Crawl complete!');

// 5. Get final stats
const pages = await graphlit.queryContents({
  feeds: [{ id: feedId }],
  types: [ContentTypes.Page]
});

const files = await graphlit.queryContents({
  feeds: [{ id: feedId }],
  types: [ContentTypes.File]
});

console.log(`\nCrawl Results:`);
console.log(`- ${pages.contents.results.length} pages`);
console.log(`- ${files.contents.results.length} files`);

// 6. Enable RAG queries
const answer = await graphlit.promptConversation({
  prompt: 'How do I configure authentication?',
  collections: [{ id: collection.createCollection.id }]
});

console.log(`\nRAG Answer: ${answer.message.message}`);
```

**Multi-site documentation aggregation**:

```typescript
// Crawl multiple documentation sites
const docSites = [
  { name: 'Main Docs', url: 'https://docs.example.com', limit: 1000 },
  { name: 'API Docs', url: 'https://api.example.com/docs', limit: 500 },
  { name: 'Help Center', url: 'https://help.example.com', limit: 300 }
];

const collection = await graphlit.createCollection({
  name: 'All Documentation'
});

for (const site of docSites) {
  const feed = await graphlit.createFeed({
    name: site.name,
    type: FeedTypes.Web,
    web: {
      uri: site.url,
      readLimit: site.limit,
      includeFiles: true,
      allowedDomains: [new URL(site.url).hostname]
    },
    collections: [{ id: collection.createCollection.id }]
  });
  
  console.log(`Started crawl: ${site.name}`);
}

// All sites crawl in parallel
console.log('All crawls started. Pages will be available as they\'re crawled.');
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.graphlit.dev/api-guides/use-cases/feeds/web/feed-create-web-crawl.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
