# Build Knowledge Graph from GitHub Repositories

## User Intent

"How do I extract entities from GitHub repositories and issues to build a knowledge graph? Show me how to analyze code, contributors, dependencies, and project relationships."

## Operation

**SDK Methods**: `createWorkflow()`, `createFeed()` (Site, Issue, Commit, or PullRequest feeds), `isFeedDone()`, `queryContents()`, `queryObservables()`\
**GraphQL**: GitHub feed creation + entity extraction + contributor/project graphs\
**Entity**: GitHub Feed → Content (Files/Issues/Commits/PRs) → Observations → Observables (Developer/Project Graph)

## Prerequisites

* Graphlit project with API credentials
* GitHub personal access token (via Graphlit Developer Portal)
* GitHub repository access
* Understanding of feed and workflow concepts

***

## Complete Code Example (TypeScript)

```typescript
import { Graphlit } from 'graphlit-client';
import {
  ContentTypes,
  EntityExtractionServiceTypes,
  EntityState,
  FeedServiceTypes,
  FeedTypes,
  ObservableTypes
} from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

console.log('=== Building Knowledge Graph from GitHub ===\n');

// Step 1: Create extraction workflow
console.log('Step 1: Creating entity extraction workflow...');
const workflow = await graphlit.createWorkflow({
  name: "GitHub Entity Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.Repo,
          ObservableTypes.Person,
          ObservableTypes.Organization,
          ObservableTypes.Software,
          ObservableTypes.Category,
          ObservableTypes.Label
        ]
      }
    }]
  }
});

console.log(`✓ Workflow: ${workflow.createWorkflow.id}\n`);

// Step 2: Create GitHub repository feed
console.log('Step 2: Creating GitHub repository feed...');
const repoFeed = await graphlit.createFeed({
  name: "Graphlit Samples Repo",
  type: FeedTypes.Site,
  site: {
    type: FeedServiceTypes.GitHub,
    github: {
      repositoryOwner: 'graphlit',
      repositoryName: 'graphlit-samples',
      personalAccessToken: process.env.GITHUB_TOKEN!
    },
    allowedPaths: ['README.md', 'docs/**', 'python/**', 'nextjs/**'],
    excludedPaths: ['**/node_modules/**', '**/dist/**']
  },
  workflow: { id: workflow.createWorkflow.id }
});

console.log(`✓ Repo Feed: ${repoFeed.createFeed.id}\n`);

// Step 3: Create GitHub issues feed
console.log('Step 3: Creating GitHub issues feed...');
const issuesFeed = await graphlit.createFeed({
  name: "Graphlit Issues",
  type: FeedTypes.Issue,
  issue: {
    type: FeedServiceTypes.GitHubIssues,
    github: {
      repositoryOwner: 'graphlit',
      repositoryName: 'graphlit-samples',
      personalAccessToken: process.env.GITHUB_TOKEN!
    },
    readLimit: 100
  },
  workflow: { id: workflow.createWorkflow.id }
});

console.log(`✓ Issues Feed: ${issuesFeed.createFeed.id}\n`);

// Step 4: Wait for sync
console.log('Step 4: Syncing repository...');
let repoDone = false;
let issuesDone = false;

while (!repoDone || !issuesDone) {
  if (!repoDone) {
    const repoStatus = await graphlit.isFeedDone(repoFeed.createFeed.id);
    repoDone = repoStatus.isFeedDone.result;
  }
  
  if (!issuesDone) {
    const issuesStatus = await graphlit.isFeedDone(issuesFeed.createFeed.id);
    issuesDone = issuesStatus.isFeedDone.result;
  }
  
  if (!repoDone || !issuesDone) {
    console.log('  Syncing... (checking again in 5s)');
    await new Promise(resolve => setTimeout(resolve, 5000));
  }
}
console.log('✓ Sync complete\n');

// Step 5: Query repository content
console.log('Step 5: Querying repository files...');
const repoFiles = await graphlit.queryContents({
  
    types: [ContentTypes.File],
    feeds: [{ id: repoFeed.createFeed.id }]
  });

console.log(`✓ Synced ${repoFiles.contents.results.length} files\n`);

// Step 6: Query issues
console.log('Step 6: Querying issues...');
const issues = await graphlit.queryContents({
  
    types: [ContentTypes.Issue],
    feeds: [{ id: issuesFeed.createFeed.id }]
  });

console.log(`✓ Synced ${issues.contents.results.length} issues\n`);

// Step 7: Extract repository entities
console.log('Step 7: Analyzing repository entities...\n');

// Get all Repo entities
const repos = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Repo] }
});
console.log(`Repositories: ${repos.observables.results.length}`);

// Get contributors (Person entities)
const people = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Person] }
});
console.log(`Contributors: ${people.observables.results.length}`);

// Get dependencies (Software entities)
const software = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Software] }
});
console.log(`Software/Dependencies: ${software.observables.results.length}\n`);

// Step 8: Analyze issue labels
console.log('Step 8: Analyzing issue labels...\n');

const labelCounts = new Map<string, number>();
issues.contents.results.forEach(issue => {
  (issue.issue?.labels || []).filter(Boolean).forEach(label => {
    labelCounts.set(label!, (labelCounts.get(label!) || 0) + 1);
  });
});

console.log('Most common issue labels:');
Array.from(labelCounts.entries())
  .sort((a, b) => b[1] - a[1])
  .slice(0, 5)
  .forEach(([label, count]) => {
    console.log(`  ${label}: ${count} issues`);
  });

// Step 9: Build contributor network
console.log('\nStep 9: Building contributor network...\n');

const contributors = new Map<string, {
  files: number;
  issues: number;
  total: number;
}>();

// Count files by contributor (from observations)
repoFiles.contents.results.forEach(file => {
  file.observations
    ?.filter(Boolean)
    .filter(obs => obs?.type === ObservableTypes.Person)
    .forEach(obs => {
      const name = obs!.observable.name;
      if (!contributors.has(name)) contributors.set(name, { files: 0, issues: 0, total: 0 });
      contributors.get(name)!.files++;
      contributors.get(name)!.total++;
    });
});

// Count issues by contributor (from observations)
issues.contents.results.forEach(issue => {
  issue.observations
    ?.filter(Boolean)
    .filter(obs => obs?.type === ObservableTypes.Person)
    .forEach(obs => {
      const name = obs!.observable.name;
      if (!contributors.has(name)) contributors.set(name, { files: 0, issues: 0, total: 0 });
      contributors.get(name)!.issues++;
      contributors.get(name)!.total++;
    });
});

console.log('Top contributors:');
Array.from(contributors.entries())
  .sort((a, b) => b[1].total - a[1].total)
  .slice(0, 5)
  .forEach(([name, stats]) => {
    console.log(`  ${name}: ${stats.files} files, ${stats.issues} issues`);
  });

console.log('\n✓ Repository analysis complete!');
```

***

## Step-by-Step Explanation

### Step 1: Create Entity Extraction Workflow

**GitHub-Specific Entity Types**:

* **Repo**: Repository as entity (name, owner, URL, description)
* **Person**: Contributors, commit authors, issue creators, reviewers
* **Organization**: Repository owner if organization, mentioned companies
* **Software**: Dependencies (from package.json, requirements.txt, etc.)
* **Category**: Topics, tags, project themes
* **Label**: Issue/PR labels

**Why Text Extraction**:

* Code files, README, docs are text-based
* Fast and cost-effective
* Handles markdown, code, JSON, YAML

### Step 2: Configure GitHub Repository Feed

**Repository Feed Options**:

```typescript
site: {
  type: FeedServiceTypes.GitHub,
  github: {
    repositoryOwner: 'graphlit',
    repositoryName: 'graphlit-samples',
    personalAccessToken: githubToken
  },
  // Optional: scope what you ingest (recommended)
  allowedPaths: ['README.md', 'docs/**'],
  excludedPaths: ['**/node_modules/**', '**/dist/**']
}
```

**What Gets Synced**:

* README.md
* Repository files matching your `allowedPaths`/`excludedPaths`
* Documentation files
* Configuration files (package.json, requirements.txt, etc.)

### Step 3: Configure GitHub Issues Feed

**Issues Feed Options**:

```typescript
issue: {
  type: FeedServiceTypes.GitHubIssues,
  github: {
    repositoryOwner: 'graphlit',
    repositoryName: 'graphlit-samples',
    personalAccessToken: githubToken
  },
  readLimit: 100
}
```

**What Gets Synced**:

* Issue title and body
* Labels
* Issue number/identifier
* Created/updated dates

If you want to focus on specific labels/states, filter after ingestion using `queryContents()`.

***

{% hint style="info" %}
**Additional GitHub Feed Types**:

You can also create feeds for:

* **GitHub Commits** (`FeedTypes.Commit`) - Sync commit history, code changes, and developer activity
* **GitHub Pull Requests** (`FeedTypes.PullRequest`) - Sync pull requests, reviews, and merge history

These feed types are useful for analyzing code review patterns, tracking developer contributions, and understanding project evolution over time.

See the [GitHub Commits](/api-guides/use-cases/feeds/project-management/feed-create-github-commits.md) and [GitHub Pull Requests](/api-guides/use-cases/feeds/project-management/feed-create-github-pull-requests.md) feed guides for details.
{% endhint %}

***

### Step 4: GitHub Token Setup

**Creating GitHub Token**:

1. GitHub → Settings → Developer settings → Personal access tokens
2. Generate new token (classic)
3. Select scopes:
   * `repo` (for private repos) or `public_repo` (for public only)
   * `read:org` (if accessing org repos)
4. Copy token
5. Use in Graphlit feed creation

**OR via Graphlit Developer Portal**:

1. Go to Developer Portal → Connectors → Version Control
2. Authorize GitHub
3. Copy OAuth token

### Step 5: Analyze Repository Files

**File Content Structure**:

```typescript
const file = await graphlit.getContent(fileId);

console.log(`File: ${file.content.name}`);
console.log(`Type: ${file.content.fileType}`);
console.log(`Path: ${file.content.uri}`);

// Extracted entities from file
file.content.observations?.forEach(obs => {
  console.log(`${obs.type}: ${obs.observable.name}`);
});
```

**README Analysis**:

* Rich source of Repo, Person, Organization entities
* Contributors listed
* Dependencies mentioned
* Project description

**package.json/requirements.txt Analysis**:

* Dependencies as Software entities
* Version information
* Project metadata

### Step 6: Analyze GitHub Issues

**Issue Metadata**:

```typescript
issue: {
  identifier: "42",                    // Issue number
  title: "Add feature X",
  project: "graphlit-samples",
  status: "Open",
  priority: "High",
  labels: ["feature", "enhancement"],
  author: {
    name: "Kirk Marple",
    email: "kirk@graphlit.com"
  }
}
```

**Entity Extraction from Issues**:

* **Person**: Issue author, mentioned contributors (@username)
* **Software**: Tools/libraries mentioned
* **Category**: Feature areas, components
* **Label**: Issue labels as Label entities

### Step 7: Build Contributor Graph

**Contributors from Multiple Sources**:

1. **File authors**: Extracted from README, commit mentions
2. **Issue creators**: From issue author field
3. **Code comments**: Developers mentioned in code
4. **Documentation**: Authors in docs

```typescript
// Deduplicate contributors
const uniqueContributors = new Map<string, {
  observableId: string;
  name: string;
  email?: string;
  contributions: number;
}>();

// Combine from all sources
allContent.forEach(content => {
  content.observations
    ?.filter(obs => obs.type === ObservableTypes.Person)
    .forEach(obs => {
      if (!uniqueContributors.has(obs.observable.id)) {
        uniqueContributors.set(obs.observable.id, {
          observableId: obs.observable.id,
          name: obs.observable.name,
          email: obs.observable.properties?.email,
          contributions: 0
        });
      }
      uniqueContributors.get(obs.observable.id)!.contributions++;
    });
});
```

### Step 8: Dependency Analysis

**Extract Software Dependencies**:

```typescript
// Get all Software entities
const dependencies = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Software] }
});

// Find which files reference each dependency
for (const dep of dependencies.observables.results) {
  const references = await graphlit.queryContents({
    
      feeds: [{ id: repoFeed.createFeed.id }],
      observations: [{
        type: ObservableTypes.Software,
        observable: { id: dep.observable.id }
      }]
    });
  
  console.log(`${dep.observable.name}: ${references.contents.results.length} files`);
}
```

***

## Configuration Options

### Scope the repository sync

Use `allowedPaths`/`excludedPaths` on the `site` feed to control repository size and cost:

```typescript
site: {
  allowedPaths: ['README.md', 'docs/**', 'src/**'],
  excludedPaths: ['**/node_modules/**', '**/dist/**', '**/*.lock']
}
```

### Limit issue backfill size

Use `readLimit` on the `issue` feed:

```typescript
issue: {
  readLimit: 500
}
```

If you want to focus on a specific subset (e.g., only certain labels), filter after ingestion using `queryContents()`.

***

## Variations

### Variation 1: Multi-Repository Analysis

Analyze multiple repositories in an organization:

```typescript
const repos = [
  { owner: 'graphlit', name: 'graphlit-client-typescript' },
  { owner: 'graphlit', name: 'graphlit-client-python' },
  { owner: 'graphlit', name: 'graphlit-client-dotnet' }
];

const githubToken = process.env.GITHUB_TOKEN!;

const feeds = await Promise.all(
  repos.map(repo =>
    graphlit.createFeed({
      name: `${repo.owner}/${repo.name}`,
      type: FeedTypes.Site,
      site: {
        type: FeedServiceTypes.GitHub,
        github: {
          repositoryOwner: repo.owner,
          repositoryName: repo.name,
          personalAccessToken: githubToken
        }
      },
      workflow: { id: workflowId }
    })
  )
);

// Wait for all to sync
const waitForAll = async () => {
  let allDone = false;
  while (!allDone) {
    const statuses = await Promise.all(
      feeds.map(f => graphlit.isFeedDone(f.createFeed.id))
    );
    allDone = statuses.every(s => s.isFeedDone.result);
    
    if (!allDone) {
      await new Promise(resolve => setTimeout(resolve, 5000));
    }
  }
};

await waitForAll();

// Analyze cross-repo entities
const allRepos = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Repo] }
});

console.log(`Total repositories: ${allRepos.observables.results.length}`);
```

### Variation 2: Dependency Graph Visualization

Map software dependencies:

```typescript
// Extract all Software entities and their relationships
const dependencies = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Software] }
});

interface DependencyNode {
  name: string;
  usedBy: string[];  // Files/repos using this dependency
  version?: string;
}

const depGraph = new Map<string, DependencyNode>();

for (const dep of dependencies.observables.results) {
  const usages = await graphlit.queryContents({
    
      observations: [{
        type: ObservableTypes.Software,
        observable: { id: dep.observable.id }
      }]
    });
  
  depGraph.set(dep.observable.name, {
    name: dep.observable.name,
    usedBy: usages.contents.results.map(c => c.name),
    version: dep.observable.properties?.version
  });
}

// Find most common dependencies
const topDeps = Array.from(depGraph.values())
  .sort((a, b) => b.usedBy.length - a.usedBy.length)
  .slice(0, 10);

console.log('Most used dependencies:');
topDeps.forEach(dep => {
  console.log(`  ${dep.name}: ${dep.usedBy.length} files`);
});
```

### Variation 3: Issue Classification by Entities

Categorize issues by extracted entities:

```typescript
// Group issues by entity types
const issuesByEntity = new Map<string, Array<typeof issues.contents.results[0]>>();

issues.contents.results.forEach(issue => {
  issue.observations?.forEach(obs => {
    const key = `${obs.type}: ${obs.observable.name}`;
    if (!issuesByEntity.has(key)) {
      issuesByEntity.set(key, []);
    }
    issuesByEntity.get(key)!.push(issue);
  });
});

// Find entities with most issues
const entityIssueCount = Array.from(issuesByEntity.entries())
  .map(([entity, issues]) => ({ entity, count: issues.length }))
  .sort((a, b) => b.count - a.count);

console.log('Entities with most related issues:');
entityIssueCount.slice(0, 10).forEach(item => {
  console.log(`  ${item.entity}: ${item.count} issues`);
});
```

### Variation 4: Contributor Activity Timeline

Track contributor activity over time:

```typescript
interface ContributorActivity {
  name: string;
  firstContribution: Date;
  lastContribution: Date;
  contributions: Array<{ date: Date; type: 'file' | 'issue' }>;
}

const activity = new Map<string, ContributorActivity>();

// Track file contributions
repoFiles.contents.results.forEach(file => {
  const date = new Date(file.creationDate);
  
  file.observations
    ?.filter(obs => obs.type === ObservableTypes.Person)
    .forEach(obs => {
      const name = obs.observable.name;
      if (!activity.has(name)) {
        activity.set(name, {
          name,
          firstContribution: date,
          lastContribution: date,
          contributions: []
        });
      }
      
      const contrib = activity.get(name)!;
      contrib.contributions.push({ date, type: 'file' });
      if (date < contrib.firstContribution) contrib.firstContribution = date;
      if (date > contrib.lastContribution) contrib.lastContribution = date;
    });
});

// Track issue contributions
issues.contents.results.forEach(issue => {
  const author = issue.issue?.author?.name;
  const date = new Date(issue.creationDate);
  
  if (author) {
    if (!activity.has(author)) {
      activity.set(author, {
        name: author,
        firstContribution: date,
        lastContribution: date,
        contributions: []
      });
    }
    
    const contrib = activity.get(author)!;
    contrib.contributions.push({ date, type: 'issue' });
    if (date < contrib.firstContribution) contrib.firstContribution = date;
    if (date > contrib.lastContribution) contrib.lastContribution = date;
  }
});

// Find most active contributors (by recent activity)
const recent = Array.from(activity.values())
  .sort((a, b) => b.lastContribution.getTime() - a.lastContribution.getTime())
  .slice(0, 10);

console.log('Most recently active contributors:');
recent.forEach(contrib => {
  console.log(`  ${contrib.name}: ${contrib.contributions.length} contributions`);
  console.log(`    First: ${contrib.firstContribution.toLocaleDateString()}`);
  console.log(`    Last: ${contrib.lastContribution.toLocaleDateString()}`);
});
```

### Variation 5: Cross-Repository Entity Linking

Find entities that appear across multiple repositories:

```typescript
// After syncing multiple repos, find cross-repo entities
const allPeople = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Person] }
});

for (const person of allPeople.observables.results) {
  // Find all content (across repos) mentioning this person
  const mentions = await graphlit.queryContents({
    
      observations: [{
        type: ObservableTypes.Person,
        observable: { id: person.observable.id }
      }]
    });
  
  // Group by feed (repository)
  const reposMentioned = new Set(
    mentions.contents.results.map(c => c.feed?.id).filter(Boolean)
  );
  
  if (reposMentioned.size > 1) {
    console.log(`${person.observable.name}: appears in ${reposMentioned.size} repos`);
  }
}
```

***

## Common Issues & Solutions

### Issue: Large Repository, Slow Sync

**Problem**: Repository with 1000s of files takes hours to sync.

**Solutions**:

1. **Scope paths**: Only sync the folders you need
2. **Exclude build output**: Skip `node_modules`, `dist`, etc.
3. **Keep readLimit conservative**: Backfill in smaller chunks

```typescript
site: {
  allowedPaths: ['README.md', 'docs/**', 'src/**'],
  excludedPaths: ['**/node_modules/**', '**/dist/**', '**/build/**'],
  readLimit: 250
}
```

### Issue: GitHub API Rate Limiting

**Problem**: Sync fails with rate limit error.

**Explanation**: GitHub API has rate limits (5000 requests/hour for authenticated).

**Solutions**:

1. Use authenticated token (higher limits)
2. Sync fewer repositories simultaneously
3. Increase polling interval
4. Wait for rate limit reset

### Issue: Missing Dependencies from package.json

**Problem**: Software entities not extracted from package files.

**Cause**: Need to sync configuration files explicitly.

**Solution**: Include config file types:

```typescript
allowedPaths: ['**/package.json', '**/requirements.txt', '**/*.toml', '**/*.lock']
```

### Issue: No Contributor Entities

**Problem**: No Person entities extracted from repository.

**Causes**:

1. README doesn't list contributors
2. Code comments don't mention developers
3. Only files were synced; issues/PRs were not ingested

**Solution**: Sync more sources (like issues) and ensure your extraction types include `Person`:

```typescript
issue: {
  type: FeedServiceTypes.GitHubIssues,
  github: {
    repositoryOwner: 'graphlit',
    repositoryName: 'graphlit-samples',
    personalAccessToken: process.env.GITHUB_TOKEN!
  }
}
```

***

## Developer Hints

### GitHub Token Best Practices

* Use fine-grained tokens (new GitHub feature) when possible
* Minimum scope: `repo` for private, `public_repo` for public
* Rotate tokens regularly
* Don't commit tokens to code (use env variables)
* Monitor token usage in GitHub settings

### File Type Recommendations

**Documentation Analysis**:

```typescript
allowedPaths: ['README.md', '**/*.md', '**/*.rst', '**/*.txt', '**/*.adoc']
```

**Full Code Analysis**:

```typescript
allowedPaths: ['**/*.py', '**/*.js', '**/*.ts', '**/*.java', '**/*.go', '**/*.rs', '**/*.cpp', '**/*.c', '**/*.h']
```

**Configuration + Dependencies**:

```typescript
allowedPaths: ['**/package.json', '**/*.yaml', '**/*.yml', '**/*.toml', '**/*.lock', '**/*.txt']
```

### Performance Optimization

* Start with README + package files only
* Add more file types incrementally
* Sync issues separately (can be slow)
* Use multiple feeds for large org
* Cache entity queries

### Entity Quality by Source

* **High confidence**: README (explicit mentions), package.json (dependencies)
* **Medium confidence**: Code comments, documentation
* **Low confidence**: Implicit mentions in code

***

## Production Patterns

### Pattern from Graphlit Samples

`Graphlit_2024_09_29_Explore_GitHub_Repo.ipynb`:

* Syncs public GitHub repository
* Extracts Repo, Person, Software entities
* Analyzes dependencies from package.json
* Builds contributor network
* Exports entity graph for visualization

`Graphlit_2025_03_17_Classify_GitHub_Issues.ipynb`:

* Syncs GitHub issues
* Extracts entities from issue descriptions
* Classifies issues by entity types
* Groups related issues
* Priority ranking by entity importance

### Open Source Intelligence Use Cases

* **Dependency tracking**: Which projects use which libraries
* **Contributor analysis**: Developer activity, collaboration
* **Project relationships**: Shared contributors, common dependencies
* **Technology adoption**: What tools/frameworks gaining traction
* **Security analysis**: Vulnerable dependency detection

***


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.graphlit.dev/api-guides/use-cases/knowledge-graph/knowledge-graph-from-github.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
