Build Knowledge Graph from GitHub Repositories

User Intent

"How do I extract entities from GitHub repositories and issues to build a knowledge graph? Show me how to analyze code, contributors, dependencies, and project relationships."

Operation

SDK Methods: createWorkflow(), createFeed() (Site, Issue, Commit, or PullRequest feeds), isFeedDone(), queryContents(), queryObservables() GraphQL: GitHub feed creation + entity extraction + contributor/project graphs Entity: GitHub Feed → Content (Files/Issues/Commits/PRs) → Observations → Observables (Developer/Project Graph)

Prerequisites

  • Graphlit project with API credentials

  • GitHub personal access token (via Graphlit Developer Portal)

  • GitHub repository access

  • Understanding of feed and workflow concepts


Complete Code Example (TypeScript)

import { Graphlit } from 'graphlit-client';
import {
  ContentTypes,
  EntityExtractionServiceTypes,
  EntityState,
  FeedServiceTypes,
  FeedTypes,
  ObservableTypes
} from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

console.log('=== Building Knowledge Graph from GitHub ===\n');

// Step 1: Create extraction workflow
console.log('Step 1: Creating entity extraction workflow...');
const workflow = await graphlit.createWorkflow({
  name: "GitHub Entity Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.Repo,
          ObservableTypes.Person,
          ObservableTypes.Organization,
          ObservableTypes.Software,
          ObservableTypes.Category,
          ObservableTypes.Label
        ]
      }
    }]
  }
});

console.log(`✓ Workflow: ${workflow.createWorkflow.id}\n`);

// Step 2: Create GitHub repository feed
console.log('Step 2: Creating GitHub repository feed...');
const repoFeed = await graphlit.createFeed({
  name: "Graphlit Samples Repo",
  type: FeedTypes.Site,
  site: {
    type: FeedServiceTypes.GitHub,
    github: {
      repositoryOwner: 'graphlit',
      repositoryName: 'graphlit-samples',
      personalAccessToken: process.env.GITHUB_TOKEN!
    },
    allowedPaths: ['README.md', 'docs/**', 'python/**', 'nextjs/**'],
    excludedPaths: ['**/node_modules/**', '**/dist/**']
  },
  workflow: { id: workflow.createWorkflow.id }
});

console.log(`✓ Repo Feed: ${repoFeed.createFeed.id}\n`);

// Step 3: Create GitHub issues feed
console.log('Step 3: Creating GitHub issues feed...');
const issuesFeed = await graphlit.createFeed({
  name: "Graphlit Issues",
  type: FeedTypes.Issue,
  issue: {
    type: FeedServiceTypes.GitHubIssues,
    github: {
      repositoryOwner: 'graphlit',
      repositoryName: 'graphlit-samples',
      personalAccessToken: process.env.GITHUB_TOKEN!
    },
    readLimit: 100
  },
  workflow: { id: workflow.createWorkflow.id }
});

console.log(`✓ Issues Feed: ${issuesFeed.createFeed.id}\n`);

// Step 4: Wait for sync
console.log('Step 4: Syncing repository...');
let repoDone = false;
let issuesDone = false;

while (!repoDone || !issuesDone) {
  if (!repoDone) {
    const repoStatus = await graphlit.isFeedDone(repoFeed.createFeed.id);
    repoDone = repoStatus.isFeedDone.result;
  }
  
  if (!issuesDone) {
    const issuesStatus = await graphlit.isFeedDone(issuesFeed.createFeed.id);
    issuesDone = issuesStatus.isFeedDone.result;
  }
  
  if (!repoDone || !issuesDone) {
    console.log('  Syncing... (checking again in 5s)');
    await new Promise(resolve => setTimeout(resolve, 5000));
  }
}
console.log('✓ Sync complete\n');

// Step 5: Query repository content
console.log('Step 5: Querying repository files...');
const repoFiles = await graphlit.queryContents({
  
    types: [ContentTypes.File],
    feeds: [{ id: repoFeed.createFeed.id }]
  });

console.log(`✓ Synced ${repoFiles.contents.results.length} files\n`);

// Step 6: Query issues
console.log('Step 6: Querying issues...');
const issues = await graphlit.queryContents({
  
    types: [ContentTypes.Issue],
    feeds: [{ id: issuesFeed.createFeed.id }]
  });

console.log(`✓ Synced ${issues.contents.results.length} issues\n`);

// Step 7: Extract repository entities
console.log('Step 7: Analyzing repository entities...\n');

// Get all Repo entities
const repos = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Repo] }
});
console.log(`Repositories: ${repos.observables.results.length}`);

// Get contributors (Person entities)
const people = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Person] }
});
console.log(`Contributors: ${people.observables.results.length}`);

// Get dependencies (Software entities)
const software = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Software] }
});
console.log(`Software/Dependencies: ${software.observables.results.length}\n`);

// Step 8: Analyze issue labels
console.log('Step 8: Analyzing issue labels...\n');

const labelCounts = new Map<string, number>();
issues.contents.results.forEach(issue => {
  (issue.issue?.labels || []).filter(Boolean).forEach(label => {
    labelCounts.set(label!, (labelCounts.get(label!) || 0) + 1);
  });
});

console.log('Most common issue labels:');
Array.from(labelCounts.entries())
  .sort((a, b) => b[1] - a[1])
  .slice(0, 5)
  .forEach(([label, count]) => {
    console.log(`  ${label}: ${count} issues`);
  });

// Step 9: Build contributor network
console.log('\nStep 9: Building contributor network...\n');

const contributors = new Map<string, {
  files: number;
  issues: number;
  total: number;
}>();

// Count files by contributor (from observations)
repoFiles.contents.results.forEach(file => {
  file.observations
    ?.filter(Boolean)
    .filter(obs => obs?.type === ObservableTypes.Person)
    .forEach(obs => {
      const name = obs!.observable.name;
      if (!contributors.has(name)) contributors.set(name, { files: 0, issues: 0, total: 0 });
      contributors.get(name)!.files++;
      contributors.get(name)!.total++;
    });
});

// Count issues by contributor (from observations)
issues.contents.results.forEach(issue => {
  issue.observations
    ?.filter(Boolean)
    .filter(obs => obs?.type === ObservableTypes.Person)
    .forEach(obs => {
      const name = obs!.observable.name;
      if (!contributors.has(name)) contributors.set(name, { files: 0, issues: 0, total: 0 });
      contributors.get(name)!.issues++;
      contributors.get(name)!.total++;
    });
});

console.log('Top contributors:');
Array.from(contributors.entries())
  .sort((a, b) => b[1].total - a[1].total)
  .slice(0, 5)
  .forEach(([name, stats]) => {
    console.log(`  ${name}: ${stats.files} files, ${stats.issues} issues`);
  });

console.log('\n✓ Repository analysis complete!');

Step-by-Step Explanation

Step 1: Create Entity Extraction Workflow

GitHub-Specific Entity Types:

  • Repo: Repository as entity (name, owner, URL, description)

  • Person: Contributors, commit authors, issue creators, reviewers

  • Organization: Repository owner if organization, mentioned companies

  • Software: Dependencies (from package.json, requirements.txt, etc.)

  • Category: Topics, tags, project themes

  • Label: Issue/PR labels

Why Text Extraction:

  • Code files, README, docs are text-based

  • Fast and cost-effective

  • Handles markdown, code, JSON, YAML

Step 2: Configure GitHub Repository Feed

Repository Feed Options:

What Gets Synced:

  • README.md

  • Repository files matching your allowedPaths/excludedPaths

  • Documentation files

  • Configuration files (package.json, requirements.txt, etc.)

Step 3: Configure GitHub Issues Feed

Issues Feed Options:

What Gets Synced:

  • Issue title and body

  • Labels

  • Issue number/identifier

  • Created/updated dates

If you want to focus on specific labels/states, filter after ingestion using queryContents().


circle-info

Additional GitHub Feed Types:

You can also create feeds for:

  • GitHub Commits (FeedTypes.Commit) - Sync commit history, code changes, and developer activity

  • GitHub Pull Requests (FeedTypes.PullRequest) - Sync pull requests, reviews, and merge history

These feed types are useful for analyzing code review patterns, tracking developer contributions, and understanding project evolution over time.

See the GitHub Commits and GitHub Pull Requests feed guides for details.


Step 4: GitHub Token Setup

Creating GitHub Token:

  1. GitHub → Settings → Developer settings → Personal access tokens

  2. Generate new token (classic)

  3. Select scopes:

    • repo (for private repos) or public_repo (for public only)

    • read:org (if accessing org repos)

  4. Copy token

  5. Use in Graphlit feed creation

OR via Graphlit Developer Portal:

  1. Go to Developer Portal → Connectors → Version Control

  2. Authorize GitHub

  3. Copy OAuth token

Step 5: Analyze Repository Files

File Content Structure:

README Analysis:

  • Rich source of Repo, Person, Organization entities

  • Contributors listed

  • Dependencies mentioned

  • Project description

package.json/requirements.txt Analysis:

  • Dependencies as Software entities

  • Version information

  • Project metadata

Step 6: Analyze GitHub Issues

Issue Metadata:

Entity Extraction from Issues:

  • Person: Issue author, mentioned contributors (@username)

  • Software: Tools/libraries mentioned

  • Category: Feature areas, components

  • Label: Issue labels as Label entities

Step 7: Build Contributor Graph

Contributors from Multiple Sources:

  1. File authors: Extracted from README, commit mentions

  2. Issue creators: From issue author field

  3. Code comments: Developers mentioned in code

  4. Documentation: Authors in docs

Step 8: Dependency Analysis

Extract Software Dependencies:


Configuration Options

Scope the repository sync

Use allowedPaths/excludedPaths on the site feed to control repository size and cost:

Limit issue backfill size

Use readLimit on the issue feed:

If you want to focus on a specific subset (e.g., only certain labels), filter after ingestion using queryContents().


Variations

Variation 1: Multi-Repository Analysis

Analyze multiple repositories in an organization:

Variation 2: Dependency Graph Visualization

Map software dependencies:

Variation 3: Issue Classification by Entities

Categorize issues by extracted entities:

Variation 4: Contributor Activity Timeline

Track contributor activity over time:

Variation 5: Cross-Repository Entity Linking

Find entities that appear across multiple repositories:


Common Issues & Solutions

Issue: Large Repository, Slow Sync

Problem: Repository with 1000s of files takes hours to sync.

Solutions:

  1. Scope paths: Only sync the folders you need

  2. Exclude build output: Skip node_modules, dist, etc.

  3. Keep readLimit conservative: Backfill in smaller chunks

Issue: GitHub API Rate Limiting

Problem: Sync fails with rate limit error.

Explanation: GitHub API has rate limits (5000 requests/hour for authenticated).

Solutions:

  1. Use authenticated token (higher limits)

  2. Sync fewer repositories simultaneously

  3. Increase polling interval

  4. Wait for rate limit reset

Issue: Missing Dependencies from package.json

Problem: Software entities not extracted from package files.

Cause: Need to sync configuration files explicitly.

Solution: Include config file types:

Issue: No Contributor Entities

Problem: No Person entities extracted from repository.

Causes:

  1. README doesn't list contributors

  2. Code comments don't mention developers

  3. Only files were synced; issues/PRs were not ingested

Solution: Sync more sources (like issues) and ensure your extraction types include Person:


Developer Hints

GitHub Token Best Practices

  • Use fine-grained tokens (new GitHub feature) when possible

  • Minimum scope: repo for private, public_repo for public

  • Rotate tokens regularly

  • Don't commit tokens to code (use env variables)

  • Monitor token usage in GitHub settings

File Type Recommendations

Documentation Analysis:

Full Code Analysis:

Configuration + Dependencies:

Performance Optimization

  • Start with README + package files only

  • Add more file types incrementally

  • Sync issues separately (can be slow)

  • Use multiple feeds for large org

  • Cache entity queries

Entity Quality by Source

  • High confidence: README (explicit mentions), package.json (dependencies)

  • Medium confidence: Code comments, documentation

  • Low confidence: Implicit mentions in code


Production Patterns

Pattern from Graphlit Samples

Graphlit_2024_09_29_Explore_GitHub_Repo.ipynb:

  • Syncs public GitHub repository

  • Extracts Repo, Person, Software entities

  • Analyzes dependencies from package.json

  • Builds contributor network

  • Exports entity graph for visualization

Graphlit_2025_03_17_Classify_GitHub_Issues.ipynb:

  • Syncs GitHub issues

  • Extracts entities from issue descriptions

  • Classifies issues by entity types

  • Groups related issues

  • Priority ranking by entity importance

Open Source Intelligence Use Cases

  • Dependency tracking: Which projects use which libraries

  • Contributor analysis: Developer activity, collaboration

  • Project relationships: Shared contributors, common dependencies

  • Technology adoption: What tools/frameworks gaining traction

  • Security analysis: Vulnerable dependency detection


Last updated