Build Knowledge Graph from GitHub Repositories

User Intent

"How do I extract entities from GitHub repositories and issues to build a knowledge graph? Show me how to analyze code, contributors, dependencies, and project relationships."

Operation

SDK Methods: createWorkflow(), createFeed() (Site, Issue, Commit, or PullRequest feeds), isFeedDone(), queryContents(), queryObservables() GraphQL: GitHub feed creation + entity extraction + contributor/project graphs Entity: GitHub Feed → Content (Files/Issues/Commits/PRs) → Observations → Observables (Developer/Project Graph)

Prerequisites

Graphlit project with API credentials
GitHub personal access token (via Graphlit Developer Portal)
GitHub repository access
Understanding of feed and workflow concepts

Complete Code Example (TypeScript)

import { Graphlit } from 'graphlit-client';
import { ContentTypes, FeedServiceTypes, FileTypes, ObservableTypes } from 'graphlit-client/dist/generated/graphql-types';
import {
  FeedTypes,
  FeedServiceTypes,
  ExtractionServiceTypes,
  ObservableTypes,
  ContentTypes
} from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

console.log('=== Building Knowledge Graph from GitHub ===\n');

// Step 1: Create extraction workflow
console.log('Step 1: Creating entity extraction workflow...');
const workflow = await graphlit.createWorkflow({
  name: "GitHub Entity Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.Repo,            // Repository itself
          ObservableTypes.Person,          // Contributors, authors
          ObservableTypes.Organization,    // Organizations, companies
          ObservableTypes.Software,        // Dependencies, tools
          ObservableTypes.Category,        // Topics, tags
          ObservableLabel            // Issue labels
        ]
      }
    }]
  }
});

console.log(`✓ Workflow: ${workflow.createWorkflow.id}\n`);

// Step 2: Create GitHub repository feed
console.log('Step 2: Creating GitHub repository feed...');
const repoFeed = await graphlit.createFeed({
  name: "Graphlit Samples Repo",
  type: FeedTypes.Site,
  site: {
    type: FeedServiceGitHub,
    token: process.env.GITHUB_TOKEN!,
    repositoryOwner: 'graphlit',
    repositoryName: 'graphlit-samples',
    includeRepositoryMetadata: true,
    includeFiles: true,
    fileTypes: ['md', 'py', 'ts', 'json', 'yaml']  // Specific file types
  },
  workflow: { id: workflow.createWorkflow.id }
});

console.log(`✓ Repo Feed: ${repoFeed.createFeed.id}\n`);

// Step 3: Create GitHub issues feed
console.log('Step 3: Creating GitHub issues feed...');
const issuesFeed = await graphlit.createFeed({
  name: "Graphlit Issues",
  type: FeedTypes.Issue,
  issue: {
    type: FeedServiceGitHubIssues,
    token: process.env.GITHUB_TOKEN!,
    repositoryOwner: 'graphlit',
    repositoryName: 'graphlit-samples',
    includeOpen: true,
    includeClosed: false,
    readLimit: 100
  },
  workflow: { id: workflow.createWorkflow.id }
});

console.log(`✓ Issues Feed: ${issuesFeed.createFeed.id}\n`);

// Step 4: Wait for sync
console.log('Step 4: Syncing repository...');
let repoDone = false;
let issuesDone = false;

while (!repoDone || !issuesDone) {
  if (!repoDone) {
    const repoStatus = await graphlit.isFeedDone(repoFeed.createFeed.id);
    repoDone = repoStatus.isFeedDone.result;
  }
  
  if (!issuesDone) {
    const issuesStatus = await graphlit.isFeedDone(issuesFeed.createFeed.id);
    issuesDone = issuesStatus.isFeedDone.result;
  }
  
  if (!repoDone || !issuesDone) {
    console.log('  Syncing... (checking again in 5s)');
    await new Promise(resolve => setTimeout(resolve, 5000));
  }
}
console.log('✓ Sync complete\n');

// Step 5: Query repository content
console.log('Step 5: Querying repository files...');
const repoFiles = await graphlit.queryContents({
  filter: {
    feeds: [{ id: repoFeed.createFeed.id }]
  }
});

console.log(`✓ Synced ${repoFiles.contents.results.length} files\n`);

// Step 6: Query issues
console.log('Step 6: Querying issues...');
const issues = await graphlit.queryContents({
  filter: {
    types: [ContentTypes.Issue],
    feeds: [{ id: issuesFeed.createFeed.id }]
  }
});

console.log(`✓ Synced ${issues.contents.results.length} issues\n`);

// Step 7: Extract repository entities
console.log('Step 7: Analyzing repository entities...\n');

// Get all Repo entities
const repos = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Repo] }
});
console.log(`Repositories: ${repos.observables.results.length}`);

// Get contributors (Person entities)
const people = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Person] }
});
console.log(`Contributors: ${people.observables.results.length}`);

// Get dependencies (Software entities)
const software = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Software] }
});
console.log(`Software/Dependencies: ${software.observables.results.length}\n`);

// Step 8: Analyze issue labels
console.log('Step 8: Analyzing issue labels...\n');

const labelCounts = new Map<string, number>();
issues.contents.results.forEach(issue => {
  const labels = issue.issue?.labels || [];
  labels.forEach(label => {
    labelCounts.set(label, (labelCounts.get(label) || 0) + 1);
  });
});

console.log('Most common issue labels:');
Array.from(labelCounts.entries())
  .sort((a, b) => b[1] - a[1])
  .slice(0, 5)
  .forEach(([label, count]) => {
    console.log(`  ${label}: ${count} issues`);
  });

// Step 9: Build contributor network
console.log('\nStep 9: Building contributor network...\n');

const contributors = new Map<string, {
  files: number;
  issues: number;
  total: number;
}>();

// Count files by contributor (from observations)
repoFiles.contents.results.forEach(file => {
  file.observations
    ?.filter(obs => obs.type === ObservableTypes.Person)
    .forEach(obs => {
      const name = obs.observable.name;
      if (!contributors.has(name)) {
        contributors.set(name, { files: 0, issues: 0, total: 0 });
      }
      contributors.get(name)!.files++;
      contributors.get(name)!.total++;
    });
});

// Count issues by contributor
issues.contents.results.forEach(issue => {
  const author = issue.issue?.author?.name || 'Unknown';
  if (!contributors.has(author)) {
    contributors.set(author, { files: 0, issues: 0, total: 0 });
  }
  contributors.get(author)!.issues++;
  contributors.get(author)!.total++;
});

console.log('Top contributors:');
Array.from(contributors.entries())
  .sort((a, b) => b[1].total - a[1].total)
  .slice(0, 5)
  .forEach(([name, stats]) => {
    console.log(`  ${name}: ${stats.files} files, ${stats.issues} issues`);
  });

console.log('\n✓ Repository analysis complete!');

Step-by-Step Explanation

Step 1: Create Entity Extraction Workflow

GitHub-Specific Entity Types:

Repo: Repository as entity (name, owner, URL, description)
Person: Contributors, commit authors, issue creators, reviewers
Organization: Repository owner if organization, mentioned companies
Software: Dependencies (from package.json, requirements.txt, etc.)
Category: Topics, tags, project themes
Label: Issue/PR labels

Why Text Extraction:

Code files, README, docs are text-based
Fast and cost-effective
Handles markdown, code, JSON, YAML

Step 2: Configure GitHub Repository Feed

Repository Feed Options:

site: {
  type: FeedServiceGitHub,
  token: githubToken,                  // GitHub personal access token
  repositoryOwner: 'graphlit',         // Owner username or org
  repositoryName: 'graphlit-samples',  // Repository name
  
  includeRepositoryMetadata: true,     // Include repo description, topics
  includeFiles: true,                  // Sync files from repo
  
  fileTypes: ['md', 'py', 'ts', 'json', 'yaml'],  // Specific file extensions
  // Or sync all files:
  // fileTypes: []  // Empty = all files
  
  branch: 'main'  // Optional: specific branch (default: default branch)
}

What Gets Synced:

README.md
Source code files (filtered by fileTypes)
Documentation files
Configuration files (package.json, requirements.txt, etc.)
Repository metadata (description, topics, contributors)

Step 3: Configure GitHub Issues Feed

Issues Feed Options:

issue: {
  type: FeedServiceGitHubIssues,
  token: githubToken,
  repositoryOwner: 'graphlit',
  repositoryName: 'graphlit-samples',
  
  includeOpen: true,                   // Sync open issues
  includeClosed: false,                // Skip closed issues
  readLimit: 100,                      // Max issues to sync
  
  // Optional: filter by labels
  labels: ['bug', 'feature', 'documentation']
}

What Gets Synced:

Issue title and body
Issue author
Labels
Issue number/identifier
Created/updated dates
Comments (optional)

Additional GitHub Feed Types:

You can also create feeds for:

GitHub Commits (FeedTypes.Commit) - Sync commit history, code changes, and developer activity
GitHub Pull Requests (FeedTypes.PullRequest) - Sync pull requests, reviews, and merge history

These feed types are useful for analyzing code review patterns, tracking developer contributions, and understanding project evolution over time.

See the GitHub Commits and GitHub Pull Requests feed guides for details.

Step 4: GitHub Token Setup

Creating GitHub Token:

GitHub → Settings → Developer settings → Personal access tokens
Generate new token (classic)
Select scopes:
- repo (for private repos) or public_repo (for public only)
- read:org (if accessing org repos)
Copy token
Use in Graphlit feed creation

OR via Graphlit Developer Portal:

Go to Developer Portal → Connectors → Version Control
Authorize GitHub
Copy OAuth token

Step 5: Analyze Repository Files

File Content Structure:

const file = await graphlit.getContent(fileId);

console.log(`File: ${file.content.name}`);
console.log(`Type: ${file.content.fileType}`);
console.log(`Path: ${file.content.uri}`);

// Extracted entities from file
file.content.observations?.forEach(obs => {
  console.log(`${obs.type}: ${obs.observable.name}`);
});

README Analysis:

Rich source of Repo, Person, Organization entities
Contributors listed
Dependencies mentioned
Project description

package.json/requirements.txt Analysis:

Dependencies as Software entities
Version information
Project metadata

Step 6: Analyze GitHub Issues

Issue Metadata:

issue: {
  identifier: "42",                    // Issue number
  title: "Add feature X",
  project: "graphlit-samples",
  status: "Open",
  priority: "High",
  labels: ["feature", "enhancement"],
  author: {
    name: "Kirk Marple",
    email: "[email protected]"
  }
}

Entity Extraction from Issues:

Person: Issue author, mentioned contributors (@username)
Software: Tools/libraries mentioned
Category: Feature areas, components
Label: Issue labels as Label entities

Step 7: Build Contributor Graph

Contributors from Multiple Sources:

File authors: Extracted from README, commit mentions
Issue creators: From issue author field
Code comments: Developers mentioned in code
Documentation: Authors in docs

// Deduplicate contributors
const uniqueContributors = new Map<string, {
  observableId: string;
  name: string;
  email?: string;
  contributions: number;
}>();

// Combine from all sources
allContent.forEach(content => {
  content.observations
    ?.filter(obs => obs.type === ObservableTypes.Person)
    .forEach(obs => {
      if (!uniqueContributors.has(obs.observable.id)) {
        uniqueContributors.set(obs.observable.id, {
          observableId: obs.observable.id,
          name: obs.observable.name,
          email: obs.observable.properties?.email,
          contributions: 0
        });
      }
      uniqueContributors.get(obs.observable.id)!.contributions++;
    });
});

Step 8: Dependency Analysis

Extract Software Dependencies:

// Get all Software entities
const dependencies = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Software] }
});

// Find which files reference each dependency
for (const dep of dependencies.observables.results) {
  const references = await graphlit.queryContents({
    filter: {
      feeds: [{ id: repoFeed.createFeed.id }],
      observations: [{
        type: ObservableTypes.Software,
        observable: { id: dep.observable.id }
      }]
    }
  });
  
  console.log(`${dep.observable.name}: ${references.contents.results.length} files`);
}

Configuration Options

Selective File Syncing

By File Extension:

fileTypes: ['md', 'py', 'ts', 'js', 'json', 'yaml']

All Files:

fileTypes: []  // Empty = sync all files

Common Patterns:

// Documentation only
fileTypes: ['md', 'rst', 'txt']

// Source code only
fileTypes: ['py', 'js', 'ts', 'java', 'go', 'rs']

// Configuration files
fileTypes: ['json', 'yaml', 'yml', 'toml', 'ini']

Branch Selection

Specific Branch:

site: {
  branch: 'develop'  // Sync specific branch
}

Default Branch:

site: {
  branch: undefined  // Uses repo's default branch
}

Issue Filtering

By State:

issue: {
  includeOpen: true,
  includeClosed: true  // Include both open and closed
}

By Labels:

issue: {
  labels: ['bug', 'security', 'high-priority']  // Only these labels
}

By Count:

issue: {
  readLimit: 500  // Most recent 500 issues
}

Variations

Variation 1: Multi-Repository Analysis

Analyze multiple repositories in an organization:

const repos = [
  { owner: 'graphlit', name: 'graphlit-client-typescript' },
  { owner: 'graphlit', name: 'graphlit-client-python' },
  { owner: 'graphlit', name: 'graphlit-client-dotnet' }
];

const feeds = await Promise.all(
  repos.map(repo =>
    graphlit.createFeed({
      name: `${repo.owner}/${repo.name}`,
      type: FeedTypes.Site,
      site: {
        type: FeedServiceGitHub,
        token: githubToken,
        repositoryOwner: repo.owner,
        repositoryName: repo.name
      },
      workflow: { id: workflowId }
    })
  )
);

// Wait for all to sync
const waitForAll = async () => {
  let allDone = false;
  while (!allDone) {
    const statuses = await Promise.all(
      feeds.map(f => graphlit.isFeedDone(f.createFeed.id))
    );
    allDone = statuses.every(s => s.isFeedDone.result);
    
    if (!allDone) {
      await new Promise(resolve => setTimeout(resolve, 5000));
    }
  }
};

await waitForAll();

// Analyze cross-repo entities
const allRepos = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Repo] }
});

console.log(`Total repositories: ${allRepos.observables.results.length}`);

Variation 2: Dependency Graph Visualization

Map software dependencies:

// Extract all Software entities and their relationships
const dependencies = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Software] }
});

interface DependencyNode {
  name: string;
  usedBy: string[];  // Files/repos using this dependency
  version?: string;
}

const depGraph = new Map<string, DependencyNode>();

for (const dep of dependencies.observables.results) {
  const usages = await graphlit.queryContents({
    filter: {
      observations: [{
        type: ObservableTypes.Software,
        observable: { id: dep.observable.id }
      }]
    }
  });
  
  depGraph.set(dep.observable.name, {
    name: dep.observable.name,
    usedBy: usages.contents.results.map(c => c.name),
    version: dep.observable.properties?.version
  });
}

// Find most common dependencies
const topDeps = Array.from(depGraph.values())
  .sort((a, b) => b.usedBy.length - a.usedBy.length)
  .slice(0, 10);

console.log('Most used dependencies:');
topDeps.forEach(dep => {
  console.log(`  ${dep.name}: ${dep.usedBy.length} files`);
});

Variation 3: Issue Classification by Entities

Categorize issues by extracted entities:

// Group issues by entity types
const issuesByEntity = new Map<string, Array<typeof issues.contents.results[0]>>();

issues.contents.results.forEach(issue => {
  issue.observations?.forEach(obs => {
    const key = `${obs.type}: ${obs.observable.name}`;
    if (!issuesByEntity.has(key)) {
      issuesByEntity.set(key, []);
    }
    issuesByEntity.get(key)!.push(issue);
  });
});

// Find entities with most issues
const entityIssueCount = Array.from(issuesByEntity.entries())
  .map(([entity, issues]) => ({ entity, count: issues.length }))
  .sort((a, b) => b.count - a.count);

console.log('Entities with most related issues:');
entityIssueCount.slice(0, 10).forEach(item => {
  console.log(`  ${item.entity}: ${item.count} issues`);
});

Variation 4: Contributor Activity Timeline

Track contributor activity over time:

interface ContributorActivity {
  name: string;
  firstContribution: Date;
  lastContribution: Date;
  contributions: Array<{ date: Date; type: 'file' | 'issue' }>;
}

const activity = new Map<string, ContributorActivity>();

// Track file contributions
repoFiles.contents.results.forEach(file => {
  const date = new Date(file.creationDate);
  
  file.observations
    ?.filter(obs => obs.type === ObservableTypes.Person)
    .forEach(obs => {
      const name = obs.observable.name;
      if (!activity.has(name)) {
        activity.set(name, {
          name,
          firstContribution: date,
          lastContribution: date,
          contributions: []
        });
      }
      
      const contrib = activity.get(name)!;
      contrib.contributions.push({ date, type: 'file' });
      if (date < contrib.firstContribution) contrib.firstContribution = date;
      if (date > contrib.lastContribution) contrib.lastContribution = date;
    });
});

// Track issue contributions
issues.contents.results.forEach(issue => {
  const author = issue.issue?.author?.name;
  const date = new Date(issue.creationDate);
  
  if (author) {
    if (!activity.has(author)) {
      activity.set(author, {
        name: author,
        firstContribution: date,
        lastContribution: date,
        contributions: []
      });
    }
    
    const contrib = activity.get(author)!;
    contrib.contributions.push({ date, type: 'issue' });
    if (date < contrib.firstContribution) contrib.firstContribution = date;
    if (date > contrib.lastContribution) contrib.lastContribution = date;
  }
});

// Find most active contributors (by recent activity)
const recent = Array.from(activity.values())
  .sort((a, b) => b.lastContribution.getTime() - a.lastContribution.getTime())
  .slice(0, 10);

console.log('Most recently active contributors:');
recent.forEach(contrib => {
  console.log(`  ${contrib.name}: ${contrib.contributions.length} contributions`);
  console.log(`    First: ${contrib.firstContribution.toLocaleDateString()}`);
  console.log(`    Last: ${contrib.lastContribution.toLocaleDateString()}`);
});

Variation 5: Cross-Repository Entity Linking

Find entities that appear across multiple repositories:

// After syncing multiple repos, find cross-repo entities
const allPeople = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Person] }
});

for (const person of allPeople.observables.results) {
  // Find all content (across repos) mentioning this person
  const mentions = await graphlit.queryContents({
    filter: {
      observations: [{
        type: ObservableTypes.Person,
        observable: { id: person.observable.id }
      }]
    }
  });
  
  // Group by feed (repository)
  const reposMentioned = new Set(
    mentions.contents.results.map(c => c.feedId)
  );
  
  if (reposMentioned.size > 1) {
    console.log(`${person.observable.name}: appears in ${reposMentioned.size} repos`);
  }
}

Common Issues & Solutions

Issue: Large Repository, Slow Sync

Problem: Repository with 1000s of files takes hours to sync.

Solutions:

Filter file types: Only sync relevant files
Skip large files: Binary files slow down processing
Use branch: Sync specific branch instead of all branches

site: {
  fileTypes: ['md', 'py', 'js'],  // Skip images, binaries
  branch: 'main'                   // Single branch only
}

Issue: GitHub API Rate Limiting

Problem: Sync fails with rate limit error.

Explanation: GitHub API has rate limits (5000 requests/hour for authenticated).

Solutions:

Use authenticated token (higher limits)
Sync fewer repositories simultaneously
Increase polling interval
Wait for rate limit reset

Issue: Missing Dependencies from package.json

Problem: Software entities not extracted from package files.

Cause: Need to sync configuration files explicitly.

Solution: Include config file types:

fileTypes: ['json', 'yaml', 'toml', 'lock']

Issue: No Contributor Entities

Problem: No Person entities extracted from repository.

Causes:

README doesn't list contributors
Code comments don't mention developers
Need to enable repository metadata

Solution: Enable metadata and sync issues:

site: {
  includeRepositoryMetadata: true  // Includes contributor info
}

// Also sync issues for issue authors
issue: {
  type: FeedServiceGitHubIssues,
  includeOpen: true
}

Developer Hints

GitHub Token Best Practices

Use fine-grained tokens (new GitHub feature) when possible
Minimum scope: repo for private, public_repo for public
Rotate tokens regularly
Don't commit tokens to code (use env variables)
Monitor token usage in GitHub settings

File Type Recommendations

Documentation Analysis:

fileTypes: ['md', 'rst', 'txt', 'adoc']

Full Code Analysis:

fileTypes: ['py', 'js', 'ts', 'java', 'go', 'rs', 'cpp', 'c', 'h']

Configuration + Dependencies:

fileTypes: ['json', 'yaml', 'yml', 'toml', 'lock', 'txt']

Performance Optimization

Start with README + package files only
Add more file types incrementally
Sync issues separately (can be slow)
Use multiple feeds for large org
Cache entity queries

Entity Quality by Source

High confidence: README (explicit mentions), package.json (dependencies)
Medium confidence: Code comments, documentation
Low confidence: Implicit mentions in code

Production Patterns

Pattern from Graphlit Samples

Graphlit_2024_09_29_Explore_GitHub_Repo.ipynb:

Syncs public GitHub repository
Extracts Repo, Person, Software entities
Analyzes dependencies from package.json
Builds contributor network
Exports entity graph for visualization

Graphlit_2025_03_17_Classify_GitHub_Issues.ipynb:

Syncs GitHub issues
Extracts entities from issue descriptions
Classifies issues by entity types
Groups related issues
Priority ranking by entity importance

Open Source Intelligence Use Cases

Dependency tracking: Which projects use which libraries
Contributor analysis: Developer activity, collaboration
Project relationships: Shared contributors, common dependencies
Technology adoption: What tools/frameworks gaining traction
Security analysis: Vulnerable dependency detection

Last updated 1 month ago

Was this helpful?