Build Knowledge Graph from GitHub Repositories

User Intent

"How do I extract entities from GitHub repositories and issues to build a knowledge graph? Show me how to analyze code, contributors, dependencies, and project relationships."

Operation

SDK Methods: createWorkflow(), createFeed() (Site, Issue, Commit, or PullRequest feeds), isFeedDone(), queryContents(), queryObservables() GraphQL: GitHub feed creation + entity extraction + contributor/project graphs Entity: GitHub Feed → Content (Files/Issues/Commits/PRs) → Observations → Observables (Developer/Project Graph)

Prerequisites

  • Graphlit project with API credentials

  • GitHub personal access token (via Graphlit Developer Portal)

  • GitHub repository access

  • Understanding of feed and workflow concepts


Complete Code Example (TypeScript)

import { Graphlit } from 'graphlit-client';
import { ContentTypes, FeedServiceTypes, FileTypes, ObservableTypes } from 'graphlit-client/dist/generated/graphql-types';
import {
  FeedTypes,
  FeedServiceTypes,
  ExtractionServiceTypes,
  ObservableTypes,
  ContentTypes
} from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

console.log('=== Building Knowledge Graph from GitHub ===\n');

// Step 1: Create extraction workflow
console.log('Step 1: Creating entity extraction workflow...');
const workflow = await graphlit.createWorkflow({
  name: "GitHub Entity Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.Repo,            // Repository itself
          ObservableTypes.Person,          // Contributors, authors
          ObservableTypes.Organization,    // Organizations, companies
          ObservableTypes.Software,        // Dependencies, tools
          ObservableTypes.Category,        // Topics, tags
          ObservableLabel            // Issue labels
        ]
      }
    }]
  }
});

console.log(`✓ Workflow: ${workflow.createWorkflow.id}\n`);

// Step 2: Create GitHub repository feed
console.log('Step 2: Creating GitHub repository feed...');
const repoFeed = await graphlit.createFeed({
  name: "Graphlit Samples Repo",
  type: FeedTypes.Site,
  site: {
    type: FeedServiceGitHub,
    token: process.env.GITHUB_TOKEN!,
    repositoryOwner: 'graphlit',
    repositoryName: 'graphlit-samples',
    includeRepositoryMetadata: true,
    includeFiles: true,
    fileTypes: ['md', 'py', 'ts', 'json', 'yaml']  // Specific file types
  },
  workflow: { id: workflow.createWorkflow.id }
});

console.log(`✓ Repo Feed: ${repoFeed.createFeed.id}\n`);

// Step 3: Create GitHub issues feed
console.log('Step 3: Creating GitHub issues feed...');
const issuesFeed = await graphlit.createFeed({
  name: "Graphlit Issues",
  type: FeedTypes.Issue,
  issue: {
    type: FeedServiceGitHubIssues,
    token: process.env.GITHUB_TOKEN!,
    repositoryOwner: 'graphlit',
    repositoryName: 'graphlit-samples',
    includeOpen: true,
    includeClosed: false,
    readLimit: 100
  },
  workflow: { id: workflow.createWorkflow.id }
});

console.log(`✓ Issues Feed: ${issuesFeed.createFeed.id}\n`);

// Step 4: Wait for sync
console.log('Step 4: Syncing repository...');
let repoDone = false;
let issuesDone = false;

while (!repoDone || !issuesDone) {
  if (!repoDone) {
    const repoStatus = await graphlit.isFeedDone(repoFeed.createFeed.id);
    repoDone = repoStatus.isFeedDone.result;
  }
  
  if (!issuesDone) {
    const issuesStatus = await graphlit.isFeedDone(issuesFeed.createFeed.id);
    issuesDone = issuesStatus.isFeedDone.result;
  }
  
  if (!repoDone || !issuesDone) {
    console.log('  Syncing... (checking again in 5s)');
    await new Promise(resolve => setTimeout(resolve, 5000));
  }
}
console.log('✓ Sync complete\n');

// Step 5: Query repository content
console.log('Step 5: Querying repository files...');
const repoFiles = await graphlit.queryContents({
  filter: {
    feeds: [{ id: repoFeed.createFeed.id }]
  }
});

console.log(`✓ Synced ${repoFiles.contents.results.length} files\n`);

// Step 6: Query issues
console.log('Step 6: Querying issues...');
const issues = await graphlit.queryContents({
  filter: {
    types: [ContentTypes.Issue],
    feeds: [{ id: issuesFeed.createFeed.id }]
  }
});

console.log(`✓ Synced ${issues.contents.results.length} issues\n`);

// Step 7: Extract repository entities
console.log('Step 7: Analyzing repository entities...\n');

// Get all Repo entities
const repos = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Repo] }
});
console.log(`Repositories: ${repos.observables.results.length}`);

// Get contributors (Person entities)
const people = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Person] }
});
console.log(`Contributors: ${people.observables.results.length}`);

// Get dependencies (Software entities)
const software = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Software] }
});
console.log(`Software/Dependencies: ${software.observables.results.length}\n`);

// Step 8: Analyze issue labels
console.log('Step 8: Analyzing issue labels...\n');

const labelCounts = new Map<string, number>();
issues.contents.results.forEach(issue => {
  const labels = issue.issue?.labels || [];
  labels.forEach(label => {
    labelCounts.set(label, (labelCounts.get(label) || 0) + 1);
  });
});

console.log('Most common issue labels:');
Array.from(labelCounts.entries())
  .sort((a, b) => b[1] - a[1])
  .slice(0, 5)
  .forEach(([label, count]) => {
    console.log(`  ${label}: ${count} issues`);
  });

// Step 9: Build contributor network
console.log('\nStep 9: Building contributor network...\n');

const contributors = new Map<string, {
  files: number;
  issues: number;
  total: number;
}>();

// Count files by contributor (from observations)
repoFiles.contents.results.forEach(file => {
  file.observations
    ?.filter(obs => obs.type === ObservableTypes.Person)
    .forEach(obs => {
      const name = obs.observable.name;
      if (!contributors.has(name)) {
        contributors.set(name, { files: 0, issues: 0, total: 0 });
      }
      contributors.get(name)!.files++;
      contributors.get(name)!.total++;
    });
});

// Count issues by contributor
issues.contents.results.forEach(issue => {
  const author = issue.issue?.author?.name || 'Unknown';
  if (!contributors.has(author)) {
    contributors.set(author, { files: 0, issues: 0, total: 0 });
  }
  contributors.get(author)!.issues++;
  contributors.get(author)!.total++;
});

console.log('Top contributors:');
Array.from(contributors.entries())
  .sort((a, b) => b[1].total - a[1].total)
  .slice(0, 5)
  .forEach(([name, stats]) => {
    console.log(`  ${name}: ${stats.files} files, ${stats.issues} issues`);
  });

console.log('\n✓ Repository analysis complete!');

Step-by-Step Explanation

Step 1: Create Entity Extraction Workflow

GitHub-Specific Entity Types:

  • Repo: Repository as entity (name, owner, URL, description)

  • Person: Contributors, commit authors, issue creators, reviewers

  • Organization: Repository owner if organization, mentioned companies

  • Software: Dependencies (from package.json, requirements.txt, etc.)

  • Category: Topics, tags, project themes

  • Label: Issue/PR labels

Why Text Extraction:

  • Code files, README, docs are text-based

  • Fast and cost-effective

  • Handles markdown, code, JSON, YAML

Step 2: Configure GitHub Repository Feed

Repository Feed Options:

site: {
  type: FeedServiceGitHub,
  token: githubToken,                  // GitHub personal access token
  repositoryOwner: 'graphlit',         // Owner username or org
  repositoryName: 'graphlit-samples',  // Repository name
  
  includeRepositoryMetadata: true,     // Include repo description, topics
  includeFiles: true,                  // Sync files from repo
  
  fileTypes: ['md', 'py', 'ts', 'json', 'yaml'],  // Specific file extensions
  // Or sync all files:
  // fileTypes: []  // Empty = all files
  
  branch: 'main'  // Optional: specific branch (default: default branch)
}

What Gets Synced:

  • README.md

  • Source code files (filtered by fileTypes)

  • Documentation files

  • Configuration files (package.json, requirements.txt, etc.)

  • Repository metadata (description, topics, contributors)

Step 3: Configure GitHub Issues Feed

Issues Feed Options:

issue: {
  type: FeedServiceGitHubIssues,
  token: githubToken,
  repositoryOwner: 'graphlit',
  repositoryName: 'graphlit-samples',
  
  includeOpen: true,                   // Sync open issues
  includeClosed: false,                // Skip closed issues
  readLimit: 100,                      // Max issues to sync
  
  // Optional: filter by labels
  labels: ['bug', 'feature', 'documentation']
}

What Gets Synced:

  • Issue title and body

  • Issue author

  • Labels

  • Issue number/identifier

  • Created/updated dates

  • Comments (optional)


Additional GitHub Feed Types:

You can also create feeds for:

  • GitHub Commits (FeedTypes.Commit) - Sync commit history, code changes, and developer activity

  • GitHub Pull Requests (FeedTypes.PullRequest) - Sync pull requests, reviews, and merge history

These feed types are useful for analyzing code review patterns, tracking developer contributions, and understanding project evolution over time.

See the GitHub Commits and GitHub Pull Requests feed guides for details.


Step 4: GitHub Token Setup

Creating GitHub Token:

  1. GitHub → Settings → Developer settings → Personal access tokens

  2. Generate new token (classic)

  3. Select scopes:

    • repo (for private repos) or public_repo (for public only)

    • read:org (if accessing org repos)

  4. Copy token

  5. Use in Graphlit feed creation

OR via Graphlit Developer Portal:

  1. Go to Developer Portal → Connectors → Version Control

  2. Authorize GitHub

  3. Copy OAuth token

Step 5: Analyze Repository Files

File Content Structure:

const file = await graphlit.getContent(fileId);

console.log(`File: ${file.content.name}`);
console.log(`Type: ${file.content.fileType}`);
console.log(`Path: ${file.content.uri}`);

// Extracted entities from file
file.content.observations?.forEach(obs => {
  console.log(`${obs.type}: ${obs.observable.name}`);
});

README Analysis:

  • Rich source of Repo, Person, Organization entities

  • Contributors listed

  • Dependencies mentioned

  • Project description

package.json/requirements.txt Analysis:

  • Dependencies as Software entities

  • Version information

  • Project metadata

Step 6: Analyze GitHub Issues

Issue Metadata:

issue: {
  identifier: "42",                    // Issue number
  title: "Add feature X",
  project: "graphlit-samples",
  status: "Open",
  priority: "High",
  labels: ["feature", "enhancement"],
  author: {
    name: "Kirk Marple",
    email: "[email protected]"
  }
}

Entity Extraction from Issues:

  • Person: Issue author, mentioned contributors (@username)

  • Software: Tools/libraries mentioned

  • Category: Feature areas, components

  • Label: Issue labels as Label entities

Step 7: Build Contributor Graph

Contributors from Multiple Sources:

  1. File authors: Extracted from README, commit mentions

  2. Issue creators: From issue author field

  3. Code comments: Developers mentioned in code

  4. Documentation: Authors in docs

// Deduplicate contributors
const uniqueContributors = new Map<string, {
  observableId: string;
  name: string;
  email?: string;
  contributions: number;
}>();

// Combine from all sources
allContent.forEach(content => {
  content.observations
    ?.filter(obs => obs.type === ObservableTypes.Person)
    .forEach(obs => {
      if (!uniqueContributors.has(obs.observable.id)) {
        uniqueContributors.set(obs.observable.id, {
          observableId: obs.observable.id,
          name: obs.observable.name,
          email: obs.observable.properties?.email,
          contributions: 0
        });
      }
      uniqueContributors.get(obs.observable.id)!.contributions++;
    });
});

Step 8: Dependency Analysis

Extract Software Dependencies:

// Get all Software entities
const dependencies = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Software] }
});

// Find which files reference each dependency
for (const dep of dependencies.observables.results) {
  const references = await graphlit.queryContents({
    filter: {
      feeds: [{ id: repoFeed.createFeed.id }],
      observations: [{
        type: ObservableTypes.Software,
        observable: { id: dep.observable.id }
      }]
    }
  });
  
  console.log(`${dep.observable.name}: ${references.contents.results.length} files`);
}

Configuration Options

Selective File Syncing

By File Extension:

fileTypes: ['md', 'py', 'ts', 'js', 'json', 'yaml']

All Files:

fileTypes: []  // Empty = sync all files

Common Patterns:

// Documentation only
fileTypes: ['md', 'rst', 'txt']

// Source code only
fileTypes: ['py', 'js', 'ts', 'java', 'go', 'rs']

// Configuration files
fileTypes: ['json', 'yaml', 'yml', 'toml', 'ini']

Branch Selection

Specific Branch:

site: {
  branch: 'develop'  // Sync specific branch
}

Default Branch:

site: {
  branch: undefined  // Uses repo's default branch
}

Issue Filtering

By State:

issue: {
  includeOpen: true,
  includeClosed: true  // Include both open and closed
}

By Labels:

issue: {
  labels: ['bug', 'security', 'high-priority']  // Only these labels
}

By Count:

issue: {
  readLimit: 500  // Most recent 500 issues
}

Variations

Variation 1: Multi-Repository Analysis

Analyze multiple repositories in an organization:

const repos = [
  { owner: 'graphlit', name: 'graphlit-client-typescript' },
  { owner: 'graphlit', name: 'graphlit-client-python' },
  { owner: 'graphlit', name: 'graphlit-client-dotnet' }
];

const feeds = await Promise.all(
  repos.map(repo =>
    graphlit.createFeed({
      name: `${repo.owner}/${repo.name}`,
      type: FeedTypes.Site,
      site: {
        type: FeedServiceGitHub,
        token: githubToken,
        repositoryOwner: repo.owner,
        repositoryName: repo.name
      },
      workflow: { id: workflowId }
    })
  )
);

// Wait for all to sync
const waitForAll = async () => {
  let allDone = false;
  while (!allDone) {
    const statuses = await Promise.all(
      feeds.map(f => graphlit.isFeedDone(f.createFeed.id))
    );
    allDone = statuses.every(s => s.isFeedDone.result);
    
    if (!allDone) {
      await new Promise(resolve => setTimeout(resolve, 5000));
    }
  }
};

await waitForAll();

// Analyze cross-repo entities
const allRepos = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Repo] }
});

console.log(`Total repositories: ${allRepos.observables.results.length}`);

Variation 2: Dependency Graph Visualization

Map software dependencies:

// Extract all Software entities and their relationships
const dependencies = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Software] }
});

interface DependencyNode {
  name: string;
  usedBy: string[];  // Files/repos using this dependency
  version?: string;
}

const depGraph = new Map<string, DependencyNode>();

for (const dep of dependencies.observables.results) {
  const usages = await graphlit.queryContents({
    filter: {
      observations: [{
        type: ObservableTypes.Software,
        observable: { id: dep.observable.id }
      }]
    }
  });
  
  depGraph.set(dep.observable.name, {
    name: dep.observable.name,
    usedBy: usages.contents.results.map(c => c.name),
    version: dep.observable.properties?.version
  });
}

// Find most common dependencies
const topDeps = Array.from(depGraph.values())
  .sort((a, b) => b.usedBy.length - a.usedBy.length)
  .slice(0, 10);

console.log('Most used dependencies:');
topDeps.forEach(dep => {
  console.log(`  ${dep.name}: ${dep.usedBy.length} files`);
});

Variation 3: Issue Classification by Entities

Categorize issues by extracted entities:

// Group issues by entity types
const issuesByEntity = new Map<string, Array<typeof issues.contents.results[0]>>();

issues.contents.results.forEach(issue => {
  issue.observations?.forEach(obs => {
    const key = `${obs.type}: ${obs.observable.name}`;
    if (!issuesByEntity.has(key)) {
      issuesByEntity.set(key, []);
    }
    issuesByEntity.get(key)!.push(issue);
  });
});

// Find entities with most issues
const entityIssueCount = Array.from(issuesByEntity.entries())
  .map(([entity, issues]) => ({ entity, count: issues.length }))
  .sort((a, b) => b.count - a.count);

console.log('Entities with most related issues:');
entityIssueCount.slice(0, 10).forEach(item => {
  console.log(`  ${item.entity}: ${item.count} issues`);
});

Variation 4: Contributor Activity Timeline

Track contributor activity over time:

interface ContributorActivity {
  name: string;
  firstContribution: Date;
  lastContribution: Date;
  contributions: Array<{ date: Date; type: 'file' | 'issue' }>;
}

const activity = new Map<string, ContributorActivity>();

// Track file contributions
repoFiles.contents.results.forEach(file => {
  const date = new Date(file.creationDate);
  
  file.observations
    ?.filter(obs => obs.type === ObservableTypes.Person)
    .forEach(obs => {
      const name = obs.observable.name;
      if (!activity.has(name)) {
        activity.set(name, {
          name,
          firstContribution: date,
          lastContribution: date,
          contributions: []
        });
      }
      
      const contrib = activity.get(name)!;
      contrib.contributions.push({ date, type: 'file' });
      if (date < contrib.firstContribution) contrib.firstContribution = date;
      if (date > contrib.lastContribution) contrib.lastContribution = date;
    });
});

// Track issue contributions
issues.contents.results.forEach(issue => {
  const author = issue.issue?.author?.name;
  const date = new Date(issue.creationDate);
  
  if (author) {
    if (!activity.has(author)) {
      activity.set(author, {
        name: author,
        firstContribution: date,
        lastContribution: date,
        contributions: []
      });
    }
    
    const contrib = activity.get(author)!;
    contrib.contributions.push({ date, type: 'issue' });
    if (date < contrib.firstContribution) contrib.firstContribution = date;
    if (date > contrib.lastContribution) contrib.lastContribution = date;
  }
});

// Find most active contributors (by recent activity)
const recent = Array.from(activity.values())
  .sort((a, b) => b.lastContribution.getTime() - a.lastContribution.getTime())
  .slice(0, 10);

console.log('Most recently active contributors:');
recent.forEach(contrib => {
  console.log(`  ${contrib.name}: ${contrib.contributions.length} contributions`);
  console.log(`    First: ${contrib.firstContribution.toLocaleDateString()}`);
  console.log(`    Last: ${contrib.lastContribution.toLocaleDateString()}`);
});

Variation 5: Cross-Repository Entity Linking

Find entities that appear across multiple repositories:

// After syncing multiple repos, find cross-repo entities
const allPeople = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Person] }
});

for (const person of allPeople.observables.results) {
  // Find all content (across repos) mentioning this person
  const mentions = await graphlit.queryContents({
    filter: {
      observations: [{
        type: ObservableTypes.Person,
        observable: { id: person.observable.id }
      }]
    }
  });
  
  // Group by feed (repository)
  const reposMentioned = new Set(
    mentions.contents.results.map(c => c.feedId)
  );
  
  if (reposMentioned.size > 1) {
    console.log(`${person.observable.name}: appears in ${reposMentioned.size} repos`);
  }
}

Common Issues & Solutions

Issue: Large Repository, Slow Sync

Problem: Repository with 1000s of files takes hours to sync.

Solutions:

  1. Filter file types: Only sync relevant files

  2. Skip large files: Binary files slow down processing

  3. Use branch: Sync specific branch instead of all branches

site: {
  fileTypes: ['md', 'py', 'js'],  // Skip images, binaries
  branch: 'main'                   // Single branch only
}

Issue: GitHub API Rate Limiting

Problem: Sync fails with rate limit error.

Explanation: GitHub API has rate limits (5000 requests/hour for authenticated).

Solutions:

  1. Use authenticated token (higher limits)

  2. Sync fewer repositories simultaneously

  3. Increase polling interval

  4. Wait for rate limit reset

Issue: Missing Dependencies from package.json

Problem: Software entities not extracted from package files.

Cause: Need to sync configuration files explicitly.

Solution: Include config file types:

fileTypes: ['json', 'yaml', 'toml', 'lock']

Issue: No Contributor Entities

Problem: No Person entities extracted from repository.

Causes:

  1. README doesn't list contributors

  2. Code comments don't mention developers

  3. Need to enable repository metadata

Solution: Enable metadata and sync issues:

site: {
  includeRepositoryMetadata: true  // Includes contributor info
}

// Also sync issues for issue authors
issue: {
  type: FeedServiceGitHubIssues,
  includeOpen: true
}

Developer Hints

GitHub Token Best Practices

  • Use fine-grained tokens (new GitHub feature) when possible

  • Minimum scope: repo for private, public_repo for public

  • Rotate tokens regularly

  • Don't commit tokens to code (use env variables)

  • Monitor token usage in GitHub settings

File Type Recommendations

Documentation Analysis:

fileTypes: ['md', 'rst', 'txt', 'adoc']

Full Code Analysis:

fileTypes: ['py', 'js', 'ts', 'java', 'go', 'rs', 'cpp', 'c', 'h']

Configuration + Dependencies:

fileTypes: ['json', 'yaml', 'yml', 'toml', 'lock', 'txt']

Performance Optimization

  • Start with README + package files only

  • Add more file types incrementally

  • Sync issues separately (can be slow)

  • Use multiple feeds for large org

  • Cache entity queries

Entity Quality by Source

  • High confidence: README (explicit mentions), package.json (dependencies)

  • Medium confidence: Code comments, documentation

  • Low confidence: Implicit mentions in code


Production Patterns

Pattern from Graphlit Samples

Graphlit_2024_09_29_Explore_GitHub_Repo.ipynb:

  • Syncs public GitHub repository

  • Extracts Repo, Person, Software entities

  • Analyzes dependencies from package.json

  • Builds contributor network

  • Exports entity graph for visualization

Graphlit_2025_03_17_Classify_GitHub_Issues.ipynb:

  • Syncs GitHub issues

  • Extracts entities from issue descriptions

  • Classifies issues by entity types

  • Groups related issues

  • Priority ranking by entity importance

Open Source Intelligence Use Cases

  • Dependency tracking: Which projects use which libraries

  • Contributor analysis: Developer activity, collaboration

  • Project relationships: Shared contributors, common dependencies

  • Technology adoption: What tools/frameworks gaining traction

  • Security analysis: Vulnerable dependency detection


Last updated

Was this helpful?