Build Knowledge Graph from GitHub Repositories

User Intent

"How do I extract entities from GitHub repositories and issues to build a knowledge graph? Show me how to analyze code, contributors, dependencies, and project relationships."

Operation

SDK Methods: createWorkflow(), createFeed() (Site or Issue feeds), isFeedDone(), queryContents(), queryObservables() GraphQL: GitHub feed creation + entity extraction + contributor/project graphs Entity: GitHub Feed → Content (Files/Issues) → Observations → Observables (Developer/Project Graph)

Prerequisites

  • Graphlit project with API credentials

  • GitHub personal access token (via Graphlit Developer Portal)

  • GitHub repository access

  • Understanding of feed and workflow concepts


Complete Code Example (TypeScript)

import { Graphlit } from 'graphlit-client';
import { ContentTypes, FeedServiceTypes, FileTypes, ObservableTypes } from 'graphlit-client/dist/generated/graphql-types';
import {
  FeedTypes,
  FeedServiceTypes,
  ExtractionServiceTypes,
  ObservableTypes,
  ContentTypes
} from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

console.log('=== Building Knowledge Graph from GitHub ===\n');

// Step 1: Create extraction workflow
console.log('Step 1: Creating entity extraction workflow...');
const workflow = await graphlit.createWorkflow({
  name: "GitHub Entity Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.Repo,            // Repository itself
          ObservableTypes.Person,          // Contributors, authors
          ObservableTypes.Organization,    // Organizations, companies
          ObservableTypes.Software,        // Dependencies, tools
          ObservableTypes.Category,        // Topics, tags
          ObservableLabel            // Issue labels
        ]
      }
    }]
  }
});

console.log(`✓ Workflow: ${workflow.createWorkflow.id}\n`);

// Step 2: Create GitHub repository feed
console.log('Step 2: Creating GitHub repository feed...');
const repoFeed = await graphlit.createFeed({
  name: "Graphlit Samples Repo",
  type: FeedTypes.Site,
  site: {
    type: FeedServiceGitHub,
    token: process.env.GITHUB_TOKEN!,
    repositoryOwner: 'graphlit',
    repositoryName: 'graphlit-samples',
    includeRepositoryMetadata: true,
    includeFiles: true,
    fileTypes: ['md', 'py', 'ts', 'json', 'yaml']  // Specific file types
  },
  workflow: { id: workflow.createWorkflow.id }
});

console.log(`✓ Repo Feed: ${repoFeed.createFeed.id}\n`);

// Step 3: Create GitHub issues feed
console.log('Step 3: Creating GitHub issues feed...');
const issuesFeed = await graphlit.createFeed({
  name: "Graphlit Issues",
  type: FeedTypes.Issue,
  issue: {
    type: FeedServiceGitHubIssues,
    token: process.env.GITHUB_TOKEN!,
    repositoryOwner: 'graphlit',
    repositoryName: 'graphlit-samples',
    includeOpen: true,
    includeClosed: false,
    readLimit: 100
  },
  workflow: { id: workflow.createWorkflow.id }
});

console.log(`✓ Issues Feed: ${issuesFeed.createFeed.id}\n`);

// Step 4: Wait for sync
console.log('Step 4: Syncing repository...');
let repoDone = false;
let issuesDone = false;

while (!repoDone || !issuesDone) {
  if (!repoDone) {
    const repoStatus = await graphlit.isFeedDone(repoFeed.createFeed.id);
    repoDone = repoStatus.isFeedDone.result;
  }
  
  if (!issuesDone) {
    const issuesStatus = await graphlit.isFeedDone(issuesFeed.createFeed.id);
    issuesDone = issuesStatus.isFeedDone.result;
  }
  
  if (!repoDone || !issuesDone) {
    console.log('  Syncing... (checking again in 5s)');
    await new Promise(resolve => setTimeout(resolve, 5000));
  }
}
console.log('✓ Sync complete\n');

// Step 5: Query repository content
console.log('Step 5: Querying repository files...');
const repoFiles = await graphlit.queryContents({
  filter: {
    feeds: [{ id: repoFeed.createFeed.id }]
  }
});

console.log(`✓ Synced ${repoFiles.contents.results.length} files\n`);

// Step 6: Query issues
console.log('Step 6: Querying issues...');
const issues = await graphlit.queryContents({
  filter: {
    types: [ContentTypes.Issue],
    feeds: [{ id: issuesFeed.createFeed.id }]
  }
});

console.log(`✓ Synced ${issues.contents.results.length} issues\n`);

// Step 7: Extract repository entities
console.log('Step 7: Analyzing repository entities...\n');

// Get all Repo entities
const repos = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Repo] }
});
console.log(`Repositories: ${repos.observables.results.length}`);

// Get contributors (Person entities)
const people = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Person] }
});
console.log(`Contributors: ${people.observables.results.length}`);

// Get dependencies (Software entities)
const software = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Software] }
});
console.log(`Software/Dependencies: ${software.observables.results.length}\n`);

// Step 8: Analyze issue labels
console.log('Step 8: Analyzing issue labels...\n');

const labelCounts = new Map<string, number>();
issues.contents.results.forEach(issue => {
  const labels = issue.issue?.labels || [];
  labels.forEach(label => {
    labelCounts.set(label, (labelCounts.get(label) || 0) + 1);
  });
});

console.log('Most common issue labels:');
Array.from(labelCounts.entries())
  .sort((a, b) => b[1] - a[1])
  .slice(0, 5)
  .forEach(([label, count]) => {
    console.log(`  ${label}: ${count} issues`);
  });

// Step 9: Build contributor network
console.log('\nStep 9: Building contributor network...\n');

const contributors = new Map<string, {
  files: number;
  issues: number;
  total: number;
}>();

// Count files by contributor (from observations)
repoFiles.contents.results.forEach(file => {
  file.observations
    ?.filter(obs => obs.type === ObservableTypes.Person)
    .forEach(obs => {
      const name = obs.observable.name;
      if (!contributors.has(name)) {
        contributors.set(name, { files: 0, issues: 0, total: 0 });
      }
      contributors.get(name)!.files++;
      contributors.get(name)!.total++;
    });
});

// Count issues by contributor
issues.contents.results.forEach(issue => {
  const author = issue.issue?.author?.name || 'Unknown';
  if (!contributors.has(author)) {
    contributors.set(author, { files: 0, issues: 0, total: 0 });
  }
  contributors.get(author)!.issues++;
  contributors.get(author)!.total++;
});

console.log('Top contributors:');
Array.from(contributors.entries())
  .sort((a, b) => b[1].total - a[1].total)
  .slice(0, 5)
  .forEach(([name, stats]) => {
    console.log(`  ${name}: ${stats.files} files, ${stats.issues} issues`);
  });

console.log('\n✓ Repository analysis complete!');

Step-by-Step Explanation

Step 1: Create Entity Extraction Workflow

GitHub-Specific Entity Types:

  • Repo: Repository as entity (name, owner, URL, description)

  • Person: Contributors, commit authors, issue creators, reviewers

  • Organization: Repository owner if organization, mentioned companies

  • Software: Dependencies (from package.json, requirements.txt, etc.)

  • Category: Topics, tags, project themes

  • Label: Issue/PR labels

Why Text Extraction:

  • Code files, README, docs are text-based

  • Fast and cost-effective

  • Handles markdown, code, JSON, YAML

Step 2: Configure GitHub Repository Feed

Repository Feed Options:

site: {
  type: FeedServiceGitHub,
  token: githubToken,                  // GitHub personal access token
  repositoryOwner: 'graphlit',         // Owner username or org
  repositoryName: 'graphlit-samples',  // Repository name
  
  includeRepositoryMetadata: true,     // Include repo description, topics
  includeFiles: true,                  // Sync files from repo
  
  fileTypes: ['md', 'py', 'ts', 'json', 'yaml'],  // Specific file extensions
  // Or sync all files:
  // fileTypes: []  // Empty = all files
  
  branch: 'main'  // Optional: specific branch (default: default branch)
}

What Gets Synced:

  • README.md

  • Source code files (filtered by fileTypes)

  • Documentation files

  • Configuration files (package.json, requirements.txt, etc.)

  • Repository metadata (description, topics, contributors)

Step 3: Configure GitHub Issues Feed

Issues Feed Options:

issue: {
  type: FeedServiceGitHubIssues,
  token: githubToken,
  repositoryOwner: 'graphlit',
  repositoryName: 'graphlit-samples',
  
  includeOpen: true,                   // Sync open issues
  includeClosed: false,                // Skip closed issues
  readLimit: 100,                      // Max issues to sync
  
  // Optional: filter by labels
  labels: ['bug', 'feature', 'documentation']
}

What Gets Synced:

  • Issue title and body

  • Issue author

  • Labels

  • Issue number/identifier

  • Created/updated dates

  • Comments (optional)

Step 4: GitHub Token Setup

Creating GitHub Token:

  1. GitHub → Settings → Developer settings → Personal access tokens

  2. Generate new token (classic)

  3. Select scopes:

    • repo (for private repos) or public_repo (for public only)

    • read:org (if accessing org repos)

  4. Copy token

  5. Use in Graphlit feed creation

OR via Graphlit Developer Portal:

  1. Go to Developer Portal → Connectors → Version Control

  2. Authorize GitHub

  3. Copy OAuth token

Step 5: Analyze Repository Files

File Content Structure:

const file = await graphlit.getContent(fileId);

console.log(`File: ${file.content.name}`);
console.log(`Type: ${file.content.fileType}`);
console.log(`Path: ${file.content.uri}`);

// Extracted entities from file
file.content.observations?.forEach(obs => {
  console.log(`${obs.type}: ${obs.observable.name}`);
});

README Analysis:

  • Rich source of Repo, Person, Organization entities

  • Contributors listed

  • Dependencies mentioned

  • Project description

package.json/requirements.txt Analysis:

  • Dependencies as Software entities

  • Version information

  • Project metadata

Step 6: Analyze GitHub Issues

Issue Metadata:

issue: {
  identifier: "42",                    // Issue number
  title: "Add feature X",
  project: "graphlit-samples",
  status: "Open",
  priority: "High",
  labels: ["feature", "enhancement"],
  author: {
    name: "Kirk Marple",
    email: "[email protected]"
  }
}

Entity Extraction from Issues:

  • Person: Issue author, mentioned contributors (@username)

  • Software: Tools/libraries mentioned

  • Category: Feature areas, components

  • Label: Issue labels as Label entities

Step 7: Build Contributor Graph

Contributors from Multiple Sources:

  1. File authors: Extracted from README, commit mentions

  2. Issue creators: From issue author field

  3. Code comments: Developers mentioned in code

  4. Documentation: Authors in docs

// Deduplicate contributors
const uniqueContributors = new Map<string, {
  observableId: string;
  name: string;
  email?: string;
  contributions: number;
}>();

// Combine from all sources
allContent.forEach(content => {
  content.observations
    ?.filter(obs => obs.type === ObservableTypes.Person)
    .forEach(obs => {
      if (!uniqueContributors.has(obs.observable.id)) {
        uniqueContributors.set(obs.observable.id, {
          observableId: obs.observable.id,
          name: obs.observable.name,
          email: obs.observable.properties?.email,
          contributions: 0
        });
      }
      uniqueContributors.get(obs.observable.id)!.contributions++;
    });
});

Step 8: Dependency Analysis

Extract Software Dependencies:

// Get all Software entities
const dependencies = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Software] }
});

// Find which files reference each dependency
for (const dep of dependencies.observables.results) {
  const references = await graphlit.queryContents({
    filter: {
      feeds: [{ id: repoFeed.createFeed.id }],
      observations: [{
        type: ObservableTypes.Software,
        observable: { id: dep.observable.id }
      }]
    }
  });
  
  console.log(`${dep.observable.name}: ${references.contents.results.length} files`);
}

Configuration Options

Selective File Syncing

By File Extension:

fileTypes: ['md', 'py', 'ts', 'js', 'json', 'yaml']

All Files:

fileTypes: []  // Empty = sync all files

Common Patterns:

// Documentation only
fileTypes: ['md', 'rst', 'txt']

// Source code only
fileTypes: ['py', 'js', 'ts', 'java', 'go', 'rs']

// Configuration files
fileTypes: ['json', 'yaml', 'yml', 'toml', 'ini']

Branch Selection

Specific Branch:

site: {
  branch: 'develop'  // Sync specific branch
}

Default Branch:

site: {
  branch: undefined  // Uses repo's default branch
}

Issue Filtering

By State:

issue: {
  includeOpen: true,
  includeClosed: true  // Include both open and closed
}

By Labels:

issue: {
  labels: ['bug', 'security', 'high-priority']  // Only these labels
}

By Count:

issue: {
  readLimit: 500  // Most recent 500 issues
}

Variations

Variation 1: Multi-Repository Analysis

Analyze multiple repositories in an organization:

const repos = [
  { owner: 'graphlit', name: 'graphlit-client-typescript' },
  { owner: 'graphlit', name: 'graphlit-client-python' },
  { owner: 'graphlit', name: 'graphlit-client-dotnet' }
];

const feeds = await Promise.all(
  repos.map(repo =>
    graphlit.createFeed({
      name: `${repo.owner}/${repo.name}`,
      type: FeedTypes.Site,
      site: {
        type: FeedServiceGitHub,
        token: githubToken,
        repositoryOwner: repo.owner,
        repositoryName: repo.name
      },
      workflow: { id: workflowId }
    })
  )
);

// Wait for all to sync
const waitForAll = async () => {
  let allDone = false;
  while (!allDone) {
    const statuses = await Promise.all(
      feeds.map(f => graphlit.isFeedDone(f.createFeed.id))
    );
    allDone = statuses.every(s => s.isFeedDone.result);
    
    if (!allDone) {
      await new Promise(resolve => setTimeout(resolve, 5000));
    }
  }
};

await waitForAll();

// Analyze cross-repo entities
const allRepos = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Repo] }
});

console.log(`Total repositories: ${allRepos.observables.results.length}`);

Variation 2: Dependency Graph Visualization

Map software dependencies:

// Extract all Software entities and their relationships
const dependencies = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Software] }
});

interface DependencyNode {
  name: string;
  usedBy: string[];  // Files/repos using this dependency
  version?: string;
}

const depGraph = new Map<string, DependencyNode>();

for (const dep of dependencies.observables.results) {
  const usages = await graphlit.queryContents({
    filter: {
      observations: [{
        type: ObservableTypes.Software,
        observable: { id: dep.observable.id }
      }]
    }
  });
  
  depGraph.set(dep.observable.name, {
    name: dep.observable.name,
    usedBy: usages.contents.results.map(c => c.name),
    version: dep.observable.properties?.version
  });
}

// Find most common dependencies
const topDeps = Array.from(depGraph.values())
  .sort((a, b) => b.usedBy.length - a.usedBy.length)
  .slice(0, 10);

console.log('Most used dependencies:');
topDeps.forEach(dep => {
  console.log(`  ${dep.name}: ${dep.usedBy.length} files`);
});

Variation 3: Issue Classification by Entities

Categorize issues by extracted entities:

// Group issues by entity types
const issuesByEntity = new Map<string, Array<typeof issues.contents.results[0]>>();

issues.contents.results.forEach(issue => {
  issue.observations?.forEach(obs => {
    const key = `${obs.type}: ${obs.observable.name}`;
    if (!issuesByEntity.has(key)) {
      issuesByEntity.set(key, []);
    }
    issuesByEntity.get(key)!.push(issue);
  });
});

// Find entities with most issues
const entityIssueCount = Array.from(issuesByEntity.entries())
  .map(([entity, issues]) => ({ entity, count: issues.length }))
  .sort((a, b) => b.count - a.count);

console.log('Entities with most related issues:');
entityIssueCount.slice(0, 10).forEach(item => {
  console.log(`  ${item.entity}: ${item.count} issues`);
});

Variation 4: Contributor Activity Timeline

Track contributor activity over time:

interface ContributorActivity {
  name: string;
  firstContribution: Date;
  lastContribution: Date;
  contributions: Array<{ date: Date; type: 'file' | 'issue' }>;
}

const activity = new Map<string, ContributorActivity>();

// Track file contributions
repoFiles.contents.results.forEach(file => {
  const date = new Date(file.creationDate);
  
  file.observations
    ?.filter(obs => obs.type === ObservableTypes.Person)
    .forEach(obs => {
      const name = obs.observable.name;
      if (!activity.has(name)) {
        activity.set(name, {
          name,
          firstContribution: date,
          lastContribution: date,
          contributions: []
        });
      }
      
      const contrib = activity.get(name)!;
      contrib.contributions.push({ date, type: 'file' });
      if (date < contrib.firstContribution) contrib.firstContribution = date;
      if (date > contrib.lastContribution) contrib.lastContribution = date;
    });
});

// Track issue contributions
issues.contents.results.forEach(issue => {
  const author = issue.issue?.author?.name;
  const date = new Date(issue.creationDate);
  
  if (author) {
    if (!activity.has(author)) {
      activity.set(author, {
        name: author,
        firstContribution: date,
        lastContribution: date,
        contributions: []
      });
    }
    
    const contrib = activity.get(author)!;
    contrib.contributions.push({ date, type: 'issue' });
    if (date < contrib.firstContribution) contrib.firstContribution = date;
    if (date > contrib.lastContribution) contrib.lastContribution = date;
  }
});

// Find most active contributors (by recent activity)
const recent = Array.from(activity.values())
  .sort((a, b) => b.lastContribution.getTime() - a.lastContribution.getTime())
  .slice(0, 10);

console.log('Most recently active contributors:');
recent.forEach(contrib => {
  console.log(`  ${contrib.name}: ${contrib.contributions.length} contributions`);
  console.log(`    First: ${contrib.firstContribution.toLocaleDateString()}`);
  console.log(`    Last: ${contrib.lastContribution.toLocaleDateString()}`);
});

Variation 5: Cross-Repository Entity Linking

Find entities that appear across multiple repositories:

// After syncing multiple repos, find cross-repo entities
const allPeople = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Person] }
});

for (const person of allPeople.observables.results) {
  // Find all content (across repos) mentioning this person
  const mentions = await graphlit.queryContents({
    filter: {
      observations: [{
        type: ObservableTypes.Person,
        observable: { id: person.observable.id }
      }]
    }
  });
  
  // Group by feed (repository)
  const reposMentioned = new Set(
    mentions.contents.results.map(c => c.feedId)
  );
  
  if (reposMentioned.size > 1) {
    console.log(`${person.observable.name}: appears in ${reposMentioned.size} repos`);
  }
}

Common Issues & Solutions

Issue: Large Repository, Slow Sync

Problem: Repository with 1000s of files takes hours to sync.

Solutions:

  1. Filter file types: Only sync relevant files

  2. Skip large files: Binary files slow down processing

  3. Use branch: Sync specific branch instead of all branches

site: {
  fileTypes: ['md', 'py', 'js'],  // Skip images, binaries
  branch: 'main'                   // Single branch only
}

Issue: GitHub API Rate Limiting

Problem: Sync fails with rate limit error.

Explanation: GitHub API has rate limits (5000 requests/hour for authenticated).

Solutions:

  1. Use authenticated token (higher limits)

  2. Sync fewer repositories simultaneously

  3. Increase polling interval

  4. Wait for rate limit reset

Issue: Missing Dependencies from package.json

Problem: Software entities not extracted from package files.

Cause: Need to sync configuration files explicitly.

Solution: Include config file types:

fileTypes: ['json', 'yaml', 'toml', 'lock']

Issue: No Contributor Entities

Problem: No Person entities extracted from repository.

Causes:

  1. README doesn't list contributors

  2. Code comments don't mention developers

  3. Need to enable repository metadata

Solution: Enable metadata and sync issues:

site: {
  includeRepositoryMetadata: true  // Includes contributor info
}

// Also sync issues for issue authors
issue: {
  type: FeedServiceGitHubIssues,
  includeOpen: true
}

Developer Hints

GitHub Token Best Practices

  • Use fine-grained tokens (new GitHub feature) when possible

  • Minimum scope: repo for private, public_repo for public

  • Rotate tokens regularly

  • Don't commit tokens to code (use env variables)

  • Monitor token usage in GitHub settings

File Type Recommendations

Documentation Analysis:

fileTypes: ['md', 'rst', 'txt', 'adoc']

Full Code Analysis:

fileTypes: ['py', 'js', 'ts', 'java', 'go', 'rs', 'cpp', 'c', 'h']

Configuration + Dependencies:

fileTypes: ['json', 'yaml', 'yml', 'toml', 'lock', 'txt']

Performance Optimization

  • Start with README + package files only

  • Add more file types incrementally

  • Sync issues separately (can be slow)

  • Use multiple feeds for large org

  • Cache entity queries

Entity Quality by Source

  • High confidence: README (explicit mentions), package.json (dependencies)

  • Medium confidence: Code comments, documentation

  • Low confidence: Implicit mentions in code


Production Patterns

Pattern from Graphlit Samples

Graphlit_2024_09_29_Explore_GitHub_Repo.ipynb:

  • Syncs public GitHub repository

  • Extracts Repo, Person, Software entities

  • Analyzes dependencies from package.json

  • Builds contributor network

  • Exports entity graph for visualization

Graphlit_2025_03_17_Classify_GitHub_Issues.ipynb:

  • Syncs GitHub issues

  • Extracts entities from issue descriptions

  • Classifies issues by entity types

  • Groups related issues

  • Priority ranking by entity importance

Open Source Intelligence Use Cases

  • Dependency tracking: Which projects use which libraries

  • Contributor analysis: Developer activity, collaboration

  • Project relationships: Shared contributors, common dependencies

  • Technology adoption: What tools/frameworks gaining traction

  • Security analysis: Vulnerable dependency detection


Last updated

Was this helpful?