Build Knowledge Graph from GitHub Repositories
User Intent
"How do I extract entities from GitHub repositories and issues to build a knowledge graph? Show me how to analyze code, contributors, dependencies, and project relationships."
Operation
SDK Methods: createWorkflow(), createFeed() (Site, Issue, Commit, or PullRequest feeds), isFeedDone(), queryContents(), queryObservables()
GraphQL: GitHub feed creation + entity extraction + contributor/project graphs
Entity: GitHub Feed → Content (Files/Issues/Commits/PRs) → Observations → Observables (Developer/Project Graph)
Prerequisites
Graphlit project with API credentials
GitHub personal access token (via Graphlit Developer Portal)
GitHub repository access
Understanding of feed and workflow concepts
Complete Code Example (TypeScript)
import { Graphlit } from 'graphlit-client';
import { ContentTypes, FeedServiceTypes, FileTypes, ObservableTypes } from 'graphlit-client/dist/generated/graphql-types';
import {
FeedTypes,
FeedServiceTypes,
ExtractionServiceTypes,
ObservableTypes,
ContentTypes
} from 'graphlit-client/dist/generated/graphql-types';
const graphlit = new Graphlit();
console.log('=== Building Knowledge Graph from GitHub ===\n');
// Step 1: Create extraction workflow
console.log('Step 1: Creating entity extraction workflow...');
const workflow = await graphlit.createWorkflow({
name: "GitHub Entity Extraction",
extraction: {
jobs: [{
connector: {
type: EntityExtractionServiceTypes.ModelText,
extractedTypes: [
ObservableTypes.Repo, // Repository itself
ObservableTypes.Person, // Contributors, authors
ObservableTypes.Organization, // Organizations, companies
ObservableTypes.Software, // Dependencies, tools
ObservableTypes.Category, // Topics, tags
ObservableLabel // Issue labels
]
}
}]
}
});
console.log(`✓ Workflow: ${workflow.createWorkflow.id}\n`);
// Step 2: Create GitHub repository feed
console.log('Step 2: Creating GitHub repository feed...');
const repoFeed = await graphlit.createFeed({
name: "Graphlit Samples Repo",
type: FeedTypes.Site,
site: {
type: FeedServiceGitHub,
token: process.env.GITHUB_TOKEN!,
repositoryOwner: 'graphlit',
repositoryName: 'graphlit-samples',
includeRepositoryMetadata: true,
includeFiles: true,
fileTypes: ['md', 'py', 'ts', 'json', 'yaml'] // Specific file types
},
workflow: { id: workflow.createWorkflow.id }
});
console.log(`✓ Repo Feed: ${repoFeed.createFeed.id}\n`);
// Step 3: Create GitHub issues feed
console.log('Step 3: Creating GitHub issues feed...');
const issuesFeed = await graphlit.createFeed({
name: "Graphlit Issues",
type: FeedTypes.Issue,
issue: {
type: FeedServiceGitHubIssues,
token: process.env.GITHUB_TOKEN!,
repositoryOwner: 'graphlit',
repositoryName: 'graphlit-samples',
includeOpen: true,
includeClosed: false,
readLimit: 100
},
workflow: { id: workflow.createWorkflow.id }
});
console.log(`✓ Issues Feed: ${issuesFeed.createFeed.id}\n`);
// Step 4: Wait for sync
console.log('Step 4: Syncing repository...');
let repoDone = false;
let issuesDone = false;
while (!repoDone || !issuesDone) {
if (!repoDone) {
const repoStatus = await graphlit.isFeedDone(repoFeed.createFeed.id);
repoDone = repoStatus.isFeedDone.result;
}
if (!issuesDone) {
const issuesStatus = await graphlit.isFeedDone(issuesFeed.createFeed.id);
issuesDone = issuesStatus.isFeedDone.result;
}
if (!repoDone || !issuesDone) {
console.log(' Syncing... (checking again in 5s)');
await new Promise(resolve => setTimeout(resolve, 5000));
}
}
console.log('✓ Sync complete\n');
// Step 5: Query repository content
console.log('Step 5: Querying repository files...');
const repoFiles = await graphlit.queryContents({
filter: {
feeds: [{ id: repoFeed.createFeed.id }]
}
});
console.log(`✓ Synced ${repoFiles.contents.results.length} files\n`);
// Step 6: Query issues
console.log('Step 6: Querying issues...');
const issues = await graphlit.queryContents({
filter: {
types: [ContentTypes.Issue],
feeds: [{ id: issuesFeed.createFeed.id }]
}
});
console.log(`✓ Synced ${issues.contents.results.length} issues\n`);
// Step 7: Extract repository entities
console.log('Step 7: Analyzing repository entities...\n');
// Get all Repo entities
const repos = await graphlit.queryObservables({
filter: { types: [ObservableTypes.Repo] }
});
console.log(`Repositories: ${repos.observables.results.length}`);
// Get contributors (Person entities)
const people = await graphlit.queryObservables({
filter: { types: [ObservableTypes.Person] }
});
console.log(`Contributors: ${people.observables.results.length}`);
// Get dependencies (Software entities)
const software = await graphlit.queryObservables({
filter: { types: [ObservableTypes.Software] }
});
console.log(`Software/Dependencies: ${software.observables.results.length}\n`);
// Step 8: Analyze issue labels
console.log('Step 8: Analyzing issue labels...\n');
const labelCounts = new Map<string, number>();
issues.contents.results.forEach(issue => {
const labels = issue.issue?.labels || [];
labels.forEach(label => {
labelCounts.set(label, (labelCounts.get(label) || 0) + 1);
});
});
console.log('Most common issue labels:');
Array.from(labelCounts.entries())
.sort((a, b) => b[1] - a[1])
.slice(0, 5)
.forEach(([label, count]) => {
console.log(` ${label}: ${count} issues`);
});
// Step 9: Build contributor network
console.log('\nStep 9: Building contributor network...\n');
const contributors = new Map<string, {
files: number;
issues: number;
total: number;
}>();
// Count files by contributor (from observations)
repoFiles.contents.results.forEach(file => {
file.observations
?.filter(obs => obs.type === ObservableTypes.Person)
.forEach(obs => {
const name = obs.observable.name;
if (!contributors.has(name)) {
contributors.set(name, { files: 0, issues: 0, total: 0 });
}
contributors.get(name)!.files++;
contributors.get(name)!.total++;
});
});
// Count issues by contributor
issues.contents.results.forEach(issue => {
const author = issue.issue?.author?.name || 'Unknown';
if (!contributors.has(author)) {
contributors.set(author, { files: 0, issues: 0, total: 0 });
}
contributors.get(author)!.issues++;
contributors.get(author)!.total++;
});
console.log('Top contributors:');
Array.from(contributors.entries())
.sort((a, b) => b[1].total - a[1].total)
.slice(0, 5)
.forEach(([name, stats]) => {
console.log(` ${name}: ${stats.files} files, ${stats.issues} issues`);
});
console.log('\n✓ Repository analysis complete!');Step-by-Step Explanation
Step 1: Create Entity Extraction Workflow
GitHub-Specific Entity Types:
Repo: Repository as entity (name, owner, URL, description)
Person: Contributors, commit authors, issue creators, reviewers
Organization: Repository owner if organization, mentioned companies
Software: Dependencies (from package.json, requirements.txt, etc.)
Category: Topics, tags, project themes
Label: Issue/PR labels
Why Text Extraction:
Code files, README, docs are text-based
Fast and cost-effective
Handles markdown, code, JSON, YAML
Step 2: Configure GitHub Repository Feed
Repository Feed Options:
What Gets Synced:
README.md
Source code files (filtered by fileTypes)
Documentation files
Configuration files (package.json, requirements.txt, etc.)
Repository metadata (description, topics, contributors)
Step 3: Configure GitHub Issues Feed
Issues Feed Options:
What Gets Synced:
Issue title and body
Issue author
Labels
Issue number/identifier
Created/updated dates
Comments (optional)
Step 4: GitHub Token Setup
Creating GitHub Token:
GitHub → Settings → Developer settings → Personal access tokens
Generate new token (classic)
Select scopes:
repo(for private repos) orpublic_repo(for public only)read:org(if accessing org repos)
Copy token
Use in Graphlit feed creation
OR via Graphlit Developer Portal:
Go to Developer Portal → Connectors → Version Control
Authorize GitHub
Copy OAuth token
Step 5: Analyze Repository Files
File Content Structure:
README Analysis:
Rich source of Repo, Person, Organization entities
Contributors listed
Dependencies mentioned
Project description
package.json/requirements.txt Analysis:
Dependencies as Software entities
Version information
Project metadata
Step 6: Analyze GitHub Issues
Issue Metadata:
Entity Extraction from Issues:
Person: Issue author, mentioned contributors (@username)
Software: Tools/libraries mentioned
Category: Feature areas, components
Label: Issue labels as Label entities
Step 7: Build Contributor Graph
Contributors from Multiple Sources:
File authors: Extracted from README, commit mentions
Issue creators: From issue author field
Code comments: Developers mentioned in code
Documentation: Authors in docs
Step 8: Dependency Analysis
Extract Software Dependencies:
Configuration Options
Selective File Syncing
By File Extension:
All Files:
Common Patterns:
Branch Selection
Specific Branch:
Default Branch:
Issue Filtering
By State:
By Labels:
By Count:
Variations
Variation 1: Multi-Repository Analysis
Analyze multiple repositories in an organization:
Variation 2: Dependency Graph Visualization
Map software dependencies:
Variation 3: Issue Classification by Entities
Categorize issues by extracted entities:
Variation 4: Contributor Activity Timeline
Track contributor activity over time:
Variation 5: Cross-Repository Entity Linking
Find entities that appear across multiple repositories:
Common Issues & Solutions
Issue: Large Repository, Slow Sync
Problem: Repository with 1000s of files takes hours to sync.
Solutions:
Filter file types: Only sync relevant files
Skip large files: Binary files slow down processing
Use branch: Sync specific branch instead of all branches
Issue: GitHub API Rate Limiting
Problem: Sync fails with rate limit error.
Explanation: GitHub API has rate limits (5000 requests/hour for authenticated).
Solutions:
Use authenticated token (higher limits)
Sync fewer repositories simultaneously
Increase polling interval
Wait for rate limit reset
Issue: Missing Dependencies from package.json
Problem: Software entities not extracted from package files.
Cause: Need to sync configuration files explicitly.
Solution: Include config file types:
Issue: No Contributor Entities
Problem: No Person entities extracted from repository.
Causes:
README doesn't list contributors
Code comments don't mention developers
Need to enable repository metadata
Solution: Enable metadata and sync issues:
Developer Hints
GitHub Token Best Practices
Use fine-grained tokens (new GitHub feature) when possible
Minimum scope:
repofor private,public_repofor publicRotate tokens regularly
Don't commit tokens to code (use env variables)
Monitor token usage in GitHub settings
File Type Recommendations
Documentation Analysis:
Full Code Analysis:
Configuration + Dependencies:
Performance Optimization
Start with README + package files only
Add more file types incrementally
Sync issues separately (can be slow)
Use multiple feeds for large org
Cache entity queries
Entity Quality by Source
High confidence: README (explicit mentions), package.json (dependencies)
Medium confidence: Code comments, documentation
Low confidence: Implicit mentions in code
Production Patterns
Pattern from Graphlit Samples
Graphlit_2024_09_29_Explore_GitHub_Repo.ipynb:
Syncs public GitHub repository
Extracts Repo, Person, Software entities
Analyzes dependencies from package.json
Builds contributor network
Exports entity graph for visualization
Graphlit_2025_03_17_Classify_GitHub_Issues.ipynb:
Syncs GitHub issues
Extracts entities from issue descriptions
Classifies issues by entity types
Groups related issues
Priority ranking by entity importance
Open Source Intelligence Use Cases
Dependency tracking: Which projects use which libraries
Contributor analysis: Developer activity, collaboration
Project relationships: Shared contributors, common dependencies
Technology adoption: What tools/frameworks gaining traction
Security analysis: Vulnerable dependency detection
Last updated
Was this helpful?