Build Knowledge Graph from GitHub Repositories
User Intent
"How do I extract entities from GitHub repositories and issues to build a knowledge graph? Show me how to analyze code, contributors, dependencies, and project relationships."
Operation
SDK Methods: createWorkflow(), createFeed() (Site or Issue feeds), isFeedDone(), queryContents(), queryObservables()
GraphQL: GitHub feed creation + entity extraction + contributor/project graphs
Entity: GitHub Feed → Content (Files/Issues) → Observations → Observables (Developer/Project Graph)
Prerequisites
Graphlit project with API credentials
GitHub personal access token (via Graphlit Developer Portal)
GitHub repository access
Understanding of feed and workflow concepts
Complete Code Example (TypeScript)
import { Graphlit } from 'graphlit-client';
import { ContentTypes, FeedServiceTypes, FileTypes, ObservableTypes } from 'graphlit-client/dist/generated/graphql-types';
import {
FeedTypes,
FeedServiceTypes,
ExtractionServiceTypes,
ObservableTypes,
ContentTypes
} from 'graphlit-client/dist/generated/graphql-types';
const graphlit = new Graphlit();
console.log('=== Building Knowledge Graph from GitHub ===\n');
// Step 1: Create extraction workflow
console.log('Step 1: Creating entity extraction workflow...');
const workflow = await graphlit.createWorkflow({
name: "GitHub Entity Extraction",
extraction: {
jobs: [{
connector: {
type: EntityExtractionServiceTypes.ModelText,
extractedTypes: [
ObservableTypes.Repo, // Repository itself
ObservableTypes.Person, // Contributors, authors
ObservableTypes.Organization, // Organizations, companies
ObservableTypes.Software, // Dependencies, tools
ObservableTypes.Category, // Topics, tags
ObservableLabel // Issue labels
]
}
}]
}
});
console.log(`✓ Workflow: ${workflow.createWorkflow.id}\n`);
// Step 2: Create GitHub repository feed
console.log('Step 2: Creating GitHub repository feed...');
const repoFeed = await graphlit.createFeed({
name: "Graphlit Samples Repo",
type: FeedTypes.Site,
site: {
type: FeedServiceGitHub,
token: process.env.GITHUB_TOKEN!,
repositoryOwner: 'graphlit',
repositoryName: 'graphlit-samples',
includeRepositoryMetadata: true,
includeFiles: true,
fileTypes: ['md', 'py', 'ts', 'json', 'yaml'] // Specific file types
},
workflow: { id: workflow.createWorkflow.id }
});
console.log(`✓ Repo Feed: ${repoFeed.createFeed.id}\n`);
// Step 3: Create GitHub issues feed
console.log('Step 3: Creating GitHub issues feed...');
const issuesFeed = await graphlit.createFeed({
name: "Graphlit Issues",
type: FeedTypes.Issue,
issue: {
type: FeedServiceGitHubIssues,
token: process.env.GITHUB_TOKEN!,
repositoryOwner: 'graphlit',
repositoryName: 'graphlit-samples',
includeOpen: true,
includeClosed: false,
readLimit: 100
},
workflow: { id: workflow.createWorkflow.id }
});
console.log(`✓ Issues Feed: ${issuesFeed.createFeed.id}\n`);
// Step 4: Wait for sync
console.log('Step 4: Syncing repository...');
let repoDone = false;
let issuesDone = false;
while (!repoDone || !issuesDone) {
if (!repoDone) {
const repoStatus = await graphlit.isFeedDone(repoFeed.createFeed.id);
repoDone = repoStatus.isFeedDone.result;
}
if (!issuesDone) {
const issuesStatus = await graphlit.isFeedDone(issuesFeed.createFeed.id);
issuesDone = issuesStatus.isFeedDone.result;
}
if (!repoDone || !issuesDone) {
console.log(' Syncing... (checking again in 5s)');
await new Promise(resolve => setTimeout(resolve, 5000));
}
}
console.log('✓ Sync complete\n');
// Step 5: Query repository content
console.log('Step 5: Querying repository files...');
const repoFiles = await graphlit.queryContents({
filter: {
feeds: [{ id: repoFeed.createFeed.id }]
}
});
console.log(`✓ Synced ${repoFiles.contents.results.length} files\n`);
// Step 6: Query issues
console.log('Step 6: Querying issues...');
const issues = await graphlit.queryContents({
filter: {
types: [ContentTypes.Issue],
feeds: [{ id: issuesFeed.createFeed.id }]
}
});
console.log(`✓ Synced ${issues.contents.results.length} issues\n`);
// Step 7: Extract repository entities
console.log('Step 7: Analyzing repository entities...\n');
// Get all Repo entities
const repos = await graphlit.queryObservables({
filter: { types: [ObservableTypes.Repo] }
});
console.log(`Repositories: ${repos.observables.results.length}`);
// Get contributors (Person entities)
const people = await graphlit.queryObservables({
filter: { types: [ObservableTypes.Person] }
});
console.log(`Contributors: ${people.observables.results.length}`);
// Get dependencies (Software entities)
const software = await graphlit.queryObservables({
filter: { types: [ObservableTypes.Software] }
});
console.log(`Software/Dependencies: ${software.observables.results.length}\n`);
// Step 8: Analyze issue labels
console.log('Step 8: Analyzing issue labels...\n');
const labelCounts = new Map<string, number>();
issues.contents.results.forEach(issue => {
const labels = issue.issue?.labels || [];
labels.forEach(label => {
labelCounts.set(label, (labelCounts.get(label) || 0) + 1);
});
});
console.log('Most common issue labels:');
Array.from(labelCounts.entries())
.sort((a, b) => b[1] - a[1])
.slice(0, 5)
.forEach(([label, count]) => {
console.log(` ${label}: ${count} issues`);
});
// Step 9: Build contributor network
console.log('\nStep 9: Building contributor network...\n');
const contributors = new Map<string, {
files: number;
issues: number;
total: number;
}>();
// Count files by contributor (from observations)
repoFiles.contents.results.forEach(file => {
file.observations
?.filter(obs => obs.type === ObservableTypes.Person)
.forEach(obs => {
const name = obs.observable.name;
if (!contributors.has(name)) {
contributors.set(name, { files: 0, issues: 0, total: 0 });
}
contributors.get(name)!.files++;
contributors.get(name)!.total++;
});
});
// Count issues by contributor
issues.contents.results.forEach(issue => {
const author = issue.issue?.author?.name || 'Unknown';
if (!contributors.has(author)) {
contributors.set(author, { files: 0, issues: 0, total: 0 });
}
contributors.get(author)!.issues++;
contributors.get(author)!.total++;
});
console.log('Top contributors:');
Array.from(contributors.entries())
.sort((a, b) => b[1].total - a[1].total)
.slice(0, 5)
.forEach(([name, stats]) => {
console.log(` ${name}: ${stats.files} files, ${stats.issues} issues`);
});
console.log('\n✓ Repository analysis complete!');Step-by-Step Explanation
Step 1: Create Entity Extraction Workflow
GitHub-Specific Entity Types:
Repo: Repository as entity (name, owner, URL, description)
Person: Contributors, commit authors, issue creators, reviewers
Organization: Repository owner if organization, mentioned companies
Software: Dependencies (from package.json, requirements.txt, etc.)
Category: Topics, tags, project themes
Label: Issue/PR labels
Why Text Extraction:
Code files, README, docs are text-based
Fast and cost-effective
Handles markdown, code, JSON, YAML
Step 2: Configure GitHub Repository Feed
Repository Feed Options:
site: {
type: FeedServiceGitHub,
token: githubToken, // GitHub personal access token
repositoryOwner: 'graphlit', // Owner username or org
repositoryName: 'graphlit-samples', // Repository name
includeRepositoryMetadata: true, // Include repo description, topics
includeFiles: true, // Sync files from repo
fileTypes: ['md', 'py', 'ts', 'json', 'yaml'], // Specific file extensions
// Or sync all files:
// fileTypes: [] // Empty = all files
branch: 'main' // Optional: specific branch (default: default branch)
}What Gets Synced:
README.md
Source code files (filtered by fileTypes)
Documentation files
Configuration files (package.json, requirements.txt, etc.)
Repository metadata (description, topics, contributors)
Step 3: Configure GitHub Issues Feed
Issues Feed Options:
issue: {
type: FeedServiceGitHubIssues,
token: githubToken,
repositoryOwner: 'graphlit',
repositoryName: 'graphlit-samples',
includeOpen: true, // Sync open issues
includeClosed: false, // Skip closed issues
readLimit: 100, // Max issues to sync
// Optional: filter by labels
labels: ['bug', 'feature', 'documentation']
}What Gets Synced:
Issue title and body
Issue author
Labels
Issue number/identifier
Created/updated dates
Comments (optional)
Step 4: GitHub Token Setup
Creating GitHub Token:
GitHub → Settings → Developer settings → Personal access tokens
Generate new token (classic)
Select scopes:
repo(for private repos) orpublic_repo(for public only)read:org(if accessing org repos)
Copy token
Use in Graphlit feed creation
OR via Graphlit Developer Portal:
Go to Developer Portal → Connectors → Version Control
Authorize GitHub
Copy OAuth token
Step 5: Analyze Repository Files
File Content Structure:
const file = await graphlit.getContent(fileId);
console.log(`File: ${file.content.name}`);
console.log(`Type: ${file.content.fileType}`);
console.log(`Path: ${file.content.uri}`);
// Extracted entities from file
file.content.observations?.forEach(obs => {
console.log(`${obs.type}: ${obs.observable.name}`);
});README Analysis:
Rich source of Repo, Person, Organization entities
Contributors listed
Dependencies mentioned
Project description
package.json/requirements.txt Analysis:
Dependencies as Software entities
Version information
Project metadata
Step 6: Analyze GitHub Issues
Issue Metadata:
issue: {
identifier: "42", // Issue number
title: "Add feature X",
project: "graphlit-samples",
status: "Open",
priority: "High",
labels: ["feature", "enhancement"],
author: {
name: "Kirk Marple",
email: "[email protected]"
}
}Entity Extraction from Issues:
Person: Issue author, mentioned contributors (@username)
Software: Tools/libraries mentioned
Category: Feature areas, components
Label: Issue labels as Label entities
Step 7: Build Contributor Graph
Contributors from Multiple Sources:
File authors: Extracted from README, commit mentions
Issue creators: From issue author field
Code comments: Developers mentioned in code
Documentation: Authors in docs
// Deduplicate contributors
const uniqueContributors = new Map<string, {
observableId: string;
name: string;
email?: string;
contributions: number;
}>();
// Combine from all sources
allContent.forEach(content => {
content.observations
?.filter(obs => obs.type === ObservableTypes.Person)
.forEach(obs => {
if (!uniqueContributors.has(obs.observable.id)) {
uniqueContributors.set(obs.observable.id, {
observableId: obs.observable.id,
name: obs.observable.name,
email: obs.observable.properties?.email,
contributions: 0
});
}
uniqueContributors.get(obs.observable.id)!.contributions++;
});
});Step 8: Dependency Analysis
Extract Software Dependencies:
// Get all Software entities
const dependencies = await graphlit.queryObservables({
filter: { types: [ObservableTypes.Software] }
});
// Find which files reference each dependency
for (const dep of dependencies.observables.results) {
const references = await graphlit.queryContents({
filter: {
feeds: [{ id: repoFeed.createFeed.id }],
observations: [{
type: ObservableTypes.Software,
observable: { id: dep.observable.id }
}]
}
});
console.log(`${dep.observable.name}: ${references.contents.results.length} files`);
}Configuration Options
Selective File Syncing
By File Extension:
fileTypes: ['md', 'py', 'ts', 'js', 'json', 'yaml']All Files:
fileTypes: [] // Empty = sync all filesCommon Patterns:
// Documentation only
fileTypes: ['md', 'rst', 'txt']
// Source code only
fileTypes: ['py', 'js', 'ts', 'java', 'go', 'rs']
// Configuration files
fileTypes: ['json', 'yaml', 'yml', 'toml', 'ini']Branch Selection
Specific Branch:
site: {
branch: 'develop' // Sync specific branch
}Default Branch:
site: {
branch: undefined // Uses repo's default branch
}Issue Filtering
By State:
issue: {
includeOpen: true,
includeClosed: true // Include both open and closed
}By Labels:
issue: {
labels: ['bug', 'security', 'high-priority'] // Only these labels
}By Count:
issue: {
readLimit: 500 // Most recent 500 issues
}Variations
Variation 1: Multi-Repository Analysis
Analyze multiple repositories in an organization:
const repos = [
{ owner: 'graphlit', name: 'graphlit-client-typescript' },
{ owner: 'graphlit', name: 'graphlit-client-python' },
{ owner: 'graphlit', name: 'graphlit-client-dotnet' }
];
const feeds = await Promise.all(
repos.map(repo =>
graphlit.createFeed({
name: `${repo.owner}/${repo.name}`,
type: FeedTypes.Site,
site: {
type: FeedServiceGitHub,
token: githubToken,
repositoryOwner: repo.owner,
repositoryName: repo.name
},
workflow: { id: workflowId }
})
)
);
// Wait for all to sync
const waitForAll = async () => {
let allDone = false;
while (!allDone) {
const statuses = await Promise.all(
feeds.map(f => graphlit.isFeedDone(f.createFeed.id))
);
allDone = statuses.every(s => s.isFeedDone.result);
if (!allDone) {
await new Promise(resolve => setTimeout(resolve, 5000));
}
}
};
await waitForAll();
// Analyze cross-repo entities
const allRepos = await graphlit.queryObservables({
filter: { types: [ObservableTypes.Repo] }
});
console.log(`Total repositories: ${allRepos.observables.results.length}`);Variation 2: Dependency Graph Visualization
Map software dependencies:
// Extract all Software entities and their relationships
const dependencies = await graphlit.queryObservables({
filter: { types: [ObservableTypes.Software] }
});
interface DependencyNode {
name: string;
usedBy: string[]; // Files/repos using this dependency
version?: string;
}
const depGraph = new Map<string, DependencyNode>();
for (const dep of dependencies.observables.results) {
const usages = await graphlit.queryContents({
filter: {
observations: [{
type: ObservableTypes.Software,
observable: { id: dep.observable.id }
}]
}
});
depGraph.set(dep.observable.name, {
name: dep.observable.name,
usedBy: usages.contents.results.map(c => c.name),
version: dep.observable.properties?.version
});
}
// Find most common dependencies
const topDeps = Array.from(depGraph.values())
.sort((a, b) => b.usedBy.length - a.usedBy.length)
.slice(0, 10);
console.log('Most used dependencies:');
topDeps.forEach(dep => {
console.log(` ${dep.name}: ${dep.usedBy.length} files`);
});Variation 3: Issue Classification by Entities
Categorize issues by extracted entities:
// Group issues by entity types
const issuesByEntity = new Map<string, Array<typeof issues.contents.results[0]>>();
issues.contents.results.forEach(issue => {
issue.observations?.forEach(obs => {
const key = `${obs.type}: ${obs.observable.name}`;
if (!issuesByEntity.has(key)) {
issuesByEntity.set(key, []);
}
issuesByEntity.get(key)!.push(issue);
});
});
// Find entities with most issues
const entityIssueCount = Array.from(issuesByEntity.entries())
.map(([entity, issues]) => ({ entity, count: issues.length }))
.sort((a, b) => b.count - a.count);
console.log('Entities with most related issues:');
entityIssueCount.slice(0, 10).forEach(item => {
console.log(` ${item.entity}: ${item.count} issues`);
});Variation 4: Contributor Activity Timeline
Track contributor activity over time:
interface ContributorActivity {
name: string;
firstContribution: Date;
lastContribution: Date;
contributions: Array<{ date: Date; type: 'file' | 'issue' }>;
}
const activity = new Map<string, ContributorActivity>();
// Track file contributions
repoFiles.contents.results.forEach(file => {
const date = new Date(file.creationDate);
file.observations
?.filter(obs => obs.type === ObservableTypes.Person)
.forEach(obs => {
const name = obs.observable.name;
if (!activity.has(name)) {
activity.set(name, {
name,
firstContribution: date,
lastContribution: date,
contributions: []
});
}
const contrib = activity.get(name)!;
contrib.contributions.push({ date, type: 'file' });
if (date < contrib.firstContribution) contrib.firstContribution = date;
if (date > contrib.lastContribution) contrib.lastContribution = date;
});
});
// Track issue contributions
issues.contents.results.forEach(issue => {
const author = issue.issue?.author?.name;
const date = new Date(issue.creationDate);
if (author) {
if (!activity.has(author)) {
activity.set(author, {
name: author,
firstContribution: date,
lastContribution: date,
contributions: []
});
}
const contrib = activity.get(author)!;
contrib.contributions.push({ date, type: 'issue' });
if (date < contrib.firstContribution) contrib.firstContribution = date;
if (date > contrib.lastContribution) contrib.lastContribution = date;
}
});
// Find most active contributors (by recent activity)
const recent = Array.from(activity.values())
.sort((a, b) => b.lastContribution.getTime() - a.lastContribution.getTime())
.slice(0, 10);
console.log('Most recently active contributors:');
recent.forEach(contrib => {
console.log(` ${contrib.name}: ${contrib.contributions.length} contributions`);
console.log(` First: ${contrib.firstContribution.toLocaleDateString()}`);
console.log(` Last: ${contrib.lastContribution.toLocaleDateString()}`);
});Variation 5: Cross-Repository Entity Linking
Find entities that appear across multiple repositories:
// After syncing multiple repos, find cross-repo entities
const allPeople = await graphlit.queryObservables({
filter: { types: [ObservableTypes.Person] }
});
for (const person of allPeople.observables.results) {
// Find all content (across repos) mentioning this person
const mentions = await graphlit.queryContents({
filter: {
observations: [{
type: ObservableTypes.Person,
observable: { id: person.observable.id }
}]
}
});
// Group by feed (repository)
const reposMentioned = new Set(
mentions.contents.results.map(c => c.feedId)
);
if (reposMentioned.size > 1) {
console.log(`${person.observable.name}: appears in ${reposMentioned.size} repos`);
}
}Common Issues & Solutions
Issue: Large Repository, Slow Sync
Problem: Repository with 1000s of files takes hours to sync.
Solutions:
Filter file types: Only sync relevant files
Skip large files: Binary files slow down processing
Use branch: Sync specific branch instead of all branches
site: {
fileTypes: ['md', 'py', 'js'], // Skip images, binaries
branch: 'main' // Single branch only
}Issue: GitHub API Rate Limiting
Problem: Sync fails with rate limit error.
Explanation: GitHub API has rate limits (5000 requests/hour for authenticated).
Solutions:
Use authenticated token (higher limits)
Sync fewer repositories simultaneously
Increase polling interval
Wait for rate limit reset
Issue: Missing Dependencies from package.json
Problem: Software entities not extracted from package files.
Cause: Need to sync configuration files explicitly.
Solution: Include config file types:
fileTypes: ['json', 'yaml', 'toml', 'lock']Issue: No Contributor Entities
Problem: No Person entities extracted from repository.
Causes:
README doesn't list contributors
Code comments don't mention developers
Need to enable repository metadata
Solution: Enable metadata and sync issues:
site: {
includeRepositoryMetadata: true // Includes contributor info
}
// Also sync issues for issue authors
issue: {
type: FeedServiceGitHubIssues,
includeOpen: true
}Developer Hints
GitHub Token Best Practices
Use fine-grained tokens (new GitHub feature) when possible
Minimum scope:
repofor private,public_repofor publicRotate tokens regularly
Don't commit tokens to code (use env variables)
Monitor token usage in GitHub settings
File Type Recommendations
Documentation Analysis:
fileTypes: ['md', 'rst', 'txt', 'adoc']Full Code Analysis:
fileTypes: ['py', 'js', 'ts', 'java', 'go', 'rs', 'cpp', 'c', 'h']Configuration + Dependencies:
fileTypes: ['json', 'yaml', 'yml', 'toml', 'lock', 'txt']Performance Optimization
Start with README + package files only
Add more file types incrementally
Sync issues separately (can be slow)
Use multiple feeds for large org
Cache entity queries
Entity Quality by Source
High confidence: README (explicit mentions), package.json (dependencies)
Medium confidence: Code comments, documentation
Low confidence: Implicit mentions in code
Production Patterns
Pattern from Graphlit Samples
Graphlit_2024_09_29_Explore_GitHub_Repo.ipynb:
Syncs public GitHub repository
Extracts Repo, Person, Software entities
Analyzes dependencies from package.json
Builds contributor network
Exports entity graph for visualization
Graphlit_2025_03_17_Classify_GitHub_Issues.ipynb:
Syncs GitHub issues
Extracts entities from issue descriptions
Classifies issues by entity types
Groups related issues
Priority ranking by entity importance
Open Source Intelligence Use Cases
Dependency tracking: Which projects use which libraries
Contributor analysis: Developer activity, collaboration
Project relationships: Shared contributors, common dependencies
Technology adoption: What tools/frameworks gaining traction
Security analysis: Vulnerable dependency detection
Last updated
Was this helpful?