Understanding Confidence Scores and Occurrences
Use Case: Understanding Confidence Scores and Occurrences
User Intent
"What are confidence scores and occurrences in entity extraction? How do I use them to validate and filter entities?"
Operation
SDK Method: Access via content.observations[].occurrences[]
GraphQL Query: getContent with observations
Entity: Observation occurrence data
Prerequisites
Content with extracted entities (workflow with extraction stage)
Understanding of observation model
Graphlit project with API credentials
Complete Code Example (TypeScript)
import { Graphlit } from 'graphlit-client';
const graphlit = new Graphlit();
// Get content with entity observations
const contentResponse = await graphlit.getContent('content-id-here');
const content = contentResponse.content;
console.log(`\nAnalyzing entities in: ${content.name}\n`);
// Iterate through all observations
content.observations?.forEach(observation => {
console.log(`\n${observation.type}: ${observation.observable.name}`);
console.log(`Entity ID: ${observation.observable.id}`);
console.log(`Total occurrences: ${observation.occurrences?.length || 0}\n`);
// Analyze each occurrence
observation.occurrences?.forEach((occurrence, index) => {
console.log(` Occurrence #${index + 1}:`);
console.log(` Confidence: ${occurrence.confidence.toFixed(3)}`);
// Location context (varies by content type)
if (occurrence.pageIndex !== undefined) {
console.log(` Page: ${occurrence.pageIndex}`);
}
if (occurrence.boundingBox) {
console.log(` Location: (${occurrence.boundingBox.left}, ${occurrence.boundingBox.top})`);
console.log(` Size: ${occurrence.boundingBox.width} x ${occurrence.boundingBox.height}`);
}
if (occurrence.startTime !== undefined) {
console.log(` Time: ${occurrence.startTime}s - ${occurrence.endTime}s`);
}
console.log();
});
// Calculate average confidence
const avgConfidence = observation.occurrences!.reduce(
(sum, occ) => sum + occ.confidence, 0
) / observation.occurrences!.length;
console.log(` Average confidence: ${avgConfidence.toFixed(3)}`);
});
// Filter high-confidence entities
const highConfidenceEntities = content.observations?.filter(obs =>
obs.occurrences?.some(occ => occ.confidence >= 0.8)
);
console.log(`\nHigh-confidence entities (>=0.8): ${highConfidenceEntities?.length || 0}`);Key differences: snake_case methods
content_response = await graphlit.getContent(id="content-id-here") content = content_response.content
print(f"\nAnalyzing entities in: {content.name}\n")
Iterate through observations
for observation in content.observations or []: print(f"\n{observation.type}: {observation.observable.name}") print(f"Entity ID: {observation.observable.id}") print(f"Total occurrences: {len(observation.occurrences or [])}\n")
# Analyze occurrences
for idx, occurrence in enumerate(observation.occurrences or []):
print(f" Occurrence #{idx + 1}:")
print(f" Confidence: {occurrence.confidence:.3f}")
if occurrence.page_index is not None:
print(f" Page: {occurrence.page_index}")
if occurrence.bounding_box:
box = occurrence.bounding_box
print(f" Location: ({box.left}, {box.top})")
print(f" Size: {box.width} x {box.height}")
if occurrence.start_time is not None:
print(f" Time: {occurrence.start_time}s - {occurrence.end_time}s")
print()
# Average confidence
avg_conf = sum(occ.confidence for occ in observation.occurrences) / len(observation.occurrences)
print(f" Average confidence: {avg_conf:.3f}")Filter high-confidence
high_conf = [ obs for obs in content.observations or [] if any(occ.confidence >= 0.8 for occ in obs.occurrences or []) ]
print(f"\nHigh-confidence entities (>=0.8): {len(high_conf)}")
### C#
```csharp
using Graphlit;
var graphlit = new Graphlit();
// Key differences: PascalCase methods
var contentResponse = await graphlit.GetContent(id: "content-id-here");
var content = contentResponse.Content;
Console.WriteLine($"\nAnalyzing entities in: {content.Name}\n");
// Iterate through observations
foreach (var observation in content.Observations ?? new List<Observation>())
{
Console.WriteLine($"\n{observation.Type}: {observation.Observable.Name}");
Console.WriteLine($"Entity ID: {observation.Observable.Id}");
Console.WriteLine($"Total occurrences: {observation.Occurrences?.Count ?? 0}\n");
// Analyze occurrences
var occurrences = observation.Occurrences ?? new List<Occurrence>();
for (int i = 0; i < occurrences.Count; i++)
{
var occurrence = occurrences[i];
Console.WriteLine($" Occurrence #{i + 1}:");
Console.WriteLine($" Confidence: {occurrence.Confidence:F3}");
if (occurrence.PageIndex.HasValue)
Console.WriteLine($" Page: {occurrence.PageIndex}");
if (occurrence.BoundingBox != null)
{
var box = occurrence.BoundingBox;
Console.WriteLine($" Location: ({box.Left}, {box.Top})");
Console.WriteLine($" Size: {box.Width} x {box.Height}");
}
if (occurrence.StartTime.HasValue)
Console.WriteLine($" Time: {occurrence.StartTime}s - {occurrence.EndTime}s");
Console.WriteLine();
}
// Average confidence
var avgConf = occurrences.Average(occ => occ.Confidence);
Console.WriteLine($" Average confidence: {avgConf:F3}");
}Step-by-Step Explanation
Step 1: Understanding Confidence Scores
What is Confidence?
Score from 0.0 (uncertain) to 1.0 (very certain)
Provided by LLM during entity extraction
Indicates extraction quality/reliability
Per-occurrence (same entity can have different confidences)
Confidence Ranges:
0.9 - 1.0: Very high confidence (explicit mentions, clear context)
0.7 - 0.9: High confidence (standard mentions, good context)
0.5 - 0.7: Medium confidence (implicit mentions, unclear context)
0.3 - 0.5: Low confidence (ambiguous mentions, weak context)
0.0 - 0.3: Very low confidence (likely false positives)
Step 2: Occurrence Context by Content Type
For Documents (PDF, Word, etc.):
occurrence: {
confidence: 0.92,
pageIndex: 5, // Zero-based page number
boundingBox: {
left: 100.5, // X coordinate (pixels or points)
top: 250.3, // Y coordinate
width: 150.2, // Width
height: 20.5 // Height
}
}For Audio/Video (Transcripts):
occurrence: {
confidence: 0.88,
startTime: 125.3, // Seconds from start
endTime: 127.8, // Seconds from start
transcript: "Kirk Marple" // Optional: exact text
}For Images:
occurrence: {
confidence: 0.85,
boundingBox: {
left: 50, // Pixel coordinates
top: 100,
width: 200,
height: 150
}
}For Text/Messages (Emails, Slack):
occurrence: {
confidence: 0.95,
// No spatial context - just presence in text
}Step 3: Multiple Occurrences
Same entity mentioned multiple times = multiple occurrences:
// Example: "Kirk Marple" appears on pages 1, 5, and 12
observation: {
type: ObservableTypes.Person,
observable: {
id: "obs-123",
name: "Kirk Marple"
},
occurrences: [
{ confidence: 0.95, pageIndex: 0 }, // Page 1 (zero-based)
{ confidence: 0.88, pageIndex: 4 }, // Page 5
{ confidence: 0.92, pageIndex: 11 } // Page 12
]
}Step 4: Filtering by Confidence
High Precision (Few False Positives):
const highPrecision = content.observations?.filter(obs =>
obs.occurrences?.every(occ => occ.confidence >= 0.8) // ALL occurrences high
);High Recall (Few False Negatives):
const highRecall = content.observations?.filter(obs =>
obs.occurrences?.some(occ => occ.confidence >= 0.5) // ANY occurrence medium+
);Balanced:
const balanced = content.observations?.filter(obs =>
obs.occurrences?.some(occ => occ.confidence >= 0.7) // ANY occurrence high
);Configuration Options
Setting Confidence Thresholds
By Use Case:
Legal/Compliance (High Precision):
const threshold = 0.85; // Very conservative
const entities = content.observations?.filter(obs =>
obs.occurrences?.every(occ => occ.confidence >= threshold)
);Research/Discovery (High Recall):
const threshold = 0.5; // More permissive
const entities = content.observations?.filter(obs =>
obs.occurrences?.some(occ => occ.confidence >= threshold)
);Production (Balanced):
const threshold = 0.7; // Recommended default
const entities = content.observations?.filter(obs =>
obs.occurrences?.some(occ => occ.confidence >= threshold)
);Analyzing Confidence Distribution
// Group by confidence range
function analyzeConfidence(observations: Observation[]) {
const distribution = {
veryHigh: 0, // 0.9 - 1.0
high: 0, // 0.7 - 0.9
medium: 0, // 0.5 - 0.7
low: 0, // 0.3 - 0.5
veryLow: 0 // 0.0 - 0.3
};
observations.forEach(obs => {
obs.occurrences?.forEach(occ => {
if (occ.confidence >= 0.9) distribution.veryHigh++;
else if (occ.confidence >= 0.7) distribution.high++;
else if (occ.confidence >= 0.5) distribution.medium++;
else if (occ.confidence >= 0.3) distribution.low++;
else distribution.veryLow++;
});
});
return distribution;
}
const dist = analyzeConfidence(content.observations || []);
console.log('Confidence distribution:', dist);Variations
Variation 1: Find Entities by Page Number
Locate entities on specific document pages:
function findEntitiesOnPage(observations: Observation[], pageNum: number) {
const pageIndex = pageNum - 1; // Convert to zero-based
return observations
.map(obs => ({
entity: obs.observable,
type: obs.type,
occurrences: obs.occurrences?.filter(occ =>
occ.pageIndex === pageIndex
) || []
}))
.filter(item => item.occurrences.length > 0);
}
const page5Entities = findEntitiesOnPage(content.observations || [], 5);
console.log(`Entities on page 5: ${page5Entities.length}`);Variation 2: Find Entities in Time Range (Audio/Video)
Extract entities mentioned during specific timeframe:
function findEntitiesInTimeRange(
observations: Observation[],
startSec: number,
endSec: number
) {
return observations
.map(obs => ({
entity: obs.observable,
type: obs.type,
occurrences: obs.occurrences?.filter(occ =>
occ.startTime !== undefined &&
occ.startTime >= startSec &&
occ.startTime <= endSec
) || []
}))
.filter(item => item.occurrences.length > 0);
}
// Find entities mentioned between 5:00 and 10:00
const segment = findEntitiesInTimeRange(content.observations || [], 300, 600);
console.log(`Entities in time range: ${segment.length}`);Variation 3: Visual Entity Locator (PDFs with Bounding Boxes)
Find entity positions for highlighting:
function getEntityLocations(observations: Observation[]) {
const locations: Array<{
entity: string;
type: string;
page: number;
box: { x: number; y: number; width: number; height: number };
confidence: number;
}> = [];
observations.forEach(obs => {
obs.occurrences?.forEach(occ => {
if (occ.boundingBox && occ.pageIndex !== undefined) {
locations.push({
entity: obs.observable.name,
type: obs.type,
page: occ.pageIndex + 1, // Convert to 1-based
box: {
x: occ.boundingBox.left,
y: occ.boundingBox.top,
width: occ.boundingBox.width,
height: occ.boundingBox.height
},
confidence: occ.confidence
});
}
});
});
return locations;
}
const locations = getEntityLocations(content.observations || []);
// Use for PDF highlighting UIVariation 4: Confidence-Weighted Entity Ranking
Rank entities by total confidence across all occurrences:
function rankEntitiesByConfidence(observations: Observation[]) {
return observations
.map(obs => {
const totalConf = obs.occurrences?.reduce(
(sum, occ) => sum + occ.confidence, 0
) || 0;
const avgConf = totalConf / (obs.occurrences?.length || 1);
const occCount = obs.occurrences?.length || 0;
return {
entity: obs.observable,
type: obs.type,
averageConfidence: avgConf,
occurrenceCount: occCount,
totalConfidence: totalConf,
score: totalConf * occCount // Weighted by frequency
};
})
.sort((a, b) => b.score - a.score);
}
const ranked = rankEntitiesByConfidence(content.observations || []);
console.log('Top entities by confidence:');
ranked.slice(0, 10).forEach((item, i) => {
console.log(`${i + 1}. ${item.entity.name} (${item.type})`);
console.log(` Avg conf: ${item.averageConfidence.toFixed(3)}, Count: ${item.occurrenceCount}`);
});Variation 5: Occurrence Clustering (Find Dense Entity Regions)
Find pages/sections with high entity density:
function findEntityClusters(observations: Observation[]) {
const pageMap = new Map<number, number>(); // page → entity count
observations.forEach(obs => {
obs.occurrences?.forEach(occ => {
if (occ.pageIndex !== undefined) {
pageMap.set(
occ.pageIndex,
(pageMap.get(occ.pageIndex) || 0) + 1
);
}
});
});
// Sort pages by entity density
return Array.from(pageMap.entries())
.map(([page, count]) => ({ page: page + 1, entityCount: count }))
.sort((a, b) => b.entityCount - a.entityCount);
}
const clusters = findEntityClusters(content.observations || []);
console.log('Pages with most entities:');
clusters.slice(0, 5).forEach(cluster => {
console.log(` Page ${cluster.page}: ${cluster.entityCount} entities`);
});Common Issues & Solutions
Issue: All Confidences Are Low
Problem: Most entities have confidence <0.5.
Solutions:
Upgrade model: Use GPT-4 instead of Gemini for better quality
Improve preparation: Better OCR, cleaner text extraction
Use vision models: For scanned documents or images
Check content quality: Low-quality scans produce low confidence
// Solution: Use better model
const spec = await graphlit.createSpecification({
name: "High-Quality Extraction",
type: SpecificationTypes.Completion,
serviceType: ModelServiceTypes.OpenAi,
openAI: {
model: OpenAIModels.Gpt4, // Better quality
temperature: 0.1
}
});Issue: Same Entity, Varying Confidence
Problem: Multiple occurrences of same entity have wildly different confidence.
Explanation: This is normal and indicates:
Some mentions are explicit ("Kirk Marple, CEO of Graphlit")
Some mentions are implicit ("Kirk said...")
Context quality varies throughout document
Solution: Use average or maximum confidence:
// Use max confidence across occurrences
const maxConf = Math.max(...observation.occurrences.map(occ => occ.confidence));
// Or use average
const avgConf = observation.occurrences.reduce(
(sum, occ) => sum + occ.confidence, 0
) / observation.occurrences.length;Issue: Bounding Boxes Missing
Problem: Occurrences don't have bounding box coordinates.
Causes:
Content type doesn't support spatial info (emails, text)
Used text extraction instead of vision extraction
Document doesn't have layout information
Solution: Use vision model for PDFs:
extraction: {
jobs: [{
connector: {
type: EntityExtractionServiceTypes.ModelDocument, // Vision model
extractedTypes: [/* ... */]
}
}]
}Issue: No Page Numbers in Occurrences
Problem: Page index is undefined for document content.
Causes:
Content isn't page-based (email, message, web page)
PDF preparation didn't preserve page structure
Used wrong preparation type
Solution: Ensure proper document preparation:
preparation: {
jobs: [{
connector: {
type: FilePreparationServiceTypes.ModelDocument // Preserves pages
}
}]
}Developer Hints
Confidence Interpretation by Model
GPT-4: Generally conservative, confidence >0.7 is very reliable
GPT-4o: Well-calibrated, confidence >0.75 recommended
Claude 3.5: Slightly optimistic, confidence >0.8 for high precision
Gemini: More variable, confidence >0.7 minimum recommended
When to Use Occurrence Data
Page index: PDF highlighting, citation validation, page-specific queries
Bounding boxes: Visual annotation, entity location UI, layout analysis
Timestamps: Video playback navigation, meeting segment analysis
Confidence: Quality filtering, precision/recall tuning, validation
Performance Considerations
Occurrence data increases response size
Filter occurrences client-side for specific pages/times
Don't fetch occurrence details if not needed
Cache occurrence analysis results
Validation Strategies
Manual review: Check high-confidence entities first
Cross-reference: Verify entities across multiple content items
Confidence distribution: Expect most >0.7 for good extraction
Frequency analysis: Common entities should appear multiple times
Context checking: Use page/time context for validation
Last updated
Was this helpful?