Understanding Confidence Scores and Occurrences
Use Case: Understanding Confidence Scores and Occurrences
User Intent
"What are confidence scores and occurrences in entity extraction? How do I use them to validate and filter entities?"
Operation
SDK Method: Access via content.observations[].occurrences[]
GraphQL Query: getContent with observations
Entity: Observation occurrence data
Prerequisites
Content with extracted entities (workflow with extraction stage)
Understanding of observation model
Graphlit project with API credentials
Complete Code Example (TypeScript)
import { Graphlit } from 'graphlit-client';
const graphlit = new Graphlit();
// Get content with entity observations
const contentResponse = await graphlit.getContent('content-id-here');
const content = contentResponse.content;
console.log(`\nAnalyzing entities in: ${content.name}\n`);
// Iterate through all observations
content.observations?.forEach(observation => {
console.log(`\n${observation.type}: ${observation.observable.name}`);
console.log(`Entity ID: ${observation.observable.id}`);
console.log(`Total occurrences: ${observation.occurrences?.length || 0}\n`);
// Analyze each occurrence
observation.occurrences?.forEach((occurrence, index) => {
console.log(` Occurrence #${index + 1}:`);
console.log(` Confidence: ${occurrence.confidence.toFixed(3)}`);
// Location context (varies by content type)
if (occurrence.pageIndex !== undefined) {
console.log(` Page: ${occurrence.pageIndex}`);
}
if (occurrence.boundingBox) {
console.log(` Location: (${occurrence.boundingBox.left}, ${occurrence.boundingBox.top})`);
console.log(` Size: ${occurrence.boundingBox.width} x ${occurrence.boundingBox.height}`);
}
if (occurrence.startTime !== undefined) {
console.log(` Time: ${occurrence.startTime}s - ${occurrence.endTime}s`);
}
console.log();
});
// Calculate average confidence
const avgConfidence = observation.occurrences!.reduce(
(sum, occ) => sum + occ.confidence, 0
) / observation.occurrences!.length;
console.log(` Average confidence: ${avgConfidence.toFixed(3)}`);
});
// Filter high-confidence entities
const highConfidenceEntities = content.observations?.filter(obs =>
obs.occurrences?.some(occ => occ.confidence >= 0.8)
);
console.log(`\nHigh-confidence entities (>=0.8): ${highConfidenceEntities?.length || 0}`);Key differences: snake_case methods
content_response = await graphlit.getContent(id="content-id-here") content = content_response.content
print(f"\nAnalyzing entities in: {content.name}\n")
Iterate through observations
for observation in content.observations or []: print(f"\n{observation.type}: {observation.observable.name}") print(f"Entity ID: {observation.observable.id}") print(f"Total occurrences: {len(observation.occurrences or [])}\n")
Filter high-confidence
high_conf = [ obs for obs in content.observations or [] if any(occ.confidence >= 0.8 for occ in obs.occurrences or []) ]
print(f"\nHigh-confidence entities (>=0.8): {len(high_conf)}")
Step-by-Step Explanation
Step 1: Understanding Confidence Scores
What is Confidence?
Score from 0.0 (uncertain) to 1.0 (very certain)
Provided by LLM during entity extraction
Indicates extraction quality/reliability
Per-occurrence (same entity can have different confidences)
Confidence Ranges:
0.9 - 1.0: Very high confidence (explicit mentions, clear context)
0.7 - 0.9: High confidence (standard mentions, good context)
0.5 - 0.7: Medium confidence (implicit mentions, unclear context)
0.3 - 0.5: Low confidence (ambiguous mentions, weak context)
0.0 - 0.3: Very low confidence (likely false positives)
Step 2: Occurrence Context by Content Type
For Documents (PDF, Word, etc.):
For Audio/Video (Transcripts):
For Images:
For Text/Messages (Emails, Slack):
Step 3: Multiple Occurrences
Same entity mentioned multiple times = multiple occurrences:
Step 4: Filtering by Confidence
High Precision (Few False Positives):
High Recall (Few False Negatives):
Balanced:
Configuration Options
Setting Confidence Thresholds
By Use Case:
Legal/Compliance (High Precision):
Research/Discovery (High Recall):
Production (Balanced):
Analyzing Confidence Distribution
Variations
Variation 1: Find Entities by Page Number
Locate entities on specific document pages:
Variation 2: Find Entities in Time Range (Audio/Video)
Extract entities mentioned during specific timeframe:
Variation 3: Visual Entity Locator (PDFs with Bounding Boxes)
Find entity positions for highlighting:
Variation 4: Confidence-Weighted Entity Ranking
Rank entities by total confidence across all occurrences:
Variation 5: Occurrence Clustering (Find Dense Entity Regions)
Find pages/sections with high entity density:
Common Issues & Solutions
Issue: All Confidences Are Low
Problem: Most entities have confidence <0.5.
Solutions:
Upgrade model: Use GPT-4 instead of Gemini for better quality
Improve preparation: Better OCR, cleaner text extraction
Use vision models: For scanned documents or images
Check content quality: Low-quality scans produce low confidence
Issue: Same Entity, Varying Confidence
Problem: Multiple occurrences of same entity have wildly different confidence.
Explanation: This is normal and indicates:
Some mentions are explicit ("Kirk Marple, CEO of Graphlit")
Some mentions are implicit ("Kirk said...")
Context quality varies throughout document
Solution: Use average or maximum confidence:
Issue: Bounding Boxes Missing
Problem: Occurrences don't have bounding box coordinates.
Causes:
Content type doesn't support spatial info (emails, text)
Used text extraction instead of vision extraction
Document doesn't have layout information
Solution: Use vision model for PDFs:
Issue: No Page Numbers in Occurrences
Problem: Page index is undefined for document content.
Causes:
Content isn't page-based (email, message, web page)
PDF preparation didn't preserve page structure
Used wrong preparation type
Solution: Ensure proper document preparation:
Developer Hints
Confidence Interpretation by Model
GPT-4: Generally conservative, confidence >0.7 is very reliable
GPT-4o: Well-calibrated, confidence >0.75 recommended
Claude 3.5: Slightly optimistic, confidence >0.8 for high precision
Gemini: More variable, confidence >0.7 minimum recommended
When to Use Occurrence Data
Page index: PDF highlighting, citation validation, page-specific queries
Bounding boxes: Visual annotation, entity location UI, layout analysis
Timestamps: Video playback navigation, meeting segment analysis
Confidence: Quality filtering, precision/recall tuning, validation
Performance Considerations
Occurrence data increases response size
Filter occurrences client-side for specific pages/times
Don't fetch occurrence details if not needed
Cache occurrence analysis results
Validation Strategies
Manual review: Check high-confidence entities first
Cross-reference: Verify entities across multiple content items
Confidence distribution: Expect most >0.7 for good extraction
Frequency analysis: Common entities should appear multiple times
Context checking: Use page/time context for validation
Last updated
Was this helpful?