Understanding Confidence Scores and Occurrences

Use Case: Understanding Confidence Scores and Occurrences

User Intent

"What are confidence scores and occurrences in entity extraction? How do I use them to validate and filter entities?"

Operation

SDK Method: Access via content.observations[].occurrences[] GraphQL Query: getContent with observations Entity: Observation occurrence data

Prerequisites

  • Content with extracted entities (workflow with extraction stage)

  • Understanding of observation model

  • Graphlit project with API credentials


Complete Code Example (TypeScript)

import { Graphlit } from 'graphlit-client';

const graphlit = new Graphlit();

// Get content with entity observations
const contentResponse = await graphlit.getContent('content-id-here');
const content = contentResponse.content;

console.log(`\nAnalyzing entities in: ${content.name}\n`);

// Iterate through all observations
content.observations?.forEach(observation => {
  console.log(`\n${observation.type}: ${observation.observable.name}`);
  console.log(`Entity ID: ${observation.observable.id}`);
  console.log(`Total occurrences: ${observation.occurrences?.length || 0}\n`);
  
  // Analyze each occurrence
  observation.occurrences?.forEach((occurrence, index) => {
    console.log(`  Occurrence #${index + 1}:`);
    console.log(`    Confidence: ${occurrence.confidence.toFixed(3)}`);
    
    // Location context (varies by content type)
    if (occurrence.pageIndex !== undefined) {
      console.log(`    Page: ${occurrence.pageIndex}`);
    }
    
    if (occurrence.boundingBox) {
      console.log(`    Location: (${occurrence.boundingBox.left}, ${occurrence.boundingBox.top})`);
      console.log(`    Size: ${occurrence.boundingBox.width} x ${occurrence.boundingBox.height}`);
    }
    
    if (occurrence.startTime !== undefined) {
      console.log(`    Time: ${occurrence.startTime}s - ${occurrence.endTime}s`);
    }
    
    console.log();
  });
  
  // Calculate average confidence
  const avgConfidence = observation.occurrences!.reduce(
    (sum, occ) => sum + occ.confidence, 0
  ) / observation.occurrences!.length;
  
  console.log(`  Average confidence: ${avgConfidence.toFixed(3)}`);
});

// Filter high-confidence entities
const highConfidenceEntities = content.observations?.filter(obs =>
  obs.occurrences?.some(occ => occ.confidence >= 0.8)
);

console.log(`\nHigh-confidence entities (>=0.8): ${highConfidenceEntities?.length || 0}`);

Key differences: snake_case methods

content_response = await graphlit.getContent(id="content-id-here") content = content_response.content

print(f"\nAnalyzing entities in: {content.name}\n")

Iterate through observations

for observation in content.observations or []: print(f"\n{observation.type}: {observation.observable.name}") print(f"Entity ID: {observation.observable.id}") print(f"Total occurrences: {len(observation.occurrences or [])}\n")

Filter high-confidence

high_conf = [ obs for obs in content.observations or [] if any(occ.confidence >= 0.8 for occ in obs.occurrences or []) ]

print(f"\nHigh-confidence entities (>=0.8): {len(high_conf)}")


Step-by-Step Explanation

Step 1: Understanding Confidence Scores

What is Confidence?

  • Score from 0.0 (uncertain) to 1.0 (very certain)

  • Provided by LLM during entity extraction

  • Indicates extraction quality/reliability

  • Per-occurrence (same entity can have different confidences)

Confidence Ranges:

  • 0.9 - 1.0: Very high confidence (explicit mentions, clear context)

  • 0.7 - 0.9: High confidence (standard mentions, good context)

  • 0.5 - 0.7: Medium confidence (implicit mentions, unclear context)

  • 0.3 - 0.5: Low confidence (ambiguous mentions, weak context)

  • 0.0 - 0.3: Very low confidence (likely false positives)

Step 2: Occurrence Context by Content Type

For Documents (PDF, Word, etc.):

For Audio/Video (Transcripts):

For Images:

For Text/Messages (Emails, Slack):

Step 3: Multiple Occurrences

Same entity mentioned multiple times = multiple occurrences:

Step 4: Filtering by Confidence

High Precision (Few False Positives):

High Recall (Few False Negatives):

Balanced:


Configuration Options

Setting Confidence Thresholds

By Use Case:

Legal/Compliance (High Precision):

Research/Discovery (High Recall):

Production (Balanced):

Analyzing Confidence Distribution


Variations

Variation 1: Find Entities by Page Number

Locate entities on specific document pages:

Variation 2: Find Entities in Time Range (Audio/Video)

Extract entities mentioned during specific timeframe:

Variation 3: Visual Entity Locator (PDFs with Bounding Boxes)

Find entity positions for highlighting:

Variation 4: Confidence-Weighted Entity Ranking

Rank entities by total confidence across all occurrences:

Variation 5: Occurrence Clustering (Find Dense Entity Regions)

Find pages/sections with high entity density:


Common Issues & Solutions

Issue: All Confidences Are Low

Problem: Most entities have confidence <0.5.

Solutions:

  1. Upgrade model: Use GPT-4 instead of Gemini for better quality

  2. Improve preparation: Better OCR, cleaner text extraction

  3. Use vision models: For scanned documents or images

  4. Check content quality: Low-quality scans produce low confidence

Issue: Same Entity, Varying Confidence

Problem: Multiple occurrences of same entity have wildly different confidence.

Explanation: This is normal and indicates:

  • Some mentions are explicit ("Kirk Marple, CEO of Graphlit")

  • Some mentions are implicit ("Kirk said...")

  • Context quality varies throughout document

Solution: Use average or maximum confidence:

Issue: Bounding Boxes Missing

Problem: Occurrences don't have bounding box coordinates.

Causes:

  1. Content type doesn't support spatial info (emails, text)

  2. Used text extraction instead of vision extraction

  3. Document doesn't have layout information

Solution: Use vision model for PDFs:

Issue: No Page Numbers in Occurrences

Problem: Page index is undefined for document content.

Causes:

  1. Content isn't page-based (email, message, web page)

  2. PDF preparation didn't preserve page structure

  3. Used wrong preparation type

Solution: Ensure proper document preparation:


Developer Hints

Confidence Interpretation by Model

  • GPT-4: Generally conservative, confidence >0.7 is very reliable

  • GPT-4o: Well-calibrated, confidence >0.75 recommended

  • Claude 3.5: Slightly optimistic, confidence >0.8 for high precision

  • Gemini: More variable, confidence >0.7 minimum recommended

When to Use Occurrence Data

  • Page index: PDF highlighting, citation validation, page-specific queries

  • Bounding boxes: Visual annotation, entity location UI, layout analysis

  • Timestamps: Video playback navigation, meeting segment analysis

  • Confidence: Quality filtering, precision/recall tuning, validation

Performance Considerations

  • Occurrence data increases response size

  • Filter occurrences client-side for specific pages/times

  • Don't fetch occurrence details if not needed

  • Cache occurrence analysis results

Validation Strategies

  1. Manual review: Check high-confidence entities first

  2. Cross-reference: Verify entities across multiple content items

  3. Confidence distribution: Expect most >0.7 for good extraction

  4. Frequency analysis: Common entities should appear multiple times

  5. Context checking: Use page/time context for validation


Last updated

Was this helpful?