Understanding Confidence Scores and Occurrences

Use Case: Understanding Confidence Scores and Occurrences

User Intent

"What are confidence scores and occurrences in entity extraction? How do I use them to validate and filter entities?"

Operation

SDK Method: Access via content.observations[].occurrences[] GraphQL Query: getContent with observations Entity: Observation occurrence data

Prerequisites

Content with extracted entities (workflow with extraction stage)
Understanding of observation model
Graphlit project with API credentials

Complete Code Example (TypeScript)

import { Graphlit } from 'graphlit-client';

const graphlit = new Graphlit();

// Get content with entity observations
const contentResponse = await graphlit.getContent('content-id-here');
const content = contentResponse.content;

console.log(`\nAnalyzing entities in: ${content.name}\n`);

// Iterate through all observations
content.observations?.forEach(observation => {
  console.log(`\n${observation.type}: ${observation.observable.name}`);
  console.log(`Entity ID: ${observation.observable.id}`);
  console.log(`Total occurrences: ${observation.occurrences?.length || 0}\n`);
  
  // Analyze each occurrence
  observation.occurrences?.forEach((occurrence, index) => {
    console.log(`  Occurrence #${index + 1}:`);
    console.log(`    Confidence: ${occurrence.confidence.toFixed(3)}`);
    
    // Location context (varies by content type)
    if (occurrence.pageIndex !== undefined) {
      console.log(`    Page: ${occurrence.pageIndex}`);
    }
    
    if (occurrence.boundingBox) {
      console.log(`    Location: (${occurrence.boundingBox.left}, ${occurrence.boundingBox.top})`);
      console.log(`    Size: ${occurrence.boundingBox.width} x ${occurrence.boundingBox.height}`);
    }
    
    if (occurrence.startTime !== undefined) {
      console.log(`    Time: ${occurrence.startTime}s - ${occurrence.endTime}s`);
    }
    
    console.log();
  });
  
  // Calculate average confidence
  const avgConfidence = observation.occurrences!.reduce(
    (sum, occ) => sum + occ.confidence, 0
  ) / observation.occurrences!.length;
  
  console.log(`  Average confidence: ${avgConfidence.toFixed(3)}`);
});

// Filter high-confidence entities
const highConfidenceEntities = content.observations?.filter(obs =>
  obs.occurrences?.some(occ => occ.confidence >= 0.8)
);

console.log(`\nHigh-confidence entities (>=0.8): ${highConfidenceEntities?.length || 0}`);

Key differences: snake_case methods

content_response = await graphlit.getContent(id="content-id-here") content = content_response.content

print(f"\nAnalyzing entities in: {content.name}\n")

Iterate through observations

for observation in content.observations or []: print(f"\n{observation.type}: {observation.observable.name}") print(f"Entity ID: {observation.observable.id}") print(f"Total occurrences: {len(observation.occurrences or [])}\n")

# Analyze occurrences
for idx, occurrence in enumerate(observation.occurrences or []):
    print(f"  Occurrence #{idx + 1}:")
    print(f"    Confidence: {occurrence.confidence:.3f}")
    
    if occurrence.page_index is not None:
        print(f"    Page: {occurrence.page_index}")
    
    if occurrence.bounding_box:
        box = occurrence.bounding_box
        print(f"    Location: ({box.left}, {box.top})")
        print(f"    Size: {box.width} x {box.height}")
    
    if occurrence.start_time is not None:
        print(f"    Time: {occurrence.start_time}s - {occurrence.end_time}s")
    
    print()

# Average confidence
avg_conf = sum(occ.confidence for occ in observation.occurrences) / len(observation.occurrences)
print(f"  Average confidence: {avg_conf:.3f}")

Filter high-confidence

high_conf = [ obs for obs in content.observations or [] if any(occ.confidence >= 0.8 for occ in obs.occurrences or []) ]

print(f"\nHigh-confidence entities (>=0.8): {len(high_conf)}")


### C#
```csharp
using Graphlit;

var graphlit = new Graphlit();

// Key differences: PascalCase methods
var contentResponse = await graphlit.GetContent(id: "content-id-here");
var content = contentResponse.Content;

Console.WriteLine($"\nAnalyzing entities in: {content.Name}\n");

// Iterate through observations
foreach (var observation in content.Observations ?? new List<Observation>())
{
    Console.WriteLine($"\n{observation.Type}: {observation.Observable.Name}");
    Console.WriteLine($"Entity ID: {observation.Observable.Id}");
    Console.WriteLine($"Total occurrences: {observation.Occurrences?.Count ?? 0}\n");
    
    // Analyze occurrences
    var occurrences = observation.Occurrences ?? new List<Occurrence>();
    for (int i = 0; i < occurrences.Count; i++)
    {
        var occurrence = occurrences[i];
        Console.WriteLine($"  Occurrence #{i + 1}:");
        Console.WriteLine($"    Confidence: {occurrence.Confidence:F3}");
        
        if (occurrence.PageIndex.HasValue)
            Console.WriteLine($"    Page: {occurrence.PageIndex}");
        
        if (occurrence.BoundingBox != null)
        {
            var box = occurrence.BoundingBox;
            Console.WriteLine($"    Location: ({box.Left}, {box.Top})");
            Console.WriteLine($"    Size: {box.Width} x {box.Height}");
        }
        
        if (occurrence.StartTime.HasValue)
            Console.WriteLine($"    Time: {occurrence.StartTime}s - {occurrence.EndTime}s");
        
        Console.WriteLine();
    }
    
    // Average confidence
    var avgConf = occurrences.Average(occ => occ.Confidence);
    Console.WriteLine($"  Average confidence: {avgConf:F3}");
}

Step-by-Step Explanation

Step 1: Understanding Confidence Scores

What is Confidence?

Score from 0.0 (uncertain) to 1.0 (very certain)
Provided by LLM during entity extraction
Indicates extraction quality/reliability
Per-occurrence (same entity can have different confidences)

Confidence Ranges:

0.9 - 1.0: Very high confidence (explicit mentions, clear context)
0.7 - 0.9: High confidence (standard mentions, good context)
0.5 - 0.7: Medium confidence (implicit mentions, unclear context)
0.3 - 0.5: Low confidence (ambiguous mentions, weak context)
0.0 - 0.3: Very low confidence (likely false positives)

Step 2: Occurrence Context by Content Type

For Documents (PDF, Word, etc.):

occurrence: {
  confidence: 0.92,
  pageIndex: 5,              // Zero-based page number
  boundingBox: {
    left: 100.5,             // X coordinate (pixels or points)
    top: 250.3,              // Y coordinate
    width: 150.2,            // Width
    height: 20.5             // Height
  }
}

For Audio/Video (Transcripts):

occurrence: {
  confidence: 0.88,
  startTime: 125.3,          // Seconds from start
  endTime: 127.8,            // Seconds from start
  transcript: "Kirk Marple"  // Optional: exact text
}

For Images:

occurrence: {
  confidence: 0.85,
  boundingBox: {
    left: 50,                // Pixel coordinates
    top: 100,
    width: 200,
    height: 150
  }
}

For Text/Messages (Emails, Slack):

occurrence: {
  confidence: 0.95,
  // No spatial context - just presence in text
}

Step 3: Multiple Occurrences

Same entity mentioned multiple times = multiple occurrences:

// Example: "Kirk Marple" appears on pages 1, 5, and 12
observation: {
  type: ObservableTypes.Person,
  observable: {
    id: "obs-123",
    name: "Kirk Marple"
  },
  occurrences: [
    { confidence: 0.95, pageIndex: 0 },   // Page 1 (zero-based)
    { confidence: 0.88, pageIndex: 4 },   // Page 5
    { confidence: 0.92, pageIndex: 11 }   // Page 12
  ]
}

Step 4: Filtering by Confidence

High Precision (Few False Positives):

const highPrecision = content.observations?.filter(obs =>
  obs.occurrences?.every(occ => occ.confidence >= 0.8)  // ALL occurrences high
);

High Recall (Few False Negatives):

const highRecall = content.observations?.filter(obs =>
  obs.occurrences?.some(occ => occ.confidence >= 0.5)  // ANY occurrence medium+
);

Balanced:

const balanced = content.observations?.filter(obs =>
  obs.occurrences?.some(occ => occ.confidence >= 0.7)  // ANY occurrence high
);

Configuration Options

Setting Confidence Thresholds

By Use Case:

Legal/Compliance (High Precision):

const threshold = 0.85;  // Very conservative
const entities = content.observations?.filter(obs =>
  obs.occurrences?.every(occ => occ.confidence >= threshold)
);

Research/Discovery (High Recall):

const threshold = 0.5;   // More permissive
const entities = content.observations?.filter(obs =>
  obs.occurrences?.some(occ => occ.confidence >= threshold)
);

Production (Balanced):

const threshold = 0.7;   // Recommended default
const entities = content.observations?.filter(obs =>
  obs.occurrences?.some(occ => occ.confidence >= threshold)
);

Analyzing Confidence Distribution

// Group by confidence range
function analyzeConfidence(observations: Observation[]) {
  const distribution = {
    veryHigh: 0,   // 0.9 - 1.0
    high: 0,       // 0.7 - 0.9
    medium: 0,     // 0.5 - 0.7
    low: 0,        // 0.3 - 0.5
    veryLow: 0     // 0.0 - 0.3
  };
  
  observations.forEach(obs => {
    obs.occurrences?.forEach(occ => {
      if (occ.confidence >= 0.9) distribution.veryHigh++;
      else if (occ.confidence >= 0.7) distribution.high++;
      else if (occ.confidence >= 0.5) distribution.medium++;
      else if (occ.confidence >= 0.3) distribution.low++;
      else distribution.veryLow++;
    });
  });
  
  return distribution;
}

const dist = analyzeConfidence(content.observations || []);
console.log('Confidence distribution:', dist);

Variations

Variation 1: Find Entities by Page Number

Locate entities on specific document pages:

function findEntitiesOnPage(observations: Observation[], pageNum: number) {
  const pageIndex = pageNum - 1;  // Convert to zero-based
  
  return observations
    .map(obs => ({
      entity: obs.observable,
      type: obs.type,
      occurrences: obs.occurrences?.filter(occ => 
        occ.pageIndex === pageIndex
      ) || []
    }))
    .filter(item => item.occurrences.length > 0);
}

const page5Entities = findEntitiesOnPage(content.observations || [], 5);
console.log(`Entities on page 5: ${page5Entities.length}`);

Variation 2: Find Entities in Time Range (Audio/Video)

Extract entities mentioned during specific timeframe:

function findEntitiesInTimeRange(
  observations: Observation[], 
  startSec: number, 
  endSec: number
) {
  return observations
    .map(obs => ({
      entity: obs.observable,
      type: obs.type,
      occurrences: obs.occurrences?.filter(occ =>
        occ.startTime !== undefined &&
        occ.startTime >= startSec &&
        occ.startTime <= endSec
      ) || []
    }))
    .filter(item => item.occurrences.length > 0);
}

// Find entities mentioned between 5:00 and 10:00
const segment = findEntitiesInTimeRange(content.observations || [], 300, 600);
console.log(`Entities in time range: ${segment.length}`);

Variation 3: Visual Entity Locator (PDFs with Bounding Boxes)

Find entity positions for highlighting:

function getEntityLocations(observations: Observation[]) {
  const locations: Array<{
    entity: string;
    type: string;
    page: number;
    box: { x: number; y: number; width: number; height: number };
    confidence: number;
  }> = [];
  
  observations.forEach(obs => {
    obs.occurrences?.forEach(occ => {
      if (occ.boundingBox && occ.pageIndex !== undefined) {
        locations.push({
          entity: obs.observable.name,
          type: obs.type,
          page: occ.pageIndex + 1,  // Convert to 1-based
          box: {
            x: occ.boundingBox.left,
            y: occ.boundingBox.top,
            width: occ.boundingBox.width,
            height: occ.boundingBox.height
          },
          confidence: occ.confidence
        });
      }
    });
  });
  
  return locations;
}

const locations = getEntityLocations(content.observations || []);
// Use for PDF highlighting UI

Variation 4: Confidence-Weighted Entity Ranking

Rank entities by total confidence across all occurrences:

function rankEntitiesByConfidence(observations: Observation[]) {
  return observations
    .map(obs => {
      const totalConf = obs.occurrences?.reduce(
        (sum, occ) => sum + occ.confidence, 0
      ) || 0;
      
      const avgConf = totalConf / (obs.occurrences?.length || 1);
      const occCount = obs.occurrences?.length || 0;
      
      return {
        entity: obs.observable,
        type: obs.type,
        averageConfidence: avgConf,
        occurrenceCount: occCount,
        totalConfidence: totalConf,
        score: totalConf * occCount  // Weighted by frequency
      };
    })
    .sort((a, b) => b.score - a.score);
}

const ranked = rankEntitiesByConfidence(content.observations || []);
console.log('Top entities by confidence:');
ranked.slice(0, 10).forEach((item, i) => {
  console.log(`${i + 1}. ${item.entity.name} (${item.type})`);
  console.log(`   Avg conf: ${item.averageConfidence.toFixed(3)}, Count: ${item.occurrenceCount}`);
});

Variation 5: Occurrence Clustering (Find Dense Entity Regions)

Find pages/sections with high entity density:

function findEntityClusters(observations: Observation[]) {
  const pageMap = new Map<number, number>();  // page → entity count
  
  observations.forEach(obs => {
    obs.occurrences?.forEach(occ => {
      if (occ.pageIndex !== undefined) {
        pageMap.set(
          occ.pageIndex,
          (pageMap.get(occ.pageIndex) || 0) + 1
        );
      }
    });
  });
  
  // Sort pages by entity density
  return Array.from(pageMap.entries())
    .map(([page, count]) => ({ page: page + 1, entityCount: count }))
    .sort((a, b) => b.entityCount - a.entityCount);
}

const clusters = findEntityClusters(content.observations || []);
console.log('Pages with most entities:');
clusters.slice(0, 5).forEach(cluster => {
  console.log(`  Page ${cluster.page}: ${cluster.entityCount} entities`);
});

Common Issues & Solutions

Issue: All Confidences Are Low

Problem: Most entities have confidence <0.5.

Solutions:

Upgrade model: Use GPT-4 instead of Gemini for better quality
Improve preparation: Better OCR, cleaner text extraction
Use vision models: For scanned documents or images
Check content quality: Low-quality scans produce low confidence

// Solution: Use better model
const spec = await graphlit.createSpecification({
  name: "High-Quality Extraction",
  type: SpecificationTypes.Completion,
  serviceType: ModelServiceTypes.OpenAi,
  openAI: {
    model: OpenAIModels.Gpt4,  // Better quality
    temperature: 0.1
  }
});

Issue: Same Entity, Varying Confidence

Problem: Multiple occurrences of same entity have wildly different confidence.

Explanation: This is normal and indicates:

Some mentions are explicit ("Kirk Marple, CEO of Graphlit")
Some mentions are implicit ("Kirk said...")
Context quality varies throughout document

Solution: Use average or maximum confidence:

// Use max confidence across occurrences
const maxConf = Math.max(...observation.occurrences.map(occ => occ.confidence));

// Or use average
const avgConf = observation.occurrences.reduce(
  (sum, occ) => sum + occ.confidence, 0
) / observation.occurrences.length;

Issue: Bounding Boxes Missing

Problem: Occurrences don't have bounding box coordinates.

Causes:

Content type doesn't support spatial info (emails, text)
Used text extraction instead of vision extraction
Document doesn't have layout information

Solution: Use vision model for PDFs:

extraction: {
  jobs: [{
    connector: {
      type: EntityExtractionServiceTypes.ModelDocument,  // Vision model
      extractedTypes: [/* ... */]
    }
  }]
}

Issue: No Page Numbers in Occurrences

Problem: Page index is undefined for document content.

Causes:

Content isn't page-based (email, message, web page)
PDF preparation didn't preserve page structure
Used wrong preparation type

Solution: Ensure proper document preparation:

preparation: {
  jobs: [{
    connector: {
      type: FilePreparationServiceTypes.ModelDocument  // Preserves pages
    }
  }]
}

Developer Hints

Confidence Interpretation by Model

GPT-4: Generally conservative, confidence >0.7 is very reliable
GPT-4o: Well-calibrated, confidence >0.75 recommended
Claude 3.5: Slightly optimistic, confidence >0.8 for high precision
Gemini: More variable, confidence >0.7 minimum recommended

When to Use Occurrence Data

Page index: PDF highlighting, citation validation, page-specific queries
Bounding boxes: Visual annotation, entity location UI, layout analysis
Timestamps: Video playback navigation, meeting segment analysis
Confidence: Quality filtering, precision/recall tuning, validation

Performance Considerations

Occurrence data increases response size
Filter occurrences client-side for specific pages/times
Don't fetch occurrence details if not needed
Cache occurrence analysis results

Validation Strategies

Manual review: Check high-confidence entities first
Cross-reference: Verify entities across multiple content items
Confidence distribution: Expect most >0.7 for good extraction
Frequency analysis: Common entities should appear multiple times
Context checking: Use page/time context for validation

Last updated 2 months ago

Was this helpful?