# Understanding Confidence Scores and Occurrences

## Use Case: Understanding Confidence Scores and Occurrences

### User Intent

"What are confidence scores and occurrences in entity extraction? How do I use them to validate and filter entities?"

### Operation

**SDK Method**: Access via `content.observations[].occurrences[]`\
**GraphQL Query**: `getContent` with observations\
**Entity**: Observation occurrence data

### Prerequisites

* Content with extracted entities (workflow with extraction stage)
* Understanding of observation model
* Graphlit project with API credentials

***

### Complete Code Example (TypeScript)

```typescript
import { Graphlit } from 'graphlit-client';

const graphlit = new Graphlit();

// Get content with entity observations
const contentResponse = await graphlit.getContent('content-id-here');
const content = contentResponse.content;

console.log(`\nAnalyzing entities in: ${content.name}\n`);

// Iterate through all observations
content.observations?.forEach(observation => {
  console.log(`\n${observation.type}: ${observation.observable.name}`);
  console.log(`Entity ID: ${observation.observable.id}`);
  console.log(`Total occurrences: ${observation.occurrences?.length || 0}\n`);
  
  // Analyze each occurrence
  observation.occurrences?.forEach((occurrence, index) => {
    console.log(`  Occurrence #${index + 1}:`);
    console.log(`    Confidence: ${occurrence.confidence.toFixed(3)}`);
    
    // Location context (varies by content type)
    if (occurrence.pageIndex !== undefined) {
      console.log(`    Page: ${occurrence.pageIndex}`);
    }
    
    if (occurrence.boundingBox) {
      console.log(`    Location: (${occurrence.boundingBox.left}, ${occurrence.boundingBox.top})`);
      console.log(`    Size: ${occurrence.boundingBox.width} x ${occurrence.boundingBox.height}`);
    }
    
    if (occurrence.startTime !== undefined) {
      console.log(`    Time: ${occurrence.startTime}s - ${occurrence.endTime}s`);
    }
    
    console.log();
  });
  
  // Calculate average confidence
  const avgConfidence = observation.occurrences!.reduce(
    (sum, occ) => sum + occ.confidence, 0
  ) / observation.occurrences!.length;
  
  console.log(`  Average confidence: ${avgConfidence.toFixed(3)}`);
});

// Filter high-confidence entities
const highConfidenceEntities = content.observations?.filter(obs =>
  obs.occurrences?.some(occ => occ.confidence >= 0.8)
);

console.log(`\nHigh-confidence entities (>=0.8): ${highConfidenceEntities?.length || 0}`);
```

***

## Key differences: snake\_case methods

content\_response = await graphlit.getContent(id="content-id-here") content = content\_response.content

print(f"\nAnalyzing entities in: {content.name}\n")

## Iterate through observations

for observation in content.observations or \[]: print(f"\n{observation.type}: {observation.observable.name}") print(f"Entity ID: {observation.observable.id}") print(f"Total occurrences: {len(observation.occurrences or \[])}\n")

```
# Analyze occurrences
for idx, occurrence in enumerate(observation.occurrences or []):
    print(f"  Occurrence #{idx + 1}:")
    print(f"    Confidence: {occurrence.confidence:.3f}")
    
    if occurrence.page_index is not None:
        print(f"    Page: {occurrence.page_index}")
    
    if occurrence.bounding_box:
        box = occurrence.bounding_box
        print(f"    Location: ({box.left}, {box.top})")
        print(f"    Size: {box.width} x {box.height}")
    
    if occurrence.start_time is not None:
        print(f"    Time: {occurrence.start_time}s - {occurrence.end_time}s")
    
    print()

# Average confidence
avg_conf = sum(occ.confidence for occ in observation.occurrences) / len(observation.occurrences)
print(f"  Average confidence: {avg_conf:.3f}")
```

## Filter high-confidence

high\_conf = \[ obs for obs in content.observations or \[] if any(occ.confidence >= 0.8 for occ in obs.occurrences or \[]) ]

print(f"\nHigh-confidence entities (>=0.8): {len(high\_conf)}")

````

### C#
```csharp
using Graphlit;

var graphlit = new Graphlit();

// Key differences: PascalCase methods
var contentResponse = await graphlit.GetContent(id: "content-id-here");
var content = contentResponse.Content;

Console.WriteLine($"\nAnalyzing entities in: {content.Name}\n");

// Iterate through observations
foreach (var observation in content.Observations ?? new List<Observation>())
{
    Console.WriteLine($"\n{observation.Type}: {observation.Observable.Name}");
    Console.WriteLine($"Entity ID: {observation.Observable.Id}");
    Console.WriteLine($"Total occurrences: {observation.Occurrences?.Count ?? 0}\n");
    
    // Analyze occurrences
    var occurrences = observation.Occurrences ?? new List<Occurrence>();
    for (int i = 0; i < occurrences.Count; i++)
    {
        var occurrence = occurrences[i];
        Console.WriteLine($"  Occurrence #{i + 1}:");
        Console.WriteLine($"    Confidence: {occurrence.Confidence:F3}");
        
        if (occurrence.PageIndex.HasValue)
            Console.WriteLine($"    Page: {occurrence.PageIndex}");
        
        if (occurrence.BoundingBox != null)
        {
            var box = occurrence.BoundingBox;
            Console.WriteLine($"    Location: ({box.Left}, {box.Top})");
            Console.WriteLine($"    Size: {box.Width} x {box.Height}");
        }
        
        if (occurrence.StartTime.HasValue)
            Console.WriteLine($"    Time: {occurrence.StartTime}s - {occurrence.EndTime}s");
        
        Console.WriteLine();
    }
    
    // Average confidence
    var avgConf = occurrences.Average(occ => occ.Confidence);
    Console.WriteLine($"  Average confidence: {avgConf:F3}");
}
````

***

### Step-by-Step Explanation

#### Step 1: Understanding Confidence Scores

**What is Confidence?**

* Score from 0.0 (uncertain) to 1.0 (very certain)
* Provided by LLM during entity extraction
* Indicates extraction quality/reliability
* Per-occurrence (same entity can have different confidences)

**Confidence Ranges**:

* **0.9 - 1.0**: Very high confidence (explicit mentions, clear context)
* **0.7 - 0.9**: High confidence (standard mentions, good context)
* **0.5 - 0.7**: Medium confidence (implicit mentions, unclear context)
* **0.3 - 0.5**: Low confidence (ambiguous mentions, weak context)
* **0.0 - 0.3**: Very low confidence (likely false positives)

#### Step 2: Occurrence Context by Content Type

**For Documents (PDF, Word, etc.)**:

```typescript
occurrence: {
  confidence: 0.92,
  pageIndex: 5,              // Zero-based page number
  boundingBox: {
    left: 100.5,             // X coordinate (pixels or points)
    top: 250.3,              // Y coordinate
    width: 150.2,            // Width
    height: 20.5             // Height
  }
}
```

**For Audio/Video (Transcripts)**:

```typescript
occurrence: {
  confidence: 0.88,
  startTime: 125.3,          // Seconds from start
  endTime: 127.8,            // Seconds from start
  transcript: "Kirk Marple"  // Optional: exact text
}
```

**For Images**:

```typescript
occurrence: {
  confidence: 0.85,
  boundingBox: {
    left: 50,                // Pixel coordinates
    top: 100,
    width: 200,
    height: 150
  }
}
```

**For Text/Messages (Emails, Slack)**:

```typescript
occurrence: {
  confidence: 0.95,
  // No spatial context - just presence in text
}
```

#### Step 3: Multiple Occurrences

Same entity mentioned multiple times = multiple occurrences:

```typescript
// Example: "Kirk Marple" appears on pages 1, 5, and 12
observation: {
  type: ObservableTypes.Person,
  observable: {
    id: "obs-123",
    name: "Kirk Marple"
  },
  occurrences: [
    { confidence: 0.95, pageIndex: 0 },   // Page 1 (zero-based)
    { confidence: 0.88, pageIndex: 4 },   // Page 5
    { confidence: 0.92, pageIndex: 11 }   // Page 12
  ]
}
```

#### Step 4: Filtering by Confidence

**High Precision (Few False Positives)**:

```typescript
const highPrecision = content.observations?.filter(obs =>
  obs.occurrences?.every(occ => occ.confidence >= 0.8)  // ALL occurrences high
);
```

**High Recall (Few False Negatives)**:

```typescript
const highRecall = content.observations?.filter(obs =>
  obs.occurrences?.some(occ => occ.confidence >= 0.5)  // ANY occurrence medium+
);
```

**Balanced**:

```typescript
const balanced = content.observations?.filter(obs =>
  obs.occurrences?.some(occ => occ.confidence >= 0.7)  // ANY occurrence high
);
```

***

### Configuration Options

#### Setting Confidence Thresholds

**By Use Case**:

**Legal/Compliance (High Precision)**:

```typescript
const threshold = 0.85;  // Very conservative
const entities = content.observations?.filter(obs =>
  obs.occurrences?.every(occ => occ.confidence >= threshold)
);
```

**Research/Discovery (High Recall)**:

```typescript
const threshold = 0.5;   // More permissive
const entities = content.observations?.filter(obs =>
  obs.occurrences?.some(occ => occ.confidence >= threshold)
);
```

**Production (Balanced)**:

```typescript
const threshold = 0.7;   // Recommended default
const entities = content.observations?.filter(obs =>
  obs.occurrences?.some(occ => occ.confidence >= threshold)
);
```

#### Analyzing Confidence Distribution

```typescript
// Group by confidence range
function analyzeConfidence(observations: Observation[]) {
  const distribution = {
    veryHigh: 0,   // 0.9 - 1.0
    high: 0,       // 0.7 - 0.9
    medium: 0,     // 0.5 - 0.7
    low: 0,        // 0.3 - 0.5
    veryLow: 0     // 0.0 - 0.3
  };
  
  observations.forEach(obs => {
    obs.occurrences?.forEach(occ => {
      if (occ.confidence >= 0.9) distribution.veryHigh++;
      else if (occ.confidence >= 0.7) distribution.high++;
      else if (occ.confidence >= 0.5) distribution.medium++;
      else if (occ.confidence >= 0.3) distribution.low++;
      else distribution.veryLow++;
    });
  });
  
  return distribution;
}

const dist = analyzeConfidence(content.observations || []);
console.log('Confidence distribution:', dist);
```

***

### Variations

#### Variation 1: Find Entities by Page Number

Locate entities on specific document pages:

```typescript
function findEntitiesOnPage(observations: Observation[], pageNum: number) {
  const pageIndex = pageNum - 1;  // Convert to zero-based
  
  return observations
    .map(obs => ({
      entity: obs.observable,
      type: obs.type,
      occurrences: obs.occurrences?.filter(occ => 
        occ.pageIndex === pageIndex
      ) || []
    }))
    .filter(item => item.occurrences.length > 0);
}

const page5Entities = findEntitiesOnPage(content.observations || [], 5);
console.log(`Entities on page 5: ${page5Entities.length}`);
```

#### Variation 2: Find Entities in Time Range (Audio/Video)

Extract entities mentioned during specific timeframe:

```typescript
function findEntitiesInTimeRange(
  observations: Observation[], 
  startSec: number, 
  endSec: number
) {
  return observations
    .map(obs => ({
      entity: obs.observable,
      type: obs.type,
      occurrences: obs.occurrences?.filter(occ =>
        occ.startTime !== undefined &&
        occ.startTime >= startSec &&
        occ.startTime <= endSec
      ) || []
    }))
    .filter(item => item.occurrences.length > 0);
}

// Find entities mentioned between 5:00 and 10:00
const segment = findEntitiesInTimeRange(content.observations || [], 300, 600);
console.log(`Entities in time range: ${segment.length}`);
```

#### Variation 3: Visual Entity Locator (PDFs with Bounding Boxes)

Find entity positions for highlighting:

```typescript
function getEntityLocations(observations: Observation[]) {
  const locations: Array<{
    entity: string;
    type: string;
    page: number;
    box: { x: number; y: number; width: number; height: number };
    confidence: number;
  }> = [];
  
  observations.forEach(obs => {
    obs.occurrences?.forEach(occ => {
      if (occ.boundingBox && occ.pageIndex !== undefined) {
        locations.push({
          entity: obs.observable.name,
          type: obs.type,
          page: occ.pageIndex + 1,  // Convert to 1-based
          box: {
            x: occ.boundingBox.left,
            y: occ.boundingBox.top,
            width: occ.boundingBox.width,
            height: occ.boundingBox.height
          },
          confidence: occ.confidence
        });
      }
    });
  });
  
  return locations;
}

const locations = getEntityLocations(content.observations || []);
// Use for PDF highlighting UI
```

#### Variation 4: Confidence-Weighted Entity Ranking

Rank entities by total confidence across all occurrences:

```typescript
function rankEntitiesByConfidence(observations: Observation[]) {
  return observations
    .map(obs => {
      const totalConf = obs.occurrences?.reduce(
        (sum, occ) => sum + occ.confidence, 0
      ) || 0;
      
      const avgConf = totalConf / (obs.occurrences?.length || 1);
      const occCount = obs.occurrences?.length || 0;
      
      return {
        entity: obs.observable,
        type: obs.type,
        averageConfidence: avgConf,
        occurrenceCount: occCount,
        totalConfidence: totalConf,
        score: totalConf * occCount  // Weighted by frequency
      };
    })
    .sort((a, b) => b.score - a.score);
}

const ranked = rankEntitiesByConfidence(content.observations || []);
console.log('Top entities by confidence:');
ranked.slice(0, 10).forEach((item, i) => {
  console.log(`${i + 1}. ${item.entity.name} (${item.type})`);
  console.log(`   Avg conf: ${item.averageConfidence.toFixed(3)}, Count: ${item.occurrenceCount}`);
});
```

#### Variation 5: Occurrence Clustering (Find Dense Entity Regions)

Find pages/sections with high entity density:

```typescript
function findEntityClusters(observations: Observation[]) {
  const pageMap = new Map<number, number>();  // page → entity count
  
  observations.forEach(obs => {
    obs.occurrences?.forEach(occ => {
      if (occ.pageIndex !== undefined) {
        pageMap.set(
          occ.pageIndex,
          (pageMap.get(occ.pageIndex) || 0) + 1
        );
      }
    });
  });
  
  // Sort pages by entity density
  return Array.from(pageMap.entries())
    .map(([page, count]) => ({ page: page + 1, entityCount: count }))
    .sort((a, b) => b.entityCount - a.entityCount);
}

const clusters = findEntityClusters(content.observations || []);
console.log('Pages with most entities:');
clusters.slice(0, 5).forEach(cluster => {
  console.log(`  Page ${cluster.page}: ${cluster.entityCount} entities`);
});
```

***

### Common Issues & Solutions

#### Issue: All Confidences Are Low

**Problem**: Most entities have confidence <0.5.

**Solutions**:

1. **Upgrade model**: Use GPT-4 instead of Gemini for better quality
2. **Improve preparation**: Better OCR, cleaner text extraction
3. **Use vision models**: For scanned documents or images
4. **Check content quality**: Low-quality scans produce low confidence

```typescript
// Solution: Use better model
const spec = await graphlit.createSpecification({
  name: "High-Quality Extraction",
  type: SpecificationTypes.Completion,
  serviceType: ModelServiceTypes.OpenAi,
  openAI: {
    model: OpenAIModels.Gpt4,  // Better quality
    temperature: 0.1
  }
});
```

#### Issue: Same Entity, Varying Confidence

**Problem**: Multiple occurrences of same entity have wildly different confidence.

**Explanation**: This is normal and indicates:

* Some mentions are explicit ("Kirk Marple, CEO of Graphlit")
* Some mentions are implicit ("Kirk said...")
* Context quality varies throughout document

**Solution**: Use average or maximum confidence:

```typescript
// Use max confidence across occurrences
const maxConf = Math.max(...observation.occurrences.map(occ => occ.confidence));

// Or use average
const avgConf = observation.occurrences.reduce(
  (sum, occ) => sum + occ.confidence, 0
) / observation.occurrences.length;
```

#### Issue: Bounding Boxes Missing

**Problem**: Occurrences don't have bounding box coordinates.

**Causes**:

1. Content type doesn't support spatial info (emails, text)
2. Used text extraction instead of vision extraction
3. Document doesn't have layout information

**Solution**: Use vision model for PDFs:

```typescript
extraction: {
  jobs: [{
    connector: {
      type: EntityExtractionServiceTypes.ModelDocument,  // Vision model
      extractedTypes: [/* ... */]
    }
  }]
}
```

#### Issue: No Page Numbers in Occurrences

**Problem**: Page index is undefined for document content.

**Causes**:

1. Content isn't page-based (email, message, web page)
2. PDF preparation didn't preserve page structure
3. Used wrong preparation type

**Solution**: Ensure proper document preparation:

```typescript
preparation: {
  jobs: [{
    connector: {
      type: FilePreparationServiceTypes.ModelDocument  // Preserves pages
    }
  }]
}
```

***

### Developer Hints

#### Confidence Interpretation by Model

* **GPT-4**: Generally conservative, confidence >0.7 is very reliable
* **GPT-4o**: Well-calibrated, confidence >0.75 recommended
* **Claude 3.5**: Slightly optimistic, confidence >0.8 for high precision
* **Gemini**: More variable, confidence >0.7 minimum recommended

#### When to Use Occurrence Data

* **Page index**: PDF highlighting, citation validation, page-specific queries
* **Bounding boxes**: Visual annotation, entity location UI, layout analysis
* **Timestamps**: Video playback navigation, meeting segment analysis
* **Confidence**: Quality filtering, precision/recall tuning, validation

#### Performance Considerations

* Occurrence data increases response size
* Filter occurrences client-side for specific pages/times
* Don't fetch occurrence details if not needed
* Cache occurrence analysis results

#### Validation Strategies

1. **Manual review**: Check high-confidence entities first
2. **Cross-reference**: Verify entities across multiple content items
3. **Confidence distribution**: Expect most >0.7 for good extraction
4. **Frequency analysis**: Common entities should appear multiple times
5. **Context checking**: Use page/time context for validation

***


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.graphlit.dev/api-guides/use-cases/knowledge-graph/observable-confidence-and-occurrences.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
