Extract Medical Entities from Clinical Content

User Intent

"How do I extract medical entities (conditions, drugs, procedures, tests) from clinical documents and research papers? Show me how to build medical knowledge graphs for healthcare applications."

Operation

SDK Methods: createWorkflow(), ingestUri(), isContentDone(), getContent(), queryObservables() GraphQL: Medical content ingestion + extraction of 12 medical entity types Entity: Medical Content → Observations → Medical Observables (Clinical Knowledge Graph)

Prerequisites

  • Graphlit project with API credentials

  • Medical/clinical documents (PDFs, research papers, clinical notes)

  • Understanding of medical entity types

  • Appropriate data privacy/HIPAA compliance measures


Complete Code Example (TypeScript)

import { Graphlit } from 'graphlit-client';
import { ModelServiceTypes, ObservableTypes, SpecificationTypes } from 'graphlit-client/dist/generated/graphql-types';
import {
  FilePreparationServiceTypes,
  ExtractionServiceTypes,
  ObservableTypes,
  ModelServiceTypes,
  OpenAIModels
} from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

console.log('=== Building Medical Knowledge Graph ===\n');

// Step 1: Create high-quality medical extraction workflow
console.log('Step 1: Creating medical entity extraction workflow...');

// Use GPT-4 for medical accuracy
const spec = await graphlit.createSpecification({
  name: "GPT-4 Medical Extraction",
  type: SpecificationTypes.Completion,
  serviceType: ModelServiceTypes.OpenAi,
  openAI: {
    model: OpenAIModels.Gpt4,    // Best quality for medical
    temperature: 0.1              // Low temperature for consistency
  }
});

const workflow = await graphlit.createWorkflow({
  name: "Medical Entity Extraction",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.ModelDocument  // PDFs, Word, etc.
      }
    }]
  },
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          // All 12 medical entity types
          ObservableTypes.MedicalCondition,         // Diseases, symptoms, diagnoses
          ObservableTypes.MedicalDrug,              // Medications, pharmaceuticals
          ObservableMedicalDrugClass,         // Drug categories (antibiotics, etc.)
          ObservableTypes.MedicalProcedure,         // Surgeries, treatments
          ObservableTypes.MedicalTest,              // Lab tests, diagnostics
          ObservableTypes.MedicalStudy,             // Clinical trials, research
          ObservableMedicalDevice,            // Medical equipment, implants
          ObservableMedicalTherapy,           // Therapies, treatments
          ObservableMedicalGuideline,         // Clinical guidelines, protocols
          ObservableMedicalIndication,        // Reasons for treatment
          ObservableMedicalContraindication,  // Reasons to avoid treatment
          
          // Also extract non-medical entities for context
          ObservableTypes.Person,                   // Patients, doctors, researchers
          ObservableTypes.Organization              // Hospitals, pharma companies
        ]
      }
    }]
  },
  specification: { id: spec.createSpecification.id }
});

console.log(`✓ Workflow: ${workflow.createWorkflow.id}\n`);

// Step 2: Ingest clinical research paper
console.log('Step 2: Ingesting clinical research paper...');
const paper = await graphlit.ingestUri('https://example.com/papers/clinical-trial.pdf', "Clinical Trial: Drug X for Condition Y", undefined, undefined, undefined, { id: workflow.createWorkflow.id  });

console.log(`✓ Ingested: ${paper.ingestUri.id}\n`);

// Step 3: Wait for extraction
console.log('Step 3: Extracting medical entities...');
let isDone = false;
while (!isDone) {
  const status = await graphlit.isContentDone(paper.ingestUri.id);
  isDone = status.isContentDone.result;
  
  if (!isDone) {
    console.log('  Processing...');
    await new Promise(resolve => setTimeout(resolve, 3000));
  }
}
console.log('✓ Extraction complete\n');

// Step 4: Retrieve extracted entities
console.log('Step 4: Retrieving medical entities...');
const paperDetails = await graphlit.getContent(paper.ingestUri.id);
const content = paperDetails.content;

console.log(`✓ Document: ${content.name}`);
console.log(`  Pages: ${content.document?.pageCount}`);
console.log(`  Total entities: ${content.observations?.length || 0}\n`);

// Step 5: Analyze by medical entity type
console.log('Step 5: Analyzing medical entities...\n');

const medicalTypes = [
  ObservableTypes.MedicalCondition,
  ObservableTypes.MedicalDrug,
  ObservableMedicalDrugClass,
  ObservableTypes.MedicalProcedure,
  ObservableTypes.MedicalTest,
  ObservableTypes.MedicalStudy,
  ObservableMedicalDevice,
  ObservableMedicalTherapy
];

medicalforEach(type => {
  const entities = content.observations?.filter(obs => obs.type === type) || [];
  const unique = new Set(entities.map(e => e.observable.name));
  
  if (unique.size > 0) {
    console.log(`${type} (${unique.size}):`);
    Array.from(unique).slice(0, 5).forEach(name => {
      console.log(`  - ${name}`);
    });
    if (unique.size > 5) {
      console.log(`  ... and ${unique.size - 5} more`);
    }
    console.log();
  }
});

// Step 6: Build drug-condition relationships
console.log('Step 6: Analyzing drug-condition relationships...\n');

const drugs = content.observations?.filter(obs => 
  obs.type === ObservableTypes.MedicalDrug
) || [];

const conditions = content.observations?.filter(obs =>
  obs.type === ObservableTypes.MedicalCondition
) || [];

// Co-occurrence analysis
const relationships: Array<{ drug: string; condition: string; confidence: number }> = [];

drugs.forEach(drug => {
  conditions.forEach(condition => {
    // Check if they appear on same pages
    const drugPages = new Set(drug.occurrences?.map(occ => occ.pageIndex));
    const condPages = new Set(condition.occurrences?.map(occ => occ.pageIndex));
    
    const sharedPages = Array.from(drugPages).filter(p => condPages.has(p));
    
    if (sharedPages.length > 0) {
      // Calculate average confidence
      const avgConf = (
        (drug.occurrences?.reduce((sum, occ) => sum + occ.confidence, 0) || 0) /
        (drug.occurrences?.length || 1) +
        (condition.occurrences?.reduce((sum, occ) => sum + occ.confidence, 0) || 0) /
        (condition.occurrences?.length || 1)
      ) / 2;
      
      relationships.push({
        drug: drug.observable.name,
        condition: condition.observable.name,
        confidence: avgConf
      });
    }
  });
});

console.log('Drug-Condition relationships:');
relationships
  .sort((a, b) => b.confidence - a.confidence)
  .slice(0, 5)
  .forEach(({ drug, condition, confidence }) => {
    console.log(`  ${drug}${condition} (confidence: ${confidence.toFixed(2)})`);
  });

// Step 7: Query medical knowledge graph
console.log('\nStep 7: Querying medical knowledge graph...\n');

// Get all conditions across all documents
const allConditions = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.MedicalCondition] }
});

console.log(`Total conditions in knowledge graph: ${allConditions.observables.results.length}`);

// Get all drugs
const allDrugs = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.MedicalDrug] }
});

console.log(`Total drugs in knowledge graph: ${allDrugs.observables.results.length}`);

console.log('\n✓ Medical knowledge graph complete!');

Step-by-Step Explanation

Step 1: Understanding Medical Entity Types

Graphlit supports 12 medical entity types (all fully supported, not beta):

Core Clinical Entities:

  1. MedicalCondition:

    • Diseases, symptoms, diagnoses

    • Examples: "Type 2 diabetes", "hypertension", "chest pain", "COVID-19"

    • Schema.org: @type: "MedicalCondition"

  2. MedicalDrug:

    • Specific medications, pharmaceuticals

    • Examples: "metformin", "lisinopril", "aspirin", "Pfizer-BioNTech vaccine"

    • Schema.org: @type: "Drug"

  3. MedicalDrugClass:

    • Categories of drugs

    • Examples: "antibiotics", "beta-blockers", "statins", "ACE inhibitors"

    • Schema.org: @type: "DrugClass"

  4. MedicalProcedure:

    • Surgeries, treatments, interventions

    • Examples: "coronary artery bypass", "hip replacement", "chemotherapy"

    • Schema.org: @type: "MedicalProcedure"

  5. MedicalTest:

    • Diagnostic tests, lab tests

    • Examples: "HbA1c test", "MRI scan", "blood pressure measurement"

    • Schema.org: @type: "MedicalTest"

Advanced Medical Entities:

  1. MedicalStudy:

    • Clinical trials, research studies

    • Examples: "Phase III trial", "randomized controlled trial", "cohort study"

    • Schema.org: @type: "MedicalStudy"

  2. MedicalDevice:

    • Medical equipment, implants

    • Examples: "pacemaker", "insulin pump", "surgical robot", "stent"

    • Schema.org: @type: "MedicalDevice"

  3. MedicalTherapy:

    • Therapies, treatment approaches

    • Examples: "physical therapy", "radiation therapy", "cognitive behavioral therapy"

    • Schema.org: @type: "MedicalTherapy"

  4. MedicalGuideline:

    • Clinical guidelines, protocols

    • Examples: "WHO guidelines", "treatment protocol", "diagnostic criteria"

    • Schema.org: @type: "MedicalGuideline"

  5. MedicalIndication:

    • Reasons for treatment

    • Examples: "indicated for hypertension", "approved for diabetes management"

    • Schema.org: @type: "MedicalIndication"

  6. MedicalContraindication:

    • Reasons to avoid treatment

    • Examples: "contraindicated in pregnancy", "not for use with kidney disease"

    • Schema.org: @type: "MedicalContraindication"

  7. MedicalRiskFactor (if supported):

    • Risk factors for conditions

    • Examples: "smoking", "obesity", "family history"

Step 2: Model Selection for Medical Content

GPT-4 (Recommended for Medical):

  • Highest accuracy for medical terminology

  • Best understanding of clinical context

  • Lower false positive rate

  • More expensive but worth it for healthcare

GPT-4o:

  • Good balance for less critical medical content

  • Faster processing

  • Lower cost

  • Acceptable for research papers, general medical content

Claude 3.5 Sonnet:

  • Good alternative to GPT-4

  • Strong medical knowledge

  • Handles long clinical documents well

NOT Recommended:

  • Gemini: Less accurate for medical terminology

  • GPT-3.5: Too many medical errors

Step 3: Clinical Document Types

Research Papers:

Clinical Notes (HIPAA considerations):

Drug Information Sheets:

Clinical Guidelines:

Step 4: Medical Entity Relationships

Drug-Condition Relationships:

  • Co-occurrence on same pages

  • "Drug X is indicated for Condition Y"

  • "Patients with Condition Y treated with Drug X"

Procedure-Condition Relationships:

  • "Procedure X performed for Condition Y"

  • Diagnostic procedures for conditions

Drug-Drug Interactions:

  • Contraindications between drugs

  • Combination therapies

Test-Condition Relationships:

  • Diagnostic tests for conditions

  • Monitoring tests for treated conditions

Step 5: Confidence Scoring for Medical Entities

High Confidence (>=0.9):

  • Explicit medical terminology

  • Standard nomenclature (ICD, SNOMED CT terms)

  • Clear clinical context

Medium Confidence (0.7-0.9):

  • Common medical terms

  • Some ambiguity in context

  • Abbreviations with context

Low Confidence (<0.7):

  • Ambiguous terms

  • Incomplete information

  • Uncertain context

Recommended Threshold: >=0.75 for medical applications (higher than general content)


Configuration Options

Precision vs Recall Tradeoff

High Precision (fewer false positives):

High Recall (fewer false negatives):

Domain-Specific Extraction

Cardiology:

Oncology:

Pharmacology:


Variations

Variation 1: Drug Information Database

Build comprehensive drug knowledge base:

Variation 2: Clinical Trial Analysis

Analyze clinical trial results:

Variation 3: Adverse Event Monitoring

Track drug side effects and adverse events:

Variation 4: Medical Literature Review

Build knowledge base from research papers:

Variation 5: Treatment Protocol Assistant

RAG-based clinical decision support:


Common Issues & Solutions

Issue: Medical Abbreviations Not Recognized

Problem: "HTN", "DM", "CHF" not extracted as conditions.

Solution: Medical abbreviations may have low confidence. Either:

  1. Use lower confidence threshold (>=0.6)

  2. Expand abbreviations in preprocessing

  3. Train on medical-specific model (future feature)

Issue: False Positives on Common Terms

Problem: "Cold" extracted as MedicalCondition when discussing weather.

Solution: Context-aware filtering:

Issue: Missing Drug-Condition Relationships

Problem: Drug and condition mentioned but not linked.

Solution: Use co-occurrence analysis (same page) or RAG queries:

Issue: HIPAA Compliance Concerns

Problem: Patient names being extracted from clinical notes.

Solution: Don't extract Person entities from patient records:

Also implement proper data handling:

  • Encrypt data at rest

  • Access controls

  • Audit logging

  • BAA with Graphlit (if processing PHI)


Developer Hints

Medical Entity Quality by Source

  • High quality: Published research papers, drug information sheets

  • Medium quality: Clinical guidelines, review articles

  • Variable quality: Clinical notes (abbreviations, typos)

Model Recommendations by Use Case

  • Clinical decision support: GPT-4 (highest accuracy required)

  • Research literature review: GPT-4o (good balance)

  • General medical knowledge: Claude 3.5 Sonnet

Confidence Thresholds

  • Regulatory/clinical use: >=0.85

  • Research/analysis: >=0.75

  • Exploratory/discovery: >=0.65

HIPAA and Privacy

  • Graphlit is HIPAA-compliant when properly configured

  • Sign BAA (Business Associate Agreement)

  • Use encryption, access controls

  • Don't extract identifiable patient information

  • Consider de-identification before ingestion

Performance Optimization

  • Medical extraction is slower (complex terminology)

  • Expect 20-30% longer processing than general content

  • Batch process overnight for large volumes

  • Cache commonly queried entities


Production Patterns

Healthcare Use Cases

  • Clinical decision support: Query guidelines by condition

  • Drug information lookup: Interactive drug database

  • Adverse event monitoring: Track side effects across reports

  • Literature review: Automated systematic reviews

  • Treatment protocol matching: Match patients to protocols

  • Medical education: Interactive medical knowledge base

Compliance Considerations

  • PHI (Protected Health Information) requires HIPAA compliance

  • De-identify data when possible

  • Implement access controls

  • Audit all queries

  • Regular security assessments

  • Data retention policies


Last updated

Was this helpful?