Extract Medical Entities from Clinical Content

User Intent

"How do I extract medical entities (conditions, drugs, procedures, tests) from clinical documents and research papers? Show me how to build medical knowledge graphs for healthcare applications."

Operation

SDK Methods: createWorkflow(), ingestUri(), isContentDone(), getContent(), queryObservables() GraphQL: Medical content ingestion + extraction of 12 medical entity types Entity: Medical Content → Observations → Medical Observables (Clinical Knowledge Graph)

Prerequisites

  • Graphlit project with API credentials

  • Medical/clinical documents (PDFs, research papers, clinical notes)

  • Understanding of medical entity types

  • Appropriate data privacy/HIPAA compliance measures


Complete Code Example (TypeScript)

import { Graphlit } from 'graphlit-client';
import { ModelServiceTypes, ObservableTypes, SpecificationTypes } from 'graphlit-client/dist/generated/graphql-types';
import {
  FilePreparationServiceTypes,
  ExtractionServiceTypes,
  ObservableTypes,
  ModelServiceTypes,
  OpenAIModels
} from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

console.log('=== Building Medical Knowledge Graph ===\n');

// Step 1: Create high-quality medical extraction workflow
console.log('Step 1: Creating medical entity extraction workflow...');

// Use GPT-4 for medical accuracy
const spec = await graphlit.createSpecification({
  name: "GPT-4 Medical Extraction",
  type: SpecificationTypes.Completion,
  serviceType: ModelServiceTypes.OpenAi,
  openAI: {
    model: OpenAIModels.Gpt4,    // Best quality for medical
    temperature: 0.1              // Low temperature for consistency
  }
});

const workflow = await graphlit.createWorkflow({
  name: "Medical Entity Extraction",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.ModelDocument  // PDFs, Word, etc.
      }
    }]
  },
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          // All 12 medical entity types
          ObservableTypes.MedicalCondition,         // Diseases, symptoms, diagnoses
          ObservableTypes.MedicalDrug,              // Medications, pharmaceuticals
          ObservableMedicalDrugClass,         // Drug categories (antibiotics, etc.)
          ObservableTypes.MedicalProcedure,         // Surgeries, treatments
          ObservableTypes.MedicalTest,              // Lab tests, diagnostics
          ObservableTypes.MedicalStudy,             // Clinical trials, research
          ObservableMedicalDevice,            // Medical equipment, implants
          ObservableMedicalTherapy,           // Therapies, treatments
          ObservableMedicalGuideline,         // Clinical guidelines, protocols
          ObservableMedicalIndication,        // Reasons for treatment
          ObservableMedicalContraindication,  // Reasons to avoid treatment
          
          // Also extract non-medical entities for context
          ObservableTypes.Person,                   // Patients, doctors, researchers
          ObservableTypes.Organization              // Hospitals, pharma companies
        ]
      }
    }]
  },
  specification: { id: spec.createSpecification.id }
});

console.log(`✓ Workflow: ${workflow.createWorkflow.id}\n`);

// Step 2: Ingest clinical research paper
console.log('Step 2: Ingesting clinical research paper...');
const paper = await graphlit.ingestUri('https://example.com/papers/clinical-trial.pdf', "Clinical Trial: Drug X for Condition Y", undefined, undefined, undefined, { id: workflow.createWorkflow.id  });

console.log(`✓ Ingested: ${paper.ingestUri.id}\n`);

// Step 3: Wait for extraction
console.log('Step 3: Extracting medical entities...');
let isDone = false;
while (!isDone) {
  const status = await graphlit.isContentDone(paper.ingestUri.id);
  isDone = status.isContentDone.result;
  
  if (!isDone) {
    console.log('  Processing...');
    await new Promise(resolve => setTimeout(resolve, 3000));
  }
}
console.log('✓ Extraction complete\n');

// Step 4: Retrieve extracted entities
console.log('Step 4: Retrieving medical entities...');
const paperDetails = await graphlit.getContent(paper.ingestUri.id);
const content = paperDetails.content;

console.log(`✓ Document: ${content.name}`);
console.log(`  Pages: ${content.document?.pageCount}`);
console.log(`  Total entities: ${content.observations?.length || 0}\n`);

// Step 5: Analyze by medical entity type
console.log('Step 5: Analyzing medical entities...\n');

const medicalTypes = [
  ObservableTypes.MedicalCondition,
  ObservableTypes.MedicalDrug,
  ObservableMedicalDrugClass,
  ObservableTypes.MedicalProcedure,
  ObservableTypes.MedicalTest,
  ObservableTypes.MedicalStudy,
  ObservableMedicalDevice,
  ObservableMedicalTherapy
];

medicalforEach(type => {
  const entities = content.observations?.filter(obs => obs.type === type) || [];
  const unique = new Set(entities.map(e => e.observable.name));
  
  if (unique.size > 0) {
    console.log(`${type} (${unique.size}):`);
    Array.from(unique).slice(0, 5).forEach(name => {
      console.log(`  - ${name}`);
    });
    if (unique.size > 5) {
      console.log(`  ... and ${unique.size - 5} more`);
    }
    console.log();
  }
});

// Step 6: Build drug-condition relationships
console.log('Step 6: Analyzing drug-condition relationships...\n');

const drugs = content.observations?.filter(obs => 
  obs.type === ObservableTypes.MedicalDrug
) || [];

const conditions = content.observations?.filter(obs =>
  obs.type === ObservableTypes.MedicalCondition
) || [];

// Co-occurrence analysis
const relationships: Array<{ drug: string; condition: string; confidence: number }> = [];

drugs.forEach(drug => {
  conditions.forEach(condition => {
    // Check if they appear on same pages
    const drugPages = new Set(drug.occurrences?.map(occ => occ.pageIndex));
    const condPages = new Set(condition.occurrences?.map(occ => occ.pageIndex));
    
    const sharedPages = Array.from(drugPages).filter(p => condPages.has(p));
    
    if (sharedPages.length > 0) {
      // Calculate average confidence
      const avgConf = (
        (drug.occurrences?.reduce((sum, occ) => sum + occ.confidence, 0) || 0) /
        (drug.occurrences?.length || 1) +
        (condition.occurrences?.reduce((sum, occ) => sum + occ.confidence, 0) || 0) /
        (condition.occurrences?.length || 1)
      ) / 2;
      
      relationships.push({
        drug: drug.observable.name,
        condition: condition.observable.name,
        confidence: avgConf
      });
    }
  });
});

console.log('Drug-Condition relationships:');
relationships
  .sort((a, b) => b.confidence - a.confidence)
  .slice(0, 5)
  .forEach(({ drug, condition, confidence }) => {
    console.log(`  ${drug}${condition} (confidence: ${confidence.toFixed(2)})`);
  });

// Step 7: Query medical knowledge graph
console.log('\nStep 7: Querying medical knowledge graph...\n');

// Get all conditions across all documents
const allConditions = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.MedicalCondition] }
});

console.log(`Total conditions in knowledge graph: ${allConditions.observables.results.length}`);

// Get all drugs
const allDrugs = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.MedicalDrug] }
});

console.log(`Total drugs in knowledge graph: ${allDrugs.observables.results.length}`);

console.log('\n✓ Medical knowledge graph complete!');

Step-by-Step Explanation

Step 1: Understanding Medical Entity Types

Graphlit supports 12 medical entity types (all fully supported, not beta):

Core Clinical Entities:

  1. MedicalCondition:

    • Diseases, symptoms, diagnoses

    • Examples: "Type 2 diabetes", "hypertension", "chest pain", "COVID-19"

    • Schema.org: @type: "MedicalCondition"

  2. MedicalDrug:

    • Specific medications, pharmaceuticals

    • Examples: "metformin", "lisinopril", "aspirin", "Pfizer-BioNTech vaccine"

    • Schema.org: @type: "Drug"

  3. MedicalDrugClass:

    • Categories of drugs

    • Examples: "antibiotics", "beta-blockers", "statins", "ACE inhibitors"

    • Schema.org: @type: "DrugClass"

  4. MedicalProcedure:

    • Surgeries, treatments, interventions

    • Examples: "coronary artery bypass", "hip replacement", "chemotherapy"

    • Schema.org: @type: "MedicalProcedure"

  5. MedicalTest:

    • Diagnostic tests, lab tests

    • Examples: "HbA1c test", "MRI scan", "blood pressure measurement"

    • Schema.org: @type: "MedicalTest"

Advanced Medical Entities:

  1. MedicalStudy:

    • Clinical trials, research studies

    • Examples: "Phase III trial", "randomized controlled trial", "cohort study"

    • Schema.org: @type: "MedicalStudy"

  2. MedicalDevice:

    • Medical equipment, implants

    • Examples: "pacemaker", "insulin pump", "surgical robot", "stent"

    • Schema.org: @type: "MedicalDevice"

  3. MedicalTherapy:

    • Therapies, treatment approaches

    • Examples: "physical therapy", "radiation therapy", "cognitive behavioral therapy"

    • Schema.org: @type: "MedicalTherapy"

  4. MedicalGuideline:

    • Clinical guidelines, protocols

    • Examples: "WHO guidelines", "treatment protocol", "diagnostic criteria"

    • Schema.org: @type: "MedicalGuideline"

  5. MedicalIndication:

    • Reasons for treatment

    • Examples: "indicated for hypertension", "approved for diabetes management"

    • Schema.org: @type: "MedicalIndication"

  6. MedicalContraindication:

    • Reasons to avoid treatment

    • Examples: "contraindicated in pregnancy", "not for use with kidney disease"

    • Schema.org: @type: "MedicalContraindication"

  7. MedicalRiskFactor (if supported):

    • Risk factors for conditions

    • Examples: "smoking", "obesity", "family history"

Step 2: Model Selection for Medical Content

GPT-4 (Recommended for Medical):

  • Highest accuracy for medical terminology

  • Best understanding of clinical context

  • Lower false positive rate

  • More expensive but worth it for healthcare

GPT-4o:

  • Good balance for less critical medical content

  • Faster processing

  • Lower cost

  • Acceptable for research papers, general medical content

Claude 3.5 Sonnet:

  • Good alternative to GPT-4

  • Strong medical knowledge

  • Handles long clinical documents well

NOT Recommended:

  • Gemini: Less accurate for medical terminology

  • GPT-3.5: Too many medical errors

Step 3: Clinical Document Types

Research Papers:

// PubMed, ArXiv medical papers
extractedTypes: [
  ObservableTypes.MedicalCondition,
  ObservableTypes.MedicalDrug,
  ObservableTypes.MedicalStudy,
  ObservableTypes.MedicalProcedure,
  ObservableTypes.Person,  // Authors, researchers
  ObservableTypes.Organization  // Institutions
]

Clinical Notes (HIPAA considerations):

// Patient records, clinical summaries
extractedTypes: [
  ObservableTypes.MedicalCondition,  // Diagnoses
  ObservableTypes.MedicalDrug,       // Medications
  ObservableTypes.MedicalProcedure,  // Treatments
  ObservableTypes.MedicalTest        // Lab results
  // NOTE: Do NOT extract Person for patient privacy
]

Drug Information Sheets:

// Prescribing information, package inserts
extractedTypes: [
  ObservableTypes.MedicalDrug,
  ObservableMedicalDrugClass,
  ObservableMedicalIndication,
  ObservableMedicalContraindication,
  ObservableTypes.MedicalCondition  // What it treats
]

Clinical Guidelines:

// Treatment protocols, best practices
extractedTypes: [
  ObservableMedicalGuideline,
  ObservableTypes.MedicalProcedure,
  ObservableTypes.MedicalTest,
  ObservableTypes.MedicalCondition
]

Step 4: Medical Entity Relationships

Drug-Condition Relationships:

  • Co-occurrence on same pages

  • "Drug X is indicated for Condition Y"

  • "Patients with Condition Y treated with Drug X"

Procedure-Condition Relationships:

  • "Procedure X performed for Condition Y"

  • Diagnostic procedures for conditions

Drug-Drug Interactions:

  • Contraindications between drugs

  • Combination therapies

Test-Condition Relationships:

  • Diagnostic tests for conditions

  • Monitoring tests for treated conditions

Step 5: Confidence Scoring for Medical Entities

High Confidence (>=0.9):

  • Explicit medical terminology

  • Standard nomenclature (ICD, SNOMED CT terms)

  • Clear clinical context

Medium Confidence (0.7-0.9):

  • Common medical terms

  • Some ambiguity in context

  • Abbreviations with context

Low Confidence (<0.7):

  • Ambiguous terms

  • Incomplete information

  • Uncertain context

Recommended Threshold: >=0.75 for medical applications (higher than general content)


Configuration Options

Precision vs Recall Tradeoff

High Precision (fewer false positives):

// Use GPT-4, high confidence threshold
specification: {
  model: OpenAIModels.Gpt4,
  temperature: 0.05  // Very low temperature
}

// Filter results
const highConfidence = observations.filter(obs =>
  obs.occurrences?.every(occ => occ.confidence >= 0.85)
);

High Recall (fewer false negatives):

// Extract all possible entities, filter later
extractedTypes: [
  // All 12 medical types
  ...allMedicalTypes
]

// Lower confidence threshold
const allEntities = observations.filter(obs =>
  obs.occurrences?.some(occ => occ.confidence >= 0.6)
);

Domain-Specific Extraction

Cardiology:

extractedTypes: [
  ObservableTypes.MedicalCondition,  // Heart diseases
  ObservableTypes.MedicalProcedure,  // Cardiac procedures
  ObservableMedicalDevice,     // Pacemakers, stents
  ObservableTypes.MedicalDrug,       // Cardiac medications
  ObservableTypes.MedicalTest        // ECG, stress tests
]

Oncology:

extractedTypes: [
  ObservableTypes.MedicalCondition,  // Cancer types
  ObservableMedicalTherapy,    // Chemotherapy, radiation
  ObservableTypes.MedicalDrug,       // Cancer drugs
  ObservableTypes.MedicalStudy,      // Clinical trials
  ObservableTypes.MedicalProcedure   // Surgeries, biopsies
]

Pharmacology:

extractedTypes: [
  ObservableTypes.MedicalDrug,
  ObservableMedicalDrugClass,
  ObservableMedicalIndication,
  ObservableMedicalContraindication,
  ObservableTypes.MedicalCondition
]

Variations

Variation 1: Drug Information Database

Build comprehensive drug knowledge base:

// Ingest drug information sheets
const drugDocs = [
  'https://example.com/drugs/metformin-info.pdf',
  'https://example.com/drugs/lisinopril-info.pdf',
  // ... more drugs
];

const drugWorkflow = await graphlit.createWorkflow({
  name: "Drug Information Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.MedicalDrug,
          ObservableMedicalDrugClass,
          ObservableMedicalIndication,
          ObservableMedicalContraindication,
          ObservableTypes.MedicalCondition
        ]
      }
    }]
  }
});

// Ingest all drug docs
await Promise.all(
  drugDocs.map(uri =>
    graphlit.ingestUri({ uri, workflow: { id: drugWorkflow.createWorkflow.id } })
  )
);

// Query drug database
const metformin = await graphlit.queryObservables({
  search: "metformin",
  filter: { types: [ObservableTypes.MedicalDrug] }
});

// Find what conditions it treats
const conditions = await graphlit.queryContents({
  filter: {
    observations: [
      { type: ObservableTypes.MedicalDrug, observable: { id: metformin.observables.results[0].observable.id } },
      { type: ObservableMedicalIndication, observable: { /* any indication */ } }
    ]
  }
});

Variation 2: Clinical Trial Analysis

Analyze clinical trial results:

const trialWorkflow = await graphlit.createWorkflow({
  name: "Clinical Trial Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.MedicalStudy,
          ObservableTypes.MedicalDrug,
          ObservableTypes.MedicalCondition,
          ObservableTypes.MedicalProcedure,
          ObservableTypes.Person,          // Principal investigators
          ObservableTypes.Organization     // Sponsors
        ]
      }
    }]
  }
});

// Ingest clinical trial paper
const trial = await graphlit.ingestUri('https://clinicaltrials.gov/study/NCT12345678/document.pdf', undefined, undefined, undefined, undefined, { id: trialWorkflow.createWorkflow.id  });

// Wait and analyze
const trialDetails = await graphlit.getContent(trial.ingestUri.id);

// Extract trial metadata
const studyType = trialDetails.content.observations
  ?.find(obs => obs.type === ObservableTypes.MedicalStudy);

const drugTested = trialDetails.content.observations
  ?.find(obs => obs.type === ObservableTypes.MedicalDrug);

const conditionTreated = trialDetails.content.observations
  ?.find(obs => obs.type === ObservableTypes.MedicalCondition);

console.log(`Study: ${studyType?.observable.name}`);
console.log(`Drug: ${drugTested?.observable.name}`);
console.log(`Condition: ${conditionTreated?.observable.name}`);

Variation 3: Adverse Event Monitoring

Track drug side effects and adverse events:

// Process adverse event reports
const adverseWorkflow = await graphlit.createWorkflow({
  name: "Adverse Event Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.MedicalDrug,
          ObservableTypes.MedicalCondition,  // Side effects
          ObservableMedicalContraindication
        ]
      }
    }]
  }
});

// Ingest multiple adverse event reports
// ... (similar to above)

// Query for drug-side effect relationships
const drugId = 'drug-observable-id';
const adverseEvents = await graphlit.queryContents({
  filter: {
    observations: [{
      type: ObservableTypes.MedicalDrug,
      observable: { id: drugId }
    }]
  }
});

// Extract side effects co-occurring with drug
const sideEffects = new Map<string, number>();
adverseEvents.contents.results.forEach(report => {
  report.observations
    ?.filter(obs => obs.type === ObservableTypes.MedicalCondition)
    .forEach(obs => {
      sideEffects.set(
        obs.observable.name,
        (sideEffects.get(obs.observable.name) || 0) + 1
      );
    });
});

console.log('Common side effects:');
Array.from(sideEffects.entries())
  .sort((a, b) => b[1] - a[1])
  .slice(0, 10)
  .forEach(([effect, count]) => {
    console.log(`  ${effect}: ${count} reports`);
  });

Variation 4: Medical Literature Review

Build knowledge base from research papers:

// Process PubMed papers on specific topic
const reviewWorkflow = await graphlit.createWorkflow({
  name: "Literature Review Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.MedicalCondition,
          ObservableTypes.MedicalDrug,
          ObservableTypes.MedicalProcedure,
          ObservableTypes.MedicalStudy,
          ObservableTypes.Person,          // Authors
          ObservableTypes.Organization     // Institutions
        ]
      }
    }]
  }
});

// Ingest collection of papers
const papers = [
  'https://pubmed.ncbi.nlm.nih.gov/paper1.pdf',
  'https://pubmed.ncbi.nlm.nih.gov/paper2.pdf',
  // ... more papers
];

await Promise.all(
  papers.map(uri =>
    graphlit.ingestUri({ uri, workflow: { id: reviewWorkflow.createWorkflow.id } })
  )
);

// Analyze trends
const allConditions = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.MedicalCondition] }
});

// Find most researched conditions
const researchCounts = new Map<string, number>();

for (const condition of allConditions.observables.results) {
  const papers = await graphlit.queryContents({
    filter: {
      observations: [{
        type: ObservableTypes.MedicalCondition,
        observable: { id: condition.observable.id }
      }]
    }
  });
  
  researchCounts.set(condition.observable.name, papers.contents.results.length);
}

console.log('Most researched conditions:');
Array.from(researchCounts.entries())
  .sort((a, b) => b[1] - a[1])
  .slice(0, 10)
  .forEach(([condition, count]) => {
    console.log(`  ${condition}: ${count} papers`);
  });

Variation 5: Treatment Protocol Assistant

RAG-based clinical decision support:

// After ingesting clinical guidelines and protocols
const conversation = await graphlit.createConversation({
  name: "Treatment Protocol Assistant"
});

// Query for treatment recommendations
const response = await graphlit.promptConversation({
  prompt: "What is the recommended treatment protocol for a patient with Type 2 diabetes and hypertension?",
  id: conversation.createConversation.id
  // RAG will search across all ingested guidelines
});

console.log('Treatment Recommendation:');
console.log(response.message.message);

// Extract structured treatment plan
const structured = await graphlit.promptConversation({
  prompt: "Based on the guidelines, provide a structured treatment plan with: 1) First-line medications, 2) Monitoring tests, 3) Lifestyle modifications, 4) Follow-up schedule. Format as JSON.",
  id: conversation.createConversation.id
});

console.log('\nStructured Plan:');
console.log(structured.message.message);

Common Issues & Solutions

Issue: Medical Abbreviations Not Recognized

Problem: "HTN", "DM", "CHF" not extracted as conditions.

Solution: Medical abbreviations may have low confidence. Either:

  1. Use lower confidence threshold (>=0.6)

  2. Expand abbreviations in preprocessing

  3. Train on medical-specific model (future feature)

Issue: False Positives on Common Terms

Problem: "Cold" extracted as MedicalCondition when discussing weather.

Solution: Context-aware filtering:

// Check surrounding context or confidence
const validConditions = conditions.filter(cond =>
  cond.occurrences?.some(occ => occ.confidence >= 0.8)
);

Issue: Missing Drug-Condition Relationships

Problem: Drug and condition mentioned but not linked.

Solution: Use co-occurrence analysis (same page) or RAG queries:

// Find relationships via RAG
const relationship = await graphlit.promptConversation({
  prompt: "What conditions is Drug X used to treat according to this document?",
  filter: { contents: [{ id: documentId }] }
});

Issue: HIPAA Compliance Concerns

Problem: Patient names being extracted from clinical notes.

Solution: Don't extract Person entities from patient records:

extractedTypes: [
  ObservableTypes.MedicalCondition,
  ObservableTypes.MedicalDrug,
  ObservableTypes.MedicalProcedure
  // DO NOT include ObservableTypes.Person for patient records
]

Also implement proper data handling:

  • Encrypt data at rest

  • Access controls

  • Audit logging

  • BAA with Graphlit (if processing PHI)


Developer Hints

Medical Entity Quality by Source

  • High quality: Published research papers, drug information sheets

  • Medium quality: Clinical guidelines, review articles

  • Variable quality: Clinical notes (abbreviations, typos)

Model Recommendations by Use Case

  • Clinical decision support: GPT-4 (highest accuracy required)

  • Research literature review: GPT-4o (good balance)

  • General medical knowledge: Claude 3.5 Sonnet

Confidence Thresholds

  • Regulatory/clinical use: >=0.85

  • Research/analysis: >=0.75

  • Exploratory/discovery: >=0.65

HIPAA and Privacy

  • Graphlit is HIPAA-compliant when properly configured

  • Sign BAA (Business Associate Agreement)

  • Use encryption, access controls

  • Don't extract identifiable patient information

  • Consider de-identification before ingestion

Performance Optimization

  • Medical extraction is slower (complex terminology)

  • Expect 20-30% longer processing than general content

  • Batch process overnight for large volumes

  • Cache commonly queried entities


Production Patterns

Healthcare Use Cases

  • Clinical decision support: Query guidelines by condition

  • Drug information lookup: Interactive drug database

  • Adverse event monitoring: Track side effects across reports

  • Literature review: Automated systematic reviews

  • Treatment protocol matching: Match patients to protocols

  • Medical education: Interactive medical knowledge base

Compliance Considerations

  • PHI (Protected Health Information) requires HIPAA compliance

  • De-identify data when possible

  • Implement access controls

  • Audit all queries

  • Regular security assessments

  • Data retention policies


Last updated

Was this helpful?