Extract Medical Entities from Clinical Content
User Intent
"How do I extract medical entities (conditions, drugs, procedures, tests) from clinical documents and research papers? Show me how to build medical knowledge graphs for healthcare applications."
Operation
SDK Methods: createWorkflow(), ingestUri(), isContentDone(), getContent(), queryObservables()
GraphQL: Medical content ingestion + extraction of 12 medical entity types
Entity: Medical Content → Observations → Medical Observables (Clinical Knowledge Graph)
Prerequisites
Graphlit project with API credentials
Medical/clinical documents (PDFs, research papers, clinical notes)
Understanding of medical entity types
Appropriate data privacy/HIPAA compliance measures
Complete Code Example (TypeScript)
import { Graphlit } from 'graphlit-client';
import { ModelServiceTypes, ObservableTypes, SpecificationTypes } from 'graphlit-client/dist/generated/graphql-types';
import {
FilePreparationServiceTypes,
ExtractionServiceTypes,
ObservableTypes,
ModelServiceTypes,
OpenAIModels
} from 'graphlit-client/dist/generated/graphql-types';
const graphlit = new Graphlit();
console.log('=== Building Medical Knowledge Graph ===\n');
// Step 1: Create high-quality medical extraction workflow
console.log('Step 1: Creating medical entity extraction workflow...');
// Use GPT-4 for medical accuracy
const spec = await graphlit.createSpecification({
name: "GPT-4 Medical Extraction",
type: SpecificationTypes.Completion,
serviceType: ModelServiceTypes.OpenAi,
openAI: {
model: OpenAIModels.Gpt4, // Best quality for medical
temperature: 0.1 // Low temperature for consistency
}
});
const workflow = await graphlit.createWorkflow({
name: "Medical Entity Extraction",
preparation: {
jobs: [{
connector: {
type: FilePreparationServiceTypes.ModelDocument // PDFs, Word, etc.
}
}]
},
extraction: {
jobs: [{
connector: {
type: EntityExtractionServiceTypes.ModelText,
extractedTypes: [
// All 12 medical entity types
ObservableTypes.MedicalCondition, // Diseases, symptoms, diagnoses
ObservableTypes.MedicalDrug, // Medications, pharmaceuticals
ObservableMedicalDrugClass, // Drug categories (antibiotics, etc.)
ObservableTypes.MedicalProcedure, // Surgeries, treatments
ObservableTypes.MedicalTest, // Lab tests, diagnostics
ObservableTypes.MedicalStudy, // Clinical trials, research
ObservableMedicalDevice, // Medical equipment, implants
ObservableMedicalTherapy, // Therapies, treatments
ObservableMedicalGuideline, // Clinical guidelines, protocols
ObservableMedicalIndication, // Reasons for treatment
ObservableMedicalContraindication, // Reasons to avoid treatment
// Also extract non-medical entities for context
ObservableTypes.Person, // Patients, doctors, researchers
ObservableTypes.Organization // Hospitals, pharma companies
]
}
}]
},
specification: { id: spec.createSpecification.id }
});
console.log(`✓ Workflow: ${workflow.createWorkflow.id}\n`);
// Step 2: Ingest clinical research paper
console.log('Step 2: Ingesting clinical research paper...');
const paper = await graphlit.ingestUri('https://example.com/papers/clinical-trial.pdf', "Clinical Trial: Drug X for Condition Y", undefined, undefined, undefined, { id: workflow.createWorkflow.id });
console.log(`✓ Ingested: ${paper.ingestUri.id}\n`);
// Step 3: Wait for extraction
console.log('Step 3: Extracting medical entities...');
let isDone = false;
while (!isDone) {
const status = await graphlit.isContentDone(paper.ingestUri.id);
isDone = status.isContentDone.result;
if (!isDone) {
console.log(' Processing...');
await new Promise(resolve => setTimeout(resolve, 3000));
}
}
console.log('✓ Extraction complete\n');
// Step 4: Retrieve extracted entities
console.log('Step 4: Retrieving medical entities...');
const paperDetails = await graphlit.getContent(paper.ingestUri.id);
const content = paperDetails.content;
console.log(`✓ Document: ${content.name}`);
console.log(` Pages: ${content.document?.pageCount}`);
console.log(` Total entities: ${content.observations?.length || 0}\n`);
// Step 5: Analyze by medical entity type
console.log('Step 5: Analyzing medical entities...\n');
const medicalTypes = [
ObservableTypes.MedicalCondition,
ObservableTypes.MedicalDrug,
ObservableMedicalDrugClass,
ObservableTypes.MedicalProcedure,
ObservableTypes.MedicalTest,
ObservableTypes.MedicalStudy,
ObservableMedicalDevice,
ObservableMedicalTherapy
];
medicalforEach(type => {
const entities = content.observations?.filter(obs => obs.type === type) || [];
const unique = new Set(entities.map(e => e.observable.name));
if (unique.size > 0) {
console.log(`${type} (${unique.size}):`);
Array.from(unique).slice(0, 5).forEach(name => {
console.log(` - ${name}`);
});
if (unique.size > 5) {
console.log(` ... and ${unique.size - 5} more`);
}
console.log();
}
});
// Step 6: Build drug-condition relationships
console.log('Step 6: Analyzing drug-condition relationships...\n');
const drugs = content.observations?.filter(obs =>
obs.type === ObservableTypes.MedicalDrug
) || [];
const conditions = content.observations?.filter(obs =>
obs.type === ObservableTypes.MedicalCondition
) || [];
// Co-occurrence analysis
const relationships: Array<{ drug: string; condition: string; confidence: number }> = [];
drugs.forEach(drug => {
conditions.forEach(condition => {
// Check if they appear on same pages
const drugPages = new Set(drug.occurrences?.map(occ => occ.pageIndex));
const condPages = new Set(condition.occurrences?.map(occ => occ.pageIndex));
const sharedPages = Array.from(drugPages).filter(p => condPages.has(p));
if (sharedPages.length > 0) {
// Calculate average confidence
const avgConf = (
(drug.occurrences?.reduce((sum, occ) => sum + occ.confidence, 0) || 0) /
(drug.occurrences?.length || 1) +
(condition.occurrences?.reduce((sum, occ) => sum + occ.confidence, 0) || 0) /
(condition.occurrences?.length || 1)
) / 2;
relationships.push({
drug: drug.observable.name,
condition: condition.observable.name,
confidence: avgConf
});
}
});
});
console.log('Drug-Condition relationships:');
relationships
.sort((a, b) => b.confidence - a.confidence)
.slice(0, 5)
.forEach(({ drug, condition, confidence }) => {
console.log(` ${drug} ↔ ${condition} (confidence: ${confidence.toFixed(2)})`);
});
// Step 7: Query medical knowledge graph
console.log('\nStep 7: Querying medical knowledge graph...\n');
// Get all conditions across all documents
const allConditions = await graphlit.queryObservables({
filter: { types: [ObservableTypes.MedicalCondition] }
});
console.log(`Total conditions in knowledge graph: ${allConditions.observables.results.length}`);
// Get all drugs
const allDrugs = await graphlit.queryObservables({
filter: { types: [ObservableTypes.MedicalDrug] }
});
console.log(`Total drugs in knowledge graph: ${allDrugs.observables.results.length}`);
console.log('\n✓ Medical knowledge graph complete!');Step-by-Step Explanation
Step 1: Understanding Medical Entity Types
Graphlit supports 12 medical entity types (all fully supported, not beta):
Core Clinical Entities:
MedicalCondition:
Diseases, symptoms, diagnoses
Examples: "Type 2 diabetes", "hypertension", "chest pain", "COVID-19"
Schema.org:
@type: "MedicalCondition"
MedicalDrug:
Specific medications, pharmaceuticals
Examples: "metformin", "lisinopril", "aspirin", "Pfizer-BioNTech vaccine"
Schema.org:
@type: "Drug"
MedicalDrugClass:
Categories of drugs
Examples: "antibiotics", "beta-blockers", "statins", "ACE inhibitors"
Schema.org:
@type: "DrugClass"
MedicalProcedure:
Surgeries, treatments, interventions
Examples: "coronary artery bypass", "hip replacement", "chemotherapy"
Schema.org:
@type: "MedicalProcedure"
MedicalTest:
Diagnostic tests, lab tests
Examples: "HbA1c test", "MRI scan", "blood pressure measurement"
Schema.org:
@type: "MedicalTest"
Advanced Medical Entities:
MedicalStudy:
Clinical trials, research studies
Examples: "Phase III trial", "randomized controlled trial", "cohort study"
Schema.org:
@type: "MedicalStudy"
MedicalDevice:
Medical equipment, implants
Examples: "pacemaker", "insulin pump", "surgical robot", "stent"
Schema.org:
@type: "MedicalDevice"
MedicalTherapy:
Therapies, treatment approaches
Examples: "physical therapy", "radiation therapy", "cognitive behavioral therapy"
Schema.org:
@type: "MedicalTherapy"
MedicalGuideline:
Clinical guidelines, protocols
Examples: "WHO guidelines", "treatment protocol", "diagnostic criteria"
Schema.org:
@type: "MedicalGuideline"
MedicalIndication:
Reasons for treatment
Examples: "indicated for hypertension", "approved for diabetes management"
Schema.org:
@type: "MedicalIndication"
MedicalContraindication:
Reasons to avoid treatment
Examples: "contraindicated in pregnancy", "not for use with kidney disease"
Schema.org:
@type: "MedicalContraindication"
MedicalRiskFactor (if supported):
Risk factors for conditions
Examples: "smoking", "obesity", "family history"
Step 2: Model Selection for Medical Content
GPT-4 (Recommended for Medical):
Highest accuracy for medical terminology
Best understanding of clinical context
Lower false positive rate
More expensive but worth it for healthcare
GPT-4o:
Good balance for less critical medical content
Faster processing
Lower cost
Acceptable for research papers, general medical content
Claude 3.5 Sonnet:
Good alternative to GPT-4
Strong medical knowledge
Handles long clinical documents well
NOT Recommended:
Gemini: Less accurate for medical terminology
GPT-3.5: Too many medical errors
Step 3: Clinical Document Types
Research Papers:
Clinical Notes (HIPAA considerations):
Drug Information Sheets:
Clinical Guidelines:
Step 4: Medical Entity Relationships
Drug-Condition Relationships:
Co-occurrence on same pages
"Drug X is indicated for Condition Y"
"Patients with Condition Y treated with Drug X"
Procedure-Condition Relationships:
"Procedure X performed for Condition Y"
Diagnostic procedures for conditions
Drug-Drug Interactions:
Contraindications between drugs
Combination therapies
Test-Condition Relationships:
Diagnostic tests for conditions
Monitoring tests for treated conditions
Step 5: Confidence Scoring for Medical Entities
High Confidence (>=0.9):
Explicit medical terminology
Standard nomenclature (ICD, SNOMED CT terms)
Clear clinical context
Medium Confidence (0.7-0.9):
Common medical terms
Some ambiguity in context
Abbreviations with context
Low Confidence (<0.7):
Ambiguous terms
Incomplete information
Uncertain context
Recommended Threshold: >=0.75 for medical applications (higher than general content)
Configuration Options
Precision vs Recall Tradeoff
High Precision (fewer false positives):
High Recall (fewer false negatives):
Domain-Specific Extraction
Cardiology:
Oncology:
Pharmacology:
Variations
Variation 1: Drug Information Database
Build comprehensive drug knowledge base:
Variation 2: Clinical Trial Analysis
Analyze clinical trial results:
Variation 3: Adverse Event Monitoring
Track drug side effects and adverse events:
Variation 4: Medical Literature Review
Build knowledge base from research papers:
Variation 5: Treatment Protocol Assistant
RAG-based clinical decision support:
Common Issues & Solutions
Issue: Medical Abbreviations Not Recognized
Problem: "HTN", "DM", "CHF" not extracted as conditions.
Solution: Medical abbreviations may have low confidence. Either:
Use lower confidence threshold (>=0.6)
Expand abbreviations in preprocessing
Train on medical-specific model (future feature)
Issue: False Positives on Common Terms
Problem: "Cold" extracted as MedicalCondition when discussing weather.
Solution: Context-aware filtering:
Issue: Missing Drug-Condition Relationships
Problem: Drug and condition mentioned but not linked.
Solution: Use co-occurrence analysis (same page) or RAG queries:
Issue: HIPAA Compliance Concerns
Problem: Patient names being extracted from clinical notes.
Solution: Don't extract Person entities from patient records:
Also implement proper data handling:
Encrypt data at rest
Access controls
Audit logging
BAA with Graphlit (if processing PHI)
Developer Hints
Medical Entity Quality by Source
High quality: Published research papers, drug information sheets
Medium quality: Clinical guidelines, review articles
Variable quality: Clinical notes (abbreviations, typos)
Model Recommendations by Use Case
Clinical decision support: GPT-4 (highest accuracy required)
Research literature review: GPT-4o (good balance)
General medical knowledge: Claude 3.5 Sonnet
Confidence Thresholds
Regulatory/clinical use: >=0.85
Research/analysis: >=0.75
Exploratory/discovery: >=0.65
HIPAA and Privacy
Graphlit is HIPAA-compliant when properly configured
Sign BAA (Business Associate Agreement)
Use encryption, access controls
Don't extract identifiable patient information
Consider de-identification before ingestion
Performance Optimization
Medical extraction is slower (complex terminology)
Expect 20-30% longer processing than general content
Batch process overnight for large volumes
Cache commonly queried entities
Production Patterns
Healthcare Use Cases
Clinical decision support: Query guidelines by condition
Drug information lookup: Interactive drug database
Adverse event monitoring: Track side effects across reports
Literature review: Automated systematic reviews
Treatment protocol matching: Match patients to protocols
Medical education: Interactive medical knowledge base
Compliance Considerations
PHI (Protected Health Information) requires HIPAA compliance
De-identify data when possible
Implement access controls
Audit all queries
Regular security assessments
Data retention policies
Last updated
Was this helpful?