Mastra (TypeScript)
Build an autonomous AI research agent that performs multi-hop web research with entity extraction and intelligent filtering—like OpenAI's Deep Research
⏱️ Time: 30-40 minutes 🎯 Level: Advanced 💻 SDK: TypeScript (Mastra framework)
What You'll Learn
In this tutorial, you'll build a production-ready research agent that:
✅ Extracts entities from documents using Graphlit's knowledge graph
✅ Performs multi-hop web research (searches for discovered entities)
✅ Filters sources before ingesting using native reranking (key innovation!)
✅ Detects convergence automatically (knows when to stop)
✅ Synthesizes multi-source reports with citations (scales to 100+ sources)
What makes this production-ready: Pre-ingestion filtering, autonomous stopping, and summary-based synthesis patterns used in real applications.
What You'll Build
An autonomous agent that takes a topic and:
Ingests seed source - Reads initial document or search results
Discovers entities - Extracts people, companies, concepts from your knowledge graph
Researches each entity - Searches Exa for 10 related sources per entity
Filters intelligently - Analyzes 50 sources, ingests only top 8 high-quality ones
Detects convergence - Stops when novelty score drops below 30%
Synthesizes report - Generates comprehensive markdown with proper citations
Example: Start with Wikipedia on "RAG" → Extracts 15 entities → Searches 50 sources → Filters to 8 → Generates 2000-word report in ~45 seconds
🔗 Full code: GitHub
Prerequisites
Required:
Node.js 20+
Graphlit account + credentials
OpenAI API key (for Mastra agent)
Basic TypeScript knowledge
Recommended (helps understand concepts):
Complete Quickstart (7 minutes)
Complete Knowledge Graph tutorial (20 minutes)
Why This Matters: What Graphlit Handles
Before we dive into building, understand what Graphlit provides so you don't have to build it:
Infrastructure (Weeks → Hours)
✅ File parsing - PDFs, DOCX, audio, video (30+ formats)
✅ Vector database - Managed Qdrant, auto-scaled
✅ Multi-tenant isolation - Each user gets isolated environment
✅ GraphQL API - Auto-generated, authenticated
Intelligence (Months → API Calls)
✅ Automatic entity extraction - LLM-powered workflow extracts Person, Organization, Category during ingestion
✅ Knowledge graph - Built on Schema.org/JSON-LD standard, relationships auto-created
✅ Native reranker - Fast, accurate relevance scoring (this enables our pre-filtering!)
✅ Exa search built-in - No separate API key needed, semantic web search included
✅ Summary-based RAG - Scales to 100+ documents via optimized summaries
Time savings: Estimated 12-14 weeks of infrastructure development you skip.
Production proof: This pattern is used in Zine, serving thousands of users with millions of documents.
The Key Innovation: Pre-Ingestion Filtering
Most research implementations blindly ingest everything they find. This creates noise and wastes processing.
The breakthrough: Analyze sources before fully ingesting them.
Here's the pattern:
Quick ingest to temporary collection (lightweight)
Use Graphlit's native reranker to score relevance
Filter out low-scoring sources (<0.5 relevance)
Only fully ingest top 5-8 sources
Delete temporary collection
Why this works: Graphlit's native reranker is fast enough (~2 seconds) to analyze 50 sources before deciding which to fully process.
Result: Process 8 sources instead of 50. Faster, higher quality, better signal-to-noise.
The 5-Phase Research Algorithm
Phase 1: Seed Acquisition
Two starting modes:
URL Mode - Start from a specific source:
Best for: Research papers, documentation, whitepapers
Search Mode - Discover seed sources automatically:
Best for: Open-ended research, new topics
Phase 2: Entity-Driven Discovery
Instead of keyword-based research, let the knowledge graph drive discovery:
Automatic extraction: Entities extracted during ingestion (no separate step!)
Types: Person, Organization, Category (concepts/technical terms)
Ranking: By occurrence count and semantic importance
Selection: Top 5 become research seeds
Why entity-driven works: A RAG paper mentions "vector databases" and "BERT"—those naturally become your next research directions. Mimics human researcher behavior.
Phase 3: Intelligent Expansion
For each entity:
Search Exa for 10 related sources
Pre-filter before ingesting (the key innovation!)
Only ingest top 3-5 highest-quality sources
The filtering workflow:
Benefit: Analyze 50, process 8. Significantly faster with better quality.
Phase 4: Convergence Detection
The agent automatically detects when to stop:
Rerank ALL content by relevance to original query
Calculate novelty: What % of newest sources rank in top 10?
Decision:
Novelty >30%: Continue, sources add value
Novelty <30%: Stop, diminishing returns
Why this matters: No manual intervention needed. The agent knows when research has converged.
Phase 5: Multi-Source Synthesis
Graphlit's summary-based approach scales beyond traditional RAG:
Auto-summarize each source (25-50 key points + entities)
Concatenate summaries (efficient context usage)
Synthesize via LLM from summaries
Citations automatically included
Traditional RAG hits limits at 10-20 docs. This approach handles 100+ sources.
Implementation: Step-by-Step
Step 1: Project Setup (5 min)
Create .env:
Configure tsconfig.json:
Step 2: Singleton Graphlit Client (2 min)
Create one shared Graphlit instance used by all tools.
File: src/graphlit.ts
Why singleton: Efficient, no redundant credential passing, production pattern.
Step 3: Build the Key Tools (20 min)
We'll build 10 Mastra tools. Here are the critical ones:
Tool 1: Workflow with Entity Extraction
This sets up automatic entity extraction during ingestion.
File: src/tools/workflow-setup.ts
Key insight: Entity extraction happens automatically during ingestion. When you query later, entities are already in your knowledge graph—no separate extraction step needed.
Graphlit's Knowledge Graph: Built on Schema.org/JSON-LD standards, ensuring interoperability and semantic richness beyond simple entity lists.
Tool 2: Pre-Ingestion Filtering (The Critical Innovation)
This tool analyzes sources before fully ingesting them.
File: src/tools/rerank.ts
Why this pattern works: The native reranker is fast (~2 seconds) and accurate. Temporary collection analysis is lightweight. This makes pre-filtering practical at scale.
Tool 3: Diminishing Returns Detection
Automatically detects when research has converged.
File: src/tools/rerank.ts (continued)
The insight: If new sources don't rank highly vs existing content, they're redundant. The agent stops automatically.
Step 4: Create the Mastra Agent (3 min)
Bring all tools together with intelligent orchestration.
File: src/agent.ts
Why agent pattern: The LLM decides when to use each tool. Adaptive, resilient, production-ready for AI applications.
Step 5: Build the CLI (5 min)
Create a polished interface for running research.
File: src/main.ts (abbreviated - see full code)
Running Your Agent
URL Mode:
Search Mode:
Expected output:
Terminal (progress):
Report (report.md):
Production Patterns
Performance Optimizations
Parallel processing:
Synchronous ingestion:
Pre-filtering:
Typical Session Metrics
Without filtering:
Sources processed: ~50
Processing time: 2-3 minutes
Quality: Significant noise
With filtering:
Sources processed: ~8
Processing time: 30-45 seconds
Quality: High signal-to-noise ratio
Alternative Frameworks
This tutorial uses Mastra (TypeScript). Graphlit works with other frameworks:
For Python developers:
Agno - Ultra-fast Python agents
LangGraph - Graph-based state machines
For TypeScript developers:
Vercel AI SDK Workflow - Deterministic orchestration
All use the same Graphlit SDK—choose based on language preference.
Next Steps
Try It Out
Clone and run:
Extend It
Domain-specific entities:
Medical:
ObservableTypes.MedicalCondition,DrugLegal:
ObservableTypes.LegalCase,ContractBusiness:
ObservableTypes.Product,Event
Multi-pass research:
Extract entities from Layer 2 results
Research 2-3 passes deep
Configurable depth limits
Real-time monitoring:
Create Exa feeds for discovered entities
Auto-expand knowledge base daily
Learn More
Related Tutorials:
Knowledge Graph - Deep dive into entity extraction
Production Deployment - Multi-tenant patterns
Context Engineering - Advanced retrieval
Production Example:
Zine Case Study - Real-world implementation serving thousands of users
Graphlit Resources:
Mastra Resources:
Summary
You've learned how to build a production-ready autonomous research agent:
Key innovations:
Pre-ingestion filtering - Native reranker analyzes sources before processing
Diminishing returns detection - Agent knows when to stop autonomously
Summary-based synthesis - Scales to 100+ sources via optimized summaries
Entity-driven discovery - Knowledge graph powers multi-hop reasoning
Architecture:
Mastra handles orchestration and tool-calling
Graphlit provides semantic memory, knowledge graph, and intelligence
Clean separation of concerns, production-ready patterns
Time investment: 30-40 minutes Value delivered: Weeks of infrastructure work eliminated, battle-tested patterns
This approach works for competitive intelligence, market research, technical deep-dives, and any multi-source synthesis use case.
Complete implementation: GitHub Repository
Last updated
Was this helpful?