Preparation
Configure content preparation.
Last updated
Was this helpful?
Configure content preparation.
Last updated
Was this helpful?
As content is ingested into Graphlit, the first stage of workflow processing is "preparation". You can configure how the text gets extracted from content, such as PDFs, and can automatically summarize the extracted text into paragraphs, bullet points or a headline.
LLM-based preparation (i.e. summarization) incurs Graphlit credit usage, based on the number of LLM tokens processed. API-based preparation (i.e. audio transcription or PDF OCR text extraction) incurs Graphlit credit usage, based on the number of document pages or length of audio/video files.
When content is prepared, you can optionally summarize the extracted text, as summary paragraphs, bullet points, or headlines - or a combination of these.
You can assign an array of summarizations
, each which specifies the type
of summary, the maximum number of tokens
to be output by the LLM, and the number of items
(i.e. paragraphs, bullet points). If the maximum number of tokens isn't specified, it will calculated based on the token limit of the LLM.
Graphlit supports these summarization types: SUMMARY
, BULLETS
, HEADLINES
, POSTS
, QUESTIONS
, and CHAPTERS
.
Summary is a multi-paragraph summary, for a piece of content.
Bullets are a list of topical bullet points about the content.
Headlines are a list of potential titles or headlines, which could be used for a piece of content
Posts are X (fka Twitter) compatible social media posts, which can be used to promote a piece of content.
Questions are potential followup questions, for a piece of content.
Chapters are YouTube compatible timestamped chapter heading, which are auto-generated from an audio transcript.
These summarizations will fill in the appropriate properties in the Content entity.
You can also assign specification
s along with the preparation
stage, which describes the LLM specification to be used for each content summarization.
Assigning a preparation job
with the connector
of type AZURE_DOCUMENT_INTELLIGENCE
will leverage Azure AI Document Intelligence for OCR and layout-aware text extraction.
You can specify the desired Azure AI Document Intelligence pre-built model
which will be used for your content format. Graphlit also supports custom-trained models on Azure AI Document Intelligence.
Read (OCR)
Layout
Invoice
Receipt
Credit Card
ID Document
Health Insurance Card (US)
W-2 Form (US)
1098 Form (US)
1098E Form (US)
1098T Form (US)
1099 Form (US)
Marriage Certificate (US)
Mortgage 1003 End-User License Agreement (EULA) (US)
Mortgage Form 1008 (US)
Mortgage closing disclosure (US)
Assigning a preparation job
with the connector
of type DEEPGRAM
will allow you to configure the Deepgram model
used for transcription.
If you have a Deepgram API key, you can assign the key
parameter so audio transcription, via this workflow, will not accrue any Graphlit credits.
By default, Graphlit extracts text from all document formats, and for PDF, DOCX and PPTX formats it performs higher-quality OCR document extraction using .
More information about the Azure AI Document Intelligence models can be found .
When ingesting audio and video content, Graphlit transcribes text from the spoken audio with audio transcription models.
You can see the full list of Deepgram model enums , which matches the .