Extract data from multiple content items in parallel.
With LLMs such as OpenAI GPT-3.5 and GPT-4, they offer function calling as a way for the model to output a JSON object containing arguments. The OpenAI GPT-4 Turbo 128K (1106) and GPT-3.5 Turbo 16k (1106) also support the model calling multiple functions in parallel.
Note, the LLM does not literally call the function itself. It formats the arguments of a function call, in JSON format, so that the application can call the function themselves.
Graphlit uses this capability to offer structured data extraction from any content format, i.e. web pages, PDFs, audio transcripts.
In the newer versions of these LLMs, function calls are now called tool calls, and we use that nomenclature in Graphlit.
Create Extraction Specification
First, you must create a specification to use with data extraction, and define the tools to be executed by the LLM.
Here we are using the OpenAI GPT-4 Turbo 128K model, which in our experience, provides the best quality data extraction, although being somewhat more costly and slower than the other OpenAI models. You can test different models to find the best one for your use case.
You can define multiple tools and for each, assign a tool name, (optional) description and JSON schema.
Tool names must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64.
The schema describes the output from the tool, i.e. this will be the format of the data output from this extraction operation.
Example JSON schema for tool:
{"$schema":"http://json-schema.org/draft-07/schema#","type":"object","properties": {"streetAddress": {"type":"string","description":"The street address, including house number and street name." },"city": {"type":"string","description":"The name of the city." },"state": {"type":"string","description":"The name of the state or province." },"postalCode": {"type":"string","description":"The postal or ZIP code." },"country": {"type":"string","description":"The name of the country." } },"required": ["streetAddress","city","state","postalCode","country"]}
Mutation:
mutationCreateSpecification($specification: SpecificationInput!) { createSpecification(specification: $specification) { id name state type serviceType }}
Variables:
{"specification": {"type":"EXTRACTION","serviceType":"OPEN_AI","openAI": {"model":"GPT4_TURBO_128K_1106","temperature":0.1,"probability":0.2 },"tools": [ {"name":"get_address","description":"Extract address properties.", "schema": "{\"$schema\":\"http://json-schema.org/draft-07/schema#\",\"type\":\"object\",\"properties\":{\"streetAddress\":{\"type\":\"string\",\"description\":\"The street address, including house number and street name.\"},\"city\":{\"type\":\"string\",\"description\":\"The name of the city.\"},\"state\":{\"type\":\"string\",\"description\":\"The name of the state or province.\"},\"postalCode\":{\"type\":\"string\",\"description\":\"The postal or ZIP code.\"},\"country\":{\"type\":\"string\",\"description\":\"The name of the country.\"}},\"required\":[\"streetAddress\",\"city\",\"state\",\"postalCode\",\"country\"]}"
} ],"name":"GPT-4 Extraction" }}
The uri field for the tool definition, in the specification, is unused by extractContents. The tool callback URI is only used when tools are configured for prompt completion specifications, and used by promptConversation mutation.
Extract Contents
Extracting contents is similar to querying contents, in that it takes a content filter parameter.
Graphlit will query the contents, based on your filter, and then extract each content separately, with the specification you specify.
With the slower performance of some LLMs like GPT-4 Turbo 128k, you may get API timeouts attempting to extract contents, especially with a larger number of contents. If this happens, you can filter the contents to return less results, or try a different LLM.
Extraction performance is dependent on the number of pages of text, or the length of an audio/video transcript.
Say we are a realtor, and our goal is to extract all the addresses of the homes on this page.
We can easily use our specification with the get_address tool and extract all addresses from this web page.
As you can see, each extraction provides the JSON value which adheres to the tool schema provided, and references the pageNumber or startTime/endTime where the data was extracted from the source content.
We can take the resulting value fields, and use to synchronize with Google Maps or some other software application.
For example:
{"streetAddress":"825 B NE 70th St","city":"Seattle","state":"WA","postalCode":"98115","country":"USA"}
Mutation:
mutationExtractContents($prompt: String!, $filter: ContentFilter, $specification: EntityReferenceInput!) { extractContents(prompt: $prompt, filter: $filter, specification: $specification) { specification { id } content { id } value startTime endTime pageNumber error }}
Variables:
{"prompt":"Find me all the street addresses.","specification": {"id":"3ffd0dcd-208b-465d-afc5-66f3bef7fe40" }}