Extract Contents

Extract data from multiple content items in parallel.

With LLMs such as OpenAI GPT-3.5 and GPT-4, they offer function calling as a way for the model to output a JSON object containing arguments. The OpenAI GPT-4 Turbo 128K (1106) and GPT-3.5 Turbo 16k (1106) also support the model calling multiple functions in parallel.

Note, the LLM does not literally call the function itself. It formats the arguments of a function call, in JSON format, so that the application can call the function themselves.

Graphlit uses this capability to offer structured data extraction from any content format, i.e. web pages, PDFs, audio transcripts.

In the newer versions of these LLMs, function calls are now called tool calls, and we use that nomenclature in Graphlit.

Create Extraction Specification

First, you must create a specification to use with data extraction, and define the tools to be executed by the LLM.

Here we are using the OpenAI GPT-4 Turbo 128K model, which in our experience, provides the best quality data extraction, although being somewhat more costly and slower than the other OpenAI models. You can test different models to find the best one for your use case.

You can define multiple tools and for each, assign a tool name, (optional) description and JSON schema.

Tool names must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64.

The schema describes the output from the tool, i.e. this will be the format of the data output from this extraction operation.

Example JSON schema for tool:

{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "type": "object",
    "properties": {
        "streetAddress": {
            "type": "string",
            "description": "The street address, including house number and street name."
        },
        "city": {
            "type": "string",
            "description": "The name of the city."
        },
        "state": {
            "type": "string",
            "description": "The name of the state or province."
        },
        "postalCode": {
            "type": "string",
            "description": "The postal or ZIP code."
        },
        "country": {
            "type": "string",
            "description": "The name of the country."
        }
    },
    "required": ["streetAddress", "city", "state", "postalCode", "country"]
}

Mutation:

mutation CreateSpecification($specification: SpecificationInput!) {
  createSpecification(specification: $specification) {
    id
    name
    state
    type
    serviceType
  }
}

Variables:

{
  "specification": {
    "type": "EXTRACTION",
    "serviceType": "OPEN_AI",
    "openAI": {
      "model": "GPT4_TURBO_128K_1106",
      "temperature": 0.1,
      "probability": 0.2
    },
    "tools": [
      {
        "name": "get_address",
        "description": "Extract address properties.",
        "schema": "{\"$schema\":\"http://json-schema.org/draft-07/schema#\",\"type\":\"object\",\"properties\":{\"streetAddress\":{\"type\":\"string\",\"description\":\"The street address, including house number and street name.\"},\"city\":{\"type\":\"string\",\"description\":\"The name of the city.\"},\"state\":{\"type\":\"string\",\"description\":\"The name of the state or province.\"},\"postalCode\":{\"type\":\"string\",\"description\":\"The postal or ZIP code.\"},\"country\":{\"type\":\"string\",\"description\":\"The name of the country.\"}},\"required\":[\"streetAddress\",\"city\",\"state\",\"postalCode\",\"country\"]}"
      }
    ],
    "name": "GPT-4 Extraction"
  }
}

Response:

{
  "type": "EXTRACTION",
  "serviceType": "OPEN_AI",
  "id": "3ffd0dcd-208b-465d-afc5-66f3bef7fe40",
  "name": "GPT-4 Extraction",
  "state": "ENABLED"
}

The uri field for the tool definition, in the specification, is unused by extractContents. The tool callback URI is only used when tools are configured for prompt completion specifications, and used by promptConversation mutation.

Extract Contents

Extracting contents is similar to querying contents, in that it takes a content filter parameter.

Graphlit will query the contents, based on your filter, and then extract each content separately, with the specification you specify.

With the slower performance of some LLMs like GPT-4 Turbo 128k, you may get API timeouts attempting to extract contents, especially with a larger number of contents. If this happens, you can filter the contents to return less results, or try a different LLM.

Extraction performance is dependent on the number of pages of text, or the length of an audio/video transcript.

In this example, we've ingested a Web page of homes in Seattle.

Say we are a realtor, and our goal is to extract all the addresses of the homes on this page.

We can easily use our specification with the get_address tool and extract all addresses from this web page.

As you can see, each extraction provides the JSON value which adheres to the tool schema provided, and references the pageNumber or startTime/endTime where the data was extracted from the source content.

We can take the resulting value fields, and use to synchronize with Google Maps or some other software application.

For example:

{
	"streetAddress": "825 B NE 70th St",
	"city": "Seattle",
	"state": "WA",
	"postalCode": "98115",
	"country": "USA"
}

Mutation:

mutation ExtractContents($prompt: String!, $filter: ContentFilter, $specification: EntityReferenceInput!) {
  extractContents(prompt: $prompt, filter: $filter, specification: $specification) {
    specification {
      id
    }
    content {
      id
    }
    value
    startTime
    endTime
    pageNumber
    error
  }
}

Variables:

{
  "prompt": "Find me all the street addresses.",
  "specification": {
    "id": "3ffd0dcd-208b-465d-afc5-66f3bef7fe40"
  }
}

Response:

[
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"9253 Densmore Ave N\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"WA\",\r\n  \"postalCode\": \"98103\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 8
  },
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"1653 N 95th St\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"WA\",\r\n  \"postalCode\": \"98103\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 8
  },
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"823 B NE 70th St\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"WA\",\r\n  \"postalCode\": \"98115\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 8
  },
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"825 B NE 70th St\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"WA\",\r\n  \"postalCode\": \"98115\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 8
  },
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"The Baranof\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"Washington\",\r\n  \"postalCode\": \"\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 6
  },
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"74th St Ale House\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"Washington\",\r\n  \"postalCode\": \"\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 6
  },
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"The Cozy Nut Tavern\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"Washington\",\r\n  \"postalCode\": \"\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 6
  },
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"The Yard Cafe\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"Washington\",\r\n  \"postalCode\": \"\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 6
  },
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"Coindexter's Bar\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"Washington\",\r\n  \"postalCode\": \"\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 6
  },
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"Gorditos\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"Washington\",\r\n  \"postalCode\": \"\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 6
  },
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"FlintCreek Cattle Co.\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"Washington\",\r\n  \"postalCode\": \"\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 6
  },
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"Gainsbourg\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"Washington\",\r\n  \"postalCode\": \"\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 6
  },
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"Greenwood Park\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"Washington\",\r\n  \"postalCode\": \"\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 6
  },
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"Sandel Park\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"Washington\",\r\n  \"postalCode\": \"\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 6
  },
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"6th Ave NW Pocket Park\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"Washington\",\r\n  \"postalCode\": \"\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 6
  },
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"8747 Phinney Ave N #3\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"WA\",\r\n  \"postalCode\": \"98103\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 2
  },
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"9209 1st Ave NW\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"WA\",\r\n  \"postalCode\": \"98117\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 2
  },
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"9207 1st Ave NW\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"WA\",\r\n  \"postalCode\": \"98117\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 2
  },
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"715 N 101st St\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"WA\",\r\n  \"postalCode\": \"98133\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 2
  },
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"9255 Greenwood Ave N #32\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"WA\",\r\n  \"postalCode\": \"98103\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 2
  },
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"912 N 100th St #B\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"WA\",\r\n  \"postalCode\": \"98133\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 2
  },
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"Greenwood Avenue North and North 85th Street\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"WA\",\r\n  \"postalCode\": \"98103\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 3
  },
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"Greenwood Ave\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"Washington\",\r\n  \"postalCode\": \"\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 5
  }
]

Last updated