Ingest PDF

Ingest a PDF into Graphlit

Let's start with a simple, but common use case: Chat with a PDF.

Graphlit uses the power of Large Language Models (LLMs) to extract useful knowledge from PDFs, and ask questions about the content.

We can use this interesting paper on Knowledge Graphs and LLMs, which we've placed on cloud storage as an example.

https://graphlitplatform.blob.core.windows.net/samples/Unifying%20Large%20Language%20Models%20and%20Knowledge%20Graphs%20A%20Roadmap-2306.08302.pdf

Or if you have a URL to a cloud-hosted PDF, feel free to use that one.

Create Content

We can use the create verb with content type to create new content. We are adding the optional --wait argument so the CLI command waits until the content has finished ingestion. Without this argument, the CLI returns immediately and the content workflow proceeds asynchronously.

g create --type content --wait

By default, create verbs return the JSON of the content, with a default set of requested fields.

You'll notice that it has indexed the PDF metadata and found document.pageCount with 29 pages.

The text has been extracted from the PDF automatically, and chunked and vector-embedded for semantic search.

The textUri field points to a JSON file which contains the raw extracted text, by page and chunk.

{
  "type": "FILE",
  "originalDate": "2023-06-21T00:44:15Z",
  "mimeType": "application/pdf",
  "fileType": "DOCUMENT",
  "fileName": "Unifying Large Language Models and Knowledge Graphs A Roadmap-2306.08302.pdf",
  "fileSize": 3312767,
  "masterUri": "https://graphlit20240221a3d8649b.blob.core.windows.net/files/4330c766-488d-4dcd-8237-f03d7a3a068f/Unifying%20Large%20Language%20Models%20and%20Knowledge%20Graphs%20A%20Roadmap-2306.08302.pdf?sv=2023-11-03&se=2024-03-04T10%3A23%3A35Z&sr=c&sp=rl&sig=LR9WwYB6FGc17HQL0xEJ6pUz7r1YRkLxgu5Zbhd6GMQ%3D",
  "textUri": "https://graphlit20240221a3d8649b.blob.core.windows.net/files/4330c766-488d-4dcd-8237-f03d7a3a068f/Mezzanine/Unifying%20Large%20Language%20Models%20and%20Knowledge%20Graphs%20A%20Roadmap-2306.08302.json?sv=2023-11-03&se=2024-03-04T10%3A23%3A35Z&sr=c&sp=rl&sig=LR9WwYB6FGc17HQL0xEJ6pUz7r1YRkLxgu5Zbhd6GMQ%3D",
  "document": {
    "pageCount": 29
  },
  "children": [],
  "uri": "https://graphlitplatform.blob.core.windows.net/samples/Unifying%20Large%20Language%20Models%20and%20Knowledge%20Graphs%20A%20Roadmap-2306.08302.pdf",
  "id": "4330c766-488d-4dcd-8237-f03d7a3a068f",
  "name": "Unifying Large Language Models and Knowledge Graphs A Roadmap-2306.08302.pdf",
  "state": "FINISHED",
  "creationDate": "2024-03-04T04:23:29Z",
  "finishedDate": "2024-03-04T04:23:40Z",
  "workflowDuration": "PT9.4591355S",
  "owner": {
    "id": "196ba668-cf01-496d-bf41-18be461650dc"
  }
}

You can request formatted text by getting the content by identifier.

Here we are using the --fields argument to ask just for the GraphQL text field.

g get --type content --id 4330c766-488d-4dcd-8237-f03d7a3a068f --fields "{ text }"

For textual content which has section headings and/or tables, you can request the Markdown formatted text. (For PDFs, this requires using a custom workflow with Azure Document Intelligence to extract headings and tables.)

g get --type content --id 4330c766-488d-4dcd-8237-f03d7a3a068f --fields "{ markdown }"

Last updated