Semantic Search

Search all content in your Graphlit project.

By specifying the search field in the GraphQL filter object, you can search your content by text string.

"search": "Knowledge Graph"

Graphlit uses a vector search approach, where the search text field has a vector embedding generated via OpenAI's Ada LLM. The vector embedding is used to find similar text chunks or audio transcript segments, which are returned in the query results.

Audio transcript segments are groups of text phrases, at 1 minute intervals, from the transcript.

Query:

query QueryContents($filter: ContentFilter!) {
  contents(filter: $filter) {
    results {
      id
      name
      creationDate
      state
      owner {
        id
      }
      originalDate
      finishedDate
      workflowDuration
      uri
      text
      type
      fileType
      mimeType
      fileName
      fileSize
      masterUri
      mezzanineUri
      transcriptUri
      audio {
        bitrate
        channels
        sampleRate
        bitsPerSample
        duration
      }
      segments {
        startTime
        endTime
        phrases {
          startTime
          endTime
          confidence
          text
          speaker
        }
      }
    }
  }
}

Variables:

{
  "filter": {
    "search": "Knowledge Graph",
    "offset": 0,
    "limit": 100
  }
}

Response:

{
  "results": [
    {
      "type": "FILE",
      "segments": [
        {
          "startTime": "PT6M",
          "endTime": "PT7M",
          "phrases": [
            {
              "startTime": "PT6M",
              "endTime": "PT7M",
              "text": "which\r\ncould just come from inferring across a whole bunch of images or a bunch of data. And so we kinda call that 3rd order metadata that's really I mean, that's when you start to get into machine learning and you start to get into more complex inference.\r\nAnd and that could be something where we now know this is an image of this piece of equipment or this physical asset that is something that a customer has maybe in another data somewhere.\r\nSo creating that edge essentially and we think in knowledge graphs, so everything is kind of an edge connecting something to something,\r\ncreating those edges.\r\nto us is is kind of that 3rd order method. Is there any limits in sort of how far we could spider out from, you know, 1st order metadata, you know, to 2nd or to 3rd order. Could could this essentially just carry on and online depending on our compute capabilities? Yeah. I mean, it's and if that different from, like, what Google's doing with their knowledge graph or the web or other companies. But, I mean, theoretically,"
            }
          ]
        },
        {
          "startTime": "PT16M",
          "endTime": "PT17M",
          "phrases": [
            {
              "startTime": "PT16M",
              "endTime": "PT17M",
              "text": "And once you can start to the other interesting part is, hey. I mean, what other data can we gather from the linked entities on the show notes or the link documents,\r\nHTML documents, and start to create commonality from that as well. And that's where we started to see a lot of the value. And and there was so much data. And, essentially, I had to build, like, a web spider to do that where it just would continually start pulling in the data, reading the document, doing it at the analysis,\r\ncoming up with a list of links that it found in the document, and then spidering out again.\r\nWhen I hear you talk about this, it feels like that this is creating the the network effect for data. Exactly. I mean and that's why I really only got into knowledge graphs heavy maybe 5 years ago. And once you start to see the I mean, I've done a good bit of database work and understanding, I mean, okay. You have sort of a table and you have a key to something else that lives somewhere else. And I start to look at knowledge graphs is a great way to have sort of dynamic references.\r\nthat we can invent new edges on the fly. I mean, you have all your data and you say, oh, well, this entity is actually, I mean, related to this other entity, and here's a new edge we're gonna create. And in the SQL world, they're kind of classic database world. Updating the schema is always the biggest pain in the butt for for everybody. And so schema migration and all that. And knowledge graphs are so much more dynamic, and they give you the ability to pivot on any entity"
            }
          ]
        },
        {
          "startTime": "PT17M",
          "endTime": "PT18M",
          "phrases": [
            {
              "startTime": "PT17M",
              "endTime": "PT18M",
              "text": "and any edge in the system. And so we can invent new edges and then just kinda see what happens and say, oh, well, let's pivot on that edge and see what what are all the things that relate to that. And that's where I love. I mean, I been deep into it for about 5 or 6 years now, and it just the ability to\r\nto represent your data and and that data model is is really what's key for us, and and we're learning things every day about it. So you you talked about building these edges\r\nin terms of the knowledge graph, but there's there's another sort of edge\r\nidea. I I wanna I wanna get your opinion on. And this is this is the idea of edge computing.\r\nAre you familiar with this? Yeah. Mhmm. Could you tell me what it means to you? I mean, to me, it's there is some device Internet connected that lives typically on premise."
            }
          ]
        },
        {
          "startTime": "PT27M",
          "endTime": "PT28M",
          "phrases": [
            {
              "startTime": "PT27M",
              "endTime": "PT28M",
              "text": "our knowledge graph? and then actually, I mean, have public access to it. And so it's something we're thinking about. I mean, we're we don't have a public angle to what we're doing yet. It wouldn't be that hard to expose, but it's\r\nit could do some really interesting things that way. The way I understand, up until now, we've been talking about, like, a a database or\r\ns 3 bucket full of files, you know, full, you know, blob storage somewhere where you put your, you know, point at that, get access to it somehow, ingest it, and then then create these knowledge graphs in do all the things we've been talking about. Is there a world where I could take a service like yours and point it at an API and say, can you just poll that API\r\nand and can you do that and build, like, a knowledge graph around what was that and, like, tell me new things about that data? Well, that's where the funny thing is that's where it all started. So I was I have it we have this concept of a feed, and so it's it's essentially a it's based kind of on the RSS feed concept where You can have any API that we can read. It could be an RSS feed. It could be what other ones do I have?"
            }
          ]
        },
        {
          "startTime": "PT29M",
          "endTime": "PT30M",
          "phrases": [
            {
              "startTime": "PT29M",
              "endTime": "PT30M",
              "text": "or the Microsoft Graph\r\nAPI and things like that. That's\r\nreally, I mean, conceptually\r\nright in line and and probably wouldn't take them very long to to integrate. That's really interesting. A lot of those examples for me that they were at least in my mind, an example of a fee that's constantly updating. So you could like you said, you could sit and and listen to that fee. I was thinking more about some geospatial APIs there. You you show up with a\r\na geography and say, well, show me everything within this polygon with this in this geographic area and make a request based on that. And I'm wondering if you could do something like that. If if you knew the bounding box of the the API,\r\nand just started polling it constantly\r\nand building, like, this knowledge graph around the stuff that you were finding, that would be amazing. We have a concept of\r\nof police So we need drop in an ESRE shape file into our system. So we ingest the file. We extract\r\nbasically, convert it to geojison internally so we get a geofence.\r\nand then we, what we call, promote it to a place entity. And so it creates a a place in the graph that becomes more searchable kind of a like a top level entity in our graph, but we also do data enrichment on that. So I go and I look and I call the Google places API. and I try and map that to you. Okay. Like, is there any other metadata essentially I can get around that place?"
            }
          ]
        },
        {
          "startTime": "PT30M",
          "endTime": "PT31M",
          "phrases": [
            {
              "startTime": "PT30M",
              "endTime": "PT31M",
              "text": "Yeah. But we could also do I mean, I've talked to you near map. I've talked to a couple other satellite services,\r\nwe could go enrich and, like, go get me the latest satellite data for that GS that region. and layer that in. That, because of our the way we have our inventing model, we could essentially we now we now support webhooks. So when anything happens to the system, like an entity is created or Tag is added, we can call a webhook. And so that's an area where\r\nfor now anybody could build a data enrichment where they could call some other API\r\nand then call back to us to inject data back into the graph. But we're also looking at other ways where we can auto basically just do that in in a in a box, I mean, where we could add add that as a feature for a customer to say, hey. For any SREU shape file you put in here,"
            }
          ]
        },
        {
          "startTime": "PT31M",
          "endTime": "PT32M",
          "phrases": [
            {
              "startTime": "PT31M",
              "endTime": "PT32M",
              "text": "go get me the latest satellite data from this service and we could just have that as an option. Wait. How do you know where to stop with with this? Because I think at some stage that people might feel like they're drowning in data. How do you know in You might cross a threshold where, like,\r\nthe the return on an investment\r\nis massive. You know? It just goes up into the right, and then it dips off. How do you know where to stop the these these spiders? How do you know when the knowledge graph is like, okay. That that's enough to complete this task for for what we're doing today? Yeah. I mean, that logic is is the tricky part. I have I in the development, I have created bugs where I kinda created infinite loop of spidering. So it's there's there's definitely a risk there. think at some point, you have to kind of see I think what we did is if we start to enrich and we're not making any changes, if we're kind of seeing, like, okay. I'm I'm getting more data, but it's literally the same data that was already there, I would start to cut off the spider at that point.\r\nAnd so it's that that is really one of the big problems, though, because you can I mean, spend a lot of money in calling out to other APIs and doing enrichment that may never be needed."
            }
          ]
        },
        {
          "startTime": "PT36M",
          "endTime": "PT37M",
          "phrases": [
            {
              "startTime": "PT36M",
              "endTime": "PT37M",
              "text": "And so those are kind of provided context, but the the imagery we we tend to be more imagery heavy in\r\njust because that's a lot of where the volume of data is. But more and more, we wanna pull in other data formats that that kinda relate to that. If you get to the stage where you can\r\njoin, like, as built documentation, those PDF documentation that all engineers love of CAD files of structures in the real world with other data,\r\nlike current data about those, then you are on a gold mine, my friend.\r\nI hope so. I mean, that's that's what we're trying to get to. I mean, and from the folks we've talked to, I mean, essentially, they just have a massive data, and it's they just want kind of And and we talked to oil oil and gas customer who said, look. We don't want Google search. Like, just\r\nsearching\r\nfile names isn't enough, searching\r\nFull text isn't enough. You essentially want, like, a semantic search, and that's what we're creating is a way to search across the relationships"
            }
          ]
        }
      ],
      "mimeType": "audio/mpeg",
      "fileType": "AUDIO",
      "fileName": "Unstructured Data is Dark Data Podcast.mp3",
      "fileSize": 33008244,
      "masterUri": "https://graphlit20230701d31d9453.blob.core.windows.net/files/c0cc103d-467b-43c1-8256-8b99f346d4f3/Unstructured%20Data%20is%20Dark%20Data%20Podcast.mp3",
      "mezzanineUri": "https://graphlit20230701d31d9453.blob.core.windows.net/files/c0cc103d-467b-43c1-8256-8b99f346d4f3/Mezzanine/Unstructured%20Data%20is%20Dark%20Data%20Podcast.mp3",
      "transcriptUri": "https://graphlit20230701d31d9453.blob.core.windows.net/files/c0cc103d-467b-43c1-8256-8b99f346d4f3/Transcript/Unstructured%20Data%20is%20Dark%20Data%20Podcast.json",
      "audio": {
        "bitrate": 106000,
        "channels": 1,
        "sampleRate": 48000,
        "duration": "00:41:26.0640000"
      },
      "uri": "https://graphlitplatform.blob.core.windows.net/samples/Unstructured%20Data%20is%20Dark%20Data%20Podcast.mp3",
      "id": "c0cc103d-467b-43c1-8256-8b99f346d4f3",
      "name": "Unstructured Data is Dark Data Podcast.mp3",
      "state": "FINISHED",
      "creationDate": "2023-07-03T22:24:50Z",
      "finishedDate": "2023-07-03T22:25:46Z",
      "workflowDuration": "PT56.2314332S",
      "owner": {
        "id": "9422b73d-f8d6-4faf-b7a9-152250c862a4"
      }
    }
  ]
}

Search Types

Graphlit offers multiple types of search, for different use cases and requirements.

By assigning the searchType field, you can control the behavior of the search engine.

The simplest (and default) approach is classic KEYWORD search, as found in Elasticsearch or Azure Cognitive Search. Graphlit will use the search field to find the closest matches via keyword indexing.

Keyword indexing is a process that allows fast and accurate text search. When a document is added, it's broken down into individual words or 'tokens'. These tokens are then simplified and stored in an 'index', which is like a roadmap of where each word is located in the document. This allows the system to quickly find and retrieve documents when a particular word is searched. This process also helps in ranking the search results based on the relevance of the searched word in the documents.

{
  "filter": {
    "searchType": "KEYWORD"
  }
}

By assigning VECTOR to the searchType field, Graphlit will use the OpenAI Ada-002 LLM to create a vector embedding from the text in the provided search field to find the most similar results from the indexed content.

Vector-based similarity search is a technique that allows for a more nuanced, context-aware retrieval of information. Instead of looking at individual words, it looks at the semantic meaning of the words, which is represented as a multi-dimensional vector. These vectors are calculated using machine learning models that understand the context of words and their relation to each other. When a search is performed, it's converted into a vector and the system retrieves the most similar vectors, hence the most semantically similar content. This approach goes beyond keyword matching to deliver more accurate and relevant results based on the true meaning of the search query.

{
  "filter": {
    "searchType": "VECTOR"
  }
}

Graphlit also supports a HYBRID search type, where the search field is utilized for a hybrid keyword and vector-based similarity search.

{
  "filter": {
    "searchType": "HYBRID"
  }
}

Caveats

When developing a search functionality, it's essential to understand the strengths and potential pitfalls of different search methods: keyword indexing, vector-based similarity search, and a hybrid of the two.

Keyword indexing provides quick and accurate results for exact text matches, making it perfect for simple searches where precise matches are required. However, it doesn't take into account semantic meaning or context, which can lead to less relevant results if the search query includes synonyms or has ambiguous meaning.

Vector-based similarity search, on the other hand, understands semantic meanings and relationships between words, leading to more nuanced results. This makes it ideal for complex searches where context and relevance are important. But, it may return unexpected results if the search query includes precise terms that have specific meanings, as the vector representation may focus more on the general semantic context than on the exact wording.

The hybrid approach attempts to blend the best of both worlds by using keyword matching for precision and vector similarity for semantic understanding. This can result in more robust and accurate results for a variety of search queries, capturing both the exact wording and the context. However, it might also lead to unexpected outcomes, as the blend of methods may favor one type of result over the other depending on the specific implementation.

Therefore, it's critical for developers to understand these differences and choose the approach that best aligns with the needs of their application, users, and specific use cases.

Query Types

Via the queryType field, Graphlit offers the ability to configure the expected syntax of the search field. By default, Graphlit uses SIMPLE query syntax.

Simple Query Syntax is typically the default language for many search engines, and is best suited for applications where users may enter informal or natural language queries. It provides support for free text search, which means users can enter a query as a simple string of text. This syntax supports basic search operators like AND, OR, and NOT, as well as a suffix operator (*) for prefix search. However, it doesn't support more complex query constructs.

{
  "filter": {
    "search": "hotels AND luxury",
    "queryType": "SIMPLE"
  }
}

Full Query Syntax, on the other hand, provides more advanced and flexible search capabilities. Ideal for applications where users might have complex search needs, this syntax supports a wider set of operators and query types. Along with the capabilities of the simple query syntax, it includes more advanced search functionalities like proximity search, fuzzy search, term boosting (assigning greater importance to some terms in the query), regular expressions, and more.

{
  "filter": {
    "search": "hotels NEAR/5 luxury",
    "queryType": "FULL"
  }
}

Example: searches for 'hotels' within 5 words of 'luxury'

Depending on the complexity and flexibility required for user searches in your application, you can choose to use either the simple or full query syntax. While the simple syntax may be sufficient for many scenarios, the full syntax offers greater control and precision in search queries.

Last updated 1 year ago

Was this helpful?