Document JSON
Structured JSON format for extracted text
When content is ingested into Graphlit which contains text, such as a PDF, Word document, Markdown file or email, the platform automatically extracts structured text into a JSON format.
You can find the extracted JSON file in the textUri
property of the Content object.
query QueryContents($filter: ContentFilter!) {
contents(filter: $filter) {
results {
id
textUri
}
}
}
The extracted JSON is stored on Azure blob storage, in a separate Azure storage account per-Graphlit project.
Text Extraction
The JSON document contains an array of indexed pages (pp
), where each page contains an array of indexed chunks (cc
). Each text chunk could represent a phrase, line or paragraph of text. Each text chunk also identifies the number of tokens (using the OpenAI tokenizer) which the text represents.
In addition to the text (t
), the JSON can optionally specify the role (r
) of the text chunk.
When using the Azure AI Document Intelligence layout model during document preparation, the model identifies the intended role of the text, such as PageHeader
, SectionHeading
or PageNumber
.
Without using a layout model such as this, the role will be unspecified.
The extracted text for this example was extracted from this paper, using the Azure AI Document Intelligence layout model.
{
"pp": [
{
"i": 0,
"cc": [
{
"i": 0,
"t": "JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021",
"r": "PageHeader",
"tok": 24
},
{
"i": 1,
"t": "Unifying Large Language Models and Knowledge Graphs: A Roadmap",
"r": "Title",
"tok": 13
},
{
"i": 2,
"t": "Shirui Pan, Senior Member, IEEE, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, Xindong Wu, Fellow, IEEE",
"tok": 36
},
{
"i": 3,
"t": "Abstract-Large language models (LLMs), such as ChatGPT and GPT4, are making new waves in the field of natural language processing and artificial intelligence, due to their emergent ability and generalizability. However, LLMs are black-box models, which often fall short of capturing and accessing factual knowledge. In contrast, Knowledge Graphs (KGs), Wikipedia and Huapu for example, are structured knowledge models that explicitly store rich factual knowledge. KGs can enhance LLMs by providing external knowledge for inference and interpretability. Meanwhile, KGs are difficult to construct and evolving by nature, which challenges the existing methods in KGs to generate new facts and represent unseen knowledge. Therefore, it is complementary to unify LLMs and KGs together and simultaneously leverage their advantages. In this article, we present a forward-looking roadmap for the unification of LLMs and KGs. Our roadmap consists of three general frameworks, namely, 1) KG-enhanced LLMs, which incorporate KGs during the pre-training and inference phases of LLMs, or for the purpose of enhancing understanding of the knowledge learned by LLMs; 2) LLM-augmented KGs, that leverage LLMs for different KG tasks such as embedding, completion, construction, graph-to-text generation, and question answering; and 3) Synergized LLMs + KGs, in which LLMs and KGs play equal roles and work in a mutually beneficial way to enhance both LLMs and KGs for bidirectional reasoning driven by both data and knowledge. We review and summarize existing efforts within these three frameworks in our roadmap and pinpoint their future research directions.",
"tok": 345
},
{
"i": 4,
"t": "Index Terms-Natural Language Processing, Large Language Models, Generative Pre-Training, Knowledge Graphs, Roadmap, Bidirectional Reasoning.",
"tok": 29
},
{
"i": 5,
"t": "✦",
"tok": 2
},
{
"i": 6,
"t": "1 INTRODUCTION",
"r": "SectionHeading",
"tok": 3
},
{
"i": 7,
"t": "Large language models (LLMs)1 (e.g., BERT [1], RoBERTA [2], and T5 [3]), pre-trained on the large-scale corpus, have shown great performance in various natural language processing (NLP) tasks, such as question answering [4], machine translation [5], and text generation [6]. Recently, the dramatically increasing model size further enables the LLMs with the emergent ability [7], paving the road for applying LLMs as Artificial General Intelligence (AGI). Advanced LLMs like ChatGPT2 and PaLM23, with billions of parameters, exhibit great potential in many complex practical tasks, such as education [8], code generation [9] and recommendation [10].",
"tok": 151
},
{
"i": 8,
"t": "· Shirui Pan is with the School of Information and Communication Tech- nology and Institute for Integrated and Intelligent Systems (IIIS), Griffith University, Queensland, Australia. Email: s.pan@griffith.edu.au;",
"tok": 45
},
{
"i": 9,
"t": "· Linhao Luo and Yufei Wang are with the Department of Data Sci- ence and AI, Monash University, Melbourne, Australia. E-mail: lin- hao.luo@monash.edu, garyyufei@gmail.com.",
"tok": 53
},
{
"i": 10,
"t": "· Chen Chen is with the Nanyang Technological University, Singapore. E- mail: s190009@ntu.edu.sg.",
"tok": 28
},
{
"i": 11,
"t": "· Jiapu Wang is with the Faculty of Information Technology, Beijing Uni- versity of Technology, Beijing, China. E-mail: jpwang@emails.bjut.edu.cn.",
"tok": 39
},
{
"i": 12,
"t": ". Xindong Wu is with the Key Laboratory of Knowledge Engineering with Big Data (the Ministry of Education of China), Hefei University of Technology, Hefei, China; He is also affiliated with the Research Center for Knowledge Engineering, Zhejiang Lab, Hangzhou, China. Email: xwu@hfut.edu.cn.",
"tok": 69
},
{
"i": 13,
"t": ". Shirui Pan and Linhao Luo contributed equally to this work.",
"tok": 15
},
{
"i": 14,
"t": "· Corresponding Author: Xindong Wu.",
"tok": 10
},
{
"i": 15,
"t": "1. LLMs are also known as pre-trained language models (PLMs).",
"tok": 17
},
{
"i": 16,
"t": "2. https://openai.com/blog/chatgpt",
"tok": 11
},
{
"i": 17,
"t": "3. https://ai.google/discover/palm2",
"tok": 11
},
{
"i": 18,
"t": "arXiv:2306.08302v2 [cs.CL] 20 Jun 2023",
"r": "SectionHeading",
"tok": 21
},
{
"i": 19,
"t": "Knowledge Graphs (KGs)",
"tok": 7
},
{
"i": 20,
"t": "Cons:",
"tok": 2
},
{
"i": 21,
"t": "Pros:",
"tok": 2
},
{
"i": 22,
"t": ". Implicit Knowledge",
"tok": 3
},
{
"i": 23,
"t": "· Structural Knowledge",
"tok": 3
},
{
"i": 24,
"t": ". Hallucination",
"tok": 4
},
{
"i": 25,
"t": "· Accuracy",
"tok": 2
},
{
"i": 26,
"t": "· Indecisiveness",
"tok": 5
},
{
"i": 27,
"t": "· Decisiveness",
"tok": 4
},
{
"i": 28,
"t": ". Black-box",
"tok": 3
},
{
"i": 29,
"t": "· Interpretability",
"tok": 3
},
{
"i": 30,
"t": "· Lacking Domain- specific/New Knowledge",
"tok": 8
},
{
"i": 31,
"t": "· Domain-specific Knowledge",
"tok": 4
},
{
"i": 32,
"t": "· Evolving Knowledge",
"tok": 4
},
{
"i": 33,
"t": "Pros:",
"tok": 2
},
{
"i": 34,
"t": "Cons:",
"tok": 2
},
{
"i": 35,
"t": "· General Knowledge",
"tok": 3
},
{
"i": 36,
"t": "· Incompleteness",
"tok": 5
},
{
"i": 37,
"t": "· Language Processing",
"tok": 3
},
{
"i": 38,
"t": "· Lacking Language",
"tok": 4
},
{
"i": 39,
"t": "· Generalizability",
"tok": 4
},
{
"i": 40,
"t": "Understanding",
"tok": 1
},
{
"i": 41,
"t": "· Unseen Facts",
"tok": 4
},
{
"i": 42,
"t": "Large Language Models (LLMs)",
"tok": 7
},
{
"i": 43,
"t": "Fig. 1. Summarization of the pros and cons for LLMs and KGs. LLM pros: General Knowledge [11], Language Processing [12], Generaliz- ability [13]; LLM cons: Implicit Knowledge [14], Hallucination [15], In- decisiveness [16], Black-box [17], Lacking Domain-specific/New Knowl- edge [18]. KG pros: Structural Knowledge [19], Accuracy [20], Decisive- ness [21], Interpretability [22], Domain-specific Knowledge [23], Evolv- ing Knowledge [24]; KG cons: Incompleteness [25], Lacking Language Understanding [26], Unseen Facts [27].",
"tok": 143
},
{
"i": 44,
"t": "Despite their success in many applications, LLMs have been criticized for their lack of factual knowledge. Specif- ically, LLMs memorize facts and knowledge contained in the training corpus [14]. However, further studies reveal that LLMs are not able to recall facts and often experience hallucinations by generating statements that are factually incorrect [15], [28]. For example, LLMs might say \"Ein-",
"tok": 87
},
{
"i": 45,
"t": "1",
"r": "PageNumber",
"tok": 1
},
{
"i": 46,
"t": "0000-0000/00$00.00 @ 2021 IEEE",
"r": "PageFooter",
"tok": 16
}
],
"tok": 1258
},
/* NOTE: pages have been removed */
{
"i": 28,
"cc": [
{
"i": 0,
"t": "JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021",
"r": "PageHeader",
"tok": 24
},
{
"i": 1,
"t": "[242] J. Jiang, K. Zhou, Z. Dong, K. Ye, W. X. Zhao, and J .- R. Wen, \"Structgpt: A general framework for large language model to reason over structured data,\" arXiv preprint ar Xiv:2305.09645, 2023.",
"tok": 70
},
{
"i": 2,
"t": "[243] H. Zhu, H. Peng, Z. Lyu, L. Hou, J. Li, and J. Xiao, \"Pre-training language model incorporating domain-specific heterogeneous knowledge into a unified representation,\" Expert Systems with Applications, vol. 215, p. 119369, 2023.",
"tok": 64
},
{
"i": 3,
"t": "[244] L. Wang, H. Hu, L. Sha, C. Xu, D. Jiang, and K .- F. Wong, \"Recin- dial: A unified framework for conversational recommendation with pretrained language models,\" in Proceedings of the 2nd Con- ference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, 2022, pp. 489-500.",
"tok": 96
},
{
"i": 4,
"t": "[245] R. Ding, X. Han, and L. Wang, \"A unified knowledge graph service for developing domain language models in ai software,\" ar Xiv preprint ar Xiv:2212.05251, 2022.",
"tok": 50
},
{
"i": 5,
"t": "[246] T. Y. Zhuo, Y. Huang, C. Chen, and Z. Xing, \"Exploring AI ethics of chatgpt: A diagnostic analysis,\" CoRR, vol. abs/2301.12867, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2301. 12867",
"tok": 79
},
{
"i": 6,
"t": "[247] W. Kryściński, B. McCann, C. Xiong, and R. Socher, \"Evaluating the factual consistency of abstractive text summarization,\" ar Xiv preprint ar Xiv:1910.12840, 2019.",
"tok": 62
},
{
"i": 7,
"t": "[248] Z. Ji, Z. Liu, N. Lee, T. Yu, B. Wilie, M. Zeng, and P. Fung, \"Rho (\\p): Reducing hallucination in open-domain dialogues with knowledge grounding,\" ar Xiv preprint ar Xiv:2212.01588, 2022.",
"tok": 73
},
{
"i": 8,
"t": "[249] S. Feng, V. Balachandran, Y. Bai, and Y. Tsvetkov, \"Factkb: Gen- eralizable factuality evaluation using language models enhanced with factual knowledge,\" ar Xiv preprint ar Xiv:2305.08281, 2023.",
"tok": 65
},
{
"i": 9,
"t": "[250] Q. Dong, D. Dai, Y. Song, J. Xu, Z. Sui, and L. Li, \"Calibrating factual knowledge in pretrained language models,\" ar Xiv preprint ar Xiv:2210.03329, 2022.",
"tok": 59
},
{
"i": 10,
"t": "[251] E. Mitchell, C. Lin, A. Bosselut, C. Finn, and C. D. Manning, \"Fast model editing at scale,\" ar Xiv preprint ar Xiv:2110.11309, 2021.",
"tok": 54
},
{
"i": 11,
"t": "[252] K. Meng, A. S. Sharma, A. Andonian, Y. Belinkov, and D. Bau, \"Mass-editing memory in a transformer,\" ar Xiv preprint ar Xiv:2210.07229, 2022.",
"tok": 57
},
{
"i": 12,
"t": "[253] S. Cheng, N. Zhang, B. Tian, Z. Dai, F. Xiong, W. Guo, and H. Chen, \"Editing language model-based knowledge graph em- beddings,\" ar Xiv preprint ar Xiv:2301.10405, 2023.",
"tok": 66
},
{
"i": 13,
"t": "[254] S. Chen, Y. Hou, Y. Cui, W. Che, T. Liu, and X. Yu, \"Recall and learn: Fine-tuning deep pretrained language models with less forgetting,\" in Proceedings of the 2020 Conference on Empirical",
"tok": 57
},
{
"i": 14,
"t": "Methods in Natural Language Processing (EMNLP), 2020, pp. 7870- 7881.",
"tok": 24
},
{
"i": 15,
"t": "[255] S. Diao, Z. Huang, R. Xu, X. Li, Y. Lin, X. Zhou, and T. Zhang, \"Black-box prompt learning for pre-trained language models,\" ar Xiv preprint ar Xiv:2201.08531, 2022.",
"tok": 63
},
{
"i": 16,
"t": "[256] T. Sun, Y. Shao, H. Qian, X. Huang, and X. Qiu, \"Black-box tuning for language-model-as-a-service,\" in International Conference on Machine Learning. PMLR, 2022, pp. 20 841-20 855.",
"tok": 64
},
{
"i": 17,
"t": "[257] X. Chen, A. Shrivastava, and A. Gupta, \"NEIL: extracting visual knowledge from web data,\" in IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013. IEEE Computer Society, 2013, pp. 1409-1416. [Online]. Available: https://doi.org/10.1109/ICCV.2013.178",
"tok": 97
},
{
"i": 18,
"t": "[258] M. Warren and P. J. Hayes, \"Bounding ambiguity: Experiences with an image annotation system,\" in Proceedings of the 1st Workshop on Subjectivity, Ambiguity and Disagreement in Crowdsourcing, and Short Paper Proceedings of the 1st Workshop on Disentangling the Relation Between Crowdsourcing and Bias Management (SAD 2018 and CrowdBias 2018) co-located the 6th AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2018), Zürich, Switzerland, July 5, 2018, ser. CEUR Workshop Proceedings, L. Aroyo, A. Dumitrache, P. K. Paritosh, A. J. Quinn, C. Welty, A. Checco, G. Demartini, U. Gadiraju, and C. Sarasua, Eds., vol. 2276. CEUR-WS.org, 2018, pp. 41-54. [Online]. Available: https://ceur-ws.org/Vol-2276/paper5.pdf",
"tok": 233
},
{
"i": 19,
"t": "[259] Z. Chen, Y. Huang, J. Chen, Y. Geng, Y. Fang, J. Z. Pan, N. Zhang, and W. Zhang, \"Lako: Knowledge-driven visual estion answer- ing via late knowledge-to-text injection,\" 2022.",
"tok": 62
},
{
"i": 20,
"t": "[260] R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, and I. Misra, \"Imagebind: One embedding space to bind them all,\" in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 180-15 190.",
"tok": 85
},
{
"i": 21,
"t": "[261] J. Zhang, Z. Yin, P. Chen, and'S. Nichele, \"Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review,\" Information Fusion, vol. 59, pp. 103-126, 2020.",
"tok": 59
},
{
"i": 22,
"t": "[262] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, \"A comprehensive survey on graph neural networks,\" IEEE transactions on neural networks and learning systems, vol. 32, no. 1, pp. 4-24, 2020.",
"tok": 69
},
{
"i": 23,
"t": "[263] T. Wu, M. Caccia, Z. Li, Y .- F. Li, G. Qi, and G. Haffari, \"Pretrained language model in continual learning: A comparative study,\" in International Conference on Learning Representations, 2022.",
"tok": 60
},
{
"i": 24,
"t": "29",
"r": "PageNumber",
"tok": 1
}
],
"tok": 1693
}
],
"tok": 48357
}
Table Extraction
Graphlit supports a JSON representation of extracted tables, if the preparation workflow supports table extraction.
Tables are identified by the role (r
) of Table
. The table is formatted as tab-delimited (TSV) format in the text (t
) property.
In addition, the table is specified as an indexed child array of text chunks (cc
), each with an identifying role, such as TableColumnHeader
or TableCell
, and with a column index (ci
) and row index (ri
).
{
"pp": [
{
"i": 1,
"cc": [
{
"i": 0,
"t": "UBER TECHNOLOGIES, INC. TABLE OF CONTENTS",
"r": "Title",
"tok": 12
},
{
"i": 1,
"t": "\t\tPages\n\tSpecial Note Regarding Forward-Looking Statements\t2\n\tPART I - FINANCIAL INFORMATION\t4\nItem 1.\tFinancial Statements (unaudited).\t4\n\tCondensed Consolidated Balance Sheets as of December 31, 2021 and March 31, 2022\t4\n\tCondensed Consolidated Statements of Operations for the Three Months Ended March 31, 2021 and 2022\t5\n\tCondensed Consolidated Statements of Comprehensive Income (Loss) for the Three Months Ended March 31, 2021 and 2022\t6\n\tCondensed Consolidated Statements of Redeemable Non-Controlling Interests and Equity for the Three Months Ended March 31, 2021\t\n\tand 2022\t7\n\tCondensed Consolidated Statements of Cash Flows for the Three Months Ended March 31, 2021 and 2022\t9\n\tNotes to Condensed Consolidated Financial Statements\t11\nItem 2.\tManagement's Discussion and Analysis of Financial Condition and Results of Operations\t32\nItem 3.\tQuantitative and Qualitative Disclosures About Market Risk\t48\nItem 4.\tControls and Procedures\t48\n\tPART II - OTHER INFORMATION\t49\nItem 1.\tLegal Proceedings\t49\nItem 1A.\tRisk Factors\t50\nItem 2.\tUnregistered Sales of Equity Securities and Use of Proceeds\t86\nItem 6.\tExhibits\t86\n\tSignatures\t88\n",
"r": "Table",
"tok": 321,
"cc": [
{
"i": 1,
"ri": 0,
"ci": 0,
"t": "",
"r": "TableColumnHeader",
"tok": 0
},
{
"i": 2,
"ri": 0,
"ci": 1,
"t": "",
"r": "TableColumnHeader",
"tok": 0
},
{
"i": 3,
"ri": 0,
"ci": 2,
"t": "Pages",
"r": "TableColumnHeader",
"tok": 1
},
{
"i": 4,
"ri": 1,
"ci": 0,
"t": "",
"r": "TableCell",
"tok": 0
},
{
"i": 5,
"ri": 1,
"ci": 1,
"t": "Special Note Regarding Forward-Looking Statements",
"r": "TableCell",
"tok": 8
},
{
"i": 6,
"ri": 1,
"ci": 2,
"t": "2",
"r": "TableCell",
"tok": 1
},
{
"i": 7,
"ri": 2,
"ci": 0,
"t": "",
"r": "TableCell",
"tok": 0
},
{
"i": 8,
"ri": 2,
"ci": 1,
"t": "PART I - FINANCIAL INFORMATION",
"r": "TableCell",
"tok": 7
},
{
"i": 9,
"ri": 2,
"ci": 2,
"t": "4",
"r": "TableCell",
"tok": 1
},
{
"i": 10,
"ri": 3,
"ci": 0,
"t": "Item 1.",
"r": "TableCell",
"tok": 4
},
{
"i": 11,
"ri": 3,
"ci": 1,
"t": "Financial Statements (unaudited).",
"r": "TableCell",
"tok": 7
},
{
"i": 12,
"ri": 3,
"ci": 2,
"t": "4",
"r": "TableCell",
"tok": 1
},
{
"i": 13,
"ri": 4,
"ci": 0,
"t": "",
"r": "TableCell",
"tok": 0
},
{
"i": 14,
"ri": 4,
"ci": 1,
"t": "Condensed Consolidated Balance Sheets as of December 31, 2021 and March 31, 2022",
"r": "TableCell",
"tok": 23
},
{
"i": 15,
"ri": 4,
"ci": 2,
"t": "4",
"r": "TableCell",
"tok": 1
},
{
"i": 16,
"ri": 5,
"ci": 0,
"t": "",
"r": "TableCell",
"tok": 0
},
{
"i": 17,
"ri": 5,
"ci": 1,
"t": "Condensed Consolidated Statements of Operations for the Three Months Ended March 31, 2021 and 2022",
"r": "TableCell",
"tok": 23
},
{
"i": 18,
"ri": 5,
"ci": 2,
"t": "5",
"r": "TableCell",
"tok": 1
},
{
"i": 19,
"ri": 6,
"ci": 0,
"t": "",
"r": "TableCell",
"tok": 0
},
{
"i": 20,
"ri": 6,
"ci": 1,
"t": "Condensed Consolidated Statements of Comprehensive Income (Loss) for the Three Months Ended March 31, 2021 and 2022",
"r": "TableCell",
"tok": 27
},
{
"i": 21,
"ri": 6,
"ci": 2,
"t": "6",
"r": "TableCell",
"tok": 1
},
{
"i": 22,
"ri": 7,
"ci": 0,
"t": "",
"r": "TableCell",
"tok": 0
},
{
"i": 23,
"ri": 7,
"ci": 1,
"t": "Condensed Consolidated Statements of Redeemable Non-Controlling Interests and Equity for the Three Months Ended March 31, 2021",
"r": "TableCell",
"tok": 29
},
{
"i": 24,
"ri": 7,
"ci": 2,
"t": "",
"r": "TableCell",
"tok": 0
},
{
"i": 25,
"ri": 8,
"ci": 0,
"t": "",
"r": "TableCell",
"tok": 0
},
{
"i": 26,
"ri": 8,
"ci": 1,
"t": "and 2022",
"r": "TableCell",
"tok": 4
},
{
"i": 27,
"ri": 8,
"ci": 2,
"t": "7",
"r": "TableCell",
"tok": 1
},
{
"i": 28,
"ri": 9,
"ci": 0,
"t": "",
"r": "TableCell",
"tok": 0
},
{
"i": 29,
"ri": 9,
"ci": 1,
"t": "Condensed Consolidated Statements of Cash Flows for the Three Months Ended March 31, 2021 and 2022",
"r": "TableCell",
"tok": 25
},
{
"i": 30,
"ri": 9,
"ci": 2,
"t": "9",
"r": "TableCell",
"tok": 1
},
{
"i": 31,
"ri": 10,
"ci": 0,
"t": "",
"r": "TableCell",
"tok": 0
},
{
"i": 32,
"ri": 10,
"ci": 1,
"t": "Notes to Condensed Consolidated Financial Statements",
"r": "TableCell",
"tok": 8
},
{
"i": 33,
"ri": 10,
"ci": 2,
"t": "11",
"r": "TableCell",
"tok": 1
},
{
"i": 34,
"ri": 11,
"ci": 0,
"t": "Item 2.",
"r": "TableCell",
"tok": 4
},
{
"i": 35,
"ri": 11,
"ci": 1,
"t": "Management's Discussion and Analysis of Financial Condition and Results of Operations",
"r": "TableCell",
"tok": 12
},
{
"i": 36,
"ri": 11,
"ci": 2,
"t": "32",
"r": "TableCell",
"tok": 1
},
{
"i": 37,
"ri": 12,
"ci": 0,
"t": "Item 3.",
"r": "TableCell",
"tok": 4
},
{
"i": 38,
"ri": 12,
"ci": 1,
"t": "Quantitative and Qualitative Disclosures About Market Risk",
"r": "TableCell",
"tok": 10
},
{
"i": 39,
"ri": 12,
"ci": 2,
"t": "48",
"r": "TableCell",
"tok": 1
},
{
"i": 40,
"ri": 13,
"ci": 0,
"t": "Item 4.",
"r": "TableCell",
"tok": 4
},
{
"i": 41,
"ri": 13,
"ci": 1,
"t": "Controls and Procedures",
"r": "TableCell",
"tok": 3
},
{
"i": 42,
"ri": 13,
"ci": 2,
"t": "48",
"r": "TableCell",
"tok": 1
},
{
"i": 43,
"ri": 14,
"ci": 0,
"t": "",
"r": "TableCell",
"tok": 0
},
{
"i": 44,
"ri": 14,
"ci": 1,
"t": "PART II - OTHER INFORMATION",
"r": "TableCell",
"tok": 5
},
{
"i": 45,
"ri": 14,
"ci": 2,
"t": "49",
"r": "TableCell",
"tok": 1
},
{
"i": 46,
"ri": 15,
"ci": 0,
"t": "Item 1.",
"r": "TableCell",
"tok": 4
},
{
"i": 47,
"ri": 15,
"ci": 1,
"t": "Legal Proceedings",
"r": "TableCell",
"tok": 2
},
{
"i": 48,
"ri": 15,
"ci": 2,
"t": "49",
"r": "TableCell",
"tok": 1
},
{
"i": 49,
"ri": 16,
"ci": 0,
"t": "Item 1A.",
"r": "TableCell",
"tok": 5
},
{
"i": 50,
"ri": 16,
"ci": 1,
"t": "Risk Factors",
"r": "TableCell",
"tok": 2
},
{
"i": 51,
"ri": 16,
"ci": 2,
"t": "50",
"r": "TableCell",
"tok": 1
},
{
"i": 52,
"ri": 17,
"ci": 0,
"t": "Item 2.",
"r": "TableCell",
"tok": 4
},
{
"i": 53,
"ri": 17,
"ci": 1,
"t": "Unregistered Sales of Equity Securities and Use of Proceeds",
"r": "TableCell",
"tok": 11
},
{
"i": 54,
"ri": 17,
"ci": 2,
"t": "86",
"r": "TableCell",
"tok": 1
},
{
"i": 55,
"ri": 18,
"ci": 0,
"t": "Item 6.",
"r": "TableCell",
"tok": 4
},
{
"i": 56,
"ri": 18,
"ci": 1,
"t": "Exhibits",
"r": "TableCell",
"tok": 3
},
{
"i": 57,
"ri": 18,
"ci": 2,
"t": "86",
"r": "TableCell",
"tok": 1
},
{
"i": 58,
"ri": 19,
"ci": 0,
"t": "",
"r": "TableCell",
"tok": 0
},
{
"i": 59,
"ri": 19,
"ci": 1,
"t": "Signatures",
"r": "TableCell",
"tok": 2
},
{
"i": 60,
"ri": 19,
"ci": 2,
"t": "88",
"r": "TableCell",
"tok": 1
}
]
},
{
"i": 2,
"t": "1",
"r": "PageNumber",
"tok": 1
}
],
"tok": 334
}
/* NOTE: pages have been removed */
],
"tok": 87176
}
JSON Schema
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"description": "Schema representing the structure of a document with text and tables, including metadata for pages, roles, and token counts.",
"properties": {
"pp": {
"type": "array",
"description": "Array of objects, each representing a page within the document.",
"items": {
"type": "object",
"properties": {
"i": {
"type": "number",
"description": "Index of the page within the document."
},
"cc": {
"type": "array",
"description": "Array of content chunks within a page, including text and tables.",
"items": {
"type": "object",
"properties": {
"i": {
"type": "number",
"description": "Index of the content chunk within the page."
},
"t": {
"type": "string",
"description": "Text of the content chunk or table title."
},
"tok": {
"type": "number",
"description": "Token count for the content chunk."
},
"r": {
"type": "string",
"description": "Role of the content chunk (e.g., 'Title', 'Table', 'PageNumber')."
},
"cc": {
"type": "array",
"optional": true,
"description": "Array of table cells if the content chunk is a table.",
"items": {
"type": "object",
"properties": {
"i": {
"type": "number",
"description": "Index of the table cell within the table."
},
"ri": {
"type": "number",
"description": "Row index of the table cell."
},
"ci": {
"type": "number",
"description": "Column index of the table cell."
},
"t": {
"type": "string",
"description": "Text of the table cell."
},
"r": {
"type": "string",
"description": "Role of the table cell (e.g., 'TableColumnHeader', 'TableCell')."
},
"tok": {
"type": "number",
"description": "Token count for the table cell."
}
},
"required": ["i", "ri", "ci", "t", "r", "tok"]
}
}
},
"required": ["i", "t", "tok"],
"additionalProperties": false
}
},
"tok": {
"type": "number",
"description": "Total token count for the page."
}
},
"required": ["i", "cc", "tok"],
"additionalProperties": false
}
},
"tok": {
"type": "number",
"description": "Total token count for the entire document."
}
},
"required": ["pp", "tok"],
"additionalProperties": false
}
Last updated