GraphRAG：复杂查询的知识图谱新框架

微软最近发布了名为 GraphRAG(Graphs + Retrieval Augmented Generation)的创新 RAG 框架，这是一个将文本提取、语义网络分析，与大语言模型(LLM)的提示和总结功能结合在一起的端到端系统，用于深入理解文本数据集。在对私有数据中的复杂文本信息进行文档分析时，GrahRAG 使用...

egzosn

1989人浏览 · 2024-08-17 15:59:17

egzosn · 2024-08-17 15:59:17 发布

GraphRAG：复杂查询的知识图谱新框架_知识图谱

微软最近发布了名为 GraphRAG(Graphs + Retrieval Augmented Generation)的创新 RAG 框架，这是一个将文本提取、语义网络分析，与大语言模型(LLM)的提示和总结功能结合在一起的端到端系统，用于深入理解文本数据集。在对私有数据中的复杂文本信息进行文档分析时，GrahRAG 使用 LLM 生成的知识图谱来大幅提高问答性能。这里的私有数据集是指 LLM 没有接受过训练且从未见过的数据，例如企业的专有技术文档、业务文档等。

为什么提出 GraphRAG？

使用检索增强生成从外部知识源中检索相关信息，使大语言模型能够回答关于私有和/或以前未见过的文档集合的问题。然而，RAG 在面对针对整个文本语料库的全局性问题时表现不佳，例如“数据集的主要主题是什么？”因为这本质上是一个面向查询的总结(QFS, Query-Focused Summarization)任务，而不是显式的检索任务。但是，现有的QFS方法无法扩展至典型 RAG 系统索引的海量文本之上。

因此，微软提出了 GraphRAG 方法，代表“基于图谱的检索增强生成”，用于在私有文本语料库上进行问答。GraphRAG 不是简单地查找文本片段，而是构建信息的结构化和层次化关系图谱。它将基于图谱的知识检索与 LLM 相结合，捕获大规模文本信息中的实体、关系和关键声明。与依赖矢量相似性搜索的传统 RAG 方法不同，GraphRAG 可以增强 LLM 理解和综合复杂数据集及其关系的能力，从而产生更准确的响应。

GraphRAG：复杂查询的知识图谱新框架_d3_02

GraphRAG 如何工作？

GraphRAG 的创新之处在于，它使大型语言模型能够基于整个数据集回答问题。从根本上讲，GraphRAG 是一种新的检索方法，它在基本 RAG 架构中使用知识图谱和向量搜索。因此，它可以整合和理解各种知识，从而提供更广泛、更全面的数据视图。

知识图谱是一种强大的信息组织方式，用于表示和存储实体之间的复杂关系，不仅捕捉到实体本身，还包括定义它们的连接和属性。通过以图结构组织信息，知识图谱能够更深入地理解数据中的关系和层次，从而支持更复杂的推理和推断。

GraphRAG 的工作原理是从索引文档中创建一个知识图谱，这些文档也被称为非结构化数据，例如网页。因此，当 GraphRAG 创建知识图谱时，它实际上是在创建一个“结构化”的表示，表示各种“实体”(如人、地点、概念和事物)之间的关系，使得机器就更容易理解这些关系。

GraphRAG 方法使用 LLM 在两个阶段构建基于图谱的文本索引：首先从源文档中推导出实体知识图谱，基于实体群体间的相关程度，创建称之为“社区”的一般主题(高层次)和更细化的主题(低层次)；然后，LLM 会对社区中的每一个主题进行总结，形成一个“数据的分层摘要”。回答问题时，则使用每个社区摘要(Community summary)生成部分回应，之后将所有部分回应再次总结为最终的用户回应。这样，聊天机器人就能够更多地基于知识(即社区摘要)来回答问题，而不是依赖嵌入。

GraphRAG：复杂查询的知识图谱新框架_知识图谱_03

提取知识图谱：首先从原始文本创建“知识图谱”。知识图谱就像一个相互连接的内容实体网络，其中每个实体(或“节点”)都以有意义的方式与其他实体相连接。
建立社区层次结构：接下来，它将这些相互关联的内容实体组织成“社区”，将这些社区视为相关概念的集群。
生成摘要：对于每个社区，GraphRAG 都会生成摘要来概括要点。这有助于理解关键内容，而不会迷失在细节中。
利用图谱结构：当您需要执行涉及检索和生成信息的任务时，GraphRAG 会使用这种组织良好的图谱结构。

GraphRAG 核心功能组件

与 RAG 系统类似，整个 GraphRAG 管道可以分为两个核心功能组件：索引和查询。索引过程使用 LLM 来提取节点(如实体)、边(如关系)和协变量(如声明)。然后，它使用社区检测技术对整个知识图谱进行分区，并使用 LLM 进一步形成摘要。对于特定查询，它可以汇总所有相关的社区摘要以生成全局答案。

GraphRAG Indexing 索引

GraphRAG 索引包是一个数据管道和转换套件，旨在使用大语言模型从非结构化文本中提取有意义的结构化数据。索引管道是可配置的，由工作流、标准和自定义步骤、提示模板以及输入/输出适配器组成。索引管道设计用于：

从原始文本中提取实体、关系和声明
在实体中执行社区检测
生成多个粒度级别的社区摘要和报告
将实体嵌入到图谱向量空间中
将文本块嵌入到文本向量空间中

GraphRAG 索引管道建立在开源库 DataShaper 之上。DataShaper 是一个数据处理库，允许用户使用定义良好的模式声明性地表达数据管道、模式和相关资产。DataShaper 中的核心资源类型之一是工作流 Workflow。工作流程以步骤序列表示，我们称之为 Verbs。每个步骤都有一个动词名称(verb)和一个配置对象(configuration object)。

我们能够将数据管道表示为一系列多个相互依赖的工作流。在 GraphRAG 索引管道中，每个工作流可以定义对其他工作流的依赖性，有效地形成工作流的有向无环图(DAG)，然后用于调度处理，如下图所示。管道输出可以以多种格式存储，包括 Json 和 Parquet，或者通过 Python APIs 手动处理。

GraphRAG：复杂查询的知识图谱新框架_人工智能_04

GraphRAG Query 查询引擎

查询引擎是 GraphRAG 库的检索模块，负责以下任务：

本地搜索

本地搜索方法通过将模型从知识图谱中提取的相关数据与原始文档的文本块相结合，生成准确的答案。这种方法特别适用于需要深入了解文档中提到的特定实体的问题，例如“洋甘菊的治疗特性是什么？”

具体来说，本地搜索方法在给定用户查询和可选的对话历史记录的情况下，从知识图谱中识别出一组与用户输入语义相关的实体。这些实体作为访问知识图谱的切入点，可以进一步提取相关的细节信息，如关联实体、关系、实体协变量以及社区报告。此外，该方法还从与这些识别出的实体相关的原始文档中提取相关的文本块。接着，将这些候选数据源进行优先级排序和筛选，以适应预定义大小的单个上下文窗口，用于生成对用户查询的最终响应。

GraphRAG：复杂查询的知识图谱新框架_d3_05

全局搜索

全局搜索方法通过以 Map-Reduce 方式搜索所有由 LLM 生成的社区报告来生成答案。这是一种资源密集型的方法，但对于需要了解整个数据集的问题，如“数据中排名前五的主题是什么？”，通常能提供较好的结果。

LLM 生成的知识图谱结构揭示了整个数据集的组织方式和主题分布。这使得我们能够将私有数据集组织成预先总结的、有意义的语义集群。通过全局搜索方法，LLM 能够在响应用户查询时，利用这些集群来总结相关主题。

具体而言，当接收到用户查询和(可选的)对话历史记录时，全局搜索方法使用从知识图谱社区层次结构中指定级别获取的社区报告集合作为上下文数据，以 Map-Reduce 方式生成响应。在 Map 步骤中，社区报告被分割成预定义大小的文本块。每个文本块用于生成包含要点列表的中间响应，并为每个要点附加表示其重要性的数字评分。在 Reduce 步骤中，筛选出的最重要要点被聚合，并用作生成最终响应的上下文。

GraphRAG：复杂查询的知识图谱新框架_知识图谱_06

全局搜索的响应质量可能会受到选择的社区层次结构级别的显著影响。较低级别的层次结构及其详细报告往往能够生成更全面的响应，但由于报告量大，也可能增加生成最终响应所需的时间和 LLM 资源。

问题生成

基于实体的问题生成方法将知识图谱中的结构化数据与输入文档中的非结构化数据相结合，以生成与特定实体相关的候选问题。

详细来说，给定先前用户问题的列表，问题生成方法使用与本地搜索相同的上下文构建方法，以提取和优先处理相关的结构化和非结构化数据，包括实体、关系、协变量、社区报告和原始文本块。然后，这些数据记录被合并至单个 LLM 提示中，以生成代表数据中最重要或最紧急的信息内容或主题的候选后续问题。这对于在对话中生成后续问题或生成问题列表非常有用，以便调查人员深入研究数据集。

GraphRAG 实践

GraphRAG 项目

基于 Python 的 GraphRAG 开源实现请参考 GitHub 项目：https://github.com/microsoft/graphrag。

安装 GraphRAG

初始化设置

现在我们需要设置数据项目和初始配置。我们使用默认配置模式，您可以根据需要使用配置文件或环境变量进行自定义。

首先，我们需要一个样本数据集：

mkdir -p ./ragtest/input
# 以查尔斯·狄更斯的《圣诞颂歌》为例
curl https://www.gutenberg.org/cache/epub/24022/pg24022.txt > ./ragtest/input/A_Christmas_Carol.txt

接下来，我们需要设置工作区环境变量。

运行 graphrag.index --init 命令进行工作区初始化。

这将在 ./ragtest 目录中创建两个文件：.env 和 settings.yaml。

.env 包含运行 GraphRAG 管道所需的环境变量。如果查看该文件，您将看到定义的单个环境变量，GRAPHRAG_API_KEY=<API_KEY>。这是 OpenAI API 或 Azure OpenAI 端点借口的 API 密钥。你需要替换为自己的 API 密钥。
settings.yaml 包含管道的设置。Azure OpenAI 用户应在 settings.yaml 文件中设置以下变量：搜索 “llm:” 配置，您应该会看到两个部分，一个用于聊天端点，一个用于嵌入端点。以下是如何配置聊天端点的示例：

type: azure_openai_chat # Or azure_openai_embedding for embeddings
api_base: https://<instance>.openai.azure.com
api_version: 2024-02-15-preview # You can customize this for other versions
deployment_name: <azure_model_deployment_name>

构建索引

现在，我们将开始运行索引管道。

控制台将打印许多运行时日志，例如完整的索引管道编排工作流：

⠼ GraphRAG Indexer
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
├── create_base_text_units
├── create_base_extracted_entities
├── create_summarized_entities
├── create_base_entity_graph
├── create_final_entities
├── create_final_nodes
├── create_final_communities
├── join_text_units_to_entity_ids
├── create_final_relationships
├── join_text_units_to_relationship_ids
├── create_final_community_reports
├── create_final_text_units
├── create_base_documents
└── create_final_documents
🚀 All workflows completed successfully.

索引管道执行过程中，运行时日志会打印非常丰富的社区报告、节点信息等内容。

🚀 Reading settings from ragtest/settings.yaml
🚀 create_base_text_units
                                   id                                              chunk                          chunk_id                        document_ids  n_tokens
0    680dd6d2a970a49082fa4f34bf63a34e   The Project Gutenberg eBook of A Christmas Ca...  680dd6d2a970a49082fa4f34bf63a34e         300
1    95f1f8f5bdbf0bee3a2c6f2f4a4907f6   THE PROJECT GUTENBERG EBOOK A CHRISTMAS CAROL...  95f1f8f5bdbf0bee3a2c6f2f4a4907f6         300
2    3a450ed2b7fb1e5fce66f92698c13824  1958,\n  1962, 1964, 1966, 1967, 1969, 1971, 1...  3a450ed2b7fb1e5fce66f92698c13824         300
3    95b143eba145d91eacae7be3e4ebaf0c  .\n  Mr. Fezziwig, a kind-hearted, jovial old ...  95b143eba145d91eacae7be3e4ebaf0c         300
4    c390f1b92e2888f78b58f6af5b12afa0   debtors.\n  Mrs. Cratchit, wife of Bob Cratch...  c390f1b92e2888f78b58f6af5b12afa0         300
..                                ...                                                ...                               ...                                 ...       ...
226  972bb34ddd371530f06d006480526d3e   harmless from all liability, costs and expens...  972bb34ddd371530f06d006480526d3e         300
227  2f918cd94d1825eb5cbdc2a9d3ce094e  \nGutenberg Literary Archive Foundation was cr...  2f918cd94d1825eb5cbdc2a9d3ce094e         300
228  eec5fc1a2be814473698e220b303dc1b  . Email contact links and up\nto date contact ...  eec5fc1a2be814473698e220b303dc1b         300
229  535f6bed392a62760401b1d4f2aa5e2f   compliance. To SEND\nDONATIONS or determine t...  535f6bed392a62760401b1d4f2aa5e2f         300
230  9e59af410db84b25757e3bf90e036f39   could be\nfreely shared with anyone. For fort...  9e59af410db84b25757e3bf90e036f39         155

[231 rows x 5 columns]

🚀 create_final_entities
                                   id                       name  ...                                      text_unit_ids                              description_embedding
0    b45241d70f0e43fca764df95b2b81f77          PROJECT GUTENBERG  ...  [0ddc17ea5e566006c000b4013f2181a5, 2b5ecb7fba1...  [-0.038737952709198, 0.03134557977318764, 0.03...
1    4119fd06010c494caa07f439b333f4c5            CHARLES DICKENS  ...                   [0.054160527884960175, 0.01252343412488699, 0....
2    d3835bf3dda84ead99deadbeac5d0d7d             ARTHUR RACKHAM  ...                   [0.0036118177231401205, 0.021079497411847115, ...
3    077d2820ae1845bcbb1803379a3d1eae   J. B. LIPPINCOTT COMPANY  ...                   [0.008591658435761929, -0.018140539526939392, ...
4    3671ea0dd4e84c1a9b02c5ab2c8f4bac              UNITED STATES  ...  [8435b078474636a989a8c22f5493e1b6, aa8d2310a20...  [-0.022689668461680412, 0.006747123785316944, ...
..                                ...                        ...  ...                                                ...                                                ...
140  f09f381c319f4251847d1a4bb8cdcac1             SALT LAKE CITY  ...                   [-0.0359080508351326, -0.014752472750842571, 0...
141  eec11f567e7f4943b157c3a657eb9a46                MISSISSIPPI  ...                   [-0.009660783223807812, -0.027041003108024597,...
142  efef117839b64ce9adf614a461d41ba6   INTERNAL REVENUE SERVICE  ...                   [-0.031183507293462753, -0.011382343247532845,...
143  2171091ada0942d8ae7944df11659f6e  PROFESSOR MICHAEL S. HART  ...                   [-0.026417620480060577, -0.02718503400683403, ...
144  bcfdc48e5f044e1d84c5d217c1992d4b                 FOUNDATION  ...                   [-0.021003013476729393, -0.019607042893767357,...

[291 rows x 8 columns]

🚀 create_final_nodes
     level                      title          type                                        description  ... entity_type                 top_level_node_id  x  y
0        0          PROJECT GUTENBERG  ORGANIZATION  Project Gutenberg is a digital library that of...  ...         NaN  b45241d70f0e43fca764df95b2b81f77  0  0
1        0            CHARLES DICKENS        PERSON  Charles Dickens is the author of "A Christmas ...  ...         NaN  4119fd06010c494caa07f439b333f4c5  0  0
2        0             ARTHUR RACKHAM        PERSON  Arthur Rackham is the illustrator of "A Christ...  ...         NaN  d3835bf3dda84ead99deadbeac5d0d7d  0  0
3        0   J. B. LIPPINCOTT COMPANY  ORGANIZATION  J. B. Lippincott Company is the original publi...  ...         NaN  077d2820ae1845bcbb1803379a3d1eae  0  0
4        0              UNITED STATES           GEO  The United States is a country where the Proje...  ...         NaN  3671ea0dd4e84c1a9b02c5ab2c8f4bac  0  0
..     ...                        ...           ...                                                ...  ...         ...                               ... .. ..
868      2             SALT LAKE CITY           GEO  Salt Lake City is the location of the business...  ...         NaN  f09f381c319f4251847d1a4bb8cdcac1  0  0
869      2                MISSISSIPPI           GEO  Mississippi is the state under whose laws the ...  ...         NaN  eec11f567e7f4943b157c3a657eb9a46  0  0
870      2   INTERNAL REVENUE SERVICE  ORGANIZATION  The Internal Revenue Service (IRS) is the U.S....  ...         NaN  efef117839b64ce9adf614a461d41ba6  0  0
871      2  PROFESSOR MICHAEL S. HART        PERSON  Professor Michael S. Hart was the originator o...  ...         NaN  2171091ada0942d8ae7944df11659f6e  0  0
872      2                 FOUNDATION                                                                   ...         NaN  bcfdc48e5f044e1d84c5d217c1992d4b  0  0

[873 rows x 15 columns]

🚀 create_final_relationships
                                            source                                         target  weight  ... source_degree target_degree rank
0                                PROJECT GUTENBERG                                  UNITED STATES     2.0  ...            12             4   16
1                                PROJECT GUTENBERG                              A CHRISTMAS CAROL     2.0  ...            12             7   19
2                                PROJECT GUTENBERG  PROJECT GUTENBERG LITERARY ARCHIVE FOUNDATION     4.0  ...            12             3   15
3                                PROJECT GUTENBERG                               COPYRIGHT HOLDER     1.0  ...            12             1   13
4                                PROJECT GUTENBERG                      PROJECT GUTENBERG LICENSE     1.0  ...            12             3   15
..                                             ...                                            ...     ...  ...           ...           ...  ...
369                           MR. SCROOGE'S NEPHEW                                    MR. SCROOGE     1.0  ...             2             2    4
370  PROJECT GUTENBERG LITERARY ARCHIVE FOUNDATION                      PROJECT GUTENBERG LICENSE     1.0  ...             3             3    6
371  PROJECT GUTENBERG LITERARY ARCHIVE FOUNDATION                                 SALT LAKE CITY     1.0  ...             3             1    4
372                                    MISSISSIPPI                                     FOUNDATION     1.0  ...             1             4    5
373                       INTERNAL REVENUE SERVICE                                     FOUNDATION     1.0  ...             1             4    5

[374 rows x 10 columns]

🚀 create_final_community_reports
   community                                       full_content  ...                                  full_content_json                                    id
0         38  # Cratchit Family and Household\n\nThe communi...  ...  {\n    "title": "Cratchit Family and Household...  9f581acc-b0b3-426e-aab7-8d783fefd8ce
1         39  # The Cratchit Family and Scrooge\n\nThe commu...  ...  {\n    "title": "The Cratchit Family and Scroo...  3bd6a384-04ee-4f5b-93b9-69d08e9fa4a0
2         40  # Project Gutenberg Community\n\nThe community...  ...  {\n    "title": "Project Gutenberg Community",...  4e1b4220-6705-48c8-94ed-fb8790467a9e
3         41  # Project Gutenberg and United States Legal Fr...  ...  {\n    "title": "Project Gutenberg and United ...  0a0b2657-df4b-4deb-a4bf-249c19b93d21
4         42  # Project Gutenberg Literary Archive Foundatio...  ...  {\n    "title": "Project Gutenberg Literary Ar...  06397a0a-144a-41d8-996f-714f7e0ef3ad
5         13  # Ebenezer Scrooge and His Transformative Jour...  ...  {\n    "title": "Ebenezer Scrooge and His Tran...  6b6f8af1-f405-478f-9694-5e834f7762b8
6         14  # Scrooge and Marley's Haunting Legacy\n\nThe ...  ...  {\n    "title": "Scrooge and Marley's Haunting...  af31d027-03e3-45fd-b8c9-948f52c3db12
7         15  # Ghost of Christmas Past and Scrooge\n\nThe c...  ...  {\n    "title": "Ghost of Christmas Past and S...  af4e9014-b1b0-42ef-baa7-f4fb17cdb983
8         16  # Ghost of Christmas Present and Scrooge\n\nTh...  ...  {\n    "title": "Ghost of Christmas Present an...  d3fbabfd-2770-4112-94cd-35ab1a19f78b
9         17  # Scrooge and the Guiding Ghosts\n\nThe commun...  ...  {\n    "title": "Scrooge and the Guiding Ghost...  fccbc490-801f-4ed5-97d5-c00dfd96adf8
10        18  # Fezziwig's Christmas Eve Celebration\n\nThe ...  ...  {\n    "title": "Fezziwig's Christmas Eve Cele...  ad6c0573-fd9b-4dd6-a84e-d013ab7c4ea8
11        19  # Fezziwig's Christmas Eve Celebration\n\nThe ...  ...  {\n    "title": "Fezziwig's Christmas Eve Cele...  75682e95-63be-4268-b9e3-5840314a2dd8
12        20  # Fezziwig's Christmas Eve Celebration Attende...  ...  {\n    "title": "Fezziwig's Christmas Eve Cele...  2b712568-0b51-4116-99bb-42ea9cf08359
13        21  # Fezziwig's Christmas Eve Celebration\n\nThe ...  ...  {\n    "title": "Fezziwig's Christmas Eve Cele...  e1b2844a-ea32-461f-b5a5-80fe141ac305
14        22  # Dick Wilkins and Ebenezer Community\n\nThe c...  ...  {\n    "title": "Dick Wilkins and Ebenezer Com...  b150d4a6-5361-44c4-894a-fd463d576477
15        23  # Fezziwig's Domestic Ball\n\nThe community ce...  ...  {\n    "title": "Fezziwig's Domestic Ball",\n ...  a316b4da-1e07-4e4c-a67a-b1b72a16d3b5
16        24  # Christmas Celebrations and Scrooge's Transfo...  ...  {\n    "title": "Christmas Celebrations and Sc...  0a7e7c40-6183-4134-ad3a-f0519deeee46
17        25  # Cratchit Family and the Spirit of Christmas ...  ...  {\n    "title": "Cratchit Family and the Spiri...  93985289-0739-423d-8621-4cd82c4bfa6b
18        26  # The Cratchit Family and Their Christmas Cele...  ...  {\n    "title": "The Cratchit Family and Their...  40667b3b-ad99-439a-936e-ead2d6ce527f
19        27  # Cratchit Family and Christmas Celebration\n\...  ...  {\n    "title": "Cratchit Family and Christmas...  20f0b6d4-ac48-4f9b-b795-ff1e090dc646
20        28  # Old Joe's Shop Community\n\nThe community re...  ...  {\n    "title": "Old Joe's Shop Community",\n ...  cb4c8cd9-c326-4de0-9706-9c7772870e5a
21        29  # Old Joe and the Appraisal Community\n\nThe c...  ...  {\n    "title": "Old Joe and the Appraisal Com...  2df3192b-15df-4bc8-bb6a-49ce3442f9e8
22        30  # Old Joe's Shop in Infamous Resort\n\nThe com...  ...  {\n    "title": "Old Joe's Shop in Infamous Re...  e7ffbfc6-b6b2-44be-9192-f84edbdc434f
23        31  # A Christmas Carol and Its Contributors\n\nTh...  ...  {\n    "title": "A Christmas Carol and Its Con...  743166d0-106f-4d33-b3a8-31e034bd869a
24        32  # J. B. Lippincott Company and A Christmas Car...  ...  {\n    "title": "J. B. Lippincott Company and ...  02290f9b-ad2c-400d-97ff-81ff4d77dae5
25        33  # Project Gutenberg Literary Archive Foundatio...  ...  {\n    "title": "Project Gutenberg Literary Ar...  300389ab-8ac9-4ae2-b711-15b842980a55
26        34  # Project Gutenberg and its Foundation\n\nThe ...  ...  {\n    "title": "Project Gutenberg and its Fou...  86df4b96-8c75-4748-8da7-4e54026c92c0
27        35  # Ebenezer Scrooge and His Transformative Jour...  ...  {\n    "title": "Ebenezer Scrooge and His Tran...  81e34e3e-79cc-4fc0-a05c-054fcda463dd
28        36  # Ebenezer Scrooge and Jacob Marley's Ghost\n\...  ...  {\n    "title": "Ebenezer Scrooge and Jacob Ma...  291dd86a-e033-455c-8245-68c5c49de2c0
29        37  # Spirits and Their Wandering in the World\n\n...  ...  {\n    "title": "Spirits and Their Wandering i...  e00e4a65-e1c7-4f81-9d60-13adc0eb3eda
30         0  # Ebenezer Scrooge and the Spirits of Christma...  ...  {\n    "title": "Ebenezer Scrooge and the Spir...  4061372d-4db7-44d0-b5f6-aa47f2bf0057
31         1  # Fezziwig's Christmas Eve Celebration\n\nThe ...  ...  {\n    "title": "Fezziwig's Christmas Eve Cele...  f0fc78d0-1dd1-46f6-a488-5f6fa8e348fd
32        10  # Scrooge and Jacob's Death\n\nThe community r...  ...  {\n    "title": "Scrooge and Jacob's Death",\n...  f8a3a0bb-18fb-453c-8450-4aae67c949a2
33        11  # Fred's Christmas Celebration\n\nThe communit...  ...  {\n    "title": "Fred's Christmas Celebration"...  67987438-becb-402b-93f7-d8a5e631d96e
34        12  # Ebenezer Scrooge and Jacob Marley\n\nThe com...  ...  {\n    "title": "Ebenezer Scrooge and Jacob Ma...  176fc2e7-7f19-48fe-a731-127cf807b692
35         2  # Cratchit Family and Christmas\n\nThe communi...  ...  {\n    "title": "Cratchit Family and Christmas...  3d7c629e-a2b7-4ca2-b689-b9aea8894068
36         3  # Business Men Discussing Old Scratch's Death\...  ...  {\n    "title": "Business Men Discussing Old S...  fa50c150-6466-488e-a090-60470a063534
37         4  # Mr. Scrooge and Caroline's Community\n\nThis...  ...  {\n    "title": "Mr. Scrooge and Caroline's Co...  3e879ac4-02c0-4728-bde3-b9aabcca5658
38         5  # Old Joe's Shop and Associated Characters\n\n...  ...  {\n    "title": "Old Joe's Shop and Associated...  09510112-3963-4a24-9824-46ec7499718a
39         6  # Matron and Her Family\n\nThe community revol...  ...  {\n    "title": "Matron and Her Family",\n    ...  29377eb4-3c69-4415-a300-546ecdee8792
40         7  # Scrooge and the Spirit's Journey\n\nThe comm...  ...  {\n    "title": "Scrooge and the Spirit's Jour...  cd54c9dd-9832-400e-8a62-eb2516d3bb0b
41         8  # The Church and Its Enigmatic Environment\n\n...  ...  {\n    "title": "The Church and Its Enigmatic ...  2db73356-5970-44b8-9462-1cae416afc3a
42         9  # Project Gutenberg and A Christmas Carol\n\nT...  ...  {\n    "title": "Project Gutenberg and A Chris...  d86c3871-ec30-499f-9cdb-2ca05fd9b07b

[43 rows x 10 columns]

索引过程需要一些时间来运行，具体时间取决于输入数据的大小、所使用的模型以及文本块的大小(这些参数可以在 .env 文件中配置)。一旦索引管道完成，您应该会在 ./ragtest/output/<timestamp>/artifacts 文件夹中看到一系列的 .parquet 文件。

GraphRAG：复杂查询的知识图谱新框架_知识图谱_07

请记住，GraphRAG 在计算上可能非常密集，尤其是在索引阶段。保持输入文档简洁以获得最佳性能。

图谱可视化

修改settings.yaml，可以将图谱保存成为 GraphML 标准格式文件，例如 ragtest/output/<timestamp>/artifacts/summarized_graph.graphml。

GraphRAG：复杂查询的知识图谱新框架_知识图谱_08

执行全局查询

索引过程执行完成后，我们就可以使用数据集进行提问了。
以下是使用全局搜索询问高级问题的示例：

INFO: Reading settings from ragtest/settings.yaml
creating llm client with {'api_key': 'REDACTED,len=32', 'type': "azure_openai_chat", 'model': 'gpt-4o', 'max_tokens': 4000, 'temperature': 0.0, 'top_p': 1.0, 'n': 1, 'request_timeout': 180.0, 'api_base': 'https://<azure_openai_api>.openai.azure.com', 'api_version': '2024-02-15-preview', 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': '<deployment_name>', 'model_supports_json': True, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}

SUCCESS: Global Search Response:

### Overview of the Story
The story centers around Ebenezer Scrooge, a miserly and solitary old man who undergoes a profound transformation. Initially, Scrooge is depicted as cold-hearted and indifferent to the joys of Christmas, repelling those around him, including his nephew and his poor clerk, Bob Cratchit [Data: Reports (13)].

### Supernatural Encounters
Scrooge's journey towards redemption is catalyzed by a series of supernatural encounters. The ghost of his former business partner, Jacob Marley, warns him of the impending visits from three spirits and urges him to change his ways [Data: Reports (13, 36)]. These spirits—the Ghosts of Christmas Past, Present, and Yet to Come—guide Scrooge through various scenes and reflections on his life, showing him visions of his past, present, and potential future [Data: Reports (13, 17)].

### Key Transformative Moments
Throughout these ghostly visits, Scrooge is confronted with the consequences of his actions and attitudes. The Ghost of Christmas Past takes him back to his earlier years, including moments of joy and regret, such as Fezziwig's ball [Data: Reports (13)]. The Ghost of Christmas Present reveals the warmth and struggles of the Cratchit family, emphasizing themes of generosity and family despite financial hardships [Data: Reports (24, 39)]. The Ghost of Christmas Yet to Come presents a grim future if Scrooge does not change his ways [Data: Reports (13)].

### Transformation and Redemption
Ultimately, these experiences lead to a significant transformation in Scrooge. He becomes generous and kind-hearted, especially towards Bob Cratchit and Tiny Tim, highlighting the redemptive power of self-reflection and the spirit of Christmas [Data: Reports (13, 24, 39)]. The bustling activity in the city streets and shops on Christmas morning contrasts with Scrooge's initial indifference, underscoring the widespread celebration and joy that the holiday brings to the community [Data: Reports (24)].

### Conclusion
In summary, the story is a powerful narrative about personal transformation and redemption, driven by supernatural guidance and the spirit of Christmas. Scrooge's journey from a miserly old man to a benevolent figure serves as a timeless reminder of the importance of generosity, family, and self-reflection [Data: Reports (13, 17, 24, 39, 36, +more)].

执行本地查询

以下是使用本地搜索询问有关特定字符的更具体问题的示例：

python -m graphrag.query \
--root ./ragtest \
--method local \
"What do you know about Scrooge? Give a brief answer"

SUCCESS: Local Search Response:
Ebenezer Scrooge is the central character in Charles Dickens' novella "A Christmas Carol." Initially, he is depicted as a miserly, solitary old man who despises Christmas and human sympathy. Scrooge undergoes a profound transformation after being visited by the ghosts of Christmas Past, Present, and Yet to Come, ultimately becoming generous and kind-hearted [Data: Entities (33, 21)].

GraphRAG + Ollama 本地方案

GraphRAG + Ollama 旨在支持本地模型，使其成为 OpenAI / Azure OpenAI APIs 实现的本地替代方案。通过利用由 Ollama 支持的开源语言模型，我们可以进行具有成本效益的本地推理，而无需承担昂贵的 LLM APIs 调用成本。但是请注意：本地方案的运行时间成本非常高。

创建运行环境

可以使用 conda 创建一个新的运行环境，以确保所有依赖项都安装在一个隔离的 Python 环境中。

安装 Ollama

下载语言模型与嵌入式模型

初始化文件夹

mkdir -p ./ragtest/input
cp <text-file-path>/*.txt ./ragtest/input
python -m graphrag.index --init --root ./ragtest

修改配置

我们使用 Ollama 提供对语言模型与嵌入式模型的本地部署支持。其中，LLM 部分可以使用诸如 llama3.1、mistral、phi3 等语言模型，而嵌入模型部分则可以使用诸如 mxbai-embed-large、nomic-embed-text 等模型。默认的 API 地址基址为 LLM 模型的 http://localhost:11434/v1 和嵌入模型的 http://localhost:11434/api。

将 settings.yaml 文件中的相关配置修改为 Ollama 本地模型与 API 调用接口：

llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: llama3
  model_supports_json: true # recommended if this is available for your model.
  api_base: http://localhost:11434/v1

embeddings:
  async_mode: threaded # or asyncio
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: nomic_embed_text
    api_base: http://localhost:11434/api

更改源代码

参考 GitHub 项目：https://github.com/karthik-codex/Autogen_GraphRAG_Ollama，替换 GraphRAG 包中的两个文件：

site-packages/graphrag/llm/openai/openai_embeddings_llm.py
site-packages/graphrag/query/llm/oai/embedding.py

# graphrag/llm/openai/openai_embeddings_llm.py
"""The EmbeddingsLLM class."""

from typing_extensions import Unpack

from graphrag.llm.base import BaseLLM
from graphrag.llm.types import (
    EmbeddingInput,
    EmbeddingOutput,
    LLMInput,
)

from .openai_configuration import OpenAIConfiguration
from .types import OpenAIClientTypes
import ollama

class OpenAIEmbeddingsLLM(BaseLLM[EmbeddingInput, EmbeddingOutput]):
    """A text-embedding generator LLM."""

    _client: OpenAIClientTypes
    _configuration: OpenAIConfiguration

    def __init__(self, client: OpenAIClientTypes, configuration: OpenAIConfiguration):
        self.client = client
        self.configuration = configuration

    async def _execute_llm(
        self, input: EmbeddingInput, **kwargs: Unpack[LLMInput]
    ) -> EmbeddingOutput | None:
        args = {
            "model": self.configuration.model,
            **(kwargs.get("model_parameters") or {}),
        }
        embedding_list = []
        for inp in input:
            embedding = ollama.embeddings(model="nomic-embed-text", prompt=inp)
            embedding_list.append(embedding["embedding"])
        return embedding_list

# graphrag/query/llm/oai/embedding.py
"""OpenAI Embedding model implementation."""

import asyncio
from collections.abc import Callable
from typing import Any
import ollama
import numpy as np
import tiktoken
from tenacity import (
    AsyncRetrying,
    RetryError,
    Retrying,
    retry_if_exception_type,
    stop_after_attempt,
    wait_exponential_jitter,
)

from graphrag.query.llm.base import BaseTextEmbedding
from graphrag.query.llm.oai.base import OpenAILLMImpl
from graphrag.query.llm.oai.typing import (
    OPENAI_RETRY_ERROR_TYPES,
    OpenaiApiType,
)
from graphrag.query.llm.text_utils import chunk_text
from graphrag.query.progress import StatusReporter


class OpenAIEmbedding(BaseTextEmbedding, OpenAILLMImpl):
    """Wrapper for OpenAI Embedding models."""

    def __init__(
        self,
        api_key: str | None = None,
        azure_ad_token_provider: Callable | None = None,
        model: str = "text-embedding-3-small",
        deployment_name: str | None = None,
        api_base: str | None = None,
        api_version: str | None = None,
        api_type: OpenaiApiType = OpenaiApiType.OpenAI,
        organization: str | None = None,
        encoding_name: str = "cl100k_base",
        max_tokens: int = 8191,
        max_retries: int = 10,
        request_timeout: float = 180.0,
        retry_error_types: tuple[type[BaseException]] = OPENAI_RETRY_ERROR_TYPES,  # type: ignore
        reporter: StatusReporter | None = None,
    ):
        OpenAILLMImpl.__init__(
            self=self,
            api_key=api_key,
            azure_ad_token_provider=azure_ad_token_provider,
            deployment_name=deployment_name,
            api_base=api_base,
            api_version=api_version,
            api_type=api_type,  # type: ignore
            organization=organization,
            max_retries=max_retries,
            request_timeout=request_timeout,
            reporter=reporter,
        )

        self.model = model
        self.encoding_name = encoding_name
        self.max_tokens = max_tokens
        self.token_encoder = tiktoken.get_encoding(self.encoding_name)
        self.retry_error_types = retry_error_types
        self.embedding_dim = 384  # Nomic-embed-text model dimension
        self.ollama_client = ollama.Client()

    def embed(self, text: str, **kwargs: Any) -> list[float]:
        """Embed text using Ollama's nomic-embed-text model."""
        try:
            embedding = self.ollama_client.embeddings(model="nomic-embed-text", prompt=text)
            return embedding["embedding"]
        except Exception as e:
            self._reporter.error(
                message="Error embedding text",
                details={self.__class__.__name__: str(e)},
            )
            return np.zeros(self.embedding_dim).tolist()

    async def aembed(self, text: str, **kwargs: Any) -> list[float]:
        """Embed text using Ollama's nomic-embed-text model asynchronously."""
        try:
            embedding = await self.ollama_client.embeddings(model="nomic-embed-text", prompt=text)
            return embedding["embedding"]
        except Exception as e:
            self._reporter.error(
                message="Error embedding text asynchronously",
                details={self.__class__.__name__: str(e)},
            )
            return np.zeros(self.embedding_dim).tolist()

    def _embed_with_retry(
        self, text: str | tuple, **kwargs: Any  #str | tuple
    ) -> tuple[list[float], int]:
        try:
            retryer = Retrying(
                stop=stop_after_attempt(self.max_retries),
                wait=wait_exponential_jitter(max=10),
                reraise=True,
                retry=retry_if_exception_type(self.retry_error_types),
            )
            for attempt in retryer:
                with attempt:
                    embedding = (
                        self.sync_client.embeddings.create(  # type: ignore
                            input=text,
                            model=self.model,
                            **kwargs,  # type: ignore
                        )
                        .data[0]
                        .embedding
                        or []
                    )
                    return (embedding["embedding"], len(text))
        except RetryError as e:
            self._reporter.error(
                message="Error at embed_with_retry()",
                details={self.__class__.__name__: str(e)},
            )
            return ([], 0)
        else:
            return ([], 0)

    async def _aembed_with_retry(
        self, text: str | tuple, **kwargs: Any
    ) -> tuple[list[float], int]:
        try:
            retryer = AsyncRetrying(
                stop=stop_after_attempt(self.max_retries),
                wait=wait_exponential_jitter(max=10),
                reraise=True,
                retry=retry_if_exception_type(self.retry_error_types),
            )
            async for attempt in retryer:
                with attempt:
                    embedding = (
                        await self.async_client.embeddings.create(  # type: ignore
                            input=text,
                            model=self.model,
                            **kwargs,  # type: ignore
                        )
                    ).data[0].embedding or []
                    return (embedding, len(text))
        except RetryError as e:
            self._reporter.error(
                message="Error at embed_with_retry()",
                details={self.__class__.__name__: str(e)},
            )
            return ([], 0)
        else:
            return ([], 0)

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
71.
72.
73.
74.
75.
76.
77.
78.
79.
80.
81.
82.
83.
84.
85.
86.
87.
88.
89.
90.
91.
92.
93.
94.
95.
96.
97.
98.
99.
100.
101.
102.
103.
104.
105.
106.
107.
108.
109.
110.
111.
112.
113.
114.
115.
116.
117.
118.
119.
120.
121.
122.
123.
124.
125.
126.
127.
128.
129.
130.
131.
132.
133.
134.
135.
136.
137.
138.
139.
140.
141.
142.
143.
144.
145.
146.
147.
148.
149.
150.
151.
152.
153.
154.

执行索引

🚀 create_final_community_reports
   community  ...                                    id
0         47  ...  66514eba-932d-4c9d-9314-d40ee2b1e9bc
1         46  ...  bf24f767-1ad3-4dec-b590-37c01c124405
2         15  ...  5d0ae0a8-5f66-45ba-97d2-679cdb849e9b
3         16  ...  50c63fe1-5467-453f-9aa2-ad8f88403083
4         17  ...  88b7ae2b-a989-4168-8446-778ad93516f5
5         19  ...  0bd0618d-1e3b-424e-bb5f-be775046b819
6         20  ...  5936c9fe-f61d-4e63-9c6f-4e11f96a0093
7         21  ...  af88afb2-fb9e-46dc-867f-ec0a21215964
8         22  ...  4bb1c63e-35c2-4e88-b754-c09225ab1a89
9         23  ...  281a4b54-a93e-469b-a140-e5c69d1283e6
10        25  ...  e52936eb-3352-427c-8e9e-1c7316994755
11        26  ...  e1d3343d-8c21-4180-9627-1ad26e43443f
12        27  ...  ef3374f3-a7da-470e-81aa-dc60e094c2b5
13        32  ...  89e079e2-a570-43ec-b6bb-a5ead3767f36
14        34  ...  eb5bda3a-b1b6-4255-a081-8e774c2e3177
15        35  ...  b388d3d8-0025-452a-a7d8-e54204a21cc1
16        36  ...  b6b24226-55bf-49a7-9904-4c268e05d6ee
17        39  ...  d5603535-4d4a-484b-800d-9188a5d657af
18        41  ...  feae7fad-76c6-492d-b8dd-021f505cd0e7
19        42  ...  7c99e217-7ca1-4889-bbe3-6dab9f0b975c
20        43  ...  14c53e5c-37b2-430f-ab18-70cd2853bd2e
21        44  ...  14f4b8d8-d3bd-4917-a1e7-840491a2612c
22        45  ...  4ee58751-b2b5-4f73-bfe1-f5f711786f8e
23        10  ...  e8ae10a9-510c-42b3-81a3-4906f5f66eff
24        12  ...  0096e95a-d6ca-405b-b027-55c7d8657bcb
25         4  ...  9cd661e5-c89f-41eb-9d02-09c21d45b1e1
26         5  ...  09effed5-0530-4abf-aa58-7e52de78ddf9
27         7  ...  42f78405-fb29-4bec-9c29-f67a5dd42beb
28         8  ...  6fb477e8-537b-4a5e-8a4f-8dd2624736ce
29         0  ...  907dd8cb-d118-4e94-8ec8-5f09d16dac35

[30 rows x 10 columns]

⠋ GraphRAG Indexer
├── Loading Input (text) - 1 files loaded (0 filtered) ━━━━━━ 100% 0:00:… 0:00:…
├── create_base_text_units
├── create_base_extracted_entities
├── create_summarized_entities
├── create_base_entity_graph
├── create_final_entities
├── create_final_nodes
├── create_final_communities
├── join_text_units_to_entity_ids
├── create_final_relationships
├── join_text_units_to_relationship_ids
├── create_final_community_reports
├── create_final_text_units
├── create_base_documents
└── create_final_documents
🚀 All workflows completed successfully.

执行用户查询

INFO: Reading settings from ragtest/settings.yaml
creating llm client with {'api_key': 'REDACTED,len=9', 'type': "openai_chat", 'model': 'llama3', 'max_tokens': 4000, 'temperature': 0.0, 'top_p': 1.0, 'request_timeout': 180.0, 'api_base': 'http://localhost:11434/v1', 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': True, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}

SUCCESS: Global Search Response: **The Story of Ebenezer Scrooge**

The story revolves around Ebenezer Scrooge, a miserly and bitter old man who has lost touch with the joys of life. On Christmas Eve, Scrooge is visited by the ghost of his former business partner, Jacob Marley, who warns him that he will be condemned to wander the earth wearing heavy chains if he doesn't change his ways.

Scrooge is then visited by three spirits: the Ghost of Christmas Past, the Ghost of Christmas Present, and the Ghost of Christmas Yet to Come. These spirits take Scrooge on a journey through time, showing him how his actions have affected those around him and revealing the consequences of his selfishness.

Through these encounters, Scrooge is forced to confront his own mortality and the emptiness of his life. He begins to see the error of his ways and starts to transform into a kinder, more compassionate person.

The story ultimately explores themes of redemption, forgiveness, and the importance of human connection during the holiday season.

GraphRAG 源代码简析

GraphRAG 项目的源代码结构如下图所示。

GraphRAG：复杂查询的知识图谱新框架_d3_09

索引工作流

索引阶段的工作流程包括以下步骤：

初始化：生成必要的配置文件、缓存和目录。
索引：使用工作流程模板创建一系列管道，根据依赖关系调整执行顺序，并依次执行这些管道。

这个过程确保了构建知识图谱和为后续查询操作做好准备的系统化和高效性。索引阶段被认为是整个项目的核心，其整体流程相当复杂。执行索引的命令如下：

这些命令调用了 graphrag/index/__main__.py 中的主函数，并使用 argparse 解析输入参数，最终调用 graphrag/index/cli.py 中的 index_cli 函数。

接下来，我们来分析相关函数的调用链，重点关注关键组件：

cli.py::index_cli() 函数首先根据用户输入参数(例如 --init)决定是否初始化当前文件夹。它会检查目录中是否存在配置文件、提示文件和 .env 文件，如果不存在则创建这些文件，包括 settings.yaml 和各种提示文件。

对于实际的索引操作，它执行了一个内部函数 cli.py::index_cli()._run_workflow_async()，该函数主要涉及两个函数：cli.py::_create_default_config() 和 run.py::run_pipeline_with_config()。

默认配置生成：
cli.py::_create_default_config() 检查根目录和 settings.yaml 文件。
然后，它执行 cli.py::_read_config_parameters() 来读取系统配置(如 LLM、块大小、缓存、存储等)。
关键步骤是根据当前参数创建管道配置，这一步由 create_pipeline_config.py::create_pipeline_config() 实现。该模块是项目中最复杂的模块之一，核心逻辑是：基于不同的函数生成一个完整的工作流程序列，使用的是 workflows/v1 目录中的模板。需要注意的是，此阶段不会考虑工作流程之间的依赖关系。
管道执行：
run.py::run_pipeline_with_config() 加载现有的管道配置。
它创建子目录(如缓存、存储、输入、输出等)。
然后，它使用 run.py::run_pipeline() 依次执行每个工作流程并返回结果，主要包括两个部分：

加载工作流程：workflows/load.py::load_workflows() 创建常规工作流程并处理拓扑排序。workflows/load.py::create_workflow()：使用现有模板创建工作流程。graphlib::topological_sort()：根据工作流程的依赖关系计算有向无环图(DAG)的拓扑排序。
执行操作：如 inject_workflow_data_dependencies()、write_workflow_stats() 和 emit_workflow_output，用于依赖注入、数据写入和保存工作流程输出。

知识图谱构建

知识图谱构建的工作流程 create_final_entities.py 依赖于 workflow:create_base_extracted_entities，并定义了 cluster_graph 和 embed_graph 等操作。cluster_graph 操作使用了 Leiden 策略，具体实现位于 index/verbs/graph/clustering/cluster_graph.py：

from datashaper import TableContainer, VerbCallbacks, VerbInput, progress_iterable, verb

@verb(name="cluster_graph")
def cluster_graph(
    input: VerbInput,
    callbacks: VerbCallbacks,
    strategy: dict[str, Any],
    column: str,
    to: str,
    level_to: str | None = None,
    **_kwargs,
) -> TableContainer:

可以看到这本质上是一个带有 @verb 装饰器的函数。这里使用的 Leiden 算法来自 graspologic 库，一个用于图统计的 Python 包。

全局搜索

GraphRag 提供两种搜索模式：全局搜索和本地搜索。根据参数的不同，主函数会分别调用这两种模式。可以在 graphrag/query/__main__.py 中找到这两个调用：cli::run_local_search() 和 cli::run_global_search()。

cli::run_global_search() 主要调用 factories.py::get_global_search_engine()，该函数返回一个 GlobalSearch 类。这个类与 LocalSearch 类类似，都是通过工厂模式创建的。其核心方法 asearch() 采用了 Map-Reduce 方法。该方法使用大型语言模型并行生成每个社区摘要的答案，然后将这些答案汇总成最终结果。这种 Map-Reduce 机制会导致全局搜索消耗大量 tokens。

本地搜索

类似地，cli::run_local_search() 调用了 factories.py::get_local_search_engine()，返回一个 LocalSearch 类。它的 asearch() 方法更为简单，直接基于上下文提供响应。这种模式更类似于传统的 RAG 语义检索策略，并且消耗的 tokens 较少。

与全局搜索不同，本地搜索模式整合了多个数据源，包括节点、社区报告、文本单元、关系、实体和协变量。

结论

GraphRAG 的核心创新在于其处理查询导向摘要(QFS)任务的方法。QFS 和多跳问答(Multi-Hop Q&A)目前是传统 RAG 系统难以应对的挑战性领域，但它们在数据分析等方面具有广泛的应用前景。尽管 GraphRAG 的当前实现需要大量计算资源，但它为这一领域开辟了新的可能性。

GraphRAG 对知识图谱的细粒度处理尤其值得关注，包括使用 Leiden 算法进行社区检测和集成多源数据的本地搜索。项目中另一个有趣的方面是其实体提取方法，GraphRAG 完全依赖大型语言模型，而不受传统三元组架构的限制。作者指出，由于相似性聚类，模型提取的变化不会显著影响最终的社区生成。

GraphRAG 另一大亮点则是其全面的内置工作流编排系统。该系统基于模板定义工作流程，具有灵活配置和可追溯步骤，这可能为未来发展指明了方向。与完全依赖大型语言模型处理所有任务的系统相比，GraphRAG 提供了更多的控制和透明度。

总之，Microsoft GraphRAG 框架为该领域做出了重要贡献。它为自然语言处理和知识图谱操作中的复杂问题引入了新的解决方案。尽管在某些方面还有改进的空间，但这个项目值得深入研究，它为高级信息检索和摘要领域带来了宝贵的洞见和创新。

GraphRAG：复杂查询的知识图谱新框架_人工智能_10

References

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Jonathan Larson. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. Apr 2024. https://doi.org/10.48550/arXiv.2404.16130 https://www.microsoft.com/en-us/research/project/graphrag/ https://microsoft.github.io/graphrag/
Markus J. Buehler, Accelerating Scientific Discovery with Generative Knowledge Extraction, Graph-Based Representation, and Multimodal Intelligent Graph Reasoning. Jun 2024. https://doi.org/10.48550/arXiv.2403.11996
rvk. Unlocking Cost-Effective Local Model Inference with GraphRAG and Ollama. Jul 2024. https://medium.com/@vamshirvk/unlocking-cost-effective-local-model-inference-with-graphrag-and-ollama-d9812cc60466
GitHub: https://github.com/TheAiSingularity/graphrag-local-ollama
Ollama. https://ollama.com/
Saurabh Rajaram Yadav. GraphRAG local setup via vLLM and Ollama : A detailed integration guide. Jul 2024. https://medium.com/@ysaurabh059/graphrag-local-setup-via-vllm-and-ollama-a-detailed-integration-guide-5d85f18f7fec
Karthik Rajan. Microsoft’s GraphRAG + AutoGen + Ollama + Chainlit = Local & Free Multi-Agent RAG Superbot. Jul 2024. https://ai.gopubby.com/microsofts-graphrag-autogen-ollama-chainlit-fully-local-free-multi-agent-rag-superbot-61ad3759f06f
GitHub: https://github.com/karthik-codex/Autogen_GraphRAG_Ollama
Inside GraphRAG: Analyzing Microsoft’s Innovative Framework for Knowledge Graph Processing. Calvin Ku. Jul 2024. https://medium.com/percena/inside-graphrag-analyzing-microsofts-innovative-framework-for-knowledge-graph-processing1-6f84deec5499
GraphRAG-Local-UI. GitHub: https://github.com/severian42/GraphRAG-Local-UI