Chroma数据库

当使用query_texts时，Chroma会使用embedding_function对query_texts进行嵌入，然后使用嵌入后的数据进行查询。该数据库对环境要求较高，推荐python3.10版本进行安装，由于使用了一些新技术，该数据库的部署可能会出现一些版本兼容性问题。Chroma还支持upsert操作，可以更新现有的项目或者添加不存在的项目。特点：开源，本地部署，基于sqlite，因此无

文章共2,276字 · 阅读需要大约8分钟

一键AI生成摘要，助你高效阅读

问答

weixin_46515328

1570人浏览 · 2023-07-21 16:48:32

weixin_46515328 · 2023-07-21 16:48:32 发布

Chroma数据库概述

该数据库对环境要求较高，推荐python3.10版本进行安装，由于使用了一些新技术，该数据库的部署可能会出现一些版本兼容性问题

特点：开源，本地部署，基于sqlite，因此无需配置数据库连接信息

功能：

存储embedding和metadata数据
embedding文档和query
对embedding的数据进行搜索

使用

1. 创建client

import chromadb
client = chromadb.Client()
client = chromadb.PersistentClient(path="/path/to/save/to")

创建client的方式有两种：

从内存中启动：数据存储在内存中，除非手动保存否则在程序结束时会丢失数据
从磁盘中启动：数据存储在磁盘中，会自动将数据存储到指定的路径中

2. collection的相关操作

其中，embedding为可选操作

# 创建collection
collection = client.create_collection(name="my_collection", embedding_function=emb_fn)
# 获取collection
collection = client.get_collection(name="my_collection", embedding_function=emb_fn)
# 删除collection
collection = client.delete_collection(name="my_collection")

collection的其他操作

# 返回collection的前10项
collection.peek() 
# 返回collection中的个数
collection.count()
# 重命名collection
collection.modify(name="new_name")

4 .向collection中添加文档

当你添加文档后，Chroma会自动对文档进行 tokenization, embedding, and indexing

collection.add(
    documents=["doc1", "doc2"],
    metadatas=[{"source": "my_source"}, {"source": "my_source"}],
    ids=["id1", "id2"]
)

其中metadatas为可选项

当然，你可能不需要Chroma帮你去embedding，你需要使用openai api进行embedding，那么Chroma也提供直接加载embedding的方法：

collection.add(
    embeddings=[[1.2, 2.3, 4.5], [6.7, 8.2, 9.2]],
    documents=["doc1", "doc2"],
    metadatas=[{"source": "my_source"}, {"source": "my_source"}],
    ids=["id1", "id2"]
)

5. 对collection进行查询

你可以使用一个文本list对collection进行查询，并且你可以控制返回的结果：

当使用query_texts时，Chroma会使用embedding_function对query_texts进行嵌入，然后使用嵌入后的数据进行查询

results = collection.query(
    query_texts=["This is a query document"],
    n_results=2
)

通过query_embeddings进行查询：
还可以对搜索的结果进行筛选：

include表示要返回的数据，可以是embedding、doc、matadata等
where表示对元数据进行筛选
where_document表示对文档内容进行筛选

collection.query(
    query_embeddings=[[11.1, 12.1, 13.1],[1.1, 2.3, 3.2], ...],
    n_results=10,
    include=["documents"]，
    where={"metadata_field": "is_equal_to_this"},
    where_document={"$contains":"search_string"}
)

6. 对collection进行更新

collection中的任何内容都可以通过update方法进行更新

collection.update(
    ids=["id1", "id2", "id3", ...],
    embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
    metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
    documents=["doc1", "doc2", "doc3", ...],
)

Chroma还支持upsert操作，可以更新现有的项目或者添加不存在的项目

collection.upsert(
    ids=["id1", "id2", "id3", ...],
    embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
    metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
    documents=["doc1", "doc2", "doc3", ...],
)

7. 对collection进行删除

通过id进行删除：

collection.delete(
    ids=["id1", "id2", "id3",...],
    where={"chapter": "20"}
)

8. 实用的方法

# 重置整个数据库
client.reset()

# 判断数据库服务是否启动
client.heartbeat()

GitCode 开源社区

旨在为数千万中国开发者提供一个无缝且高效的云端环境，以支持学习、使用和贡献开源项目。

更多推荐

GitTalk | 使用面向业务的狮偶编程语言提升开发效率

GitCode 开源社区

GitTalk | DevUI Suits 场景解决方案

GitCode 开源社区

GitTalk | DevUI Admin 前端项目构建

GitCode 开源社区

所有评论(0)

查看更多评论

weixin_46515328

@weixin_46515328

已为社区贡献1条内容