在Milvus中创建集合并在集合中插入数据，然后attu管理工具可以查看

原创

keyboard_sun 2024-09-14 15:47:11 ©著作权

文章标签 milvus 数据加载插入数据 文章分类 Python 后端开发

©著作权归作者所有：来自51CTO博客作者keyboard_sun的原创作品，请联系作者获取转载授权，否则将追究法律责任

日志打印出来的是这个，现在attu为什么看不到插入的数据信息，集合信息已经可以看到，为什么看不到数据呢

/home/anaconda3/envs/bi-txt-sql/bin/python -X pycache_prefix=/home/.cache/JetBrains/PyCharm2023.2/cpython-cache /home/tools/pycharm-professional-2023.2.5/pycharm-2023.2.5/plugins/python/helpers/pydev/pydevd.py --multiprocess --qt-support=auto --client 127.0.0.1 --port 42745 --file /home/PycharmProjects/bi-txt-sql/app/api/endpoints/testmilvus.py 
已连接到 pydev 调试器(内部版本号 232.10300.41)/home/.local/lib/python3.10/site-packages/torch/__init__.py:235: UserWarning: The NumPy module was reloaded (imported a second time). This can in some cases result in small but subtle issues and is discouraged.
  from torch._C import *  # noqa: F403
成功连接到 Milvus 服务器
集合 'weather' 不存在，正在创建...
集合 'weather' 已创建
/home/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
  warnings.warn(
/home/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
  warnings.warn(
嵌入向量生成成功
成功插入 4 条数据到集合 'weather' 中
现有的 Milvus 集合:  ['weather']

进程已结束，退出代码为 0

根据日志，你的 Milvus 集合 "weather" 已经创建，并且你成功插入了 4 条数据。但是你在使用 Attu 时看不到这些数据。这可能是由于以下几种原因：

可能原因及解决方法：

1. 未创建索引

Milvus 在插入数据后，需要创建索引以支持快速搜索和可见性。如果没有创建索引，虽然数据已插入，但你在 Attu 中可能看不到可视化的数据。

解决方法：创建索引
你可以在插入数据后创建一个索引。Milvus 支持多种索引类型，如 IVF_FLAT、IVF_SQ8 等。你可以在插入数据后创建索引：

# 创建索引
index_params = {
    "metric_type": "L2",  # 可以根据需求选择其他相似度度量方式，如 IP
    "index_type": "IVF_FLAT",
    "params": {"nlist": 128}
}

collection.create_index(field_name="embedding", index_params=index_params)
print("索引已创建")

2. 数据还未加载

Milvus 使用 lazy loading（延迟加载），这意味着插入的数据并不会立即可供查询。你需要显式地调用 load() 方法来加载数据，使它们可用于搜索和可视化。

解决方法：加载集合数据

# 加载集合数据
collection.load()
print("集合数据已加载")

3. Attu 缓存问题

Attu 可能缓存了之前的集合信息，因此即使你插入了数据，Attu 可能没有刷新数据。

解决方法：刷新 Attu
尝试手动刷新 Attu 页面，或者重启 Attu 服务，确保它能够从 Milvus 中获取最新的数据。

docker restart attu

4. Milvus 数据插入未生效

尽管日志显示插入成功，有时候如果数据没有完全提交到 Milvus 中，可能会导致数据不可见。

解决方法：确保数据插入已提交并且集合已 flush

# 插入后 flush 数据
collection.flush()

5. 查看 Attu 中是否选择了正确的集合

确保你在 Attu 中查看的是正确的集合。有时，可能是查看的集合不是你刚刚插入数据的集合。

完整修改后的代码：

确保你在插入数据后创建索引、加载集合，并刷新 Attu：

import torch
from transformers import AutoTokenizer, AutoModel
from pymilvus import CollectionSchema, FieldSchema, DataType, Collection, connections, utility
import numpy as np


# 计算文本嵌入的函数
def embed_texts(texts, model, tokenizer):
    """
    使用 Jina-embeddings-v2 模型将文本转化为嵌入向量
    """
    inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state[:, 0, :]  # 提取CLS token嵌入
    return embeddings.numpy()


if __name__ == '__main__':
    # 创建连接并检查是否成功
    try:
        connections.connect(alias="default", host="172.19.17.52", port="19530")
        print("成功连接到 Milvus 服务器")
    except Exception as e:
        print(f"连接 Milvus 失败: {e}")

    # 定义集合结构，注意我们为主键添加了 is_primary=True
    id_field = FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True)
    text_field = FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=100)
    embedding_field = FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768)
    schema = CollectionSchema(fields=[id_field, text_field, embedding_field])

    # 集合名称
    collection_name = "weather"

    # 检查集合是否存在
    if not utility.has_collection(collection_name):
        print(f"集合 '{collection_name}' 不存在，正在创建...")
        # 创建集合
        try:
            collection = Collection(name=collection_name, schema=schema)
            print(f"集合 '{collection_name}' 已创建")
        except Exception as e:
            print(f"创建集合时出错: {e}")
    else:
        print(f"集合 '{collection_name}' 已存在，正在使用现有集合...")
        # 加载现有集合
        collection = Collection(name=collection_name)

    # 加载 Jina-embeddings-v2-base-zh 模型和对应的 tokenizer
    model_name = 'jinaai/jina-embeddings-v2-base-zh'
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

    # 示例文本
    candidate_texts = ['今天天气下雨吗?', '今天是一个晴天', '今天是星期几?', '今天天气怎么样?']

    # 使用 Jina-embeddings-v2 模型生成文本嵌入向量
    try:
        candidate_embeddings = embed_texts(candidate_texts, model, tokenizer)
        print(f"嵌入向量生成成功")
    except Exception as e:
        print(f"生成嵌入向量时出错: {e}")

    # 插入数据
    try:
        collection.insert([candidate_texts, candidate_embeddings.tolist()])
        collection.flush()
        print(f"成功插入 {len(candidate_texts)} 条数据到集合 '{collection_name}' 中")
    except Exception as e:
        print(f"插入数据时出错: {e}")

    # 创建索引
    index_params = {
        "metric_type": "L2",
        "index_type": "IVF_FLAT",
        "params": {"nlist": 128}
    }
    collection.create_index(field_name="embedding", index_params=index_params)
    print("索引已创建")

    # 加载集合数据
    collection.load()
    print("集合数据已加载")

    # 查看所有现有的 Milvus 集合
    try:
        collections = utility.list_collections()
        print("现有的 Milvus 集合: ", collections)
    except Exception as e:
        print(f"列出现有集合时出错: {e}")