
关于
向量搜索应用中嵌入模型的选择与优化指南
name: embedding-strategies description: "向量搜索应用中嵌入模型选择和优化指南。" risk: unknown source: community date_added: "2026-02-27"
嵌入策略
向量搜索应用中嵌入模型选择和优化指南。
不要在以下情况使用此技能
- 任务与嵌入策略无关
- 你需要此范围之外的不同领域或工具
说明
- 明确目标、约束和所需输入。
- 应用相关最佳实践并验证结果。
- 提供可操作的步骤和验证。
- 如果需要详细示例,请打开
resources/implementation-playbook.md。
在以下情况使用此技能
- 为 RAG 选择嵌入模型
- 优化分块策略
- 为特定领域微调嵌入
- 比较嵌入模型性能
- 降低嵌入维度
- 处理多语言内容
核心概念
1. 嵌入模型对比
| 模型 | 维度 | 最大 Token | 最适用于 | |------|------|-----------|----------| | text-embedding-3-large | 3072 | 8191 | 高精度 | | text-embedding-3-small | 1536 | 8191 | 性价比高 | | voyage-2 | 1024 | 4000 | 代码、法律 | | bge-large-en-v1.5 | 1024 | 512 | 开源 | | all-MiniLM-L6-v2 | 384 | 256 | 快速、轻量 | | multilingual-e5-large | 1024 | 512 | 多语言 |
2. 嵌入流水线
Document → Chunking → Preprocessing → Embedding Model → Vector
↓
[Overlap, Size] [Clean, Normalize] [API/Local]
模板
模板 1:OpenAI 嵌入
from openai import OpenAI
from typing import List
import numpy as np
client = OpenAI()
def get_embeddings(
texts: List[str],
model: str = "text-embedding-3-small",
dimensions: int = None
) -> List[List[float]]:
"""Get embeddings from OpenAI."""
# Handle batching for large lists
batch_size = 100
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
kwargs = {"input": batch, "model": model}
if dimensions:
kwargs["dimensions"] = dimensions
response = client.embeddings.create(**kwargs)
embeddings = [item.embedding for item in response.data]
all_embeddings.extend(embeddings)
return all_embeddings
def get_embedding(text: str, **kwargs) -> List[float]:
"""Get single embedding."""
return get_embeddings([text], **kwargs)[0]
# Dimension reduction with OpenAI
def get_reduced_embedding(text: str, dimensions: int = 512) -> List[float]:
"""Get embedding with reduced dimensions (Matryoshka)."""
return get_embedding(
text,
model="text-embedding-3-small",
dimensions=dimensions
)
模板 2:使用 Sentence Transformers 的本地嵌入
from sentence_transformers import SentenceTransformer
from typing import List, Optional
import numpy as np
class LocalEmbedder:
"""Local embedding with sentence-transformers."""
def __init__(
self,
model_name: str = "BAAI/bge-large-en-v1.5",
device: str = "cuda"
):
self.model = SentenceTransformer(model_name, device=device)
def embed(
self,
texts: List[str],
normalize: bool = True,
show_progress: bool = False
) -> np.ndarray:
"""Embed texts with optional normalization."""
embeddings = self.model.encode(
texts,
normalize_embeddings=normalize,
show_progress_bar=show_progress,
convert_to_numpy=True
)
return embeddings
def embed_query(self, query: str) -> np.ndarray:
"""Embed a query with BGE-style prefix."""
# BGE models benefit from query prefix
if "bge" in self.model.get_sentence_embedding_dimension():
query = f"Represent this sentence for searching relevant passages: {query}"
return self.embed([query])[0]
def embed_documents(self, documents: List[str]) -> np.ndarray:
"""Embed documents for indexing."""
return self.embed(documents)
# E5 model with instructions
class E5Embedder:
def __init__(self, model_name: str = "intfloat/multilingual-e5-large"):
self.model = SentenceTransformer(model_name)
def embed_query(self, query: str) -> np.ndarray:
return self.model.encode(f"query: {query}")
def embed_document(self, document: str) -> np.ndarray:
return self.model.encode(f"passage: {document}")
模板 3:分块策略
from typing import List, Tuple
import re
def chunk_by_tokens(
text: str,
chunk_size: int = 512,
chunk_overlap: int = 50,
tokenizer=None
) -> List[str]:
"""Chunk text by token count."""
import tiktoken
tokenizer = tokenizer or tiktoken.get_encoding("cl100k_base")
tokens = tokenizer.encode(text)
chunks = []
start = 0
while start <
兼容工具
Claude CodeCursor
标签
AI与机器学习