内容哈希缓存

低风险

作者 @affaan-m已验证来源

4.1272 次安装v1.0.0更新于 2026年5月25日

使用方式

在 Claude Code 中运行以下命令

第一步：添加 Marketplace

/plugin marketplace add affaan-m/ECC

第二步：安装插件

/plugin install content-hash-cache-pattern@ecc

关于

使用 SHA-256 内容哈希缓存昂贵的文件处理结果 — 路径无关、自动失效、具有服务层分离。

name: content-hash-cache-pattern description: 使用 SHA-256 内容哈希缓存昂贵的文件处理结果——与路径无关、自动失效、具有服务层分离。 origin: ECC

内容哈希文件缓存模式

使用 SHA-256 内容哈希作为缓存键来缓存昂贵的文件处理结果（PDF 解析、文本提取、图像分析）。与基于路径的缓存不同，这种方法在文件移动/重命名后仍然有效，并在内容更改时自动失效。

何时激活

构建文件处理管道（PDF、图像、文本提取）
处理成本高且相同文件被重复处理
需要 --cache/--no-cache CLI 选项
想要在不修改现有纯函数的情况下添加缓存

核心模式

1. 基于内容哈希的缓存键

使用文件内容（而非路径）作为缓存键：

import hashlib
from pathlib import Path

_HASH_CHUNK_SIZE = 65536  # 64KB chunks for large files

def compute_file_hash(path: Path) -> str:
    """SHA-256 of file contents (chunked for large files)."""
    if not path.is_file():
        raise FileNotFoundError(f"File not found: {path}")
    sha256 = hashlib.sha256()
    with open(path, "rb") as f:
        while True:
            chunk = f.read(_HASH_CHUNK_SIZE)
            if not chunk:
                break
            sha256.update(chunk)
    return sha256.hexdigest()

为什么使用内容哈希？ 文件重命名/移动 = 缓存命中。内容更改 = 自动失效。无需索引文件。

2. 冻结数据类作为缓存条目

from dataclasses import dataclass

@dataclass(frozen=True, slots=True)
class CacheEntry:
    file_hash: str
    source_path: str
    document: ExtractedDocument  # The cached result

3. 基于文件的缓存存储

每个缓存条目存储为 {hash}.json —— 通过哈希进行 O(1) 查找，无需索引文件。

import json
from typing import Any

def write_cache(cache_dir: Path, entry: CacheEntry) -> None:
    cache_dir.mkdir(parents=True, exist_ok=True)
    cache_file = cache_dir / f"{entry.file_hash}.json"
    data = serialize_entry(entry)
    cache_file.write_text(json.dumps(data, ensure_ascii=False), encoding="utf-8")

def read_cache(cache_dir: Path, file_hash: str) -> CacheEntry | None:
    cache_file = cache_dir / f"{file_hash}.json"
    if not cache_file.is_file():
        return None
    try:
        raw = cache_file.read_text(encoding="utf-8")
        data = json.loads(raw)
        return deserialize_entry(data)
    except (json.JSONDecodeError, ValueError, KeyError):
        return None  # Treat corruption as cache miss

4. 服务层包装器（单一职责原则）

保持处理函数纯净。将缓存作为单独的服务层添加。

def extract_with_cache(
    file_path: Path,
    *,
    cache_enabled: bool = True,
    cache_dir: Path = Path(".cache"),
) -> ExtractedDocument:
    """Service layer: cache check -> extraction -> cache write."""
    if not cache_enabled:
        return extract_text(file_path)  # Pure function, no cache knowledge

    file_hash = compute_file_hash(file_path)

    # Check cache
    cached = read_cache(cache_dir, file_hash)
    if cached is not None:
        logger.info("Cache hit: %s (hash=%s)", file_path.name, file_hash[:12])
        return cached.document

    # Cache miss -> extract -> store
    logger.info("Cache miss: %s (hash=%s)", file_path.name, file_hash[:12])
    doc = extract_text(file_path)
    entry = CacheEntry(file_hash=file_hash, source_path=str(file_path), document=doc)
    write_cache(cache_dir, entry)
    return doc

关键设计决策

| 决策 | 理由 | |------|------| | SHA-256 内容哈希 | 与路径无关，内容更改时自动失效 | | {hash}.json 文件命名 | O(1) 查找，无需索引文件 | | 服务层包装器 | 单一职责：提取保持纯净，缓存是独立关注点 | | 手动 JSON 序列化 | 完全控制冻结数据类的序列化 | | 损坏返回 None | 优雅降级，下次运行时重新处理 | | cache_dir.mkdir(parents=True) | 首次写入时延迟创建目录 |

最佳实践

哈希内容，而非路径 —— 路径会变，内容标识不会
对大文件进行分块哈希 —— 避免将整个文件加载到内存中
保持处理函数纯净 —— 它们不应该知道缓存的存在
记录缓存命中/未命中并使用截断的哈希值用于调试
优雅处理损坏 —— 将无效缓存条目视为未命中，永不崩溃

应避免的反模式

# BAD: Path-based caching (breaks on file move/rename)
cache = {"/path/to/file.pdf": result}

# BAD: Adding cache logic inside the processing function (SRP violation)
def extract_text(path, *, cache_enabled=False, cache_dir=None):
    if cache_enabled:  # Now this function has two responsibilities
        ...

# BAD: Using dataclasses.asdict() with nested frozen dataclasses
# (can cause issues with complex nested types)
dat

兼容工具

Claude CodeCursor

内容哈希缓存

关于

name: content-hash-cache-pattern description: 使用 SHA-256 内容哈希缓存昂贵的文件处理结果——与路径无关、自动失效、具有服务层分离。 origin: ECC

内容哈希文件缓存模式

何时激活

核心模式

1. 基于内容哈希的缓存键

2. 冻结数据类作为缓存条目

3. 基于文件的缓存存储

4. 服务层包装器（单一职责原则）

关键设计决策

最佳实践

应避免的反模式

兼容工具

标签

相关推荐

RAG系统工程师

批量重构编排

Docx 文档处理

Azure AI Agents Java SDK

Azure Search 文档搜索

Azure AI Agent框架