
About
Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.
name: hugging-face-evaluation description: Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format. risk: unknown source: community
Overview
This skill provides tools to add structured evaluation results to Hugging Face model cards. It supports multiple methods for adding evaluation data:
- Extracting existing evaluation tables from README content
- Importing benchmark scores from Artificial Analysis
- Running custom model evaluations with vLLM or accelerate backends (lighteval/inspect-ai)
When to Use
- You need to add structured evaluation results to a Hugging Face model card.
- You want to import benchmark data or run custom evaluations with vLLM, lighteval, or inspect-ai.
- You are preparing leaderboard-compatible
model-indexmetadata for a model release.
Integration with HF Ecosystem
- Model Cards: Updates model-index metadata for leaderboard integration
- Artificial Analysis: Direct API integration for benchmark imports
- Papers with Code: Compatible with their model-index specification
- Jobs: Run evaluations directly on Hugging Face Jobs with
uvintegration - vLLM: Efficient GPU inference for custom model evaluation
- lighteval: HuggingFace's evaluation library with vLLM/accelerate backends
- inspect-ai: UK AI Safety Institute's evaluation framework
Version
1.3.0
Dependencies
Core Dependencies
- huggingface_hub>=0.26.0
- markdown-it-py>=3.0.0
- python-dotenv>=1.2.1
- pyyaml>=6.0.3
- requests>=2.32.5
- re (built-in)
Inference Provider Evaluation
- inspect-ai>=0.3.0
- inspect-evals
- openai
vLLM Custom Model Evaluation (GPU required)
- lighteval[accelerate,vllm]>=0.6.0
- vllm>=0.4.0
- torch>=2.0.0
- transformers>=4.40.0
- accelerate>=0.30.0
Note: vLLM dependencies are installed automatically via PEP 723 script headers when using uv run.
IMPORTANT: Using This Skill
⚠️ CRITICAL: Check for Existing PRs Before Creating New Ones
Before creating ANY pull request with --create-pr, you MUST check for existing open PRs:
uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
If open PRs exist:
- DO NOT create a new PR - this creates duplicate work for maintainers
- Warn the user that open PRs already exist
- Show the user the existing PR URLs so they can review them
- Only proceed if the user explicitly confirms they want to create another PR
This prevents spamming model repositories with duplicate evaluation PRs.
All paths are relative to the directory containing this SKILL.md file. Before running any script, first
cdto that directory or use the full path.
Use --help for the latest workflow guidance. Works with plain Python or uv run:
uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py inspect-tables --help
uv run scripts/evaluation_manager.py extract-readme --help
Key workflow (matches CLI help):
get-prs→ check for existing open PRs firstinspect-tables→ find table numbers/columnsextract-readme --table N→ prints YAML by default- add
--apply(push) or--create-prto write changes
Core Capabilities
1. Inspect and Extract Evaluation Tables from README
- Inspect Tables: Use
inspect-tablesto see all tables in a README with structure, columns, and sample rows - Parse Markdown Tables: Accurate parsing using markdown-it-py (ignores code blocks and examples)
- Table Selection: Use
--table Nto extract from a specific table (required when multiple tables exist) - Format Detection: Recognize common formats (benchmarks as rows, columns, or comparison tables with multiple models)
- Column Matching: Automatically identify model columns/rows; prefer
--model-column-index(index from inspect output). Use--model-name-overrideonly with exact column header text. - YAML Generation: Convert selected table to model-index YAML format
- Task Typing:
--task-typesets thetask.typefield in model-index output (e.g.,text-generation,summarization)
2. Import from Artificial Analysis
- API Integration: Fetch benchmark scores directly from Artificial Analysis
- Automatic Formatting: Convert API responses to model-index format
- Metadata Preservation: Maintain source attribution and URLs
- PR Creation: Automatically create pull requests with evaluation updates
3. Model-Index Management
- YAML Generation: Create properly formatted model-index entries
- Merge Support: Add evaluations to existing model cards without overwriting
- Validation: Ensure compliance with Papers with Code specification
- Batch Operations: Process multiple models efficiently
4. Run Evaluations on HF Jobs (Inference Providers)
- **Ins