Hugging Face Evaluation

Low Risk

by @sickn33Verified Source

4.1136 installsv1.0.0Updated May 25, 2026

How to Use

Run in Claude Code terminal

Step 1: Add Marketplace

/plugin marketplace add sickn33/antigravity-awesome-skills

Step 2: Install Plugin

/plugin install antigravity-awesome-skills@antigravity-awesome-skills

About

Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.

name: hugging-face-evaluation description: Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format. risk: unknown source: community

Overview

This skill provides tools to add structured evaluation results to Hugging Face model cards. It supports multiple methods for adding evaluation data:

Extracting existing evaluation tables from README content
Importing benchmark scores from Artificial Analysis
Running custom model evaluations with vLLM or accelerate backends (lighteval/inspect-ai)

When to Use

You need to add structured evaluation results to a Hugging Face model card.
You want to import benchmark data or run custom evaluations with vLLM, lighteval, or inspect-ai.
You are preparing leaderboard-compatible model-index metadata for a model release.

Integration with HF Ecosystem

Model Cards: Updates model-index metadata for leaderboard integration
Artificial Analysis: Direct API integration for benchmark imports
Papers with Code: Compatible with their model-index specification
Jobs: Run evaluations directly on Hugging Face Jobs with uv integration
vLLM: Efficient GPU inference for custom model evaluation
lighteval: HuggingFace's evaluation library with vLLM/accelerate backends
inspect-ai: UK AI Safety Institute's evaluation framework

Version

1.3.0

Dependencies

Core Dependencies

huggingface_hub>=0.26.0
markdown-it-py>=3.0.0
python-dotenv>=1.2.1
pyyaml>=6.0.3
requests>=2.32.5
re (built-in)

Inference Provider Evaluation

inspect-ai>=0.3.0
inspect-evals
openai

vLLM Custom Model Evaluation (GPU required)

lighteval[accelerate,vllm]>=0.6.0
vllm>=0.4.0
torch>=2.0.0
transformers>=4.40.0
accelerate>=0.30.0

Note: vLLM dependencies are installed automatically via PEP 723 script headers when using uv run.

IMPORTANT: Using This Skill

⚠️ CRITICAL: Check for Existing PRs Before Creating New Ones

Before creating ANY pull request with --create-pr, you MUST check for existing open PRs:

uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"

If open PRs exist:

DO NOT create a new PR - this creates duplicate work for maintainers
Warn the user that open PRs already exist
Show the user the existing PR URLs so they can review them
Only proceed if the user explicitly confirms they want to create another PR

This prevents spamming model repositories with duplicate evaluation PRs.

All paths are relative to the directory containing this SKILL.md file. Before running any script, first cd to that directory or use the full path.

Use --help for the latest workflow guidance. Works with plain Python or uv run:

uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py inspect-tables --help
uv run scripts/evaluation_manager.py extract-readme --help

Key workflow (matches CLI help):

get-prs → check for existing open PRs first
inspect-tables → find table numbers/columns
extract-readme --table N → prints YAML by default
add --apply (push) or --create-pr to write changes

Core Capabilities

1. Inspect and Extract Evaluation Tables from README

Inspect Tables: Use inspect-tables to see all tables in a README with structure, columns, and sample rows
Parse Markdown Tables: Accurate parsing using markdown-it-py (ignores code blocks and examples)
Table Selection: Use --table N to extract from a specific table (required when multiple tables exist)
Format Detection: Recognize common formats (benchmarks as rows, columns, or comparison tables with multiple models)
Column Matching: Automatically identify model columns/rows; prefer --model-column-index (index from inspect output). Use --model-name-override only with exact column header text.
YAML Generation: Convert selected table to model-index YAML format
Task Typing: --task-type sets the task.type field in model-index output (e.g., text-generation, summarization)

2. Import from Artificial Analysis

API Integration: Fetch benchmark scores directly from Artificial Analysis
Automatic Formatting: Convert API responses to model-index format
Metadata Preservation: Maintain source attribution and URLs
PR Creation: Automatically create pull requests with evaluation updates

3. Model-Index Management

YAML Generation: Create properly formatted model-index entries
Merge Support: Add evaluations to existing model cards without overwriting
Validation: Ensure compliance with Papers with Code specification
Batch Operations: Process multiple models efficiently

4. Run Evaluations on HF Jobs (Inference Providers)

**Ins

Compatible Tools

Claude CodeCursor

Hugging Face Evaluation

About

Overview

When to Use

Integration with HF Ecosystem

Version

Dependencies

Core Dependencies

Inference Provider Evaluation

vLLM Custom Model Evaluation (GPU required)

IMPORTANT: Using This Skill

⚠️ CRITICAL: Check for Existing PRs Before Creating New Ones

Core Capabilities

1. Inspect and Extract Evaluation Tables from README

2. Import from Artificial Analysis

3. Model-Index Management

4. Run Evaluations on HF Jobs (Inference Providers)

Compatible Tools

Tags

Related Skills

RAG Engineer

"orchestrate-batch-refactor"

Docx Official

Azure AI Agents Persistent Java

Azure Search Documents Ts

Agent Framework Azure AI Py