
About
Build a fully automated AI-powered data collection agent for any public source — job boards, prices, news, GitHub, sports, anything. Scrapes on a schedule, enriches data with a free LLM (Gemini Flash), stores results in Notion/Sheets/Supabase, and learns from user feedback. Runs 100% free on GitHub
name: data-scraper-agent description: Build a fully automated AI-powered data collection agent for any public source — job boards, prices, news, GitHub, sports, anything. Scrapes on a schedule, enriches data with a free LLM (Gemini Flash), stores results in Notion/Sheets/Supabase, and learns from user feedback. Runs 100% free on GitHub Actions. Use when the user wants to monitor, collect, or track any public data automatically. origin: community
Data Scraper Agent
Build a production-ready, AI-powered data collection agent for any public data source. Runs on a schedule, enriches results with a free LLM, stores to a database, and improves over time.
Stack: Python · Gemini Flash (free) · GitHub Actions (free) · Notion / Sheets / Supabase
When to Activate
- User wants to scrape or monitor any public website or API
- User says "build a bot that checks...", "monitor X for me", "collect data from..."
- User wants to track jobs, prices, news, repos, sports scores, events, listings
- User asks how to automate data collection without paying for hosting
- User wants an agent that gets smarter over time based on their decisions
Core Concepts
The Three Layers
Every data scraper agent has three layers:
COLLECT → ENRICH → STORE
│ │ │
Scraper AI (LLM) Database
runs on scores/ Notion /
schedule summarises Sheets /
& classifies Supabase
Free Stack
| Layer | Tool | Why |
|---|---|---|
| Scraping | requests + BeautifulSoup | No cost, covers 80% of public sites |
| JS-rendered sites | playwright (free) | When HTML scraping fails |
| AI enrichment | Gemini Flash via REST API | 500 req/day, 1M tokens/day — free |
| Storage | Notion API | Free tier, great UI for review |
| Schedule | GitHub Actions cron | Free for public repos |
| Learning | JSON feedback file in repo | Zero infra, persists in git |
AI Model Fallback Chain
Build agents to auto-fallback across Gemini models on quota exhaustion:
gemini-2.0-flash-lite (30 RPM) →
gemini-2.0-flash (15 RPM) →
gemini-2.5-flash (10 RPM) →
gemini-flash-lite-latest (fallback)
Batch API Calls for Efficiency
Never call the LLM once per item. Always batch:
# BAD: 33 API calls for 33 items
for item in items:
result = call_ai(item) # 33 calls → hits rate limit
# GOOD: 7 API calls for 33 items (batch size 5)
for batch in chunks(items, size=5):
results = call_ai(batch) # 7 calls → stays within free tier
Workflow
Step 1: Understand the Goal
Ask the user:
- What to collect: "What data source? URL / API / RSS / public endpoint?"
- What to extract: "What fields matter? Title, price, URL, date, score?"
- How to store: "Where should results go? Notion, Google Sheets, Supabase, or local file?"
- How to enrich: "Do you want AI to score, summarise, classify, or match each item?"
- Frequency: "How often should it run? Every hour, daily, weekly?"
Common examples to prompt:
- Job boards → score relevance to resume
- Product prices → alert on drops
- GitHub repos → summarise new releases
- News feeds → classify by topic + sentiment
- Sports results → extract stats to tracker
- Events calendar → filter by interest
Step 2: Design the Agent Architecture
Generate this directory structure for the user:
my-agent/
├── config.yaml # User customises this (keywords, filters, preferences)
├── profile/
│ └── context.md # User context the AI uses (resume, interests, criteria)
├── scraper/
│ ├── __init__.py
│ ├── main.py # Orchestrator: scrape → enrich → store
│ ├── filters.py # Rule-based pre-filter (fast, before AI)
│ └── sources/
│ ├── __init__.py
│ └── source_name.py # One file per data source
├── ai/
│ ├── __init__.py
│ ├── client.py # Gemini REST client with model fallback
│ ├── pipeline.py # Batch AI analysis
│ ├── jd_fetcher.py # Fetch full content from URLs (optional)
│ └── memory.py # Learn from user feedback
├── storage/
│ ├── __init__.py
│ └── notion_sync.py # Or sheets_sync.py / supabase_sync.py
├── data/
│ └── feedback.json # User decision history (auto-updated)
├── .env.example
├── setup.py # One-time DB/schema creation
├── enrich_existing.py # Backfill AI scores on old rows
├── requirements.txt
└── .github/
└── workflows/
└── scraper.yml # GitHub Actions schedule
Step 3: Build the Scraper Source
Template for any data source:
# scraper/sources/my_source.py
"""
[Source Name] — scrapes [what] from [where].
Method: [REST API / HTML scraping / RSS feed]
"""
import requests
from bs4 import BeautifulSoup
from datetime import datetime, timezone
from scraper.filters import is_relevant
HEADERS = {
"User-Agent": "Mozilla/5.0 (compatible; research-bot/1.0)",
}
def fetch() -> list[dict]:
"""
