Production-ready Python toolkit for scraping Canadian credit card data and uploading to Supabase.
Built to populate the Rewards Optimizer database with comprehensive, accurate credit card information including category rewards, signup bonuses, and point valuations.
- π― Curated Data - 34+ Canadian credit cards with verified category rewards
- π Multi-Source Scraping - Ratehub, MoneySense, NerdWallet, CreditCardGenius, GreedyRates
- βοΈ Supabase Integration - Direct database upload with duplicate prevention
- π‘οΈ Production Ready - Rate limiting, retry logic, error tracking, structured logging
- β Tested - 20+ unit tests with coverage reporting
- βοΈ Configurable - Environment-based configuration for all settings
git clone https://github.com/tahseen137/WebDataScraper.git
cd WebDataScraper
pip install -r requirements.txtCopy and configure your environment:
cp .env.example .envEdit .env with your Supabase credentials:
SUPABASE_URL=https://your-project-id.supabase.co
SUPABASE_KEY=your-service-role-keyNote: Use the
service_rolekey (notanonkey) from Supabase Dashboard β Settings β API
Upload 34 curated Canadian credit cards:
python seed_cards.pyThat's it! Your database now contains production-ready credit card data.
- American Express (6) - Cobalt, Gold, Platinum, Aeroplan Reserve, SimplyCash
- BMO (4) - CashBack, Eclipse, AIR MILES, CashBack World Elite
- CIBC (4) - Dividend, Dividend Infinite, Aventura, Aeroplan
- Scotiabank (3) - Gold Amex, Momentum, Passport
- TD (3) - Aeroplan, Cash Back, First Class Travel
- RBC (3) - Avion, Cash Back, WestJet
- Plus - Neo, Desjardins, MBNA, National Bank, PC Financial, Rogers, Simplii, Tangerine, Triangle
Each card includes:
- β Base reward rates (cashback/points/miles)
- β Category bonuses (groceries, dining, gas, travel, etc.)
- β Signup bonuses with requirements
- β Annual fees and point valuations
- β Reward program associations
- Language: Python 3.10+
- Web Scraping: BeautifulSoup4, Requests, Newspaper3k
- Database: Supabase (PostgreSQL)
- Data Processing: Pandas, Jellyfish (fuzzy matching)
- Testing: Pytest with coverage
- Configuration: python-dotenv
- SCRIPTS.md - Reference for all 37 Python scripts
- .env.example - Configuration options
- tests/README.md - Testing guide
- docs/ - Additional documentation
# Run all tests
pytest
# With coverage
pytest --cov=. --cov-report=html
# Specific tests
pytest tests/test_scraper.py -vpython seed_cards.pypython scrape_workflow.pypython check_duplicates.pySet environment variables or edit .env:
SCRAPER_DELAY=5.0 # Slow down requests
LOG_LEVEL=DEBUG # Detailed logging
SCRAPER_MAX_RETRIES=5 # More retry attempts| Module | Purpose |
|---|---|
config.py |
Environment-based configuration |
scraper.py |
HTML parsing and data extraction |
credit_card_uploader.py |
Supabase database operations |
logger_config.py |
Structured logging setup |
rate_limiter.py |
Domain-based rate limiting |
retry_util.py |
Exponential backoff retry logic |
βββββββββββββββββββ
β Web Sources β
β (5 websites) β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Scrapers β
β (BeautifulSoup)β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Data Merger β
β & Validator β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Duplicate β
β Prevention β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Supabase β
β Database β
βββββββββββββββββββ
The scraper populates three Supabase tables:
id,card_key(unique),name,issuerreward_program,reward_currency,point_valuationannual_fee,base_reward_rate,base_reward_unit
id,card_id,category,multiplierreward_unit,description,spend_limit
id,card_id,bonus_amount,bonus_currencyspend_requirement,timeframe_days
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes and add tests
- Run tests (
pytest) - Commit (
git commit -m 'feat: Add amazing feature') - Push (
git push origin feature/amazing-feature) - Open a Pull Request
# Install dev dependencies
pip install -r requirements.txt
# Run tests with coverage
pytest --cov=. --cov-report=html
# Check code quality
pylint *.pyIncrease delays in .env:
SCRAPER_DELAY=5.0
SCRAPER_MAX_DELAY=30.0Verify Supabase credentials:
python -c "from supabase import create_client; import os; from dotenv import load_dotenv; load_dotenv(); print('β
Connected!' if create_client(os.getenv('SUPABASE_URL'), os.getenv('SUPABASE_KEY')) else 'β Failed')"Logs are written to logs/webdatascraper_YYYYMMDD_HHMMSS.log with detailed error traces.
MIT License - see LICENSE file for details.
Built for the Rewards Optimizer project to help Canadians maximize credit card rewards.
Data sources:
Tahseen Ahmed
- GitHub: @tahseen137
- Project: WebDataScraper
β Star this repo if you find it useful!