Skip to content

Canadian credit card data scraper for financial comparison tools

License

Notifications You must be signed in to change notification settings

tahseen137/WebDataScraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

20 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ’³ WebDataScraper

Production-ready Python toolkit for scraping Canadian credit card data and uploading to Supabase.

Built to populate the Rewards Optimizer database with comprehensive, accurate credit card information including category rewards, signup bonuses, and point valuations.

Python 3.10+ License: MIT

✨ Features

  • 🎯 Curated Data - 34+ Canadian credit cards with verified category rewards
  • πŸ”„ Multi-Source Scraping - Ratehub, MoneySense, NerdWallet, CreditCardGenius, GreedyRates
  • ☁️ Supabase Integration - Direct database upload with duplicate prevention
  • πŸ›‘οΈ Production Ready - Rate limiting, retry logic, error tracking, structured logging
  • βœ… Tested - 20+ unit tests with coverage reporting
  • βš™οΈ Configurable - Environment-based configuration for all settings

πŸš€ Quick Start

1. Installation

git clone https://github.com/tahseen137/WebDataScraper.git
cd WebDataScraper
pip install -r requirements.txt

2. Configuration

Copy and configure your environment:

cp .env.example .env

Edit .env with your Supabase credentials:

SUPABASE_URL=https://your-project-id.supabase.co
SUPABASE_KEY=your-service-role-key

Note: Use the service_role key (not anon key) from Supabase Dashboard β†’ Settings β†’ API

3. Seed Database

Upload 34 curated Canadian credit cards:

python seed_cards.py

That's it! Your database now contains production-ready credit card data.

πŸ“Š What's Included

34 Canadian Credit Cards

  • American Express (6) - Cobalt, Gold, Platinum, Aeroplan Reserve, SimplyCash
  • BMO (4) - CashBack, Eclipse, AIR MILES, CashBack World Elite
  • CIBC (4) - Dividend, Dividend Infinite, Aventura, Aeroplan
  • Scotiabank (3) - Gold Amex, Momentum, Passport
  • TD (3) - Aeroplan, Cash Back, First Class Travel
  • RBC (3) - Avion, Cash Back, WestJet
  • Plus - Neo, Desjardins, MBNA, National Bank, PC Financial, Rogers, Simplii, Tangerine, Triangle

Complete Data Coverage

Each card includes:

  • βœ… Base reward rates (cashback/points/miles)
  • βœ… Category bonuses (groceries, dining, gas, travel, etc.)
  • βœ… Signup bonuses with requirements
  • βœ… Annual fees and point valuations
  • βœ… Reward program associations

πŸ› οΈ Tech Stack

  • Language: Python 3.10+
  • Web Scraping: BeautifulSoup4, Requests, Newspaper3k
  • Database: Supabase (PostgreSQL)
  • Data Processing: Pandas, Jellyfish (fuzzy matching)
  • Testing: Pytest with coverage
  • Configuration: python-dotenv

πŸ“– Documentation

πŸ§ͺ Testing

# Run all tests
pytest

# With coverage
pytest --cov=. --cov-report=html

# Specific tests
pytest tests/test_scraper.py -v

πŸ“ Usage Examples

Seed Curated Cards (Recommended)

python seed_cards.py

Scrape Fresh Data

python scrape_workflow.py

Check for Duplicates

python check_duplicates.py

Custom Configuration

Set environment variables or edit .env:

SCRAPER_DELAY=5.0          # Slow down requests
LOG_LEVEL=DEBUG            # Detailed logging
SCRAPER_MAX_RETRIES=5      # More retry attempts

πŸ—οΈ Architecture

Core Modules

Module Purpose
config.py Environment-based configuration
scraper.py HTML parsing and data extraction
credit_card_uploader.py Supabase database operations
logger_config.py Structured logging setup
rate_limiter.py Domain-based rate limiting
retry_util.py Exponential backoff retry logic

Data Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Web Sources    β”‚
β”‚  (5 websites)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Scrapers      β”‚
β”‚  (BeautifulSoup)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Data Merger    β”‚
β”‚  & Validator    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Duplicate      β”‚
β”‚  Prevention     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Supabase      β”‚
β”‚   Database      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ—ƒοΈ Database Schema

The scraper populates three Supabase tables:

cards

  • id, card_key (unique), name, issuer
  • reward_program, reward_currency, point_valuation
  • annual_fee, base_reward_rate, base_reward_unit

category_rewards

  • id, card_id, category, multiplier
  • reward_unit, description, spend_limit

signup_bonuses

  • id, card_id, bonus_amount, bonus_currency
  • spend_requirement, timeframe_days

🀝 Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes and add tests
  4. Run tests (pytest)
  5. Commit (git commit -m 'feat: Add amazing feature')
  6. Push (git push origin feature/amazing-feature)
  7. Open a Pull Request

Development Setup

# Install dev dependencies
pip install -r requirements.txt

# Run tests with coverage
pytest --cov=. --cov-report=html

# Check code quality
pylint *.py

πŸ› Troubleshooting

Rate Limited by Websites

Increase delays in .env:

SCRAPER_DELAY=5.0
SCRAPER_MAX_DELAY=30.0

Database Connection Issues

Verify Supabase credentials:

python -c "from supabase import create_client; import os; from dotenv import load_dotenv; load_dotenv(); print('βœ… Connected!' if create_client(os.getenv('SUPABASE_URL'), os.getenv('SUPABASE_KEY')) else '❌ Failed')"

Check Logs

Logs are written to logs/webdatascraper_YYYYMMDD_HHMMSS.log with detailed error traces.

πŸ“„ License

MIT License - see LICENSE file for details.

πŸ™ Acknowledgments

Built for the Rewards Optimizer project to help Canadians maximize credit card rewards.

Data sources:

πŸ“§ Contact

Tahseen Ahmed


⭐ Star this repo if you find it useful!

About

Canadian credit card data scraper for financial comparison tools

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •