Data pipeline for YogaMatLabApp
This repository contains an automated data pipeline that:
- Scrapes yoga mat product data from 19+ Shopify brand websites
- Normalizes the data to a unified schema
- Aggregates data into a single dataset
- Detects changes (new/removed/updated products)
- Makes data available to YogaMatLabApp via git submodule
-
YogaMatLabApp Setup - The following must exist in YogaMatLabApp:
- Convex
brandstable with scraping configuration fields - Query:
convex/brands/getScrapableBrands.ts
- Convex
-
Environment Variables
cp .env.example .env # Edit .env and set your CONVEX_URL
npm install# Run each step individually
npm run fetch # Phase 1: Fetch products.json from brands
npm run enrich # Phase 1.5: Optional product-page enrichment (e.g. "Core Features")
npm run normalize # Phase 2: Transform to unified schema (coming soon)
npm run aggregate # Phase 3: Combine all brands (coming soon)
npm run detect-changes # Phase 4: Detect changes (coming soon)
# Or run the full pipeline
npm run pipelineSee CLAUDE.md for detailed architecture and implementation notes.
Dimension parsing (including round mat diameter and canonical dimensionOptions) is documented in docs/DIMENSIONS.md.
Some brands render important fields on the product page (accordion/metafields) that do not appear in products.json. Optional enrichment is documented in docs/ENRICHMENT.md.
Convex brands → Extract → Enrich → Normalize → Aggregate → Detect Changes
↓ ↓ ↓ ↓ ↓
data/raw/ data/enriched/ data/normalized/ data/aggregated/ data/changes/
YogaMatLabData/
├── config/ # Configuration files
├── scripts/ # Pipeline scripts
│ ├── get-brands-from-convex.ts
│ ├── enrich-data.ts
│ └── lib/ # Shared utilities
│ ├── shopify-scraper.ts
│ ├── image-downloader.ts
│ └── logger.ts
├── data/ # Extracted and processed data
│ ├── raw/{date}/ # Daily raw extractions
│ ├── enriched/{date}/ # Optional product-page enrichment
│ ├── normalized/{date}/
│ ├── aggregated/{date}/
│ └── changes/
└── logs/ # Pipeline execution logs
- Project setup (package.json, tsconfig.json)
- Logger utility
- JSON fetcher (fetch-products-json.ts)
- Brand orchestrator (get-brands-from-convex.ts)
- Convex integration
- Refactored to use products.json (no browser automation!)
- Field mapping configuration
- Field mapper utility (extracts specs from descriptions)
- Normalize script (Shopify → YogaMat schema)
- Aggregate script (combines all brands + stats)
- Detect changes script (tracks new/removed/changed products)
- GitHub Actions workflow (daily at 2 AM UTC)
- Latest symlinks updater
- Automatic commits with changeset summary
- Failure notifications (creates GitHub issues)
- Image downloader (TODO)
- Git submodule setup in YogaMatLabApp (see INTEGRATION_INSTRUCTIONS.md)
- Convex bulk upsert mutation (in YogaMatLabApp)
- Import script for Convex (in YogaMatLabApp)
- Complete README
- Commit message generator
Brands are configured in YogaMatLabApp's Convex brands table:
{
name: string
slug: string
website: string
scrapingEnabled: boolean
shopifyCollectionUrl: string | null // e.g., "/collections/yoga-mats"
isShopify: boolean
rateLimit: {
delayBetweenProducts: number // default: 500ms
delayBetweenPages: number // default: 1000ms
}
}The pipeline runs automatically every day at 2 AM UTC via GitHub Actions.
Set these in your GitHub repository settings (Settings → Secrets and variables → Actions):
CONVEX_URL- Your Convex deployment URL (e.g.,https://unique-dachshund-712.convex.cloud)PAT_TOKEN(optional) - Personal Access Token for cross-repo commits (if integrating with YogaMatLabApp)
You can manually trigger the workflow from the Actions tab:
- Go to Actions → Daily Product Extraction
- Click "Run workflow"
- Select branch and run
- Fetches products from all enabled brands
- Normalizes and aggregates data
- Detects changes from previous day
- Updates
latest/symlinks - Commits results with changeset summary
- Creates GitHub issue if pipeline fails
Make sure you've created a .env file with your Convex deployment URL:
CONVEX_URL=https://your-deployment.convex.cloudEnsure the api.brands.getScrapableBrands query exists in YogaMatLabApp's Convex functions.
Some brands may block automated requests. Adjust the rateLimit values in the Convex brands table or check if the brand requires special headers.
For detailed development guidance, see CLAUDE.md.
To test extraction on a single brand, temporarily modify the query in Convex or filter in the extraction script.
All extraction logs are saved to logs/{date}.log with timestamps and color-coded output.
This repository generates data that is consumed by YogaMatLabApp. See INTEGRATION_INSTRUCTIONS.md for complete setup instructions.
-
Add as Submodule in YogaMatLabApp:
git submodule add https://github.com/productStripesAdmin/YogaMatLabData.git data/external
-
Create Import Script in YogaMatLabApp to load
data/external/data/aggregated/latest/all-products.jsoninto Convex -
Daily Updates:
npm run update-data # Pulls latest data and imports to Convex
The data pipeline runs automatically daily at 2 AM UTC. YogaMatLabApp can pull and import the latest data whenever needed.
- YogaMatLabApp - Main application
- DATA_PIPELINE.md - Detailed implementation plan
- INTEGRATION_INSTRUCTIONS.md - Integration guide for YogaMatLabApp
Private repository