YogaMatLabData

Data pipeline for YogaMatLabApp

Overview

This repository contains an automated data pipeline that:

Scrapes yoga mat product data from 19+ Shopify brand websites
Normalizes the data to a unified schema
Aggregates data into a single dataset
Detects changes (new/removed/updated products)
Makes data available to YogaMatLabApp via git submodule

Quick Start

Prerequisites

YogaMatLabApp Setup - The following must exist in YogaMatLabApp:
- Convex brands table with scraping configuration fields
- Query: convex/brands/getScrapableBrands.ts

Environment Variables

cp .env.example .env
# Edit .env and set your CONVEX_URL

Installation

npm install

Running the Pipeline

# Run each step individually
npm run fetch          # Phase 1: Fetch products.json from brands
npm run enrich         # Phase 1.5: Optional product-page enrichment (e.g. "Core Features")
npm run normalize      # Phase 2: Transform to unified schema (coming soon)
npm run aggregate      # Phase 3: Combine all brands (coming soon)
npm run detect-changes # Phase 4: Detect changes (coming soon)

# Or run the full pipeline
npm run pipeline

Architecture

See CLAUDE.md for detailed architecture and implementation notes.

Dimensions & Options

Dimension parsing (including round mat diameter and canonical dimensionOptions) is documented in docs/DIMENSIONS.md.

Product Page Enrichment

Some brands render important fields on the product page (accordion/metafields) that do not appear in products.json. Optional enrichment is documented in docs/ENRICHMENT.md.

Data Flow

Convex brands → Extract → Enrich → Normalize → Aggregate → Detect Changes
                   ↓         ↓         ↓          ↓              ↓
              data/raw/ data/enriched/ data/normalized/ data/aggregated/ data/changes/

Directory Structure

YogaMatLabData/
├── config/              # Configuration files
├── scripts/             # Pipeline scripts
│   ├── get-brands-from-convex.ts
│   ├── enrich-data.ts
│   └── lib/            # Shared utilities
│       ├── shopify-scraper.ts
│       ├── image-downloader.ts
│       └── logger.ts
├── data/               # Extracted and processed data
│   ├── raw/{date}/     # Daily raw extractions
│   ├── enriched/{date}/ # Optional product-page enrichment
│   ├── normalized/{date}/
│   ├── aggregated/{date}/
│   └── changes/
└── logs/               # Pipeline execution logs

Implementation Status

Phase 1: Fetch Products ✅

Project setup (package.json, tsconfig.json)
Logger utility
JSON fetcher (fetch-products-json.ts)
Brand orchestrator (get-brands-from-convex.ts)
Convex integration
Refactored to use products.json (no browser automation!)

Phase 2: Data Processing ✅

Field mapping configuration
Field mapper utility (extracts specs from descriptions)
Normalize script (Shopify → YogaMat schema)
Aggregate script (combines all brands + stats)
Detect changes script (tracks new/removed/changed products)

Phase 3: Automation ✅

GitHub Actions workflow (daily at 2 AM UTC)
Latest symlinks updater
Automatic commits with changeset summary
Failure notifications (creates GitHub issues)
Image downloader (TODO)

Phase 4: Integration 🔄

Git submodule setup in YogaMatLabApp (see INTEGRATION_INSTRUCTIONS.md)
Convex bulk upsert mutation (in YogaMatLabApp)
Import script for Convex (in YogaMatLabApp)

Phase 5: Documentation

Complete README
Commit message generator

Configuration

Brand Configuration (in YogaMatLabApp Convex)

Brands are configured in YogaMatLabApp's Convex brands table:

{
  name: string
  slug: string
  website: string
  scrapingEnabled: boolean
  shopifyCollectionUrl: string | null  // e.g., "/collections/yoga-mats"
  isShopify: boolean
  rateLimit: {
    delayBetweenProducts: number      // default: 500ms
    delayBetweenPages: number         // default: 1000ms
  }
}

GitHub Actions Automation

The pipeline runs automatically every day at 2 AM UTC via GitHub Actions.

Required Secrets

Set these in your GitHub repository settings (Settings → Secrets and variables → Actions):

CONVEX_URL - Your Convex deployment URL (e.g., https://unique-dachshund-712.convex.cloud)
PAT_TOKEN (optional) - Personal Access Token for cross-repo commits (if integrating with YogaMatLabApp)

Manual Trigger

You can manually trigger the workflow from the Actions tab:

Go to Actions → Daily Product Extraction
Click "Run workflow"
Select branch and run

What Happens Automatically

Fetches products from all enabled brands
Normalizes and aggregates data
Detects changes from previous day
Updates latest/ symlinks
Commits results with changeset summary
Creates GitHub issue if pipeline fails

Troubleshooting

"CONVEX_URL environment variable is not set"

Make sure you've created a .env file with your Convex deployment URL:

CONVEX_URL=https://your-deployment.convex.cloud

"Failed to fetch brands from Convex"

Ensure the api.brands.getScrapableBrands query exists in YogaMatLabApp's Convex functions.

Rate Limiting / 429 Errors / 403 Forbidden

Some brands may block automated requests. Adjust the rateLimit values in the Convex brands table or check if the brand requires special headers.

Development

For detailed development guidance, see CLAUDE.md.

Testing Individual Brands

To test extraction on a single brand, temporarily modify the query in Convex or filter in the extraction script.

Logs

All extraction logs are saved to logs/{date}.log with timestamps and color-coded output.

Integration with YogaMatLabApp

This repository generates data that is consumed by YogaMatLabApp. See INTEGRATION_INSTRUCTIONS.md for complete setup instructions.

Quick Overview

Add as Submodule in YogaMatLabApp:

git submodule add https://github.com/productStripesAdmin/YogaMatLabData.git data/external

Create Import Script in YogaMatLabApp to load data/external/data/aggregated/latest/all-products.json into Convex

Daily Updates:

npm run update-data  # Pulls latest data and imports to Convex

The data pipeline runs automatically daily at 2 AM UTC. YogaMatLabApp can pull and import the latest data whenever needed.

Related Repositories

YogaMatLabApp - Main application
DATA_PIPELINE.md - Detailed implementation plan
INTEGRATION_INSTRUCTIONS.md - Integration guide for YogaMatLabApp

License

Private repository

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github		.github
config		config
data		data
docs		docs
scripts		scripts
types		types
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
GEMINI.md		GEMINI.md
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

productStripesAdmin/YogaMatLabData

Folders and files

Latest commit

History

Repository files navigation