Skip to content

This project implements a modern SQL data warehouse for retail‑style sales analytics using a **Medallion Architecture (Bronze → Silver → Gold)** inside a relational database.

License

Notifications You must be signed in to change notification settings

Arjun-M-101/Retail-Sales-SQL-Data-Warehouse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation


🏬 Retail Sales SQL Data Warehouse (SQL Server / MySQL)

🚀 Overview

This project implements a modern SQL data warehouse for retail‑style sales analytics using a Medallion Architecture (Bronze → Silver → Gold) inside a relational database.

It consolidates CRM and ERP source data into a clean star schema with:

  • fact_sales – line‑level sales transactions
  • dim_customers – unified customer view from CRM + ERP
  • dim_products – enriched product view with categories and attributes

The pipeline is implemented entirely in SQL:

  • Bronze layer → Raw ingested tables matching CRM/ERP CSV source layout
  • Silver layer → Cleansed, standardized, conformed tables with data quality checks
  • Gold layer → Business‑ready views and tables modeled as a star schema for reporting

This repo is ideal as a portfolio‑grade Data Engineering + SQL DWH project.


🏗️ Architecture

Conceptually, the warehouse follows Medallion / Star‑Schema best practices:

  • Bronze (Raw landing)
    • Tables under bronze schema mirror source CRM/ERP CSVs (e.g., customers, products, sales, locations).
  • Silver (Cleansed & conformed)
    • Tables under silver schema apply:
      • ID normalization across systems (CRM vs ERP customer IDs).
      • Date validation and coercion.
      • Standardization of domains (gender, marital status, categories).
      • Referential checks between related entities.
  • Gold (Star schema)
    • Tables/views under gold schema expose:
      • dim_customers – descriptive customer attributes.
      • dim_products – product, category, and cost info.
      • fact_sales – sales transactions referencing dimensions via surrogate keys.

Architecture Diagram:

CRM & ERP CSVs
     │
     ▼
  Bronze (SQL)
  - bronze.crm_cust_info
  - bronze.crm_prd_info
  - bronze.crm_sls_details
  - bronze.erp_cust_az12
  - bronze.erp_loc_a101
  - bronze.erp_px_cat_g1v2
     │  (ELT with T‑SQL)
     ▼
  Silver (SQL)
  - silver.crm_cust_info
  - silver.crm_prd_info
  - silver.crm_sls_details
  - silver.erp_cust_az12
  - silver.erp_loc_a101
  - silver.erp_px_cat_g1v2
     │  (joins, DQC, SCD logic)
     ▼
  Gold (Star Schema)
  - gold.dim_customers
  - gold.dim_products
  - gold.fact_sales

📂 Project Structure

Retail_Sales_SQL_DWH/
│
├── docs/
│   └── dwh_architecture.png          # High-level Medallion + star schema diagram (optional)
│
├── scripts/
│   ├── DWH_Create_Databases_-_Schemas.sql   # Create DB + schemas (bronze/silver/gold)
│   ├── DWH_Bronze_Create_Tables.sql         # Bronze table DDL
│   ├── DWH_Bronze_Insert_Stored_Procedure.sql  # Ingest CSV → Bronze
│   │
│   ├── DWH_Silver_Create_Tables.sql         # Silver table DDL
│   ├── DWH_Silver_Insert_Stored_Procedure.sql # Bronze → Silver load
│   ├── DWH_Silver_DQC.sql                   # Global Silver data quality checks
│   ├── DWH_Silver_CRM_Table1_DQC.sql        # CRM customers DQC
│   ├── DWH_Silver_CRM_Table2_DQC.sql        # CRM products DQC
│   ├── DWH_Silver_CRM_Table3_DQC.sql        # CRM sales DQC
│   ├── DWH_Silver_ERP_Table1_DQC.sql        # ERP customers DQC
│   ├── DWH_Silver_ERP_Table2_DQC.sql        # ERP locations DQC
│   ├── DWH_Silver_ERP_Table3_DQC.sql        # ERP category DQC
│   │
│   ├── DWH_Gold_Dimension_Customer_Prework.sql  # Customer dim logic + DQ
│   ├── DWH_Gold_Dimension_Product_Prework.sql   # Product dim logic + DQ
│   ├── DWH_Gold_Fact_Sales_Prework.sql          # Fact sales logic + DQ
│   ├── DWH_Gold_Dimension_Customer.sql          # Final dim_customers view/table
│   ├── DWH_Gold_Dimension_Product.sql           # Final dim_products view/table
│   ├── DWH_Gold_Fact_Sales.sql                  # Final fact_sales view/table
│
├── dataset/
│   ├── crm/      # CRM CSVs (cust_info, prd_info, sls_details)
│   └── erp/      # ERP CSVs (CUST_AZ12, LOC_A101, PX_CAT_G1V2)
│
└── README.md

🛠️ Prerequisites & Setup

1. Database Engine

  • SQL Server (recommended for T‑SQL):
    • Install SQL Server (Developer/Express) and SQL Server Management Studio (SSMS), or use Docker.
  • Alternatively, adjust syntax for PostgreSQL/MySQL if needed.

2. Clone the Repository

git clone https://github.com/Arjun-M-101/Retail_Sales_SQL_DWH.git
cd Retail_Sales_SQL_DWH

3. Create Database & Schemas

In SSMS (or your SQL client), run:

-- Create database and schemas
:r scripts/DWH_Create_Databases_-_Schemas.sql

This script should:

  • Create a database (e.g., RetailDWH).
  • Create schemas: bronze, silver, gold.

4. Create Bronze Tables

:r scripts/DWH_Bronze_Create_Tables.sql

This creates Bronze tables such as:

  • bronze.crm_cust_info
  • bronze.crm_prd_info
  • bronze.crm_sls_details
  • bronze.erp_cust_az12
  • bronze.erp_loc_a101
  • bronze.erp_px_cat_g1v2

5. Load CSV Data into Bronze

Place your CSVs under dataset/crm and dataset/erp, then use either:

  • The stored procedure in DWH_Bronze_Insert_Stored_Procedure.sql, or
  • BULK INSERT / OPENROWSET manually.

Example (SQL Server):

:r scripts/DWH_Bronze_Insert_Stored_Procedure.sql

EXEC bronze.sp_load_bronze_data
    @crm_cust_path = 'C:\path\to\dataset\crm\crm_cust_info.csv',
    @crm_prd_path  = 'C:\path\to\dataset\crm\crm_prd_info.csv',
    @crm_sls_path  = 'C:\path\to\dataset\crm\crm_sls_details.csv',
    @erp_cust_path = 'C:\path\to\dataset\erp\CUST_AZ12.csv',
    @erp_loc_path  = 'C:\path\to\dataset\erp\LOC_A101.csv',
    @erp_cat_path  = 'C:\path\to\dataset\erp\PX_CAT_G1V2.csv';

🔄 Data Flow & Transformations

🥉 Bronze Layer – Raw Landing

  • Stores data “as‑is” from CRM & ERP CSVs.
  • No business logic; only basic typing and schema alignment.
  • Purpose: immutable source of truth and easy re‑load of Silver if rules change.

🥈 Silver Layer – Cleansed & Conformed

Silver tables fix data quality issues and align CRM/ERP into a consistent model.

1. Customer Conformance & DQC

Core script: DWH_Silver_ERP_Table1_DQC.sql.

Key steps:

  • ID normalization between ERP and CRM:

    • ERP customer IDs like NASAW0001 are transformed to AW0001 so they match CRM cst_key.
    • Logic:
      CASE WHEN cid LIKE 'NAS%' THEN SUBSTRING(cid,4,LEN(cid))
           ELSE cid
      END AS cid
  • Check matches vs non‑matches:

    • cid IN (SELECT DISTINCT cst_key FROM silver.crm_cust_info) → matched customers.
    • cid NOT IN (...) → anomalies to investigate.
  • Date validation (bdate):

    • Flag invalid dates: null, non‑dates, < 1900-01-01 or > GETDATE().
    • Coerce invalid dates to NULL in the transformed Silver table.
  • Gender standardization:

    • Normalize M/F/Male/Female/NULL/'' into a clean domain:
      CASE WHEN UPPER(TRIM(gen)) IN ('M','MALE') THEN 'Male'
           WHEN UPPER(TRIM(gen)) IN ('F','FEMALE') THEN 'Female'
           ELSE 'n/a'
      END AS gen

Silver DQC scripts also verify:

  • No broken links between Silver CRM and Silver ERP tables.
  • Clean distinct value sets for gender, marital status, country, etc.

🥇 Gold Layer – Star Schema (Analytics‑Ready)

Gold scripts (*_Prework.sql + final *.sql) model the star schema.

1. Customer Dimension – gold.dim_customers

Core: DWH_Gold_Dimension_Customer_Prework.sql & DWH_Gold_Dimension_Customer.sql.

  • Sources:

    • silver.crm_cust_info – primary customer attributes.
    • silver.erp_cust_az12 – birth date & ERP gender.
    • silver.erp_loc_a101 – country/location.
  • Join logic:

    FROM silver.crm_cust_info AS C1
    LEFT JOIN silver.erp_cust_az12 AS C2 ON C1.cst_key = C2.cid
    LEFT JOIN silver.erp_loc_a101 AS C3 ON C1.cst_key = C3.cid
  • Gender conflict resolution:

    CASE WHEN C1.cst_gndr != 'n/a' THEN C1.cst_gndr
         ELSE COALESCE(C2.gen, 'n/a')
    END AS gender
  • Final schema (example):

    SELECT
        ROW_NUMBER() OVER (ORDER BY C1.cst_id) AS customer_key, -- surrogate key
        C1.cst_id       AS customer_id,
        C1.cst_key      AS customer_number,
        C1.cst_firstname AS first_name,
        C1.cst_lastname  AS last_name,
        C3.cntry         AS country,
        C1.cst_marital_status AS marital_status,
        CASE WHEN C1.cst_gndr != 'n/a' THEN C1.cst_gndr
             ELSE COALESCE(C2.gen, 'n/a') END AS gender,
        C2.bdate         AS birth_date,
        C1.cst_create_date AS create_date
    INTO gold.dim_customers
  • Gold DQC for customers:

    • Distinct checks on country, marital_status, gender to inspect domains.

2. Product Dimension – gold.dim_products

Core: DWH_Gold_Dimension_Product_Prework.sql & DWH_Gold_Dimension_Product.sql.

  • Sources:

    • silver.crm_prd_info – product master with history (prd_start_dt, prd_end_dt).
    • silver.erp_px_cat_g1v2 – category & maintenance attributes.
  • Handling history:

    • Keep only current products:
      WHERE P1.prd_end_dt IS NULL
  • Enrichment & renaming:

    SELECT
        ROW_NUMBER() OVER(ORDER BY P1.prd_start_dt, P1.prd_key) AS product_key,
        P1.prd_id   AS product_id,
        P1.prd_key  AS product_number,
        P1.prd_nm   AS product_name,
        P1.cat_id   AS category_id,
        P2.cat      AS category,
        P2.subcat   AS subcategory,
        P2.maintenance AS maintenance,
        P1.prd_cost AS cost,
        P1.prd_line AS product_line,
        P1.prd_start_dt AS start_date
    INTO gold.dim_products
  • Gold DQC for products:

    • Check product_id, product_number, product_name for duplicates to ensure unique keys.

3. Sales Fact – gold.fact_sales

Core: DWH_Gold_Fact_Sales_Prework.sql & DWH_Gold_Fact_Sales.sql.

  • Grain:

    • One row per order line item: order_number × product × customer.
  • Sources:

    • silver.crm_sls_details as Sl (fact base).
    • gold.dim_products as Pr.
    • gold.dim_customers as Cu.
  • Join to dimensions via business keys then keep surrogate keys:

    SELECT
        Sl.sls_ord_num AS order_number,
        Pr.product_key AS product_key,
        Cu.customer_key AS customer_key,
        Sl.sls_order_dt AS order_date,
        Sl.sls_ship_dt  AS ship_date,
        Sl.sls_due_dt   AS due_date,
        Sl.sls_sales    AS sales_amount,
        Sl.sls_quantity AS quantity,
        Sl.sls_price    AS price
    INTO gold.fact_sales
    FROM silver.crm_sls_details AS Sl
    LEFT JOIN gold.dim_products  AS Pr ON Sl.sls_prd_key = Pr.product_number
    LEFT JOIN gold.dim_customers AS Cu ON Sl.sls_cust_id = Cu.customer_id;
  • Gold DQC for fact:

    SELECT *
    FROM gold.fact_sales AS F
    LEFT JOIN gold.dim_products  AS P ON F.product_key = P.product_key
    LEFT JOIN gold.dim_customers AS C ON F.customer_key = C.customer_key
    WHERE P.product_key IS NULL
        OR C.customer_key IS NULL;
    -- Should return NO ROWS → full referential integrity

✅ Key Takeaways

This project demonstrates:

  • Implementing Medallion Architecture (Bronze/Silver/Gold) purely in SQL Server / RDBMS.
  • Designing a star schema with a central fact table (fact_sales) and surrounding dimensions (dim_customers, dim_products).
  • Performing data quality checks:
    • ID normalization across systems.
    • Date validation and coercion.
    • Domain standardization for categorical fields.
    • Referential integrity validations between fact and dimensions.
  • Using window functions (ROW_NUMBER) to generate surrogate keys for conformed dimensions.
  • Building SQL‑based ETL/ELT pipelines from CSV → Bronze → Silver → Gold.

⚖️ Trade‑offs & Design Decisions

  • SQL DWH vs Spark / cloud lakehouse
    Chose a pure SQL implementation (no Spark) to emphasize strong relational modeling and T‑SQL skills. The same Medallion and star‑schema concepts can later be ported to a lakehouse (e.g., Delta/S3).

  • CRM as primary source vs ERP
    Where CRM and ERP disagree (e.g., gender), the pipeline prefers CRM and uses ERP as a fallback, based on a realistic business rule for “system of record.”

  • Current vs historic in dimensions
    Kept current products in dim_products (via prd_end_dt IS NULL) while retaining full history in fact_sales so historic sales remain analyzable without bloating the dimension.

  • Surrogate keys in fact
    The fact table stores product_key and customer_key instead of natural IDs, enabling robust SCD support and decoupling facts from upstream operational ID changes.


About

This project implements a modern SQL data warehouse for retail‑style sales analytics using a **Medallion Architecture (Bronze → Silver → Gold)** inside a relational database.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages