This project implements a modern SQL data warehouse for retail‑style sales analytics using a Medallion Architecture (Bronze → Silver → Gold) inside a relational database.
It consolidates CRM and ERP source data into a clean star schema with:
fact_sales– line‑level sales transactionsdim_customers– unified customer view from CRM + ERPdim_products– enriched product view with categories and attributes
The pipeline is implemented entirely in SQL:
- Bronze layer → Raw ingested tables matching CRM/ERP CSV source layout
- Silver layer → Cleansed, standardized, conformed tables with data quality checks
- Gold layer → Business‑ready views and tables modeled as a star schema for reporting
This repo is ideal as a portfolio‑grade Data Engineering + SQL DWH project.
Conceptually, the warehouse follows Medallion / Star‑Schema best practices:
- Bronze (Raw landing)
- Tables under
bronzeschema mirror source CRM/ERP CSVs (e.g., customers, products, sales, locations).
- Tables under
- Silver (Cleansed & conformed)
- Tables under
silverschema apply:- ID normalization across systems (CRM vs ERP customer IDs).
- Date validation and coercion.
- Standardization of domains (gender, marital status, categories).
- Referential checks between related entities.
- Tables under
- Gold (Star schema)
- Tables/views under
goldschema expose:dim_customers– descriptive customer attributes.dim_products– product, category, and cost info.fact_sales– sales transactions referencing dimensions via surrogate keys.
- Tables/views under
Architecture Diagram:
CRM & ERP CSVs
│
▼
Bronze (SQL)
- bronze.crm_cust_info
- bronze.crm_prd_info
- bronze.crm_sls_details
- bronze.erp_cust_az12
- bronze.erp_loc_a101
- bronze.erp_px_cat_g1v2
│ (ELT with T‑SQL)
▼
Silver (SQL)
- silver.crm_cust_info
- silver.crm_prd_info
- silver.crm_sls_details
- silver.erp_cust_az12
- silver.erp_loc_a101
- silver.erp_px_cat_g1v2
│ (joins, DQC, SCD logic)
▼
Gold (Star Schema)
- gold.dim_customers
- gold.dim_products
- gold.fact_sales
Retail_Sales_SQL_DWH/
│
├── docs/
│ └── dwh_architecture.png # High-level Medallion + star schema diagram (optional)
│
├── scripts/
│ ├── DWH_Create_Databases_-_Schemas.sql # Create DB + schemas (bronze/silver/gold)
│ ├── DWH_Bronze_Create_Tables.sql # Bronze table DDL
│ ├── DWH_Bronze_Insert_Stored_Procedure.sql # Ingest CSV → Bronze
│ │
│ ├── DWH_Silver_Create_Tables.sql # Silver table DDL
│ ├── DWH_Silver_Insert_Stored_Procedure.sql # Bronze → Silver load
│ ├── DWH_Silver_DQC.sql # Global Silver data quality checks
│ ├── DWH_Silver_CRM_Table1_DQC.sql # CRM customers DQC
│ ├── DWH_Silver_CRM_Table2_DQC.sql # CRM products DQC
│ ├── DWH_Silver_CRM_Table3_DQC.sql # CRM sales DQC
│ ├── DWH_Silver_ERP_Table1_DQC.sql # ERP customers DQC
│ ├── DWH_Silver_ERP_Table2_DQC.sql # ERP locations DQC
│ ├── DWH_Silver_ERP_Table3_DQC.sql # ERP category DQC
│ │
│ ├── DWH_Gold_Dimension_Customer_Prework.sql # Customer dim logic + DQ
│ ├── DWH_Gold_Dimension_Product_Prework.sql # Product dim logic + DQ
│ ├── DWH_Gold_Fact_Sales_Prework.sql # Fact sales logic + DQ
│ ├── DWH_Gold_Dimension_Customer.sql # Final dim_customers view/table
│ ├── DWH_Gold_Dimension_Product.sql # Final dim_products view/table
│ ├── DWH_Gold_Fact_Sales.sql # Final fact_sales view/table
│
├── dataset/
│ ├── crm/ # CRM CSVs (cust_info, prd_info, sls_details)
│ └── erp/ # ERP CSVs (CUST_AZ12, LOC_A101, PX_CAT_G1V2)
│
└── README.md
- SQL Server (recommended for T‑SQL):
- Install SQL Server (Developer/Express) and SQL Server Management Studio (SSMS), or use Docker.
- Alternatively, adjust syntax for PostgreSQL/MySQL if needed.
git clone https://github.com/Arjun-M-101/Retail_Sales_SQL_DWH.git
cd Retail_Sales_SQL_DWHIn SSMS (or your SQL client), run:
-- Create database and schemas
:r scripts/DWH_Create_Databases_-_Schemas.sqlThis script should:
- Create a database (e.g.,
RetailDWH). - Create schemas:
bronze,silver,gold.
:r scripts/DWH_Bronze_Create_Tables.sqlThis creates Bronze tables such as:
bronze.crm_cust_infobronze.crm_prd_infobronze.crm_sls_detailsbronze.erp_cust_az12bronze.erp_loc_a101bronze.erp_px_cat_g1v2
Place your CSVs under dataset/crm and dataset/erp, then use either:
- The stored procedure in
DWH_Bronze_Insert_Stored_Procedure.sql, or BULK INSERT/OPENROWSETmanually.
Example (SQL Server):
:r scripts/DWH_Bronze_Insert_Stored_Procedure.sql
EXEC bronze.sp_load_bronze_data
@crm_cust_path = 'C:\path\to\dataset\crm\crm_cust_info.csv',
@crm_prd_path = 'C:\path\to\dataset\crm\crm_prd_info.csv',
@crm_sls_path = 'C:\path\to\dataset\crm\crm_sls_details.csv',
@erp_cust_path = 'C:\path\to\dataset\erp\CUST_AZ12.csv',
@erp_loc_path = 'C:\path\to\dataset\erp\LOC_A101.csv',
@erp_cat_path = 'C:\path\to\dataset\erp\PX_CAT_G1V2.csv';- Stores data “as‑is” from CRM & ERP CSVs.
- No business logic; only basic typing and schema alignment.
- Purpose: immutable source of truth and easy re‑load of Silver if rules change.
Silver tables fix data quality issues and align CRM/ERP into a consistent model.
Core script: DWH_Silver_ERP_Table1_DQC.sql.
Key steps:
-
ID normalization between ERP and CRM:
- ERP customer IDs like
NASAW0001are transformed toAW0001so they match CRMcst_key. - Logic:
CASE WHEN cid LIKE 'NAS%' THEN SUBSTRING(cid,4,LEN(cid)) ELSE cid END AS cid
- ERP customer IDs like
-
Check matches vs non‑matches:
cid IN (SELECT DISTINCT cst_key FROM silver.crm_cust_info)→ matched customers.cid NOT IN (...)→ anomalies to investigate.
-
Date validation (
bdate):- Flag invalid dates: null, non‑dates,
< 1900-01-01or> GETDATE(). - Coerce invalid dates to
NULLin the transformed Silver table.
- Flag invalid dates: null, non‑dates,
-
Gender standardization:
- Normalize
M/F/Male/Female/NULL/''into a clean domain:CASE WHEN UPPER(TRIM(gen)) IN ('M','MALE') THEN 'Male' WHEN UPPER(TRIM(gen)) IN ('F','FEMALE') THEN 'Female' ELSE 'n/a' END AS gen
- Normalize
Silver DQC scripts also verify:
- No broken links between Silver CRM and Silver ERP tables.
- Clean distinct value sets for gender, marital status, country, etc.
Gold scripts (*_Prework.sql + final *.sql) model the star schema.
Core: DWH_Gold_Dimension_Customer_Prework.sql & DWH_Gold_Dimension_Customer.sql.
-
Sources:
silver.crm_cust_info– primary customer attributes.silver.erp_cust_az12– birth date & ERP gender.silver.erp_loc_a101– country/location.
-
Join logic:
FROM silver.crm_cust_info AS C1 LEFT JOIN silver.erp_cust_az12 AS C2 ON C1.cst_key = C2.cid LEFT JOIN silver.erp_loc_a101 AS C3 ON C1.cst_key = C3.cid
-
Gender conflict resolution:
CASE WHEN C1.cst_gndr != 'n/a' THEN C1.cst_gndr ELSE COALESCE(C2.gen, 'n/a') END AS gender
-
Final schema (example):
SELECT ROW_NUMBER() OVER (ORDER BY C1.cst_id) AS customer_key, -- surrogate key C1.cst_id AS customer_id, C1.cst_key AS customer_number, C1.cst_firstname AS first_name, C1.cst_lastname AS last_name, C3.cntry AS country, C1.cst_marital_status AS marital_status, CASE WHEN C1.cst_gndr != 'n/a' THEN C1.cst_gndr ELSE COALESCE(C2.gen, 'n/a') END AS gender, C2.bdate AS birth_date, C1.cst_create_date AS create_date INTO gold.dim_customers
-
Gold DQC for customers:
- Distinct checks on
country,marital_status,genderto inspect domains.
- Distinct checks on
Core: DWH_Gold_Dimension_Product_Prework.sql & DWH_Gold_Dimension_Product.sql.
-
Sources:
silver.crm_prd_info– product master with history (prd_start_dt,prd_end_dt).silver.erp_px_cat_g1v2– category & maintenance attributes.
-
Handling history:
- Keep only current products:
WHERE P1.prd_end_dt IS NULL
- Keep only current products:
-
Enrichment & renaming:
SELECT ROW_NUMBER() OVER(ORDER BY P1.prd_start_dt, P1.prd_key) AS product_key, P1.prd_id AS product_id, P1.prd_key AS product_number, P1.prd_nm AS product_name, P1.cat_id AS category_id, P2.cat AS category, P2.subcat AS subcategory, P2.maintenance AS maintenance, P1.prd_cost AS cost, P1.prd_line AS product_line, P1.prd_start_dt AS start_date INTO gold.dim_products
-
Gold DQC for products:
- Check
product_id,product_number,product_namefor duplicates to ensure unique keys.
- Check
Core: DWH_Gold_Fact_Sales_Prework.sql & DWH_Gold_Fact_Sales.sql.
-
Grain:
- One row per order line item:
order_number × product × customer.
- One row per order line item:
-
Sources:
silver.crm_sls_detailsasSl(fact base).gold.dim_productsasPr.gold.dim_customersasCu.
-
Join to dimensions via business keys then keep surrogate keys:
SELECT Sl.sls_ord_num AS order_number, Pr.product_key AS product_key, Cu.customer_key AS customer_key, Sl.sls_order_dt AS order_date, Sl.sls_ship_dt AS ship_date, Sl.sls_due_dt AS due_date, Sl.sls_sales AS sales_amount, Sl.sls_quantity AS quantity, Sl.sls_price AS price INTO gold.fact_sales FROM silver.crm_sls_details AS Sl LEFT JOIN gold.dim_products AS Pr ON Sl.sls_prd_key = Pr.product_number LEFT JOIN gold.dim_customers AS Cu ON Sl.sls_cust_id = Cu.customer_id;
-
Gold DQC for fact:
SELECT * FROM gold.fact_sales AS F LEFT JOIN gold.dim_products AS P ON F.product_key = P.product_key LEFT JOIN gold.dim_customers AS C ON F.customer_key = C.customer_key WHERE P.product_key IS NULL OR C.customer_key IS NULL; -- Should return NO ROWS → full referential integrity
This project demonstrates:
- Implementing Medallion Architecture (Bronze/Silver/Gold) purely in SQL Server / RDBMS.
- Designing a star schema with a central fact table (
fact_sales) and surrounding dimensions (dim_customers,dim_products). - Performing data quality checks:
- ID normalization across systems.
- Date validation and coercion.
- Domain standardization for categorical fields.
- Referential integrity validations between fact and dimensions.
- Using window functions (
ROW_NUMBER) to generate surrogate keys for conformed dimensions. - Building SQL‑based ETL/ELT pipelines from CSV → Bronze → Silver → Gold.
-
SQL DWH vs Spark / cloud lakehouse
Chose a pure SQL implementation (no Spark) to emphasize strong relational modeling and T‑SQL skills. The same Medallion and star‑schema concepts can later be ported to a lakehouse (e.g., Delta/S3). -
CRM as primary source vs ERP
Where CRM and ERP disagree (e.g., gender), the pipeline prefers CRM and uses ERP as a fallback, based on a realistic business rule for “system of record.” -
Current vs historic in dimensions
Kept current products indim_products(viaprd_end_dt IS NULL) while retaining full history infact_salesso historic sales remain analyzable without bloating the dimension. -
Surrogate keys in fact
The fact table storesproduct_keyandcustomer_keyinstead of natural IDs, enabling robust SCD support and decoupling facts from upstream operational ID changes.