Automating ETL with Stored Procedures in PostgreSQL

Project Overview

Data management is crucial for businesses aiming to leverage insights for decision-making. This project focuses on designing and implementing an automated ETL (Extract, Transform, Load) process using a stored procedure, dwh.generate_sales(), in PostgreSQL.

Objectives

  • Automate the ETL process to efficiently extract, transform, and load sales data.

  • Ensure data integrity by properly structuring and managing transactional, staging, and warehouse tables.

  • Optimize data flow across different schemas for effective reporting and analysis.

Scope

The ETL process is structured as follows:

  1. Extract: Move data from the public.sales_transaction table to the staging table stg.stg_sales_transaction.

  2. Transform: Organize and load cleaned data into dimension tables (dim_store, dim_product, dim_time, dim_sales_name).

  3. Load: Populate the fact_sales_transaction table and create aggregated reports in data marts (dm_sales_by_store, dm_sales_by_product, dm_sales_by_time, dm_sales_by_salesperson, dm_sales_summary).

ETL Process Breakdown

1. Extract Phase

  • The stored procedure clears the staging table (stg_sales_transaction) to avoid duplicates.

  • Data is extracted from public.sales_transaction and inserted into stg.stg_sales_transaction.

2. Transform Phase

  • Each dimension table is populated with unique and necessary data:

    • dim_product: Extracts and inserts product-related details.

    • dim_store: Loads store information.

    • dim_time: Converts and structures timestamps into year, month, and day.

    • dim_sales_name: Organizes salesperson details.

  • Data cleansing and deduplication are ensured using TRUNCATE and INSERT INTO ... SELECT DISTINCT operations.

3. Load Phase

  • The fact_sales_transaction table is populated with sales transactions by integrating data from the staging table and dimension tables.

  • Aggregated data marts (dm_sales_*) provide business intelligence insights by grouping data based on store, product, time, and salesperson.

Project Structure

ETL-Data-Warehouse-Generate-Sales/
├── SQL Query/
│ ├── 1.sales_transaction.sql — Create and populate the transactional database table.
│ ├── 2.stg.sql — Create and populate staging tables.
│ ├── 3.dwh-dim-fact.sql — Create and populate dimension and fact tables.
│ ├── 4.dm.sql — Create and populate data marts.
│ ├── 5.Store Procedure.sql — Implement the stored procedure `generate_sales`.
│ ├── 6.update_data.sql — Add test data and validate ETL updates.
├── ETL Process with Stored Procedure.pdf — Presentation explaining the ETL process.
├── README.md

Execution Steps

Step 1: Create and Populate the Transactional Table

Run 1.sales_transaction.sql to initialize the public.sales_transaction table.

Step 2: Create Staging Tables

Execute 2.stg.sql to generate the stg_sales_transaction table for temporary data storage.

Step 3: Create Data Warehouse Tables

Run 3.dwh-dim-fact.sql to define and populate dimension tables (dim_store, dim_product, etc.) and the fact table (fact_sales_transaction).

Step 4: Create Data Marts

Execute 4.dm.sql to construct business-oriented data marts.

Step 5: Execute the Stored Procedure

Run 5.Store Procedure.sql to automate the ETL workflow using dwh.generate_sales().

Step 6: Test with New Data

Use 6.update_data.sql to insert new sales transactions and rerun the ETL process to verify consistency across all schemas.

Key Benefits

  1. Performance Optimization: Precompiled stored procedures execute faster than dynamic SQL queries.

  2. Data Consistency & Integrity: Centralized logic ensures uniform transformation rules.

  3. Scalability: Handles large datasets efficiently for enterprise applications.

  4. Reduced Manual Work: Automating ETL minimizes human intervention and errors.

  5. Improved Data Quality: Ensures accurate, structured, and clean data for analysis.


 

Conclusion

This project successfully demonstrates the implementation of a robust ETL process using PostgreSQL stored procedures. The generate_sales() procedure efficiently extracts, transforms, and loads data across multiple schemas, ensuring reliable and optimized data management. The final structured datasets and data marts enable insightful analytics, supporting business intelligence decisions.

Leave a Comment

Your email address will not be published. Required fields are marked *