ETL Pipeline with Apache Airflow: Extract, Transform, and Load Data into SQLite

Leave a Comment / Data Engineering, Portfolio / By Hijir

Project Overview

This project demonstrates how to build an ETL (Extract, Transform, Load) pipeline using Apache Airflow to automate data processing. The pipeline extracts data from either a CSV or JSON source, processes it, and loads it into an SQLite database. The workflow is managed dynamically using Airflow DAGs (Directed Acyclic Graphs) and utilizes BranchPythonOperator to determine the data extraction source.

Project Workflow

Extraction Phase
- The pipeline retrieves data from CSV or JSON sources.
- The data is downloaded and stored in a local directory (data/).
- If the file type is CSV, it is staged directly.
- If the file type is JSON, it is converted into a structured CSV format before further processing.
Transformation Phase
- The extracted data is cleaned and structured.
- JSON data is normalized and converted to CSV.
Loading Phase
- The transformed data is loaded into an SQLite database.
- The table name is derived from the staged file.
- Existing data is replaced to maintain an updated dataset.

Airflow DAG Structure

start_task → Initiates the DAG execution.
choose_extract_task → Determines whether to extract from CSV or JSON.
extract_from_csv_task → Extracts CSV data and saves it as a staging file.
extract_from_json_task → Extracts JSON data, converts it to CSV, and saves it.
load_to_sqlite_task → Loads the staged file into an SQLite database.
end_task → Marks the end of the pipeline execution.

Key Features

✅ Automated Extraction: Dynamically fetches data from online sources.
✅ Branching Logic: Uses BranchPythonOperator to determine extraction method.
✅ Data Staging: Saves intermediate data for validation before loading.
✅ Database Integration: Loads data efficiently into SQLite tables.
✅ Parameterization: Allows flexible input via Airflow Params for different file sources.
✅ Scheduled Execution: Runs daily to keep the data updated.

Results and Output

The staging folder (data/) contains intermediate CSV files.
The SQLite database (dbtugas.db) stores processed data.
The Airflow DAG graph visually represents the workflow execution.

Example Output:
📁 Staging File: staging_addresses.csv
📂 Database: dbtugas.db

Conclusion

This project showcases an efficient and scalable ETL pipeline using Apache Airflow. By leveraging dynamic DAGs, branching, and automated data loading, the workflow enables seamless data integration into SQLite. The project can be extended to handle larger datasets, cloud storage, and other databases like PostgreSQL or MySQL.

Code & Documentation

https://github.com/hijirdella/ETL-with-Python-Airflow