Simplified ETL Pipeline using Apache Airflow from NewsAPI and into MySQL
Simplified ETL Pipeline using Apache Airflow from NewsAPI and into MySQL
This project is an ETL (Extract, Transform, Load) pipeline built using Apache Airflow. The pipeline extracts business news articles from the NewsAPI, transforms the data, and loads it into a MySQL database. The project leverages Airflow’s task automation capabilities and utilizes a Docker-based environment for smooth integration.
Project Overview
- Extract: The pipeline pulls top business news headlines from the NewsAPI.
- Transform: The articles are cleaned, parsed, and prepared for MySQL storage.
- Load: The cleaned data is inserted into a MySQL database for future analysis.
Project Structure
- news_etlpipeline.py: The main DAG (Directed Acyclic Graph) file where the ETL logic is defined.
- Dockerfile: Custom Docker setup for running Airflow and integrating MySQL Connector and NewsAPI.
- docker-compose.yml: The Docker Compose file to spin up the required services like MySQL, Adminer (DB management), and Airflow webserver/scheduler.
Project Overview
How It Works
Airflow DAG
- Daily Schedule: The DAG runs every day to fetch the latest business news.
- Error Handling: If a task fails, Airflow retries the task with a 5-minute delay.
- Dynamic Task Creation: The pipeline dynamically processes news articles and loads them into the MySQL database.
Tasks
Extract News:- Fetches the latest business news from the NewsAPI.
- Uses the
newsapi-python
client to pull top headlines. - Transform News:
- Processes and transforms the data, ensuring fields like author, title, and publication date are in a format suitable for database insertion.
- Cleans and prepares article titles, handles missing values, and extracts domain names from URLs.
- Saves the transformed data in JSON format for backup.
- Load News:
- Inserts the transformed data into a MySQL database.
- MySQL table structure:
CREATE TABLE news_articles ( id INT AUTO_INCREMENT PRIMARY KEY, source_id VARCHAR(255), source_name VARCHAR(255), author VARCHAR(255), title TEXT, domain VARCHAR(255), description TEXT, url TEXT, url_to_image TEXT, published_at DATETIME );
Dockerized Setup
The environment is fully containerized using Docker Compose with the following services:
- MySQL 8.0: Database to store the news articles.
- Adminer: Database management tool for MySQL (accessible at port 8080).
- Airflow Webserver: Provides the Airflow UI to monitor DAGs and tasks (accessible at port 8081).
- Airflow Scheduler: Schedules and runs DAGs based on the specified intervals.
How to Run
Prerequisites
Clone the repository:
docker pull apache/airflow:2.10.2-python3.8
docker pull adminer
docker pull mysql:8.0
docker exec -it etlairflow-airflow-webserver-1 airflow users create --username admin --firstname Admin --lastname Admin --role Admin --email admin@example.com --password admin
Navigate to http://localhost:8081
and log in using the credentials created above.
Run the DAG:
- Trigger the DAG
simplified_newsapi_etl_pipeline
manually or wait for the scheduled run.
- Trigger the DAG
Customizations
- API Key: Replace the hardcoded API key in
news_etlpipeline.py
with your own from NewsAPI. - Database Connection: Set up your MySQL connection in the Airflow UI under the Connections tab, using the ID
mysql_conn
.
Future Improvements
- Add advanced data processing steps like sentiment analysis or topic modeling (LDA) on the news articles.
- Implement error handling and notifications (e.g., using Slack or email) in case of failures.
- Extend the pipeline to support more data sources or integrate with BI tools for reporting.
Our Latest Projects
Far far away, behind the word mountains, far from the countries Vokalia and Consonantia
About
An AI Geek and a lifelong learner, who thrives in coding and problem-solving through ML, DL, and LLMs.
Copyright ©2024 All rights reserved.