Abalone Age Estimation with Predictive Modelling and Feature Analysis: A Web-Integrated Approach

The main purpose of the project was to build a web-based application that predicts a target variable (e.g., age, price, or some other measure) using a pre-trained machine learning model. The application allows users to input relevant features, receive a prediction, and store this prediction in a PostgreSQL database.

The project involved:

  • Training and saving a machine learning model (LightGBM).
  • Building a web interface using FastAPI.
  • Integrating with a PostgreSQL database to persist predictions.
  • Using Jinja2 templates to render HTML pages.
  • Implementing logging to monitor the app’s behavior and errors.

This setup allows users to easily interact with the model and keeps a record of all predictions for later review.

Here’s a breakdown of everything I did

Jupyter Notebook

Data Loading and Exploration:

  • I loaded the dataset, performed basic exploratory data analysis, and visualized feature correlations using heatmaps and scatter plots.
  • I encoded the ‘Sex’ feature with LabelEncoder.

Outlier Detection:

  • I applied IsolationForest to identify and remove outliers and visualized the results with box plots.

Train-Test Split:

  • I split the dataset into training and testing sets to prepare for model development.

Regression Models:

  • I trained and optimized several regression models, including RandomForestRegressor, GradientBoostingRegressor, Ridge, and LightGBM.I used BayesianOptimization and Optuna for hyperparameter tuning.I evaluated the models using MSE, MAE, and R² metrics.

Model Comparison:

  • I visualized and compared the models’ performance using bar plots.

Feature Importance:

  • I used SHAP values to analyze feature importance and created various SHAP plots to visualize the results.

Classification Models:

  • I defined objective functions for tuning GradientBoostingClassifier, DecisionTreeClassifier, and RandomForestClassifier with Optuna.
  • I trained and evaluate these classifiers next using ROC-AUC curve.
FastAPI Application (app1.py):

App Initialization:

  • I set up a FastAPI application with all necessary imports and configurations.I initialized logging with a custom logging_config.py.

Model Loading:

  • I loaded the pre-trained LightGBM model from model.lgb.

Database Setup:

  • I established a connection to a PostgreSQL database.I created a table to store predictions if it didn’t already exist.

API Endpoints:

  • GET /: Serves the main input page (index.html).
  • POST /predict: Receives input data, makes predictions with the LightGBM model, stores the predictions in the database, and returns the result page (result.html).
  • GET /view-predictions: Displays stored predictions from the database on a page (predictions.html).
HTML Templates:

index.html:

  • This is the main input page where users can enter features for prediction.
  • It includes a form with fields for sex, length, diameter, height, weight, shucked_weight, viscera_weight, and shell_weight.

result.html:

  • This page displays the prediction result after form submission.
  • It shows the predicted value returned by the model.

predictions.html:

  • Lists all stored predictions, including input data and corresponding predictions.
  • Provides a historical view of all predictions made with the application.
Data Drift Monitoring with Evidently.ai:
  • Data Preparation:

    • I prepared my dataset by dropping the ‘id’ column and adding the prediction results to the DataFrame df3.
    • I sampled 5000 records from the cleaned dataset (df_cleaned) as the reference data and 5000 records from df3 as the current data.
    •  
  • Column Mapping:

    • I defined a ColumnMapping to specify the target variable (‘Age’), numerical features, and categorical features. The numerical features were selected based on the columns in df1, excluding ‘Age’, ‘id’, and ‘Sex’.
    •  
  • Report Generation:

    • I created a Report instance with the DataDriftPreset metric to detect data drift.
    • I ran the report using the reference and current data, along with the column mapping.
    • I displayed the report and saved it as an HTML file (file.html) for further review.
    •