Deep Learning Project – Air Pollution Level Estimation using ANN

Machine Learning courses with 100+ Real-time projects Start Now!!

Program 1

Air Pollution Dataset

# -*- coding: utf-8 -*-
"""Air Pollution Level Estimation_ANN.ipynb

Automatically generated by Colab.

Original file is located at
    https://colab.research.google.com/drive/1iD91CUNvXYOx4WjCe9LgVvuKZv6sthM4

Air-Pollution Level Estimation (PM2.5) from Weather Conditions

Estimate or predict PM2.5 concentration (fine particulate matter in micrograms per cubic meter) based on weather and time-related features.
PM2.5 is a critical indicator of air quality and public health.
Accurate predictions help in issuing early warnings, health advisories, and urban planning.
Shows how machine learning + environmental data can drive real-world impact.

| Column    | Description                                            |
| --------- | ------------------------------------------------------ |
| No    | Row index (1 to N)                                         |
| year  | Year of measurement (2010–2014)                            |
| month | Month of measurement (1–12)                                |
| day   | Day of the month (1–31)                                    |
| hour  | Hour of the day (0–23)                                     |
| pm2.5 | PM2.5 concentration (µg/m³); **target variable**           |
| DEWP  | Dew Point temperature (°C)                                 |
| TEMP  | Ambient air temperature (°C)                               |
| PRES  | Atmospheric pressure (hPa)                                 |
| cbwd  | Combined wind direction (categorical: e.g. NE, NW, SE, cv) |
| Iws   | Cumulative wind speed (m/s)                                |
| Is    | Cumulative hours of snow                                   |
| Ir    | Cumulative hours of rain                                   |
"""

import pandas as pd, numpy as np, joblib, matplotlib.pyplot as plt, seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import EarlyStopping

# 1. Load & clean data
df = pd.read_csv("D://scikit_data\global/beijing_pm25.csv")          # path to the file you saved
df = df[df["pm2.5"].notna()]                 # drop rows with missing target
df.isnull().sum()
df.head()

# Combine Y-M-D-h into a Datetime index (handy, but not mandatory)
#This helps us understand when each pollution reading was taken.
#We make this datetime the index of our data for easier time-based handling
df["datetime"] = pd.to_datetime(df[["year", "month", "day", "hour"]])
df.set_index("datetime", inplace=True)
df.head()

# 2. Minimal feature engineering
# Now that we have a full datetime, we extract:
# hour of day (e.g., 11 AM) month (e.g., January)
# Because air pollution often changes with time of day or season.


df["hour"]  = df.index.hour
df["month"] = df.index.month
df.head()
FEATURES = ["DEWP", "TEMP", "PRES", "Iws", "Is", "Ir", "hour", "month"] # Indedpend
TARGET   = "pm2.5" # Depended

X = df[FEATURES]
y = df[TARGET]
#X.head()
y.head()

# 3. Train / test split  +  standardisation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)
joblib.dump(scaler, "scaler.joblib")   # keep for later inference

# 4. Build & train the ANN
# optimizer="adam": helps adjust the model during training.
# loss="mse": we use Mean Squared Error to measure how far predictions are from true values.

model = Sequential([
    Dense(64, activation="relu",  input_shape=(X_train_scaled.shape[1],)),
    Dense(32, activation="relu"),
    Dense(1)                        # linear output for regression
])
model.compile(optimizer="adam", loss="mse", metrics=["mae"])

history = model.fit(
    X_train_scaled, y_train,
    validation_split=0.1,
    epochs=50,
    batch_size=256,
    callbacks=[EarlyStopping(patience=5, restore_best_weights=True)],
    verbose=1
)

# validation_split=0.1 --> This tells the model to use 10% of the training data for validation.
#10% is used to validate how well the model is doing after each epoch
#It helps detect overfitting — if your model is memorizing the training data instead of learning to generalize.

# batch_size=256
# Instead of training on the entire dataset at once the model processes 256 samples at a time.,This is called a batch.
#Training with batches:Reduces memory usage.Speeds up training.
# Adds randomness that helps prevent overfitting.
#callbacks=[EarlyStopping()]
#This is a special rule to stop training early if the model stops improving.

#patience=5: If validation loss does not improve for 5 epochs in a row, stop training.
#restore_best_weights=True: After stopping, restore the model weights from the epoch when validation
#loss was lowest (not from the last epoch).

# 5. Evaluate
# MAE = average error
# RMSE = root mean squared error (penalizes bigger mistakes)
# R2 = how much of the data variance our model explains

y_pred = model.predict(X_test_scaled).flatten()
print("\nTest-set metrics")
print(f" MAE   : {mean_absolute_error(y_test, y_pred):.2f} µg/m3") # Its m qube
print(f" RMSE  : {np.sqrt(mean_squared_error(y_test, y_pred)):.2f} µg/m3") # Its m qube
print(f" R2    : {r2_score(y_test, y_pred):.3f}")

# Training loss curve
plt.figure(figsize=(6,4))
plt.plot(history.history["loss"], label="Train")
plt.plot(history.history["val_loss"], label="Val")
plt.xlabel("Epoch"); plt.ylabel("MSE"); plt.title("Training Loss");
plt.legend()
plt.grid(True);
plt.tight_layout();
plt.show()

# Actual vs Predicted scatter
plt.figure(figsize=(6,6))
sns.scatterplot(x=y_test, y=y_pred, alpha=0.3, color="blue")
plt.plot([0,600], [0,600], color="darkorange")
plt.xlabel("Actual PM2.5 (µg/m³)");
plt.ylabel("Predicted PM2.5 (µg/m³)")
plt.title("Actual vs Predicted PM2.5");
plt.grid(True);
plt.tight_layout();
plt.show()

# Save model
model.save("pm25_ann.h5")
joblib.dump(FEATURES, "feature_order.joblib")

# --------------------------------------------------------------
# 6. Simple console inference
# --------------------------------------------------------------
print("\n=== Quick PM2.5 Estimator ===")
new_vals = {}
for feat in FEATURES:
    new_vals[feat] = float(input(f"Enter {feat}: "))

row_df     = pd.DataFrame([new_vals])[FEATURES]
row_scaled = scaler.transform(row_df)
pm25_est   = model.predict(row_scaled)[0][0]

print(f"\n Estimated PM2.5 concentration: {pm25_est:.1f} µg/m3\n") # Its m qube

 

Did we exceed your expectations?
If Yes, share your valuable feedback on Google

courses

DataFlair Team

DataFlair Team provides high-impact content on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. We make complex concepts easy to grasp, helping learners of all levels succeed in their tech careers.

Leave a Reply

Your email address will not be published. Required fields are marked *