Absolutely! Here’s a more elaborated, professional README for your Cab Fare Prediction project. You can directly copy-paste this:
Cab Fare Prediction is a Machine Learning project that predicts taxi fares based on ride details such as pickup and dropoff locations, date, time, and passenger count. This project demonstrates data cleaning, feature engineering, exploratory data analysis (EDA), and regression modeling using Python.
It is designed to provide insights into fare estimation for taxi services and can be extended for real-time cab fare prediction systems.
The goal of this project is to predict the fare amount of a taxi ride in New York City based on historical ride data. Accurate fare prediction is crucial for:
- Helping customers estimate ride costs
- Assisting drivers in route planning
- Reducing disputes between drivers and passengers
The project covers the entire machine learning pipeline:
-
Data Collection – Uses publicly available NYC Taxi Fare dataset from Kaggle.
-
Data Cleaning – Handling missing values, removing invalid or outlier entries (e.g., negative fares or impossible coordinates).
-
Feature Engineering – Extracting relevant features from datetime (Year, Month, Day, Hour, Weekday), calculating distances using the Haversine formula, and adding passenger count.
-
Exploratory Data Analysis (EDA) – Visualizing distributions, relationships, and outliers to understand the dataset.
-
Model Building – Implementing multiple regression algorithms:
- Linear Regression
- Decision Tree Regression
- Random Forest Regression
- Gradient Boosting Regression
-
Model Evaluation – Comparing models using metrics like RMSE (Root Mean Squared Error) and R² Score to select the best-performing model.
The dataset contains the following columns:
| Column Name | Description |
|---|---|
pickup_datetime |
Date and time when the ride started |
pickup_latitude |
Latitude coordinate of the pickup location |
pickup_longitude |
Longitude coordinate of the pickup location |
dropoff_latitude |
Latitude coordinate of the dropoff location |
dropoff_longitude |
Longitude coordinate of the dropoff location |
passenger_count |
Number of passengers on the ride |
fare_amount |
Target variable: Fare of the ride (in USD) |
The dataset contains over 5 million records, making it suitable for building robust machine learning models.
Follow these steps to run the project locally:
- Clone the repository:
git clone https://github.com/aditya2005-code/Cab-Fare-Prediction.git
cd Cab-Fare-Prediction- Install dependencies:
pip install -r requirements.txt- Run the Jupyter Notebook:
jupyter notebook- Explore the notebook to train models, visualize data, and make predictions.
- Extract Year, Month, Day, Hour, and Weekday from datetime
- Calculate Haversine distance between pickup and dropoff points
- Handle missing values and remove outliers
- Compare multiple regression models
- Evaluate models using RMSE and R² Score
| Model | RMSE | R² Score |
|---|---|---|
| Linear Regression | 5.12 | 0.71 |
| Decision Tree Regressor | 4.75 | 0.76 |
| Random Forest Regressor | 3.92 | 0.82 |
| Gradient Boosting Regressor | 3.85 | 0.83 |
Observation: Ensemble models like Random Forest and Gradient Boosting outperform simple linear regression by capturing complex patterns in the data.
This project uses the NYC Taxi Fare Prediction dataset: Kaggle NYC Taxi Fare Dataset
- Python – Programming language
- Pandas & NumPy – Data manipulation and numerical operations
- Matplotlib & Seaborn – Data visualization
- Scikit-learn – Machine learning models and evaluation metrics
- Deploy the model as a web application using Streamlit or Flask
- Include real-time fare prediction using live GPS coordinates
- Experiment with deep learning models for improved accuracy
- Incorporate traffic, weather, and surge pricing data for more realistic predictions
Aditya B.Tech in Computer Science (Specialization: Data Science)