Used Car Price Prediction Using Supervised Machine Learning

14 min readMay 17, 2021

All python scripts related to this case study are available on my GITHUB page.This project has been also deployed using Flask on Amazon EC2 instance .

1. Introduction

Sometimes selling your used car becomes crucial as we are not able to identify its fair price accurately. The depreciation of a car depends on a variety of factors so the car owner needs to be aware of the worth of their vehicle. With the rapid expansion of Machine Learning, this problem can also be solved by minimizing human efforts and time. Let’s see an end-to-end solution for a similar problem.

2. Prerequisites

The blog assumes some basic knowledge of Machine Learning techniques, familiarity with python, some libraries like NumPy, pandas,sklearn, etc.

3. Business Problem

This case study deals with predicting the price of used cars in the USA market. The dataset has been scraped by Austin Reese from Craiglist and is available on Kaggle.

4. Machine Learning Problem

In the world of Machine Learning, this problem is a Regression problem using which we can predict the price of a used car given a variety of features like (manufacturer, model, condition, state, city, year, and 20 other categories).

5. Performance Metric

Mean absolute error: In MAE we will calculate the residual for every data point taking only the absolute value of each so that positive and negative residuals do not cancel out. The mathematical equation is shown below:

So a small MAEmeans our model is great at prediction and a large error may indicate that there is some problem in our model in some areas. A model with 0 MAE is considered a perfect predictor of output(which is very rare to happen).

2. Mean absolute percentage error :It will tell us about the prediction accuracy of our regression model. It measures the size of the error in percentage terms.

Both MAE and MAPE are robust to Outliers. Now, Our Goal is to minimize both error metrics.

6. Data Collection Source

The dataset is download from Kaggle:

Link: https://www.kaggle.com/austinreese/craigslist-carstrucks-data

The dataset contains only one file i.e vehicles.csv, there are no separate train and test files present for this problem.

The dataset contains 458213 rows and 26 columns.

7. Understanding Data

Data contains 26 columns out of which our target column is price and the rest are features that will be used to predict the target value.

Let’s have a look at some of them.

Region: The region in which the car is available for sell
Year: The year it was first purchased.
Manufacturer: Manufacturer of the car like Ford, Chevrolet.
Model: The model of the car.
Condition: Current condition of the car.
Cylinders: No. of cylinders in a car which tells about the engine capacity of the car.
Fuel: Fuel type of the car whether it’s gas, diesel or petrol.
Odometer: No. of miles the car has driven.
Title_status: It is the current state of the car whether it’s clean or not.
Transmission: Transmission type of the car i.e automatic, manual or other
Drive: Drive type whether it’s RWD,4wd, fwd.
Size: It tells about the size of the car.
Type: It tells about the car type like its hatchback or some other category.
paint_color: Colour of the car
Description: a short description of the car written by the seller

There are some other features also which are related to the geographical location of the car like State, City, Lat & Long.

8. Data Cleaning & Exploratory Data Analysis:

Before applying some explorations and data visualization I have dropped some irrelevant features present in the dataset like ‘id’, ’url’, ’region_url’, ’image_url’, ’posting_date’, ’Vehicle Identification Number’ as they didn’t bring much value to the data.

Dropping Unnecessary Features

All our EDA is done concerning the target variable i.e ‘Price’. Let’s have a look at some of them.

I’ve plotted the heatmap of the correlation matrix to check for the correlation between the variables. This dataset contains a combination of numerical and categorical features so to find the correlation we can use the Phi_k Correlation Analyzer library.

Phi_k Correlation Analyzer library: It’s been always troublesome to find the correlation of categorical features. Firstly, we need to encode it using either a Label Encoder or One hot Encoder, or any other encoding technique. So, here comes this library to save you from all this trouble using which you can easily check the correlation between categorical, numerical, and interval features. It also captures non-linear dependency among the variables.

From this heatmap, we can conclude odometer has a correlation of 0.7 with the Price so it is the most important feature and the model is almost highly correlated to every other feature so this feature is not so important.

Fuel vs Price

From this figure, we can analyze that diesel cars are the costliest in the USA market and other fuel types have significantly lower prices. So this is also an important feature.

Drive Vs Price

From the above figure, we can analyze 4WD(Four Wheel Drive) Cars are significantly costlier than RWD(Rear Wheel Drive ) Cars while fwd cars are the cheapest. This feature can also play an important role in predicting price.

Transmission Vs price

The boxplot between transmission and price shows that 75% of automatic and manual transmission vehicles are priced under 20k $ while other transmissions have higher prices.

Cylinders Vs Price

From this figure, we can analyze that 12 cylinders cars are the most expensive ones because 12 cylinders are mostly used in supercars i.e Ferrari, Austin martin, Bentley, etc.

4 cylinders and 5 cylinders cars are in the cheapest segment because they are mostly used in either hatchbacks or sedans.

You can check the detailed EDA on my Github Profile.

Data Cleaning:

After calculating the missing % of null values which is shown inside the table for every feature we can see that size of the car has 70% null values while other features have less than 50% null values. We will drop features having missing value percentages above 50% and attempt to fill others using domain knowledge and observations we got whilst performing EDA.

I’ve also dropped State, Long, Lat features as they are highly correlated to Region.

As a next step, Firstly, some extreme values were dropped, while exploring I found there is a total of 617 cars out of 458213 which had listed as more than 100k $. Additionally, I dropped cars that were listed below 200 $. Secondly, I found some cars had purchased year as 2021 so all those values were also dropped as this data was scrapped in the year 2020 so 2021 is not a valid year. Additionally, Cars which had year less than 1950 were also dropped. Cars that had odometer values greater than 500k and equals to 0 miles were dropped as all these are noise for our data. Lastly, I dropped cars that had price, odometer, and year values less than 1000 $, 60k miles, and a year less than 2010 respectively. For our analysis, these points were considered outliers

What is the distribution of Price?

In the above figure, the blue curve is the Price distribution that’s a rightly skewed distribution with a long tail in which maximum of the cars are priced under 20k $ and few cars are priced above 40K $. The black curve is the normal distribution of price.

Apply Box-Cox transformation On Price :

We can remove the skewness of a random variable by transforming it with various techniques like log transformation,box-cox transformation, or by using the z-score method, and many more other techniques. In this case study, I’ve applied box-cox transformation to make price distribution look normal.

After applying the box cox transformation our Price distribution will look like as blue curve shown in the above figure. We have to inverse transform our target variable while calculating performance metric scores using inv_boxcox.

From the figure, we can analyze that the QQ-plot of our price and distribution is almost normal except for some deviation in low and high price ranges.

Now, let’s fill in the missing values of other features.

Missing Value Imputation :

To fill the missing values for numerical columns, the Iterative Imputer method is used and Gradient Boosting Regressor is implemented.

Iterative Imputing is a process where each feature is represented as a function of other features like in a regression problem where missing or null values are predicted. In this approach, every feature is imputed sequentially, prior imputed values are then used as a part of the model in predicting subsequent features.

The second move was to fill in some missing values with acceptable values. It was made a point to fill in the missing ‘condition’ values based on the category. Both ‘condition’ sub-categories had their median odometer value measured. The median odometer values for each condition sub-category were then used to fill in missing values. Furthermore, cars with a year greater than 2019 were filled as new, and years in between 2018 to 2019 were filled as like new. Using some domain knowledge all-electric car cylinders were replaced by 0 cylinders. Firstly, ‘Cylinder’ feature missing values were filled with the information present in the description feature and rest were filled with the mode of cylinders grouped by manufacturer. Similarly, drive and type features were also filled with mode grouped by the feature they had a high correlation with.

Other remaining features were filled using the mode of their respective columns.

9. Feature Engineering & Data Preprocessing:

Four new features were made out of which 2 are derived from the existing set of features and 2 are derived using a mathematical function which is described as below:

AGE: Age of a car is calculated as (today's year- Purchased year)
Average Odometer Per Year: It is calculated by dividing the odometer value of a car by its age.

Average odometer per year= No. of miles car has driven / total age

3. Sin_odo & Sin_age are calculated by taking the sin of odometer and age values.

Then odometer is also transformed using box-cox transformation the same way we transformed price. I didn’t transform age using box cox because it was not transforming to normal distribution.

Text Preprocessing

The description is a text feature so all the special characters, stop words, and digits were removed. Any HTML tags present were also removed using BeautifulSoup

Splitting data into train and test (Random Splitting)

In this process, 80% of data was split for Train data and 20% was taken as test data.

Standardizing Data:

The dataset has features that are on a different scale so it’s a good practice to bring them to the same scale either using standardization or normalization. Here, We will be using Standardization.

All numerical type features were column standardized using StandardScaler() which means after scaling all numerical features will have mean 0 and standard deviation 1.

Encoding Description feature using TF-IDF:

Description feature is encoded using TF-IDF encoding technique with max features as 200, min length of the word as 5, and ngram_range from 1 to 3.

One Hot Encoding:

To apply ML Models first we need to convert our Categorical variables into Numerical variables which are done by encoding every feature using Sklearn Count Vectorizer(Binary=true) and stacking them together along with description encoded values.

Now, we can move forward to apply ML Models to our data.

10. First-Cut Approach:

Firstly, we will create baseline models on our data without doing all the steps mentioned above to check for the worst scores we get on out dataset. You can check that in my github repository.

After doing data cleaning and data processing we have data with 348 one hot encoded features. As dimensionality of data is large models like XGBoost, RandomForest ,Linear Regression , Lasso are supposed to work pretty well on our data.

11. Applying ML Models:

In this section, we will look at some of the applied machine learning models in the same order.

Linear Regression
XG-Boost
Random Forest
Custom Ensemble

1. Linear Regression:

Results of Linear Regression:

We found that linear Regression is not working well on this dataset as mape is around 84.4% which means this model is only 15.6% accurate.

2. XGBoost Regressor(Best Model): It is an implementation of Gradient Boosted Decision trees designed for enhancing speed and performance . To know more about boosting algorithms. Click here.

Hyperparameter Tuning: There are various techniques like GridSearchCv, RandomSearchCv, Bayesian Optimization, Optuna Framework to tune hyperparameters of your model. Here, we will be using Optuna Framework for tuning XGBoost hyperparameters.

Optuna Optimization: Let’s see how we use optuna, First, we must choose the metric for which the hyperparameters will be optimized. As a result, this metric serves as the optimization goal. I’ve used a cross-validation score of k=10 to calculate the metric. So, first, let’s make the goal function.

Hyperparameter optimization using Optuna

In this case, I’m using XGBRegressor to predict my target variable i.e ‘Price’. Here, I’ll be optimizing 9 parameters. As a result, we will describe the hyperparameter search space. As shown above, n_estimators, max_depth, reg_alpha, reg_lambda, min_child_weight are integer ranging between specified range, the learning rate is taken from log uniform ranging from 0.005 to 0.5 and colsample_bysubtree, subsample is taken from a discrete uniform.

So the above function takes the trial object as an argument. The trial is just one iteration of the optimization process. So, our goal is to minimize the cross-validation score by figuring out what values of hyperparameters would give me the lowest cross-validation score.

Now, we will evaluate the Objective function using the study object.

Depending on the essence of the target, the path may be ‘maximize’ or ‘minimize.’ I gave ‘minimize’ because I want to minimize the cross-validation score. I’ve also set the number of trials to 25. I have used the sampler method as TPESampler if you won’t define it bayesian optimizer will be chosen by default.

Once, all the trials will be completed. We can find the optimal values of hyperparameters as shown below:

Now, we have the best hyperparameters to train our model Let’s train an XGBregressor on these hyperparameters.

Result of XGBoost:

So XGBoost gave the best results with a MAPE of 30.5 and MAEof 2356 which is way better than the Linear regression model. Let’s look at the Feature importance by this model.

Feature Importance:

From the left graph, we can analyze ‘phone’, ’ crew’, ‘fwd’, ‘diesel’, ’four_cylinders’, ‘odometer’ are the most important features, and ‘salvage’, ’ good’ is not given much weightage.

So if our Description feature contains words like phone, crew, guarantee, copy there are very higher chances our model will predict accurate values.

By looking at the comparison plot we can analyze there are very few predictions that are deviating from the actual values and the rest are almost accurate.

3. Random Forest: Random Forest is a bagging algorithm where m no. of decision trees with low bias and high variance are trained on samples of data so that every model learns a different aspect of data and aggregation of their results leads to a decrease in overall overfitting.

Results on Random Forest:

Feature Importance:

With Random Forest I found that ‘age’,’ fwd’, ’odometer’ are the most important features to determine the price of a used car.

4. Custom Stacking Regressor:

Let’s look at the steps of how to create our custom ensemble model.

Firstly We will Split the whole data into train and test(80–20).
Now, in the 80% train set, We will split the train set into D1 and D2.(50–50).From this D1, do sampling with replacement to create d1,d2,d3….dk(k samples).Now we will create ‘k’ models and train each of these models with each of these k samples.
Now we will pass the D2 set to each of these k models; now, we will get k predictions for D2 from each of these models.
Now, using these k predictions, we will create a new dataset, and for D2, we already know its corresponding target values, so now can train a metamodel with these k predictions.
Now, for model evaluation, we can use the 20% data that we have kept as the test set. After Pass that test set to each of the base models, we will get ‘k’ predictions. Now we will create a new dataset with these k predictions and pass it to our metamodel, and we will get the final prediction. Using this final prediction and the targets for the test set, we can calculate the model’s performance score.

We can choose any combination of our base models like Linear Regression, lasso, Random Forest Regressor, Knn, SVM, Naive Baye's, or XGBoost. I’ve chosen Linear regression, lasso, XGBRegressor, Decision Tree Regressor as a combination of my base models and the Final meta-model is XGBRegressor.

Results of Custom Ensemble:

So our custom ensemble is performing worst than XGBoost and Random Forest.

12. Comparison of Models:

From the above plot, we can analyze that XGBRegressor is performing best with the lowest mape score of 30.51 so we will save this model for future use.

13. Summary :

Our goal was to build a predictive model which can predict the price of the used car given 25 variety of features and 458213 rows.

Initially, data exploration was done to get the insights from data, and then data cleaning was performed to remove the noise from the data. Then, missing values were computed using the iterative imputer method and some insights of the data we got while performing EDA.

At Last, after applying ML models we concluded that XGBoost performed best on our dataset with a MAEof 2356 and MAPE of 30.50.

Thank you for Reading! Any feedback is highly appreciated.

14. References :

You can reach me at :

Linkedin:www.linkedin.com/in/shubham-jain-2251491a4