Linear Regression: Boston House Price Prediction


Problem Statement


The problem on hand is to predict the housing prices of a town or a suburb based on the features of the locality provided to us. In the process, we need to identify the most important features in the dataset. We need to employ techniques of data preprocessing and build a linear regression model that predicts the prices for us.


Data Information


Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. Detailed attribute information can be found below-

Attribute Information (in order):

Let us start by importing the required libraries

Read the dataset

Observations

Get information about the dataset using the info() method

Observations

Let's now check the summary statistics of this dataset

Step 1: Find the summary statistics and write observations

Observations:

Before performing the modeling, it is important to check the univariate distribution of the variables.

Univariate Analysis

Check the distribution of the variables

Observations

As the dependent variable is sightly skewed, we will apply a log transformation on the 'MEDV' column and check the distribution of the transformed column.

Observations

Before creating the linear regression model, it is important to check the bivariate relationship between the variables. Let's check the same using the heatmap and scatterplot.

Bivariate Analysis

Let's check the correlation using the heatmap

Step 2:

Observations:

Now, we will visualize the relationship between the pairs of features having significant correlations.

Visualizing the relationship between the features having significant correlations (> 0.7)

Step 3 :

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Let's check the correlation after removing the outliers.

So the high correlation between TAX and RAD is due to the outliers. The tax rate for some properties might be higher due to some other reason.

Observations:

Observations:

Observations:

We have seen that the variables LSTAT and RM have a linear relationship with the dependent variable MEDV. Also, there are significant relationships among a few independent variables, which is not desirable for a linear regression model. Let's first split the dataset.

Split the dataset

Let's split the data into the dependent and independent variables and further split it into train and test set in a ratio of 70:30 for train and test set.

Next, we will check the multicollinearity in the train dataset.

Check for Multicollinearity

We will use the Variance Inflation Factor (VIF), to check if there is multicollinearity in the data.

Features having a VIF score > 5 will be dropped/treated till all the features have a VIF score < 5

Observations:

Step 4: Drop the column 'TAX' from the training data and check if multicollinearity is removed?

Now, we will create the linear regression model as the VIF is less than 5 for all the independent variables, and we can assume that multicollinearity has been removed between the variables.

Step 5: Write the code to create the linear regression model and print the model summary.

Observations:

Step 6: Drop insignificant variables from the above model and create the regression model again.

Examining the significance of the model

It is not enough to fit a multiple regression model to the data, it is necessary to check whether all the regression coefficients are significant or not. Significance here means whether the population regression parameters are significantly different from zero.

From the above it may be noted that the regression coefficients corresponding to ZN, AGE, and INDUS are not statistically significant at level α = 0.05. In other words, the regression coefficients corresponding to these three are not significantly different from 0 in the population. Hence, we will eliminate the three features and create a new model.

Observations:

Now, we will check the linear regression assumptions.

Check the below linear regression assumptions

  1. Mean of residuals should be 0
  2. No Heteroscedasticity
  3. Linearity of variables
  4. Normality of error terms

Step 7: Check the above linear regression assumptions and provide insights.

Check for mean residuals

Observations:

Mean of residulas is very close to zero.

Check for homoscedasticity

Observations:

p-value is > .05, so we fail to reject NULL hypothesis and assume that residuals are homescedastic.

Linearity of variables

It states that the predictor variables must have a linear relation with the dependent variable.

To test the assumption, we'll plot residuals and fitted values on a plot and ensure that residuals do not form a strong pattern. They should be randomly and uniformly scattered on the x-axis.

Observations:

Normality of error terms

The residuals should be normally distributed.

Observations:

Check the performance of the model on the train and test data set

Step 8: Compare model performance of train and test dataset

Observations:

Apply cross validation to improve the model and evaluate it using different evaluation metrics

Step 9: Apply the cross validation technique to improve the model and evaluate it using different evaluation metrics.

Observations

We may want to reiterate the model building process again with new features or better feature engineering to increase the R-squared and decrease the MSE on cross validation.

Step 10: Get model Coefficients in a pandas dataframe with column 'Feature' having all the features and column 'Coefs' with all the corresponding Coefs.

Step 11: Write the conclusions and business recommendations derived from the model.

Write Conclusions here

Write Recommendations here