Sunday, June 2, 2019

Modeling Housing Prices

I was recently tasked with using a year's worth (2014-2015) of housing data from the greater Seattle area to construct a workable and accurate model capable of predicting the sale price of a house if given the variables present in the data set as inputs. This was no small feat to accomplish for a budding data scientist, but I must profess that I am rather pleased with my results.

Before progressing further, here is the price equation from my model:

𝑝=(1296579.6×𝑓𝐿)+(60401.32×𝑓𝑉)+(110.79×𝑓𝐵)+(59144.57×𝑓𝐺)+(111362.3×𝑙𝑜𝑔𝑒(𝑓𝑆))62564620.4

where:

𝑝 = House price (in USD)
= latitude
 times property has been viewed
 square footage of basement
 grade given to the housing unit, based on King County grading system
 square footage of living space

To say the least, it took quite a bit of work to arrive at this equation. Rather than go through the entire process, I would like to focus upon to roadblocks I encountered during the model-building process and the steps I took to overcome them. First, I will talk about dealing with geographical data (latitude and longitude coordinates). Second, I will discuss my approach to building custom features for the model in order to eliminate multicollinearity. 

Geographical Data (lat, long)

Just as with any city, Seattle's population is not uniformly distributed across a square region. Intuitively. there must be numerous bodies of water, public lands, zoning-restricted areas, mountains, and other geographical features that prevent an even distribution of houses across this specific section of the Earth's surface. But how to account for such within the given dataset which only lists a house's latitudinal and longitudinal position? 

I decided to try to get a handle on this information first by graphing the histograms for these variables, in the fool's hope that they were normally distributed.
No such luck. If anything, it seemed like there were three almost independent zones which each exhibited an almost normal distribution.

To make the best of a bad situation, I decided to remove the outliers from this data. But where would be best to draw the line on these histograms? Faced with this impossible question, I turned to a 2-dimensional graph of these variables using hexbins (while also using 'price' to determine the color of each hex. 
It turns out that this is essentially an 8-bit representation of a map of Seattle itself:

Using the map as a guide, it became easier to decide where to draw the boundaries for outlier elimination. Anything in hills to the East or on an island to the West had to go. Additionally, the fairly straight line near the Southern end of the hexbin graph provided a convenient line for dropping latitudinal outliers. This process resulted in a hexbin graph that, while not normally distributed due to internal topographical nuances, had a much more uniform population density across the entirety of the graph than before.
With my data subset honed in on the more uniformly urban areas (having excluded the majority of rural terrain), there was a much lower likelihood of non-observed geographical features (e.g. distance from city) confounding my model. 

Building Custom Features

Once I had finished cleaning and scaling my data, it came time to run an Ordinary Least Squared (OLD) analysis on my data and see which variables might be exhibiting multicollinearity. Multicollinearity is undesirable in linear regression models because it essentially means that two input variables are having a multiplied effect on your output. 

For example, let's say you want to see what effect increasing the square footage of above-ground living space has on a house's price. Thinking about this logically, you can't increase the above-ground square footage without also increasing the property's total square footage of living space (above- and below-ground). Now if your model relies directly on both the total square feet of living space and on the square feet of above-ground living space, you will be counting a 1 unit change more than once! Thus, in this state, your model is less accurate than desired.

To eliminate multicollinearity as much as possible, I decided to create a new feature (variable) that assigned a weight to each of the inputs. In the process, I created a function in python that would allow me to quickly calculate the appropriate weight for any two input variables, thus permitting me to test different combinations quickly.

Here's a correlation heatmap, displaying a high degree of collinearity between several variables (the beige/orange squares).
And here's the code I used to determine the weights:
(many thanks to Rafael Carrasco for inspiration on part of the for loop)
image.png
With this function now defined, I was able to work my way through different pairs of variables to eliminate multicollinearity. I started by making a feature out of the combination of sqft_living and sqft_above, as they were the two variables that showed the highest degree of correlation. After that, I then combined this new feature with the variable grade, creating yet another hybrid feature ('grade_sqft_living_sqft_above_feature') since my custom feature and 'grade' showed a fairly high degree of collinearity. Amdist this process a few other variables were dropped outright, but the correlation heatmap of my final list of model variables is quite telling in terms of how much multicollinearity I was able to eliminate:

No comments:

Post a Comment