Friday, June 14, 2019

Visualizing Data with Hexbins in Matplotlib in Python

honeycombIn the course of learning the basics of producing data visualizations with Python, I happened to run across a fairly uncommon style of graph called hexbins. Hexbin plots take in lists of X and Y values (and an optional list of Z values) and returns what looks somewhat similar to a scatter plot, but where the entire graphing space has been divided into hexagons (like a honeycomb) and all points have been grouped into their respective hexagonal regions with a color gradient indicating the density (or mean Z values) of each hexagonal area.
While at first mention this might seem like a peculiar method of plotting data, this technique really shines when two conditions are met:
_____1. the data contains hundreds (or thousands, or more) of X-Y data points
_____2. the X-Y variables are spatial in nature (e.g. latitude-longitude pairs)

Hexbins vs. Scatter Plots

Let's look at a few examples to illustrate the advantages of hexbins over scatter plots. In a normal scatter plot of a large data set (fig. 1), the resulting graph will often look like a messy dark blob in the center with smattering of distinguishable points around its periphery. We can improve this a tiny bit by making every point translucent (using the alpha= parameter as in fig. 2).

Fig. 1 - scatter plot
code: plt.scatter(x,y)
Fig. 2 - scatter plot
code: plt.scatter(x,y, alpha=0.5)

But this still is not very informative. This is where hexbins can prove a bit more useful. Let's look at the same data in a hexbin plot (fig. 3). Unlike in the scatter plots above, a hexbin plot automatically returns values using a color gradient for density. Below you can now clearly see that our data set displays clustering around two different possible regression lines!

Fig. 3 - hexbin plot
code: plt.hexbin(x,y)

Though this is certainly a useful trick, hexbins really shine when using geospatial data as mentioned above. As you can probably guess, the built-in color gradient feature of hexbins can easily communicate to viewers the spatial location(s) where the most data points are gathered. Let's take a look at some recent real estate market data gathered from the Seattle metro area in 2014-2015. Figures 4 & 5 show two different plots of the same latitude and longitude pairs for all house sales within a single year.

Fig. 4 - scatter plot of Seattle
code: plt.scatter(x,y, alpha=0.3)
Fig. 5 - hexbin plot of Seattle
code: plt.hexbin(x, y, mincnt=1)

Thinking about the Seattle area, these graphs naturally bear a striking resemblance to an actual map of Seattle itself. From the hexbin graph, we can see that the most house sales took place in the areas closest to downtown Seattle, with fewer sales taking place in the less densely populated ares to the East and to the South. This is not very surprising, but it is nice to have a way to graphically represent what we instinctually know. Furthermore, if our hexbin graph happened to output a counter-intuitive result (for instance, a bright yellow cluster somewhere in the right half of the graph), this would be very useful as flag for further investigation.

Going further with hexbins

While what we have seen above is very useful, there is a lot more that we can do with hexbins. Density heatmaps (using a color gradient to display density differences) are great, but we can use hexbins to produce heatmaps based on variables other than density. As an example, now that we know where the majority of home sales take place in Seattle let's see if there is a correlation house sale location density and house sale price. In other words, do the highest home prices match up with the areas in which the most home sales occur (or vice versa)? We can accomplish this task by simply including one addition parameter (using C=)* that points to a list of house prices (fig. 6).
* As a note, this can also be done with scatter plots (using  c=). The difference here is that while scatter plots display every point, hexbin plots determine the color of each hexagon by taking the mean of all the price values within that hexagon.


Fig. 6 - hexbin plot with price determining color gradient
code: plt.hexbin(x, y, C=price)

Wow! So it appears that house price is neither directly nor inversely closely related to housing sale density. Looking at an actual map, it seems that lake-front property is significantly more expensive than either bay-front or downtown property. Furthermore, if we look closely, it appears that there must be a small, particularly wealthy community in the North-central region of our map. This information could be quite useful to a realtor, a construction firm, or even a potential home-owner who is moving to the Seattle area.

In truth, we are not restricted to using 'price' as our color gradient variable; we can use any list of values that is equal in length to our X and Y variables ('longitude' and 'latitude' in this case, but these could also be changed to other spatial orientation variables like the X-Y locations for a gridded archaeological dig site, for example!). As an example, let's see what our hexbin graph would look like if we used the number of bathrooms to color our hexes (fig. 7).

Fig. 7 - hexbin plot with bathrooms determining color gradient
code: plt.hexbin(x, y, C=bathrooms)

This is wild! Who could have guessed that the number of bathrooms in a house appears to be directly correlated with longitude ?! In other words, the further East of Seattle a house is located, the more likely it is to have a higher number of bathrooms! Thinking back to our density hexbin graph, population density probably has a negative influence on the potential number of bathrooms in a house, both because denser areas generally necessitate prioritzing living space over closets/bathrooms/etc. and because urban areas are more likely to have houses of older construction as compared to suburbs (huge closets and lots of bathrooms are a relatively recent trend in home construction).

One Last Parameter

As we've seen, hexbin plots can be exceptionally useful and powerful tools in the right context. But there are many more parameters that can be used with hexbin plots to further improve the utility of your graphs. One of the simplest to grasp, and arguably one of the most useful, is the gridsize parameter. This parameter simply determines how many hexagons wide your x-axis will be (automatically scaling y-axis accordingly). The default value is 100, but this can be reduced or increased to your heart's desire. While the utility of this parameter will depend heavily on the specific dataset, increasing or decreasing the size/number of hex groupings can potentially reveal or at least highlight important nuances of your dataset. Let's compare our Seattle housing data using the default of 100 bins and then 50 bins (fig. 8 & 9).

Fig. 8 - hexbin plot with price (100 bins)
code: plt.hexbin(x, y, C=price)
Fig. 9 - hexbin plot with price (50 bins)
code: plt.hexbin(x, y, C=price, gridsize=50)

Though not necessarily revealing anything new in this example, we have done a much better job of highlighting the presence of an exceptionally affluent community in the North-central region of our graph. While it might have been fairly easy to miss it in the original graph, it is now clearly visible.


Sunday, June 2, 2019

Modeling Housing Prices

I was recently tasked with using a year's worth (2014-2015) of housing data from the greater Seattle area to construct a workable and accurate model capable of predicting the sale price of a house if given the variables present in the data set as inputs. This was no small feat to accomplish for a budding data scientist, but I must profess that I am rather pleased with my results.

Before progressing further, here is the price equation from my model:

𝑝=(1296579.6×𝑓𝐿)+(60401.32×𝑓𝑉)+(110.79×𝑓𝐵)+(59144.57×𝑓𝐺)+(111362.3×𝑙𝑜𝑔𝑒(𝑓𝑆))62564620.4

where:

𝑝 = House price (in USD)
= latitude
 times property has been viewed
 square footage of basement
 grade given to the housing unit, based on King County grading system
 square footage of living space

To say the least, it took quite a bit of work to arrive at this equation. Rather than go through the entire process, I would like to focus upon to roadblocks I encountered during the model-building process and the steps I took to overcome them. First, I will talk about dealing with geographical data (latitude and longitude coordinates). Second, I will discuss my approach to building custom features for the model in order to eliminate multicollinearity. 

Geographical Data (lat, long)

Just as with any city, Seattle's population is not uniformly distributed across a square region. Intuitively. there must be numerous bodies of water, public lands, zoning-restricted areas, mountains, and other geographical features that prevent an even distribution of houses across this specific section of the Earth's surface. But how to account for such within the given dataset which only lists a house's latitudinal and longitudinal position? 

I decided to try to get a handle on this information first by graphing the histograms for these variables, in the fool's hope that they were normally distributed.
No such luck. If anything, it seemed like there were three almost independent zones which each exhibited an almost normal distribution.

To make the best of a bad situation, I decided to remove the outliers from this data. But where would be best to draw the line on these histograms? Faced with this impossible question, I turned to a 2-dimensional graph of these variables using hexbins (while also using 'price' to determine the color of each hex. 
It turns out that this is essentially an 8-bit representation of a map of Seattle itself:

Using the map as a guide, it became easier to decide where to draw the boundaries for outlier elimination. Anything in hills to the East or on an island to the West had to go. Additionally, the fairly straight line near the Southern end of the hexbin graph provided a convenient line for dropping latitudinal outliers. This process resulted in a hexbin graph that, while not normally distributed due to internal topographical nuances, had a much more uniform population density across the entirety of the graph than before.
With my data subset honed in on the more uniformly urban areas (having excluded the majority of rural terrain), there was a much lower likelihood of non-observed geographical features (e.g. distance from city) confounding my model. 

Building Custom Features

Once I had finished cleaning and scaling my data, it came time to run an Ordinary Least Squared (OLD) analysis on my data and see which variables might be exhibiting multicollinearity. Multicollinearity is undesirable in linear regression models because it essentially means that two input variables are having a multiplied effect on your output. 

For example, let's say you want to see what effect increasing the square footage of above-ground living space has on a house's price. Thinking about this logically, you can't increase the above-ground square footage without also increasing the property's total square footage of living space (above- and below-ground). Now if your model relies directly on both the total square feet of living space and on the square feet of above-ground living space, you will be counting a 1 unit change more than once! Thus, in this state, your model is less accurate than desired.

To eliminate multicollinearity as much as possible, I decided to create a new feature (variable) that assigned a weight to each of the inputs. In the process, I created a function in python that would allow me to quickly calculate the appropriate weight for any two input variables, thus permitting me to test different combinations quickly.

Here's a correlation heatmap, displaying a high degree of collinearity between several variables (the beige/orange squares).
And here's the code I used to determine the weights:
(many thanks to Rafael Carrasco for inspiration on part of the for loop)
image.png
With this function now defined, I was able to work my way through different pairs of variables to eliminate multicollinearity. I started by making a feature out of the combination of sqft_living and sqft_above, as they were the two variables that showed the highest degree of correlation. After that, I then combined this new feature with the variable grade, creating yet another hybrid feature ('grade_sqft_living_sqft_above_feature') since my custom feature and 'grade' showed a fairly high degree of collinearity. Amdist this process a few other variables were dropped outright, but the correlation heatmap of my final list of model variables is quite telling in terms of how much multicollinearity I was able to eliminate: