Prof. Parker's Data Science Journey: Zillow Time Series Analysis

I recently completed a data science project analyzing housing price time series data provided by Zillow.com. The goal of the project was to pick a region of the country and determine what the "5 best zip codes for investment" would be for an investment firm. I set my sights on Atlanta, Georgia, but then had to determine what characteristics constituted a "best" zip code. I ultimately decided that an ideal zip code for investment would be one that is highly predictable (low price volatility) and offers a relatively high return on investment (ROI).

First, several libraries and packages need to be imported to Python:

#for math and data structure
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
#for modeling
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.seasonal import seasonal_decompose as sd
import itertools
#for visualizations
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')

Limiting Scope

After pre-processing and paring down the Zillow data set to only Atlanta, I made initial plots of the time series for each zip code.

As you can see, the housing bubble and crash of the mid-2000's is quite distinct. Since the bubble/crash period comprises fully one half of the time series at my disposal, it would certainly wreak havoc when attempting to model overarching trends due to the inherent volatility. To remove this confounding factor, I decided to develop my models using only post-2011 data (when the housing market bottomed out and began its slow recovery). Not only would this remove a fair amount of volatility from the data, but it would also make more immediate trends more important. Given the shorter history, I also decided to limit predictions to only 2 years. Since this project is targeted at investment firms and not at potential home owners, the focus on smaller time horizons seemed justified as firms would be seeking more immediate returns compared to families planning to remain stationary for a decade or more.

ARIMA Modeling

After log-transforming my data to scale down values in the millions of dollars, I began building an ARIMA (Auto-Regressive Integrated Moving Average) model for each zip code. This first required iterating through possible combinations of AR, I, and MA parameter values (p,d,q) to find the combination which produced the model with the lowest AIC score (best fit) for each zip code:

def find_best_pdq(df, label, pdq):

"""

Input a dataframe and lists of potential pdq & pdqs values

Finds combination of pdq/pdqs values that returns lowest AIC

Returns tuple of label, pdq, and pdqs

"""

import warnings

warnings.filterwarnings('ignore')

results = []

for params in pdq:

try:

model = SARIMAX(endog=df, order=params,

enforce_stationarity=False,

enforce_invertibility=False)

AIC = model.fit().aic

results.append([params, AIC])

except:

continue

results.sort(key=lambda x: x[-1]) result_pdq = results[0][0]

return (label, result_pdq)

A quick investigation of the data with the seasonal_decompose package from statsmodels proved that Atlanta housing price trends do not exhibit any significant seasonality. So, first I modeled a single zip code:

# plug the optimal parameter values into a new SARIMAX model.

mod_SARIMAX = SARIMAX(endog=ATL_log[30342], order=pdq30342[1],

enforce_stationarity=False,

enforce_invertibility=False)

# Fit the model and print results

res_SARIMAX = mod_SARIMAX.fit()

display(res_SARIMAX.summary())

res_SARIMAX.plot_diagnostics(figsize=(15,10));

# Get predictions starting from 2016 and calculate confidence intervals.

prediction = res_SARIMAX.get_prediction(start=pd.to_datetime('2016-04'),

dynamic=True)

pred_conf = prediction.conf_int()

# Plot observed values

ATL_log[30342]['2012':].plot(label='observed')

# Plot predicted values

prediction.predicted_mean.plot(figsize=(15,4),color='g', label='predicted')

# Plot the range for confidence intervals

plt.fill_between(pred_conf.index,

pred_conf.iloc[:, 0],

pred_conf.iloc[:, 1], color='g', alpha=.3)

# Set axes labels

plt.xlabel('Date', fontsize=16)

plt.ylabel('Log-transformed Mean Value', fontsize=16)

plt.legend();

As one can see from this model's output, despite the upward trend of the prediction line, the confidence interval is so wide that it is still possible to lose money on a house in this. The intervals are as wide as they are largely due to the volatility of the price over time (the presence of pronounced peaks and troughs in the data). This posed a problem in terms of how best to select for zip codes with as little volatility as possible.

Unaware of any simple metrics that might measure confidence interval spread in a fashion that could be used for comparing one zip code to another, I opted to create my own metric. Dubbed the Parker Confidence Score (PCS), the new metric calculates 1 minus the ratio of the confidence interval height to the maximum height of the confidence interval. This might best be explained visually:

And in code:

def confidence_score(pred_conf):
"""
This function takes in a prediction's confidence interval
and calculates the difference between the upper and lower bounds at the
greatest X value as a proportion of the upper bound maximum. The purpose
is to provide a metric by which to compare model variability.
A model with more uncertainty will have a lower confidence score.
-----------------------
Since I made this up, I'm calling it the Parker Confidence Score (PCS).
"""
pred_conf_usd = np.exp(pred_conf) #reversing log-transformation
conf_score = 1 - ((pred_conf_usd['upper MeanValue'][-1]
- pred_conf_usd['lower MeanValue'][-1])
/ pred_conf_usd['upper MeanValue'][-1])
return conf_score

With this metric in hand, I filtered all of the zip codes for only the ones with a PCS of greater than .8, then removed any instances where the lower interval sank below the initial prediction value, and finally selected the zip codes from this group which had the highest ROI. This resulted in the determination that the "best" Atlanta zip codes for investment firms to purchase houses are 30317, 30030, 30341, 30087, and 30319 which correspond to the areas of Kirkwood, Decatur, Chamblee, Rockbridge, and Brookhaven, respectively.

Prof. Parker's Data Science Journey

Wednesday, July 31, 2019

Zillow Time Series Analysis

Limiting Scope

ARIMA Modeling

No comments:

Post a Comment