Prof. Parker's Data Science Journey: 2019

I recently completed a data science project analyzing housing price time series data provided by Zillow.com. The goal of the project was to pick a region of the country and determine what the "5 best zip codes for investment" would be for an investment firm. I set my sights on Atlanta, Georgia, but then had to determine what characteristics constituted a "best" zip code. I ultimately decided that an ideal zip code for investment would be one that is highly predictable (low price volatility) and offers a relatively high return on investment (ROI).

First, several libraries and packages need to be imported to Python:

#for math and data structure
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
#for modeling
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.seasonal import seasonal_decompose as sd
import itertools
#for visualizations
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')

Limiting Scope

After pre-processing and paring down the Zillow data set to only Atlanta, I made initial plots of the time series for each zip code.

As you can see, the housing bubble and crash of the mid-2000's is quite distinct. Since the bubble/crash period comprises fully one half of the time series at my disposal, it would certainly wreak havoc when attempting to model overarching trends due to the inherent volatility. To remove this confounding factor, I decided to develop my models using only post-2011 data (when the housing market bottomed out and began its slow recovery). Not only would this remove a fair amount of volatility from the data, but it would also make more immediate trends more important. Given the shorter history, I also decided to limit predictions to only 2 years. Since this project is targeted at investment firms and not at potential home owners, the focus on smaller time horizons seemed justified as firms would be seeking more immediate returns compared to families planning to remain stationary for a decade or more.

ARIMA Modeling

After log-transforming my data to scale down values in the millions of dollars, I began building an ARIMA (Auto-Regressive Integrated Moving Average) model for each zip code. This first required iterating through possible combinations of AR, I, and MA parameter values (p,d,q) to find the combination which produced the model with the lowest AIC score (best fit) for each zip code:

def find_best_pdq(df, label, pdq):

"""

Input a dataframe and lists of potential pdq & pdqs values

Finds combination of pdq/pdqs values that returns lowest AIC

Returns tuple of label, pdq, and pdqs

"""

import warnings

warnings.filterwarnings('ignore')

results = []

for params in pdq:

try:

model = SARIMAX(endog=df, order=params,

enforce_stationarity=False,

enforce_invertibility=False)

AIC = model.fit().aic

results.append([params, AIC])

except:

continue

results.sort(key=lambda x: x[-1]) result_pdq = results[0][0]

return (label, result_pdq)

A quick investigation of the data with the seasonal_decompose package from statsmodels proved that Atlanta housing price trends do not exhibit any significant seasonality. So, first I modeled a single zip code:

# plug the optimal parameter values into a new SARIMAX model.

mod_SARIMAX = SARIMAX(endog=ATL_log[30342], order=pdq30342[1],

enforce_stationarity=False,

enforce_invertibility=False)

# Fit the model and print results

res_SARIMAX = mod_SARIMAX.fit()

display(res_SARIMAX.summary())

res_SARIMAX.plot_diagnostics(figsize=(15,10));

# Get predictions starting from 2016 and calculate confidence intervals.

prediction = res_SARIMAX.get_prediction(start=pd.to_datetime('2016-04'),

dynamic=True)

pred_conf = prediction.conf_int()

# Plot observed values

ATL_log[30342]['2012':].plot(label='observed')

# Plot predicted values

prediction.predicted_mean.plot(figsize=(15,4),color='g', label='predicted')

# Plot the range for confidence intervals

plt.fill_between(pred_conf.index,

pred_conf.iloc[:, 0],

pred_conf.iloc[:, 1], color='g', alpha=.3)

# Set axes labels

plt.xlabel('Date', fontsize=16)

plt.ylabel('Log-transformed Mean Value', fontsize=16)

plt.legend();

As one can see from this model's output, despite the upward trend of the prediction line, the confidence interval is so wide that it is still possible to lose money on a house in this. The intervals are as wide as they are largely due to the volatility of the price over time (the presence of pronounced peaks and troughs in the data). This posed a problem in terms of how best to select for zip codes with as little volatility as possible.

Unaware of any simple metrics that might measure confidence interval spread in a fashion that could be used for comparing one zip code to another, I opted to create my own metric. Dubbed the Parker Confidence Score (PCS), the new metric calculates 1 minus the ratio of the confidence interval height to the maximum height of the confidence interval. This might best be explained visually:

And in code:

def confidence_score(pred_conf):
"""
This function takes in a prediction's confidence interval
and calculates the difference between the upper and lower bounds at the
greatest X value as a proportion of the upper bound maximum. The purpose
is to provide a metric by which to compare model variability.
A model with more uncertainty will have a lower confidence score.
-----------------------
Since I made this up, I'm calling it the Parker Confidence Score (PCS).
"""
pred_conf_usd = np.exp(pred_conf) #reversing log-transformation
conf_score = 1 - ((pred_conf_usd['upper MeanValue'][-1]
- pred_conf_usd['lower MeanValue'][-1])
/ pred_conf_usd['upper MeanValue'][-1])
return conf_score

With this metric in hand, I filtered all of the zip codes for only the ones with a PCS of greater than .8, then removed any instances where the lower interval sank below the initial prediction value, and finally selected the zip codes from this group which had the highest ROI. This resulted in the determination that the "best" Atlanta zip codes for investment firms to purchase houses are 30317, 30030, 30341, 30087, and 30319 which correspond to the areas of Kirkwood, Decatur, Chamblee, Rockbridge, and Brookhaven, respectively.

For my latest final project, I was given Microsoft's Northwind database and instructed to develop four questions designed to improve the company's business. Among the questions I developed, one in particular posed considerable challenges from a coding perspective:

Is there a correlation between product category and customer region? (i.e. do certain categories of product sell better/worse depending on the region?)

For such a question I thought it would be best to first produce some stacked bar plots before attempting any hypothesis testing, just to see if there were any immediately obvious problems or nuances worthy of further investigation. This, however, was a bit more difficult than I had anticipated.

The first difficulty I encountered was making stacked bar plots. Neither Matplotlib nor Seaborn have built-in stacked bar plot methods, so I had to define my own functions to automate the process since I anticipated needing to generate numerous plots for comparison. Fortunately, the pyplot.bar() method does have a parameter called bottomthat defines the y-axis location for the bottom of a bar plot.

However, I intended to make multiple stacked bars side-by-side, so I needed to iterate through the process of aggregating my data by both region and by product category as a list of lists (list of regions, each with a sublist of categories). To facilitate this process, I used a SQL query to structure a dataframe in a usable fashion:

c.execute("""
          SELECT cat.CategoryName, c.Region,
          COUNT(Distinct c.Id) as Customers, SUM(od.Quantity) as TotalUnits, 
          SUM(od.Quantity*od.UnitPrice*(1-od.Discount)) as TotalSpending, COUNT(o.Id) as TotalOrders
          FROM OrderDetail od
          JOIN 'Order' o ON od.OrderId = o.Id
          JOIN Customer c ON o.CustomerId = c.Id
          JOIN Product p ON p.Id = od.ProductId
          JOIN Category cat ON p.CategoryId = cat.Id
          GROUP BY 1,2;
          """)
region_df = pd.DataFrame(c.fetchall())
region_df.columns = [x[0] for x in c.description]
display(region_df.head(10))
region_df.info()

This produces a dataframe which looks like this:

Stacked Bar Plots

I defined two functions to efficiently graph stacked bar plots by region (each region's bar being subdivided by product categories). The real trick was to make each product category its own bar plot and then to overlay them all on top of each other, with the height of the underlying bar becoming the bottom of the bar section above it.

def make_dataset(dataframe, group_col_name, y_axis_col_name):
    """
    Function creates dataset ready for the stacked_bar_plot() function, given an 
    input dataframe and the name of the input columns you wish to use examine.
    -------------------------------------
    dataframe = name of your pandas dataframe
    group_col_name = name of column defining groups within each bar
    y_axis_col_name = name of column supplying bar plot values
    -------------------------------------
    Outputs: a dataset for use with the stacked_bar_plot() function
    """
    # first we need to make a dataset (list of lists)
    legend = list(dataframe[group_col_name].unique())
    dataset= []
    for category in legend:
        cat_group = []
        for i in dataframe.index:
            if dataframe[group_col_name][i] == category:
                cat_group.append(dataframe[y_axis_col_name][i])
        dataset.append(cat_group)
    return dataset

def stacked_bar_plot(dataframe, group_col_name, x_axis_col_name, 
                     y_axis_col_name, dataset=None, auto_dataset=False, figsize=(15,8)):
    """
    Function creates stacked bar plots given an input dataframe, the
    name of the input columns you wish to use examine, and a dataset
    from the make_dataset() function.
    -------------------------------------
    dataframe = name of your pandas dataframe
    group_col_name = name of column defining groups within each bar
    x_axis_col_name = name of column defining bar plot categories
    y_axis_col_name = name of column supplying bar plot values
    dataset (optional) = output of the make_dataset() function...if missing,
                         then you must set auto_dataset=True
    auto_dataset = if True, runs make_dataset() function (default is False),
                   only set to True if no dataset has already been defined
    figsize = optional tuple to define size of graph
    -------------------------------------
    Outputs: a stacked bar graph
    """
    if auto_dataset:
        dataset = make_dataset(dataframe, group_col_name, y_axis_col_name)
    legend = list(dataframe[group_col_name].unique())
    X_AXIS = list(dataframe[x_axis_col_name].unique())
    ind = np.arange(len(dataset[0]))
    
    plt.figure(figsize=figsize)
    bottom = np.array(dataset[0])
    for i in list(range(len(dataset))):
        if i == 0:
            plt.bar(ind, dataset[i], label=legend[i])
        else:
            plt.bar(ind, dataset[i], label=legend[i], bottom=bottom)
            bottom += np.array(dataset[i])

    plt.yticks(fontsize=12)
    plt.ylabel(f'Relative {y_axis_col_name} by {group_col_name}', fontsize=12)
    plt.xticks(ind, X_AXIS, rotation=45, fontsize=12)
    plt.xlabel(f'{x_axis_col_name}', fontsize=12)
    plt.title(f'{group_col_name} {y_axis_col_name} by {x_axis_col_name}', fontsize=16)
    plt.legend(fontsize=12);
    pass

Running the functions using my dataframe produced excellent results:

Pleased though I was, it was obvious that the regions had vastly differing numbers of sales. To make comparison a bit more balanced, I decided to create two new combined regions ('Latin America' and 'Northeastern Europe') to pool the data from the smallest groups. This was fairly simple to do with a for loop (example is NE Europe):

region2_df = region_df.copy()

# Make our Northeastern Europe group
NE_Europe = ['Northern Europe','Eastern Europe','Scandinavia']
NE_cat_df = pd.DataFrame()
for i in list(region2_df.CategoryName.unique()):
    if NE_cat_df.empty:
        NE_cat_df = pd.DataFrame(region2_df.loc[(region2_df.CategoryName == i)
                                                & (region2_df.Region.isin(NE_Europe))]
                                 .groupby(['CategoryName'], as_index=False).sum())
    else:
        derp = pd.DataFrame(region2_df.loc[(region2_df.CategoryName == i)
                                           & (region2_df.Region.isin(NE_Europe))]
                            .groupby(['CategoryName'], as_index=False).sum())
        NE_cat_df = pd.concat([NE_cat_df, derp])
NE_cat_df.insert(1, 'Region', 'Northeastern Europe')

Then it was a simple matter of concatenating the results with the original dataframe and filtering out the regions that were combined using a .loc method.

Though this improved the situation, the regions were still operating on vastly different scales. To tighten things a bit further, I decided to compare the mean values (dividing totals by number of customers per region) instead of the sums. Doing so produced a much more comprehensible chart, especially with an additional bar for the entire sample population means:

Wonderful. With the graph above, it was plain to see that customers from different regions spend differing proportions of their budgets on various categories. For instance, just compare the thickness of the brown bands for meat and poultry between the British Isles and Western Europe.

Hypothesis Testing

Now fairly confident that a sizable difference was present, it was important to support this assertion with statistics. For this circumstance I thought it best to generate sampling distributions and then compare them using their Welch's t-test p-values, their Cohen's d statistic, and statistical power.

Rather than code each test manually, defining a few functions was the more efficient route:

def create_sample_distribution(data, dist_size=30, n=30):
    """
    Inputs:
    - data : original dataset
    - dist_size : number of samples to generate
    - n : size of each sample
    =======================
    Returns:
    - sample distribution of sample means
    """
    sample_distr = []
    np.random.seed(37)   #so tests are reproducible
    for i in range(0,dist_size):
        sample = []
        i_list = list(np.random.randint(0,len(data),n))
        for item in i_list:
            sample.append(data[item])
        sample_distr.append(np.mean(sample))
    return sample_distr


def welch_t(a, b):
    """ Calculate Welch's t statistic for two samples. """
    numerator = np.mean(a) - np.mean(b)
    denominator = np.sqrt(np.var(a, ddof=1)/len(a) + np.var(b, ddof=1)/np.size(b))
    return abs(numerator/denominator)


def welch_df(a, b):
    """ Calculate the effective degrees of freedom for two samples. """
    num = (np.var(a, ddof=1)/np.size(a) + np.var(b, ddof=1)/np.size(b))**2
    denom = (((np.var(a, ddof=1)**2) / (((np.size(a))**2) * (np.size(a)-1))) 
             + ((np.var(b, ddof=1)**2) / (((np.size(b))**2) * (np.size(b)-1))))
    return num/denom


def p_value(a, b, two_sided=False):
    """
    Returns the p-value from a Welch's t-test given two datasets 
    (lists, arrays, or series).
    """
    t = welch_t(a,b)
    df = welch_df(a,b)
    p = 1 - stats.t.cdf(t, df)
    if two_sided:
        p += p
    return p


def cohen_d(group1, group2):
    """
    group1: Series or NumPy array
    group2: Series or NumPy array
    returns a floating point number 
    """
    diff = np.mean(group1) - np.mean(group2)
    n1, n2 = len(group1), len(group2)
    var1 = np.var(group1)
    var2 = np.var(group2)
    # Calculate the pooled threshold as shown earlier
    pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)
    # Calculate Cohen's d statistic
    d = abs(diff / np.sqrt(pooled_var))    
    return d

From there, I had to run a new SQL query in order to access all of the original observations (rather than just the aggregates I had been examining).

c.execute("""
          SELECT cat.CategoryName, c.Region, od.Quantity, 
          (od.Quantity*od.UnitPrice*(1-od.Discount)) as OrderPrice
          FROM OrderDetail od
          JOIN 'Order' o ON od.OrderId = o.Id
          JOIN Customer c ON o.CustomerId = c.Id
          JOIN Product p ON p.Id = od.ProductId
          JOIN Category cat ON p.CategoryId = cat.Id;
          """)
population_df = pd.DataFrame(c.fetchall())
population_df.columns = [x[0] for x in c.description]
for i in ['Northern Europe','Eastern Europe','Scandinavia']:
    population_df.replace(i, 'Northeastern Europe', inplace=True)
for i in ['South America','Central America']:
    population_df.replace(i, 'Latin America', inplace=True)

As a final step, I wrote several nested for loops to iterate generating sampling distributions and statistics for each region-category combination (40 in total!). The results were then filtered for only the combinations which had a p-value < 0.05, a statistical power > 0.8, and a Cohen's d score >0.1.

sig_list = []
for region in list(population_df.Region.unique()):
    for category in list(population_df.CategoryName.unique()):
        pops_df = list(population_df.loc[population_df.CategoryName == category].OrderPrice)
        reg_df = list(population_df.loc[(population_df.CategoryName == category) 
                                        & (population_df.Region == region)].OrderPrice)
        pop_samps = create_sample_distribution(pops_df, 30, 30)
        region_samps = create_sample_distribution(reg_df, 30, 30)
        p = p_value(pop_samps, region_samps, two_sided=True)
        d = cohen_d(pop_samps, region_samps)
        power = power_analysis.solve_power(effect_size=d, nobs1=len(pop_samps), alpha=.05)
        if p < 0.05 and power > 0.8 and d > 0.1:
            sig_list.append((region,category))
        print(f'{region} {category} compared to population samples')
        print(f"Welch's t-test p-value: {p}")
        print(f"Effect size (Cohen's d): {d}")
        print(f"Statistical Power (1-B): {power}\n")

Finally, using the filtered results from above, I was able to divide the statistically significant pairs into two groups: 'increase sales' and 'decrease sales' (labeled according to the business recommendation I would make). These were generated by calculating the proportion of the mean entire budget that each region spent on each category and compared that to the population mean proportions. Those combinations that were greater than the population signified exceptionally high demand and therefore a combination that Northwind salespersons should try to exploit. The opposite results suggest combinations that salespersons should not devote overmuch effort to pushing, for the lower expected return on time invested.

increase_sales_push = []
decrease_sales_push = []
for region, category in sig_list:
    a = region2_df.loc[(region2_df.CategoryName == category)
                       &(region2_df.Region == region)].MeanSpending
    a_prime = sum(region2_df.loc[region2_df.Region == region].MeanSpending)
    b = pop_df.loc[pop_df.CategoryName == category].MeanSpending
    b_prime = sum(pop_df.MeanSpending)
    if float(a)/a_prime < float(b)/b_prime:
        decrease_sales_push.append((region,category))
    elif float(a)/a_prime > float(b)/b_prime:
        increase_sales_push.append((region,category))

In the end, I was able to break the Northwind database down like so:

Prof. Parker's Data Science Journey

Wednesday, July 31, 2019

Zillow Time Series Analysis

Limiting Scope

ARIMA Modeling

Sunday, July 14, 2019

Northwind Exploration

Stacked Bar Plots

Hypothesis Testing