Prof. Parker's Data Science Journey: Northwind Exploration

For my latest final project, I was given Microsoft's Northwind database and instructed to develop four questions designed to improve the company's business. Among the questions I developed, one in particular posed considerable challenges from a coding perspective:

Is there a correlation between product category and customer region? (i.e. do certain categories of product sell better/worse depending on the region?)

For such a question I thought it would be best to first produce some stacked bar plots before attempting any hypothesis testing, just to see if there were any immediately obvious problems or nuances worthy of further investigation. This, however, was a bit more difficult than I had anticipated.

The first difficulty I encountered was making stacked bar plots. Neither Matplotlib nor Seaborn have built-in stacked bar plot methods, so I had to define my own functions to automate the process since I anticipated needing to generate numerous plots for comparison. Fortunately, the pyplot.bar() method does have a parameter called bottomthat defines the y-axis location for the bottom of a bar plot.

However, I intended to make multiple stacked bars side-by-side, so I needed to iterate through the process of aggregating my data by both region and by product category as a list of lists (list of regions, each with a sublist of categories). To facilitate this process, I used a SQL query to structure a dataframe in a usable fashion:

c.execute("""
          SELECT cat.CategoryName, c.Region,
          COUNT(Distinct c.Id) as Customers, SUM(od.Quantity) as TotalUnits, 
          SUM(od.Quantity*od.UnitPrice*(1-od.Discount)) as TotalSpending, COUNT(o.Id) as TotalOrders
          FROM OrderDetail od
          JOIN 'Order' o ON od.OrderId = o.Id
          JOIN Customer c ON o.CustomerId = c.Id
          JOIN Product p ON p.Id = od.ProductId
          JOIN Category cat ON p.CategoryId = cat.Id
          GROUP BY 1,2;
          """)
region_df = pd.DataFrame(c.fetchall())
region_df.columns = [x[0] for x in c.description]
display(region_df.head(10))
region_df.info()

This produces a dataframe which looks like this:

Stacked Bar Plots

I defined two functions to efficiently graph stacked bar plots by region (each region's bar being subdivided by product categories). The real trick was to make each product category its own bar plot and then to overlay them all on top of each other, with the height of the underlying bar becoming the bottom of the bar section above it.

def make_dataset(dataframe, group_col_name, y_axis_col_name):
    """
    Function creates dataset ready for the stacked_bar_plot() function, given an 
    input dataframe and the name of the input columns you wish to use examine.
    -------------------------------------
    dataframe = name of your pandas dataframe
    group_col_name = name of column defining groups within each bar
    y_axis_col_name = name of column supplying bar plot values
    -------------------------------------
    Outputs: a dataset for use with the stacked_bar_plot() function
    """
    # first we need to make a dataset (list of lists)
    legend = list(dataframe[group_col_name].unique())
    dataset= []
    for category in legend:
        cat_group = []
        for i in dataframe.index:
            if dataframe[group_col_name][i] == category:
                cat_group.append(dataframe[y_axis_col_name][i])
        dataset.append(cat_group)
    return dataset

def stacked_bar_plot(dataframe, group_col_name, x_axis_col_name, 
                     y_axis_col_name, dataset=None, auto_dataset=False, figsize=(15,8)):
    """
    Function creates stacked bar plots given an input dataframe, the
    name of the input columns you wish to use examine, and a dataset
    from the make_dataset() function.
    -------------------------------------
    dataframe = name of your pandas dataframe
    group_col_name = name of column defining groups within each bar
    x_axis_col_name = name of column defining bar plot categories
    y_axis_col_name = name of column supplying bar plot values
    dataset (optional) = output of the make_dataset() function...if missing,
                         then you must set auto_dataset=True
    auto_dataset = if True, runs make_dataset() function (default is False),
                   only set to True if no dataset has already been defined
    figsize = optional tuple to define size of graph
    -------------------------------------
    Outputs: a stacked bar graph
    """
    if auto_dataset:
        dataset = make_dataset(dataframe, group_col_name, y_axis_col_name)
    legend = list(dataframe[group_col_name].unique())
    X_AXIS = list(dataframe[x_axis_col_name].unique())
    ind = np.arange(len(dataset[0]))
    
    plt.figure(figsize=figsize)
    bottom = np.array(dataset[0])
    for i in list(range(len(dataset))):
        if i == 0:
            plt.bar(ind, dataset[i], label=legend[i])
        else:
            plt.bar(ind, dataset[i], label=legend[i], bottom=bottom)
            bottom += np.array(dataset[i])

    plt.yticks(fontsize=12)
    plt.ylabel(f'Relative {y_axis_col_name} by {group_col_name}', fontsize=12)
    plt.xticks(ind, X_AXIS, rotation=45, fontsize=12)
    plt.xlabel(f'{x_axis_col_name}', fontsize=12)
    plt.title(f'{group_col_name} {y_axis_col_name} by {x_axis_col_name}', fontsize=16)
    plt.legend(fontsize=12);
    pass

Running the functions using my dataframe produced excellent results:

Pleased though I was, it was obvious that the regions had vastly differing numbers of sales. To make comparison a bit more balanced, I decided to create two new combined regions ('Latin America' and 'Northeastern Europe') to pool the data from the smallest groups. This was fairly simple to do with a for loop (example is NE Europe):

region2_df = region_df.copy()

# Make our Northeastern Europe group
NE_Europe = ['Northern Europe','Eastern Europe','Scandinavia']
NE_cat_df = pd.DataFrame()
for i in list(region2_df.CategoryName.unique()):
    if NE_cat_df.empty:
        NE_cat_df = pd.DataFrame(region2_df.loc[(region2_df.CategoryName == i)
                                                & (region2_df.Region.isin(NE_Europe))]
                                 .groupby(['CategoryName'], as_index=False).sum())
    else:
        derp = pd.DataFrame(region2_df.loc[(region2_df.CategoryName == i)
                                           & (region2_df.Region.isin(NE_Europe))]
                            .groupby(['CategoryName'], as_index=False).sum())
        NE_cat_df = pd.concat([NE_cat_df, derp])
NE_cat_df.insert(1, 'Region', 'Northeastern Europe')

Then it was a simple matter of concatenating the results with the original dataframe and filtering out the regions that were combined using a .loc method.

Though this improved the situation, the regions were still operating on vastly different scales. To tighten things a bit further, I decided to compare the mean values (dividing totals by number of customers per region) instead of the sums. Doing so produced a much more comprehensible chart, especially with an additional bar for the entire sample population means:

Wonderful. With the graph above, it was plain to see that customers from different regions spend differing proportions of their budgets on various categories. For instance, just compare the thickness of the brown bands for meat and poultry between the British Isles and Western Europe.

Hypothesis Testing

Now fairly confident that a sizable difference was present, it was important to support this assertion with statistics. For this circumstance I thought it best to generate sampling distributions and then compare them using their Welch's t-test p-values, their Cohen's d statistic, and statistical power.

Rather than code each test manually, defining a few functions was the more efficient route:

def create_sample_distribution(data, dist_size=30, n=30):
    """
    Inputs:
    - data : original dataset
    - dist_size : number of samples to generate
    - n : size of each sample
    =======================
    Returns:
    - sample distribution of sample means
    """
    sample_distr = []
    np.random.seed(37)   #so tests are reproducible
    for i in range(0,dist_size):
        sample = []
        i_list = list(np.random.randint(0,len(data),n))
        for item in i_list:
            sample.append(data[item])
        sample_distr.append(np.mean(sample))
    return sample_distr


def welch_t(a, b):
    """ Calculate Welch's t statistic for two samples. """
    numerator = np.mean(a) - np.mean(b)
    denominator = np.sqrt(np.var(a, ddof=1)/len(a) + np.var(b, ddof=1)/np.size(b))
    return abs(numerator/denominator)


def welch_df(a, b):
    """ Calculate the effective degrees of freedom for two samples. """
    num = (np.var(a, ddof=1)/np.size(a) + np.var(b, ddof=1)/np.size(b))**2
    denom = (((np.var(a, ddof=1)**2) / (((np.size(a))**2) * (np.size(a)-1))) 
             + ((np.var(b, ddof=1)**2) / (((np.size(b))**2) * (np.size(b)-1))))
    return num/denom


def p_value(a, b, two_sided=False):
    """
    Returns the p-value from a Welch's t-test given two datasets 
    (lists, arrays, or series).
    """
    t = welch_t(a,b)
    df = welch_df(a,b)
    p = 1 - stats.t.cdf(t, df)
    if two_sided:
        p += p
    return p


def cohen_d(group1, group2):
    """
    group1: Series or NumPy array
    group2: Series or NumPy array
    returns a floating point number 
    """
    diff = np.mean(group1) - np.mean(group2)
    n1, n2 = len(group1), len(group2)
    var1 = np.var(group1)
    var2 = np.var(group2)
    # Calculate the pooled threshold as shown earlier
    pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)
    # Calculate Cohen's d statistic
    d = abs(diff / np.sqrt(pooled_var))    
    return d

From there, I had to run a new SQL query in order to access all of the original observations (rather than just the aggregates I had been examining).

c.execute("""
          SELECT cat.CategoryName, c.Region, od.Quantity, 
          (od.Quantity*od.UnitPrice*(1-od.Discount)) as OrderPrice
          FROM OrderDetail od
          JOIN 'Order' o ON od.OrderId = o.Id
          JOIN Customer c ON o.CustomerId = c.Id
          JOIN Product p ON p.Id = od.ProductId
          JOIN Category cat ON p.CategoryId = cat.Id;
          """)
population_df = pd.DataFrame(c.fetchall())
population_df.columns = [x[0] for x in c.description]
for i in ['Northern Europe','Eastern Europe','Scandinavia']:
    population_df.replace(i, 'Northeastern Europe', inplace=True)
for i in ['South America','Central America']:
    population_df.replace(i, 'Latin America', inplace=True)

As a final step, I wrote several nested for loops to iterate generating sampling distributions and statistics for each region-category combination (40 in total!). The results were then filtered for only the combinations which had a p-value < 0.05, a statistical power > 0.8, and a Cohen's d score >0.1.

sig_list = []
for region in list(population_df.Region.unique()):
    for category in list(population_df.CategoryName.unique()):
        pops_df = list(population_df.loc[population_df.CategoryName == category].OrderPrice)
        reg_df = list(population_df.loc[(population_df.CategoryName == category) 
                                        & (population_df.Region == region)].OrderPrice)
        pop_samps = create_sample_distribution(pops_df, 30, 30)
        region_samps = create_sample_distribution(reg_df, 30, 30)
        p = p_value(pop_samps, region_samps, two_sided=True)
        d = cohen_d(pop_samps, region_samps)
        power = power_analysis.solve_power(effect_size=d, nobs1=len(pop_samps), alpha=.05)
        if p < 0.05 and power > 0.8 and d > 0.1:
            sig_list.append((region,category))
        print(f'{region} {category} compared to population samples')
        print(f"Welch's t-test p-value: {p}")
        print(f"Effect size (Cohen's d): {d}")
        print(f"Statistical Power (1-B): {power}\n")

Finally, using the filtered results from above, I was able to divide the statistically significant pairs into two groups: 'increase sales' and 'decrease sales' (labeled according to the business recommendation I would make). These were generated by calculating the proportion of the mean entire budget that each region spent on each category and compared that to the population mean proportions. Those combinations that were greater than the population signified exceptionally high demand and therefore a combination that Northwind salespersons should try to exploit. The opposite results suggest combinations that salespersons should not devote overmuch effort to pushing, for the lower expected return on time invested.

increase_sales_push = []
decrease_sales_push = []
for region, category in sig_list:
    a = region2_df.loc[(region2_df.CategoryName == category)
                       &(region2_df.Region == region)].MeanSpending
    a_prime = sum(region2_df.loc[region2_df.Region == region].MeanSpending)
    b = pop_df.loc[pop_df.CategoryName == category].MeanSpending
    b_prime = sum(pop_df.MeanSpending)
    if float(a)/a_prime < float(b)/b_prime:
        decrease_sales_push.append((region,category))
    elif float(a)/a_prime > float(b)/b_prime:
        increase_sales_push.append((region,category))

In the end, I was able to break the Northwind database down like so:

Prof. Parker's Data Science Journey

Sunday, July 14, 2019

Northwind Exploration

Stacked Bar Plots

Hypothesis Testing

No comments:

Post a Comment