Pandas Real time Case Study using Retail data-sets

Kajol ajab   27 January,2020  

Introduction

Content

You are provided with historical sales data for 45 stores located in different regions - each store contains a number of departments. The company also runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks.

Within the Excel Sheet, there are 3 Tabs – Stores, Features and Sales

Stores

Anonymized information about the 45 stores, indicating the type and size of store

Features

Contains additional data related to the store, department, and regional activity for the given dates.

  1. Store - the store number
  2. Date - the week
  3. Temperature - average temperature in the region
  4. Fuel_Price - cost of fuel in the region
  5. MarkDown1-5 - anonymized data related to promotional markdowns. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA
  6. CPI - the consumer price index
  7. Unemployment - the unemployment rate
  8. IsHoliday - whether the week is a special holiday week

Sales

Historical sales data, which covers to 2010-02-05 to 2012-11-01. Within this tab you will find the following fields:

  1. Store - the store number
  2. Dept - the department number
  3. Date - the week
  4. Weekly_Sales - sales for the given department in the given store
  5. IsHoliday - whether the week is a special holiday week

 

Objectives

  1.     Diagnose data for cleaning
  2.     Merge the data in a unique DataFrame
  3.     Subsetting
  4.     Grouped calculations
  5.     Visualization
  6.     Correlation
    •  Calculating the Pearson Correlation
    • Kendall Rank Correlation
    • Kendall Rank Correlation Using .corr()
    • Visualize using a Heat-map
  7.     Outlier
    •  Detecting outlier using Z score
    • InterQuantile Range

    

Diagnose data for cleaning

Merge the data in a unique DataFrame

Subsetting

Grouped calculations

Visualization

Correlation

The term "correlation" refers to a mutual relationship or association between quantities. In almost any business, it is useful to express one quantity in terms of its relationship with others. For example, sales might increase when the marketing department spends more on TV advertisements, or a customer's average purchase amount on an e-commerce website might depend on a number of factors related to that customer. Often, correlation is the first step to understanding these relationships and subsequently building better business and statistical models.

So, why is correlation a useful metric?

  1. Correlation can help in predicting one quantity from another
  1. Correlation can (but often does not, as we will see in some examples below) indicate the presence of a causal relationship
  1. Correlation is used as a basic quantity and foundation for many other modeling techniques

More formally, correlation is a statistical measure that describes the association between random variables. There are several methods for calculating the correlation coefficient, each measuring different types of strength of association. Below we summarize three of the most widely used methods.

Calculating the Pearson Correlation

We’ll use method = ‘pearson’ for the dataframe.corr since we want to calculate the pearson coefficient of correlation. Then we’ll print it out!

This is nice to have, but having a large number of variables in the data will quickly make this more time consuming to interpret. This is the reason I imported the seaborn package in pandas:

Quick Description — Seaborn is a python library for visualizing data. It is built on top of matplotlib and closely integrated with pandas data structures. To make this look beautiful and easier to interpret, add this after calculating the Pearson coefficient of correlation.

A co-efficient close to 1 means that there’s a very strong positive correlation between the two variables. In our case, the maroon shows very strong correlations. The diagonal line is the correlation of the variables to themselves — so they’ll obviously be 1.

Kendall Rank Correlation

Kendall rank correlation (non-parametric) is an alternative to Pearson’s correlation (parametric) when the data you’re working with has failed one or more assumptions of the test. This is also the best alternative to Spearman correlation (non-parametric) when your sample size is small and has many tied ranks.

Kendall rank correlation is used to test the similarities in the ordering of data when it is ranked by quantities. Other types of correlation coefficients use the observations as the basis of the correlation, Kendall’s correlation coefficient uses pairs of observations and determines the strength of association based on the patter on concordance and discordance between the pairs.

  1. Concordant: Ordered in the same way (consistency). A pair of observations is considered concordant if (x2 — x1) and (y2 — y1) have the same sign.

  2. Discordant: Ordered differently (inconsistency). A pair of observations is considered concordant if (x2 — x1) and (y2 — y1) have opposite signs.

Kendall Rank Correlation Using .corr()

Pandas dataframe.corr() is used to find the pairwise correlation of all columns in the dataframe. If you need a quick intro on this — check out my explanation of dataframe.corr().

Visualize using a Heat-map

Outlier

An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset.

What are the criteria to identify an outlier?

  1. Data point that falls outside od 1.5 times of an interquartile range above the 3rd quartile and below the 1st quartile

  2. Data point that falls outside of 3 standard deviations, we can usee a z-score & if the z-score falls outside of 2 standard deviation.

What is the reason for an outlier to exists in a dataset?

  1. Variablitity in the data
  2. An experimental measurement error.

I have taken one column of dataframe Unemployment to detect the outlier, So let's start

Detecting outlier using Z score

Using Z score Formula for Zscore = (Observation-Mean)/Standard Deviation

Threshold value i am keeping it as 3 because within 3rd standard deviation if data is falling away then i'm going to consider as a outlier

Applied unique function because we don't want the repeated data.

InterQuantile Range

75%,25% values in a dataset

Steps

  1. Arrange the data in increasing order

  2. Calculate first(q1) and third quartile(q3)

  3. Find interquartile range(q3-q1)

  4. Find lower bound q1*1.5

  5. Find upper bound q3*1.5

Anything that lies outside of lower and upper bound is an outlier

1
Like