Blog Details
Blog Title: | Pandas Real time Case Study using Retail data-sets |
---|---|
Blogger: | kajolajab21@gmail.com |
Image: | View |
Content: | IntroductionContentYou are provided with historical sales data for 45 stores located in different regions - each store contains a number of departments. The company also runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Within the Excel Sheet, there are 3 Tabs – Stores, Features and Sales StoresAnonymized information about the 45 stores, indicating the type and size of store FeaturesContains additional data related to the store, department, and regional activity for the given dates.
SalesHistorical sales data, which covers to 2010-02-05 to 2012-11-01. Within this tab you will find the following fields:
Objectives
Diagnose data for cleaningMerge the data in a unique DataFrameSubsettingGrouped calculationsVisualizationCorrelationThe term "correlation" refers to a mutual relationship or association between quantities. In almost any business, it is useful to express one quantity in terms of its relationship with others. For example, sales might increase when the marketing department spends more on TV advertisements, or a customer's average purchase amount on an e-commerce website might depend on a number of factors related to that customer. Often, correlation is the first step to understanding these relationships and subsequently building better business and statistical models. So, why is correlation a useful metric?
More formally, correlation is a statistical measure that describes the association between random variables. There are several methods for calculating the correlation coefficient, each measuring different types of strength of association. Below we summarize three of the most widely used methods. Calculating the Pearson CorrelationWe’ll use method = ‘pearson’ for the dataframe.corr since we want to calculate the pearson coefficient of correlation. Then we’ll print it out! This is nice to have, but having a large number of variables in the data will quickly make this more time consuming to interpret. This is the reason I imported the seaborn package in pandas: Quick Description — Seaborn is a python library for visualizing data. It is built on top of matplotlib and closely integrated with pandas data structures. To make this look beautiful and easier to interpret, add this after calculating the Pearson coefficient of correlation. A co-efficient close to 1 means that there’s a very strong positive correlation between the two variables. In our case, the maroon shows very strong correlations. The diagonal line is the correlation of the variables to themselves — so they’ll obviously be 1. Kendall Rank CorrelationKendall rank correlation (non-parametric) is an alternative to Pearson’s correlation (parametric) when the data you’re working with has failed one or more assumptions of the test. This is also the best alternative to Spearman correlation (non-parametric) when your sample size is small and has many tied ranks. Kendall rank correlation is used to test the similarities in the ordering of data when it is ranked by quantities. Other types of correlation coefficients use the observations as the basis of the correlation, Kendall’s correlation coefficient uses pairs of observations and determines the strength of association based on the patter on concordance and discordance between the pairs.
Kendall Rank Correlation Using .corr()Pandas dataframe.corr() is used to find the pairwise correlation of all columns in the dataframe. If you need a quick intro on this — check out my explanation of dataframe.corr(). Visualize using a Heat-mapOutlierAn outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset. What are the criteria to identify an outlier?
What is the reason for an outlier to exists in a dataset?
I have taken one column of dataframe Unemployment to detect the outlier, So let's start Detecting outlier using Z scoreUsing Z score Formula for Zscore = (Observation-Mean)/Standard Deviation Threshold value i am keeping it as 3 because within 3rd standard deviation if data is falling away then i'm going to consider as a outlier Applied unique function because we don't want the repeated data. InterQuantile Range75%,25% values in a dataset Steps
Anything that lies outside of lower and upper bound is an outlier |