2 Week 2

2.1 Sandwich Store Example

2.1.1 Task Description

The goal is to predict purchase probability and purchase amount based on customer satisfaction.

Data Description: See “Predicting Purchase and Amount Demo.xlsx”

Variable	Definition
Respondent	Respondent ID
Purchase	Purchase from the sandwich store in the last month (1= YES, 0= NO)
Spending	Amount spent on the sandwich store in the last month
OverallSAT	Overall Satisfaction (1= very dissatisfied, 7= very satisfied)
Education	Education Level (1= completing or completed High School, 2= Undergraduate Degree, 3= Masters Degree/Professional Degree/PhD)
Gender	Gender (0= Male, 1= Female)
Age	Age (1= 18 years or younger, 2= 18-55 years old, 3= 56 years or older)

2.1.2 Analysis using R

We need to make sure all necessary packages are installed. If any of these packages are not installed, write install.packages("<name of package>"). The next step is to include all the libraries we use in this exercise.

library(readxl)
library(dplyr)
library(fBasics)
library(car)
library(epiDisplay)
library(ggplot2)

The next step is to import data. We will use the read_excel function to import excel file into R.

# Importing Data
satisfaction_data <- read_excel("Predicting Purchase and Amount Demo.xlsx")

Some variables can be represented on an interval scale

Examples: satisfaction, Income, height, price, and temperature

Categorical variables can only be represented on a nominal scale

Examples: gender, ethnicity, brand, or location

In R, we can convert all the categorical variables into factors.

# Importing Data
satisfaction_data$Purchase <- as.factor(satisfaction_data$Purchase)
satisfaction_data$Gender <- as.factor(satisfaction_data$Gender)
satisfaction_data$Age <- as.factor(satisfaction_data$Age)
satisfaction_data$Education <- as.factor(satisfaction_data$Education)

2.1.2.1 Summary Statistics

2.1.2.1.1 Continous Variables

Review the means of continuous variables.

# Make a list of variables you want summary statistics for
var_list <- c("Spending","OverallSAT")
# Make a data.frame containing summary statistics of interest
summ_stats <- fBasics::basicStats(satisfaction_data[var_list])
summ_stats <- as.data.frame(t(summ_stats))
# Rename some of the columns for convenience
summ_stats <- summ_stats %>% dplyr::select("nobs","Mean", "Stdev", "Minimum", "Maximum")
summ_stats <- summ_stats %>% rename('N'= 'nobs')

	N	Mean	Stdev	Minimum	Maximum
Spending	1900	43.887158	39.548578	0	112
OverallSAT	1900	5.403684	1.375972	1	7

Check if the means and standard deviations “make sense”
Check the minimum and maximum values and see if they are “correct”

2.1.2.1.2 Categorical Variables

Review the frequency distribution of categorical variables.

# Frequency distribution of Purchase
tab1(satisfaction_data$Purchase, cum.percent = TRUE)

## satisfaction_data$Purchase : 
##         Frequency Percent Cum. percent
## 0             810    42.6         42.6
## 1            1090    57.4        100.0
##   Total      1900   100.0        100.0

Similarly

# Frequency distribution of Education
tab1(satisfaction_data$Education,  cum.percent = TRUE)

## satisfaction_data$Education : 
##         Frequency Percent Cum. percent
## 1            1046    55.1         55.1
## 2             654    34.4         89.5
## 3             200    10.5        100.0
##   Total      1900   100.0        100.0

2.1.2.1.3 Mean of satisfaction and spending by purchase outcomes

library(psych)
summary_by_by_purchase <- describeBy(satisfaction_data[var_list],satisfaction_data$Purchase, mat=TRUE,digits=3) 
summary_by_by_purchase <- summary_by_by_purchase %>% dplyr::select("group1","n","mean", "sd", "min", "max")
summary_by_by_purchase <- summary_by_by_purchase %>% rename('Purchase'= 'group1', 'N' = 'n', "Stdev" ='sd')

	Purchase	N	mean	Stdev	min	max
Spending1	0	810	0.000	0.000	0.0	0
Spending2	1	1090	76.501	15.173	30.8	112
OverallSAT1	0	810	4.499	1.263	1.0	7
OverallSAT2	1	1090	6.076	1.027	1.0	7

Satisfaction is higher for those who purchase
Spending is zero for those who don’t purchase (make sense; this is a quick data validity check)

2.1.3 Logistic Regression to Predict Purchase Probability Based on Customer Satisfaction

We will fit a logistic regression model to predict purchase probability and purchase amount based on customer satisfaction. The glm function fits generalized linear models, a class of models that includes logistic regression. The syntax of the glm function is similar to that of lm, except that we must pass the argument family = binomial in order to tell R to run a logistic regression rather than some other type of generalized linear model.

# TO specify the level to use as base
satisfaction_data <- within(satisfaction_data, Education <- relevel(Education, ref = "1"))
satisfaction_data <- within(satisfaction_data, Gender <- relevel(Gender, ref = "0"))
satisfaction_data <- within(satisfaction_data, Age <- relevel(Age, ref = "1"))

# Logistic Regression
logistic_reg <- glm(Purchase ~ OverallSAT+Education+Gender+Age, family = "binomial", data = satisfaction_data)

logistic_reg_coeffitients <- summary(logistic_reg)$coefficients

Regression Coefficients

	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-6.219	0.415	-15.002	0.000
OverallSAT	1.228	0.059	20.705	0.000
Education2	-0.226	0.124	-1.831	0.067
Education3	-0.331	0.195	-1.699	0.089
Gender1	0.001	0.119	0.012	0.990
Age2	-0.005	0.258	-0.018	0.986
Age3	0.012	0.296	0.039	0.969

Implies the following regression model: log(p/1-p) = 1.228*OverallSAT – .226*Education_undergrad – .331*Education_master + .001*Female – .005 Age_1855 +.012 Age_56 – 6.219
Confusion Matrix

predict_purchase <- predict(logistic_reg, type="response")

library(caret)
#confusionMatrix(true_value,predicted)
confusion_matrix <- confusionMatrix(as.factor(as.integer(predict_purchase  > 0.5)),satisfaction_data$Purchase)

# Matrix
table <- data.frame(confusion_matrix$table)

Accuracy = (887+644)/(887+644+203+166)

## [1] 0.8057895

satisfaction_data %>%
    mutate(prob = ifelse(Purchase == "1", 1, 0)) %>%
    ggplot(aes(OverallSAT, prob)) +
    geom_point(alpha = .15) +
    geom_smooth(method = "glm", method.args = list(family = "binomial")) +
    ggtitle("Logistic regression model fit") +
    xlab("OverallSAT") +
    ylab("Probability of Purchase")

satisfaction_data %>%
    mutate(prob = ifelse(Purchase == "1", 1, 0)) %>%
    ggplot(aes(OverallSAT, prob)) +
    geom_point(alpha = .15) +
    geom_smooth(method = "lm") +
    ggtitle("Linear regression model fit") +
    xlab("OverallSAT") +
    ylab("Probability of Purchase")

2.1.4 Linear Regression to Predict Purchase Amount Based on Customer Satisfaction

Review of Variables

Variable	Definition
Respondent	Respondent ID
Purchase	Purchase from the sandwich store in the last month (1= YES, 0= NO)
Spending	Amount spent on the sandwich store in the last month
OverallSAT	Overall Satisfaction (1= very dissatisfied, 7= very satisfied)
Education	Education Level (1= completing or completed High School, 2= Undergraduate Degree, 3= Masters Degree/Professional Degree/PhD)
Gender	Gender (0= Male, 1= Female)
Age	Age (1= 18 years or younger, 2= 18-55 years old, 3= 56 years or older)

Note that spending is 0 when Purchase is 0. Therefore, we will run the regression on subset of data when purchase is >0.

	Purchase	N	mean	Stdev	min	max
Spending1	0	810	0.000	0.000	0.0	0
Spending2	1	1090	76.501	15.173	30.8	112
OverallSAT1	0	810	4.499	1.263	1.0	7
OverallSAT2	1	1090	6.076	1.027	1.0	7

2.1.4.1 Linear Regression

Dependent variable: Spending
Independent variable(s): OverallSAT, Education, Gender, Age
Data: Subset of data when purchase is != 0

# TO specify the level to use as base
satisfaction_data <- within(satisfaction_data, Education <- relevel(Education, ref = "1"))
satisfaction_data <- within(satisfaction_data, Gender <- relevel(Gender, ref = "0"))
satisfaction_data <- within(satisfaction_data, Age <- relevel(Age, ref = "1"))

# Linear Regression

spending_regression  <-  lm(Spending ~ OverallSAT+Education+Gender+Age , data=subset(satisfaction_data,Purchase!=0)) 

# Summary of the regression
spending_regression_coef <- summary(spending_regression)$coefficients

Regression Coefficients

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	52.245	3.150	16.588	0.000
OverallSAT	4.230	0.430	9.835	0.000
Education2	-2.970	0.970	-3.062	0.002
Education3	-2.163	1.557	-1.389	0.165
Gender1	0.517	0.909	0.568	0.570
Age2	-0.283	1.998	-0.141	0.888
Age3	-2.062	2.349	-0.878	0.380

Overall satisfaction seems to be a significant predictor of Spending.

2.1.5 Summary of Logistic Regression

Logistic regression predicts the probability of the two possible outcomes y = 0,1 conditional on the values of the independent variables \(x_1, x_2, ... x_k\).
- Predicting the probability conveys more information than predicting 0 or 1.
The logistic regression model is consistent with consumers making choices based on a comparison of the utilities from the two options.
Estimating the logistic regression model is as simple as estimating the linear regression model, but interpreting the coefficient estimates is slightly different.