2 Week 2
2.1 Sandwich Store Example
2.1.1 Task Description
The goal is to predict purchase probability and purchase amount based on customer satisfaction.
Data Description: See “Predicting Purchase and Amount Demo.xlsx”
Variable | Definition |
---|---|
Respondent | Respondent ID |
Purchase | Purchase from the sandwich store in the last month (1= YES, 0= NO) |
Spending | Amount spent on the sandwich store in the last month |
OverallSAT | Overall Satisfaction (1= very dissatisfied, 7= very satisfied) |
Education | Education Level (1= completing or completed High School, 2= Undergraduate Degree, 3= Masters Degree/Professional Degree/PhD) |
Gender | Gender (0= Male, 1= Female) |
Age | Age (1= 18 years or younger, 2= 18-55 years old, 3= 56 years or older) |
2.1.2 Analysis using R
We need to make sure all necessary packages are installed. If any of these packages are not installed, write install.packages("<name of package>")
. The next step is to include all the libraries we use in this exercise.
library(readxl)
library(dplyr)
library(fBasics)
library(car)
library(epiDisplay)
library(ggplot2)
The next step is to import data. We will use the read_excel
function to import excel file into R.
# Importing Data
<- read_excel("Predicting Purchase and Amount Demo.xlsx") satisfaction_data
Some variables can be represented on an interval scale
- Examples: satisfaction, Income, height, price, and temperature
Categorical variables can only be represented on a nominal scale
- Examples: gender, ethnicity, brand, or location
In R, we can convert all the categorical variables into factors.
# Importing Data
$Purchase <- as.factor(satisfaction_data$Purchase)
satisfaction_data$Gender <- as.factor(satisfaction_data$Gender)
satisfaction_data$Age <- as.factor(satisfaction_data$Age)
satisfaction_data$Education <- as.factor(satisfaction_data$Education) satisfaction_data
2.1.2.1 Summary Statistics
2.1.2.1.1 Continous Variables
Review the means of continuous variables.
# Make a list of variables you want summary statistics for
<- c("Spending","OverallSAT")
var_list # Make a data.frame containing summary statistics of interest
<- fBasics::basicStats(satisfaction_data[var_list])
summ_stats <- as.data.frame(t(summ_stats))
summ_stats # Rename some of the columns for convenience
<- summ_stats %>% dplyr::select("nobs","Mean", "Stdev", "Minimum", "Maximum")
summ_stats <- summ_stats %>% rename('N'= 'nobs') summ_stats
N | Mean | Stdev | Minimum | Maximum | |
---|---|---|---|---|---|
Spending | 1900 | 43.887158 | 39.548578 | 0 | 112 |
OverallSAT | 1900 | 5.403684 | 1.375972 | 1 | 7 |
Check if the means and standard deviations “make sense”
Check the minimum and maximum values and see if they are “correct”
2.1.2.1.2 Categorical Variables
Review the frequency distribution of categorical variables.
# Frequency distribution of Purchase
tab1(satisfaction_data$Purchase, cum.percent = TRUE)
## satisfaction_data$Purchase :
## Frequency Percent Cum. percent
## 0 810 42.6 42.6
## 1 1090 57.4 100.0
## Total 1900 100.0 100.0
Similarly
# Frequency distribution of Education
tab1(satisfaction_data$Education, cum.percent = TRUE)
## satisfaction_data$Education :
## Frequency Percent Cum. percent
## 1 1046 55.1 55.1
## 2 654 34.4 89.5
## 3 200 10.5 100.0
## Total 1900 100.0 100.0
2.1.2.1.3 Mean of satisfaction and spending by purchase outcomes
library(psych)
<- describeBy(satisfaction_data[var_list],satisfaction_data$Purchase, mat=TRUE,digits=3)
summary_by_by_purchase <- summary_by_by_purchase %>% dplyr::select("group1","n","mean", "sd", "min", "max")
summary_by_by_purchase <- summary_by_by_purchase %>% rename('Purchase'= 'group1', 'N' = 'n', "Stdev" ='sd') summary_by_by_purchase
Purchase | N | mean | Stdev | min | max | |
---|---|---|---|---|---|---|
Spending1 | 0 | 810 | 0.000 | 0.000 | 0.0 | 0 |
Spending2 | 1 | 1090 | 76.501 | 15.173 | 30.8 | 112 |
OverallSAT1 | 0 | 810 | 4.499 | 1.263 | 1.0 | 7 |
OverallSAT2 | 1 | 1090 | 6.076 | 1.027 | 1.0 | 7 |
Satisfaction is higher for those who purchase
Spending is zero for those who don’t purchase (make sense; this is a quick data validity check)
2.1.3 Logistic Regression to Predict Purchase Probability Based on Customer Satisfaction
We will fit a logistic regression model to predict purchase probability and purchase amount based on customer satisfaction. The glm
function fits generalized linear models, a class of models that includes logistic regression. The syntax of the glm
function is similar to that of lm
, except that we must pass the argument family = binomial
in order to tell R
to run a logistic regression rather than some other type of generalized linear model.
# TO specify the level to use as base
<- within(satisfaction_data, Education <- relevel(Education, ref = "1"))
satisfaction_data <- within(satisfaction_data, Gender <- relevel(Gender, ref = "0"))
satisfaction_data <- within(satisfaction_data, Age <- relevel(Age, ref = "1"))
satisfaction_data
# Logistic Regression
<- glm(Purchase ~ OverallSAT+Education+Gender+Age, family = "binomial", data = satisfaction_data)
logistic_reg
<- summary(logistic_reg)$coefficients logistic_reg_coeffitients
- Regression Coefficients
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -6.219 | 0.415 | -15.002 | 0.000 |
OverallSAT | 1.228 | 0.059 | 20.705 | 0.000 |
Education2 | -0.226 | 0.124 | -1.831 | 0.067 |
Education3 | -0.331 | 0.195 | -1.699 | 0.089 |
Gender1 | 0.001 | 0.119 | 0.012 | 0.990 |
Age2 | -0.005 | 0.258 | -0.018 | 0.986 |
Age3 | 0.012 | 0.296 | 0.039 | 0.969 |
Implies the following regression model:
log(p/1-p) = 1.228*OverallSAT – .226*Education_undergrad – .331*Education_master + .001*Female – .005 Age_1855 +.012 Age_56 – 6.219
Confusion Matrix
<- predict(logistic_reg, type="response")
predict_purchase
library(caret)
#confusionMatrix(true_value,predicted)
<- confusionMatrix(as.factor(as.integer(predict_purchase > 0.5)),satisfaction_data$Purchase)
confusion_matrix
# Matrix
<- data.frame(confusion_matrix$table) table
- Accuracy =
(887+644)/(887+644+203+166)
## [1] 0.8057895
%>%
satisfaction_data mutate(prob = ifelse(Purchase == "1", 1, 0)) %>%
ggplot(aes(OverallSAT, prob)) +
geom_point(alpha = .15) +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +
ggtitle("Logistic regression model fit") +
xlab("OverallSAT") +
ylab("Probability of Purchase")
%>%
satisfaction_data mutate(prob = ifelse(Purchase == "1", 1, 0)) %>%
ggplot(aes(OverallSAT, prob)) +
geom_point(alpha = .15) +
geom_smooth(method = "lm") +
ggtitle("Linear regression model fit") +
xlab("OverallSAT") +
ylab("Probability of Purchase")
2.1.4 Linear Regression to Predict Purchase Amount Based on Customer Satisfaction
- Review of Variables
Variable Definition Respondent Respondent ID Purchase Purchase from the sandwich store in the last month (1= YES, 0= NO) Spending Amount spent on the sandwich store in the last month OverallSAT Overall Satisfaction (1= very dissatisfied, 7= very satisfied) Education Education Level (1= completing or completed High School, 2= Undergraduate Degree, 3= Masters Degree/Professional Degree/PhD) Gender Gender (0= Male, 1= Female) Age Age (1= 18 years or younger, 2= 18-55 years old, 3= 56 years or older) - Note that spending is 0 when Purchase is 0. Therefore, we will run the regression on subset of data when purchase is >0.
Purchase | N | mean | Stdev | min | max | |
---|---|---|---|---|---|---|
Spending1 | 0 | 810 | 0.000 | 0.000 | 0.0 | 0 |
Spending2 | 1 | 1090 | 76.501 | 15.173 | 30.8 | 112 |
OverallSAT1 | 0 | 810 | 4.499 | 1.263 | 1.0 | 7 |
OverallSAT2 | 1 | 1090 | 6.076 | 1.027 | 1.0 | 7 |
2.1.4.1 Linear Regression
Dependent variable: Spending
Independent variable(s): OverallSAT, Education, Gender, Age
Data: Subset of data when purchase is
!= 0
# TO specify the level to use as base
<- within(satisfaction_data, Education <- relevel(Education, ref = "1"))
satisfaction_data <- within(satisfaction_data, Gender <- relevel(Gender, ref = "0"))
satisfaction_data <- within(satisfaction_data, Age <- relevel(Age, ref = "1"))
satisfaction_data
# Linear Regression
<- lm(Spending ~ OverallSAT+Education+Gender+Age , data=subset(satisfaction_data,Purchase!=0))
spending_regression
# Summary of the regression
<- summary(spending_regression)$coefficients spending_regression_coef
- Regression Coefficients
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 52.245 | 3.150 | 16.588 | 0.000 |
OverallSAT | 4.230 | 0.430 | 9.835 | 0.000 |
Education2 | -2.970 | 0.970 | -3.062 | 0.002 |
Education3 | -2.163 | 1.557 | -1.389 | 0.165 |
Gender1 | 0.517 | 0.909 | 0.568 | 0.570 |
Age2 | -0.283 | 1.998 | -0.141 | 0.888 |
Age3 | -2.062 | 2.349 | -0.878 | 0.380 |
- Overall satisfaction seems to be a significant predictor of Spending.
2.1.5 Summary of Logistic Regression
Logistic regression predicts the probability of the two possible outcomes
y = 0,1
conditional on the values of the independent variables \(x_1, x_2, ... x_k\).- Predicting the probability conveys more information than predicting 0 or 1.
The logistic regression model is consistent with consumers making choices based on a comparison of the utilities from the two options.
Estimating the logistic regression model is as simple as estimating the linear regression model, but interpreting the coefficient estimates is slightly different.