1 Week 1

1.1 Mini-Example

1.1.1 Task Description

The goal of the mini-example is to examine the association between firm’s discretionary spending (advertising, R&D) and top-line financial impact (sales)

How many variables are in the dataset? Which ones are identifier, categorical vs. continuous?
What is(are) the focal outcome variable(s)?
What is(are) the focal predictor variable(s)?
What is(are) the focal “control” variable(s)?
What is the mean of sales, advertising and R&D?
How are sales, advertising and R&D different in b2b vs. b2c firms?
How do advertising and R&D affect sales?

1.1.2 Variables in the Data

Variable Name	What does it mean?
gvkey	6-digit Firm id
datadate	fiscal year
pure_b2b	Indicator: Only in B2B sector (1) – otherwise (0)
Sales	Sales in Millions of Dollars
xad	Advertising expenditure in millions of dollars
xrd	R&D expenditure in millions of dollars
fmsize_at	Natural log of the book value of total assets
mksh2	Ratio of firm’s overall sales revenue to the sales revenue of all firms in the same 2-digit SIC code industry

1.1.3 Analysis using R

Before beginning the tutorial, we need to make sure all necessary packages are installed. If any of these packages are not installed, write install.packages("<name of package>"). For example, to install the package dplyr, use install.packages("dplyr").

The next step is to include all the libraries we use in this exercise. An alternative approach is to load libraries specific to a function before using the function.

library(readxl)
library(dplyr)
library(ggplot2)
library(fBasics)

The next step is to import data. We will use the read_excel function to import excel file into R.

You can use the setwd function to set the working directory to the folder which has your data or give the full path of your excel file wile reading the excel file.

# Importing Data
mini_case_data <- read_excel("Mini-case data.xlsx")

1.1.3.1 Summary Statistics

In R, there are multiple ways to check the summary statistics of your variables. (Link 1,Link 2)

# Make a list of variables you want summary statistics for
var_list <- c("sales","xad","xrd")
# Make a data.frame containing summary statistics of interest
summ_stats <- fBasics::basicStats(mini_case_data[var_list])
summ_stats <- as.data.frame(t(summ_stats))
# Rename some of the columns for convenience
summ_stats <- summ_stats %>% select("nobs","Mean", "Stdev", "Minimum", "Maximum")
summ_stats <- summ_stats %>% rename('N'= 'nobs')

	N	Mean	Stdev	Minimum	Maximum
sales	1000	20081.5496	34027.1812	29.729	160123
xad	1000	485.2452	806.1509	0.397	3698
xrd	1000	788.7484	1716.4137	0.000	7400

We can also compare summary statistics by grouping the data based on some categorical variables.

In our case, we will group the data by pure_b2b and compare the summary statistics

# Summary stats - sales
summary_sales_by_group <- mini_case_data %>% group_by(pure_b2b)%>% summarise(n = n(), mean_sales = mean(sales),std_dev_sales = sd(sales), min_sale = min(sales), max_sales = max(sales) )

pure_b2b	n	mean_sales	std_dev_sales	min_sale	max_sales
0	873	21026.72	36148.236	29.729	160123
1	127	13584.42	9391.104	2959.900	53999

Similarly we can compare for other variables.

1.1.3.2 Graph of Mean Sales (B2B vs. non-B2B Firms)

We can plot the mean sales for B2B vs. non-B2B Firms. You can use the code below as reference to plot other variables such as xad and xrd.

# Convert pure_b2b as a factor variable

summary_sales_by_group$pure_b2b <- as.factor(summary_sales_by_group$pure_b2b)

# Plot
ggplot(summary_sales_by_group, aes(x = pure_b2b, y =mean_sales )) +
    geom_bar(stat = "identity") + 
    ylab("Mean Sales") +
    ggtitle("Mean of Sales by pure_b2b")

1.1.3.3 Sales Regression

For our sales regression

Dependent Variable : sales
Independet Variables: xad,xrd

sales_regression  <-  lm(sales ~ xad+xrd , data =mini_case_data )
# Summary of the regression
sales_regression_coef <- summary(sales_regression)$coefficients

sales_regression_r2 <- summary(sales_regression)$r.squared

Regression Coefficients

	Estimate	Std. Error	t value
(Intercept)	5273.430435	881.2489482	5.984042
xad	25.077253	1.2282534	20.417002
xrd	3.346444	0.5768758	5.800979

-  Advertising (25.0773) and R&D (3.3464) have a positive effect on revenue

R-squared (model fit index) is

## [1] 0.5117014

We can add more control variables to the regression model

Control Variables: pure_b2b, fmsize_at, mksh2

sales_regression_with_controls  <-  lm(sales ~ xad+xrd+pure_b2b+fmsize_at+mksh2 ,data =mini_case_data )

# Summary of the regression
sales_regression_with_controls_coef <- summary(sales_regression_with_controls)$coefficients

sales_regression_with_controls_r2 <- summary(sales_regression_with_controls)$r.squared

For the new model

Coefficients

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-39982.337166	3956.2692006	-10.106071	0.0000000
xad	15.939687	1.1975483	13.310267	0.0000000
xrd	2.978288	0.5275706	5.645289	0.0000000
pure_b2b	-4806.439350	2091.3768307	-2.298218	0.0217563
fmsize_at	5179.377667	505.7198986	10.241594	0.0000000
mksh2	104078.783632	9682.0954844	10.749613	0.0000000

The relationship between control variables and sales makes sense

R-squared (model fit index) is

## [1] 0.6434498

Model fit improves to 0.6434 from 0.5117

1.2 Customer Model Analytics I: Regression Demo

1.2.1 Task Description

The goal is to predict customer satisfaction for B2B company based on the following variables (See “Regression Demo Satisfaction.xls”)

Variable	Definition
ResponseID	Respondent ID
OVERALLSATISFACTION	Overall satisfaction with SUPPLIER
BID	Overall satisfaction with INITIAL SALES/BIDDING
PSQUAL	Overall satisfaction with PRODUCT/SERVICE QUALITY
PRICE	Overall satisfaction with PRICING & BILLING
INVOLVEMENT	Involvement with the supplier
GENDER	1= Female, 0= Male

1.2.2 Analysis using R

We need to make sure all necessary packages are installed. If any of these packages are not installed, write install.packages("<name of package>"). The next step is to include all the libraries we use in this exercise.

library(readxl)
library(dplyr)
library(fBasics)
library(car)

The next step is to import data. We will use the read_excel function to import excel file into R.

# Importing Data
satisfaction_data <- read_excel("Regression Demo Satisfaction.xlsx")

Some variables can be represented on an interval scale

Examples: satisfaction, Income, height, price, and temperature

Categorical variables can only be represented on a nominal scale

Examples: gender, ethnicity, brand, or location

In R, we can convert all the categorical variables into factors.

# Importing Data
satisfaction_data$GENDER <- as.factor(satisfaction_data$GENDER)

1.2.2.1 Summary Statistics

Review the means of dependent and independent variables.

# Make a list of variables you want summary statistics for
var_list <- c("OVERALLSATISFACTION","BID","PSQUAL", "PRICE", "INVOLVEMENT"  )
# Make a data.frame containing summary statistics of interest
summ_stats <- fBasics::basicStats(satisfaction_data[var_list])
summ_stats <- as.data.frame(t(summ_stats))
# Rename some of the columns for convenience
summ_stats <- summ_stats %>% select("nobs","Mean", "Stdev", "Minimum", "Maximum")
summ_stats <- summ_stats %>% rename('N'= 'nobs')

	N	Mean	Stdev	Minimum	Maximum
OVERALLSATISFACTION	575	5.260870	1.328375	1	7
BID	575	5.241739	1.294183	1	7
PSQUAL	575	5.299130	1.336013	1	7
PRICE	575	4.928696	1.386895	1	7
INVOLVEMENT	575	5.422609	1.440036	1	7

Check if the means and standard deviations “make sense”
Check the minimum and maximum values and see if they are “correct”

1.2.2.2 Correlation among the variables

Review the correlation among the variables

# Check the correlations among variables in var_list 
correlation_matrix <- cor(satisfaction_data[var_list])
# Rounding off to two digits
correlation_matrix <-round(correlation_matrix, 3)

	OVERALLSATISFACTION	BID	PSQUAL	PRICE	INVOLVEMENT
OVERALLSATISFACTION	1.000	0.625	0.787	0.679	0.812
BID	0.625	1.000	0.638	0.604	0.586
PSQUAL	0.787	0.638	1.000	0.633	0.727
PRICE	0.679	0.604	0.633	1.000	0.635
INVOLVEMENT	0.812	0.586	0.727	0.635	1.000

You want to see high correlation between the dependent variable and independent variables.
You do want to see tremendously high correlation among the independent variables (since they should ideally capture different factors driving the dependent variable).

1.2.2.3 Regression with one covariate

Let’s examine the statistical relationship between overall satisfaction and satisfaction with bidding

Dependent variable: OVERALLSATISFACTION
Independent variable(s): BID

satisfaction_regression  <-  lm(OVERALLSATISFACTION ~ BID , data =satisfaction_data )
# Summary of the regression
satisfaction_regression_coef <- summary(satisfaction_regression)$coefficients

satisfaction_regression_r2 <- summary(satisfaction_regression)$r.squared

Regression Coefficients

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	1.898	0.181	10.504	0
BID	0.642	0.033	19.165	0

Implies the following regression model: Overall satisfaction = 1.898 + 0.642*Bidding Satisfaction
R-squared (model fit index) is

## [1] 0.3906125

1.2.2.4 Regression with all the variables

Let’s examine the statistical relationship between overall satisfaction and all the variables.

Dependent variable: OVERALLSATISFACTION
Independent variable(s): BID ,PSQUAL, PRICE, INVOLVEMENT and GENDER

satisfaction_regression_all  <-  lm(OVERALLSATISFACTION ~ BID+PSQUAL+PRICE+INVOLVEMENT+GENDER , data =satisfaction_data )
# Summary of the regression
satisfaction_regression_all_coef <- summary(satisfaction_regression_all)$coefficients

satisfaction_regression_all_r2 <- summary(satisfaction_regression_all)$r.squared

Regression Coefficients

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	0.337	0.148	2.285	0.023
BID	0.067	0.029	2.273	0.023
PSQUAL	0.331	0.033	10.036	0.000
PRICE	0.147	0.028	5.228	0.000
INVOLVEMENT	0.392	0.031	12.774	0.000
GENDER1	-0.061	0.060	-1.018	0.309

Implies the following regression model: Overall satisfaction = 0.337 + 0.067*Bidding Satisfaction +0.331* Product Quality Satisfaction + 0.147*Price Satisfaction+ 0.392*Involvement -0.061*Gender
R-squared (model fit index) is

## [1] 0.7598821

Adding variables to the model changes results

Let’s add all the strategic areas
- Notice the improvement in R-squared
- Notice the significance of all areas

1.2.2.5 Interpreting the output- multicollinearity

Detecting multicollinearity

Use VIF (variance inflation factor) defined as [1/(1-R2)]
- an index that measures how much the variance of a coefficient is increased because of collinearity.
Check if VIFs are less than 5 as a rule of thumb.

The R function vif() [car package] can be used to detect multicollinearity in a regression model:

model_vif <- car::vif(satisfaction_regression_all)

Variable	VIF
BID	1.938
PSQUAL	2.601
PRICE	2.037
INVOLVEMENT	2.626
GENDER	1.206

The VIF score for all the predictor variables is less than 5.

1.2.2.6 Regression is not a mechanical process

Look at the means of the dependent variables for different levels of each independent variable so you can see if you need to code it as a dummy variable or continuous variable
Examine the VIFs to ensure your model is reasonably robust and not an artifact of multicollinearity
Run at least 4-5 different models to get a ‘feel” for the underlying story. Which parameters are stable enough that they don’t change no matter if you drop or add variables?
Dropping of variables, if necessary, should be done sequentially, one-at-a-time, and with a view to examining:
- change in R-squared
- change in /stability of parameters of retained variables
- managerial implications of dropping the variable on remaining variables
Do compare the R-squared of different models to see the relative change in model fit. However, sometimes models with lower R-squared are better!
When is an R-squared “too small,” such that the model is not useful?
- There is no clear relationship between the value of R-squared and the value of the model. The regression model has value to a firm if it allows the firm to increase its profits, and that is possible even with very small R-squared values. On the other hand, different model specifications for the same outcome can be compared (ranked) according to the R-squared values.