class: center, middle ## IMSE 440 ## Applied Statistical Models in Engineering
## Multiple linear regression ## Categorical predictors [ISLR book](https://www.statlearning.com): Chapter 3.3.1 --- # Categorical predictors .center[] -- A binary dummy variable: $$ \small{ \text{student} = \begin{cases} 1, & \text{if “Yes”,} \\\ 0, & \text{if “No”.} \end{cases} } $$ -- The category that is set to zero is commonly referred to as the .red[baseline]. -- $$\small{\text{balance}=\beta_0+\beta_1\cdot\text{income}+\beta_2\cdot\text{student}+\epsilon}$$ --- $$ \small{ \begin{aligned} \text{balance}&=\beta_0+\beta_1\cdot\text{income}+\beta_2\cdot\text{student}+\epsilon \\\ \\\ \text{E[balance]}&=\beta_0+\beta_1\cdot\text{income}+\beta_2\cdot\text{student} \end{aligned} } $$ --
$$ \small{ \begin{aligned} \color{gray}{\text{For non-students:}}\;\;\;\;\text{E[balance]}&=\beta_0+\beta_1\cdot\text{income}+\beta_2\cdot \color{red}{0} \\\ \\\ &=\beta_0+\beta_1\cdot\text{income} \\\ \\\ \color{gray}{\text{For students:}}\;\;\;\;\text{E[balance]}&=\beta_0+\beta_1\cdot\text{income}+\beta_2\cdot \color{red}{1} \\\ \\\ &=(\beta_0+\beta_2)+\beta_1\cdot\text{income} \\\ \end{aligned} } $$ -- What are the interpretations of the model parameters? --- What if the effect of income on balance is different for students vs. non-students? -- $$ \small{ \begin{aligned} \text{balance}=\beta_0&+\beta_1\cdot\text{income} \\\ &+\beta_2\cdot\text{student} \\\ &+ \color{blue}{\beta_3\cdot\text{income $\cdot$ student}}+\epsilon \\\ \end{aligned} } $$ -- $$ \small{ \begin{aligned} \color{gray}{\text{For non-students:}}\;\;\;\;\text{E[balance]}=\beta_0&+\beta_1\cdot\text{income} \\\ &+\beta_2\cdot \color{red}{0} \\\ &+ \beta_3\cdot\text{income $\cdot$} \color{red}{0} \\\ =\beta_0&+\beta_1\cdot\text{income} \\\ \\\ \color{gray}{\text{For students:}}\;\;\;\;\text{E[balance]}=\beta_0&+\beta_1\cdot\text{income} \\\ &+\beta_2\cdot \color{red}{1} \\\ &+ \beta_3\cdot\text{income $\cdot$} \color{red}{1} \\\ =\beta_0 &+ \beta_2+(\beta_1+\beta_3)\cdot\text{income} \\\ \end{aligned} } $$ --- $$ \small{ \begin{aligned} \color{gray}{\text{For non-students:}}\;\;\;\;\text{E[balance]}=\beta_0&+\beta_1\cdot\text{income} \\\ \\\ \color{gray}{\text{For students:}}\;\;\;\;\text{E[balance]}=\beta_0 &+ \beta_2+(\beta_1+\beta_3)\cdot\text{income} \\\ \end{aligned} } $$ --- # Categorical predictors with 3+ levels  Ethnicity: African American, Asian, Caucasian -- A dummy variable that takes 3 values? For example: $$ \small{ \text{ethnicity} = \begin{cases} 0, & \text{if “African American”,} \\\ 1, & \text{if “Asian”,} \\\ 2, & \text{if “Caucasian”.} \\\ \end{cases} } $$ -- $$\small{\text{balance}=\beta_0+\beta_1\cdot\text{income}+\beta_2\cdot\text{ethnicity}+\epsilon}$$ --- $$\small{\text{E[balance]}=\beta_0+\beta_1\cdot\text{income}+\beta_2\cdot\text{ethnicity}}$$ -- $$ \small{ \begin{aligned} \color{gray}{\text{For African American:}}\;\;\text{E[balance]}&=\beta_0+\beta_1\cdot\text{income}+\beta_2\cdot \color{red}{0} \\\ \\\ &=\beta_0+\beta_1\cdot\text{income} \\\ \\\ \color{gray}{\text{For Asian:}}\;\;\text{E[balance]}&=\beta_0+\beta_1\cdot\text{income}+\beta_2\cdot \color{red}{1} \\\ \\\ &=(\beta_0+\beta_2)+\beta_1\cdot\text{income} \\\ \\\ \color{gray}{\text{For Caucasian:}}\;\;\text{E[balance]}&=\beta_0+\beta_1\cdot\text{income}+\beta_2\cdot \color{red}{2} \\\ \\\ &=(\beta_0+2\beta_2)+\beta_1\cdot\text{income} \\\ \end{aligned} } $$ --- A better way: using .red[two] binary dummy variables $$ \small{ \text{asian} = \begin{cases} 1, & \text{if “Asian”,} \\\ 0, & \text{otherwise.} \\\ \end{cases} },\;\;\; \text{caucasian} = \begin{cases} 1, & \text{if Caucasian,} \\\ 0, & \text{otherwise.} \\\ \end{cases} $$ -- $$\small{ \text{(asian, caucasian)}= \begin{cases} (1, 0), & \text{if “Asian”,} \\\ (0, 1), & \text{if “Caucasian”,} \\\ (0, 0), & \text{if “African American”.} \\\ \end{cases} }$$ -- $$\small{\text{balance}=\beta_0+\beta_1\cdot\text{income}+\beta_2\cdot\text{asian}+\beta_3\cdot\text{caucasian}+\epsilon}$$ --- $$\small{\text{E[balance]}=\beta_0+\beta_1\cdot\text{income}+\beta_2\cdot\text{asian}+\beta_3\cdot\text{caucasian}}$$ $$ \small{ \begin{aligned} \color{gray}{\text{For AA:}}\;\; \text{E[balance]}&=\beta_0+\beta_1\cdot\text{income}+\beta_2\cdot 0 +\beta_3\cdot 0 \\\ &=\beta_0+\beta_1\cdot\text{income} \\\ \\\ \color{gray}{\text{For Asian:}}\;\; \text{E[balance]}&=\beta_0+\beta_1\cdot\text{income}+\beta_2\cdot 1 +\beta_3\cdot 0 \\\ &=(\beta_0+\beta_2)+\beta_1\cdot\text{income} \\\ \\\ \color{gray}{\text{For Caucasian:}}\;\; \text{E[balance]}&=\beta_0+\beta_1\cdot\text{income}+\beta_2\cdot 0 +\beta_3\cdot 1 \\\ &=(\beta_0+\beta_3)+\beta_1\cdot\text{income} \\\ \\\ \end{aligned} } $$ --- What if a categorical predictor has 4 levels (A, B, C, D) ?
-- In general, we need .red[one fewer] dummy binary variables than the number of levels. --- # We can further add the interaction terms
$$\small{ \begin{aligned} \text{balance}=\beta_0&+\beta_1\cdot\text{income}\\\ &+\beta_2\cdot\text{asian}\\\ &+\beta_3\cdot\text{caucasian} \\\ &+\beta_4\cdot\text{income}\cdot \text{asian} \;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \color{lightgray}{\rightarrow \text{2-way interaction}}\\\ &+\beta_5\cdot\text{income}\cdot \text{caucasian} \;\;\;\;\;\;\;\;\; \color{lightgray}{\rightarrow \text{2-way interaction}}\\\ &+\beta_6\cdot\text{income}\cdot \text{asian} \cdot \text{caucasian} \color{lightgray}{\rightarrow \text{3-way interaction}}\\\ &+\epsilon \end{aligned} } $$