class: center, middle ## IMSE 440 ## Applied Statistical Models in Engineering
## Simple linear regression ## Assessing model accuracy [ISLR book](https://www.statlearning.com): Chapter 3.1.3 --- # Assessing the accuracy of the model If the $H_0$ is rejected, meaning there is a significant relationship between $x$ and $y$, -- we want to see .red[how strong is this relationship]. -- .center[How?] -- Look at the extent to which the model fits the data. -- .center[   ] --- .center[   ] How about using the "average amount of errors" ? $$\text{RSE}=\sqrt{\frac{e_1^2+e_2^2+\cdots+e_n^2}{n-2}}=\sqrt{\frac{\text{RSS}}{n-2}}$$ -- RSE provides an absolute measure of model accuracy. --- $$\text{RSE}=\sqrt{\frac{e_1^2+e_2^2+\cdots+e_n^2}{n-2}}=\sqrt{\frac{\text{RSS}}{n-2}}$$ In the advertising example, "sales ~ TV" ``` RSE = 3.26 ``` -- Interpretation The actual sales in each market deviate from the model prediction by an average of 3,260 unit. --
It is not always clear what constitutes a "good" RSE. --- # Residual Sum of Squares $$ \small{ \begin{aligned} &\text{RSS}=e_1^2+e_2^2+\cdots+e_n^2 \\\ \\\ &\text{where $e_i=y_i-\hat{y}_i$} \\\ \end{aligned} } $$ For a simple linear regression model, $\small{\hat{y}_i=\hat{\beta}_0+\hat{\beta}_1x_i}$ .center[] -- RSS measures the total amount of variability that is .red[left unexplained] after fitting the model. --- # Total Sum of Squares $$ \small{ \begin{aligned} &\text{TSS}=(y_1-\bar{y})^2+(y_2-\bar{y})^2+\cdots+(y_n-\bar{y})^2 \\\ \\\ &\text{where $\bar{y}=\frac{1}{n}\sum y_i$} \\\ \end{aligned} } $$ .center[] -- TSS measures the total amount of variability that is .red[inherent in the response $Y$] before fitting a model. --- # Total Sum of Squares $$ \small{ \begin{aligned} &\text{TSS}=(y_1-\bar{y})^2+(y_2-\bar{y})^2+\cdots+(y_n-\bar{y})^2 \\\ \\\ &\text{where $\bar{y}=\frac{1}{n}\sum y_i$} \\\ \end{aligned} } $$ Note to calculate TSS we only need the data about $y$. TSS has nothing to do with what model is used. --- $$R^2=\frac{\text{TSS}-\text{RSS}}{\text{TSS}}=1-\frac{\text{RSS}}{\text{TSS}}$$ -- $R^2$ measures the .red[proportion] of variability in $Y$ that is explained by the model. -- In the advertising example, "sales ~ TV" ``` R2 = 0.612 ``` -- Interpretation 61.2% of the variability in the sales is explained by a linear regression on TV budget. --- .center[] -- .center[] .gray[.tiny[[Image source](https://twitter.com/TedPetrou/status/1205897381356683264)]] --- $$R^2=1-\frac{\text{RSS}}{\text{TSS}}$$ Property: $$R^2 \leq 1$$ -- If a model is accurate, the RSS will be much smaller compared to TSS, $R^2$ will be close to 1. --- $$R^2=1-\frac{\text{RSS}}{\text{TSS}}$$ $$ \begin{aligned} \text{where } \text{RSS}&=\sum (y_i - \hat{y}_i)^2 \\\ \text{and } \text{TSS}&=\sum (y_i - \bar{y})^2 \\\ \end{aligned} $$ The TSS can be considered as a special case of RSS when a "mean model" is used. $$\hat{y}_i=\bar{y}$$ .center[] --- $$R^2=1-\frac{\text{RSS}}{\text{TSS}}$$ $$ \begin{aligned} \text{where } \text{RSS}&=\sum (y_i - \hat{y}_i)^2 \\\ \text{and } \text{TSS}&=\sum (y_i - \bar{y})^2 \\\ \end{aligned} $$ The definition of $R^2$ does not rely on the form of the model (e.g., linear regression). --
In an OLS linear regression, we minimize RSS. $$\text{thus, }\text{RSS} \leq \text{TSS}$$ $$\text{thus, }0 \leq R^2 \leq 1$$ --- $$R^2=1-\frac{\text{RSS}}{\text{TSS}}$$ However, in theory, $R^2 \leq 0$ does not always hold.
.center[   ] -- In practice, we can always fall back to a "mean model" which has $R^2=0$. --- Population covariance and correlation (coefficient) $$\small{ \begin{aligned} \text{cov}(X, Y)&=\text{E}\big[\big(X-\text{E}[X]\big)\big(Y-\text{E}[Y]\big)\big] \\\ \\\ \rho&=\frac{\text{cov}(X, Y)}{\sqrt{\text{var}(X) \text{var}(Y)}} \\\ \end{aligned} } $$ -- Sample covariance and correlation (coefficient) $$\small{ \begin{aligned} \text{cov}(x, y)&=\frac{\sum(x\_i-\bar{x})(y\_i-\bar{y})}{n-1}=\frac{S\_{xy}}{n-1} \\\ \\\ r&=\frac{\sum(x\_i-\bar{x})(y\_i-\bar{y})}{\sqrt{\sum(x\_i-\bar{x})^2\sum(y\_i-\bar{y})^2}}=\frac{S\_{xy}}{\sqrt{S\_{xx}S\_{yy}}} \\\ \end{aligned} } $$ --- $$\small{r=\frac{\sum(x\_i-\bar{x})(y\_i-\bar{y})}{\sqrt{\sum(x\_i-\bar{x})^2\sum(y\_i-\bar{y})^2}}=\frac{S\_{xy}}{\sqrt{S\_{xx}S\_{yy}}}}$$ $r$ measures the .red[linear relationship] between $x$ and $y$.
--  --- $$\small{r=\frac{\sum(x\_i-\bar{x})(y\_i-\bar{y})}{\sqrt{\sum(x\_i-\bar{x})^2\sum(y\_i-\bar{y})^2}}=\frac{S\_{xy}}{\sqrt{S\_{xx}S\_{yy}}}}$$ Can we use $r$ to assess the accuracy of a fitted OLS simple linear regression model? --
Yes, In fact, for OLS simple linear regression, we have $$R^2=r^2$$ --- $x$: ice cream sales 🍦 $y$: number of people drowning in swimming pool 🏊 -- Are $x$ and $y$ correlated? --
# .red[Correlation does not equal causation.]