Es Solo Un Modelo Lineal! (ESUML!)
Statistical
learning
Herramientas para modelar y entender datos complejos. Resolver el
problema de encontrar una función predictiva a partir de los datos.
Respuesta = Modelo + Error \[Y_{i} = \beta_{0} + \beta_{1}X_{1i} + \epsilon_{i}\]
\[Y_{i} = \beta_{0} + \beta_{1}X_{1i} + \epsilon_{i}\] \[X_{i} \sim continua\]
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.073 -6.835 -0.875 5.806 32.904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101.33319 5.00127 20.261 < 2e-16 ***
## X -0.42624 0.05344 -7.976 2.85e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.707 on 98 degrees of freedom
## Multiple R-squared: 0.3936, Adjusted R-squared: 0.3874
## F-statistic: 63.62 on 1 and 98 DF, p-value: 2.853e-12
\[\color{greenyellow}{{SS_{modelo}} = \sum(\hat{y_{i}}-\bar{y})^2}\]
\[\color{red}{SS_{residual} = \sum({y_{i}}-\hat{y_{i}})^2}\]
\[\color{cyan}{SS_{total} = \sum({y_{i}}-\bar{y})^2}\]
\[R^2 = \frac{SS_{modelo}}{SS_{total}}\]
\[R^2 = 1 - \frac{SS_{residual}}{SS_{total}}\]
\[Y_{ij} = \mu + \alpha_{i} + \epsilon_{ij}\]
Iris fue publicado por primera vez en 1936 por Ronald Fisher en el Annals of Eugenics. Proponía una metodología para describir “rasgos deseables” en apoyo al programa eugenésico.
Desde 2021 uso como ejemplo los datos de pingüinos de tres especies (Adelie, Chinstraw y Gentoo) del Archipiélago de Palmer (Antártida), a traves del paquete palmerpenguins.
\[\color{greenyellow}{{SS_{modelo}} = \sum(\hat{y_{i}}-\bar{y})^2}\]
\[\color{red}{SS_{residual} = \sum({y_{i}}-\hat{y_{i}})^2}\]
\[\color{cyan}{SS_{total} = \sum({y_{i}}-\bar{y})^2}\]
\[R^2 = \frac{SS_{modelo}}{SS_{total}}\]
\[R^2 = 1 - \frac{SS_{residual}}{SS_{total}}\]
library(palmerpenguins); data("penguins")
fit <- lm(bill_length_mm ~ species, data = penguins)
anova(fit)
## Analysis of Variance Table
##
## Response: bill_length_mm
## Df Sum Sq Mean Sq F value Pr(>F)
## species 2 7194.3 3597.2 410.6 < 2.2e-16 ***
## Residuals 339 2969.9 8.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(palmerpenguins); data("penguins")
fit <- lm(bill_length_mm ~ species, data = penguins)
summary(fit)
##
## Call:
## lm(formula = bill_length_mm ~ species, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9338 -2.2049 0.0086 2.0662 12.0951
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.7914 0.2409 161.05 <2e-16 ***
## speciesChinstrap 10.0424 0.4323 23.23 <2e-16 ***
## speciesGentoo 8.7135 0.3595 24.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.96 on 339 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7078, Adjusted R-squared: 0.7061
## F-statistic: 410.6 on 2 and 339 DF, p-value: < 2.2e-16
\[Y_{ij} = \mu + \alpha_{i} + \epsilon_{ij}\] \[Y_{ij} = \beta_{0} + \beta_{1}X_{i1} + ... \beta_{j}X_{ij} + \epsilon_{ij}\]
\[Y_{ij} = \mu + \alpha_{i} + \epsilon_{ij}\] \[Y_{ij} = \beta_{0} + \beta_{1}X_{i1} + ... \beta_{j}X_{ij} + \epsilon_{ij}\]
\[Y_{ij} = \beta_{0} + \beta_{dummy1}X_{i1} + \beta_{dummy2}X_{i, dummy2} + \epsilon_{ij}\]
\[Y_{i, Adelie} = \beta_{0} + \epsilon_{ij}\]
\[Y_{i, Chinstrap} = \beta_{0} + \beta_{dummy1} + \epsilon_{ij}\] \[Y_{i, Gentoo} = \beta_{0} + \beta_{dummy2} + \epsilon_{ij}\]
library(palmerpenguins); data("penguins")
fit <- lm(bill_length_mm ~ species, data = penguins)
summary(fit)
##
## Call:
## lm(formula = bill_length_mm ~ species, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9338 -2.2049 0.0086 2.0662 12.0951
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.7914 0.2409 161.05 <2e-16 ***
## speciesChinstrap 10.0424 0.4323 23.23 <2e-16 ***
## speciesGentoo 8.7135 0.3595 24.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.96 on 339 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7078, Adjusted R-squared: 0.7061
## F-statistic: 410.6 on 2 and 339 DF, p-value: < 2.2e-16
library(palmerpenguins); data("penguins")
fit <- lm(bill_length_mm ~ species, data = penguins, na.action = na.exclude)
shapiro.test(residuals(fit))
##
## Shapiro-Wilk normality test
##
## data: residuals(fit)
## W = 0.98903, p-value = 0.01131
##
## Bartlett test of homogeneity of variances
##
## data: residuals(fit) by penguins$species
## Bartlett's K-squared = 5.6179, df = 2, p-value = 0.06027
Datos de entrenamiento y datos de prueba
set.seed(1234)
X <-rnorm(100, mean = 10, sd=20)
Y <- 100 + 0.25*X + 0.019*X^2 + rnorm(100, 0, 10)
fit1 <- lm(Y ~ X)
fit2 <- lm(Y ~ poly(X, 2))
fit3 <- lm(Y ~ poly(X, 10))
summary(fit1)$r.squared #LINEAL
[1] 0.5408381
[1] 0.7319006
[1] 0.7412149