R语言代做编程辅导和解答Day 2 Lab Activities - MAT 500：Linear Regression and PCA

原创

拓端tecdat 2022-11-28 12:01:35 ©著作权

文章标签 sed hapi ide 文章分类 运维

©著作权归作者所有：来自51CTO博客作者拓端tecdat的原创作品，请联系作者获取转载授权，否则将追究法律责任

Directions: Complete the following exercises using the code discussed during computer lab. Save your work in an R script as well as a Word document containing the necessary output and comments. Be sure to use notes in the script to justify any computations. If you have any questions, do not hesitate to ask.

1 Simple Linear Regression

Load the data set pressure from the datasets package in R. Perform a Simple Linear Regression on the two variables. Provide the regression equation, coefficients table, and anova table. Summarize your findings. What is the relationship between the t statistic for temperature and the F statistic in the ANOVA table?

Refer to the previous exercise. Check the assumptions on the regression model and report your results. Be sure to include the scatterplot with regression equation, normal QQ plot, and residual plot. Explain what you see.
Refer to exercise 1. Experiment with different transformations of the data to improve the model. What is the best transformation?

2 Multiple Linear Regression

Load the swiss data set from the ‘datasets’ package in R. Find the correlation matrix and print the pairwise scatterplots. What variables seem to be related?
Run a Multiple Regression on Fertility using all of the other variables as predictors. Print the model and coefficients table. Explain the meaning of the significant coefficients.
Check the assumptions using the diagnostic tests mentioned in this section. Discuss your findings.
Run a stepwise selection method to reduce the dimension of the model using the backward direction. Print the new model and new coefficients table. Check the assumptions and discuss any changes.
Use Mallow’s Cp to determine the best model. Does your choice match the model in the previous exercise?

3 Principal Component Analysis

Load the longley data set from the R datasets package. This data set was used to predict a countries GNP based on several variables. Find the correlation matrix of the explanatory variables.
Refer to the previous exercise. Perform a principal component analysis on the explanatory variables using the correlation matrix. Use a scree plot to determine the optimal number of components and report them. Try to explain the meaning behind each component.
Refer to the previous exercise. What proportion of variation does each component explain? What is the total cumulative variance explained by the optimal number of components?

Day 2 Lab Activities - Solutions解答

Simple Linear Regression

1. > pressure.lm <- lm(pressure ~ temperature, data = pressure)

> summary(pressure.lm)

 

Call:

lm(formula = pressure ~ temperature, data = pressure)

 

Residuals:

    Min      1Q  Median      3Q     Max

-158.08 -117.06  -32.84   72.30  409.43

 

Coefficients:

             Estimate Std. Error t value Pr(>|t|)   

(Intercept) -147.8989    66.5529  -2.222 0.040124 * 

temperature    1.5124     0.3158   4.788 0.000171 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 150.8 on 17 degrees of freedom

Multiple R-squared:  0.5742,    Adjusted R-squared:  0.5492

F-statistic: 22.93 on 1 and 17 DF,  p-value: 0.000171

 

> anova(pressure.lm)

Analysis of Variance Table

 

Response: pressure

            Df Sum Sq Mean Sq F value   Pr(>F)   

temperature  1 521530  521530   22.93 0.000171 ***

Residuals   17 386665   22745                    

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The temperature coefficient is positive so if there is a significant relationship between temperature and pressure, it is a direct relationship. Since the p-value is less than 0.05, temperature is indeed significant in the model. The relationship between the t statistic and the F statistic is t^2 = F.

2. Linearity: The scatterplot shows a clear violation of the linearity assumption. The data appears to be exponentially increasing. The standardized residual plot reinforces this observation.

Equal Variance: The lack of a linear relationship makes it difficult to determine the equality of variance in observations.

Normality: The Normal Quantile plot shows a lack of linearity at the tails of the data set. A Shapiro-Wilk test verifies that the residuals do not follow a normal distribution.

R语言代做编程辅导和解答Day 2 Lab Activities - MAT 500：Linear Regression and PCA_ide

R语言代做编程辅导和解答Day 2 Lab Activities - MAT 500：Linear Regression and PCA_sed_02

Shapiro-Wilk normality test

data:  rstandard(pressure.lm)

W = 0.8832, **p-value = 0.02438**

R语言代做编程辅导和解答Day 2 Lab Activities - MAT 500：Linear Regression and PCA_ide_03

3. Using a Box Cox transformation, the optimal transformation is either

where λ = 0.01

Multiple Linear Regression

Fertility Agriculture Examination Education Catholic Infant.Mortality

Fertility            1.000       0.353      -0.646    -0.664    0.464            0.417

Agriculture          0.353       1.000      -0.687    -0.640    0.401           -0.061

Examination         -0.646      -0.687       1.000     0.698   -0.573           -0.114

Education           -0.664      -0.640       0.698     1.000   -0.154           -0.099

Catholic             0.464       0.401      -0.573    -0.154    1.000            0.175

![]()Infant.Mortality     0.417      -0.061      -0.114    -0.099    0.175            1.000

 

*Related Variables:*

 

Fertility, Agriculture

Fertility, Examination

Fertility, Infant Mortality

Agriculture, Examination

Agriculture, Education

Examination, Education

R语言代做编程辅导和解答Day 2 Lab Activities - MAT 500：Linear Regression and PCA_ide_04

Call:

lm(formula = Fertility ~ Agriculture + Examination + Education +

    Catholic + Infant.Mortality, data = swiss)

 

Residuals:

     Min       1Q   Median       3Q      Max

-15.2743  -5.2617   0.5032   4.1198  15.3213

 

Coefficients:

                 Estimate Std. Error t value Pr(>|t|)   

(Intercept)      66.91518   10.70604   6.250 1.91e-07 ***

Agriculture      -0.17211    0.07030  -2.448  0.01873 * 

Examination      -0.25801    0.25388  -1.016  0.31546   

Education        -0.87094    0.18303  -4.758 2.43e-05 ***

Catholic          0.10412    0.03526   2.953  0.00519 **

Infant.Mortality  1.07705    0.38172   2.822  0.00734 **

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 7.165 on 41 degrees of freedom

Multiple R-squared:  0.7067,    Adjusted R-squared:  0.671

F-statistic: 19.76 on 5 and 41 DF,  p-value: 5.594e-10

All of the predictors are significant except examination.

3. Residual Plot: There is a random pattern in the residual plot which causes no concern with the model fit.

Normal Q-Q Plot: The data follows the diagonal line quite nicely, indicating that the residuals probably satisfy the normality assumption.

Scale - Location: The data is randomly scattered which indicates that the homoscedasticity assumption is probably met.

R语言代做编程辅导和解答Day 2 Lab Activities - MAT 500：Linear Regression and PCA_ide_05

> swiss.step.b <- step(swiss.lm, direction = 'backward')

Start:  AIC=190.69

Fertility ~ Agriculture + Examination + Education + Catholic +

    Infant.Mortality

 

                   Df Sum of Sq    RSS    AIC

- Examination       1     53.03 2158.1 189.86

<none>                          2105.0 190.69

- Agriculture       1    307.72 2412.8 195.10

- Infant.Mortality  1    408.75 2513.8 197.03

- Catholic          1    447.71 2552.8 197.75

- Education         1   1162.56 3267.6 209.36

 

Step:  AIC=189.86

Fertility ~ Agriculture + Education + Catholic + Infant.Mortality

 

                   Df Sum of Sq    RSS    AIC

<none>                          2158.1 189.86

- Agriculture       1    264.18 2422.2 193.29

- Infant.Mortality  1    409.81 2567.9 196.03

- Catholic          1    956.57 3114.6 205.10

- Education         1   2249.97 4408.0 221.43

 

Call:

lm(formula = Fertility ~ Agriculture + Education + Catholic +

    Infant.Mortality, data = swiss)

 

Residuals:

     Min       1Q   Median       3Q      Max

-14.6765  -6.0522   0.7514   3.1664  16.1422

 

Coefficients:

                 Estimate Std. Error t value Pr(>|t|)   

(Intercept)      62.10131    9.60489   6.466 8.49e-08 ***

Agriculture      -0.15462    0.06819  -2.267  0.02857 * 

Education        -0.98026    0.14814  -6.617 5.14e-08 ***

Catholic          0.12467    0.02889   4.315 9.50e-05 ***

Infant.Mortality  1.07844    0.38187   2.824  0.00722 **

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 7.168 on 42 degrees of freedom

Multiple R-squared:  0.6993,    Adjusted R-squared:  0.6707

F-statistic: 24.42 on 4 and 42 DF,  p-value: 1.717e-10

R语言代做编程辅导和解答Day 2 Lab Activities - MAT 500：Linear Regression and PCA_ide_06

The new model does not include the examination variable. Now all of the predictors are significant.

Residual Plot: There is a random pattern in the residual plot which causes no concern with the model fit.

Normal Q-Q Plot: The data no longer seems to follow a precise normal distribution. This assumption may now be violated.

Scale - Location: The data is randomly scattered which indicates that the homoscedasticity assumption is probably met.

5. The two models that best fit Mallow's Cp are the model with all 5 variables or the model with the 4 variables Agriculture, Education, Catholic, and Infant.Mortality. We prefer a simpler model in statistics, so the best model choice is the model with four explanatory variables. This is the exact same model that backward selection had identified.

R语言代做编程辅导和解答Day 2 Lab Activities - MAT 500：Linear Regression and PCA_hapi_07

Principal Component Analysis

Unemployed Armed.Forces Population  Year Employed

Unemployed         1.000       -0.177      0.687 0.668    0.502

Armed.Forces      -0.177        1.000      0.364 0.417    0.457

Population           0.687        0.364      1.000 0.994    0.960

Year                       0.668        0.417      0.994 1.000    0.971

Employed            0.502        0.457      0.960 0.971    1.000

![]()

2.                      Comp.1  Comp.2

Unemployed       0.3633  0.5988

Armed.Forces       0.2269 -0.7911

Population     0.5261  0.0435

Year                0.5291 -0.0024

Employed            0.5097 -0.1171

The first component is a standardized measure of GNP and the second component is difficult to interpret.

R语言代做编程辅导和解答Day 2 Lab Activities - MAT 500：Linear Regression and PCA_hapi_08

Component 1: **71.23%** Variance explained

        Component2:  **23.67%** Variance explained

 

        Cumulative Variance:  **94.89%**

上一篇：R语言使用多元AR-GARCH模型衡量市场风险|附代码数据

下一篇：拓端tecdat|【视频】Lasso回归、岭回归等正则化回归数学原理及R语言实例

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯