### 加载包

``library(ggplot2)``

### 载入资料

``load("brfss2013.RData")``

## 第1部分：关于数据

### 因果关系：

BRFSS是一项观察研究，只能建立变量之间的相关性/关联性，因此无法建立因果关系。

## 第3部分：探索性数据分析

### 研究问题1：

``````ggplot(aes(x=physhlth, fill=sex), data = brfss2013[ ! is.na(brfss2013 \$ sex), ])  +
geom_histogram(bins=30, position = position_dodge())  +  ggtitle('Number of Days Physical Health not Good in the Past 30 Days')``````

01

02

03

04

``````ggplot(aes(x=menthlth, fill=sex), data=brfss2013[ ! is.na(brfss2013 \$ sex), ])  +
geom_histogram(bins=30, position = position_dodge())  +  ggtitle('Number of Days Mental Health not Good in the Past 30 Days')``````

``````ggplot(aes(x=poorhlth, fill=sex), data=brfss2013[ ! is.na(brfss2013 \$ sex), ])  +
geom_histogram(bins=30, position = position_dodge())  +  ggtitle('Number of Days with Poor Physical Or Mental Health in the Past 30 Days')``````

``summary(brfss2013 \$ sex)``
``````##  Male  Female   NA's
##201313 290455      7``````

### 研究问题4：

• smoke100：抽至少100支香烟
• avedrnk2：过去30天每天平均含酒精饮料
• bphigh4：曾经血压过高
• tellhi2：高胆固醇血症
• weight2：报告的磅数
• cvdstrk3：曾经被诊断为中风

``````corr.matrix  <- cor(selected_brfss)
corrplot(corr.matrix, main="\n\nCorrelation Plot of Smoke, Alcohol, Blood pressure, Cholesterol, and Weight", method="number")``````

### Logistic回归预测中风

``````stroke \$ bphigh4  <- replace(stroke \$ bphigh4, which(is.na(stroke \$ bphigh4)), "No")
stroke \$ toldhi2  <- replace(stroke \$ toldhi2, which(is.na(stroke \$ toldhi2)), "No")
stroke \$ cvdstrk3  <- replace(stroke \$ cvdstrk3, which(is.na(stroke \$ cvdstrk3)), "No")
stroke \$ smoke100  <- replace(stroke \$ smoke100, which(is.na(stroke \$ smoke100)), 'No')``````

``mean(stroke \$ avedrnk2,na.rm = T)``
``##[1] 2.209905``
``stroke \$ avedrnk2  <- replace(stroke \$ avedrnk2, which(is.na(stroke \$ avedrnk2)), 2)``

``````head(stroke)
summary(stroke)``````
``````##   bphigh4 toldhi2 cvdstrk3 weight2 smoke100 avedrnk2
##1     Yes     Yes       No     154      Yes        2
##2      No      No       No      30       No        2
##3      No      No       No      63      Yes        4
##4      No     Yes       No      31       No        2
##5     Yes      No       No     169      Yes        2
##6     Yes     Yes       No     128       No        2``````
``````##  bphigh4      toldhi2      cvdstrk3        weight2       smoke100
## No :284107   Yes:183501   Yes: 20391   Min.   :  1.00   Yes:215201
## Yes:207668   No :308274   No :471384   1st Qu.: 43.00   No :276574
##                                        Median : 73.00
##                                        Mean   : 80.22
##                                        3rd Qu.:103.00
##                                        Max.   :570.00
##    avedrnk2
## Min.   : 1.000
## 1st Qu.: 2.000
## Median : 2.000
## Mean   : 2.099
## 3rd Qu.: 2.000
## Max.   :76.000``````

### Logistic回归模型拟合

``summary(model)``
``````##Call:
##glm(formula = cvdstrk3 ~ ., family = binomial(link = "logit"),
##    data = train)

##Deviance Residuals:
##    Min       1Q   Median       3Q      Max
##-0.5057  -0.3672  -0.2109  -0.1630   3.2363

##Coefficients:
##              Estimate Std. Error  z value Pr(>|z|)
##(Intercept) -3.2690106  0.0268240 -121.869  < 2e-16 ***
##bphigh4Yes   1.3051850  0.0193447   67.470  < 2e-16 ***
##toldhi2No   -0.5678048  0.0171500  -33.108  < 2e-16 ***
##weight2     -0.0009628  0.0001487   -6.476 9.41e-11 ***
##smoke100No  -0.3990598  0.0163896  -24.348  < 2e-16 ***
##avedrnk2    -0.0274511  0.0065099   -4.217 2.48e-05 ***
##---
##Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##(Dispersion parameter for binomial family taken to be 1)

##    Null deviance: 136364  on 389999  degrees of freedom
##Residual deviance: 126648  on 389994  degrees of freedom
##AIC: 126660

##Number of Fisher Scoring iterations: 6``````

• 所有其他变量都相等，被告知血压升高，更可能发生中风。
• 预测变量的负系数-tellhi2No表示，所有其他变量相等，没有被告知血液中胆固醇水平较高，则发生中风的可能性较小。
• 每单位重量改变，具有冲程（相对于无冲程）的对数几率降低0.00096。
• 至少抽100支香烟不抽烟，中风的可能性较小。
• 在过去30天内，每天平均含酒精饮料增加1个单位，中风的对数几率降低0.027。
``anova(model, test="Chisq")``
``````##Analysis of Deviance Table

##Response: cvdstrk3

##Terms added sequentially (first to last)

##         Df Deviance Resid. Df Resid. Dev  Pr(>Chi)
##NULL                    389999     136364
##bphigh4   1   7848.6    389998     128516 < 2.2e-16 ***
##toldhi2   1   1230.1    389997     127285 < 2.2e-16 ***
##weight2   1     33.2    389996     127252 8.453e-09 ***
##smoke100  1    584.5    389995     126668 < 2.2e-16 ***
##avedrnk2  1     19.9    389994     126648 7.958e-06 ***
##---
##Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1``````

### 评估模型的预测能力``

``##[1] "Accuracy 0.961296978629329``

### 绘制ROC曲线并计算AUC（曲线下的面积）

``auc``

``##[1] 0.7226642``