[color=#008000][b]      17级[/b][u][/u]    
张yu答案[/color]

一、    mnsy(40分)
从正态分布N(0,σ)中随机抽取样本,根据样本估计总体均数的95%可信区间,重复1000 次,有多少次能估计到总体0,均数。不同的样本含量N对结果有什么影响,不同的标准差σ对结果有什么影响,不同的可信度又会对结果有什么影响?(1)进行模拟实验,填入下面表格(30分);
(2)总结规律(10分)。
1000次模拟中估计到总体均数的次数
模拟参数    标准差σ=1    标准差σ=10
    N=10    N=20    N=50    N=10    N=20    N=50
可信度=90%    367    296    182    405    289    191
可信度=95%    436    331    201    446    315    210
可信度=99%    563    392    270    561    422    292

总结结论如下:
当样本量和标准差不变时,随着可信度的增加,落入可信区间内的次数增多;当样本量和可信度不变时,随着标准差的增加落入可信区间内次数基本不变;当可信度和标准差不变时,随着样本量的增加落入可信区间内的次数减少。

答:set.seed(17200115)

ttest<-function(n1,mean1,sd1,alpha)
{
  count<-0
  for(i in 1:1000)
  {
    d1<-rnorm(n1,mean=mean1,sd=sd1)
    xb<-mean(d1)
   
    tmp<-sd1/sqrt(n1)*qnorm(1-alpha/2);df<-n1
   
    a<-xb-tmp
    b<-xb+tmp
   
   
    if(a>mean1)
    {
      count<-count+1
    }
    if(b<mean1)
    {
      count<-count+1
    }    
  }
  return(count)
}ttest(10,0,1,0.9)
ttest(20,0,1,0.9)
ttest(50,0,1,0.9)
ttest(10,0,1,0.95)
ttest(20,0,1,0.95)
ttest(50,0,1,0.95)
ttest(10,0,1,0.99)
ttest(20,0,1,0.99)
ttest(50,0,1,0.99)ttest(10,0,10,0.9)
ttest(20,0,10,0.9)
ttest(50,0,10,0.9)
ttest(10,0,10,0.95)
ttest(20,0,10,0.95)
ttest(50,0,10,0.95)
ttest(10,0,10,0.99)
ttest(20,0,10,0.99)
ttest(50,0,10,0.99)

二、数据分析(60分)
请载入R语言自带的数据集iris,该数据集描述了三种不同的鸢尾属植物对应的萼片和花瓣的长度和宽度。其中,Species变量代表植物种类,Sepal代表萼片,Petal代表花瓣,Length为长度,Width为宽度。请按照下述要求进行分析。
(1)    根据不同的Species种类,分别使用均数±标准差描述不同种类植物Sepal.Length和Sepal.Width两个属性,使用中位数(四分位数间距)描述不同种类植物Petal.Length和Petal.Width两个属性(10分)。

(2)计算植物属性的4个变量间的Pearson相关系数矩阵和相关系数假设检验的P值矩阵,无需分组(10分)。
根据数据所得Pearson相关系数矩阵如下

(3)请自编变量转换函数,要求实现如下功能:将定量数据根据固定界值划分为定性数据, [0~2]定义为1,(2~4]定义为2,(4~6]定义为3, (6~8]定义为4。将编写的函数填写在试卷上并对Petal.Length变量做变换,对变换后的变量做频数统计(10分)。

(4)绘制Sepal.Length在不同Species种类下的分布的箱式图和散点图,要求:1)在一张图中绘制,散点图直接绘制在箱式图上;
2)箱式图不同种类填充色不同,并添加图例说明不同颜色与Species种类对应关系;
3)散点为蓝色,且排列方式为随机扰动,而不是一串排列(15分)。
结果如下

(5)使用三种不同的分类算法,基于植物属性的4个变量数据将其分为3类,并展示分类结果与真实标签比较,使用准确率指标进行评价。分类算法不限(15分)。

答:##第二大题##

##1##
iris <- datasets::iris
iris1<-iris[which(iris$Species=='setosa'),]
iris2<-iris[which(iris$Species=='versicolor'),]
iris3<-iris[which(iris$Species=='virginica'),]paste(round(mean(iris1$Sepal.Length),2),"±",round(sd(iris1$Sepal.Length),2))
paste(round(mean(iris2$Sepal.Length),2),"±",round(sd(iris2$Sepal.Length),2))
paste(round(mean(iris3$Sepal.Length),2),"±",round(sd(iris3$Sepal.Length),2))
paste(round(mean(iris1$Sepal.Width),2),"±",round(sd(iris1$Sepal.Width),2))
paste(round(mean(iris2$Sepal.Width),2),"±",round(sd(iris2$Sepal.Width),2))
paste(round(mean(iris3$Sepal.Width),2),"±",round(sd(iris3$Sepal.Width),2))
a<-quantile(iris1$Sepal.Length,0.25)
b<-quantile(iris1$Sepal.Length,0.75)
paste(median(iris1$Sepal.Length),"(",a[[1]],"-",b[[1]],")")a<-quantile(iris2$Sepal.Length,0.25)
b<-quantile(iris2$Sepal.Length,0.75)
paste(median(iris2$Sepal.Length),"(",a[[1]],"-",b[[1]],")")a<-quantile(iris3$Sepal.Length,0.25)
b<-quantile(iris3$Sepal.Length,0.75)
paste(median(iris3$Sepal.Length),"(",a[[1]],"-",b[[1]],")")a<-quantile(iris1$Sepal.Width,0.25)
b<-quantile(iris1$Sepal.Width,0.75)
paste(median(iris1$Sepal.Width),"(",a[[1]],"-",b[[1]],")")a<-quantile(iris2$Sepal.Width,0.25)
b<-quantile(iris2$Sepal.Width,0.75)
paste(median(iris2$Sepal.Width),"(",a[[1]],"-",b[[1]],")")a<-quantile(iris3$Sepal.Width,0.25)
b<-quantile(iris3$Sepal.Width,0.75)
paste(median(iris3$Sepal.Width),"(",a[[1]],"-",b[[1]],")")##2##
install.packages("psych")
library(psych)data1<-iris[,c(1,2,3,4)]
data1
a<-cor(data1)
corr.test(data1)
 ##3##
zhang<-function(a){
    for(i in 1:length(a)){
        if(a[i]>=0 & a[i]<=2){
            a[i]=1
        }else if(a[i]>2 & a[i]<=4){
            a[i]=2
        }else if(a[i]>4 & a[i]<=6){
            a[i]=3
        }else{
            a[i]=4
        }
    }
    return(a)
}zhang(iris$Petal.Length)
freq<-table(zhang(iris$Petal.Length))
freq##4##
boxplot(iris$Sepal.Length~iris$Species,col=c(2,3,5),xlab="x",ylab="y")
par(new=TRUE)
plot(iris$Sepal.Length,col="blue",xlab="x",ylab="y")legend("topright",                                    #图例位置为右上角
legend=c("setosa","versicolor","virginica"),          #图例内容
col=c(2,3,5),                                 #图例颜色
lty=1,lwd=2)                                          #图例大小##5##
#加载数据
iris <- datasets::iris
iris2 <- iris[,-5]
species_labels <- iris[,5]
library(colorspace) # 颜色包
species_col <- rev(rainbow_hcl(3))[as.numeric(species_labels)]
#绘制 SPLOM:
pairs(iris2, col = species_col,
      lower.panel = NULL,
       cex.labels=2, pch=19, cex = 1.2)
 
# 添加图例
par(xpd = TRUE)
legend(x = 0.05, y = 0.4, cex = 2,
   legend = as.character(levels(species_labels)),
    fill = unique(species_col))
par(xpd = NA)par(las = 1, mar = c(4.5, 3, 3, 2) + 0.1, cex = .8)
MASS::parcoord(iris2, col = species_col, var.label = TRUE, lwd = 2)
 
# 添加标题
title("Parallel coordinates plot of the Iris data")
# 添加图例
par(xpd = TRUE)
legend(x = 1.75, y = -.25, cex = 1,
   legend = as.character(levels(species_labels)),
    fill = unique(species_col), horiz = TRUE)data1<-iris[,c(1,2,3,4)]
data1
kmeans(data1)[color=#99CC00]孙chuanrui答案[/color]
[color=#008000]
#模拟试验[/color]
set.seed(17200130)
ttest <- function(num,n,alpha,mean,sd){
  count <- 0
  for (i in 1:num) {
    x1 <- rnorm(n,mean,sd)
    x2 <- rnorm(n,mean,sd)
    test <- t.test(x1,x2,var.equal = T)
    p <- test$p.value
    if( p< alpha ){count <- count +1}
  }
  
  return(count)
  
}
#第一行
for (alpha in c(0.05,0.10)) {
  for (n in c(10,15,20)) {
    print(paste("alpha = ",alpha,"N = ",n," ",ttest(1000,n,alpha,50,5)))
  }#i
}#j
#第二行
for (alpha in c(0.05,0.10)) {
  for (n in c(10,15,20)) {
    print(paste("alpha = ",alpha,"N = ",n," ",ttest(1000,n,alpha,50,8)))
  }#i
}#j
#第三行
for (alpha in c(0.05,0.10)) {
  for (n in c(10,15,20)) {
    print(paste("alpha = ",alpha,"N = ",n," ",ttest(1000,n,alpha,70,10)))
  }#i
}#j[color=#008000]
二、[/color]
#实际数据分析
data2 <- read.csv("C:\\Users\\阿瑞\\Desktop\\test_data.csv")
#(1)绘制年龄分布的柱形图和核密度曲线par(mfcol = c(2,1))
plot(as.factor(data2$age),main = "年龄分布柱形图",xlab = "年龄(岁)",ylab = "人数",col = "#99CCCC")plot(density(data2$age),main = "年龄分布核密度曲线图",xlab = "年龄(岁)",col = "#003366")
#(2)自定义函数计算定量资料的四分位数
fun <- function(n,data2){
  p50 <- round(quantile(data2[,c(n)],na.rm = T)[3],2)
  p25 <- round(quantile(data2[,c(n)],na.rm = T)[2],2)
  p75 <- round(quantile(data2[,c(n)],na.rm = T)[4],2)
  p <- cbind(p50,p25,p75)
  
  per <- paste0(p[1],"(",p[2],"-",p[3],")")
  name <- colnames(data2)[n]
  x <- data.frame(name,per)
  return(x)
}q2 <- rbind(fun(3,data2) ,fun(6,data2) ,fun(9,data2) )
write.csv(q2,"C:\\Users\\阿瑞\\Desktop\\q2.csv")
#q3评价不同性别之间差异是否显著
library(tableone)
data2$sex <- as.factor(data2$sex)data2$education<-as.factor(data2$education)
data2$smoke<-as.factor(data2$smoke)
data2$drink<-as.factor(data2$drink)
data2$exercise<-as.factor(data2$exercise)
vars <- c("age","education","bmi","smoke","drink","exercise")
table1 <- CreateTableOne(vars = vars,strata = c("sex") ,smd = T ,addOverall = T,data = data2)
table2 <- print(table1,nonnormal = c(),showAllLevels = T)
write.csv(table2,file = "C:\\Users\\阿瑞\\Desktop\\q3.csv")#q4使用适当的广义线性模型评估肺癌风险与基线因素的关联强度,总结哪些因素关联性较强。
#使用单因素logistic回归,评价肺癌风险与基因因素之间的关联强度
gen_logistic <- function(var){
  in_formula <- as.formula(paste("lung_ca~",var)) 
  p <- glm(in_formula,family=binomial(link=logit),data=data2)
  beta <- summary(p)$coefficients[2,1]
  LCI <- summary(p)$coefficients[2,1] -summary(p)$coefficients[2,2]*1.96 
  UCI <- summary(p)$coefficients[2,1] +summary(p)$coefficients[2,2]*1.96 
  OR <- exp(beta) 
  OR_LCI <- exp(LCI)
  OR_UCI <- exp(UCI)
  p_value <- summary(p)$coefficients[2,4]
  name <- var
  data_var <- data.frame(name,OR,OR_LCI,OR_UCI,p_value)
  return(data_var)
}
#对年龄进行分层
data2$ageu[data2$age >= 40 &data2$age < 50] <- 1
data2$ageu[data2$age >= 50 &data2$age < 60] <- 2
data2$ageu[data2$age >= 60 &data2$age < 70] <- 3
data2$ageu[data2$age >= 70 &data2$age < 80] <- 4
#对bmi进行分层
data2$bmiu[data2$bmi <18] <- 1
data2$bmiu[data2$bmi >=18 &data2$bmi <24] <- 2
data2$bmiu[data2$bmi >=24] <- 3q4 <- rbind(gen_logistic("ageu"),gen_logistic("sex"),gen_logistic("education"),
      gen_logistic("bmiu"),gen_logistic("family_ca"),gen_logistic("smoke"),
      gen_logistic("packyr"),gen_logistic("respdis"),gen_logistic("secsmoke"),
      gen_logistic("exposure"),gen_logistic("drink"),gen_logistic("exercise"))  
write.csv(q4,file = "C:\\Users\\阿瑞\\Desktop\\q4.csv")
#多因素分析
mul <- glm(lung_ca ~ageu+family_ca+sex+bmiu+smoke+education+respdis,data = data2)
m <- summary(mul)$coefficients[-1,]
beta <- m[,1]
LCI <- beta -m[,2]*1.96 
UCI <- beta +m[,2]*1.96 
OR <- exp(beta) 
OR_LCI <- exp(LCI)
OR_UCI <- exp(UCI)
p <- m[,4]
model1 <- rbind(beta,LCI,UCI,OR,OR_LCI,OR_UCI,p)
model1 <- t(model1)
write.csv(model1,file = "C:\\Users\\阿瑞\\Desktop\\q4_1.csv")[color=#008000]
王yuanyuan[/color]
simu1<-function(alpha,sigma,n){
  count<-0
  for (i in 1:1000) {
    sample<-rnorm(10000,0,sigma)
    data<-sample(sample,n)
    t<-t.test(data,conf.level =alpha)
    low<-t$conf.int[1]
    up<-t$conf.int[2]
    if(low<0 & up>0) {
     count<-count+1
    }
  }
  return(count)
}simu1(0.9,1,10)
simu1(0.9,1,20)
simu1(0.9,1,50)simu1(0.9,10,10)
simu1(0.9,10,20)
simu1(0.9,10,50)simu1(0.95,1,10)
simu1(0.95,1,20)
simu1(0.95,1,50)simu1(0.95,10,10)
simu1(0.95,10,20)
simu1(0.95,10,50)simu1(0.99,1,10)
simu1(0.99,1,20)
simu1(0.99,1,50)simu1(0.99,10,10)
simu1(0.99,10,20)
simu1(0.99,10,50)第二题见别tie
 
 
 
###(1)
 
attach(iris)
 
describe<-function(x,parametric=TRUE,print=TRUE){
 
   a<-which(colnames(iris) == x)
 
   if(parametric){
 
    df_mean<-aggregate(iris[,a],by=list(Species),FUN=mean)
 
    df_sd<-aggregate(iris[,a],by=list(Species),FUN=sd)
 
  }else{
 
    median<-aggregate(iris[,a],by=list(Species),FUN=median)
 
    df_q1<-aggregate(iris[,a],by=list(Species),FUN=quantile,probs=0.25)
 
    df_q3<-aggregate(iris[,a],by=list(Species),FUN=quantile,probs=0.75)
 
  }
 
 
 
  if(parametric&print){
 
    cat(paste0(mean,"±",sd))
 
  }else if(!parametric&print){
 
    cat(paste0(median,"(",q1,",",q3,")"))
 
  }
 
   return()
 
}
 
 
 
describe(Sepal.Length)
 
describe(Sepal.Width)
 
 
 
 
 
 
 
 
 
if(1){
 
  mean
 
  a1<-cat(paste0(mean[1,2],"±",sd[1,2]))
 
  a2<-cat(paste0(mean[2,2],"±",sd[2,2]))
 
  a3<-cat(paste0(mean[3,2],"±",sd[3,2]))
 
  res<-rbind(a1,a2,a3)
 
}else if(!parametric&print){
 
  cat(paste0(median,"(",q1,",",q3,")"))
 
}
 
 
 
 
 
 
 
###(2)
 
install.packages("psych")
 
library(psych)
 
data<-iris[,-5]
 
cor(data)
 
corr.test(data)
 
 
 
####(3)
 
trans<-function(var){
 
  iris$cat[0<=var&var<=2]<-1
 
  iris$cat[2<var&var<=4]<-2
 
  iris$cat[4<var&var<=6]<-3
 
  iris$cat[6<var&var<=8]<-4
 
  iris1<<-iris
 
  return(iris1)
 
}
 
trans(Petal.Length)
 
table(iris1$cat)
 
 
 
 
 
 
 
 
 
####(4)
 
install.packages("ggplot2")
 
library(ggplot2)
 
ggplot(data=iris,aes(x=Species,y=Sepal.Length,colour=Species))+geom_boxplot()+geom_jitter(colour="blue")
 
 
 
####(5)
 
##第一种:层次聚类
 
d<-dist(data)
 
fit.average<-hclust(d,method = "average")
 
plot(fit.average)
 
clusters<-cutree(fit.average,k=3)
 
data2<-cbind(iris,clusters)
 
data2
 
table(clusters)
 
 
 
##2.逻辑回归
 
train<-sample(1:nrow(iris),0.7*nrow(iris))
 
df.train<-iris[train,]
 
df.validate<-iris[-train,]
 
 
 
fit.logit<-glm(Species~.,data=df.train,family = binomial())
 
 
 
logit.pred<-predict(fit.logit,df.validate)
 
logit.perf<-table(df.validate$Species,logit.pred,dnn=c("Actual","Predicted"))
 
 
 
##3.随机森林
 
install.packages("randomForest")
 
library(randomForest)
 
train<-sample(1:nrow(iris),0.7*nrow(iris))
 
df.train<-iris[train,]
 
df.validate<-iris[-train,]
 
 
 
fit.forest<-randomForest(Species~.,data=df.train,importance=TRUE)
 
forest.pred<-predict(fit.forest,df.validate)
 
forest.perf<-table(df.validate$Species,forest.pred,dnn=c("Actual","Predicted"))
 
forest.perf
 
 
 
 
 
 
 
###决策树
 
install.packages("party")
 
library(party) #导入party包
 
myctree<-ctree(df.train$Species~.,df.train)
 
pre.forest<-predict(myctree,df.validate)
 
jueceshu<-table(pre.forest,df.validate$Species)
 
 
 
 
 
###准确率
 
##决策树
 
acc<-(jueceshu[1,1]+jueceshu[2,2])/(jueceshu[1,1]+jueceshu[2,2]+jueceshu[1,2]+jueceshu[2,1])