R语言数据分析
参考资料:
Data Analysis and Prediction Algorithms with R
文章目录
- R语言数据分析
- 5.data.table
- 5.1 操作数据表
- 5.1.1 selecting
- 5.1.2 添加一列或者改变一列
- 5.1.3 引用和复制
- 5.1.4索引
- 5.2 数据描述性统计
- 5.2.1 多个描述性统计指标
- 5.2.2 分组统计
- 5.3 对数据排序
5.data.table
data.table库是用于数据整理和分析的,在第三章中我们介绍了dplyr包来进行数据处理。本章介绍在data.table中如何实现相同的功能
5.1 操作数据表
data.table
是一个单独的库。需要单独安装导入。本章介绍一些与第三章:R语言数据处理相关的方法:mutate
,filter
,select
,group_by
等
首先我们使用setDT
函数将数据框装换为一个data.table
,否则 后面的操作可能会失效
library(tidyverse)
library(data.table)
library(dslabs)
murders <- copy(murders)
murders <- setDT(murders)
5.1.1 selecting
对数据进行选择指定列,在使用dplyr时,我们是这样写的
select(murders, state, region) %>% head()
A data.table: 6 × 2
state | region |
<chr> | <fct> |
Alabama | South |
Alaska | West |
Arizona | West |
Arkansas | South |
California | West |
Colorado | West |
下面我们演示一下在data.table中是如何使用的
murders[, c('state', 'region')] %>% head()
A data.table: 6 × 2
state | region |
<chr> | <fct> |
Alabama | South |
Alaska | West |
Arizona | West |
Arkansas | South |
California | West |
Colorado | West |
也可以直接使用.()来进行访问相应变量
murders[,.(state, rate)] %>% head()
A data.table: 6 × 2
state | rate |
<chr> | <dbl> |
Alabama | 0.2824424 |
Alaska | 0.2675186 |
Arizona | 0.3629527 |
Arkansas | 0.3189390 |
California | 0.3374138 |
Colorado | 0.1292453 |
5.1.2 添加一列或者改变一列
我们在dplyr中使用mutate
函数
murders %>% mutate(murders, rate = total / population * 10^5) %>% head()
A data.table: 6 × 6
state | abb | region | population | total | rate |
<chr> | <chr> | <fct> | <dbl> | <dbl> | <dbl> |
Alabama | AL | South | 4779736 | 135 | 2.824424 |
Alaska | AK | West | 710231 | 19 | 2.675186 |
Arizona | AZ | West | 6392017 | 232 | 3.629527 |
Arkansas | AR | South | 2915918 | 93 | 3.189390 |
California | CA | West | 37253956 | 1257 | 3.374138 |
Colorado | CO | West | 5029196 | 65 | 1.292453 |
在data.table中,我们使用:=来定义新的一列,这样能节约电脑内存
murders[, rate := total/population * 10 ^5] %>% head()
A data.table: 6 × 6
state | abb | region | population | total | rate |
<chr> | <chr> | <fct> | <dbl> | <dbl> | <dbl> |
Alabama | AL | South | 4779736 | 135 | 2.824424 |
Alaska | AK | West | 710231 | 19 | 2.675186 |
Arizona | AZ | West | 6392017 | 232 | 3.629527 |
Arkansas | AR | South | 2915918 | 93 | 3.189390 |
California | CA | West | 37253956 | 1257 | 3.374138 |
Colorado | CO | West | 5029196 | 65 | 1.292453 |
同样我们可以使用:=
定义多个列
murders[, ':='(rate=total / population * 10000, rank = rank(population))] %>% head()
A data.table: 6 × 7
state | abb | region | population | total | rate | rank |
<chr> | <chr> | <fct> | <dbl> | <dbl> | <dbl> | <dbl> |
Alabama | AL | South | 4779736 | 135 | 0.2824424 | 29 |
Alaska | AK | West | 710231 | 19 | 0.2675186 | 5 |
Arizona | AZ | West | 6392017 | 232 | 0.3629527 | 36 |
Arkansas | AR | South | 2915918 | 93 | 0.3189390 | 20 |
California | CA | West | 37253956 | 1257 | 0.3374138 | 51 |
Colorado | CO | West | 5029196 | 65 | 0.1292453 | 30 |
5.1.3 引用和复制
data.table包的设计是为了避免浪费内存。因此我们可以复制一个表
x <- data.table(a=1)
y <- x
y实际是x的引用,而不是一个新对象,相当于是x的另一个名字。只有当改变y的时候,才会生成一个新对象
然而在使用:=
函数是,即便改变x也不会生成一个新的y对象,有时候我们不希望改变原来的对象,此时需要用copy()函数
x [,a:=2]
y
A data.table: 1 × 1
a |
<dbl> |
2 |
z = copy(x)
x[,a:=3]
z
A data.table: 1 × 1
a |
<dbl> |
1 |
5.1.4索引
在dplyr
中,我们通过下述代码过滤
filter(murders, rate <= 0.7) %>% head()
A data.table: 6 × 7
state | abb | region | population | total | rate | rank |
<chr> | <chr> | <fct> | <dbl> | <dbl> | <dbl> | <dbl> |
Alabama | AL | South | 4779736 | 135 | 0.2824424 | 29 |
Alaska | AK | West | 710231 | 19 | 0.2675186 | 5 |
Arizona | AZ | West | 6392017 | 232 | 0.3629527 | 36 |
Arkansas | AR | South | 2915918 | 93 | 0.3189390 | 20 |
California | CA | West | 37253956 | 1257 | 0.3374138 | 51 |
Colorado | CO | West | 5029196 | 65 | 0.1292453 | 30 |
在data.table中,我们可以直接使用索引
murders[rate<=0.7,.(state, rate)] %>% head()
A data.table: 6 × 2
state | rate |
<chr> | <dbl> |
Alabama | 0.2824424 |
Alaska | 0.2675186 |
Arizona | 0.3629527 |
Arkansas | 0.3189390 |
California | 0.3374138 |
Colorado | 0.1292453 |
5.2 数据描述性统计
和第三章一样,我们使用heights
数据集为例
data(heights)
# 将数据转换为data.table对象
heights <- setDT(heights)
在data.table中,我们可以使用.()
函数来直接访问相应的变量。因此我们可以在原来dplyr中简化代码如下
s <- heights[, .(average = mean(height), standard_deviation = sd(height))]
s
A data.table: 1 × 2
average | standard_deviation |
<dbl> | <dbl> |
68.32301 | 4.078617 |
下面假设我们要查询女性的平均身高和标准差
s <- heights[sex == 'Female', .(avg = mean(height), standard_deviation = sd(height))]
s
A data.table: 1 × 2
avg | standard_deviation |
<dbl> | <dbl> |
64.93942 | 3.760656 |
5.2.1 多个描述性统计指标
还记得在第三章中,我们定义了如下函数
median_min_max <- function(x){
qs <- quantile(x, c(0.5,0,1))
data.frame(median=qs[1], min = qs[2], max = qs[3])
}
heights[,.(median_min_max(height))]
A data.table: 1 × 3
median | min | max |
<dbl> | <dbl> | <dbl> |
68.5 | 50 | 82.67717 |
5.2.2 分组统计
在dplyr中我们使用group_by
来进行分组,在data.table中,我们使用by进行分组
heights[,.(avg = mean(height), standard_deviation=sd(height)), by = sex]
A data.table: 2 × 3
sex | avg | standard_deviation |
<fct> | <dbl> | <dbl> |
Male | 69.31475 | 3.611024 |
Female | 64.93942 | 3.760656 |
5.3 对数据排序
我们可以使用与筛选相同的方法对行进行排序。以下是按谋杀率排序的州:
murders[order(population)] %>% head()
A data.table: 6 × 7
state | abb | region | population | total | rate | rank |
<chr> | <chr> | <fct> | <dbl> | <dbl> | <dbl> | <dbl> |
Wyoming | WY | West | 563626 | 5 | 0.08871131 | 1 |
District of Columbia | DC | South | 601723 | 99 | 1.64527532 | 2 |
Vermont | VT | Northeast | 625741 | 2 | 0.03196211 | 3 |
North Dakota | ND | North Central | 672591 | 4 | 0.05947151 | 4 |
Alaska | AK | West | 710231 | 19 | 0.26751860 | 5 |
South Dakota | SD | North Central | 814180 | 8 | 0.09825837 | 6 |