常见的数据类型: 向量,矩阵,数据框,数组 ,列表
1. 用is和as函数
- is相关函数:判断数据类型
> methods(is)
[1] is.Alignment is.array is.atomic is.Border is.call
[6] is.CellBlock is.CellProtection is.CellStyle is.character is.complex
[11] is.data.frame is.DataFormat is.double is.element is.empty.model
[16] is.environment is.expression is.factor is.Fill is.finite
[21] is.Font is.function is.infinite is.integer is.jnull
[26] is.language is.leaf is.list is.loaded is.logical
[31] is.matrix is.mts is.na is.na.data.frame is.na.numeric_version
[36] is.na.POSIXlt is.na<- is.na<-.default is.na<-.factor is.na<-.numeric_version
[41] is.name is.nan is.null is.numeric is.numeric.Date
[46] is.numeric.difftime is.numeric.POSIXt is.numeric_version is.object is.ordered
[51] is.package_version is.pairlist is.primitive is.qr is.R
[56] is.raster is.raw is.recursive is.relistable is.single
[61] is.stepfun is.symbol is.table is.ts is.tskernel
[66] is.unsorted is.vector
- as相关函数:强制转换数据
> methods(as)
[1] as.array as.array.default as.call as.character
[5] as.character.condition as.character.Date as.character.default as.character.error
[9] as.character.factor as.character.hexmode as.character.numeric_version as.character.octmode
[13] as.character.POSIXt as.character.srcref as.complex as.data.frame
[17] as.data.frame.array as.data.frame.AsIs as.data.frame.character as.data.frame.complex
[21] as.data.frame.data.frame as.data.frame.Date as.data.frame.default as.data.frame.difftime
[25] as.data.frame.factor as.data.frame.integer as.data.frame.list as.data.frame.logical
[29] as.data.frame.matrix as.data.frame.model.matrix as.data.frame.noquote as.data.frame.numeric
[33] as.data.frame.numeric_version as.data.frame.ordered as.data.frame.POSIXct as.data.frame.POSIXlt
[37] as.data.frame.raw as.data.frame.table as.data.frame.ts as.data.frame.vector
[41] as.Date as.Date.character as.Date.default as.Date.factor
[45] as.Date.numeric as.Date.POSIXct as.Date.POSIXlt as.dendrogram
[49] as.difftime as.dist as.double as.double.difftime
[53] as.double.POSIXlt as.environment as.expression as.expression.default
[57] as.factor as.formula as.function as.function.default
[61] as.graphicsAnnot as.hclust as.hexmode as.integer
[65] as.list as.list.data.frame as.list.Date as.list.default
[69] as.list.difftime as.list.environment as.list.factor as.list.function
[73] as.list.numeric_version as.list.POSIXct as.list.POSIXlt as.logical
[77] as.logical.factor as.matrix as.matrix.data.frame as.matrix.default
[81] as.matrix.noquote as.matrix.POSIXlt as.name as.null
[85] as.null.default as.numeric as.numeric_version as.octmode
[89] as.ordered as.package_version as.pairlist as.person
[93] as.personList as.POSIXct as.POSIXct.Date as.POSIXct.default
[97] as.POSIXct.numeric as.POSIXct.POSIXlt as.POSIXlt as.POSIXlt.character
[101] as.POSIXlt.Date as.POSIXlt.default as.POSIXlt.factor as.POSIXlt.numeric
[105] as.POSIXlt.POSIXct as.qr as.raster as.raw
[109] as.relistable as.roman as.single as.single.default
[113] as.stepfun as.symbol as.table as.table.default
[117] as.ts as.vector as.vector.factor
- 将矩阵转换成数据框
> x <- state.x77
> is.data.frame(x) ##x不是数据框
[1] FALSE
> x <- as.data.frame(x) ##将x强制转换成数据框
> is.data.frame(x)
[1] TRUE ##x是数据框
- 数据框转换成矩阵
矩阵里面的数据必须是同一个数据类型,如字符串或者数值型
数据框里面的数据可以有多个数据类型
将含有多种数据类型的数据框转换成矩阵,将数据统一转换成字符串
> x <- as.matrix(x) ##将前面转换成数据框的x强制转换成矩阵
> is.data.frame(x)
[1] FALSE
> is.matrix(x)
[1] TRUE
- 不能用as将数据框转换成向量和因子
> x <- as.data.frame(state.x77)
> x <- as.vector(x)
> is.vector(x)
[1] FALSE
>
> x <- as.factor(x)
Warning message:
In xtfrm.data.frame(x) : cannot xtfrm data frames
- 将向量转换成矩阵
> x <- state.abb
> class(x)
[1] "character"
> is.vector(x)
[1] TRUE ##验证该数据为向量
> dim(x) <- c(5,10)
> x
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] "AL" "CO" "HI" "KS" "MA" "MT" "NM" "OK" "SD" "VA"
[2,] "AK" "CT" "ID" "KY" "MI" "NE" "NY" "OR" "TN" "WA"
[3,] "AZ" "DE" "IL" "LA" "MN" "NV" "NC" "PA" "TX" "WV"
[4,] "AR" "FL" "IN" "ME" "MS" "NH" "ND" "RI" "UT" "WI"
[5,] "CA" "GA" "IA" "MD" "MO" "NJ" "OH" "SC" "VT" "WY"
- 将向量转换成因子
> x <- state.abb
> as.factor(x)
[1] AL AK AZ AR CA CO CT DE FL GA HI ID IL IN IA KS KY LA ME MD MA MI MN MS MO MT NE NV NH NJ NM NY NC ND OH OK OR PA RI SC SD TN TX UT VT
[46] VA WA WV WI WY
50 Levels: AK AL AR AZ CA CO CT DE FL GA HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC ... WY
- 将向量转换成列表
> x <- state.abb
> as.list(x)
[[1]]
[1] "AL"
[[2]]
[1] "AK"
[[3]]
[1] "AZ"
[[4]]
[1] "AR"
[[5]]
[1] "CA"
...
[[46]]
[1] "VA"
[[47]]
[1] "WA"
[[48]]
[1] "WV"
[[49]]
[1] "WI"
[[50]]
[1] "WY"
2. 对数据取子集
> who <- read.csv("E:/R-workplace/1/RData/WHO.csv",header = T) ##202行,358列的数据
> View(who)
> Showing 1 to 18 of 202 entries, 358 total columns
- 利用索引提取子集
#####提取连续的数据
> who1 <- who[c(1:50),c(1:10)]
> View(who)
> Showing 1 to 18 of 50 entries, 10 total columns ##从who中提取50行10列的数据
####提取不连续的数据
> who2 <- who[c(1,4,6,3,10),c(38,234,214,78)]
> View(who2)
> Showing 1 to 5 of 5 entries, 4 total columns ##从who中提取指定的行跟列的数据
##提取who中Continent为7 的数据
> who3 <- who[which(who$Continent==7),]
> View(who3)
> Showing 1 to 9 of 9 entries, 358 total columns
##提取who中countryId在1-10的国家
> who4 <- who[which(who$CountryID>=1&who$CountryID<=10),]
> View(who4)
> Showing 1 to 10 of 10 entries, 358 total columns
- subset函数
subset(x,subset,select,drop)
x:要子集的对象
subset:表示要保留的元素或行的逻辑表达式
select:指示要从数据帧中选择的列
> who5 <- subset(who,who$CountryID>=1 & who$CountryID<=10)
> View(who5)
> Showing 1 to 10 of 10 entries, 358 total columns
- sample函数:进行随机抽样
sample(x, size, replace = FALSE, prob = NULL)
x:一个由一个或多个元素组成的向量,或一个正整数
size:抽样的数量
replace:数据是否返回,默认是无返回抽样
######1. 对向量进行抽样
> x <- 1:100
> sample(x,20)
[1] 14 55 65 40 83 71 96 45 21 77 97 28 98 18 93 31 85 88 74 20
##用sort函数给抽样的数据排序
> sort(sample(x,20))
[1] 3 12 18 25 33 34 36 38 39 41 45 47 58 59 60 66 87 97 98 100
> sort(sample(x,20,T))
[1] 2 9 22 24 29 36 42 43 47 53 68 73 76 79 82 83 84 90 90 91
#######2. 对数据框进行抽样
> sample(who$CountryID,10)
[1] 189 172 177 176 37 100 44 180 153 142
> who6 <- who[sample(who$CountryID,10),]
> View(who6)
> Showing 1 to 10 of 10 entries, 358 total columns
3. 对数据框进行合并
- 合并为列
data.frame()函数
> USArrests #1973年美国50个州每10万居民因袭击、谋杀和强奸被捕的统计数据,还有居住在城市地区的人口百分比。
> state.division #与美国50个州有关的数据集
> ###将两个数据集合并
> data.frame(USArrests,state.division)
Murder Assault UrbanPop Rape state.division
Alabama 13.2 236 58 21.2 East South Central
Alaska 10.0 263 48 44.5 Pacific
Arizona 8.1 294 80 31.0 Mountain
Arkansas 8.8 190 50 19.5 West South Central
...
Washington 4.0 145 73 26.2 Pacific
West Virginia 5.7 81 39 9.3 South Atlantic
Wisconsin 2.6 53 66 10.8 East North Central
Wyoming 6.8 161 60 15.6 Mountain
cbind()函数
> cbind(USArrests,state.division)
> Murder Assault UrbanPop Rape state.division
Alabama 13.2 236 58 21.2 East South Central
Alaska 10.0 263 48 44.5 Pacific
Arizona 8.1 294 80 31.0 Mountain
Arkansas 8.8 190 50 19.5 West South Central
...
Washington 4.0 145 73 26.2 Pacific
West Virginia 5.7 81 39 9.3 South Atlantic
Wisconsin 2.6 53 66 10.8 East North Central
Wyoming 6.8 161 60 15.6 Mountain
- 合并为行:两个数据框要有相同的列名
> data1 <- head(USArrests,5)
> data1
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
> data2 <- tail(USArrests,5)
> data2
Murder Assault UrbanPop Rape
Virginia 8.5 156 63 20.7
Washington 4.0 145 73 26.2
West Virginia 5.7 81 39 9.3
Wisconsin 2.6 53 66 10.8
Wyoming 6.8 161 60 15.6
> rbind(data1,data2)
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Virginia 8.5 156 63 20.7
Washington 4.0 145 73 26.2
West Virginia 5.7 81 39 9.3
Wisconsin 2.6 53 66 10.8
Wyoming 6.8 161 60 15.6
rbind()函数和cbind()函数同样适用于矩阵,但是矩阵的行数跟列数要相同
4. 去除重复项
rbind()函数
> data1 <- head(USArrests,5)
> data3 <- head(USArrests,10)
> data4 <- rbind(data1,data3)
##重复的项
> data4[duplicated(data4),]
Murder Assault UrbanPop Rape
Alabama1 13.2 236 58 21.2
Alaska1 10.0 263 48 44.5
Arizona1 8.1 294 80 31.0
Arkansas1 8.8 190 50 19.5
California1 9.0 276 91 40.6
##输出不重复的项
> data4[!duplicated(data4),]
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
Connecticut 3.3 110 77 11.1
Delaware 5.9 238 72 15.8
Florida 15.4 335 80 31.9
Georgia 17.4 211 60 25.8
unique()函数:直接输出不重复的数据
> unique(data4)
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
Connecticut 3.3 110 77 11.1
Delaware 5.9 238 72 15.8
Florida 15.4 335 80 31.9
Georgia 17.4 211 60 25.8
5. 数据的转置:t()函数
> t(data4)
Alabama Alaska Arizona Arkansas California Alabama1 Alaska1 Arizona1 Arkansas1 California1 Colorado Connecticut Delaware Florida
Murder 13.2 10.0 8.1 8.8 9.0 13.2 10.0 8.1 8.8 9.0 7.9 3.3 5.9 15.4
Assault 236.0 263.0 294.0 190.0 276.0 236.0 263.0 294.0 190.0 276.0 204.0 110.0 238.0 335.0
UrbanPop 58.0 48.0 80.0 50.0 91.0 58.0 48.0 80.0 50.0 91.0 78.0 77.0 72.0 80.0
Rape 21.2 44.5 31.0 19.5 40.6 21.2 44.5 31.0 19.5 40.6 38.7 11.1 15.8 31.9
Georgia
Murder 17.4
Assault 211.0
UrbanPop 60.0
Rape 25.8
6. 数据反向:rev()函数
> letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
> rev(letters)
[1] "z" "y" "x" "w" "v" "u" "t" "s" "r" "q" "p" "o" "n" "m" "l" "k" "j" "i" "h" "g" "f" "e" "d" "c" "b" "a"
> women
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
7 64 132
8 65 135
9 66 139
10 67 142
11 68 146
12 69 150
13 70 154
14 71 159
15 72 164
> rev(women)
weight height
1 115 58
2 117 59
3 120 60
4 123 61
5 126 62
6 129 63
7 132 64
8 135 65
9 139 66
10 142 67
11 146 68
12 150 69
13 154 70
14 159 71
15 164 72
> women[rev(row.names(women)),]
height weight
15 72 164
14 71 159
13 70 154
12 69 150
11 68 146
10 67 142
9 66 139
8 65 135
7 64 132
6 63 129
5 62 126
4 61 123
3 60 120
2 59 117
1 58 115
7. 更改元素的值:transform()函数
> transform(women,height=height*2.54)
height weight
1 147.32 115
2 149.86 117
3 152.40 120
4 154.94 123
5 157.48 126
6 160.02 129
7 162.56 132
8 165.10 135
9 167.64 139
10 170.18 142
11 172.72 146
12 175.26 150
13 177.80 154
14 180.34 159
15 182.88 164
##生成新的一列
> transform(women,height1=height*2.54)
height weight height1
1 58 115 147.32
2 59 117 149.86
3 60 120 152.40
4 61 123 154.94
5 62 126 157.48
6 63 129 160.02
7 64 132 162.56
8 65 135 165.10
9 66 139 167.64
10 67 142 170.18
11 68 146 172.72
12 69 150 175.26
13 70 154 177.80
14 71 159 180.34
15 72 164 182.88
8. 排序
- sort()函数
sort()函数:对向量进行排序,返回值为排序后的向量,默认为从小到大排序
rev(sort()):rev()函数与sort()函数合并使用,从大到小排序
> river <- head(rivers,10)
> river
[1] 735 320 325 392 524 450 1459 135 465 600
> sort(river)
[1] 135 320 325 392 450 465 524 600 735 1459
> rev(sort(river))
[1] 1459 735 600 524 465 450 392 325 320 135
- order()函数
order()函数:可以对向量进行排序,返回向量所在的位置
> head(rivers,10)
[1] 735 320 325 392 524 450 1459 135 465 600
> river <- head(rivers,10)
> river
[1] 735 320 325 392 524 450 1459 135 465 600
> sort(river)
[1] 135 320 325 392 450 465 524 600 735 1459 #返回排序后的值
> order(river)
[1] 8 2 3 4 6 9 5 10 1 7 #返回排序后的位置
对数据框的某一列进行排序
> mtcar <- head(mtcars$mpg,10)
> mtcar
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
> order(mtcar)
[1] 7 6 5 10 1 2 4 3 9 8
> mtcar[order(mtcar)]
[1] 14.3 18.1 18.7 19.2 21.0 21.0 21.4 22.8 22.8 24.4
9. apply的系列函数
- apply()函数:
适用于数据框或矩阵
apply(X,MARGIN,FUN)
X:一个数组,包括一个矩阵。
MARGIN:MARGIN=1,对行进行操作;MARGIN=2,对列进行操作
FUN:函数,系统自带函数或自定义函数
> WorldPhones
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
1951 45939 21574 2876 1815 1646 89 555
1956 60423 29990 4708 2568 2366 1411 733
1957 64721 32510 5230 2695 2526 1546 773
1958 68484 35218 6662 2845 2691 1663 836
1959 71799 37598 6856 3000 2868 1769 911
1960 76036 40341 8220 3145 3054 1905 1008
1961 79831 43173 9053 3338 3224 2005 1076
> sum <- apply(WorldPhones,1,sum)
> WorldPhone <- cbind(WorldPhones,sum)
> WorldPhone
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer sum
1951 45939 21574 2876 1815 1646 89 555 74494
1956 60423 29990 4708 2568 2366 1411 733 102199
1957 64721 32510 5230 2695 2526 1546 773 110001
1958 68484 35218 6662 2845 2691 1663 836 118399
1959 71799 37598 6856 3000 2868 1769 911 124801
1960 76036 40341 8220 3145 3054 1905 1008 133709
1961 79831 43173 9053 3338 3224 2005 1076 141700
> mean <- apply(WorldPhone,2,mean)
> WorldPhone <- rbind(WorldPhone,mean)
> WorldPhone
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer sum
1951 45939.00 21574.00 2876.000 1815.000 1646 89 555.0000 74494.0
1956 60423.00 29990.00 4708.000 2568.000 2366 1411 733.0000 102199.0
1957 64721.00 32510.00 5230.000 2695.000 2526 1546 773.0000 110001.0
1958 68484.00 35218.00 6662.000 2845.000 2691 1663 836.0000 118399.0
1959 71799.00 37598.00 6856.000 3000.000 2868 1769 911.0000 124801.0
1960 76036.00 40341.00 8220.000 3145.000 3054 1905 1008.0000 133709.0
1961 79831.00 43173.00 9053.000 3338.000 3224 2005 1076.0000 141700.0
mean 66747.57 34343.43 6229.286 2772.286 2625 1484 841.7143 115043.3
- lapply()函数
适用于列表,返回列表
lapply(X,FUN)
X:向量(原子或列表)或表达式对象
FUN:函数,系统自带函数或自定义函数
> lapply(state.center,length)
$x
[1] 50
$y
[1] 50
> class(lapply(state.center, length))
[1] "list"
- sapply()函数
适用于列表,返回矩阵或者向量
sapply(X,FUN)
X:向量(原子或列表)或表达式对象
FUN:函数,系统自带函数或自定义函数
> sapply(state.center,length)
x y
50 50
> class(sapply(state.center, length))
[1] "integer"
- tapply()函数
处理因子数据,根据因子进行分组,对每组进行处理
tapply(X,INDEX,FUN)
X:一个数据集
INDEX:由因子组成的列表,根据因子对数据进行处理
FUN:函数,系统自带函数或自定义函数
#美国50个州的名字
> state.name
[1] "Alabama" "Alaska" "Arizona" "Arkansas" "California" "Colorado" "Connecticut"
[8] "Delaware" "Florida" "Georgia" "Hawaii" "Idaho" "Illinois" "Indiana"
[15] "Iowa" "Kansas" "Kentucky" "Louisiana" "Maine" "Maryland" "Massachusetts"
[22] "Michigan" "Minnesota" "Mississippi" "Missouri" "Montana" "Nebraska" "Nevada"
[29] "New Hampshire" "New Jersey" "New Mexico" "New York" "North Carolina" "North Dakota" "Ohio"
[36] "Oklahoma" "Oregon" "Pennsylvania" "Rhode Island" "South Carolina" "South Dakota" "Tennessee"
[43] "Texas" "Utah" "Vermont" "Virginia" "Washington" "West Virginia" "Wisconsin"
[50] "Wyoming"
#美国50个州划分的区域
> state.division
[1] East South Central Pacific Mountain West South Central Pacific Mountain New England
[8] South Atlantic South Atlantic South Atlantic Pacific Mountain East North Central East North Central
[15] West North Central West North Central East South Central West South Central New England South Atlantic New England
[22] East North Central West North Central East South Central West North Central Mountain West North Central Mountain
[29] New England Middle Atlantic Mountain Middle Atlantic South Atlantic West North Central East North Central
[36] West South Central Pacific Middle Atlantic New England South Atlantic West North Central East South Central
[43] West South Central Mountain New England South Atlantic Pacific South Atlantic East North Central
[50] Mountain
9 Levels: New England Middle Atlantic South Atlantic East South Central West South Central East North Central ... Pacific
> tapply(state.name, state.division, length)
New England Middle Atlantic South Atlantic East South Central West South Central East North Central West North Central Mountain Pacific
6 3 8 4 4 5 7 8 5
10. 数据的中心化与标准化处理
数据中心化:是指数据集中的各项数据减去数据集的均值
数据标准化:是指在中心化之后在除以数据集的标准差,即数据集中的各项数据减去数据集的均值再除以数据集的标准差
数据中心化和数据标准化都是消除量纲(单位)对数据的影响,归一化,降低数据集中数据之间的差别
- scale()函数:对数据进行中心化或标准化处理
scale(x, center = TRUE, scale = TRUE)
x:数字矩阵(如对象)
center:TRUE,对数据中心化处理
scale:TRUE,对数据标准化处理
> x <- head(state.x77)
> x
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
California 21198 5114 1.1 71.71 10.3 62.6 20 156361
Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
> scale(x,T,F)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Alabama -1725.167 -1016.8333 0.58333333 -1.506666667 4.86666667 -14.116667 -53 -123063.5
Alaska -4975.167 1674.1667 -0.01666667 -1.246666667 1.06666667 11.283333 79 392660.5
Arizona -3128.167 -110.8333 0.28333333 -0.006666667 -2.43333333 2.683333 -58 -60354.5
Arkansas -3230.167 -1262.8333 0.38333333 0.103333333 -0.13333333 -15.516667 -8 -121826.5
California 15857.833 473.1667 -0.41666667 1.153333333 0.06666667 7.183333 -53 -17410.5
Colorado -2799.167 243.1667 -0.81666667 1.503333333 -3.43333333 8.483333 93 -70005.5
attr(,"scaled:center")
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
5.340167e+03 4.640833e+03 1.516667e+00 7.055667e+01 1.023333e+01 5.541667e+01 7.300000e+01 1.737715e+05
> scale(x,F,T)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Alabama 0.36958694 0.6975662 1.2040366 0.8932665 1.3035855 0.6677959 0.1891343 0.1853587
Alaska 0.03731652 1.2155437 0.8600261 0.8966300 0.9755309 1.0784985 1.4374205 2.0705426
Arizona 0.22614835 0.8719577 1.0320314 0.9126712 0.6733753 0.9394417 0.1418507 0.4145859
Arkansas 0.21572018 0.6502148 1.0893665 0.9140943 0.8719347 0.6451588 0.6146864 0.1898804
California 2.16722098 0.9843690 0.6306858 0.9276776 0.8892007 1.0122040 0.1891343 0.5715639
Colorado 0.25978434 0.9400974 0.4013455 0.9322054 0.5870451 1.0332242 1.5698145 0.3793075
attr(,"scaled:scale")
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
9.781190e+03 5.195206e+03 1.744133e+00 7.730056e+01 1.158344e+01 6.184524e+01 1.057450e+02 2.735669e+05
> scale(x,T,T)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Alabama -0.2200732 -0.9501180 1.09913001 -1.236374542 1.66820651 -1.1946743 -0.7660111 -0.62635214
Alaska -0.6346639 1.5643230 -0.03140371 -1.023017873 0.36563430 0.9548932 1.1417902 1.99851088
Arizona -0.3990488 -0.1035615 0.53386315 -0.005470684 -0.83410325 0.2270869 -0.8382763 -0.30718426
Arkansas -0.4120606 -1.1799777 0.72228544 0.084795599 -0.04570429 -1.3131544 -0.1156243 -0.62005622
California 2.0229261 0.4421218 -0.78509287 0.946428300 0.02285214 0.6079157 -0.7660111 -0.08861363
Colorado -0.3570795 0.2272123 -1.53878202 1.233639200 -1.17688541 0.7179330 1.3441327 -0.35630463
attr(,"scaled:center")
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
5.340167e+03 4.640833e+03 1.516667e+00 7.055667e+01 1.023333e+01 5.541667e+01 7.300000e+01 1.737715e+05
attr(,"scaled:scale")
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
7.839057e+03 1.070218e+03 5.307228e-01 1.218617e+00 2.917305e+00 1.181633e+01 6.918959e+01 1.964765e+05
11. reshape2包:重构整个数据的万能工具,可以将数据转换成任何需要的形式
- merge()函数
merge是R语言中用来合并数据框的函数
merge(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x",".y"),incomparables = NULL, …)
x,y:用于合并的两个数据框
by,by.x,by.y:指定依据哪些行合并数据框,默认值为相同列名的列.
all,all.x,all.y:指定x和y的行是否应该全在输出文件.
sort:by:指定的列是否要排序.
suffixes:指定除by外相同列名的后缀.
incomparables:指定by中哪些单元不进行合并
> w1
name school class english
1 A S1 10 85
2 B S2 5 50
3 A S1 4 90
4 A S1 11 90
5 C S1 1 12
> w2
name school class math english
1 A S3 5 80 88
2 B S2 5 89 81
3 C S1 1 55 32
#依据条件合并数据框,输出相同的行
> merge(w1,w2,by=c("name","school","class"))
name school class english.x math english.y
1 B S2 5 50 89 81
2 C S1 1 12 55 32
#一句条件合并数据框,输出全部的行
> merge(w1,w2,by=c("name","school","class"),all=T)
name school class english.x math english.y
1 A S1 4 90 NA NA
2 A S1 10 85 NA NA
3 A S1 11 90 NA NA
4 A S3 5 NA 80 88
5 B S2 5 50 89 81
6 C S1 1 12 55 32
melt函数:融合公式
melt()函数:融合数据,数据的“宽变长”,每一列单独拿出来自上而下追加到新的数据集中
融合就是取某一列或几列作为标量(结构保持不变)把剩余的列融合在一起
> m <- head(airquality,3)
> m
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
#将每一列自上而下追加到每一行中
> melt(m)
No id variables; using all as measure variables
variable value
1 Ozone 41.0
2 Ozone 36.0
3 Ozone 12.0
4 Solar.R 190.0
5 Solar.R 118.0
6 Solar.R 149.0
7 Wind 7.4
8 Wind 8.0
9 Wind 12.6
10 Temp 67.0
11 Temp 72.0
12 Temp 74.0
13 Month 5.0
14 Month 5.0
15 Month 5.0
16 Day 1.0
17 Day 2.0
18 Day 3.0
#以Month和Day这两列为标量,将数据进行融合
> melt(m,id.vars = c("Month","Day"))
Month Day variable value
1 5 1 Ozone 41.0
2 5 2 Ozone 36.0
3 5 3 Ozone 12.0
4 5 1 Solar.R 190.0
5 5 2 Solar.R 118.0
6 5 3 Solar.R 149.0
7 5 1 Wind 7.4
8 5 2 Wind 8.0
9 5 3 Wind 12.6
10 5 1 Temp 67.0
11 5 2 Temp 72.0
12 5 3 Temp 74.0
dcast()函数:铸造公式
dcast()/函数:铸造数据,”长变宽“依据variable列中的各level作为列名
> n <- melt(m,id.vars = c("Month","Day"))
> n
Month Day variable value
1 5 1 Ozone 41.0
2 5 2 Ozone 36.0
3 5 3 Ozone 12.0
4 5 1 Solar.R 190.0
5 5 2 Solar.R 118.0
6 5 3 Solar.R 149.0
7 5 1 Wind 7.4
8 5 2 Wind 8.0
9 5 3 Wind 12.6
10 5 1 Temp 67.0
11 5 2 Temp 72.0
12 5 3 Temp 74.0
#用Month跟Day关联variable列中的lever对数据进行聚合
> dcast(n,Month+Day~variable)
Month Day Ozone Solar.R Wind Temp
1 5 1 41 190 7.4 67
2 5 2 36 118 8.0 72
3 5 3 12 149 12.6 74
#缺少聚合函数Day:默认为长度
> dcast(n,Month~variable)
Aggregation function missing: defaulting to length
Month Ozone Solar.R Wind Temp
1 5 3 3 3 3
#缺少聚合函数,设置fun.aggregate处理重复数据
> dcast(n,Month~variable,fun.aggregate = mean)
Month Ozone Solar.R Wind Temp
1 5 29.66667 152.3333 9.333333 71
12. tidyr包:整理混乱的数据
tidyr:创建整洁的数据,其中每一列是一个变量,每一行是一个观察值,每个单元格包含一个值。
‘tidyr’包含用于更改数据集的形状(旋转)和层次结构(嵌套和’ un嵌套’)、将深度嵌套列表转换为矩形数据帧(‘rectangling’)以及从字符串列中提取值的工具。
它还包括用于处理缺失值(隐式和显式)的工具。
- gather()函数:将宽数据转换成长数据
> mtcars <- data.frame(names=rownames(mtcars),mtcars)
> mtcars
> > tdata <- mtcars[1:10,1:4]
> tdata
names mpg cyl disp
Mazda RX4 Mazda RX4 21.0 6 160.0
Mazda RX4 Wag Mazda RX4 Wag 21.0 6 160.0
Datsun 710 Datsun 710 22.8 4 108.0
Hornet 4 Drive Hornet 4 Drive 21.4 6 258.0
Hornet Sportabout Hornet Sportabout 18.7 8 360.0
Valiant Valiant 18.1 6 225.0
Duster 360 Duster 360 14.3 8 360.0
Merc 240D Merc 240D 24.4 4 146.7
Merc 230 Merc 230 22.8 4 140.8
Merc 280 Merc 280 19.2 6 167.6
> gather(tdata,key = "key",value = "values",mpg,cyl,disp)
names key values
1 Mazda RX4 mpg 21.0
2 Mazda RX4 Wag mpg 21.0
3 Datsun 710 mpg 22.8
4 Hornet 4 Drive mpg 21.4
5 Hornet Sportabout mpg 18.7
6 Valiant mpg 18.1
7 Duster 360 mpg 14.3
8 Merc 240D mpg 24.4
9 Merc 230 mpg 22.8
10 Merc 280 mpg 19.2
11 Mazda RX4 cyl 6.0
12 Mazda RX4 Wag cyl 6.0
13 Datsun 710 cyl 4.0
14 Hornet 4 Drive cyl 6.0
15 Hornet Sportabout cyl 8.0
16 Valiant cyl 6.0
17 Duster 360 cyl 8.0
18 Merc 240D cyl 4.0
19 Merc 230 cyl 4.0
20 Merc 280 cyl 6.0
21 Mazda RX4 disp 160.0
22 Mazda RX4 Wag disp 160.0
23 Datsun 710 disp 108.0
24 Hornet 4 Drive disp 258.0
25 Hornet Sportabout disp 360.0
26 Valiant disp 225.0
27 Duster 360 disp 360.0
28 Merc 240D disp 146.7
29 Merc 230 disp 140.8
30 Merc 280 disp 167.6
> gather(tdata,key = "key",value = "values",mpg:cyl,-disp)
names disp key values
1 Mazda RX4 160.0 mpg 21.0
2 Mazda RX4 Wag 160.0 mpg 21.0
3 Datsun 710 108.0 mpg 22.8
4 Hornet 4 Drive 258.0 mpg 21.4
5 Hornet Sportabout 360.0 mpg 18.7
6 Valiant 225.0 mpg 18.1
7 Duster 360 360.0 mpg 14.3
8 Merc 240D 146.7 mpg 24.4
9 Merc 230 140.8 mpg 22.8
10 Merc 280 167.6 mpg 19.2
11 Mazda RX4 160.0 cyl 6.0
12 Mazda RX4 Wag 160.0 cyl 6.0
13 Datsun 710 108.0 cyl 4.0
14 Hornet 4 Drive 258.0 cyl 6.0
15 Hornet Sportabout 360.0 cyl 8.0
16 Valiant 225.0 cyl 6.0
17 Duster 360 360.0 cyl 8.0
18 Merc 240D 146.7 cyl 4.0
19 Merc 230 140.8 cyl 4.0
20 Merc 280 167.6 cyl 6.0
- spread()函数:将长数据转换成宽数据
> spread(gdata,key = "key",value = "values")
names cyl disp mpg
1 Datsun 710 4 108.0 22.8
2 Duster 360 8 360.0 14.3
3 Hornet 4 Drive 6 258.0 21.4
4 Hornet Sportabout 8 360.0 18.7
5 Mazda RX4 6 160.0 21.0
6 Mazda RX4 Wag 6 160.0 21.0
7 Merc 230 4 140.8 22.8
8 Merc 240D 4 146.7 24.4
9 Merc 280 6 167.6 19.2
10 Valiant 6 225.0 18.1
gather函数为收集单个数据,多列合成一列;
spread函数为展开成数据,一列合成多列
gather跟spread是互为逆处理,名称和数值都要对应
- separate()函数:将一列拆分成多列
separate(data,col,into,sep ,…)
data:数据框
col:要分隔的列名或位置
into:要创建为字符向量的新变量的名称
sep:设置列之间的分隔符,默认分隔符为".",点
> dt <- data.frame(x=c(NA,"a.b","b.c","c.d",NA))
> dt
x
1 <NA>
2 a.b
3 b.c
4 c.d
5 <NA>
> separate(dt,x,c("A","B"))
A B
1 <NA> <NA>
2 a b
3 b c
4 c d
5 <NA> <NA>
- unite()函数
unite(data, col, …, sep = “", …)
data:数据框
col:合成的新列的名称
…:要合成的列
sep:用于值之间的分隔符,默认为"”,短下滑线
> unite(s,"AB",A,B)
AB
1 NA_NA
2 a_b
3 b_c
4 c_d
5 NA_NA
> unite(s,"AB",A,B,sep="-")
AB
1 NA-NA
2 a-b
3 b-c
4 c-d
5 NA-NA
separate函数跟union函数是一对对应函数
separate函数根据列进行分割
union合并列
13. dplyr包:数据操作语法
可对单个单表格进行操作,还可对双表格进行操作
1) 对单表格进行操作
以iris数据集对dplyr包中相应的函数进行操作
- dplyr::filter()函数:对数据进行筛选
#筛选出iris数据集中Sepal.Length>7的数据
> dplyr::filter(iris,Sepal.Length>7)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 7.1 3.0 5.9 2.1 virginica
2 7.6 3.0 6.6 2.1 virginica
3 7.3 2.9 6.3 1.8 virginica
4 7.2 3.6 6.1 2.5 virginica
5 7.7 3.8 6.7 2.2 virginica
6 7.7 2.6 6.9 2.3 virginica
7 7.7 2.8 6.7 2.0 virginica
8 7.2 3.2 6.0 1.8 virginica
9 7.2 3.0 5.8 1.6 virginica
10 7.4 2.8 6.1 1.9 virginica
11 7.9 3.8 6.4 2.0 virginica
12 7.7 3.0 6.1 2.3 virginica
- dplyr::distinct()函数:用于去除重复行
> dplyr::distinct(rbind(iris[1:10,],iris[1:15,]))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3.0 1.4 0.1 setosa
14 4.3 3.0 1.1 0.1 setosa
15 5.8 4.0 1.2 0.2 setosa
- dplyr::slice()函数:切片,可以取出数据集中任意行
> dplyr::slice(iris,c(1:3),5,4)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 4.6 3.1 1.5 0.2 setosa
- dplyr::samply_n()函数:用于随机取样
#随机抽取iris数据集中的10个数据
> dplyr::sample_n(iris,10)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.3 3.0 1.1 0.1 setosa
2 5.1 3.7 1.5 0.4 setosa
3 4.9 2.5 4.5 1.7 virginica
4 6.0 3.0 4.8 1.8 virginica
5 4.4 3.0 1.3 0.2 setosa
6 6.8 3.0 5.5 2.1 virginica
7 5.1 2.5 3.0 1.1 versicolor
8 6.2 3.4 5.4 2.3 virginica
9 5.9 3.2 4.8 1.8 versicolor
10 4.9 3.1 1.5 0.1 setosa
- dplyr::sample_frac()函数:用于抽取一定比例的数据
#抽取iris数据集中10%的数据
> dplyr::sample_frac(iris,0.1)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.6 2.8 4.9 2.0 virginica
2 7.7 2.6 6.9 2.3 virginica
3 5.8 2.7 5.1 1.9 virginica
4 5.6 2.7 4.2 1.3 versicolor
5 5.7 2.8 4.5 1.3 versicolor
6 5.7 4.4 1.5 0.4 setosa
7 6.0 2.7 5.1 1.6 versicolor
8 5.4 3.7 1.5 0.2 setosa
9 6.4 3.2 4.5 1.5 versicolor
10 4.4 2.9 1.4 0.2 setosa
11 6.3 2.5 4.9 1.5 versicolor
12 4.5 2.3 1.3 0.3 setosa
13 6.9 3.1 5.4 2.1 virginica
14 5.1 3.5 1.4 0.3 setosa
15 5.9 3.2 4.8 1.8 versicolor
- dplyr::arrange()函数:用于排序
默认排序为升序
反序排序:1. 在排序条件前加负号;2. 采用desc()函数
> dplyr::arrange(head(iris,10),Sepal.Length)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.4 2.9 1.4 0.2 setosa
2 4.6 3.1 1.5 0.2 setosa
3 4.6 3.4 1.4 0.3 setosa
4 4.7 3.2 1.3 0.2 setosa
5 4.9 3.0 1.4 0.2 setosa
6 4.9 3.1 1.5 0.1 setosa
7 5.0 3.6 1.4 0.2 setosa
8 5.0 3.4 1.5 0.2 setosa
9 5.1 3.5 1.4 0.2 setosa
10 5.4 3.9 1.7 0.4 setosa
> dplyr::arrange(head(iris,10),-Sepal.Length)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.4 3.9 1.7 0.4 setosa
2 5.1 3.5 1.4 0.2 setosa
3 5.0 3.6 1.4 0.2 setosa
4 5.0 3.4 1.5 0.2 setosa
5 4.9 3.0 1.4 0.2 setosa
6 4.9 3.1 1.5 0.1 setosa
7 4.7 3.2 1.3 0.2 setosa
8 4.6 3.1 1.5 0.2 setosa
9 4.6 3.4 1.4 0.3 setosa
10 4.4 2.9 1.4 0.2 setosa
> dplyr::arrange(head(iris,10),desc(Sepal.Length))
- dplyr::summarise()函数:统计函数
> dplyr::summarise(head(iris),mean(head(Sepal.Length)))
mean(head(Sepal.Length))
1 4.95
> dplyr::summarise(head(iris),sum(Sepal.Length))
sum(Sepal.Length)
1 29.7
- 链式操作符(管道操作符)%>%
用于实现将一个函数的输出传递给下一个函数,作为下一个函数的输入
在R中,可以使用ctrl+shift+M快捷键输出出来
#取出mtcars数据集中前20行的后10行
> head(mtcars,20) %>% tail(10)
names mpg cyl disp hp drat wt qsec vs am gear carb
Merc 280C Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
- dplyr::group_by()函数:对数据集进行分组
#对iris数据集根据Species进行分组
> dplyr::group_by(iris,Species)
# A tibble: 150 x 5
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ... with 140 more rows
- dplyr::group_by()函数与管道符联合使用
> iris %>% group_by(Species) %>% summarise(mean(Sepal.Length))
# A tibble: 3 x 2
Species `mean(Sepal.Length)`
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59
- dplyr::mutate()函数:添加新的一列
> mutate(head(iris),new=Sepal.Length+Petal.Length)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species new
1 5.1 3.5 1.4 0.2 setosa 6.5
2 4.9 3.0 1.4 0.2 setosa 6.3
3 4.7 3.2 1.3 0.2 setosa 6.0
4 4.6 3.1 1.5 0.2 setosa 6.1
5 5.0 3.6 1.4 0.2 setosa 6.4
6 5.4 3.9 1.7 0.4 setosa 7.1
2) 对双表格进行操作
> a <- data.frame(x1=c("A","B","C"),x2=c(1,2,3))
> a
x1 x2
1 A 1
2 B 2
3 C 3
> b <- data.frame(x1=c("A","B","D"),x3=c(T,F,T))
> b
x1 x3
1 A TRUE
2 B FALSE
3 D TRUE
- 左连接:dplyr::left_join()函数:以左边的数据为基础进行连接
> dplyr::left_join(a,b,by="x1")
x1 x2 x3
1 A 1 TRUE
2 B 2 FALSE
3 C 3 NAv
- 右连接:dplyr::right_join()函数:以右边的数据为基础进行连接
> dplyr::right_join(a,b,by="x1")
x1 x2 x3
1 A 1 TRUE
2 B 2 FALSE
3 D NA TRUE
- 内连接:dplyr::inner_join()函数:取两个数据表的交集
> dplyr::inner_join(a,b,by="x1")
x1 x2 x3
1 A 1 TRUE
2 B 2 FALSE
- 全连接:dplyr::full_join()函数:取两个数据表的并集
> dplyr::full_join(a,b,by="x1")
x1 x2 x3
1 A 1 TRUE
2 B 2 FALSE
3 C 3 NA
4 D NA TRUE
- 半连接:dplyr::semi_join()函数:根据右侧表的内容对左侧表进行过滤,即取出左侧表中与右侧表数据重合的数据(两个表的交集)
> dplyr::semi_join(a,b,by="x1")
x1 x2
1 A 1
2 B 2
- 反连接:dplyr::anti_join()函数:根据右侧表的内容对左侧表进行过滤,即取出左侧表中两个表的差集部分的数据
> dplyr::anti_join(a,b,by="x1")
x1 x2
1 C 3
3) 数据集的合并
- intersect()函数:取两个数据集的交集
> first <- slice(mtcars,1:20)
> second <- slice(mtcars,10:30)
> intersect(first,second)
names mpg cyl disp hp drat wt qsec vs am gear carb
Merc 280 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
- dplyr::union_all()函数:取两个数据集的并集
> dplyr::union_all(first,second)
names mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710...3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive...4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout...5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Datsun 710...6 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive...7 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout...8 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
- dplyr::union()函数:取两个数据集的非冗余并集
> dplyr::union(first,second)
names mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
- dplyr::setdiff()函数:取两个数据集中左边数据集的补集
> dplyr::setdiff(first,second)
names mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
Mazda RX4 Wag Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
> dplyr::setdiff(second,first)
names mpg cyl disp hp drat wt qsec vs am gear carb
Valiant Valiant 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1
Duster 360 Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
Merc 240D Merc 240D 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2