常见的数据类型: 向量,矩阵,数据框,数组 ,列表

1. 用is和as函数

  • is相关函数:判断数据类型
> methods(is)
 [1] is.Alignment            is.array                is.atomic               is.Border               is.call                
 [6] is.CellBlock            is.CellProtection       is.CellStyle            is.character            is.complex             
[11] is.data.frame           is.DataFormat           is.double               is.element              is.empty.model         
[16] is.environment          is.expression           is.factor               is.Fill                 is.finite              
[21] is.Font                 is.function             is.infinite             is.integer              is.jnull               
[26] is.language             is.leaf                 is.list                 is.loaded               is.logical             
[31] is.matrix               is.mts                  is.na                   is.na.data.frame        is.na.numeric_version  
[36] is.na.POSIXlt           is.na<-                 is.na<-.default         is.na<-.factor          is.na<-.numeric_version
[41] is.name                 is.nan                  is.null                 is.numeric              is.numeric.Date        
[46] is.numeric.difftime     is.numeric.POSIXt       is.numeric_version      is.object               is.ordered             
[51] is.package_version      is.pairlist             is.primitive            is.qr                   is.R                   
[56] is.raster               is.raw                  is.recursive            is.relistable           is.single              
[61] is.stepfun              is.symbol               is.table                is.ts                   is.tskernel            
[66] is.unsorted             is.vector
  • as相关函数:强制转换数据
> methods(as)
  [1] as.array                      as.array.default              as.call                       as.character                 
  [5] as.character.condition        as.character.Date             as.character.default          as.character.error           
  [9] as.character.factor           as.character.hexmode          as.character.numeric_version  as.character.octmode         
 [13] as.character.POSIXt           as.character.srcref           as.complex                    as.data.frame                
 [17] as.data.frame.array           as.data.frame.AsIs            as.data.frame.character       as.data.frame.complex        
 [21] as.data.frame.data.frame      as.data.frame.Date            as.data.frame.default         as.data.frame.difftime       
 [25] as.data.frame.factor          as.data.frame.integer         as.data.frame.list            as.data.frame.logical        
 [29] as.data.frame.matrix          as.data.frame.model.matrix    as.data.frame.noquote         as.data.frame.numeric        
 [33] as.data.frame.numeric_version as.data.frame.ordered         as.data.frame.POSIXct         as.data.frame.POSIXlt        
 [37] as.data.frame.raw             as.data.frame.table           as.data.frame.ts              as.data.frame.vector         
 [41] as.Date                       as.Date.character             as.Date.default               as.Date.factor               
 [45] as.Date.numeric               as.Date.POSIXct               as.Date.POSIXlt               as.dendrogram                
 [49] as.difftime                   as.dist                       as.double                     as.double.difftime           
 [53] as.double.POSIXlt             as.environment                as.expression                 as.expression.default        
 [57] as.factor                     as.formula                    as.function                   as.function.default          
 [61] as.graphicsAnnot              as.hclust                     as.hexmode                    as.integer                   
 [65] as.list                       as.list.data.frame            as.list.Date                  as.list.default              
 [69] as.list.difftime              as.list.environment           as.list.factor                as.list.function             
 [73] as.list.numeric_version       as.list.POSIXct               as.list.POSIXlt               as.logical                   
 [77] as.logical.factor             as.matrix                     as.matrix.data.frame          as.matrix.default            
 [81] as.matrix.noquote             as.matrix.POSIXlt             as.name                       as.null                      
 [85] as.null.default               as.numeric                    as.numeric_version            as.octmode                   
 [89] as.ordered                    as.package_version            as.pairlist                   as.person                    
 [93] as.personList                 as.POSIXct                    as.POSIXct.Date               as.POSIXct.default           
 [97] as.POSIXct.numeric            as.POSIXct.POSIXlt            as.POSIXlt                    as.POSIXlt.character         
[101] as.POSIXlt.Date               as.POSIXlt.default            as.POSIXlt.factor             as.POSIXlt.numeric           
[105] as.POSIXlt.POSIXct            as.qr                         as.raster                     as.raw                       
[109] as.relistable                 as.roman                      as.single                     as.single.default            
[113] as.stepfun                    as.symbol                     as.table                      as.table.default             
[117] as.ts                         as.vector                     as.vector.factor
  • 将矩阵转换成数据框
> x <- state.x77
> is.data.frame(x)				    ##x不是数据框
[1] FALSE

> x <- as.data.frame(x)				##将x强制转换成数据框

> is.data.frame(x)
[1] TRUE						    ##x是数据框
  • 数据框转换成矩阵

矩阵里面的数据必须是同一个数据类型,如字符串或者数值型
数据框里面的数据可以有多个数据类型
将含有多种数据类型的数据框转换成矩阵,将数据统一转换成字符串

> x <- as.matrix(x)			##将前面转换成数据框的x强制转换成矩阵

> is.data.frame(x)
[1] FALSE

> is.matrix(x)
[1] TRUE
  • 不能用as将数据框转换成向量和因子
> x <- as.data.frame(state.x77)
> x <- as.vector(x)
> is.vector(x)
[1] FALSE
> 
> x <- as.factor(x)
Warning message:
In xtfrm.data.frame(x) : cannot xtfrm data frames
  • 将向量转换成矩阵
> x <- state.abb
> class(x)
[1] "character"
> is.vector(x)
[1] TRUE														##验证该数据为向量

> dim(x) <- c(5,10)
> x
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] "AL" "CO" "HI" "KS" "MA" "MT" "NM" "OK" "SD" "VA" 
[2,] "AK" "CT" "ID" "KY" "MI" "NE" "NY" "OR" "TN" "WA" 
[3,] "AZ" "DE" "IL" "LA" "MN" "NV" "NC" "PA" "TX" "WV" 
[4,] "AR" "FL" "IN" "ME" "MS" "NH" "ND" "RI" "UT" "WI" 
[5,] "CA" "GA" "IA" "MD" "MO" "NJ" "OH" "SC" "VT" "WY"
  • 将向量转换成因子
> x <- state.abb
> as.factor(x)
 [1] AL AK AZ AR CA CO CT DE FL GA HI ID IL IN IA KS KY LA ME MD MA MI MN MS MO MT NE NV NH NJ NM NY NC ND OH OK OR PA RI SC SD TN TX UT VT
[46] VA WA WV WI WY
50 Levels: AK AL AR AZ CA CO CT DE FL GA HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC ... WY
  • 将向量转换成列表
> x <- state.abb
> as.list(x)
[[1]]
[1] "AL"

[[2]]
[1] "AK"

[[3]]
[1] "AZ"

[[4]]
[1] "AR"

[[5]]
[1] "CA"

...

[[46]]
[1] "VA"

[[47]]
[1] "WA"

[[48]]
[1] "WV"

[[49]]
[1] "WI"

[[50]]
[1] "WY"

2. 对数据取子集

> who <- read.csv("E:/R-workplace/1/RData/WHO.csv",header = T)		##202行,358列的数据
> View(who)
> Showing 1 to 18 of 202 entries, 358 total columns
  • 利用索引提取子集
#####提取连续的数据
> who1 <- who[c(1:50),c(1:10)]
> View(who)
> Showing 1 to 18 of 50 entries, 10 total columns		##从who中提取50行10列的数据

####提取不连续的数据
> who2 <- who[c(1,4,6,3,10),c(38,234,214,78)]
> View(who2)
> Showing 1 to 5 of 5 entries, 4 total columns			##从who中提取指定的行跟列的数据

##提取who中Continent为7 的数据
> who3 <- who[which(who$Continent==7),]
> View(who3)
> Showing 1 to 9 of 9 entries, 358 total columns

##提取who中countryId在1-10的国家

> who4 <- who[which(who$CountryID>=1&who$CountryID<=10),]
> View(who4)
> Showing 1 to 10 of 10 entries, 358 total columns
  • subset函数

subset(x,subset,select,drop)
x:要子集的对象
subset:表示要保留的元素或行的逻辑表达式
select:指示要从数据帧中选择的列

> who5 <- subset(who,who$CountryID>=1 & who$CountryID<=10)
> View(who5)
> Showing 1 to 10 of 10 entries, 358 total columns
  • sample函数:进行随机抽样

sample(x, size, replace = FALSE, prob = NULL)
x:一个由一个或多个元素组成的向量,或一个正整数
size:抽样的数量
replace:数据是否返回,默认是无返回抽样

######1. 对向量进行抽样
> x <- 1:100
> sample(x,20)
 [1] 14 55 65 40 83 71 96 45 21 77 97 28 98 18 93 31 85 88 74 20
 ##用sort函数给抽样的数据排序
> sort(sample(x,20))
 [1]   3  12  18  25  33  34  36  38  39  41  45  47  58  59  60  66  87  97  98 100
 
> sort(sample(x,20,T))
 [1]  2  9 22 24 29 36 42 43 47 53 68 73 76 79 82 83 84 90 90 91

#######2. 对数据框进行抽样
> sample(who$CountryID,10)
 [1] 189 172 177 176  37 100  44 180 153 142
> who6 <- who[sample(who$CountryID,10),]
> View(who6)
> Showing 1 to 10 of 10 entries, 358 total columns

3. 对数据框进行合并

  • 合并为列
    data.frame()函数
> USArrests		#1973年美国50个州每10万居民因袭击、谋杀和强奸被捕的统计数据,还有居住在城市地区的人口百分比。
> state.division   #与美国50个州有关的数据集
> ###将两个数据集合并
> data.frame(USArrests,state.division)
               Murder Assault UrbanPop Rape     state.division
Alabama          13.2     236       58 21.2 East South Central
Alaska           10.0     263       48 44.5            Pacific
Arizona           8.1     294       80 31.0           Mountain
Arkansas          8.8     190       50 19.5 West South Central
...
Washington        4.0     145       73 26.2            Pacific
West Virginia     5.7      81       39  9.3     South Atlantic
Wisconsin         2.6      53       66 10.8 East North Central
Wyoming           6.8     161       60 15.6           Mountain

cbind()函数

> cbind(USArrests,state.division)
>                Murder Assault UrbanPop Rape     state.division
Alabama          13.2     236       58 21.2 East South Central
Alaska           10.0     263       48 44.5            Pacific
Arizona           8.1     294       80 31.0           Mountain
Arkansas          8.8     190       50 19.5 West South Central
...
Washington        4.0     145       73 26.2            Pacific
West Virginia     5.7      81       39  9.3     South Atlantic
Wisconsin         2.6      53       66 10.8 East North Central
Wyoming           6.8     161       60 15.6           Mountain
  • 合并为行:两个数据框要有相同的列名
> data1 <- head(USArrests,5)
> data1
           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
> data2 <- tail(USArrests,5)
> data2
              Murder Assault UrbanPop Rape
Virginia         8.5     156       63 20.7
Washington       4.0     145       73 26.2
West Virginia    5.7      81       39  9.3
Wisconsin        2.6      53       66 10.8
Wyoming          6.8     161       60 15.6
> rbind(data1,data2)
              Murder Assault UrbanPop Rape
Alabama         13.2     236       58 21.2
Alaska          10.0     263       48 44.5
Arizona          8.1     294       80 31.0
Arkansas         8.8     190       50 19.5
California       9.0     276       91 40.6
Virginia         8.5     156       63 20.7
Washington       4.0     145       73 26.2
West Virginia    5.7      81       39  9.3
Wisconsin        2.6      53       66 10.8
Wyoming          6.8     161       60 15.6

rbind()函数和cbind()函数同样适用于矩阵,但是矩阵的行数跟列数要相同

4. 去除重复项

rbind()函数

> data1 <- head(USArrests,5)
> data3 <- head(USArrests,10)
> data4 <- rbind(data1,data3)
##重复的项
> data4[duplicated(data4),]										
            Murder Assault UrbanPop Rape
Alabama1      13.2     236       58 21.2
Alaska1       10.0     263       48 44.5
Arizona1       8.1     294       80 31.0
Arkansas1      8.8     190       50 19.5
California1    9.0     276       91 40.6
##输出不重复的项
> data4[!duplicated(data4),]
            Murder Assault UrbanPop Rape
Alabama       13.2     236       58 21.2
Alaska        10.0     263       48 44.5
Arizona        8.1     294       80 31.0
Arkansas       8.8     190       50 19.5
California     9.0     276       91 40.6
Colorado       7.9     204       78 38.7
Connecticut    3.3     110       77 11.1
Delaware       5.9     238       72 15.8
Florida       15.4     335       80 31.9
Georgia       17.4     211       60 25.8

unique()函数:直接输出不重复的数据

> unique(data4)
            Murder Assault UrbanPop Rape
Alabama       13.2     236       58 21.2
Alaska        10.0     263       48 44.5
Arizona        8.1     294       80 31.0
Arkansas       8.8     190       50 19.5
California     9.0     276       91 40.6
Colorado       7.9     204       78 38.7
Connecticut    3.3     110       77 11.1
Delaware       5.9     238       72 15.8
Florida       15.4     335       80 31.9
Georgia       17.4     211       60 25.8

5. 数据的转置:t()函数

> t(data4)
         Alabama Alaska Arizona Arkansas California Alabama1 Alaska1 Arizona1 Arkansas1 California1 Colorado Connecticut Delaware Florida
Murder      13.2   10.0     8.1      8.8        9.0     13.2    10.0      8.1       8.8         9.0      7.9         3.3      5.9    15.4
Assault    236.0  263.0   294.0    190.0      276.0    236.0   263.0    294.0     190.0       276.0    204.0       110.0    238.0   335.0
UrbanPop    58.0   48.0    80.0     50.0       91.0     58.0    48.0     80.0      50.0        91.0     78.0        77.0     72.0    80.0
Rape        21.2   44.5    31.0     19.5       40.6     21.2    44.5     31.0      19.5        40.6     38.7        11.1     15.8    31.9
         Georgia
Murder      17.4
Assault    211.0
UrbanPop    60.0
Rape        25.8

6. 数据反向:rev()函数

> letters
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
> rev(letters)
 [1] "z" "y" "x" "w" "v" "u" "t" "s" "r" "q" "p" "o" "n" "m" "l" "k" "j" "i" "h" "g" "f" "e" "d" "c" "b" "a"
> women
   height weight
1      58    115
2      59    117
3      60    120
4      61    123
5      62    126
6      63    129
7      64    132
8      65    135
9      66    139
10     67    142
11     68    146
12     69    150
13     70    154
14     71    159
15     72    164
> rev(women)
   weight height
1     115     58
2     117     59
3     120     60
4     123     61
5     126     62
6     129     63
7     132     64
8     135     65
9     139     66
10    142     67
11    146     68
12    150     69
13    154     70
14    159     71
15    164     72
> women[rev(row.names(women)),]
   height weight
15     72    164
14     71    159
13     70    154
12     69    150
11     68    146
10     67    142
9      66    139
8      65    135
7      64    132
6      63    129
5      62    126
4      61    123
3      60    120
2      59    117
1      58    115

7. 更改元素的值:transform()函数

> transform(women,height=height*2.54)
   height weight
1  147.32    115
2  149.86    117
3  152.40    120
4  154.94    123
5  157.48    126
6  160.02    129
7  162.56    132
8  165.10    135
9  167.64    139
10 170.18    142
11 172.72    146
12 175.26    150
13 177.80    154
14 180.34    159
15 182.88    164
##生成新的一列
> transform(women,height1=height*2.54)
   height weight height1
1      58    115  147.32
2      59    117  149.86
3      60    120  152.40
4      61    123  154.94
5      62    126  157.48
6      63    129  160.02
7      64    132  162.56
8      65    135  165.10
9      66    139  167.64
10     67    142  170.18
11     68    146  172.72
12     69    150  175.26
13     70    154  177.80
14     71    159  180.34
15     72    164  182.88

8. 排序

  • sort()函数

sort()函数:对向量进行排序,返回值为排序后的向量,默认为从小到大排序
rev(sort()):rev()函数与sort()函数合并使用,从大到小排序

> river <- head(rivers,10)
> river
 [1]  735  320  325  392  524  450 1459  135  465  600
 
> sort(river)
 [1]  135  320  325  392  450  465  524  600  735 1459
 
> rev(sort(river))
 [1] 1459  735  600  524  465  450  392  325  320  135
  • order()函数

order()函数:可以对向量进行排序,返回向量所在的位置

> head(rivers,10)
 [1]  735  320  325  392  524  450 1459  135  465  600
> river <- head(rivers,10)
> river
 [1]  735  320  325  392  524  450 1459  135  465  600
 
> sort(river)
 [1]  135  320  325  392  450  465  524  600  735 1459			#返回排序后的值
 
> order(river)
 [1]  8  2  3  4  6  9  5 10  1  7								#返回排序后的位置

对数据框的某一列进行排序

> mtcar <- head(mtcars$mpg,10)
> mtcar
 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
 
> order(mtcar)
 [1]  7  6  5 10  1  2  4  3  9  8

> mtcar[order(mtcar)]
 [1] 14.3 18.1 18.7 19.2 21.0 21.0 21.4 22.8 22.8 24.4

9. apply的系列函数

  • apply()函数:

适用于数据框或矩阵
apply(X,MARGIN,FUN)
X:一个数组,包括一个矩阵。
MARGIN:MARGIN=1,对行进行操作;MARGIN=2,对列进行操作
FUN:函数,系统自带函数或自定义函数

> WorldPhones
     N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
1951  45939  21574 2876   1815    1646     89      555
1956  60423  29990 4708   2568    2366   1411      733
1957  64721  32510 5230   2695    2526   1546      773
1958  68484  35218 6662   2845    2691   1663      836
1959  71799  37598 6856   3000    2868   1769      911
1960  76036  40341 8220   3145    3054   1905     1008
1961  79831  43173 9053   3338    3224   2005     1076

> sum <- apply(WorldPhones,1,sum)
> WorldPhone <- cbind(WorldPhones,sum)
> WorldPhone
     N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer    sum
1951  45939  21574 2876   1815    1646     89      555  74494
1956  60423  29990 4708   2568    2366   1411      733 102199
1957  64721  32510 5230   2695    2526   1546      773 110001
1958  68484  35218 6662   2845    2691   1663      836 118399
1959  71799  37598 6856   3000    2868   1769      911 124801
1960  76036  40341 8220   3145    3054   1905     1008 133709
1961  79831  43173 9053   3338    3224   2005     1076 141700

> mean <- apply(WorldPhone,2,mean)
> WorldPhone <- rbind(WorldPhone,mean)
> WorldPhone
       N.Amer   Europe     Asia   S.Amer Oceania Africa  Mid.Amer      sum
1951 45939.00 21574.00 2876.000 1815.000    1646     89  555.0000  74494.0
1956 60423.00 29990.00 4708.000 2568.000    2366   1411  733.0000 102199.0
1957 64721.00 32510.00 5230.000 2695.000    2526   1546  773.0000 110001.0
1958 68484.00 35218.00 6662.000 2845.000    2691   1663  836.0000 118399.0
1959 71799.00 37598.00 6856.000 3000.000    2868   1769  911.0000 124801.0
1960 76036.00 40341.00 8220.000 3145.000    3054   1905 1008.0000 133709.0
1961 79831.00 43173.00 9053.000 3338.000    3224   2005 1076.0000 141700.0
mean 66747.57 34343.43 6229.286 2772.286    2625   1484  841.7143 115043.3
  • lapply()函数

适用于列表,返回列表
lapply(X,FUN)
X:向量(原子或列表)或表达式对象
FUN:函数,系统自带函数或自定义函数

> lapply(state.center,length)
$x
[1] 50

$y
[1] 50

> class(lapply(state.center, length))
[1] "list"
  • sapply()函数

适用于列表,返回矩阵或者向量
sapply(X,FUN)
X:向量(原子或列表)或表达式对象
FUN:函数,系统自带函数或自定义函数

> sapply(state.center,length)
 x  y 
50 50 
> class(sapply(state.center, length))
[1] "integer"
  • tapply()函数

处理因子数据,根据因子进行分组,对每组进行处理
tapply(X,INDEX,FUN)
X:一个数据集
INDEX:由因子组成的列表,根据因子对数据进行处理
FUN:函数,系统自带函数或自定义函数

#美国50个州的名字
> state.name
 [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"       "California"     "Colorado"       "Connecticut"   
 [8] "Delaware"       "Florida"        "Georgia"        "Hawaii"         "Idaho"          "Illinois"       "Indiana"       
[15] "Iowa"           "Kansas"         "Kentucky"       "Louisiana"      "Maine"          "Maryland"       "Massachusetts" 
[22] "Michigan"       "Minnesota"      "Mississippi"    "Missouri"       "Montana"        "Nebraska"       "Nevada"        
[29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"       "North Carolina" "North Dakota"   "Ohio"          
[36] "Oklahoma"       "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina" "South Dakota"   "Tennessee"     
[43] "Texas"          "Utah"           "Vermont"        "Virginia"       "Washington"     "West Virginia"  "Wisconsin"     
[50] "Wyoming"       

#美国50个州划分的区域

> state.division
 [1] East South Central Pacific            Mountain           West South Central Pacific            Mountain           New England       
 [8] South Atlantic     South Atlantic     South Atlantic     Pacific            Mountain           East North Central East North Central
[15] West North Central West North Central East South Central West South Central New England        South Atlantic     New England       
[22] East North Central West North Central East South Central West North Central Mountain           West North Central Mountain          
[29] New England        Middle Atlantic    Mountain           Middle Atlantic    South Atlantic     West North Central East North Central
[36] West South Central Pacific            Middle Atlantic    New England        South Atlantic     West North Central East South Central
[43] West South Central Mountain           New England        South Atlantic     Pacific            South Atlantic     East North Central
[50] Mountain          
9 Levels: New England Middle Atlantic South Atlantic East South Central West South Central East North Central ... Pacific

> tapply(state.name, state.division, length)
       New England    Middle Atlantic     South Atlantic East South Central West South Central East North Central West North Central Mountain Pacific 
                 6                  3                  8                  4                  4                  5                  7  		8       5

10. 数据的中心化与标准化处理

数据中心化:是指数据集中的各项数据减去数据集的均值
数据标准化:是指在中心化之后在除以数据集的标准差,即数据集中的各项数据减去数据集的均值再除以数据集的标准差
数据中心化和数据标准化都是消除量纲(单位)对数据的影响,归一化,降低数据集中数据之间的差别

  • scale()函数:对数据进行中心化或标准化处理

scale(x, center = TRUE, scale = TRUE)
x:数字矩阵(如对象)
center:TRUE,对数据中心化处理
scale:TRUE,对数据标准化处理

> x <- head(state.x77)
> x
           Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
California      21198   5114        1.1    71.71   10.3    62.6    20 156361
Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766

> scale(x,T,F)
           Population     Income  Illiteracy     Life Exp      Murder    HS Grad Frost      Area
Alabama     -1725.167 -1016.8333  0.58333333 -1.506666667  4.86666667 -14.116667   -53 -123063.5
Alaska      -4975.167  1674.1667 -0.01666667 -1.246666667  1.06666667  11.283333    79  392660.5
Arizona     -3128.167  -110.8333  0.28333333 -0.006666667 -2.43333333   2.683333   -58  -60354.5
Arkansas    -3230.167 -1262.8333  0.38333333  0.103333333 -0.13333333 -15.516667    -8 -121826.5
California  15857.833   473.1667 -0.41666667  1.153333333  0.06666667   7.183333   -53  -17410.5
Colorado    -2799.167   243.1667 -0.81666667  1.503333333 -3.43333333   8.483333    93  -70005.5
attr(,"scaled:center")
  Population       Income   Illiteracy     Life Exp       Murder      HS Grad        Frost         Area 
5.340167e+03 4.640833e+03 1.516667e+00 7.055667e+01 1.023333e+01 5.541667e+01 7.300000e+01 1.737715e+05 

> scale(x,F,T)
           Population    Income Illiteracy  Life Exp    Murder   HS Grad     Frost      Area
Alabama    0.36958694 0.6975662  1.2040366 0.8932665 1.3035855 0.6677959 0.1891343 0.1853587
Alaska     0.03731652 1.2155437  0.8600261 0.8966300 0.9755309 1.0784985 1.4374205 2.0705426
Arizona    0.22614835 0.8719577  1.0320314 0.9126712 0.6733753 0.9394417 0.1418507 0.4145859
Arkansas   0.21572018 0.6502148  1.0893665 0.9140943 0.8719347 0.6451588 0.6146864 0.1898804
California 2.16722098 0.9843690  0.6306858 0.9276776 0.8892007 1.0122040 0.1891343 0.5715639
Colorado   0.25978434 0.9400974  0.4013455 0.9322054 0.5870451 1.0332242 1.5698145 0.3793075
attr(,"scaled:scale")
  Population       Income   Illiteracy     Life Exp       Murder      HS Grad        Frost         Area 
9.781190e+03 5.195206e+03 1.744133e+00 7.730056e+01 1.158344e+01 6.184524e+01 1.057450e+02 2.735669e+05 

> scale(x,T,T)
           Population     Income  Illiteracy     Life Exp      Murder    HS Grad      Frost        Area
Alabama    -0.2200732 -0.9501180  1.09913001 -1.236374542  1.66820651 -1.1946743 -0.7660111 -0.62635214
Alaska     -0.6346639  1.5643230 -0.03140371 -1.023017873  0.36563430  0.9548932  1.1417902  1.99851088
Arizona    -0.3990488 -0.1035615  0.53386315 -0.005470684 -0.83410325  0.2270869 -0.8382763 -0.30718426
Arkansas   -0.4120606 -1.1799777  0.72228544  0.084795599 -0.04570429 -1.3131544 -0.1156243 -0.62005622
California  2.0229261  0.4421218 -0.78509287  0.946428300  0.02285214  0.6079157 -0.7660111 -0.08861363
Colorado   -0.3570795  0.2272123 -1.53878202  1.233639200 -1.17688541  0.7179330  1.3441327 -0.35630463
attr(,"scaled:center")
  Population       Income   Illiteracy     Life Exp       Murder      HS Grad        Frost         Area 
5.340167e+03 4.640833e+03 1.516667e+00 7.055667e+01 1.023333e+01 5.541667e+01 7.300000e+01 1.737715e+05 
attr(,"scaled:scale")
  Population       Income   Illiteracy     Life Exp       Murder      HS Grad        Frost         Area 
7.839057e+03 1.070218e+03 5.307228e-01 1.218617e+00 2.917305e+00 1.181633e+01 6.918959e+01 1.964765e+05

11. reshape2包:重构整个数据的万能工具,可以将数据转换成任何需要的形式

  • merge()函数

merge是R语言中用来合并数据框的函数
merge(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x",".y"),incomparables = NULL, …)
x,y:用于合并的两个数据框
by,by.x,by.y:指定依据哪些行合并数据框,默认值为相同列名的列.
all,all.x,all.y:指定x和y的行是否应该全在输出文件.
sort:by:指定的列是否要排序.
suffixes:指定除by外相同列名的后缀.
incomparables:指定by中哪些单元不进行合并

> w1
  name school class english
1    A     S1    10      85
2    B     S2     5      50
3    A     S1     4      90
4    A     S1    11      90
5    C     S1     1      12

> w2
  name school class math english
1    A     S3     5   80      88
2    B     S2     5   89      81
3    C     S1     1   55      32

#依据条件合并数据框,输出相同的行
> merge(w1,w2,by=c("name","school","class"))
  name school class english.x math english.y
1    B     S2     5        50   89        81
2    C     S1     1        12   55        32

#一句条件合并数据框,输出全部的行

> merge(w1,w2,by=c("name","school","class"),all=T)
  name school class english.x math english.y
1    A     S1     4        90   NA        NA
2    A     S1    10        85   NA        NA
3    A     S1    11        90   NA        NA
4    A     S3     5        NA   80        88
5    B     S2     5        50   89        81
6    C     S1     1        12   55        32

melt函数:融合公式

melt()函数:融合数据,数据的“宽变长”,每一列单独拿出来自上而下追加到新的数据集中
融合就是取某一列或几列作为标量(结构保持不变)把剩余的列融合在一起

> m <- head(airquality,3)
> m
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3

#将每一列自上而下追加到每一行中

> melt(m)
No id variables; using all as measure variables
   variable value
1     Ozone  41.0
2     Ozone  36.0
3     Ozone  12.0
4   Solar.R 190.0
5   Solar.R 118.0
6   Solar.R 149.0
7      Wind   7.4
8      Wind   8.0
9      Wind  12.6
10     Temp  67.0
11     Temp  72.0
12     Temp  74.0
13    Month   5.0
14    Month   5.0
15    Month   5.0
16      Day   1.0
17      Day   2.0
18      Day   3.0
#以Month和Day这两列为标量,将数据进行融合
> melt(m,id.vars = c("Month","Day"))
   Month Day variable value
1      5   1    Ozone  41.0
2      5   2    Ozone  36.0
3      5   3    Ozone  12.0
4      5   1  Solar.R 190.0
5      5   2  Solar.R 118.0
6      5   3  Solar.R 149.0
7      5   1     Wind   7.4
8      5   2     Wind   8.0
9      5   3     Wind  12.6
10     5   1     Temp  67.0
11     5   2     Temp  72.0
12     5   3     Temp  74.0

dcast()函数:铸造公式

dcast()/函数:铸造数据,”长变宽“依据variable列中的各level作为列名

> n <- melt(m,id.vars = c("Month","Day"))
> n
   Month Day variable value
1      5   1    Ozone  41.0
2      5   2    Ozone  36.0
3      5   3    Ozone  12.0
4      5   1  Solar.R 190.0
5      5   2  Solar.R 118.0
6      5   3  Solar.R 149.0
7      5   1     Wind   7.4
8      5   2     Wind   8.0
9      5   3     Wind  12.6
10     5   1     Temp  67.0
11     5   2     Temp  72.0
12     5   3     Temp  74.0
#用Month跟Day关联variable列中的lever对数据进行聚合
> dcast(n,Month+Day~variable)
  Month Day Ozone Solar.R Wind Temp
1     5   1    41     190  7.4   67
2     5   2    36     118  8.0   72
3     5   3    12     149 12.6   74
#缺少聚合函数Day:默认为长度
> dcast(n,Month~variable)
Aggregation function missing: defaulting to length
  Month Ozone Solar.R Wind Temp
1     5     3       3    3    3
#缺少聚合函数,设置fun.aggregate处理重复数据
> dcast(n,Month~variable,fun.aggregate = mean)
  Month    Ozone  Solar.R     Wind Temp
1     5 29.66667 152.3333 9.333333   71

12. tidyr包:整理混乱的数据

tidyr:创建整洁的数据,其中每一列是一个变量,每一行是一个观察值,每个单元格包含一个值。
‘tidyr’包含用于更改数据集的形状(旋转)和层次结构(嵌套和’ un嵌套’)、将深度嵌套列表转换为矩形数据帧(‘rectangling’)以及从字符串列中提取值的工具。
它还包括用于处理缺失值(隐式和显式)的工具。

  • gather()函数:将宽数据转换成长数据
> mtcars <- data.frame(names=rownames(mtcars),mtcars)
> mtcars
> > tdata <- mtcars[1:10,1:4]
> tdata
                              names  mpg cyl  disp
Mazda RX4                 Mazda RX4 21.0   6 160.0
Mazda RX4 Wag         Mazda RX4 Wag 21.0   6 160.0
Datsun 710               Datsun 710 22.8   4 108.0
Hornet 4 Drive       Hornet 4 Drive 21.4   6 258.0
Hornet Sportabout Hornet Sportabout 18.7   8 360.0
Valiant                     Valiant 18.1   6 225.0
Duster 360               Duster 360 14.3   8 360.0
Merc 240D                 Merc 240D 24.4   4 146.7
Merc 230                   Merc 230 22.8   4 140.8
Merc 280                   Merc 280 19.2   6 167.6
> gather(tdata,key = "key",value = "values",mpg,cyl,disp)
               names  key values
1          Mazda RX4  mpg   21.0
2      Mazda RX4 Wag  mpg   21.0
3         Datsun 710  mpg   22.8
4     Hornet 4 Drive  mpg   21.4
5  Hornet Sportabout  mpg   18.7
6            Valiant  mpg   18.1
7         Duster 360  mpg   14.3
8          Merc 240D  mpg   24.4
9           Merc 230  mpg   22.8
10          Merc 280  mpg   19.2
11         Mazda RX4  cyl    6.0
12     Mazda RX4 Wag  cyl    6.0
13        Datsun 710  cyl    4.0
14    Hornet 4 Drive  cyl    6.0
15 Hornet Sportabout  cyl    8.0
16           Valiant  cyl    6.0
17        Duster 360  cyl    8.0
18         Merc 240D  cyl    4.0
19          Merc 230  cyl    4.0
20          Merc 280  cyl    6.0
21         Mazda RX4 disp  160.0
22     Mazda RX4 Wag disp  160.0
23        Datsun 710 disp  108.0
24    Hornet 4 Drive disp  258.0
25 Hornet Sportabout disp  360.0
26           Valiant disp  225.0
27        Duster 360 disp  360.0
28         Merc 240D disp  146.7
29          Merc 230 disp  140.8
30          Merc 280 disp  167.6
> gather(tdata,key = "key",value = "values",mpg:cyl,-disp)
               names  disp key values
1          Mazda RX4 160.0 mpg   21.0
2      Mazda RX4 Wag 160.0 mpg   21.0
3         Datsun 710 108.0 mpg   22.8
4     Hornet 4 Drive 258.0 mpg   21.4
5  Hornet Sportabout 360.0 mpg   18.7
6            Valiant 225.0 mpg   18.1
7         Duster 360 360.0 mpg   14.3
8          Merc 240D 146.7 mpg   24.4
9           Merc 230 140.8 mpg   22.8
10          Merc 280 167.6 mpg   19.2
11         Mazda RX4 160.0 cyl    6.0
12     Mazda RX4 Wag 160.0 cyl    6.0
13        Datsun 710 108.0 cyl    4.0
14    Hornet 4 Drive 258.0 cyl    6.0
15 Hornet Sportabout 360.0 cyl    8.0
16           Valiant 225.0 cyl    6.0
17        Duster 360 360.0 cyl    8.0
18         Merc 240D 146.7 cyl    4.0
19          Merc 230 140.8 cyl    4.0
20          Merc 280 167.6 cyl    6.0
  • spread()函数:将长数据转换成宽数据
> spread(gdata,key = "key",value = "values")
               names cyl  disp  mpg
1         Datsun 710   4 108.0 22.8
2         Duster 360   8 360.0 14.3
3     Hornet 4 Drive   6 258.0 21.4
4  Hornet Sportabout   8 360.0 18.7
5          Mazda RX4   6 160.0 21.0
6      Mazda RX4 Wag   6 160.0 21.0
7           Merc 230   4 140.8 22.8
8          Merc 240D   4 146.7 24.4
9           Merc 280   6 167.6 19.2
10           Valiant   6 225.0 18.1

gather函数为收集单个数据,多列合成一列;
spread函数为展开成数据,一列合成多列
gather跟spread是互为逆处理,名称和数值都要对应

  • separate()函数:将一列拆分成多列

separate(data,col,into,sep ,…)
data:数据框
col:要分隔的列名或位置
into:要创建为字符向量的新变量的名称
sep:设置列之间的分隔符,默认分隔符为".",点

> dt <- data.frame(x=c(NA,"a.b","b.c","c.d",NA))
> dt
     x
1 <NA>
2  a.b
3  b.c
4  c.d
5 <NA>
> separate(dt,x,c("A","B"))
     A    B
1 <NA> <NA>
2    a    b
3    b    c
4    c    d
5 <NA> <NA>
  • unite()函数

unite(data, col, …, sep = “", …)
data:数据框
col:合成的新列的名称
…:要合成的列
sep:用于值之间的分隔符,默认为"
”,短下滑线

> unite(s,"AB",A,B)
     AB
1 NA_NA
2   a_b
3   b_c
4   c_d
5 NA_NA
> unite(s,"AB",A,B,sep="-")
     AB
1 NA-NA
2   a-b
3   b-c
4   c-d
5 NA-NA

separate函数跟union函数是一对对应函数
separate函数根据列进行分割
union合并列

13. dplyr包:数据操作语法

可对单个单表格进行操作,还可对双表格进行操作

1) 对单表格进行操作

以iris数据集对dplyr包中相应的函数进行操作

  • dplyr::filter()函数:对数据进行筛选
#筛选出iris数据集中Sepal.Length>7的数据
> dplyr::filter(iris,Sepal.Length>7)
   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1           7.1         3.0          5.9         2.1 virginica
2           7.6         3.0          6.6         2.1 virginica
3           7.3         2.9          6.3         1.8 virginica
4           7.2         3.6          6.1         2.5 virginica
5           7.7         3.8          6.7         2.2 virginica
6           7.7         2.6          6.9         2.3 virginica
7           7.7         2.8          6.7         2.0 virginica
8           7.2         3.2          6.0         1.8 virginica
9           7.2         3.0          5.8         1.6 virginica
10          7.4         2.8          6.1         1.9 virginica
11          7.9         3.8          6.4         2.0 virginica
12          7.7         3.0          6.1         2.3 virginica
  • dplyr::distinct()函数:用于去除重复行
> dplyr::distinct(rbind(iris[1:10,],iris[1:15,]))
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
11          5.4         3.7          1.5         0.2  setosa
12          4.8         3.4          1.6         0.2  setosa
13          4.8         3.0          1.4         0.1  setosa
14          4.3         3.0          1.1         0.1  setosa
15          5.8         4.0          1.2         0.2  setosa
  • dplyr::slice()函数:切片,可以取出数据集中任意行
> dplyr::slice(iris,c(1:3),5,4)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          5.0         3.6          1.4         0.2  setosa
5          4.6         3.1          1.5         0.2  setosa
  • dplyr::samply_n()函数:用于随机取样
#随机抽取iris数据集中的10个数据
> dplyr::sample_n(iris,10)
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1           4.3         3.0          1.1         0.1     setosa
2           5.1         3.7          1.5         0.4     setosa
3           4.9         2.5          4.5         1.7  virginica
4           6.0         3.0          4.8         1.8  virginica
5           4.4         3.0          1.3         0.2     setosa
6           6.8         3.0          5.5         2.1  virginica
7           5.1         2.5          3.0         1.1 versicolor
8           6.2         3.4          5.4         2.3  virginica
9           5.9         3.2          4.8         1.8 versicolor
10          4.9         3.1          1.5         0.1     setosa
  • dplyr::sample_frac()函数:用于抽取一定比例的数据
#抽取iris数据集中10%的数据

> dplyr::sample_frac(iris,0.1)
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1           5.6         2.8          4.9         2.0  virginica
2           7.7         2.6          6.9         2.3  virginica
3           5.8         2.7          5.1         1.9  virginica
4           5.6         2.7          4.2         1.3 versicolor
5           5.7         2.8          4.5         1.3 versicolor
6           5.7         4.4          1.5         0.4     setosa
7           6.0         2.7          5.1         1.6 versicolor
8           5.4         3.7          1.5         0.2     setosa
9           6.4         3.2          4.5         1.5 versicolor
10          4.4         2.9          1.4         0.2     setosa
11          6.3         2.5          4.9         1.5 versicolor
12          4.5         2.3          1.3         0.3     setosa
13          6.9         3.1          5.4         2.1  virginica
14          5.1         3.5          1.4         0.3     setosa
15          5.9         3.2          4.8         1.8 versicolor
  • dplyr::arrange()函数:用于排序

默认排序为升序
反序排序:1. 在排序条件前加负号;2. 采用desc()函数

> dplyr::arrange(head(iris,10),Sepal.Length)
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           4.4         2.9          1.4         0.2  setosa
2           4.6         3.1          1.5         0.2  setosa
3           4.6         3.4          1.4         0.3  setosa
4           4.7         3.2          1.3         0.2  setosa
5           4.9         3.0          1.4         0.2  setosa
6           4.9         3.1          1.5         0.1  setosa
7           5.0         3.6          1.4         0.2  setosa
8           5.0         3.4          1.5         0.2  setosa
9           5.1         3.5          1.4         0.2  setosa
10          5.4         3.9          1.7         0.4  setosa
> dplyr::arrange(head(iris,10),-Sepal.Length)
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.4         3.9          1.7         0.4  setosa
2           5.1         3.5          1.4         0.2  setosa
3           5.0         3.6          1.4         0.2  setosa
4           5.0         3.4          1.5         0.2  setosa
5           4.9         3.0          1.4         0.2  setosa
6           4.9         3.1          1.5         0.1  setosa
7           4.7         3.2          1.3         0.2  setosa
8           4.6         3.1          1.5         0.2  setosa
9           4.6         3.4          1.4         0.3  setosa
10          4.4         2.9          1.4         0.2  setosa
> dplyr::arrange(head(iris,10),desc(Sepal.Length))
  • dplyr::summarise()函数:统计函数
> dplyr::summarise(head(iris),mean(head(Sepal.Length)))
  mean(head(Sepal.Length))
1                     4.95
> dplyr::summarise(head(iris),sum(Sepal.Length))
  sum(Sepal.Length)
1              29.7
  • 链式操作符(管道操作符)%>%

用于实现将一个函数的输出传递给下一个函数,作为下一个函数的输入
在R中,可以使用ctrl+shift+M快捷键输出出来

#取出mtcars数据集中前20行的后10行
> head(mtcars,20) %>% tail(10)
                                  names  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Merc 280C                     Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE                   Merc 450SE 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL                   Merc 450SL 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC                 Merc 450SLC 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood   Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial     Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128                       Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic                 Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla           Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
  • dplyr::group_by()函数:对数据集进行分组
#对iris数据集根据Species进行分组
> dplyr::group_by(iris,Species)
# A tibble: 150 x 5
# Groups:   Species [3]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ... with 140 more rows
  • dplyr::group_by()函数与管道符联合使用
> iris %>% group_by(Species) %>% summarise(mean(Sepal.Length))
# A tibble: 3 x 2
  Species    `mean(Sepal.Length)`
  <fct>                     <dbl>
1 setosa                     5.01
2 versicolor                 5.94
3 virginica                  6.59
  • dplyr::mutate()函数:添加新的一列
> mutate(head(iris),new=Sepal.Length+Petal.Length)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species new
1          5.1         3.5          1.4         0.2  setosa 6.5
2          4.9         3.0          1.4         0.2  setosa 6.3
3          4.7         3.2          1.3         0.2  setosa 6.0
4          4.6         3.1          1.5         0.2  setosa 6.1
5          5.0         3.6          1.4         0.2  setosa 6.4
6          5.4         3.9          1.7         0.4  setosa 7.1
2) 对双表格进行操作
> a <- data.frame(x1=c("A","B","C"),x2=c(1,2,3))
> a
  x1 x2
1  A  1
2  B  2
3  C  3
> b <- data.frame(x1=c("A","B","D"),x3=c(T,F,T))
> b
  x1    x3
1  A  TRUE
2  B FALSE
3  D  TRUE
  • 左连接:dplyr::left_join()函数:以左边的数据为基础进行连接
> dplyr::left_join(a,b,by="x1")
  x1 x2    x3
1  A  1  TRUE
2  B  2 FALSE
3  C  3    NAv
  • 右连接:dplyr::right_join()函数:以右边的数据为基础进行连接
> dplyr::right_join(a,b,by="x1")
  x1 x2    x3
1  A  1  TRUE
2  B  2 FALSE
3  D NA  TRUE
  • 内连接:dplyr::inner_join()函数:取两个数据表的交集
> dplyr::inner_join(a,b,by="x1")
  x1 x2    x3
1  A  1  TRUE
2  B  2 FALSE
  • 全连接:dplyr::full_join()函数:取两个数据表的并集
> dplyr::full_join(a,b,by="x1")
  x1 x2    x3
1  A  1  TRUE
2  B  2 FALSE
3  C  3    NA
4  D NA  TRUE
  • 半连接:dplyr::semi_join()函数:根据右侧表的内容对左侧表进行过滤,即取出左侧表中与右侧表数据重合的数据(两个表的交集)
> dplyr::semi_join(a,b,by="x1")
  x1 x2
1  A  1
2  B  2
  • 反连接:dplyr::anti_join()函数:根据右侧表的内容对左侧表进行过滤,即取出左侧表中两个表的差集部分的数据
> dplyr::anti_join(a,b,by="x1")
  x1 x2
1  C  3
3) 数据集的合并
  • intersect()函数:取两个数据集的交集
> first <- slice(mtcars,1:20)
> second <- slice(mtcars,10:30)
> intersect(first,second)
                                  names  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Merc 280                       Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C                     Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE                   Merc 450SE 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL                   Merc 450SL 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC                 Merc 450SLC 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood   Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial     Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128                       Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic                 Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla           Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
  • dplyr::union_all()函数:取两个数据集的并集
> dplyr::union_all(first,second)
                                  names  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4                     Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag             Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710...3               Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive...4       Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout...5 Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Datsun 710...6               Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive...7       Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout...8 Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant                         Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360                   Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D                     Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
  • dplyr::union()函数:取两个数据集的非冗余并集
> dplyr::union(first,second)
                              names  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4                 Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag         Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710               Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive       Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant                     Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360               Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D                 Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
  • dplyr::setdiff()函数:取两个数据集中左边数据集的补集
> dplyr::setdiff(first,second)
                      names mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         Mazda RX4  21   6  160 110  3.9 2.620 16.46  0  1    4    4
Mazda RX4 Wag Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4

> dplyr::setdiff(second,first)
                names  mpg cyl  disp  hp drat   wt  qsec vs am gear carb
Valiant       Valiant 18.1   6 225.0 105 2.76 3.46 20.22  1  0    3    1
Duster 360 Duster 360 14.3   8 360.0 245 3.21 3.57 15.84  0  0    3    4
Merc 240D   Merc 240D 24.4   4 146.7  62 3.69 3.19 20.00  1  0    4    2