stringr包函数处理文本是游刃有余的

一、元字符

在正则表达式中,有12个字符有特殊用途

字符

含义

[ ]

括号内的任意字符串

\

有两个含义:1、 对元字符串进行转义     2、一些以 \ 开头的特殊序列表达了一些字符串组

^

匹配字符串的开始,将^置于character class的首位表达的意思是取反义

如:[^5]表示除了‘5’以外的任何字符

$

匹配字符串的结束。但将它置于character class内则消除了它的特殊含义。

如: [akm$]将匹配’a’,’k’,’m’或者’$’

.

匹配除换行符以外的任意字符。

|

或者


前面的字符(组)最多被匹配一次

*

前面的字符(组)将被匹配零次或多次

+

前面的字符(组)将被匹配一次或多次

()

表示一个字符组,括号内的字符串将作为一个整体被匹配。

 

 


1.1  重复

R语言包installr r语言包含字符_R语言包installr


1.2 转义

如果我们想查找元字符本身,如”?”和”*“,我们需要提前告诉编译系统,取消这些字符的特殊含义。

这个时候,就需要用到转义字符\,即使用\?和\*.当然,如果我们要找的是\,则使用\\进行匹配。

注:R中的转义字符则是双斜杠:\\


1.3 R中预定义的字符组


R语言包installr r语言包含字符_字符串_02


1.4 代表字符组的特殊符号



R语言包installr r语言包含字符_字符串_03



二、主要函数


str_extract()   提取首个匹配模式的字符


str_extract_all(shopping_list, "\\b[a-z]+\\b")


shopping_list <- c("apples x4", "bag of flour", "bag of sugar", "milk x2")
str_extract(shopping_list, "\\d")
[1] "4" NA  NA  "2"
str_extract_all(shopping_list, "\\b[a-z]+\\b")
[[1]]
[1] "apples"

[[2]]
[1] "bag"   "of"    "flour"

[[3]]
[1] "bag"   "of"    "sugar"

[[4]]
[1] "milk"


str_locate()           返回首个匹配模式的字符的位置 


str_locate_all()    返回所有匹配模式的字符的位置 


fruit <- c("apple", "banana", "pear", "pineapple")
str_locate(fruit, "a")
     start end
[1,]     1   1
[2,]     2   2
[3,]     3   3
[4,]     5   5

 str_locate_all(fruit, "a")
[[1]]
     start end
[1,]     1   1

[[2]]
     start end
[1,]     2   2
[2,]     4   4
[3,]     6   6

[[3]]
     start end
[1,]     3   3

[[4]]
     start end
[1,]     5   5


str_replace()        替换首个匹配模式 


str_replace_all() 替换所有匹配模式 


fruits <- c("one apple", "two pears", "three bananas")
str_replace(fruits, "[aeiou]", "_")
[1] "_ne apple"     "tw_ pears"     "thr_e bananas"

str_replace_all(fruits, "([aeiou])", "")
[1] "n ppl"    "tw prs"   "thr bnns"


str_split()              按照模式分割字符串 

str_split_fixed()   按照模式将字符串分割成指定个数 

fruits <- c(
     "apples and oranges and pears and bananas",
     "pineapples and mangos and guavas"
   )
   str_split(fruits, " and ")
[[1]]
[1] "one apple"

[[2]]
[1] "two pears"

[[3]]
[1] "three bananas"

str_split(fruits, " and ", simplify = TRUE)
     [,1]           
[1,] "one apple"    
[2,] "two pears"    
[3,] "three bananas"

str_split_fixed(fruits, " and ", 2)
     [,1]            [,2]
[1,] "one apple"     ""  
[2,] "two pears"     ""  
[3,] "three bananas" ""


str_detect()   检测字符是否存在某些指定模式 

fruit <- c("apple", "banana", "pear", "pinapple")
str_detect(fruit, "a")
[1] TRUE TRUE TRUE TRUE


str_count()  返回指定模式出现的次数 

fruit <- c("apple", "banana", "pear", "pineapple")
str_count(fruit, "a")
[1] 1 3 1 1

三、其他重要函数

str_sub()   提取指定位置的字符 

hw <- "Hadley Wickham"
str_sub(hw, 1, 6)
[1] "Hadley"


str_dup() 重复指定位置的字符 

fruit <- c("apple", "pear", "banana")
str_dup(fruit, 2)
[1] "appleapple"   "pearpear"     "bananabanana"


str_length() 返回字符的长度 

fruit <- c("apple", "pear", "banana")
str_length(fruit)
[1] 5 4 6


str_pad()  填补字符 

str_pad(c("a", "abc", "abcdef"), 10)
[1] "    a"                "         a"           "                   a"


str_trim() 丢弃填充,如去掉字符前后的空格 

str_trim("  String with trailing and leading white space\t")
[1] "String with trailing and leading white space"

str_trim("\n\nString with trailing and leading white space\n\n")
[1] "String with trailing and leading white space"


str_c() 连接字符 

str_c(letters, collapse = ", ")
[1] "a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y,