stringr包函数处理文本是游刃有余的
一、元字符
在正则表达式中,有12个字符有特殊用途
字符 | 含义 |
[ ] | 括号内的任意字符串 |
\ | 有两个含义:1、 对元字符串进行转义 2、一些以 \ 开头的特殊序列表达了一些字符串组 |
^ | 匹配字符串的开始,将^置于character class的首位表达的意思是取反义 如:[^5]表示除了‘5’以外的任何字符 |
$ | 匹配字符串的结束。但将它置于character class内则消除了它的特殊含义。 如: [akm$]将匹配’a’,’k’,’m’或者’$’ |
. | 匹配除换行符以外的任意字符。 |
| | 或者 |
? | 前面的字符(组)最多被匹配一次 |
* | 前面的字符(组)将被匹配零次或多次 |
+ | 前面的字符(组)将被匹配一次或多次 |
() | 表示一个字符组,括号内的字符串将作为一个整体被匹配。 |
| |
1.1 重复
1.2 转义
如果我们想查找元字符本身,如”?”和”*“,我们需要提前告诉编译系统,取消这些字符的特殊含义。
这个时候,就需要用到转义字符\,即使用\?和\*.当然,如果我们要找的是\,则使用\\进行匹配。
注:R中的转义字符则是双斜杠:\\
1.3 R中预定义的字符组
1.4 代表字符组的特殊符号
二、主要函数
str_extract() 提取首个匹配模式的字符
str_extract_all(shopping_list, "\\b[a-z]+\\b")
shopping_list <- c("apples x4", "bag of flour", "bag of sugar", "milk x2")
str_extract(shopping_list, "\\d")
[1] "4" NA NA "2"
str_extract_all(shopping_list, "\\b[a-z]+\\b")
[[1]]
[1] "apples"
[[2]]
[1] "bag" "of" "flour"
[[3]]
[1] "bag" "of" "sugar"
[[4]]
[1] "milk"
str_locate() 返回首个匹配模式的字符的位置
str_locate_all() 返回所有匹配模式的字符的位置
fruit <- c("apple", "banana", "pear", "pineapple")
str_locate(fruit, "a")
start end
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 5 5
str_locate_all(fruit, "a")
[[1]]
start end
[1,] 1 1
[[2]]
start end
[1,] 2 2
[2,] 4 4
[3,] 6 6
[[3]]
start end
[1,] 3 3
[[4]]
start end
[1,] 5 5
str_replace() 替换首个匹配模式
str_replace_all() 替换所有匹配模式
fruits <- c("one apple", "two pears", "three bananas")
str_replace(fruits, "[aeiou]", "_")
[1] "_ne apple" "tw_ pears" "thr_e bananas"
str_replace_all(fruits, "([aeiou])", "")
[1] "n ppl" "tw prs" "thr bnns"
str_split() 按照模式分割字符串
str_split_fixed() 按照模式将字符串分割成指定个数
fruits <- c(
"apples and oranges and pears and bananas",
"pineapples and mangos and guavas"
)
str_split(fruits, " and ")
[[1]]
[1] "one apple"
[[2]]
[1] "two pears"
[[3]]
[1] "three bananas"
str_split(fruits, " and ", simplify = TRUE)
[,1]
[1,] "one apple"
[2,] "two pears"
[3,] "three bananas"
str_split_fixed(fruits, " and ", 2)
[,1] [,2]
[1,] "one apple" ""
[2,] "two pears" ""
[3,] "three bananas" ""
str_detect() 检测字符是否存在某些指定模式
fruit <- c("apple", "banana", "pear", "pinapple")
str_detect(fruit, "a")
[1] TRUE TRUE TRUE TRUE
str_count() 返回指定模式出现的次数
fruit <- c("apple", "banana", "pear", "pineapple")
str_count(fruit, "a")
[1] 1 3 1 1
三、其他重要函数
str_sub() 提取指定位置的字符
hw <- "Hadley Wickham"
str_sub(hw, 1, 6)
[1] "Hadley"
str_dup() 重复指定位置的字符
fruit <- c("apple", "pear", "banana")
str_dup(fruit, 2)
[1] "appleapple" "pearpear" "bananabanana"
str_length() 返回字符的长度
fruit <- c("apple", "pear", "banana")
str_length(fruit)
[1] 5 4 6
str_pad() 填补字符
str_pad(c("a", "abc", "abcdef"), 10)
[1] " a" " a" " a"
str_trim() 丢弃填充,如去掉字符前后的空格
str_trim(" String with trailing and leading white space\t")
[1] "String with trailing and leading white space"
str_trim("\n\nString with trailing and leading white space\n\n")
[1] "String with trailing and leading white space"
str_c() 连接字符
str_c(letters, collapse = ", ")
[1] "a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y,