Regular Expression Basics
Regular expressions are made up of normal characters and metacharacters. Normal characters include upper and lower case letters and digits. The metacharacters have special meanings and are described in detail below.Metacharacter | Description | |
---|---|---|
|
| |
|
Matches any single character. For example the regular expression r.t would match the strings rat, rut, r t, but not root. | |
|
Matches the end of a line. For example, the regular expression weasel$ would match the end of the string "He's a weasel" but not the string "They are a bunch of weasels." | |
|
Matches the beginning of a line. For example, the regular expression ^When in would match the beginning of the string "When in the course of human events" but would not match "What and When in the" . | |
|
Matches zero or more occurences of the character immediately preceding. For example, the regular expression .* means match any number of any characters. | |
|
This is the quoting character, use it to treat the following character as an ordinary character. For example, \$ is used to match the dollar sign character ($) rather than the end of a line. Similarly, the expression \. is used to match the period character rather than any single character. | |
[c1-c2] [^c1-c2] |
Matches any one of the characters between the brackets. For example, the regular expression r[aou]t matches rat, rot, and rut, but not ret. Ranges of characters can specified by using a hyphen. For example, the regular expression [0-9] means match any digit. Multiple ranges can be specified as well. The regular expression [A-Za-z] means match any upper or lower case letter. To match any character except those in the range, the complement range, use the caret as the first character after the opening bracket. For example, the expression [^269A-Z] will match any characters except 2, 6, 9, and upper case letters. | |
|
Matches the beginning (\<) or end (\>) or a word. For example, \<the matches on "the" in the string "for the wise" but does not match "the" in "otherwise". NOTE: this metacharacter is not supported by all applications. | |
|
Treat the expression between \( and \) as a group. Also, saves the characters matched by the expression into temporary holding areas. Up to nine pattern matches can be saved in a single regular expression. They can be referenced as \1 through \9. | |
|
Or two conditions together. For example (him|her) matches the line "it belongs to him" and matches the line "it belongs to her" but does not match the line "it belongs to them." NOTE: this metacharacter is not supported by all applications. | |
|
Matches one or more occurences of the character or regular expression immediately preceding. For example, the regular expression 9+ matches 9, 99, 999. NOTE: this metacharacter is not supported by all applications. | |
|
Matches 0 or 1 occurence of the character or regular expression immediately preceding.NOTE: this metacharacter is not supported by all applications. | |
\{i,j\} |
Match a specific number of instances or instances within a range of the preceding character. For example, the expression A[0-9]\{3\} will match "A" followed by exactly 3 digits. That is, it will match A123 but not A1234. The expression [0-9]\{4,6\} any sequence of 4, 5, or 6 digits. NOTE: this metacharacter is not supported by all applications. |
he is in a rut
the food is Rotten
I like root beer
Simple Examples
Here are a few representative, simple examples.vi command | What it does |
|
|
:%s/ */ /g | Change 1 or more spaces into a single space. |
:%s/ *$// | Remove all spaces from the end of the line. |
:%s/^/ / | Insert a space at the beginning of every line. |
:%s/^[0-9][0-9]* // | Remove all numbers at the beginning of a line. |
:%s/b[aeio]g/bug/g | Change all occurences of bag, beg, big, and bog, to bug. |
:%s/t\([aou]\)g/h\1t/g | Change all occurences of tag, tog, and tug to hat, hot, and hug respectively. |
Medium Examples (Strange Incantations)
Example 1
Change all instances of foo(a,b,c) to foo(b,a,c). where a, b, and c can be any parameters supplied to foo(). That is, we must be able to make changes like the following:Before | After | |
foo(10,7,2) | foo(7,10,2) | |
foo(x+13,y-2,10) | foo(y-2,x+13,10) | |
foo( bar(8), x+y+z, 5) | foo( x+y+z, bar(8), 5) |
[^,] | means any character which is not a comma | |
[^,]* | means 0 or more characters which are not commas | |
\([^,]*\) | tags the non-comma characters as \1 for use in the replacement part of the command | |
\([^,]*\), | means that we must match 0 or more non-comma characters which are followed by a comma. The non-comma characters are tagged. |
Example 2
We have a CSV (comma separated value) file with information we need, but in the wrong format. The columns of data are currently arranged in the following order: Name, Company Name, State, Postal Code. We need to reorganize the data into the following order in order to use it with a particular piece of software: Name, State-Postal Code, Company Name. This means that we must change the order of the columns in addition to merging two columns to form a new column value. The particular piece of software that needs this data will not work if there are any whitespace characters (spaces or tabs) before or after the commas. So we must remove whitespace around the commas.Sharon Lee Smith, Design Works Incorporated, CA, 95012
B. Amos , Hill Street Cafe, CA, 95013
Alexander Weatherworth, The Crafts Store, CA, 95014
...
Sharon Lee Smith,CA 95012,Design Works Incorporated
B. Amos,CA 95013,Hill Street Cafe
Alexander Weatherworth,CA 95014,The Crafts Store
...
Example 3
Suppose you have a multi-character sequence that repeats. For example, consider the following:Billy tried really hardNow suppose you want to change "really", "really really", and any number of consecutive "really" strings to a single word: "very". The command
Sally tried really really hard
Timmy tried really really really hard
Johnny tried really really really really hard
:%s/\(really \)\(really \)*/very /changes the text above to:
Billy tried very hardThe expression \(really \)* matches 0 or more sequences of "really ". The sequence \(really \)\(really \)* matches one or more instances of the sequence "really ".
Sally tried very hard
Timmy tried very hard
Johnny tried very hard
Hard Examples (Magical Hieroglyphics)
coming soon.Regular Expressions In Various ToolsOK, you'd like to use regular expressions, but you can't bring yourself to use vi. Here, then, are a few examples of how to use regular expressions in other tools. Also, I have attempted to summarize the differences in regular expressions you will find between different programs.
sed
Sed is a Stream EDitor which can be used to make changes to files or pipes. For complete details, see the man page sed(1).sed script | Description | |
|
| |
sed 's/^$/d' price.txt | removes all empty lines | |
sed 's/^[ \t]*$/d' price.txt | removes all lines containing only whitespace | |
sed 's/"//g' price.txt | remove all quotation marks |
awk
Awk is a programming language which can be used to perform sophisticated analysis and manipulation of text data. For complete details, see the man page awk(1). Its peculiar name is an acronym made up of the first character of its authors last names (Aho, Weinberger, and Kernighan).awk script | Description | |
|
| |
awk '$0 !~ /^$/' price.txt | removes all empty lines | |
awk 'NF > 0' price.txt | a better way to remove all lines in awk | |
awk '$2 ~ /^[JT]/ {print $3}' price.txt | print the third field of all lines whose second field begins with 'J' or 'T' | |
awk '$2 !~ /[Mm]isc/ {print $3 + $4}' price.txt | for all lines whose second field does not contain 'Misc' or 'misc' print the sum of columns 3 and 4 (assumed to be numbers). | |
awk '$3 !~ /^[0-9]+\.[0-9]*$/ {print $0}' price.txt | print all lines where field 3 is not a number. The number must be of the form: d.d or d. where d is any number of digits from 0 to 9. | |
awk '$2 ~ /John|Fred/ {print $0}' price.txt | print the entire line if the second field contains 'John' or 'Fred' |
grep
grep is a program used to match regular expressions in one or more specified files or in an input stream. Its name programming language which can be used to perform data manipulation on files or pipes. For complete details, see the man page grep(1). Its peculiar name stems from its roots as a command in vi, g/re/p meaning global regular expression print.Wong, Fred 4-4123
Jones, Thomas 1-4122
Salazar, Richard 5-2522
grep command | Description | |
|
| |
grep '\t5-...1' phone.txt | print all the lines in phone.txt where the phone number begins with 5 and ends with 1. Note that the tab character is represented by \t. | |
grep '^S[^ ]* R' phone.txt | print lines where the last name begins with S and first name begins with R. | |
grep '^[JW]' phone.txt | print lines where the last name begins with J or W | |
grep ', ....\t' phone.txt | print lines where the first name is 4 characters. The tab character is represented by \t. | |
grep -v '^[JW]' phone.txt | print lines that do not begin with J or W | |
grep '^[M-Z]' phone.txt | print lines where the last name begins with any letter from M to Z. | |
grep '^[M-Z].*[12]' phone.txt | print lines where the last name begins with a letter from M to Z and where the phone number ends with a 1 or 2. |
egrep
egrep is an extended version of grep. It supports a few more metacharacters in its regular expressions. For the examples below, assume we have the text below in a file named phone.txt. Its format is last name followed by a comma, first name followed by a tab, then a phone number.Wong, Fred 4-4123
Jones, Thomas 1-4122
Salazar, Richard 5-2522
egrep command | Description | |
|
| |
egrep '(John|Fred)' phone.txt | print all lines that contain the name John or Fred. | |
egrep 'John|22$|^W' phone.txt | print lines that contain John or that end with 22 or that begin with W. | |
egrep 'net(work)?s' report.txt | print lines in report.txt contain networks or nets. |
Regular Expressions Syntax Support
Command or Environment |
. | [ ] | ^ | $ | \( \) | \{ \} | ? | + | | | ( ) |
vi | X | X | X | X | X | |||||
Visual C++ | X | X | X | X | X | |||||
awk | X | X | X | X | X | X | X | X | ||
sed | X | X | X | X | X | X | ||||
Tcl | X | X | X | X | X | X | X | X | X | |
ex | X | X | X | X | X | X | ||||
grep | X | X | X | X | X | X | ||||
egrep | X | X | X | X | X | X | X | X | X | |
fgrep | X | X | X | X | X | |||||
perl | X | X | X | X | X | X | X | X | X |