Regular expressions are used in pattern matching in order to find certain strings or combinations or letters and numbers in documents or string objects. It is likely you have already worked with regular expressions before, whether you realize it or not. Here we will look specifically at using egrep with regular expressions.
The simplest regular expression is one that just looks for a specific word or character as demonstrated here. We will use a document callled regex_test.txt with the contents shown below for all the examples unless otherwise noted:
Patty Farnsworth Phone Number: 765-899-4756 4lph4num3r1c Hello there students |
egrep Patty regex_test.txt
Output: Patty Farnsworth
This command will return the line containing the name Patty from the
document. Note that this is case-sensitive and that using "patty"
instead of "Patty" will not return the line we want.
egrep [PH] regex_test.txt
Output: Patty Farnsworth
Phone Number: 765-899-4756
Hello there students
When multiple symbols are placed inside of brackets, as in this case, you
are saying find lines containing "P or H".
egrep [0-9] regex_test.txt
Output: Phone Number: 765-899-4756
4lph4num3r1c
You can search for all the numbers 0,1,2,3,4,5,6,7,8,9 by using a '-'
symbol between two numbers 0 and 9. Similarly you can search for
alpha characters by using a-z or A-Z for capital letters.
egrep [0-9][a-z] regex_test.txt
Output:4lph4num3r1c
It is important to see that when the brackets are used, it will only
search for a single character. So in this case we are looking for any
line that contains the pattern of any number immediately followed by any
lower case letter.
egrep [0-9]+ regex_test.txt
Output:Phone Number: 765-899-4756
4lph4num3r1c
Here we see the + operator. This operator says "one or more" of the
preceding character so in this case it will match any line that has 1 or
more numbers consecutively.
egrep ".*" regex_test.txt
Output:Patty Farnsworth
Phone Number: 765-899-4756
4lph4num3r1c
Hello there students
This introduces a new symbol and operator, the "." symbol which matches
anything and the "*" operator which matches 0, 1 or more of the preceding
character. So this regular expression will literally match every line
with zero or more characters of any type in the document. Note here the use of double quotes surrounding the regular expression. This is necessary in this case because the arguments are processed by the shell before the command is executed. Without the quotes, the shell would replace .* with the names of all dot files of the current directory.
egrep [0-9][a-z]+[0-9] regex_test.txt
Output: 4lph4num3r1c
In this command we notice that multiple regular expressions can be
combined to find complex patterns in a document. For example, in this
expression we are asking for a pattern that matches a number following by
at least 1 or more lower case letters followed finally by another
number.
egrep [#$\+%@] regex_test.txt
In this command we are seaching for lines containing specific special
characters. Note we have escaped the + symbol to prevent it from being
used as a matching operator in regex and is instead treated as just the
normal "+" character.
Backreferences in Regex
Often you will need to match specific types of characters using regular expressions. Recall from the regular expressions tutorial in Lab 2 that you can use the character class syntax, i.e. [a-z0-9] to specify 0+, 1+, or some other number of those characters by using the 0, +, and {,} operators. For instance, [0-9][0-9]* matches on 1 or more decimals, i.e. 12345 or 32. However, we may sometimes want to reference repeated instances of the same character. If we know what the character is, this is easy. We simply use a single character in place of the character class. However, if we do not know what the character is (or if it could be any character in a class) we can use a backreference. When parts of a regular expression appear in parentheses, they can be referred to later using backreferences. We use \1 to refer to the first backreference, \2 to the second, and so on. A simple example of how to use a backreference is if we wanted to find two or more of the same decimals in a row. We don't care which decimal, just that there are two or more of a kind. ([0-9]) will match on the first character. Then we can append a \1 to match on the next character, then add the extended regex character + to indicate 1 or more matches: ([0-9])\1+ This would match on strings like '11111' or '3333'. Explicitly translated, this says "match on any single decimal, then one or more of what we just matched". If we wanted to match on a repeated pattern of two characters, we could do the following: ([0-9])([0-9])(\1\2)+ This matches on strings such as '121212' or '232323'. The usefulness of backreferences is obvious with stream processing programs such as sed, where we need to replace and rearrange data that we encounter.