Regular Expression Tutorial

Regular expressions are used in pattern matching in order to find certain strings or combinations or letters and numbers in documents or string objects. It is likely you have already worked with regular expressions before, whether you realize it or not. Here we will look specifically at using egrep with regular expressions.

The simplest regular expression is one that just looks for a specific word or character as demonstrated here. We will use a document callled regex_test.txt with the contents shown below for all the examples unless otherwise noted:
Patty Farnsworth
Phone Number: 765-899-4756

4lph4num3r1c

Hello there students

egrep Patty regex_test.txt
Output: Patty Farnsworth
This command will return the line containing the name Patty from the document. Note that this is case-sensitive and that using "patty" instead of "Patty" will not return the line we want.

egrep [PH] regex_test.txt
Output: Patty Farnsworth
Phone Number: 765-899-4756
Hello there students
When multiple symbols are placed inside of brackets, as in this case, you are saying find lines containing "P or H".

egrep [0-9] regex_test.txt
Output: Phone Number: 765-899-4756
4lph4num3r1c
You can search for all the numbers 0,1,2,3,4,5,6,7,8,9 by using a '-' symbol between two numbers 0 and 9. Similarly you can search for alpha characters by using a-z or A-Z for capital letters.

egrep [0-9][a-z] regex_test.txt
Output:4lph4num3r1c
It is important to see that when the brackets are used, it will only search for a single character. So in this case we are looking for any line that contains the pattern of any number immediately followed by any lower case letter.

egrep [0-9]+ regex_test.txt
Output:Phone Number: 765-899-4756
4lph4num3r1c
Here we see the + operator. This operator says "one or more" of the preceding character so in this case it will match any line that has 1 or more numbers consecutively.

egrep ".*" regex_test.txt
Output:Patty Farnsworth
Phone Number: 765-899-4756

4lph4num3r1c

Hello there students
This introduces a new symbol and operator, the "." symbol which matches anything and the "*" operator which matches 0, 1 or more of the preceding character. So this regular expression will literally match every line with zero or more characters of any type in the document. Note here the use of double quotes surrounding the regular expression. This is necessary in this case because the arguments are processed by the shell before the command is executed. Without the quotes, the shell would replace .* with the names of all dot files of the current directory.

egrep [0-9][a-z]+[0-9] regex_test.txt
Output: 4lph4num3r1c
In this command we notice that multiple regular expressions can be combined to find complex patterns in a document. For example, in this expression we are asking for a pattern that matches a number following by at least 1 or more lower case letters followed finally by another number.

egrep [#$\+%@] regex_test.txt
In this command we are seaching for lines containing specific special characters. Note we have escaped the + symbol to prevent it from being used as a matching operator in regex and is instead treated as just the normal "+" character.

Backreferences in Regex

Often you will need to match specific types of characters using regular expressions. Recall from the regular expressions tutorial in Lab 2 that you can use the character class syntax, i.e. [a-z0-9] to specify 0+, 1+, or some other number of those characters by using the 0, +, and {,} operators. For instance, [0-9][0-9]* matches on 1 or more decimals, i.e. 12345 or 32. However, we may sometimes want to reference repeated instances of the same character. If we know what the character is, this is easy. We simply use a single character in place of the character class. However, if we do not know what the character is (or if it could be any character in a class) we can use a backreference. When parts of a regular expression appear in parentheses, they can be referred to later using backreferences. We use \1 to refer to the first backreference, \2 to the second, and so on. A simple example of how to use a backreference is if we wanted to find two or more of the same decimals in a row. We don't care which decimal, just that there are two or more of a kind. ([0-9]) will match on the first character. Then we can append a \1 to match on the next character, then add the extended regex character + to indicate 1 or more matches: ([0-9])\1+ This would match on strings like '11111' or '3333'. Explicitly translated, this says "match on any single decimal, then one or more of what we just matched". If we wanted to match on a repeated pattern of two characters, we could do the following: ([0-9])([0-9])(\1\2)+ This matches on strings such as '121212' or '232323'. The usefulness of backreferences is obvious with stream processing programs such as sed, where we need to replace and rearrange data that we encounter.