Learning Regular Expressions is fun and easy :)

A silly cartoon about regular expressions

DON’T BE AFRAID. FOUNDATIONS FIRST, DETAILS LATER

The first step towards learning Regular Expressions is to stop being afraid of them. A good idea is NOT to start by consulting your programming language’s documentation page, but by reading a book, written by a human being for a human being, which is “Mastering Regular Expressions” by Jeffrey E.F. Friedl (O’Reilly). At least the first two chapters, which give you a pretty good idea of some basic issues. The book helps you create a “regular expressions frame of mind.” Details can come later.

WARNING: I am NOT an expert on this. This is the first in a series of posts I plan to write on this subject. I hope this post might entertaining you and help to break the ice if you need to learn regular expressions, but find them intimidating. Otherwise they might give you ideas on how to explain regular expressions to someone else. But first and foremost, these are just my notes, as I’m reading the book, doing the exercises, putting all of this into practice, just my personal notes in a fun format. I’ve found that having to put something into your own words helps a lot in figuring out how much I have really understood. Plus it may serve me later when I need a reminder on the subject. These notes are mostly from the book. Some examples and humorous explanations are mine, though. Explaining things in fun ways helps me understand and remember them better.

MATCHING, MATCHING AND SAVING, REPLACING

We can use regular expressions to ‘match’ a regular expression to a string, to see if it ‘fits’, like trying on shoes – the idea is to see if that string contains something that corresponds to our pattern. If we are matching the pattern to a file with many lines, for example with egrep, we will get only the lines that contain something that suits the pattern. This is often used by system administrators to parse log files, in order to recognize possible problems: a certain error message, or that a hard disk may be becoming full, or that its temperature may be rising. In the last case, the system administrator may email himself any log lines regarding temperatures, but it might be easier to single out the number – only the number that represents that temperature, instead of the entire line of text, and check if it’s above a certain value, or log the temperatures, graph them, or anything else. We may also use regular expressions to search and then replace a certain pattern, like a fiction writer who decides to rename a character, being careful not to include similar text, and replace too much. This can be done comfortably in the vi(m) editor, for example.

FAST-FORWARD TO A USEFUL TRICK

This is a more complex example, as it regards scripts, not egrep, but it’s pretty useful. With egrep we get whole matching lines, and with scripting we can get single pieces of info, like a list of URLs in a web page. For basic uses, when we need to do something like that, a simple .*? trick will do. How does it work? As you’ll learn later, a dot stands for “any character”, a star or asterisk stands for “any, including zero, number of times”, and the question mark in this context makes the regular expression not greedy (or ‘lazy’, or ‘minimal-matching’). So, to find urls we would use something like
a href =”(.*?)”
But more about that later.

PLAY FIRST, STUDY LATER

(…)

To break the ice, start playing with some regular expressions. On Unix systems, you might want to use the egrep program to look for patterns in text files. Or use Perl, or Python, or PHP. Or anything else, but it’s easiest to start with egrep. That’s what the above-cited book uses as well, for its basic examples. In a Unix (Linux etc.) command line, we provide the command (egrep), a pattern (a regular expression) and a text file, and it returns the lines of that files that contain something which fits the pattern. If you aren’t using egrep, check if there are some (subtle) differences in interpretation with what you are using (Perl, Python, PHP, emacs, etc.). This post explains some basics concepts which are useful either way, even if some details, often syntax, may vary. Don’t worry about that too early. Take a big breath, smile, and try different things. Sooner or later you’ll figure out what works or what doesn’t. It’s just a matter of not giving up and not letting it get to you.

To start playing with egrep, you can download subtitle files from your favorite sitcom, and play with that. A basic example could be:

user@host:~/regexp$ egrep ‘^Sheldon’ The.Big.Bang.Theory.S04E11.HDTV.XviD-FEVER.srt
user@host:~/regexp$ egrep ‘^Leonard’ The.Big.Bang.Theory.S04E11.HDTV.XviD-FEVER.srt
Leonard’s buying.
Leonard as Superman.
Leonard,
Leonard Hofstadter!

To find out that, in this particular episode, no lines begin with the word “Sheldon”, but some begin with the word “Leonard”. With regular expressions, we can find patterns, for example – the beginning of the line followed by a word. If we only wanted to find lines containing a certain word, then there are simpler methods that work just fine (any normal ‘find a needle in a haystack’ method). If, instead, we were interested in different spellings of a word, we could start using regular expressions, for example
colou?r
would make the final u optional. It’s a bit like searching for all text (-extensioned) files with *.txt, if you’ve ever done that. So, for instance, regular expressions allow us to figure out which lines start or end with something. I remembered that, in The Big Bang Theory episode 12 of season 1 Leonard was teased about his “tendency to end  sentences with prepositions”. That doesn’t make for a good example for this blog post, but might get your imagination working (though not all problems are best solved with regular expressions). If we simply wanted to use egrep to find lines of text that end with someword, we would do that with
someword$
since $ means end of line in this context.

CHARACTERS AND META-CHARACTERS

For discovering these patterns in texts we use characters, obviously (literals, to look for exact, literal, pieces of text), and metacharacters – which are characters with special tasks. Metacharacters are usually good at role playing, for instance, the caret symbol (^) can convince the regular expression engine that it’s a beginning of a line, and the dollar sign ($) can make a pretty believable end of a line (my metaphor, I think). These are metacharacters. If we want to find a line that contains a certain pattern and nothing else, we look for ^certain pattern$
where certain pattern is the pattern we are interested in.

And if we want to use metacharacters as their literal meanings, we escape them, adding a backslash (\) in front of them, like \$ if we really need an actual dollar sign to appear in that spot.

WITH US OR AGAINST US? (CHARACTER CLASSES)

Square brackets ([…]) are interesting metacharacters. They define classes of characters, when we want a certain character to be any member of a certain “crowd”. We can use
gr[ae]y
to match either gray or gray (example from the book). A character class always stands for just one character, which needs to be one of those in between square brackets.

We can also use a negated character class, to find all characters except those included between the brackets, which is made with [^…]. Some of you may be annoyed to see that the caret (^) character is already used for two completely different purposes (one outside and one inside the character class). Don’t let that ruin the fun for you. I understand your point, yes, it is strange, but strange things happen, so take a deep breath, smile at how peculiar human beings are (even regular expressions were made by human hands), and move on. The book suggests that since the caret would otherwise remain useless inside square brackets, this is one way for it to make itself useful.  Now that I think of it, I guess this means nobody wants to use a character class where one of the possible characters is the beginning of the line (probably because it’s not a proper character, so character classes look down on it and keep it out). At least it found it’s way to take revenge, becoming the great character class negator, the caret symbol. Oh, well.  Remember that a negated class still represents and expects one character, except that we insist that one character should be some other characters, not one of those in square brackets. Like a rebellious teenager, a negated class knows it wants something, doesn’t know what that something is, but knows what that something is totally not. (This metaphor is definitely mine).

A good example for this, straight from that book, is used for finding any number of character between two quotation marks, except for quotation marks. So we use: “[^”]”

Character classes take some practice as they have their own set of rules. But practice can be acquired, so don’t worry. Nobody is born knowing it all. Or much anything.

WE WANT ALL AND WE WANT IT NOW

So, what if we want a metacharacter that stands for any character at all? Well than we use the dot (.) metacharacter, which we learn in that book to be like a shortcut for an ‘anything goes’ character class. It usually snubs the newline, so it actually means anything but newline, and that can sometimes be overridden. But don’t worry about that yet. Inside a character class we can use a dash (-),  to ask for a range, such as [1-9] or [a-z]. And we can combine all of that, and it’s better if we put the ranges first. Like [0-9A-za-z,]. If we want a dash character as such, we can put it as first in the brackets [-0-9A-za-z,] so it stands for itself, not for a range separator.

I WANT SOMETHING, EITHER A LETTER OR A NUMBER, OR A SPACE, …

Regular expressions in general offer shortcuts like \d for any digit, \D for anything not a number, \w for any letter ‘word character’ (letter, digit or underscore – EDITED, as I’ve come to understand this subtlety only later on), \W for anything not a letter ‘word character’, \s for any space-like character (space, tab, newline, etc.), \S for anything but space, etc. I’m not yet sure which of these are available with which tools. Perl and Perl like offer a wide choice (PHP, Python, etc.)

But we usually don’t want just one character, and we don’t want to put dozens of dots to say we want dozens of characters. So what do we do? We use quantifiers, to explain how much of something we DO want.

AT LEAST ONE, WHATEVER YOU WANT, OR SUPERSIZE IT

Basic quantifiers are the plus (+), the question mark (?) and the star or asterisk (*). Plus requires something to appear at least once, and as many times as it wants. We are sure we want it to be there, but can’t exclude it may appear multiple times in a row. But the pattern has to appear at least once, so the regular expression statement may fail. Star and question mark are “happy” with not finding the optional fragment at all, but the question mark doesn’t like if something appears more then once, while the asterisk has an ‘anything goes’ attitude. We could say that the question mark is used to make something optional, and its logic is – this thing could appear once (like without any quantifiers), or not at all, since it’s optional. Star’s logic is – this thing may appear once, not at all, or any number of times it wants.

Summary:

+ wants 1->infinite

star wants 0->infinite

question mark wants 0->1

Draw this in a little table, if that helps. The book uses a table. Whatever works for you. Don’t think you must memorize it instantly. Memorizing takes work and repetition.

If we know exactly how many times something should be repeated, or almost, we can use {n} for exactly n times, {n,} for n or more, {,m} for at least m, or {n,m} for items that appear at least n times, but not more than m times.

We apply the quantifiers to characters, metacharacter-fueled combinations, or substrings enclosed in parenthesis. They always apply to the smallest unit that comes before them. so d+ would match any string that contained at least one d in a certain place. And that’s also why the command:
egrep ‘graduate (school)?’
would turn up both “his graduate work here.” and “started graduate school at 14.”, since the entire word “school” is completely optional. It can be there or not be there. It can’t however, appear more than once. For that, we would need to have used + or *.

Parenthesis are also used to save values for later use, but you don’t need to worry about that for now, experimenting with egrep.

WORD BOUNDARIES

Some names are easy and rarely if ever end up as parts of more complex words. Some, like Mark, often end up mixed up in other words. In general, we will sometimes want to find a certain pattern only if it’s a standalone word, not if it’s mixed in the middle of another word. For that we use word boundaries. One of egrep’s ways of doing this is with the use of the \< \> metasequence. These two brackets aren’t metacharacters on their own, but are metacharacters if used together. So, if we only look for ‘eat’, it also finds similar stuff such as ‘eating out’, but also some stuff that is related only by accident, such as ‘bite-sized treats’, and even less related stuff such as ‘great party’, ‘beat generation’. If we only want eat on its own, we use
\<eat\>
If we want the appearances of word cat, and don’t want things like cats and concatenate, but do want cat’s, we use
\<cat\>

ALTERNATIONS

Finally, we learn alternations. Sometimes we want to simply offer different alternatives, like Jeff and Jefferey. Then we use (Jeff|Jefferey). The pipe character (|). Simple. Easy. Useful.

THE GREED ISSUE

This is not that much of a big deal when retrieving the whole line, because it still works, but when you want to extract a certain substring, like an url, this helps.

Ah yes, the greed issue. It’s very complex, and in later chapters of the book I’m reading, it is explained in detail. Basically, a greedy expression takes the most characters it can get, the non greedy one the minimum quantity. Which is an issue if we need to look for urls, and want only the urls returned, not the whole line. If you don’t take care of that greed, you may be fine with finding urls where there is only one per line, like
This is a <a href=”line.html”>line</a> with one link.
In fact even with a greedy expression like ‘a href = “(.*)”’, it would still give you ‘line.html’.
But if the line is something like
This is a <a href=”line”>line</a> with two <a href=”links”>links</a>
if you use  .*, it would give you
line2.html”>line</a> with two <a href=”links.html
instead of line2.html. If you use .*? everything runs smoothly.

Here’s a tiny Perl script that illustrates that:

$str1=’This is a <a href=”line.html”>line</a> with one link.’;
$str2=’This is a <a href=”line2.html”>line</a> with two <a href=”links.html”>links</a>’;
if ($str1 =~ m\a href=”(.*)”\) { print “$1\n”;}
if ($str1 =~ m\a href=”(.*?)”\) { print “$1\n”;}
if ($str2 =~ m\a href=”(.*)”\) { print “$1\n”;}
if ($str2 =~ m\a href=”(.*?)”\) { print “$1\n”;}

If you run it you get:

user@host:~/regexp$ perl greed.pl
line.html
line.html
line2.html”>line</a> with two <a href=”links.html
line2.html

If we want to fetch both URLs, we need to add a g modifier after the regular expression, and also save the results into an array.

if (@arr = $str2 =~ m\a href=”(.*?)”\g) { print “ARRAY:\n@arr[0]\n@arr[1]\n”;}

(It prints:

ARRAY:
line2.html
links.html)

This is all for the first regular expressions post, thank you for your patience, hope this was useful for you as well, certainly was for me.

Any comments and corrections very welcome!

Learning Regular Expressions is fun and easy 🙂

 

DON’T BE AFRAID, START WITH THE FOUNDATIONS

 

The first step towards learning Regular Expressions is to stop being afraid of them. A good idea is NOT to start by consulting your programming language’s documentation page, but by reading a book, written by a human being for a human being, which is “Mastering Regular Expressions” by Jeffrey E.F. Friedl (O’Reilly). At least the first two chapters, which give you a pretty good idea of some basic issues. The book helps you create a “regular expressions frame of mind.” Details can come later.

 

WARNING: I am NOT an expert on this. I hope this little post might entertaining you and help to break the ice if someone needs to learn regular expressions, but finds them intimidating. But first and foremost, these are just my notes, as I’m reading the book, doing the exercises, putting all of this into practice, just my personal notes in a fun format. I’ve found that having to put something into your own words helps a lot in figuring out how much have we really understood. Plus may serve me later when I need to remind myself. These notes are mostly from the book. Some examples and humorous explanations are mine, though. Explaining things in fun ways helps me understand and remember them better.

 

MATCHING, MATCHING AND SAVING, REPLACING

 

We can use regular expressions to ‘match’ a regular expression to a string, like trying on shoes, when we only care to know that that string contains something that fits, something corresponding our pattern. This is often used by system admins to parse log files, and recognize possible problems such as that a certain error message appears, or that a hard disk may be becoming full, or that it’s temperature may be rising. In the last case, we may email to ourselves any log lines regarding temperatures, but it might be easier to single out the number that represents that temperature and check if it’s above a certain value, or log it, graph it, or anything else. We may also use regular expressions to search and then replace a certain pattern, like a fiction writer who decides to rename a character, being careful not to include similar text.

 

FASTWORFARD TO A USEFUL TRICK

 

For basic uses, when we need to find all web page links (urls) or all bolded text in a web page, for example, a simple .*? trick will do. How does it work? As you’ll learn later, a dot stands for “any character”, a star or asterisk stands for “any, including zero, number of times”, and the question mark in this context makes the regular expression not greedy. So, to find urls we would use something like

a href =”(.*?)”

But more about that later.

 

PLAY FIRST, STUDY LATER

 

To break the ice, start playing with some regular expressions. On Unix systems, you might want to use the egrep program to look for patterns in text files. Or use Perl, or Python, or PHP. Or anything else, but it’s easiest to start with egrep. That’s what the above-cited book uses as well, for its basic examples. If you aren’t using egrep, check if there are some (subtle) differences in interpretation with what you are using. This post explains some basics concepts which are useful either way, even if some details, often syntax, may vary. Don’t worry about that too early. Take a big breath, smile, and try different things. Sooner or later you’ll figure out what works or what doesn’t. It’s just a matter of not giving up and not letting it get to you.

 

To start playing with egrep, you can download subtitle files from your favorite sitcom, and play with that. A basic example could be:

 

user@host:~/regexp$ egrep ‘^Sheldon’ The.Big.Bang.Theory.S04E11.HDTV.XviD-FEVER.srt

user@host:~/regexp$ egrep ‘^Leonard’ The.Big.Bang.Theory.S04E11.HDTV.XviD-FEVER.srt

Leonard’s buying.

Leonard as Superman.

Leonard,

Leonard Hofstadter!

 

To find out that, in this particular episode, no lines begin with the word “Sheldon”, but some begin with the word “Leonard”. With regular expressions, we can find patterns, for example – the beginning of the line followed by a word. If we only wanted to find lines containing a certain word, then there are simpler methods that work just fine (any normal ‘find a needle in a haystack’ method). If, instead, we were interested in different spellings of a word, we could start using regular expressions, for example

colou?r

would make the final u optional. It’s a bit like searching for all text (-extensioned) files with *.txt, if you’ve ever done that. So, for instance, regular expressions allow us to figure out which lines start or end with something. I remembered that, in The Big Bang Theory episode 12 of season 1 Leonard was teased about his “tendency to end sentences with prepositions”. That doesn’t make for a good example for this blog post, but might get your imagination working (though not all problems are best solved with regular expressions). If we simply wanted to use egrep to find lines of text that end with someword, we would do that with

someword$

since $ means end of line in this context.

 

CHARACTERS AND META-CHARACTERS

 

For discovering these patterns in texts we use characters, obviously (literals, to look for exact, literal, pieces of text), and metacharacters – which are characters with special tasks. Metacharacters are usually good at role playing, for instance, the caret symbol (^) can convince the regular expression engine that it’s a beginning of a line, and the dollar sign ($) can make a pretty believable end of a line. These are metacharacters. If we want to find a line that contains a certain pattern and nothing else, we look for ^certain pattern$

where certain pattern is the pattern we are interested in.

 

And if we want to use metacharacters as their literal meanings, we escape them, adding a backslash (\) in front of them, like \$ if we really need an actual dollar sign to appear in that spot.

 

WITH US OR AGAINS US? (CHARACTER CLASSES)

 

Square brackets ([…]) are interesting metacharacters. They define classes of characters, when we want a certain character to be any member of a certain crowd. We can use

gr[ae]y

to match either gray or gray (example from the book). A character class always stands for just one character, which needs to be one of those in between square brackets.

 

We can also use a negated character class, to find all characters except those included in that class, which is made with [^…]. Some of you may be annoyed to see that the caret (^) character is already used for two completely different purposes. Don’t let that ruin the fun for you. I understand your point, yes, it is strange, but strange things happen, so take a deep breath, smile at how peculiar human beings are (even regular expressions were made by human hands), and move on. Remember that a negated class still represents and expects one character, only that we insist that one character should be some other characters, not one of those in square brackets. Like a rebellious teenager, a negated class knows it wants something, doesn’t know what that something is, but knows what that something is totally not.

 

A good example for this, straight from that book, is used for finding any number of character between two quotation marks, except for quotation marks. So we use “[^”]”.

 

Character classes take some practice as they have their own set of rules. Bur practice can be acquired, so don’t worry. Nobody is born knowing it all.

 

WE WANT ALL AND WE WANT IT NOW

 

So, what if we want a metacharacter that stands for any character at all? Well than we use the dot (.) metacharacter, which we learn in that book to be like a shortcut for an ‘anything goes’ character class. It usually snubs the newline, so it actually means anything but newline, and that can sometimes be overriden. But don’t worry about that yet. Inside a character class we can use a dash (-), to ask for a range, such as [1-9] or [a-z]. If we want a dash character as such, we can put it as first in the brackets.

 

I WANT SOMETHING, EITHER A LETTER OR A NUMBER, OR A SPACE, …

 

With perl and other more complex tools, you’ll be able to use shortcuts like \d for any number, \D for anything not a number, \w for any letter, \W for anything not a letter, \s for any space-like character (space, tab, newline, etc.), etc. But we usually don’t want just one character, and we don’t want to put dozens of dots to say we want dozens of characters. So what do we do? We use quantifiers, to explain how much of something do we want.

 

AT LEAST ONE, WHATEVER YOU WANT, OR SUPERSIZE IT

 

Basic quantifiers are the plus (+), the question mark (?) and the star or asterisk (*). Plus requires something to appear at least once, and as many times as it wants. But at least once, so it may fail. Star and question mark are ok with zero, but the question mark doesn’t like if something appears more then once, while the asterisk has an ‘anything goes’ attitude. We could say that the question mark is used to make something optional, and stands for – this thing could appear once (like without any quantifiers), or not at all, since it’s optional. Star stands for – this thing may appear once, not at all, or any number of times it wants. Plus stands for required, but may also be repeated, like when we know to expect a space character, but don’t exclude there may be more then one of them.

 

Summary: + wants 1->infinite, star wants 0->infinite, question mark wants 0->1. Draw this in a little table, if that helps. The book uses a table. Whatever works for you. Don’t think you must memorize it instantly. Memorizing takes work and repetition.

 

We apply the quantifiers to characters, metacharacter-fueled combinations, or substrings enclosed in parenthesis. They always apply to the smallest unit that comes before them. so d+ would match any string that contained at least one d in a certain place. And why the command:

egrep ‘graduate (school)?’

would turn up both “his graduate work here.” and “started graduate school at 14.”, since the word “school” is completely optional.

 

Parenthesis are also used to save values for later use, but you don’t need to worry about that with egrep.

 

WORD BOUNDARIES

 

Some names are easy and rarely if ever end up as parts of more complex words. Some, like Mark, often end up mixed up in other words. In general, we will sometimes want to find a certain pattern only if it’s a standalone word, not if it’s mixed in the middle of another word. For that we use word boundaries. Egrep’s way of doing this is with the use of the \< \> metasequence. These two aren’t metacharacters on their own, but are metacharacters if used together. So, if we only look for eat, it also finds similar stuff such as ‘eating out’, but also some stuff that is related only by accident, such as ‘bite-sized treats’, and unrelated stuff such as ‘great party’. If we only want eat on its own, we use

\<eat\>

If we want the appearances of word cat, and don’t want things like cats and concatenate, but do want cat’s, we use

\<cat\>

 

ALTERNATIONS

 

Finally, we learn alternations. Sometimes we want to simply offer different alternatives, like Jeff and Jefferey. Then we use (Jeff|Jefferey). The pipe character (|). Simple. Easy. Useful.

 

THE GREED ISSUE

 

This is not that much of a big deal when retreaving the whole line, because it still works, but when you want to extract a certain substring, like an url, this helps.

 

Ah yes, the greed issue. It’s very complex, and in later chapters it is explained in detail. Basically, a greedy expression takes the most characters it can get, the non greedy one the minimum quantity. Which is clear if we need to look for urls. If you don’t take care of that greed, you may be fine with finding urls where there is only one per line, like

This is a <a href=”line.html”>line</a> with one link.

In fact even with a greedy expression like ‘a href = “(.*)”’, it would still give you ‘line.html’.

But if the line is something like

This is a <a href=”line”>line</a> with two <a href=”links”>links</a>

if you u se .*, it would give you

line2.html”>line</a> with two <a href=”links.html

instead of line2.html. If you use .*? everything runs smoothly.

 

Here’s a tiny Perl script that illustrates that:

 

$str1='This is a <a href="line.html">line</a> with one link.';
$str2='This is a <a href="line2.html">line</a> with two <a href="links.html">links</a>';
if ($str1 =~ m\a href="(.*)"\) { print "$1\n";}
if ($str1 =~ m\a href="(.*?)"\) { print "$1\n";}
if ($str2 =~ m\a href="(.*)"\) { print "$1\n";}
if ($str2 =~ m\a href="(.*?)"\) { print "$1\n";}

 

If you run it you get:

 

user@host:~/regexp$ perl greed.pl
line.html
line.html
line2.html">line</a> with two <a href="links.html
line2.html
Advertisements

About apprenticecoder

My blog is about me learning to program, and trying to narrate it in interesting ways. I love to learn and to learn through creativity. For example I like computers, but even more I like to see what computers can do for people. That's why I find web programming and scripting especially exciting. I was born in Split, Croatia, went to college in Bologna, Italy and now live in Milan. I like reading, especially non-fiction (lately). I'd like to read more poetry. I find architecture inspiring. Museums as well. Some more then others. Interfaces. Lifestyle magazines with interesting points of view. Semantic web. Strolls in nature. The sea.
This entry was posted in tutorials and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s