Tuesday 30 August 2016

RegEx MiniGrammar

I was working on some code the other day and quickly needed to drop in a RegEx to match a particular pattern and once again needed to Googling the syntax and check some example web-wisdom to get what I needed quickly.

To save myself next time, I'm blogging some self notes cribbed from the easiest places so I can refer them here rather than search around. This might start to become a behaviour pattern in itself!

In many cases my usual regexs are just simple globbing helpers [glob patterns are sets of filenames expressed with wildcard characters, such as *.txt] to filter out particular files to process. More recently I've been increasingly using them to pattern match portions of text in URLs and other data sets.

Simple MiniGrammar reminder:

Alternatives (or)
Using the pipe character
e.g. gray|grey matches "grey" and "gray".

Grouping
Parts of the pattern match can be isolated with parentheses
e.g. gr(e|a)y matches "gray" and "grey".

Occurances
A number of characters can be used to express the number of occurances of the pattern in the match:

Optional (zero or one) is indicated with ?
e.g. colou?r matches "color" or "colour"

None to many (zero or n) is indicated with *
e.g. tx*t matches "tt", "txt" and "txxxxxxt".

Any number (zero or n) is indicated with +
e.g. go+al matches "goal", "goooal" and "goooooooal".

Exactly a certain number of times is given by {n}
e.g. gr{3} matches "grrr"

Bounded number of times is given by {min,max}

Metacharacters

Match a single character from a set using square brackets:
e.g. gr[ae]y matches "grey" or "gray"

Character ranges are expressed using the dash/minus sign:
e.g. [a-z] matches a single character from a to z or [a-zA-Z] matches lower case and capitals.

Except ranges are expressed by putting a ^ in the bracket. This indicates all characters except those following. It's most common case is matching against everything apart from whitespace:
e.g. [^ ] matches all characters except whitespace.

Start of string is matched by ^
End of string is matched by $



Examples
And some of the examples I find helpful with a bit of commentary:

match whitespace at the start or end of a file: ^[ \t]+|[ \t]+$
Explanation: ^ is used to set the start of the string, the square brackets indicate can contain either a space or tab (\t) whitespace character. This sequence can be repeated zero or many times (plus character). OR it matches to the whitespace found at the end of the string as indicated by $.




2 comments: