Regular expressions

Archived

This page has been archived and will receive no further updates.

useful sites

characters that must be escaped:

[, \, ^, $, ., |, ?, *, +, (, )

character matches

\d single character that is a digit
\n line feed (0x0A)
\r carriage return (0x0D)
\s whitespace character (includes tabs and line breaks)
\t tab character (ASCII 0x09)
\w “word character” (alphanumeric characters plus underscore)
. single character, except line break characters. It is short for [^\n] (UNIX regex flavors) or [^\r\n] (Windows regex flavors)
[] match a character class
- inside a character class to specify a range of characters
^ after the opening square bracket will negate the character class

*remember that Windows text files use \r\n to terminate lines, while UNIX text files use \n.

*if your regular expression engine supports Unicode, use \uFFFF to insert a Unicode character. E.g. \u20AC matches the euro currency sign.

anchors

(anchors do not match any characters. they match a position.)

^ matches at the start of the string
$ matches at the end of the string
\b matches at a word boundary. A word boundary is a position between a character that can be matched by \w and a character that cannot be matched by \w
\B matches at every position where \b cannot match.

other matches

| or
? makes the preceding token in the regular expression optional
E.g.: colou?r matches colour or color.
* attempt to match the preceding token zero or more times
+ attempt to match the preceding token once or more
E.g.: <[A-Za-z][A-Za-z0-9]*> matches an HTML tag without any attributes. <[A-Za-z0-9]+> is easier to write but matches invalid tEleven characters with special meanings: the opening square bracket [, the backslash \, the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus sign +, the opening round bracket ( and the closing round bracket ). These special characters are often called “metacharacters”.ags such as <1>.
{} specify amount of repetition
E.g.: Use \b[1-9][0-9]{3}\b to match a number between 1000 and 9999. \b[1-9][0-9]{2,4}\b matches a number between 100 and 99999.
Use {3,} to match 3 or more repetitions
() create a group
\# match a group (use slash plus a number, starting at 1, to indicate which group, i.e. \1 for the first group
(?i) case insensitive regex
(?=pattern) zero-width positive look-ahead assertion. For example, /\w+(?=\t)/ matches a word followed by a tab, without including the tab.

matches that must be escaped in vim:

+ ( ) |

examples:

q[^x] matches qu in question. It does not match Iraq since there is no character after the q for the negated character class to match.

\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b

search for an email address

^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$

verify properly formatted email address

:%s/^\s\+\(o\|*\|+\)\s//g

(vim) remove whitespace followed by one of o, *, +, and one more whitespace character at beginning of line (for copying stuff from Google docs to remove formatting)