Regular expressions

Archived

This page has been archived and will receive no further updates.

useful sites

characters that must be escaped:

[, \, ^, $, ., |, ?, *, +, (, )

character matches

`\d`	single character that is a digit
`\n`	line feed (0x0A)
`\r`	carriage return (0x0D)
`\s`	whitespace character (includes tabs and line breaks)
`\t`	tab character (ASCII 0x09)
`\w`	“word character” (alphanumeric characters plus underscore)
`.`	single character, except line break characters. It is short for `[^\n]` (UNIX regex flavors) or `[^\r\n]` (Windows regex flavors)
`[]`	match a character class
`-`	inside a character class to specify a range of characters
`^`	after the opening square bracket will negate the character class

*remember that Windows text files use \r\n to terminate lines, while UNIX text files use \n.

*if your regular expression engine supports Unicode, use \uFFFF to insert a Unicode character. E.g. \u20AC matches the euro currency sign.

anchors

(anchors do not match any characters. they match a position.)

`^`	matches at the start of the string
`$`	matches at the end of the string
`\b`	matches at a word boundary. A word boundary is a position between a character that can be matched by \w and a character that cannot be matched by \w
`\B`	matches at every position where \b cannot match.

other matches

`\|`	or
`?`	makes the preceding token in the regular expression optional E.g.: colou?r matches colour or color.
`*`	attempt to match the preceding token zero or more times
`+`	attempt to match the preceding token once or more E.g.: `<[A-Za-z][A-Za-z0-9]>` matches an HTML tag without any attributes. `<[A-Za-z0-9]+>` is easier to write but matches invalid tEleven characters with special meanings: the opening square bracket `[`, the backslash `\`, the caret `^`, the dollar sign `$`, the period or dot `.`, the vertical bar or pipe symbol `\|`, the question mark `?`, the asterisk or star ``, the plus sign `+`, the opening round bracket `(` and the closing round bracket `)`. These special characters are often called “metacharacters”.ags such as `<1>`.
`{}`	specify amount of repetition E.g.: Use `\b[1-9][0-9]{3}\b` to match a number between 1000 and 9999. `\b[1-9][0-9]{2,4}\b` matches a number between 100 and 99999. Use `{3,}` to match 3 or more repetitions
`()`	create a group
`\#`	match a group (use slash plus a number, starting at 1, to indicate which group, i.e. `\1` for the first group
`(?i)`	case insensitive regex
`(?=pattern)`	zero-width positive look-ahead assertion. For example, `/\w+(?=\t)/` matches a word followed by a tab, without including the tab.

matches that must be escaped in vim:

+ ( ) |

examples:

q[^x] matches qu in question. It does not match Iraq since there is no character after the q for the negated character class to match.

\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b

search for an email address

^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$

verify properly formatted email address

:%s/^\s\+$o\|*\|+$\s//g

(vim) remove whitespace followed by one of o, *, +, and one more whitespace character at beginning of line (for copying stuff from Google docs to remove formatting)

My wiki