A large part of writing Cucumber Glue Code is ensuring you have the perfect regular expression for the Step Definition. So we’re taking a break from our usual direct Cucumber posts, and dedicating this post to the setup for next month’s post.
A regular expression, or regex for short, is a pattern describing a certain amount of text. Below is a simple example, matching the literal text “cucumber”.
The most basic regular expression consists of a single literal character; e.g.
It will match the first occurrence of that character in the string. If the string is Jack is a boy, it will match the a after the J
There are 11 characters with special meanings:
- the opening square bracket [
- the backslash \
- the caret ^
- the dollar sign $
- the period or dot .
- the vertical bar or pipe symbol |
- the question mark ?
- the asterisk or star *
- the plus sign +
- the opening round bracket (
- the closing round bracket )
If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash. If you want to match 1+1=2, the correct regex is
Otherwise, the plus sign will have a special meaning.
A “character class” matches only one out of several characters. To match an a or an e, use
You could use this in
to match either gray or grey. A character class matches only a single character.
will not match graay, graey or any such thing. The order of the characters inside a character class does not matter. You can use a hyphen inside a character class to specify a range of characters.
matches a single digit between 0 and 9. You can also use more than one range.
matches a single hexadecimal digit, case insensitively. You can combine ranges and single characters.
matches a hexadecimal digit or the letter X. Typing a caret after the opening square bracket will negate the character class. The result is that the character class will match any character that is not in the character class.
matches qu in question. It does NOT match Iraq since there is no character after the q for the negated character class to match.
There are also several shorthand character classes.
matches a single character that is a digit.
matches a “word character” (alphanumeric characters plus underscore)
matches a whitespace character (includes tabs and line breaks). The actual characters matched by the shorthands depends on the software you’re using. Usually, non-English letters and numbers are included.
The dot matches (almost) any character. The dot matches a single character, except line break characters. It is short for [^\n] (UNIX regex flavors) or [^\r\n] (Windows regex flavors).
Most regex engines have a “dot matches all” or “single line” mode that makes the dot match any single character, including line breaks.
matches gray, grey, gr%y, etc. Use the dot sparingly. Often, a character class or negated character class is faster and more precise.
Alternation is the regular expression equivalent of “or”.
will match cat in About cats and dogs. If the regex is applied again, it will match dog. You can add as many alternatives as you want, e.g.
Regular Expressions also support repetition. The question mark makes the preceding token in the regular expression optional.
matches colour or color. The asterisk or star tells the engine to attempt to match the preceding token zero or more times. The plus tells the engine to attempt to match the preceding token once or more.
matches an HTML tag without any attributes.
is easier to write but matches invalid tags such as <1>. Use curly braces to specify a specific amount of repetition. Use
to match a number between 1000 and 9999.
matches a number between 100 and 99999.
The repetition operators or quantifiers are greedy. They will expand the match as far as they can, and only give back if they must to satisfy the remainder of the regex. The regex
will match <EM>first</EM> in This is a <EM>first</EM> test. Place a question mark after the quantifier to make it lazy.
will match <EM> in the above string. A better solution is to follow my advice to use the dot sparingly. Use
to quickly match an HTML tag without regard to attributes. The negated character class is more specific than the dot, which helps the regex engine find matches quickly.
Grouping is another important aspect in Regular Expressions. Place round brackets around multiple tokens to group them together. You can then apply a quantifier to the group.
matches Set or SetValue. Round brackets create a capturing group. The above example has one group. After the match, group number one will contain nothing if Set was matched or Value if SetValue was matched. Use the special syntax
to group tokens without creating a capturing group. This is important if you don’t plan to use the group’s contents. Do not confuse the question mark in the non-capturing group syntax with the quantifier.
Use anchors to ensure the start and stop of your expression is properly matched. The regular expression
I'm logged in
matches I’m logged in and I’m logged in as an admin. To avoid ambiguous matches, use
^I'm logged in$
The caret at the beginning anchors to the beginning of the string. The dollar at the end does the same with the end of the string. Use these with all your step definitions and you won’t have surprise matches.
Matching specific words is fine. But you often want flexibility to match a variety of strings. Here are some common patterns for non-exact matches.
- .* matches anything (or nothing). Any character except a newline 0 or more times
- .+ matches at least one of something
- [0-9]* or \d* matches a series of digits (or nothing)
- [0-9]+ or \d+ matches one or more digits
- ”[^”]*” matches anything (or nothing) in double quotes
- an? matches a or an