Content
    Metacharacters

Regular Expressions

See also the Python Standard Library Documentation

My Repository on GitHub

Metacharacters

Twelve Metacharacters:

  • Backslash \
  • Caret ^
  • Dollar $
  • Dot .
  • Pipe |
  • Question mark ?
  • Asterisk *
  • Plus +
  • Opening and closing parenthesis ( and )

Character Classes

  • Also known as character sets
  • Grouped in [ and ]
  • Example: Hex number [0-9a-fA-F]
  • Invert character set with ^

Predefined Character Classes

Available in Python

ElementDescription
.Any character except newline \n
\dDecimal digit, equivalent to [0-9]
\DAny non-digit character, equivalent to [^0-9]
\sAny whitespace character, equivalent to [ \t\n\r\f\v]
\SAny non-whitespace character, equivalent to [^ \t\n\r\f\v]
\wAny alphanumeric character, equivalent to [a-zA-Z0-9_]
\WAny non-alphanumeric character, equivalent to [^a-zA-Z0-9_]

POSIX Character Classes

  • POSIX defines some classes: Wikibooks
  • not available in Python

Alternation

  • Alternation (or) is marked with |

Quantifiers

Quantifiers can be applied to characters, character sets, and to groups

  • Optional (0 or 1 repetition): ?
  • Zero (0) or more times: *
  • One (1) or more times: +
  • Exact repetition and ranges: {}
    • Exactly n times: {n}
    • Between n and m times (both inclusive): {n,m}
    • At least n times: {n,}
    • At most n times: {,n}

Greedy and Reluctant Quantifiers

  • greedy quantifiers will try to match as much as possible
    • default behavior
    • biggest possible result
  • reluctant (non-greedy, lazy) will try to have smalles match possible
    • extra ? to quantifier: ??, *? and +?

Boundary Matchers

IdentifiersMatch
^Beginning of a line
$End of a line
\bAt word boundary
\BAnything that is not word boundary (opposite of \b)
\ABeginning of the input
\ZEnd of the input

Python Regex Functions

  • RegexObject class in re module
  • wrapper functions in re module
  • match tries to match at beginning of string
    • pos and slicing can have different results
  • search is like match in most languages (e.g Perl)
    • tries to match at any position in string
  • Compilation Flags

Grouping

  • Subexpressions are grouped within ( and )

Used for different purposes:

  • Creating subexpressions for applying quantifiers
  • Limiting scope of an alternation
  • Extract parts of the matched pattern (capturing)
  • Using captured parts again in the regex

Look Around

  • Add subpatterns that are not in the result (not consuming characters)
    • positive: subpattern needs to match
    • negative: subpattern must not match
  • Also called zero-width assertions
  • Python re module allows look behind only with fixed size (sub-) patterns
positivenegative
Look ahead(?=regex)(?!regex)
Look behind(?<=regex)(?<!regex)

Benchmarking

General for Python:

import cProfile
cProfile.run("myFunction")


  • Category

  • Programming

  • Tags

  • Computer Science

  • Created

  • 7. October 2016


  • Modified

  • 10. April 2022