Working with RegEx we have to deal with 3 data types:
- Strings with RegEx patterns
- Those are normal Python's strings, except that we should use the
rawformatting (check out the section Formatting)
- Compiled RegEx patterns ( re.RegexObject)
- If we have just few patterns we don't have to care about a compilation. They will be created on the fly if needed and not exist.
- Match objects ( re.MatchObject)
- Those are results of patterns match. From them we can get the matched string, groups and their's positions.
Compiled patterns when reuse should save execution time. If we don't create them, the string with a pattern will be compiled on the fly anyway. I've tested RegEx performance for some time and it looks like documentation says it most accurate: if we have just few patterns, and we use them occasionally, we don't need to compile. RegEx has a cache of recent patterns. Of course not using the cache is good practice if we match millions of times.
Match objects always return
True if we put them in
if statement. On the other hand, if we don't have a match, we also
don't have the match object just
none (which is
in condition statement). That way we can always check if the match was
successful. Another useful thing from match objects are the
methods. They return positions where a match begins and ends. They also work
for the specified group.
search() method or module function returns only the first
(leftmost) match if a pattern matches more than once. The same is true for the
match(). BTW, I've never found
useful. Adding ^ at the beginning of a pattern gives the same
finditer() return only non-overlapping matches. What is
surprising is that
findall returns only groups if they exist, not
the whole match. E.g. the pattern
.(o) used in
finditer() will match:
Try it yourself
Try it yourself
If we don't have groups it returns the whole match. One more thing, the
findall() returns only a matched string, not a match object.
sub() has just one RegEx pattern parameter. Another one -
substitution parameter is a normal string except that
\1 \2 etc.
will expand to groups values. Named groups referenced as
\g<name> also work. Instead of a substitution string we can
use a function and construct that value dynamically.
split() will not work if it will have only betweeners. It has
to match at least one character. So to split lines use the
\n rather than
$. If we
use parentheses in a split pattern, those groups will be returned along split
text. When split separator matches at the end, an empty string is also returned
at the end. Like in the
findall(), matching is done in
non-overlapping fashion from left to right.