Traps and surprises
Very tricky is how the $ works when
re.MULTILINEflag is off. It should match only at the end of the string. But it matches also before \n if it is the last character:
\nMango is yellow.\nBanana is rip.I\nI Try it yourselfOne can think that's needed when a new line finishes file, and we use content of that file for the match. But when we have more \n, they are ignored:
\nMango is yellow.\nBanana is rip.\n\n\nI\nI Try it yourselfWithout new lines at the end everything looks fine:
\nMango is yellow.\nBanana is rip.I Try it yourself
A pattern like this
a*can match totally nothing. That's really not what a user most of the time expects.
A pattern like
(a*)*will raise Nothing to repeat error. But a slight modification will be fine:
I was writing about that in the Objects section, but let
me say it one more time.
findall()will not return matches, just set of groups. If no groups in the pattern, it will return the whole matches.
- Divide and conquer is a strategy which works also with RegEx. It is always easier to write two RegEx patterns, instead of one complicated.
Well know misbehavior of
sub()function is when some group was not matched and we want to use it in replacement string. In that case RegEx throws the unmatched group exception instead of giving us
(apple)|peachpattern will raise an exception on the following string if first group (referenced like
\1) was used
peach Try it yourselfSurprisingly everything is fine when completely nothing is matched by the pattern.
RegexObjecthave more parameters than module versions functions.
endposparameters allows to search pattern in a string slice, but it is not equivalent to search on a Python's slice. The betweener ^ matches beginning of the real string, not of the slice. For this pattern
^.we will match first letter
Longan Try it yourselfif
posis zero (or not specified). But for
pos=1or higher, we will have not match at all:
Longan Try it yourselfBetweener $ has different behavior, matching always end of the slice. Pattern
.$will match one letter before end if
endposis 1 less than string length (here
Longan Try it yourself
In the Parsing principles section we were
investigation the ^ and $ behavior. I would like to propose a
change to RegEx syntax. First eliminate the
re.DOTALLflags. Then make ^ and $ always matching per lines (like when
re.MULTILINEis on). We have \A and \Z for no MULTILINE behavior. The same do with . (dot) sign. Make it always matching \n. And introduce some other character (e.g. , (period)) to match everything except new line, or just leave it as [^\n].
Submitted by Tomasz Wyderka on Tue, 03/11/2014 - 02:41