Traps and surprises

  • Very tricky is how the $ works when re.MULTILINE flag is off. It should match only at the end of the string. But it matches also before \n if it is the last character: \nMango is yellow.\nBanana is rip.I\nI Try it yourself One can think that's needed when a new line finishes file, and we use content of that file for the match. But when we have more \n, they are ignored: \nMango is yellow.\nBanana is rip.\n\n\nI\nI Try it yourself Without new lines at the end everything looks fine: \nMango is yellow.\nBanana is rip.I Try it yourself
  • A pattern like this a* can match totally nothing. That's really not what a user most of the time expects.
  • A pattern like (a*)* will raise Nothing to repeat error. But a slight modification will be fine: (a+)*
  • I was writing about that in the Objects section, but let me say it one more time. findall() will not return matches, just set of groups. If no groups in the pattern, it will return the whole matches.
  • Divide and conquer is a strategy which works also with RegEx. It is always easier to write two RegEx patterns, instead of one complicated.
  • Well know misbehavior of sub() function is when some group was not matched and we want to use it in replacement string. In that case RegEx throws the unmatched group exception instead of giving us None value. The (apple)|peach pattern will raise an exception on the following string if first group (referenced like \1) was used peach Try it yourself Surprisingly everything is fine when completely nothing is matched by the pattern.
  • The search(), match(), findall() and finditer() methods of RegexObject have more parameters than module versions functions. pos and endpos parameters allows to search pattern in a string slice, but it is not equivalent to search on a Python's slice. The betweener ^ matches beginning of the real string, not of the slice. For this pattern ^. we will match first letter Longan Try it yourself if pos is zero (or not specified). But for pos=1 or higher, we will have not match at all: Longan Try it yourself Betweener $ has different behavior, matching always end of the slice. Pattern .$ will match one letter before end if endpos is 1 less than string length (here endpos=5) Longan Try it yourself
  • In the Parsing principles section we were investigation the ^ and $ behavior. I would like to propose a change to RegEx syntax. First eliminate the re.MULTILINE and re.DOTALL flags. Then make ^ and $ always matching per lines (like when re.MULTILINE is on). We have \A and \Z for no MULTILINE behavior. The same do with . (dot) sign. Make it always matching \n. And introduce some other character (e.g. , (period)) to match everything except new line, or just leave it as [^\n].