RE parsing principles

RegEx patterns operate on two types of atoms:

like ASCII characters, e.g. a, B, -, \n
places between other characters, e.g. \b matches between a word and a space

Very important is to know that betweeners are optional to find the match. The pattern yellow\b \band as long as pattern yellow and will match the same strings.

The ^ and $ are betweeners, not characters. How they work can be little mysterious. For sure $ doesn't match \n (the new line character). "Say what?" you ask:)! When the re.MULTILINE flag is off, ^ matches the beginning of a string and $ the end of it. That means $ matches very often after \n character if it's the last character in a string. Which is very common at the end of text files.

Lets take this string: \nMango is yellow.\nBanana is rip.\n The pattern ^ matches at the beginning: I\nMango is yellow.\nBanana is rip.\n Try it yourself and $ at the very end: \nMango is yellow.\nBanana is rip.\nI Try it yourself But that's a lie. The $ matches two times in the previous example. I wanted it looks like that only for teaching purposes... Sorry. Please go to the Traps section for further explanation. Lets now assume it matches just once, at the very end.

When re.MULTILINE is on, the ^ matches additionally after every \n and $ before every \n. So for last string, the ^ pattern matches 4 times: I\nIMango is yellow.\nIBanana is rip.\nI Try it yourself and the $ also 4 times: I\nMango is yellow.I\nBanana is rip.I\nI Try it yourself

One more important note. RegEx always looks for a match in the whole string. The re.MULTILINE flag doesn't stop matching after end of the line. It just changes meaning of the ^ and $ betweeners.

Another flag re.DOTALL changes . (the dot) special character. By default it matches everything except \n. That can be misleading for beginners, suggesting that RegEx works with lines separately. When re.DOTALL is on, . (the dot) matches alsothe \n character, so absolutely everything in a string.

As a misleading example lets consider the previous text again. The ^.*$ pattern will not match anything if flags re.DOTALL and re.MULTILINE are off! Even it looks like it should match whole string not matter what! \nMango is yellow.\nBanana is rip.\n Try it yourself

The \A betweener matches only the beginning of a string. It is not equal to the ^ character because the flag re.MULTILINE doesn't modify it's meaning. It always matches only beginnings of strings, not beginnings of lines. The same applies to the \Z betweener compared with the $.