Pattern Search and Replace in Text Files with Regular Expressions and Python
I recently discovered a service provider for encrypted calendar, address book and email with servers in Germany. It ticked all the boxes. But I soon found out that their import- and export-functions are faulty. This provided an excellent opportunity to brush up my skills in creating command line tools and using regular expressions (regex) in Python to perform pattern matching and replacement in text files. My preferred packages to accomplish this task are “argparse”, “pathlib” and “re”. The result is tutaexport, a post processor that fixes the formatting of vCards in vCalendar Files (.vcf files).
A common strategy is to use regular expression in combination with raw strings (Syntax in Python: r"my string"
, in C++: R"my string"
). The most important elements of the regex syntax compatible with Python’s package “re” are summarized in the following table:
regex | meaning |
---|---|
. |
Wildcard. Matches any symbol except line breaks |
\n |
Line break |
\b |
Word boundary |
\ |
Excape the following token |
? |
The preceding token or group is present 0 or 1 times |
* |
The preceding token or group is present 0 or more times |
+ |
The preceding token or group is present 1 or more times |
{m} |
Where m is an integer: The preceding token or group is present m times |
| |
A|B where A and B can be arbitrary regex: Matches A , or, if unseccesful, matches B |
[...] |
Character class with set of tokens. Matches any of the tokens in the character class |
[^...] |
Character class with negated set of tokens. Matches any tokens that are not in the set of tokens |
(?:...) |
Non-capturing group (can be nested) |
(...) |
Without ? following the ( : capture group (can be nested) that can be referenced later by a sequential 1-based index to reproduce its content |
\m |
Where m is an integer: Index for referencing the respective capture group and reproducing its content. In other programming languages, the syntax is $m . |
(?P<name>...) |
Named capture group (can be nested) that can be referenced by name or by a sequential 1-based index |
(?=...) |
Lookahead assertion. E.g. for s='abxyabcdxy' the statement print(re.sub(r'ab(?P=cd)', r'gh', s)) will result in output abxyghcdxy |
(?!=...) |
Negative lookahead assertion. e.g. for s='abxyabcdxy' the statement print(re.sub(r'ab(?!P=xy)', r'gh', s)) will result in output abxyghcdxy |
(?<=...) |
Lookbehind assertion. e.g. for s='abxyabcdxy' the statement print(re.sub(r'xy(?<P=cd)', r'gh', s)) will result in output abxyabcdgh |
Copyright 2022 DerAndere