0
Regular Expression Engine research
Isn't there a good possibility to replace the regular expression engine with a faster one? I guess you use "re" from python, but there are faster ones.
Find/replace on large files is rather slow and could crash. For example:
a file of 1.000.000 rows with empty lines, replace "^\n" with "" (without the quotes), to remove the empty lines. This will take a while or crash.
This is because the "re" uses backtracking, which could be rather slow. But there are also regular expression engines which use finite state machine (FSA). You can see this page for more information: http://swtch.com/~rsc/regexp/regexp1.html
Possible options could be (but not limited to):
- RE2 - http://code.google.com/p/re2/
(with python bindings: https://github.com/axiak/pyre2) - PIRE - https://github.com/dprokoptsev/pire
- Iregexp (regular expression engine of V8) - http://blog.quenta.org/2009/02/irregexp.html (but no python yet, as i know of)
- http://blog.errstr.com/2013/01/22/implementing-a-more-powerful-regex/
Seeing the benchmarks on the internet, it could make a lot of difference.
Customer support service by UserEcho