[Python-ideas] Support parsing stream with `re`

Ram Rachum ram at rachum.com
Sat Oct 6 07:25:55 EDT 2018


"This is a regular expression problem, rather than a Python problem."

Do you have evidence for this assertion, except that other regex
implementations have this limitation? Is there a regex specification
somewhere that specifies that streams aren't supported? Is there a
fundamental reason that streams aren't supported?


"Can the lexing be done on a line-by-line basis?"

For my use case, it unfortunately can't.

On Sat, Oct 6, 2018 at 1:53 PM Jonathan Fine <jfine2358 at gmail.com> wrote:

> Hi Ram
>
> You wrote:
>
> > I'd like to use the re module to parse a long text file, 1GB in size. I
> > wish that the re module could parse a stream, so I wouldn't have to load
> > the whole thing into memory. I'd like to iterate over matches from the
> > stream without keeping the old matches and input in RAM.
>
> This is a regular expression problem, rather than a Python problem. A
> search for
>     regular expression large file
> brings up some URLs that might help you, starting with
>
> https://stackoverflow.com/questions/23773669/grep-pattern-match-between-very-large-files-is-way-too-slow
>
> This might also be helpful
> https://svn.boost.org/trac10/ticket/11776
>
> What will work for your problem depends on the nature of the problem
> you have. The simplest thing that might work is to iterate of the file
> line-by-line, and use a regular expression to extract matches from
> each line.
>
> In other words, something like (not tested)
>
>    def helper(lines):
>        for line in lines:
>            yield from re.finditer(pattern, line)
>
>     lines = open('my-big-file.txt')
>     for match in helper(lines):
>         # Do your stuff here
>
> Parsing is not the same as lexing, see
> https://en.wikipedia.org/wiki/Lexical_analysis
>
> I suggest you use regular expressions ONLY for the lexing phase. If
> you'd like further help, perhaps first ask yourself this. Can the
> lexing be done on a line-by-line basis? And if not, why not?
>
> If line-by-line not possible, then you'll have to modify the helper.
> At the end of each line, they'll be a residue / remainder, which
> you'll have to bring into the next line. In other words, the helper
> will have to record (and change) the state that exists at the end of
> each line. A bit like the 'carry' that is used when doing long
> addition.
>
> I hope this helps.
>
> --
> Jonathan
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20181006/a70cb296/attachment-0001.html>


More information about the Python-ideas mailing list