[Tutor] File reading

Danny Yoo dyoo at hashcollision.org
Thu Jun 19 04:12:31 CEST 2014


Hi Uma,


In your case, I'd look at the file as a sequence of "tokens" and look
at this as a tokenization problem

I think we'll see some kind of _identifier_, followed by _whitespace_,
followed by a _string_.  All these three tokens will repeat, until we
hit the end of the the file.

More formally, I'd try to describe the file's structure in a grammar:


    ## This is not Python, but just a way for me to formally express
what I think your file format is:

    file := (IDENTIFIER WHITESPACE STRING)* END_OF_FILE


The star there is meant to symbolize the "repeatedly" part.  Note that
we haven't yet said what IDENTIFIER, WHITESPACE, or STRING means at
all yet: I'm just making sure we've got the toplevel understanding of
the file.


If this is true, then we might imagine a function tokenize() that
takes the file and breaks it down into a sequence of these tokens, and
then the job of pulling out the content you care about should be
easier.  We can loop over the token sequence, watch for an identifier
"x-1", skip over the whitespace token, and then take the string we
care about, till we hit the "x-2" identifier and stop.


tokenize() should not be too bad to write: we walk the file, and
recognize certain patterns as either IDENTIFIER, WHITESPACE, or
STRING.

    IDENTIFIER looks like a bunch of non-whitespace characters.
    WHITESPACE looks like a bunch of whitespace characters.
    STRING looks like a quote, followed by a bunch of non-quote
characters, followed by a quote.

The description above is very handwavy.  You can more formally write
those descriptions out by hand with the use of regular expressions.
Regular expressions are a mini-language for writing out string
patterns and extracting content from strings.

See:  https://docs.python.org/2/howto/regex.html


Once we can formally describe the patterns above, then we can walk the
characters in the file.  We pick out which of the three patterns will
match what we're currently seeing, and then add to the list of tokens.
Eventually, we hit end of file, and tokenize() can return all the
tokens that it has accumulated.


More information about the Tutor mailing list