[Tutor] Tokenizing Help

Thu Apr 23 04:23:34 CEST 2009

On Wed, Apr 22, 2009 at 9:41 PM, William Witteman <yam at nerd.cx> wrote:
> On Wed, Apr 22, 2009 at 11:23:11PM +0200, Eike Welk wrote:
>
>>How do you decide that a word is a keyword (AU, AB, UN) and not a part
>>of the text? There could be a file like this:
>>
>><567>
>>AU  - Bibliographical Theory and Practice - Volume 1 - The AU  - Tag
>>and its applications
>>AB  - Texts in Library Science
>><568>
>>AU  - Bibliographical Theory and Practice - Volume 2 - The
>>AB  - Tag and its applications
>>AB  - Texts in Library Science
>><569>
>>AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  -
>>AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU
>>AB  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  -
>>AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU
>>ZZ  - Somewhat nonsensical case
>
> This is a good case, and luckily the files are validated on the other
> end to prevent this kind of collision.

>>To me it seems that a parsing library is unnecessary. Just look at the
>>first few characters of each line and decide if its the start of a
>>record, a tag or normal text. You might need some additional
>>algorithm for corner cases.

I agree with this. The structure is simple and the lines are easily
recognized. Here is one way to do it:

data = '''<567>
AU  - Bibliographical Theory and Practice - Volume 1 - The AU  - Tag
and its applications
AB  - Texts in Library Science
<568>
AU  - Bibliographical Theory and Practice - Volume 2 - The
AB  - Tag and its applications
AB  - Texts in Library Science
<569>
AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  -
AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU
AB  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  -
AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU  - AU
ZZ  - Somewhat nonsensical case
'''.splitlines()

import pprint, re
from collections import defaultdict

def parse(data):
    ''' Yields dictionaries corresponding to bibliographic entries'''
    result = None
    key = None

    for line in data:
        if not line.strip():
            continue    # skip blank lines

        if re.search(r'^<\d+>', line):
            # start of a new entry
            if result:
                # return the previous entry and initialize
                yield result
            result = defaultdict(list)
            key = None
        else:
            m = re.search(r'^([A-Z]{2}) +- +(.*)', line)
            if m:
                # New field
                key, value = m.group(1, 2)
                result[key].append(value)
            else:
                # Extension of previous field
                if result and key:  # sanity check
                    result[key][-1] += '\n' + line

    if result:
        yield result

for entry in parse(data):
    for key, value in entry.iteritems():
        print key
        pprint.pprint(value)
    print

Note that dicts do not preserve order so the fields are not output in
the same order as they appear in the file.

> If this was the only type of file I'd need to parse, I'd agree with you,
> but this is one of at least 4 formats I'll need to process, and so a
> robust methodology will serve me better than a regex-based one-off.

Unless there is some commonality between the formats, each parser is
going to be a one-off no matter how you implement it.

Kent