[Tutor] Tokenizing Help
Kent Johnson
kent37 at tds.net
Thu Apr 23 04:23:34 CEST 2009
On Wed, Apr 22, 2009 at 9:41 PM, William Witteman <yam at nerd.cx> wrote:
> On Wed, Apr 22, 2009 at 11:23:11PM +0200, Eike Welk wrote:
>
>>How do you decide that a word is a keyword (AU, AB, UN) and not a part
>>of the text? There could be a file like this:
>>
>><567>
>>AU - Bibliographical Theory and Practice - Volume 1 - The AU - Tag
>>and its applications
>>AB - Texts in Library Science
>><568>
>>AU - Bibliographical Theory and Practice - Volume 2 - The
>>AB - Tag and its applications
>>AB - Texts in Library Science
>><569>
>>AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU -
>>AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU
>>AB - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU -
>>AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU
>>ZZ - Somewhat nonsensical case
>
> This is a good case, and luckily the files are validated on the other
> end to prevent this kind of collision.
>>To me it seems that a parsing library is unnecessary. Just look at the
>>first few characters of each line and decide if its the start of a
>>record, a tag or normal text. You might need some additional
>>algorithm for corner cases.
I agree with this. The structure is simple and the lines are easily
recognized. Here is one way to do it:
data = '''<567>
AU - Bibliographical Theory and Practice - Volume 1 - The AU - Tag
and its applications
AB - Texts in Library Science
<568>
AU - Bibliographical Theory and Practice - Volume 2 - The
AB - Tag and its applications
AB - Texts in Library Science
<569>
AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU -
AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU
AB - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU -
AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU
ZZ - Somewhat nonsensical case
'''.splitlines()
import pprint, re
from collections import defaultdict
def parse(data):
''' Yields dictionaries corresponding to bibliographic entries'''
result = None
key = None
for line in data:
if not line.strip():
continue # skip blank lines
if re.search(r'^<\d+>', line):
# start of a new entry
if result:
# return the previous entry and initialize
yield result
result = defaultdict(list)
key = None
else:
m = re.search(r'^([A-Z]{2}) +- +(.*)', line)
if m:
# New field
key, value = m.group(1, 2)
result[key].append(value)
else:
# Extension of previous field
if result and key: # sanity check
result[key][-1] += '\n' + line
if result:
yield result
for entry in parse(data):
for key, value in entry.iteritems():
print key
pprint.pprint(value)
print
Note that dicts do not preserve order so the fields are not output in
the same order as they appear in the file.
> If this was the only type of file I'd need to parse, I'd agree with you,
> but this is one of at least 4 formats I'll need to process, and so a
> robust methodology will serve me better than a regex-based one-off.
Unless there is some commonality between the formats, each parser is
going to be a one-off no matter how you implement it.
Kent
More information about the Tutor
mailing list