Text parsing via regex

Vlastimil Brom vlastimil.brom at gmail.com
Mon Dec 8 15:59:53 EST 2008


2008/12/8 Robocop <bthayre at physics.ucsd.edu>:
> I'm having a little text parsing problem that i think would be really
> quick to troubleshoot for someone more versed in python and Regexes.
> I need to write a simple script that parses some arbitrarily long
> string every 50 characters, and does not parse text in the middle of
> words (but ultimately every parsed string should be 50 characters,
> ...

Hi, not sure, if I understand the task completely, but maybe some of
the variants below using re may help (depending on what should be done
further with the resulting test segments);
in the first two possibilities the resulting lines are 50 characters
long + 1 for "\n"; possibly 49 would be used if needed.


import re

input_txt = """I'm having a little text parsing problem that i think
would be really
quick to troubleshoot for someone more versed in python and Regexes.
I need to write a simple script that parses some arbitrarily long
string every 50 characters, and does not parse text in the middle of
words (but ultimately every parsed string should be 50 characters, so
adding in white spaces is necessary).  So i immediately came up with
something along the lines of:"""

# print re.sub(r"((?s).{1,50}\b)", lambda m: m.group().ljust(50) +
"\n", input_txt) # re.sub using a function

# for m in re.finditer(r"((?s).{1,50}\b)",  input_txt): # adjusting
the matches via finditer
#     print m.group().ljust(50)

print [chunk.ljust(50) for chunk in re.findall(r"((?s).{1,50}\b)",
input_txt)] # adjusting the matched parts in findall

hth,
  vbr



More information about the Python-list mailing list