Text parsing via regex

MRAB google at mrabarnett.plus.com
Mon Dec 8 17:34:26 EST 2008


Vlastimil Brom wrote:
> 2008/12/8 Robocop <bthayre at physics.ucsd.edu>:
>> I'm having a little text parsing problem that i think would be really
>> quick to troubleshoot for someone more versed in python and Regexes.
>> I need to write a simple script that parses some arbitrarily long
>> string every 50 characters, and does not parse text in the middle of
>> words (but ultimately every parsed string should be 50 characters,
>> ...
> 
> Hi, not sure, if I understand the task completely, but maybe some of
> the variants below using re may help (depending on what should be done
> further with the resulting test segments);
> in the first two possibilities the resulting lines are 50 characters
> long + 1 for "\n"; possibly 49 would be used if needed.
> 
> 
> import re
> 
> input_txt = """I'm having a little text parsing problem that i think
> would be really
> quick to troubleshoot for someone more versed in python and Regexes.
> I need to write a simple script that parses some arbitrarily long
> string every 50 characters, and does not parse text in the middle of
> words (but ultimately every parsed string should be 50 characters, so
> adding in white spaces is necessary).  So i immediately came up with
> something along the lines of:"""
> 
> # print re.sub(r"((?s).{1,50}\b)", lambda m: m.group().ljust(50) +
> "\n", input_txt) # re.sub using a function
> 
I also thought of r"(.{1,50}\b)", but then I realised that there's a 
subtle problem: it says that the captured text should end on a word 
boundary, when, in fact, we just don't want it to split within a word. 
It would still be acceptable if it split between 2 non-word characters. 
Aargh! :-)

> # for m in re.finditer(r"((?s).{1,50}\b)",  input_txt): # adjusting
> the matches via finditer
> #     print m.group().ljust(50)
> 
> print [chunk.ljust(50) for chunk in re.findall(r"((?s).{1,50}\b)",
> input_txt)] # adjusting the matched parts in findall
> 



More information about the Python-list mailing list