Text parsing via regex

Mon Dec 8 13:42:00 EST 2008

Robocop wrote:
> I'm having a little text parsing problem that i think would be really
> quick to troubleshoot for someone more versed in python and Regexes.
> I need to write a simple script that parses some arbitrarily long
> string every 50 characters, and does not parse text in the middle of
> words (but ultimately every parsed string should be 50 characters, so
> adding in white spaces is necessary).  So i immediately came up with
> something along the lines of:
> 
> string = "a bunch of nonsense that could be really long, or really
> short depending on the situation"
> r = re.compile(r".{50}")
> m = r.match(string)
> 
> then i started to realize that i didn't know how to do exactly what i
> wanted.  At this point i wanted to find a way to simply use something
> like:
> 
> parsed_1, parsed_2,...parsed_n = m.groups()
> 
> However i'm having several problems.  I know that playskool regular
> expression i wrote above will only parse every 50 characters, and will
> blindly cut words in half if the parsed string doesn't end with a
> whitespace.  I'm relatively new to regexes and i don't know how to
> have it take that into account, or even what type of logic i would
> need to fill in the extra whitespaces to make the string the proper
> length when avoiding cutting words up.  So that's problem #1.  Problem
> #2 is that because the string is of arbitrary length, i never know how
> many parsed strings i'll have, and thus do not immediately know how
> many variables need to be created to accompany them.  It's easy enough
> with each pass of the function to find how many i will have by doing:
> mag = len(string)
> upper_lim = mag/50 + 1
> But i'm not sure how to declare and set them to my parsed strings.
> Now problem #1 isn't as pressing, i can technically get away with
> cutting up the words, i'd just prefer not to.  The most pressing
> problem right now is #2.  Any help, or suggestions would be great,
> anything to get me thinking differently is helpful.

Hi Robocop,

What do you mean by "parses some arbitrarily long string every 50
characters"?  What does your source data look like? Can you give us an
example of of a) it and b) what a match would look like.

I think you will get good mileage out of using '\b' to match word
boundaries and that you may be better off rexing your string into a list
and then padding it with whitespace after the fact but I can't say for
sure. Please clarify.