Text parsing via regex

Mon Dec 8 15:03:32 EST 2008

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/148061

def wrap(text, width):
    """
    A word-wrap function that preserves existing line breaks
    and most spaces in the text. Expects that existing line
    breaks are posix newlines (\n).
    """
    return reduce(lambda line, word, width=width: '%s%s%s' %
                  (line,
                   ' \n'[(len(line)-line.rfind('\n')-1
                         + len(word.split('\n',1)[0]
                              ) >= width)],
                   word),
                  text.split(' ')
                 )

# 2 very long lines separated by a blank line
msg = """Arthur:  "The Lady of the Lake, her arm clad in the purest \
shimmering samite, held aloft Excalibur from the bosom of the water, \
signifying by Divine Providence that I, Arthur, was to carry \
Excalibur. That is why I am your king!"

Dennis:  "Listen. Strange women lying in ponds distributing swords is \
no basis for a system of government. Supreme executive power derives \
from a mandate from the masses, not from some farcical aquatic \
ceremony!\""""

# example: make it fit in 40 columns
print(wrap(msg,40))

# result is below
"""
Arthur:  "The Lady of the Lake, her arm
"""

Robocop wrote:
> I'm having a little text parsing problem that i think would be really
> quick to troubleshoot for someone more versed in python and Regexes.
> I need to write a simple script that parses some arbitrarily long
> string every 50 characters, and does not parse text in the middle of
> words (but ultimately every parsed string should be 50 characters, so
> adding in white spaces is necessary).  So i immediately came up with
> something along the lines of:
>
> string = "a bunch of nonsense that could be really long, or really
> short depending on the situation"
> r = re.compile(r".{50}")
> m = r.match(string)
>
> then i started to realize that i didn't know how to do exactly what i
> wanted.  At this point i wanted to find a way to simply use something
> like:
>
> parsed_1, parsed_2,...parsed_n = m.groups()
>
> However i'm having several problems.  I know that playskool regular
> expression i wrote above will only parse every 50 characters, and will
> blindly cut words in half if the parsed string doesn't end with a
> whitespace.  I'm relatively new to regexes and i don't know how to
> have it take that into account, or even what type of logic i would
> need to fill in the extra whitespaces to make the string the proper
> length when avoiding cutting words up.  So that's problem #1.  Problem
> #2 is that because the string is of arbitrary length, i never know how
> many parsed strings i'll have, and thus do not immediately know how
> many variables need to be created to accompany them.  It's easy enough
> with each pass of the function to find how many i will have by doing:
> mag = len(string)
> upper_lim = mag/50 + 1
> But i'm not sure how to declare and set them to my parsed strings.
> Now problem #1 isn't as pressing, i can technically get away with
> cutting up the words, i'd just prefer not to.  The most pressing
> problem right now is #2.  Any help, or suggestions would be great,
> anything to get me thinking differently is helpful.
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>   

-- 
Shane Geiger
IT Director
National Council on Economic Education
sgeiger at ncee.net  |  402-438-8958  |  http://www.ncee.net

Leading the Campaign for Economic and Financial Literacy