Text parsing via regex
Shane Geiger
sgeiger at ncee.net
Mon Dec 8 15:03:32 EST 2008
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/148061
def wrap(text, width):
"""
A word-wrap function that preserves existing line breaks
and most spaces in the text. Expects that existing line
breaks are posix newlines (\n).
"""
return reduce(lambda line, word, width=width: '%s%s%s' %
(line,
' \n'[(len(line)-line.rfind('\n')-1
+ len(word.split('\n',1)[0]
) >= width)],
word),
text.split(' ')
)
# 2 very long lines separated by a blank line
msg = """Arthur: "The Lady of the Lake, her arm clad in the purest \
shimmering samite, held aloft Excalibur from the bosom of the water, \
signifying by Divine Providence that I, Arthur, was to carry \
Excalibur. That is why I am your king!"
Dennis: "Listen. Strange women lying in ponds distributing swords is \
no basis for a system of government. Supreme executive power derives \
from a mandate from the masses, not from some farcical aquatic \
ceremony!\""""
# example: make it fit in 40 columns
print(wrap(msg,40))
# result is below
"""
Arthur: "The Lady of the Lake, her arm
"""
Robocop wrote:
> I'm having a little text parsing problem that i think would be really
> quick to troubleshoot for someone more versed in python and Regexes.
> I need to write a simple script that parses some arbitrarily long
> string every 50 characters, and does not parse text in the middle of
> words (but ultimately every parsed string should be 50 characters, so
> adding in white spaces is necessary). So i immediately came up with
> something along the lines of:
>
> string = "a bunch of nonsense that could be really long, or really
> short depending on the situation"
> r = re.compile(r".{50}")
> m = r.match(string)
>
> then i started to realize that i didn't know how to do exactly what i
> wanted. At this point i wanted to find a way to simply use something
> like:
>
> parsed_1, parsed_2,...parsed_n = m.groups()
>
> However i'm having several problems. I know that playskool regular
> expression i wrote above will only parse every 50 characters, and will
> blindly cut words in half if the parsed string doesn't end with a
> whitespace. I'm relatively new to regexes and i don't know how to
> have it take that into account, or even what type of logic i would
> need to fill in the extra whitespaces to make the string the proper
> length when avoiding cutting words up. So that's problem #1. Problem
> #2 is that because the string is of arbitrary length, i never know how
> many parsed strings i'll have, and thus do not immediately know how
> many variables need to be created to accompany them. It's easy enough
> with each pass of the function to find how many i will have by doing:
> mag = len(string)
> upper_lim = mag/50 + 1
> But i'm not sure how to declare and set them to my parsed strings.
> Now problem #1 isn't as pressing, i can technically get away with
> cutting up the words, i'd just prefer not to. The most pressing
> problem right now is #2. Any help, or suggestions would be great,
> anything to get me thinking differently is helpful.
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>
--
Shane Geiger
IT Director
National Council on Economic Education
sgeiger at ncee.net | 402-438-8958 | http://www.ncee.net
Leading the Campaign for Economic and Financial Literacy
More information about the Python-list
mailing list