[Tutor] Splitting by word boundaries
Jonathan Hayward http://JonathansCorner.com
jonathan.hayward at pobox.com
Thu Aug 14 22:34:16 EDT 2003
Neil Schemenauer wrote:
>Michael Janssen wrote:
>
>
>>this is important but not enough. re.split(r'\b', 'word boundary') is
>>yet infunctional. I've looked through the sources to find out why.
>>
>>
>
>re.findall(r'\w+', ...) should do what is intended.
>
>
[other discussion snipped]
I couldn't figure out how to get this to accept re.DOTALL or equivalent
and wrote a little bit of HTML handling that the original regexp
wouldn't have done:
def split_at_word_boundaries(self, text):
debug_log("split_at_word_boundaries")
return text
result = []
current_segment = []
text_characters = []
is_on_first_newline_segment = 1
newline_segments = string.split(text, "\n")
for segment in newline_segments:
if is_on_first_newline_segment:
is_on_first_newline_segment = 0
else:
text_characters.append("\n")
text_characters.extend(re.findall(".", segment))
is_in_escape = 0
if re.match(r"[\w&]", text_characters[0]):
is_in_word = 1
else:
is_in_word = 0
for current_character in text_characters:
current_is_in_word = 0
if is_in_word:
if current_character == "&":
is_in_escape = 1
current_is_in_word = 1
if is_in_escape and current_character == "#":
current_is_in_word = 1
if is_in_escape and current_character == ";":
is_in_escape = 0
current_is_in_word = 1
if self.word_character_matcher.match(current_character):
current_is_in_word = 1
else:
if current_character == "&":
is_in_escape = 1
current_is_in_word = 1
if is_in_escape and current_character == "#":
current_is_in_word = 1
if is_in_escape and current_character == ";":
is_in_escape = 0
current_is_in_word = 1
if self.word_character_matcher.match(current_character):
current_is_in_word = 1
if current_is_in_word == is_in_word:
current_segment.append(current_character)
else:
result.append("".join(current_segment))
current_segment = []
is_in_word = current_is_in_word
return result
Usually when I do this I think I'm doing something Python tries to
prevent programmers from needing. It's not duct tape, but it does read
more like a C++ programmer trying to use Python as C++.
--
++ Jonathan Hayward, jonathan.hayward at pobox.com
** To see an award-winning website with stories, essays, artwork,
** games, and a four-dimensional maze, why not visit my home page?
** All of this is waiting for you at http://JonathansCorner.com
More information about the Tutor
mailing list