[Tutor] Splitting by word boundaries

Thu Aug 14 22:34:16 EDT 2003

Neil Schemenauer wrote:

>Michael Janssen wrote:
>  
>
>>this is important but not enough. re.split(r'\b', 'word boundary') is
>>yet infunctional. I've looked through the sources to find out why.
>>    
>>
>
>re.findall(r'\w+', ...) should do what is intended.
>  
>
[other discussion snipped]

I couldn't figure out how to get this to accept re.DOTALL or equivalent 
and wrote a little bit of HTML handling that the original regexp 
wouldn't have done:

    def split_at_word_boundaries(self, text):
        debug_log("split_at_word_boundaries")
        return text
        result = []
        current_segment = []
        text_characters = []
        is_on_first_newline_segment = 1
        newline_segments = string.split(text, "\n")
        for segment in newline_segments:
            if is_on_first_newline_segment:
                is_on_first_newline_segment = 0
            else:
                text_characters.append("\n")
            text_characters.extend(re.findall(".", segment))
        is_in_escape = 0
        if re.match(r"[\w&]", text_characters[0]):
            is_in_word = 1
        else:
            is_in_word = 0
            for current_character in text_characters:
                current_is_in_word = 0
                if is_in_word:
                    if current_character == "&":
                        is_in_escape = 1
                        current_is_in_word = 1
                    if is_in_escape and current_character == "#":
                        current_is_in_word = 1
                    if is_in_escape and current_character == ";":
                        is_in_escape = 0
                        current_is_in_word = 1
                    if self.word_character_matcher.match(current_character):
                        current_is_in_word = 1
                else:
                    if current_character == "&":
                        is_in_escape = 1
                        current_is_in_word = 1
                    if is_in_escape and current_character == "#":
                        current_is_in_word = 1
                    if is_in_escape and current_character == ";":
                        is_in_escape = 0
                        current_is_in_word = 1
                    if self.word_character_matcher.match(current_character):
                        current_is_in_word = 1
                if current_is_in_word == is_in_word:
                    current_segment.append(current_character)
                else:
                    result.append("".join(current_segment))
                    current_segment = []
                is_in_word = current_is_in_word
        return result

Usually when I do this I think I'm doing something Python tries to 
prevent programmers from needing. It's not duct tape, but it does read 
more like a C++ programmer trying to use Python as C++.

-- 
++ Jonathan Hayward, jonathan.hayward at pobox.com
** To see an award-winning website with stories, essays, artwork,
** games, and a four-dimensional maze, why not visit my home page?
** All of this is waiting for you at http://JonathansCorner.com