[Tutor] sentence case module for comments and possible cookbook
submission
Brian van den Broek
bvande at po-box.mcgill.ca
Fri Sep 10 11:05:25 CEST 2004
Hi all,
earlier today, I needed to change some ALL CAPS text to sentence case. To
my surprise, searching the docs, the Cookbook, and a browse of the 70+
google hits for: Python "sentence case" turned up nothing.
I've produced something that I think works. But, as regular readers of the
list might know, I'm still learning. I'd appreciate any comments on it. I
realize it is a bit long, but I intend to refine and submit to the
cookbook. (Unless the likely "But what about using X instead?" comments
are forthcoming. :-) If for some reason you comment but don't want to be
mentioned should I post to the cookbook, please let me know.
Also, I still can't believe I'm not reinventing the wheel. If there is
something available, I couldn't find it. So, if you know, I'd be happy to
hear.
Thanks and best to all,
Brian vdB
#! /usr/bin/env python
# sentence_caser.py
# Version 0.1
# Brian van den Broek
# bvande at po-box.mcgill.ca
# This module is released under the Python License. (See www.python.org.)
punctuation_indexes = {}
punctuation = ['!', '?']
def punctuation_stripper(data_string):
'''punctuation_stripper(data_string) -> data_string
Stores the indexes of each type of punctuation (other than '.') in the
punctuation_indexes dict and replaces them with '.'s. (This makes
splitting
the string easier, and, thanks to the dict, is reversible.)'''
for mark in punctuation:
punctuation_indexes[mark] = []
offset = 0
while True:
try:
i = data_string.index(mark, offset)
punctuation_indexes[mark].append(i)
offset = i + 1
except ValueError:
break
data_string = data_string.replace(mark, '.')
return data_string
def change_to_sentence_case(sentence_list):
'''change_to_sentence_case(sentence_list) -> cap_sentences_list
Takes a list of sentence strings and transforms it so that the first and
only the first) letter is capitalized. It is a bit more complicated than
just calling the capitalize string method as the strings in the sentence
list may well start with ' ', '(', '[', etc. The while loop travels the
string, looking for the first letter and calling capitalize on the
substring it commences. restore_Is() is also called, in an attempt to
undo
lower-casing of the pronoun "I".'''
cap_sentences_list = []
for s in sentence_list:
offset = 0
while offset < len(s):
if s[offset].isalpha():
s = s[:offset] + s[offset:].capitalize()
break
offset += 1
s += '.'
s = restore_Is(s)
cap_sentences_list.append(s)
return cap_sentences_list
def restore_Is(sentence):
'''restore_Is(sentence) -> sentence
Takes a sentence string and tries to restore any "I"s incorrectly
changed
to "i"s by change_to_sentence_case()'s use of .capitalize().'''
sentence = sentence.replace(' i ', ' I ')
sentence = sentence.replace(' i,', ' I,')
sentence = sentence.replace(' i.', ' I.')
return sentence
def restore_punctuation(data_sentences):
'''restore_punctuation(data_sentences) -> data_sentences
Consulting the punctuation_indexes dict, restore_punctuation() reverts
non '.' punctuation that was changed to '.' to facilitate splitting the
string.'''
for mark in punctuation:
for i in punctuation_indexes[mark]:
data_sentences = data_sentences[:i] + mark + data_sentences[i
+ 1:]
return data_sentences
def sentence_caser(data_string):
'''sentence_caser(data_string) -> data_string
Takes a string and returns it into sentence case (it is hoped). To do
so,
it runs it through various helper functions. sentence_caser() does
almost
no work on its own; consult the functions punctuation_stripper(),
change_to_sentence_case(), and restore_punctuation() for details of the
processing.'''
working_data = punctuation_stripper(data_string)
data_sentences_list = working_data.split('.')
data_sentences_list = change_to_sentence_case(data_sentences_list)
data_sentences = ''.join(data_sentences_list)
data_sentences = restore_punctuation(data_sentences)
data_sentences = data_sentences[:len(data_string)]
# To remove possibly spurious trailing '.' added when original string
ended
# with non-'.' character (as in data below).
return data_sentences
if __name__ == '__main__':
data = '''STRINGS IN ALL CAPS ARE HARD TO READ! SOME PEOPLE THINK
THEY ARE
LIKE SHOUTING. DO YOU THINK SO? I ONLY WRITE THEM WHEN I HAVE A CAPS-LOCK
ACCIDENT. (OR WHEN CREATING TEST DATA.) THEY ARE NO FUN. (OK, ENOUGH NOW.)'''
print data
print
print sentence_caser(data)
More information about the Tutor
mailing list