Python library to break text into words

Chris Angelico rosuav at gmail.com
Thu May 31 16:45:59 EDT 2018


On Fri, Jun 1, 2018 at 6:26 AM, beliavsky--- via Python-list
<python-list at python.org> wrote:
> I bought some e-books in a Humble Bundle. The file names are shown below. I would like to hyphenate words within the file names, so that the first three titles are
>
> a_devils_chaplain.pdf
> atomic_accidents.pdf
> chaos_making_a_new_science.pdf
>
> Is there a Python library that uses intelligent guesses to break sequences of characters into words? The general strategy would be to break strings into the longest words possible. The library would need to "know" a sizable subset of words in English.
>
> adevilschaplain.pdf
> atomicaccidents.pdf
> chaos_makinganewscience.pdf

Let's start with the easy bit. On many MANY Unix-like systems, you can
find a dictionary of words in the user's language (not necessarily
English, but that's appropriate here - it means your script will work
on a French or German or Turkish or Russian system as well) at
/usr/share/dict/words. All you have to do is:

with open("/usr/share/dict/words") as f:
    words = f.read().strip().split("\n")

Tada! That'll give you somewhere between 50K and 650K words, for
English. (I have eight English dictionaries installed, ranging from
american-english-small and british-english-small at 51K all the way up
to their corresponding -insane variants at 650K.) Most likely you'll
have about 100K words, which is a good number to be working with. If
you're on Windows, see if you can just download something from
wordlist.sourceforge.net or similar; it should be in the same format.

So! Now for the next step. You need to split a pile of letters such
that each of the resulting pieces is a word. You're probably going to
find some that just don't work ("x-15diary" seems dubious), but for
the most part, you should get at least _some_ result. You suggested a
general strategy of breaking strings into the longest words possible,
which would be easy enough to code. A basic algorithm of "take as many
letters as you can while still finding a word" is likely to give you
fairly decent results. You'll need a way of backtracking in the event
that the rest of the letters don't work ("theedgeofphysics" will take
a first word of "thee", but then "dgeofphysics" isn't going to work
out well), but otherwise, I think your basic idea is sound.

Should be a fun project!

ChrisA



More information about the Python-list mailing list