Python library to break text into words

Abdur-Rahmaan Janhangeer arj.python at gmail.com
Thu May 31 23:29:27 EDT 2018


1-> search in dict, identify all words example :

meaningsofoffers

.. identified words :

me
an
mean
in
meaning
meanings
so
of
of
offer
offers

2-> next filter duplicates, i.e. of above in a new list as the original
list serves as chronological reference

3-> next chose the words whose lengths make up the length of the string

4-> if several solutions choose non-overlapping and chronologically sound
ones

5-> unused letters are treated as words where non-natural words are
included, that can be problematic if sub words are found in it and point 7
might be the way to go

6-> in the case of non-regular words included, the program returns the best
solutions for the user to choose from

i have branded the above 6 points algorithm as the Arj.mu Algorithm of Word
Extraction in Connected Letters

7-> if machine learning is enacted, the above point (6) serves as training
(on an everyday usage app) or it can directly train on predefined examples

8-> if typos are assumed to be found titles, then the title should be
assumed to have the corrected words and a new search is done on this
assumed title. in which case the results are added to the non corrected
version and then point 6 above is executed

8.1-> for assumptions in 8, Natural Language modules might be used

9-> titles can contain numbers, dates, author names and others and as such
is not covered by the points above


Abdur-Rahmaan Janhangeer
https://github.com/Abdur-rahmaanJ



More information about the Python-list mailing list