Discussion on some Code Issues

Wed Jul 4 20:08:23 EDT 2012

On Jul 4, 6:21 pm, subhabangal... at gmail.com wrote:
> [...]
> To detect the document boundaries, I am splitting them into a bag
> of words and using a simple for loop as,
>
> for i in range(len(bag_words)):
>         if bag_words[i]=="$":
>             print (bag_words[i],i)

Ignoring that you are attacking the problem incorrectly: that is very
poor method of splitting a string since especially the Python gods
have given you *power* over string objects. But you are going to have
an even greater problem if the string contains a "$" char that you DID
NOT insert :-O. You'd be wise to use a sep that is not likely to be in
the file data. For example: "<SEP>" or "<SPLIT-HERE>". But even that
approach is naive! Why not streamline the entire process and pass a
list of file paths to a custom parser object instead?