how to avoid leading white spaces

Roy Smith roy at panix.com
Thu Jun 2 23:44:40 EDT 2011


In article <is9ikg083h at news1.newsguy.com>,
 Chris Torek <nospam at torek.net> wrote:

> Python might be penalized by its use of Unicode here, since a
> Boyer-Moore table for a full 16-bit Unicode string would need
> 65536 entries (one per possible ord() value).

I'm not sure what you mean by "full 16-bit Unicode string"?  Isn't 
unicode inherently 32 bit?  Or at least 20-something bit?  Things like 
UTF-16 are just one way to encode it.

In any case, while I could imagine building a 2^16 entry jump table, 
clearly it's infeasible (with today's hardware) to build a 2^32 entry 
table. But, there's nothing that really requires you to build a table at 
all.  If I understand the algorithm right, all that's really required is 
that you can map a character to a shift value.

For an 8 bit character set, an indexed jump table makes sense.  For a 
larger character set, I would imagine you would do some heuristic 
pre-processing to see if your search string consisted only of characters 
in one unicode plane and use that fact to build a table which only 
indexes that plane.  Or, maybe use a hash table instead of a regular 
indexed table.  Not as fast, but only slower by a small constant factor, 
which is not a horrendous price to pay in a fully i18n world :-)



More information about the Python-list mailing list