unicode, bytes redux

Mon Sep 25 04:38:08 EDT 2006

Paul Rubin wrote:
> Leif K-Brooks <eurleif at ecritters.biz> writes:
> > It requires a fairly large change to code and API for a relatively
> > uncommon problem. How often do you need to know how many bytes an
> > encoded Unicode string takes up without needing the encoded string
> > itself?
>
> Shrug. I don't see a real large change--the code would just check for
> an optional arg and process accordingly.  I don't know if the issue
> comes up often enough to be worth making such accomodations for.  I do
> know that we had an extensive newsgroup thread about it, from which
> this discussion came, but I haven't paid that much attention.

Actually, what Willie was concerned about was some cockamamie DBMS
which required to be fed Unicode, which it encoded as UTF-8, but
silently truncated if it was more than the n in varchar(n) ... or
something like that.

So all he needs is a boolean result: u.willitfit(encoding, width)

This can of course be optimised with simple early-loop-exit tests:
if n_bytes_so_far + n_remaining_uchars > width: return False
elif n_bytes_so_far + n_remaining_uchars * M <= width: return True
# where M is the maximum #bytes per Unicode char for the encoding
that's being used.

Tell you what, why don't you and Willie get together and write a PEP?

Cheers,
John