[Python-Dev] codecs question

M.-A. Lemburg mal@lemburg.com
Sat, 30 Sep 2000 12:21:43 +0200


Martin von Loewis wrote:
> 
> > the "unicodenames" patch (which replaces ucnhash) includes this
> > functionality -- but with a little distance, I think it's better to add
> > it to the unicodedata module.
> >
> > (it's included in the step 4 patch, soon to be posted to a patch
> > manager near you...)
> 
> Sounds good. Is there any chance to use this in codecs, then?

If you need speed, you'd have to write a C codec for this
and yes: the ucnhash module does import a C API using a
PyCObject which you can use to access the static C data
table.

Don't know if Fredrik's version will also support this.

I think a C function as access method would be more generic
than the current direct C table access.

> I'm thinking of
> 
> >>> print u"\N{COPYRIGHT SIGN}".encode("ascii-ucn")
> \N{COPYRIGHT SIGN}
> >>> print u"\N{COPYRIGHT SIGN}".encode("latin-1-ucn")
> ©
> 
> Regards,
> Martin
> 
> P.S. Some people will recognize this as the disguised question 'how
> can I convert non-convertable characters using the XML entity
> notation?'

If you just need a single encoding, e.g. Latin-1, simply clone
the codec (it's coded in unicodeobject.c) and add the XML entity
processing.

Unfortunately, reusing the existing codecs is not too
efficient: the reason is that there is no error handling
which would permit you to say "encode as far as you can
and then return the encoded data plus a position marker
in the input stream/data".

Perhaps we should add a new standard error handling
scheme "break" which simply stops encoding/decoding
whenever an error occurrs ?!

This should then allow reusing existing codecs by
processing the input in slices.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/