Unicode -> UTF-8

Alex Martelli aleax at aleax.it
Mon Sep 3 05:39:07 EDT 2001


"Ignacio Vazquez-Abrams" <ignacio at openservices.net> wrote in message
news:mailman.999477366.3290.python-list at python.org...
> What's the easiest way in C to get the contents of a PyUnicodeObject (or a
> PyStringObject for that matter) as UTF-8?

What about (warning, untested code):

void
treat_as_utf8(PyObject* pSomeUnicode)
{
    char* buffer;
    int length;
    int rc;

    PyObject* pAstring =
PyObject_CallMethod(pSomeUnicode,"encode","s","utf8");
    if(!pAstring) return;

    rc = PyString_AsStringAndSize(pAstring, &buffer, &length);

    if(rc==0) {
        /* snipped: use buffer (READ-ONLY!) & length as you wish */
    }

    Py_DECREF(pAstring);
}

Is this what you had in mind?


There are also "es" and "es#" formats which I think you should
be able to use directly with PyArg_Parse, a la (I think):

    buffer = 0;
    rc = PyArg_Parse(pSomeUnicode, "es#", "utf8", &buffer, &length);

    /* check rc & use buffer/length, read/write now */

    PyMem_Free(buffer);

but I'm unclear on the status/usability of PyArg_Parse -- the
current docs say it's all right to use it to analyze other
object (not arguments), but I've seen people "in the know"
just flat out advise against using it and claiming it is in
fact deprecated (without qualifying that warning with "for
argument parsing only").  Maybe this thread can lead to some
useful clarification in this regard...?



Alex






More information about the Python-list mailing list