[Expat-discuss] Problems with utf8_toUtf8

Josh Martin Josh.Martin@abq.sc.philips.com
Tue Aug 6 12:08:23 2002


Okay, the two byte hex sequence "0xc0 0x80" represents unicode character number 
(in hex) 80.  This is the first (from zero) UTF-8 character which is represented 
by two (or more) bytes, all of the previous characters being ASCII characters 
which are only encoded by one byte.  I would (and will for this message) 
_erroneously_ call this two byte character 'character zero'.

So basically, from what I see, the first for loop looks for the first non-ASCII 
character, breaks out, and then copies that character from the "from" buffer to 
the "to" buffer, also skipping 'character zero', which must be invalid.

So my guess would be this function is used to copy one UTF-8 buffer to another.  
I do have reason to believe that my analysis is not entirely correct, especially 
since I can't see why you would skip character 0x80, but Fred will have to tell 
us the truth.

My question is, why does this function concern you?  What problems are you 
having with it?

 - Josh Martin

> Hi,
> 
> I'm still working to adapt Expat to AS/400. The code has a different
> behaviour in following function:
> 
> utf8_toUtf8
> 
>   if (fromLim - *fromP > toLim - *toP) {
>     /* Avoid copying partial characters. */
>     for (fromLim = *fromP + (toLim - *toP); fromLim > *fromP; fromLim--)
>       if (((unsigned char)fromLim[-1] & 0xc0) != 0x80)
> 
> --> WHY 0xc0 and =x80 ??
> 
>         break;
>   }
>   for (to = *toP, from = *fromP; from != fromLim; from++, to++)
>     *to = *from;
>   *fromP = from;
>   *toP = to;
> 
> I guess it has something to do with Ascii codes. Can anyone clarify me what
> does this function do?
> 
> Thanks!
> Marta
>