[Expat-discuss] Parsing copyright symbol

Warren Young warren at etr-usa.com
Tue Jul 8 01:56:17 CEST 2008


М.В. wrote:
> 
> ...copyright symbol (code 0xae)...utf8? 

You are confused on a number of fronts:

First, 0xAE is not a valid UTF-8 code, by itself.  Read this on how 
UTF-8 encodes multi-byte characters over 0x007F down to multiple bytes 
over 0x80 in value:

	http://en.wikipedia.org/wiki/UTF-8

Second, 0xAE is the registered trademark symbol in ISO 8859-1 (Latin-1), 
not a copyright symbol.  The copyright symbol is 0xA9 in Latin-1.

Third, XML defaults to UTF-8, so unless you declare the document's 
character set differently in the <?xml> tag, that's what expat will use. 
  Either convert your data into UTF-8 format, or tell Expat the truth 
about your document's content:

	<?xml version="1.0" encoding="iso-8859-1"?>

I'm just guessing about it being 8859-1.  It could be 8859-15, or 
probably several other encodings.


More information about the Expat-discuss mailing list