[XML-SIG] Strings or Unicode ?

M.-A. Lemburg mal@lemburg.com
Fri, 09 Nov 2001 10:12:06 +0100


"Martin v. Loewis" wrote:
> 
> > The question still remains, though: is this an acceptable approach
> > in practice ? (Converting Unicode back to strings has its cost and
> > it might be worthwhile having the 8-bit string approach available
> > too.)
> 
> That depends on the processing you want to do after parsing. In
> general, I'd argue that not having to deal with encodings, but being
> able to rely that everything is Unicode simplifies the application.
> The only known drawback is that it may be difficult to restore the
> original document: You may forget what the input encoding was,
> and in what places character references had been used.

True, even though I believe that you can tune the tools to 
"remember" this information. For most applications which work
on computer generated and computer read data, this probably
isn't worth it, but could be worth it for human edited XML 
files.

After having looked into handling the various problems you run into 
when expanding entities, I found that the all-Unicode approach fits
best: you simply don't have the problem of character entities
not expanding due to constraints on the range of supported
character points. That certainly makes life easier.

One thing I'd be curious to know is whether it's common practice
to restrict XML tag names and attribute names to Latin-1 or even
ASCII... ? I've used the approach of using Latin-1 8-bit strings
if possible and reverting to Unicode objects for cases where this
doesn't work. I've never seen an XML file with non-ASCII tag 
names, so I suppose the Unicode case is artificial.

FWIW, I've changed my parser to now use Unicode exclusively
and to my surprise things got faster :-) 
 
> I personally don't consider this as a drawback: In most applications,
> it is good thing if the application "normalizes" the XML documents,
> since that reduces the hassles in later processing stages.
> 
> In the early days after Python 2, there was a desire to make Unicode
> optional in PyXML, by propagating a "wants_unicode" flag throughout
> the processing chain. Event though this is still supported in a number
> of places, I doubt it is used much.

Thanks for the suggestions,
-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Consulting & Company:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/