SGMLParser eats ä etc

Eric Brunel eric.brunel at N0SP4M.com
Mon Dec 1 04:24:29 EST 2003


Anders Eriksson wrote:
> Hello!
> 
> I'm using smgllib (ActivePython 2.3.2, build 230) and I have some trouble
> with letters that has been coded, e.g. the letter å is coded å ä is
> coded ä and ö is coded ö all according to the html standard.
> 
> I use the SGMLParser and when I feed method all the coded letter will be
> stripped/eaten.
> 
> Why?
> How do I fix this?

The &something; "coding" for accented characters is called an entity in SGML. 
These entities are all defined in the underlying DTD for your document. HTML 
defines the "standard" entities you describe, like å, ä, etc... But 
if the DTD for the document you're parsing does not include these entity 
definitions, there's no reason why the parser should do anything with them, even 
if silently ignoring them seems strange to me (I'd have expected a parsing error).

So there are two solutions:
- either your document is HTML, and you should use an HTML parser as it was 
already suggested
- or your document is not HTML, and you should define all entities you may use 
in your DTD. This is done for example with:
<!ENTITY auml ä>
(if you use the iso8859-1 encoding)

HTH
-- 
- Eric Brunel <eric dot brunel at pragmadev dot com> -
PragmaDev : Real Time Software Development Tools - http://www.pragmadev.com





More information about the Python-list mailing list