SGMLParser eats ä etc
Eric Brunel
eric.brunel at N0SP4M.com
Mon Dec 1 04:24:29 EST 2003
Anders Eriksson wrote:
> Hello!
>
> I'm using smgllib (ActivePython 2.3.2, build 230) and I have some trouble
> with letters that has been coded, e.g. the letter å is coded å ä is
> coded ä and ö is coded ö all according to the html standard.
>
> I use the SGMLParser and when I feed method all the coded letter will be
> stripped/eaten.
>
> Why?
> How do I fix this?
The &something; "coding" for accented characters is called an entity in SGML.
These entities are all defined in the underlying DTD for your document. HTML
defines the "standard" entities you describe, like å, ä, etc... But
if the DTD for the document you're parsing does not include these entity
definitions, there's no reason why the parser should do anything with them, even
if silently ignoring them seems strange to me (I'd have expected a parsing error).
So there are two solutions:
- either your document is HTML, and you should use an HTML parser as it was
already suggested
- or your document is not HTML, and you should define all entities you may use
in your DTD. This is done for example with:
<!ENTITY auml ä>
(if you use the iso8859-1 encoding)
HTH
--
- Eric Brunel <eric dot brunel at pragmadev dot com> -
PragmaDev : Real Time Software Development Tools - http://www.pragmadev.com
More information about the Python-list
mailing list