Unicode chr(150) en dash
"Martin v. Löwis"
martin at v.loewis.de
Thu Apr 17 15:12:30 EDT 2008
> For example, I got that EN DASH out of a web page which states <?xml
> version="1.0" encoding="ISO-8859-1"?> at the beggining. That's why I
> did go for that encoding. But if the browser can properly decode that
> character using that encoding, how come other applications can't?
Please do trust us that ISO-8859-1 does *NOT* support EN DASH.
There are two possible explanations for the behavior you observed:
a) even though the file was declared ISO-8859-1, the data in it
actually didn't use that encoding. The browser somehow found out,
and chose a different encoding from the declared one.
b) the web page contained the character reference – (or –),
or the entity reference –. XML allows to support arbitrary
Unicode characters even in a file that is encoded with ASCII.
> I might need to go for python's htmllib to avoid this, not sure. But
> if I don't, if I only want to just copy and paste some web pages text
> contents into a tkinter Text widget, what should I do to succesfully
> make every single character go all the way from the widget and out of
> tkinter into a python string variable? How did my browser knew it
> should render an EN DASH instead of a circumflexed lowercase u?
Read the source of the web page to be certain.
> This is the webpage in case you are interested, 4th line of first
> paragraph, there is the EN DASH:
> http://www.pagina12.com.ar/diario/elmundo/subnotas/102453-32303-2008-04-15.html
Ok, this says – in several places, as well as “ and ”
HTH,
Martin
More information about the Python-list
mailing list