Unicode chr(150) en dash

"Martin v. Löwis" martin at v.loewis.de
Thu Apr 17 15:12:30 EDT 2008


> For example, I got that EN DASH out of a web page which states <?xml
> version="1.0" encoding="ISO-8859-1"?> at the beggining. That's why I
> did go for that encoding. But if the browser can properly decode that
> character using that encoding, how come other applications can't?

Please do trust us that ISO-8859-1 does *NOT* support EN DASH.

There are two possible explanations for the behavior you observed:
a) even though the file was declared ISO-8859-1, the data in it
   actually didn't use that encoding. The browser somehow found out,
   and chose a different encoding from the declared one.
b) the web page contained the character reference &#x2013; (or –),
   or the entity reference –. XML allows to support arbitrary
   Unicode characters even in a file that is encoded with ASCII.

> I might need to go for python's htmllib to avoid this, not sure. But
> if I don't, if I only want to just copy and paste some web pages text
> contents into a tkinter Text widget, what should I do to succesfully
> make every single character go all the way from the widget and out of
> tkinter into a python string variable? How did my browser knew it
> should render an EN DASH instead of a circumflexed lowercase u?

Read the source of the web page to be certain.

> This is the webpage in case you are interested, 4th line of first
> paragraph, there is the EN DASH:
> http://www.pagina12.com.ar/diario/elmundo/subnotas/102453-32303-2008-04-15.html

Ok, this says – in several places, as well as “ and ”

HTH,
Martin



More information about the Python-list mailing list