raw_input() and utf-8 formatted chars

Marc 'BlackJack' Rintsch bj_666 at gmx.net
Fri Nov 2 07:07:05 EDT 2007


On Thu, 01 Nov 2007 19:21:03 -0700, 7stud wrote:

> BeautifulSoup can convert an html entity representing an 'A' with
> umlaut, e.g.:
> 
> Ä
> 
> into an   without every touching my keyboard.  How does BeautifulSoup
> do it?

It maps the HTML entity names to unicode characters.  Take a look at the
`htmlentitydefs` module.

> from BeautifulSoup import BeautifulStoneSoup as bss
> 
> 
> s1 = "<h1>Ä</h1>"  #&_Auml;_
> #I added the comment after the line to show the
> #format of the html entity.  In case a browser
> #might render the comment into the actual character,
> #I added underscores to the html entity:
> 
> soup = bss(s1)
> text = soup.contents[0].string  #gets the 'A' with umlaut out of the
> html
> 
> new_s = bss(text, convertEntities=bss.HTML_ENTITIES)
> print repr(new_s)
> print new_s
> 
> I see the same output for both print statements, and what I see is an
> 'A' with umlaut.  I expected that the first print statement would show
> the utf-8 encoding for the character.

Well it does, and apparently your terminal, or wherever the output goes,
decodes that UTF-8 encoded 'Ä' and shows it.  If you expected the output
'\xc3\x84' then remember that you ask the soup object for its
representation and not a string.  The object itself decides what
`repr(obj)` returns.  Soup objects represent themselves as UTF-8 encoded
strings.

Ciao,
	Marc 'BlackJack' Rintsch



More information about the Python-list mailing list