raw_input() and utf-8 formatted chars

7stud bbxx789_05ss at yahoo.com
Thu Nov 1 22:21:03 EDT 2007


On Oct 13, 12:42 pm, MRAB <goo... at mrabarnett.plus.com> wrote:
> You can
> decode that into the actual UTF-8 string with decode("string_escape"):
>
> s = raw_input('Enter: ')   #A\xcc\x88
> s = s.decode("string_escape")
>

Ahh.  Thanks for that.


>On Oct 12, 2:43 pm, Marc 'BlackJack' Rintsch <bj_... at gmx.net> wrote:
>
> > And what is it that your keyboard enters to produce an 'a' with an umlaut?
>
> *I* just hit the ä key.  The one right next to the ö key.  ;-)
>

BeautifulSoup can convert an html entity representing an 'A' with
umlaut, e.g.:

Ä

into an   without every touching my keyboard.  How does BeautifulSoup
do it?


from BeautifulSoup import BeautifulStoneSoup as bss


s1 = "<h1>Ä</h1>"  #&_Auml;_
#I added the comment after the line to show the
#format of the html entity.  In case a browser
#might render the comment into the actual character,
#I added underscores to the html entity:

soup = bss(s1)
text = soup.contents[0].string  #gets the 'A' with umlaut out of the
html

new_s = bss(text, convertEntities=bss.HTML_ENTITIES)
print repr(new_s)
print new_s

I see the same output for both print statements, and what I see is an
'A' with umlaut.  I expected that the first print statement would show
the utf-8 encoding for the character.




More information about the Python-list mailing list