getting rid of —

MRAB python at mrabarnett.plus.com
Fri Jul 3 12:54:10 EDT 2009


Tep wrote:
> On 3 Jul., 16:58, "Mark Tolonen" <metolone+gm... at gmail.com> wrote:
>> "Tep" <petshm... at googlemail.com> wrote in message
>>
>> news:46d36544-1ea2-4391-8922-11b8127a2fef at o6g2000yqj.googlegroups.com...
>>
>>
>>
>>
>>
>>> On 3 Jul., 06:40, Simon Forman <sajmik... at gmail.com> wrote:
>>>> On Jul 2, 4:31 am, Tep <petshm... at googlemail.com> wrote:
>> [snip]
>>>>>>>> how can I replace '—' sign from string? Or do split at that
>>>>>>>> character?
>>>>>>>> Getting unicode error if I try to do it:
>>>>>>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in
>>>>>>>> position
>>>>>>>> 1: ordinal not in range(128)
>>>>>>>> Thanks, Pet
>>>>>>>> script is # -*- coding: UTF-8 -*-
>> [snip]
>>>> I just tried a bit of your code above in my interpreter here and it
>>>> worked fine:
>>>> |>>> data = 'foo — bar'
>>>> |>>> data.split('—')
>>>> |['foo ', ' bar']
>>>> |>>> data = u'foo — bar'
>>> |>>> data.split(u'—')
>>>> |[u'foo ', u' bar']
>>>> Figure out the smallest piece of "html source code" that causes the
>>>> problem and include that with your next post.
>>> The problem was, I've converted "html source code" to unicode object
>>> and didn't encoded to utf-8 back, before using split...
>>> Thanks for help and sorry for not so smart question
>>> Pet
>> You'd still benefit from posting some code.  You shouldn't be converting
> 
> I've posted code below
> 
>> back to utf-8 to do a split, you should be using a Unicode string with split
>> on the Unicode version of the "html source code".  Also make sure your file
>> is actually saved in the encoding you declare.  I print the encoding of your
>> symbol in two encodings to illustrate why I suspect this.
> 
> File was indeed in windows-1252, I've changed this. For errors see
> below
> 
>> Below, assume "data" is your "html source code" as a Unicode string:
>>
>> # -*- coding: UTF-8 -*-
>> data = u'foo — bar'
>> print repr(u'—'.encode('utf-8'))
>> print repr(u'—'.encode('windows-1252'))
>> print data.split(u'—')
>> print data.split('—')
>>
>> OUTPUT:
>>
>> '\xe2\x80\x94'
>> '\x97'
>> [u'foo ', u' bar']
>> Traceback (most recent call last):
>>   File
>> "C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
>> line 427, in ImportFile
>>     exec codeObj in __main__.__dict__
>>   File "<auto import>", line 1, in <module>
>>   File "x.py", line 6, in <module>
>>     print data.split('—')
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0:
>> ordinal not in range(128)
>>
>> Note that using the Unicode string in split() works.  Also note the decode
>> byte in the error message when using a non-Unicode string to split the
>> Unicode data.  In your original error message the decode byte that caused an
>> error was 0x97, which is 'EM DASH' in Windows-1252 encoding.  Make sure to
>> save your source code in the encoding you declare.  If I save the above
>> script in windows-1252 encoding and change the coding line to windows-1252 I
>> get the same results, but the decode byte is 0x97.
>>
>> # coding: windows-1252
>> data = u'foo — bar'
>> print repr(u'—'.encode('utf-8'))
>> print repr(u'—'.encode('windows-1252'))
>> print data.split(u'—')
>> print data.split('—')
>>
>> '\xe2\x80\x94'
>> '\x97'
>> [u'foo ', u' bar']
>> Traceback (most recent call last):
>>   File
>> "C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
>> line 427, in ImportFile
>>     exec codeObj in __main__.__dict__
>>   File "<auto import>", line 1, in <module>
>>   File "x.py", line 6, in <module>
>>     print data.split('ק)
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 0:
>> ordinal not in range(128)
>>
>> -Mark
> 
> #! /usr/bin/python
> # -*- coding: UTF-8 -*-
> import urllib2
> import re
> def getTitle(input):
>     title = re.search('<title>(.*?)</title>', input)

The input is Unicode, so it's probably better for the regular expression
to also be Unicode:

     title = re.search(u'<title>(.*?)</title>', input)

(In the current implementation it actually doesn't matter.)

>     title = title.group(1)
>     print "FULL TITLE", title.encode('UTF-8')
>     parts = title.split(' — ')

The title is Unicode, so the string with which you're splitting should
also be Unicode:

     parts = title.split(u' — ')

>     return parts[0]
> 
> 
> def getWebPage(url):
>     user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
>     headers = { 'User-Agent' : user_agent }
>     req = urllib2.Request(url, '', headers)
>     response = urllib2.urlopen(req)
>     the_page = unicode(response.read(), 'UTF-8')
>     return the_page
> 
> 
> def main():
>     url = "http://bg.wikipedia.org/wiki/
> %D0%91%D0%B0%D1%85%D1%80%D0%B5%D0%B9%D0%BD"
>     title = getTitle(getWebPage(url))
>     print title[0]
> 
> 
> if __name__ == "__main__":
>     main()
> 
> 
> Traceback (most recent call last):
>   File "C:\user\Projects\test\src\new_main.py", line 29, in <module>
>     main()
>   File "C:\user\Projects\test\src\new_main.py", line 24, in main
>     title = getTitle(getWebPage(url))
> FULL TITLE Бахрейн — Уикипеди�
>   File "C:\user\Projects\test\src\new_main.py", line 9, in getTitle
>     parts = title.split(' — ')
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
> 1: ordinal not in range(128)



More information about the Python-list mailing list