Re: getting rid of —

Tep petshmidt at googlemail.com
Fri Jul 3 12:24:43 EDT 2009


On 3 Jul., 16:58, "Mark Tolonen" <metolone+gm... at gmail.com> wrote:
> "Tep" <petshm... at googlemail.com> wrote in message
>
> news:46d36544-1ea2-4391-8922-11b8127a2fef at o6g2000yqj.googlegroups.com...
>
>
>
>
>
> > On 3 Jul., 06:40, Simon Forman <sajmik... at gmail.com> wrote:
> > > On Jul 2, 4:31 am, Tep <petshm... at googlemail.com> wrote:
> [snip]
> > > > > > > how can I replace '—' sign from string? Or do split at that
> > > > > > > character?
> > > > > > > Getting unicode error if I try to do it:
>
> > > > > > > UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in
> > > > > > > position
> > > > > > > 1: ordinal not in range(128)
>
> > > > > > > Thanks, Pet
>
> > > > > > > script is # -*- coding: UTF-8 -*-
> [snip]
> > > I just tried a bit of your code above in my interpreter here and it
> > > worked fine:
>
> > > |>>> data = 'foo — bar'
> > > |>>> data.split('—')
> > > |['foo ', ' bar']
> > > |>>> data = u'foo — bar'
> > |>>> data.split(u'—')
> > > |[u'foo ', u' bar']
>
> > > Figure out the smallest piece of "html source code" that causes the
> > > problem and include that with your next post.
>
> > The problem was, I've converted "html source code" to unicode object
> > and didn't encoded to utf-8 back, before using split...
> > Thanks for help and sorry for not so smart question
> > Pet
>
> You'd still benefit from posting some code.  You shouldn't be converting

I've posted code below

> back to utf-8 to do a split, you should be using a Unicode string with split
> on the Unicode version of the "html source code".  Also make sure your file
> is actually saved in the encoding you declare.  I print the encoding of your
> symbol in two encodings to illustrate why I suspect this.

File was indeed in windows-1252, I've changed this. For errors see
below

>
> Below, assume "data" is your "html source code" as a Unicode string:
>
> # -*- coding: UTF-8 -*-
> data = u'foo — bar'
> print repr(u'—'.encode('utf-8'))
> print repr(u'—'.encode('windows-1252'))
> print data.split(u'—')
> print data.split('—')
>
> OUTPUT:
>
> '\xe2\x80\x94'
> '\x97'
> [u'foo ', u' bar']
> Traceback (most recent call last):
>   File
> "C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
> line 427, in ImportFile
>     exec codeObj in __main__.__dict__
>   File "<auto import>", line 1, in <module>
>   File "x.py", line 6, in <module>
>     print data.split('—')
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0:
> ordinal not in range(128)
>
> Note that using the Unicode string in split() works.  Also note the decode
> byte in the error message when using a non-Unicode string to split the
> Unicode data.  In your original error message the decode byte that caused an
> error was 0x97, which is 'EM DASH' in Windows-1252 encoding.  Make sure to
> save your source code in the encoding you declare.  If I save the above
> script in windows-1252 encoding and change the coding line to windows-1252 I
> get the same results, but the decode byte is 0x97.
>
> # coding: windows-1252
> data = u'foo — bar'
> print repr(u'—'.encode('utf-8'))
> print repr(u'—'.encode('windows-1252'))
> print data.split(u'—')
> print data.split('—')
>
> '\xe2\x80\x94'
> '\x97'
> [u'foo ', u' bar']
> Traceback (most recent call last):
>   File
> "C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
> line 427, in ImportFile
>     exec codeObj in __main__.__dict__
>   File "<auto import>", line 1, in <module>
>   File "x.py", line 6, in <module>
>     print data.split('ק)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 0:
> ordinal not in range(128)
>
> -Mark

#! /usr/bin/python
# -*- coding: UTF-8 -*-
import urllib2
import re
def getTitle(input):
    title = re.search('<title>(.*?)</title>', input)
    title = title.group(1)
    print "FULL TITLE", title.encode('UTF-8')
    parts = title.split(' — ')
    return parts[0]


def getWebPage(url):
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    headers = { 'User-Agent' : user_agent }
    req = urllib2.Request(url, '', headers)
    response = urllib2.urlopen(req)
    the_page = unicode(response.read(), 'UTF-8')
    return the_page


def main():
    url = "http://bg.wikipedia.org/wiki/
%D0%91%D0%B0%D1%85%D1%80%D0%B5%D0%B9%D0%BD"
    title = getTitle(getWebPage(url))
    print title[0]


if __name__ == "__main__":
    main()


Traceback (most recent call last):
  File "C:\user\Projects\test\src\new_main.py", line 29, in <module>
    main()
  File "C:\user\Projects\test\src\new_main.py", line 24, in main
    title = getTitle(getWebPage(url))
FULL TITLE Бахрейн — Уикипеди�
  File "C:\user\Projects\test\src\new_main.py", line 9, in getTitle
    parts = title.split(' — ')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
1: ordinal not in range(128)



More information about the Python-list mailing list