Re: getting rid of —

Fri Jul 3 10:58:35 EDT 2009

"Tep" <petshmidt at googlemail.com> wrote in message 
news:46d36544-1ea2-4391-8922-11b8127a2fef at o6g2000yqj.googlegroups.com...
> On 3 Jul., 06:40, Simon Forman <sajmik... at gmail.com> wrote:
> > On Jul 2, 4:31 am, Tep <petshm... at googlemail.com> wrote:
[snip]
> > > > > > how can I replace '—' sign from string? Or do split at that 
> > > > > > character?
> > > > > > Getting unicode error if I try to do it:
> >
> > > > > > UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in 
> > > > > > position
> > > > > > 1: ordinal not in range(128)
> >
> > > > > > Thanks, Pet
> >
> > > > > > script is # -*- coding: UTF-8 -*-
[snip]
> > I just tried a bit of your code above in my interpreter here and it
> > worked fine:
> >
> > |>>> data = 'foo — bar'
> > |>>> data.split('—')
> > |['foo ', ' bar']
> > |>>> data = u'foo — bar'
> |>>> data.split(u'—')
> > |[u'foo ', u' bar']
> >
> > Figure out the smallest piece of "html source code" that causes the
> > problem and include that with your next post.
>
> The problem was, I've converted "html source code" to unicode object
> and didn't encoded to utf-8 back, before using split...
> Thanks for help and sorry for not so smart question
> Pet

You'd still benefit from posting some code.  You shouldn't be converting 
back to utf-8 to do a split, you should be using a Unicode string with split 
on the Unicode version of the "html source code".  Also make sure your file 
is actually saved in the encoding you declare.  I print the encoding of your 
symbol in two encodings to illustrate why I suspect this.

Below, assume "data" is your "html source code" as a Unicode string:

# -*- coding: UTF-8 -*-
data = u'foo — bar'
print repr(u'—'.encode('utf-8'))
print repr(u'—'.encode('windows-1252'))
print data.split(u'—')
print data.split('—')

OUTPUT:

'\xe2\x80\x94'
'\x97'
[u'foo ', u' bar']
Traceback (most recent call last):
  File 
"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py", 
line 427, in ImportFile
    exec codeObj in __main__.__dict__
  File "<auto import>", line 1, in <module>
  File "x.py", line 6, in <module>
    print data.split('—')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: 
ordinal not in range(128)

Note that using the Unicode string in split() works.  Also note the decode 
byte in the error message when using a non-Unicode string to split the 
Unicode data.  In your original error message the decode byte that caused an 
error was 0x97, which is 'EM DASH' in Windows-1252 encoding.  Make sure to 
save your source code in the encoding you declare.  If I save the above 
script in windows-1252 encoding and change the coding line to windows-1252 I 
get the same results, but the decode byte is 0x97.

# coding: windows-1252
data = u'foo — bar'
print repr(u'—'.encode('utf-8'))
print repr(u'—'.encode('windows-1252'))
print data.split(u'—')
print data.split('—')

'\xe2\x80\x94'
'\x97'
[u'foo ', u' bar']
Traceback (most recent call last):
  File 
"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py", 
line 427, in ImportFile
    exec codeObj in __main__.__dict__
  File "<auto import>", line 1, in <module>
  File "x.py", line 6, in <module>
    print data.split('ק)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 0: 
ordinal not in range(128)

-Mark