Right solution to unicode error?

Thu Nov 8 06:01:14 EST 2012

Le mercredi 7 novembre 2012 23:17:42 UTC+1, Anders a écrit :
> I've run into a Unicode error, and despite doing some googling, I
> 
> can't figure out the right way to fix it. I have a Python 2.6 script
> 
> that reads my Outlook 2010 task list. I'm able to read the tasks from
> 
> Outlook and store them as a list of objects without a hitch.  But when
> 
> I try to print the tasks' subjects, one of the tasks is generating an
> 
> error:
> 
> 
> 
> Traceback (most recent call last):
> 
>   File "outlook_tasks.py", line 66, in <module>
> 
>     my_tasks.dump_today_tasks()
> 
>   File "C:\Users\Anders\code\Task List\tasks.py", line 29, in
> 
> dump_today_tasks
> 
>     print task.subject
> 
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in
> 
> position 42: ordinal not in range(128)
> 
> 
> 
> (where task.subject  was previously assigned the value of
> 
> task.Subject, aka the Subject property of an Outlook 2010 TaskItem)
> 
> 
> 
> From what I understand from reading online, the error is telling me
> 
> that the subject line  contains an en dash and that Python is trying
> 
> to convert to ascii and failing (as it should).
> 
> 
> 
> Here's where I'm getting stuck.  In the code above I was just printing
> 
> the subject so I can see whether the script is working properly.
> 
> Ultimately what I want to do is parse the tasks I'm interested in and
> 
> then create an HTML file containing those tasks.  Given that, what's
> 
> the best way to fix this problem?
> 
> 
> 
> BTW, if there's a clear description of the best solution for this
> 
> particular problem – i.e., where I want to ultimately display the
> 
> results as HTML – please feel free to refer me to the link. I tried
> 
> reading a number of docs on the web but still feel pretty lost.
> 
> 
> 
> Thanks,
> 
> Anders

----------

The problem is not on the Python side or specific
to Python. It is on the side of the "coding of
characters".

1) Unicode is an abstract entity, it has to be encoded
for the system/device that will host it.
Using Python:
<unicode>.encode(host_coding)

2) The host_coding scheme may not contain the
character (glyph/grapheme) corresponding to the
"unicode character". In that case, 2 possible
solutions, "ignore" it ou "replace" it with a
substitution character.
Using Python:
<unicode>.encode(host_coding, "ignore")
<unicode>.encode(host_coding, "replace")

3) Detecting the host_coding, the most difficult
task. Either you have to hard-code it or you
may expect Python find it via its sys.encoding.

4) Due to the nature of unicode, it the unique
way to do it correctly.

Expectedly failing and not failing examples.
Mainly Py3, but it doesn't matter. Note: Py3 encodes
and creates a byte string, which has to be
decoded to produce a native (unicode) string, here
with cp1252.

Py2

>>> u'éléphant\u2013abc'.encode('ascii')

Traceback (most recent call last):
  File "<pyshell#0>", line 1, in <module>
    u'éléphant\u2013abc'.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)
>>> print(u'éléphant\u2013abc'.encode('cp1252'))
éléphant–abc
>>> 

Py3

>>> 'éléphant\u2013abc'.encode('ascii')
Traceback (most recent call last):
  File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in
position 0: ordinal not in range(128)
>>> 'éléphant\u2013abc'.encode('ascii', 'ignore')
b'lphantabc'
>>> 'éléphant\u2013abc'.encode('ascii', 'replace')
b'?l?phant?abc'
>>> 'éléphant\u2013abc'.encode('ascii', 'ignore').decode('cp1252')
'lphantabc'
>>> 'éléphant\u2013abc'.encode('ascii', 'replace').decode('cp1252')
'?l?phant?abc'
>>> 
>>> 'éléphant\u2013abc'.encode('cp1252').decode('cp1252')
'éléphant–abc'

>>> sys.stdout.encoding
'cp1252'
>>> 'éléphant\u2013abc'.encode(sys.stdout.encoding).decode('cp1252')
'éléphant–abc'

etc

jmf