[Tutor] Trouble in dealing with special characters.

Fri Dec 7 05:20:04 EST 2018

On Fri, Dec 07, 2018 at 02:06:16PM +0530, Sunil Tech wrote:
> Hi Alan,
> 
> I am using Python 2.7.8

That is important information.

Python 2 unfortunately predates Unicode, and when it was added some bad 
decisions were made. For example, we can write this in Python 2:

>>> txt = "abcπ"

but it is a lie, because what we get isn't the string we typed, but the 
interpreters *bad guess* that we actually meant this:

>>> txt
'abc\xcf\x80'

Depending on your operating system, sometimes you can work with these 
not-really-text strings for a long time, but when it fails, it fails 
HARD with confusing errors. Just as you have here:

> >>> tx = "MOUNTAIN VIEW WOMEN’S HEALTH CLINIC"
> >>> tx.decode()
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 19:
> ordinal not in range(128)

Here, Python tried to guess an encoding, and picked some 
platform-specific encoding like Latin-1 or CP-1252 or something even 
more exotic. That is the wrong thing to do.

But if you can guess which encoding it uses, you can make it work:

tx.decode("Latin1")

tx.decode("CP-1252")

But a better fix is to use actual text, by putting a "u" prefix outside 
the quote marks:

txt = u"MOUNTAIN VIEW WOMEN’S HEALTH CLINIC"

If you need to write this to a file, you can do this:

file.write(txt.encode('utf-8'))

To read it back again:

# from a file using UTF-8
txt = file.read().decode('utf-8')

(If you get a decoding error, it means your text file wasn't actually 
UTF-8. Ask the supplier what it really is.)

> How to know whether in a given string(sentence) is there any that is not
> ASCII character and how to replace?

That's usually the wrong solution. That's like saying, "My program can't 
add numbers greater than 100. How do I tell if a number is greater than 
100, and turn it into a number smaller than 100?"

You can do this:

mystring = "something"
if any(ord(c) > 127 for c in mystring):
    print "Contains non-ASCII"

But what you do then is hard to decide. Delete non-ASCII characters? 
Replace them with what?

If you are desperate, you can do this:

bytestring = "something"
text = bytestring.decode('ascii', errors='replace')
bytestring = text.encode('ascii', errors='replace')

but that will replace any non-ascii character with a question mark "?" 
which might not be what you want.

-- 
Steve