[Tutor] Trouble in dealing with special characters.
Steven D'Aprano
steve at pearwood.info
Fri Dec 7 05:20:04 EST 2018
On Fri, Dec 07, 2018 at 02:06:16PM +0530, Sunil Tech wrote:
> Hi Alan,
>
> I am using Python 2.7.8
That is important information.
Python 2 unfortunately predates Unicode, and when it was added some bad
decisions were made. For example, we can write this in Python 2:
>>> txt = "abcπ"
but it is a lie, because what we get isn't the string we typed, but the
interpreters *bad guess* that we actually meant this:
>>> txt
'abc\xcf\x80'
Depending on your operating system, sometimes you can work with these
not-really-text strings for a long time, but when it fails, it fails
HARD with confusing errors. Just as you have here:
> >>> tx = "MOUNTAIN VIEW WOMEN’S HEALTH CLINIC"
> >>> tx.decode()
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 19:
> ordinal not in range(128)
Here, Python tried to guess an encoding, and picked some
platform-specific encoding like Latin-1 or CP-1252 or something even
more exotic. That is the wrong thing to do.
But if you can guess which encoding it uses, you can make it work:
tx.decode("Latin1")
tx.decode("CP-1252")
But a better fix is to use actual text, by putting a "u" prefix outside
the quote marks:
txt = u"MOUNTAIN VIEW WOMEN’S HEALTH CLINIC"
If you need to write this to a file, you can do this:
file.write(txt.encode('utf-8'))
To read it back again:
# from a file using UTF-8
txt = file.read().decode('utf-8')
(If you get a decoding error, it means your text file wasn't actually
UTF-8. Ask the supplier what it really is.)
> How to know whether in a given string(sentence) is there any that is not
> ASCII character and how to replace?
That's usually the wrong solution. That's like saying, "My program can't
add numbers greater than 100. How do I tell if a number is greater than
100, and turn it into a number smaller than 100?"
You can do this:
mystring = "something"
if any(ord(c) > 127 for c in mystring):
print "Contains non-ASCII"
But what you do then is hard to decide. Delete non-ASCII characters?
Replace them with what?
If you are desperate, you can do this:
bytestring = "something"
text = bytestring.decode('ascii', errors='replace')
bytestring = text.encode('ascii', errors='replace')
but that will replace any non-ascii character with a question mark "?"
which might not be what you want.
--
Steve
More information about the Tutor
mailing list