What extended ASCII character set uses 0x9D?

John Nagle nagle at animats.com
Thu Aug 17 20:30:28 EDT 2017


On 08/17/2017 05:14 PM, John Nagle wrote:
 >      I'm cleaning up some data which has text description fields from
 > multiple sources.
A few more cases:

bytearray(b'miguel \xe3\x81ngel santos')
bytearray(b'lidija kmeti\xe4\x8d')
bytearray(b'\xe5\x81ukasz zmywaczyk')
bytearray(b'M\x81\x81\xfcnster')
bytearray(b'ji\xe5\x99\xe3\xad urban\xe4\x8d\xe3\xadk')
bytearray(b'\xe4\xbdubom\xe3\xadr mi\xe4\x8dko')
bytearray(b'petr urban\xe4\x8d\xe3\xadk')

0x9d is the most common; that occurs in English text. The others
seem to be in some Eastern European character set.

Understand, there's no metadata available to disambiguate this. What I
have is a big CSV file in which different character sets are mixed.
Each field has a uniform character set, so I need character set
detection on a per-field basis.

				John Nagle




More information about the Python-list mailing list