[melbourne-pug] Unicode for windows dummies

Mike Dewhirst miked at dewhirst.com.au
Tue Aug 16 01:51:11 EDT 2016


First of all, thank you all very much for this support. I was seriously 
contemplating a change of career to something easy like becoming an 
Olympic gymnast and giving up unicode forever ... but anyway ...

 From the top, some examples which might shed some light ...

class CsvImport(object):

     """ Imports a csv file and converts it into a list of lists """

     def __init__(self, csvfile, company, start, finish):

         self.company = company

         self.rows = list()

         with open(csvfile, "r") as csv:

             i = 0

             self.rows = csv.readlines()

             for line in self.rows:

                 line = line.encode("utf-8").decode("cp1252", "replace")

                 #line = line.encode("cp1252").decode("cp1252", "replace")

                 #line = line.encode("cp1252")

                 #line = line.encode("utf-8")

                 i += 1

                 cells = list(line)

                 # this requires a bytes-like object not 'str'

                 #cells = line.split(",")

                 if i >= start:

                     # as expected this includes the [] brackets around each row

                     #print(cells)

                     # this omits the [] brackets but otherwise output is identical

                     print(', '.join(repr(cell) for cell in cells))

                 if i > finish:

                     break

Output with different code page settings ...

                 line = line.encode("utf-8").decode("cp1252", "replace")

, ',', '\x00', ',', '\x00', ',', '\x00', ',', '\x00', '0', '\x00', '.', 
'\x00', '0', '\x00',
  '0', '\x00', '0', '\x00', '0', '\x00', '0', '\x00', '%', '\x00', ',', 
'\x00', '"', '\x00',
'"', '\x00', ',', '\x00', '"', '\x00', '"', '\x00', ',', '\x00', '"', 
'\x00', 'A', '\x00', '
d', '\x00', 'd', '\x00', 'i', '\x00', 't', '\x00', 'i', '\x00', 'o', 
'\x00', 'n', '\x00', 'a
', '\x00', 'l', '\x00', ' ', '\x00', 'N', '\x00', 'o', '\x00', 'n', 
'\x00', '-', '\x00', 'G'
, '\x00', 'H', '\x00', 'S', '\x00', ' ', '\x00', 'H', '\x00', 'a', 
'\x00', 'z', '\x00', 'a',
  '\x00', 'r', '\x00', 'd', '\x00', ' ', '\x00', 'S', '\x00', 't', 
'\x00', 'a', '\x00', 't',
'\x00', 'e', '\x00', 'm', '\x00', 'e', '\x00', 'n', '\x00', 't', '\x00', 
'"', '\x00', ',', '
\x00', ',', '\x00', '0', '\x00', '.', '\x00', '0', '\x00', '0', '\x00', 
'0', '\x00', '0', '\
x00', '0', '\x00', '%', '\x00', ',', '\x00', '"', '\x00', '"', '\x00', '\n'

                 line = line.encode("cp1252").decode("cp1252", "replace")


, ',', '\x00', ',', '\x00', ',', '\x00', ',', '\x00', '0', '\x00', '.', 
'\x00', '0', '\x00',
  '0', '\x00', '0', '\x00', '0', '\x00', '0', '\x00', '%', '\x00', ',', 
'\x00', '"', '\x00',
'"', '\x00', ',', '\x00', '"', '\x00', '"', '\x00', ',', '\x00', '"', 
'\x00', 'A', '\x00', '
d', '\x00', 'd', '\x00', 'i', '\x00', 't', '\x00', 'i', '\x00', 'o', 
'\x00', 'n', '\x00', 'a
', '\x00', 'l', '\x00', ' ', '\x00', 'N', '\x00', 'o', '\x00', 'n', 
'\x00', '-', '\x00', 'G'
, '\x00', 'H', '\x00', 'S', '\x00', ' ', '\x00', 'H', '\x00', 'a', 
'\x00', 'z', '\x00', 'a',
  '\x00', 'r', '\x00', 'd', '\x00', ' ', '\x00', 'S', '\x00', 't', 
'\x00', 'a', '\x00', 't',
'\x00', 'e', '\x00', 'm', '\x00', 'e', '\x00', 'n', '\x00', 't', '\x00', 
'"', '\x00', ',', '
\x00', ',', '\x00', '0', '\x00', '.', '\x00', '0', '\x00', '0', '\x00', 
'0', '\x00', '0', '\
x00', '0', '\x00', '%', '\x00', ',', '\x00', '"', '\x00', '"', '\x00', '\n'

Both of which are identical ... so now

                 line = line.encode("cp1252")


, 44, 0, 34, 0, 72, 0, 52, 0, 49, 0, 49, 0, 34, 0, 44, 0, 44, 0, 44, 0, 
48, 0, 46, 0, 48, 0,
  48, 0, 48, 0, 48, 0, 48, 0, 37, 0, 44, 0, 34, 0, 34, 0, 44, 0, 44, 0, 
34, 0, 34, 0, 44, 0,
34, 0, 34, 0, 44, 0, 34, 0, 34, 0, 44, 0, 34, 0, 34, 0, 44, 0, 34, 0, 
34, 0, 44, 0, 34, 0, 7
2, 0, 97, 0, 122, 0, 97, 0, 114, 0, 100, 0, 111, 0, 117, 0, 115, 0, 32, 
0, 84, 0, 111, 0, 32
, 0, 84, 0, 104, 0, 101, 0, 32, 0, 79, 0, 122, 0, 111, 0, 110, 0, 101, 
0, 32, 0, 76, 0, 97,
0, 121, 0, 101, 0, 114, 0, 46, 0, 34, 0, 44, 0, 44, 0, 44, 0, 44, 0, 44, 
0, 48, 0, 46, 0, 48
, 0, 48, 0, 48, 0, 48, 0, 48, 0, 37, 0, 44, 0, 34, 0, 34, 0, 44, 0, 34, 
0, 34, 0, 44, 0, 34,
  0, 65, 0, 100, 0, 100, 0, 105, 0, 116, 0, 105, 0, 111, 0, 110, 0, 97, 
0, 108, 0, 32, 0, 78,
  0, 111, 0, 110, 0, 45, 0, 71, 0, 72, 0, 83, 0, 32, 0, 72, 0, 97, 0, 
122, 0, 97, 0, 114, 0,
100, 0, 32, 0, 83, 0, 116, 0, 97, 0, 116, 0, 101, 0, 109, 0, 101, 0, 
110, 0, 116, 0, 34, 0,
44, 0, 44, 0, 48, 0, 46, 0, 48, 0, 48, 0, 48, 0, 48, 0, 48, 0, 37, 0, 
44, 0, 34, 0, 34, 0, 1
0

                 line = line.encode("cp1252")

, 44, 0, 34, 0, 72, 0, 52, 0, 49, 0, 49, 0, 34, 0, 44, 0, 44, 0, 44, 0, 
48, 0, 46, 0, 48, 0,
  48, 0, 48, 0, 48, 0, 48, 0, 37, 0, 44, 0, 34, 0, 34, 0, 44, 0, 44, 0, 
34, 0, 34, 0, 44, 0,
34, 0, 34, 0, 44, 0, 34, 0, 34, 0, 44, 0, 34, 0, 34, 0, 44, 0, 34, 0, 
34, 0, 44, 0, 34, 0, 7
2, 0, 97, 0, 122, 0, 97, 0, 114, 0, 100, 0, 111, 0, 117, 0, 115, 0, 32, 
0, 84, 0, 111, 0, 32
, 0, 84, 0, 104, 0, 101, 0, 32, 0, 79, 0, 122, 0, 111, 0, 110, 0, 101, 
0, 32, 0, 76, 0, 97,
0, 121, 0, 101, 0, 114, 0, 46, 0, 34, 0, 44, 0, 44, 0, 44, 0, 44, 0, 44, 
0, 48, 0, 46, 0, 48
, 0, 48, 0, 48, 0, 48, 0, 48, 0, 37, 0, 44, 0, 34, 0, 34, 0, 44, 0, 34, 
0, 34, 0, 44, 0, 34,
  0, 65, 0, 100, 0, 100, 0, 105, 0, 116, 0, 105, 0, 111, 0, 110, 0, 97, 
0, 108, 0, 32, 0, 78,
  0, 111, 0, 110, 0, 45, 0, 71, 0, 72, 0, 83, 0, 32, 0, 72, 0, 97, 0, 
122, 0, 97, 0, 114, 0,
100, 0, 32, 0, 83, 0, 116, 0, 97, 0, 116, 0, 101, 0, 109, 0, 101, 0, 
110, 0, 116, 0, 34, 0,
44, 0, 44, 0, 48, 0, 46, 0, 48, 0, 48, 0, 48, 0, 48, 0, 48, 0, 37, 0, 
44, 0, 34, 0, 34, 0, 1
0

And both of these are identical.

And for completeness ...

(xxex3) C:\Users\mike\env\xxex3\ssds>python

Python 3.5.1 (v3.5.1:37a07cee5969, Dec  6 2015, 01:54:25) [MSC v.1900 64 bit (AMD64)] on win32

Type "help", "copyright", "credits" or "license" for more information.

>>> import sys

>>> sys.stdout.encoding

'cp1252'

>>>



On 16/08/2016 3:28 PM, Anthony Briggs wrote:
>
>
> On 16 August 2016 at 14:57, William ML Leslie 
> <william.leslie.ttg at gmail.com <mailto:william.leslie.ttg at gmail.com>> 
> wrote:
>
>     On 16 August 2016 at 14:40, Anthony Briggs
>     <anthony.briggs at gmail.com <mailto:anthony.briggs at gmail.com>> wrote:
>     > print("Mÿ hôvèrçràft îß
>     f├╗┼él ├Âf ├®├¬l┼ø")
>     >
>     > works just fine for me, since you're just printing an internal
>     Python
>     > string.
>
>     It will work fine unless you're on Mike's machine - if
>     sys.stdout.encoding is cp850 and you've got unicode_literals imported
>     (or are using python3), it won't.
>
>
> That string is translated to a cp1252 character set, so I'd be 
> surprised if it didn't work.
>
> OTOH, try utf-8 characters in a Windows Python REPL, and you don't 
> even make it to the end of the string :)
>
> print("Mÿ hôvèrçrà ft îß fûll öf éêls")
>
>     >The problem is from trying to print a binary string (which is what
>     > you get from .encode()) as an internal Python string. If you
>     specify an
>     > encoding, the error goes away:
>     >
>     > print("Mÿ hôvèrçràft îß
>     f├╗┼él ├Âf
>     > ├®├¬l┼ø".encode("utf-8").decode("cp1252", "replace"))
>
>     The only reason to encode to utf-8 and then decode from cp1252 is to
>     fix incorrect input.
>
>     I think you mean .encode("cp1252", "replace").decode("cp1252")
>
>
> No - the point was to get a binary string that doesn't translate 
> nicely into cp1252, otherwise you don't need the 'replace' parameter. 
> This is Mike's core problem - he's reading bytes from a utf-8 file, 
> and trying to print that to the terminal.
>
> Anthony
>
>
> _______________________________________________
> melbourne-pug mailing list
> melbourne-pug at python.org
> https://mail.python.org/mailman/listinfo/melbourne-pug

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/melbourne-pug/attachments/20160816/144e5ae1/attachment-0001.html>


More information about the melbourne-pug mailing list