[Numpy-discussion] using loadtxt to load a text file in to a numpy array

josef.pktd at gmail.com josef.pktd at gmail.com
Thu Jan 23 12:42:13 EST 2014


On Thu, Jan 23, 2014 at 12:13 PM,  <josef.pktd at gmail.com> wrote:
> On Thu, Jan 23, 2014 at 11:58 AM,  <josef.pktd at gmail.com> wrote:
>> On Thu, Jan 23, 2014 at 11:43 AM, Oscar Benjamin
>> <oscar.j.benjamin at gmail.com> wrote:
>>> On Thu, Jan 23, 2014 at 11:23:09AM -0500, josef.pktd at gmail.com wrote:
>>>>
>>>> another curious example, encode utf-8 to latin-1 bytes
>>>>
>>>> >>> b
>>>> array(['Õsc', 'zxc'],
>>>>       dtype='<U3')
>>>> >>> b[0].encode('utf8')
>>>> b'\xc3\x95sc'
>>>> >>> b[0].encode('latin1')
>>>> b'\xd5sc'
>>>> >>> b.astype('S')
>>>> Traceback (most recent call last):
>>>>   File "<pyshell#40>", line 1, in <module>
>>>>     b.astype('S')
>>>> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
>>>> position 0: ordinal not in range(128)
>>>> >>> c = b.view('S4').astype('S1').view('S3')
>>>> >>> c
>>>> array([b'\xd5sc', b'zxc'],
>>>>       dtype='|S3')
>>>> >>> c[0].decode('latin1')
>>>> 'Õsc'
>>>
>>> Okay, so it seems that .view() implicitly uses latin-1 whereas .astype() uses
>>> ascii:
>>>
>>>>>> np.array(['Õsc']).astype('S4')
>>> Traceback (most recent call last):
>>>   File "<stdin>", line 1, in <module>
>>> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128)
>>>>>> np.array(['Õsc']).view('S4')
>>> array([b'\xd5', b's', b'c'],
>>>       dtype='|S4')
>>
>>
>> No, a view doesn't change the memory, it just changes the
>> interpretation and there shouldn't be any conversion involved.
>> astype does type conversion, but it goes through ascii encoding which fails.
>>
>>>>> b = np.array(['Õsc', 'zxc'], dtype='<U3')
>>>>> b.tostring()
>> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00'
>>>>> b.view('S12')
>> array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'],
>>       dtype='|S12')
>>
>> The conversion happens somewhere in the array creation, but I have no
>> idea about the memory encoding for uc2 and the low level layouts.

>>> b = np.array(['Õsc', 'zxc'], dtype='<U3')
>>> b[0].tostring()
b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00'
>>> 'Õsc'.encode('utf-32LE')
b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00'

Is that the encoding for 'U' ?

---
another sideeffect of null truncation: cannot decode truncated data

>>> b.view('S4').tostring()
b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00'
>>> b.view('S4')[0]
b'\xd5'
>>> b.view('S4')[0].tostring()
b'\xd5'
>>> b.view('S4')[:1].tostring()
b'\xd5\x00\x00\x00'

>>> b.view('S4')[0].decode('utf-32LE')
Traceback (most recent call last):
  File "<pyshell#101>", line 1, in <module>
    b.view('S4')[0].decode('utf-32LE')
  File "C:\Programs\Python33\lib\encodings\utf_32_le.py", line 11, in decode
    return codecs.utf_32_le_decode(input, errors, True)
UnicodeDecodeError: 'utf32' codec can't decode byte 0xd5 in position
0: truncated data

>>> b.view('S4')[:1].tostring().decode('utf-32LE')
'Õ'

numpy arrays need a decode and encode method

Josef

>
> utf8 encoded bytes
>
>>>> a = np.array(['Õsc'.encode('utf8'), 'zxc'], dtype='S')
>>>> a
> array([b'\xc3\x95sc', b'zxc'],
>       dtype='|S4')
>>>> a.tostring()
> b'\xc3\x95sczxc\x00'
>>>> a.view('S8')
> array([b'\xc3\x95sczxc'],
>       dtype='|S8')
>
>>>> a[0].decode('latin1')
> 'Ã\x95sc'
>>>> a[0].decode('utf8')
> 'Õsc'
>
> Josef
>
>>
>> Josef
>>
>>>
>>>> --------
>>>> The original numpy py3 conversion used latin-1 as default
>>>> (It's still used in statsmodels, and I haven't looked at the structure
>>>> under the common py2-3 codebase)
>>>>
>>>> if sys.version_info[0] >= 3:
>>>>     import io
>>>>     bytes = bytes
>>>>     unicode = str
>>>>     asunicode = str
>>>
>>> These two functions are an abomination:
>>>
>>>>     def asbytes(s):
>>>>         if isinstance(s, bytes):
>>>>             return s
>>>>         return s.encode('latin1')
>>>>     def asstr(s):
>>>>         if isinstance(s, str):
>>>>             return s
>>>>         return s.decode('latin1')
>>>
>>>
>>> Oscar
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion



More information about the NumPy-Discussion mailing list