[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Thu Jan 23 12:13:55 EST 2014

On Thu, Jan 23, 2014 at 11:58 AM,  <josef.pktd at gmail.com> wrote:
> On Thu, Jan 23, 2014 at 11:43 AM, Oscar Benjamin
> <oscar.j.benjamin at gmail.com> wrote:
>> On Thu, Jan 23, 2014 at 11:23:09AM -0500, josef.pktd at gmail.com wrote:
>>>
>>> another curious example, encode utf-8 to latin-1 bytes
>>>
>>> >>> b
>>> array(['Õsc', 'zxc'],
>>>       dtype='<U3')
>>> >>> b[0].encode('utf8')
>>> b'\xc3\x95sc'
>>> >>> b[0].encode('latin1')
>>> b'\xd5sc'
>>> >>> b.astype('S')
>>> Traceback (most recent call last):
>>>   File "<pyshell#40>", line 1, in <module>
>>>     b.astype('S')
>>> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
>>> position 0: ordinal not in range(128)
>>> >>> c = b.view('S4').astype('S1').view('S3')
>>> >>> c
>>> array([b'\xd5sc', b'zxc'],
>>>       dtype='|S3')
>>> >>> c[0].decode('latin1')
>>> 'Õsc'
>>
>> Okay, so it seems that .view() implicitly uses latin-1 whereas .astype() uses
>> ascii:
>>
>>>>> np.array(['Õsc']).astype('S4')
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128)
>>>>> np.array(['Õsc']).view('S4')
>> array([b'\xd5', b's', b'c'],
>>       dtype='|S4')
>
>
> No, a view doesn't change the memory, it just changes the
> interpretation and there shouldn't be any conversion involved.
> astype does type conversion, but it goes through ascii encoding which fails.
>
>>>> b = np.array(['Õsc', 'zxc'], dtype='<U3')
>>>> b.tostring()
> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00'
>>>> b.view('S12')
> array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'],
>       dtype='|S12')
>
> The conversion happens somewhere in the array creation, but I have no
> idea about the memory encoding for uc2 and the low level layouts.

utf8 encoded bytes

>>> a = np.array(['Õsc'.encode('utf8'), 'zxc'], dtype='S')
>>> a
array([b'\xc3\x95sc', b'zxc'],
      dtype='|S4')
>>> a.tostring()
b'\xc3\x95sczxc\x00'
>>> a.view('S8')
array([b'\xc3\x95sczxc'],
      dtype='|S8')

>>> a[0].decode('latin1')
'Ã\x95sc'
>>> a[0].decode('utf8')
'Õsc'

Josef

>
> Josef
>
>>
>>> --------
>>> The original numpy py3 conversion used latin-1 as default
>>> (It's still used in statsmodels, and I haven't looked at the structure
>>> under the common py2-3 codebase)
>>>
>>> if sys.version_info[0] >= 3:
>>>     import io
>>>     bytes = bytes
>>>     unicode = str
>>>     asunicode = str
>>
>> These two functions are an abomination:
>>
>>>     def asbytes(s):
>>>         if isinstance(s, bytes):
>>>             return s
>>>         return s.encode('latin1')
>>>     def asstr(s):
>>>         if isinstance(s, str):
>>>             return s
>>>         return s.decode('latin1')
>>
>>
>> Oscar
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion