[Numpy-discussion] using loadtxt to load a text file in to a numpy array
josef.pktd at gmail.com
josef.pktd at gmail.com
Thu Jan 23 12:42:13 EST 2014
On Thu, Jan 23, 2014 at 12:13 PM, <josef.pktd at gmail.com> wrote:
> On Thu, Jan 23, 2014 at 11:58 AM, <josef.pktd at gmail.com> wrote:
>> On Thu, Jan 23, 2014 at 11:43 AM, Oscar Benjamin
>> <oscar.j.benjamin at gmail.com> wrote:
>>> On Thu, Jan 23, 2014 at 11:23:09AM -0500, josef.pktd at gmail.com wrote:
>>>>
>>>> another curious example, encode utf-8 to latin-1 bytes
>>>>
>>>> >>> b
>>>> array(['Õsc', 'zxc'],
>>>> dtype='<U3')
>>>> >>> b[0].encode('utf8')
>>>> b'\xc3\x95sc'
>>>> >>> b[0].encode('latin1')
>>>> b'\xd5sc'
>>>> >>> b.astype('S')
>>>> Traceback (most recent call last):
>>>> File "<pyshell#40>", line 1, in <module>
>>>> b.astype('S')
>>>> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
>>>> position 0: ordinal not in range(128)
>>>> >>> c = b.view('S4').astype('S1').view('S3')
>>>> >>> c
>>>> array([b'\xd5sc', b'zxc'],
>>>> dtype='|S3')
>>>> >>> c[0].decode('latin1')
>>>> 'Õsc'
>>>
>>> Okay, so it seems that .view() implicitly uses latin-1 whereas .astype() uses
>>> ascii:
>>>
>>>>>> np.array(['Õsc']).astype('S4')
>>> Traceback (most recent call last):
>>> File "<stdin>", line 1, in <module>
>>> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128)
>>>>>> np.array(['Õsc']).view('S4')
>>> array([b'\xd5', b's', b'c'],
>>> dtype='|S4')
>>
>>
>> No, a view doesn't change the memory, it just changes the
>> interpretation and there shouldn't be any conversion involved.
>> astype does type conversion, but it goes through ascii encoding which fails.
>>
>>>>> b = np.array(['Õsc', 'zxc'], dtype='<U3')
>>>>> b.tostring()
>> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00'
>>>>> b.view('S12')
>> array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'],
>> dtype='|S12')
>>
>> The conversion happens somewhere in the array creation, but I have no
>> idea about the memory encoding for uc2 and the low level layouts.
>>> b = np.array(['Õsc', 'zxc'], dtype='<U3')
>>> b[0].tostring()
b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00'
>>> 'Õsc'.encode('utf-32LE')
b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00'
Is that the encoding for 'U' ?
---
another sideeffect of null truncation: cannot decode truncated data
>>> b.view('S4').tostring()
b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00'
>>> b.view('S4')[0]
b'\xd5'
>>> b.view('S4')[0].tostring()
b'\xd5'
>>> b.view('S4')[:1].tostring()
b'\xd5\x00\x00\x00'
>>> b.view('S4')[0].decode('utf-32LE')
Traceback (most recent call last):
File "<pyshell#101>", line 1, in <module>
b.view('S4')[0].decode('utf-32LE')
File "C:\Programs\Python33\lib\encodings\utf_32_le.py", line 11, in decode
return codecs.utf_32_le_decode(input, errors, True)
UnicodeDecodeError: 'utf32' codec can't decode byte 0xd5 in position
0: truncated data
>>> b.view('S4')[:1].tostring().decode('utf-32LE')
'Õ'
numpy arrays need a decode and encode method
Josef
>
> utf8 encoded bytes
>
>>>> a = np.array(['Õsc'.encode('utf8'), 'zxc'], dtype='S')
>>>> a
> array([b'\xc3\x95sc', b'zxc'],
> dtype='|S4')
>>>> a.tostring()
> b'\xc3\x95sczxc\x00'
>>>> a.view('S8')
> array([b'\xc3\x95sczxc'],
> dtype='|S8')
>
>>>> a[0].decode('latin1')
> 'Ã\x95sc'
>>>> a[0].decode('utf8')
> 'Õsc'
>
> Josef
>
>>
>> Josef
>>
>>>
>>>> --------
>>>> The original numpy py3 conversion used latin-1 as default
>>>> (It's still used in statsmodels, and I haven't looked at the structure
>>>> under the common py2-3 codebase)
>>>>
>>>> if sys.version_info[0] >= 3:
>>>> import io
>>>> bytes = bytes
>>>> unicode = str
>>>> asunicode = str
>>>
>>> These two functions are an abomination:
>>>
>>>> def asbytes(s):
>>>> if isinstance(s, bytes):
>>>> return s
>>>> return s.encode('latin1')
>>>> def asstr(s):
>>>> if isinstance(s, str):
>>>> return s
>>>> return s.decode('latin1')
>>>
>>>
>>> Oscar
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
More information about the NumPy-Discussion
mailing list