[AstroPy] "ASCII" tables that contain non-ASCII characters
Stephen Bailey
stephenbailey at lbl.gov
Tue Oct 25 00:38:06 EDT 2016
Thanks for the suggestions. The original problem also applies to python
3.5 though — this isn't just a python 2.7 thing. If LANG isn't set, the
ascii table readers can break even with python 3.5 and even if the
non-ascii character is in a comment field. e.g. the following table can't
be read with format='ascii.basic' unless $LANG is set or one of locale
tricks from Thomas or Derek is used:
# Some comment
# Å or not?
# Another comment
x y
1 2
3 4
5 6
Also, for the record: apparently Mac OSX needs 'en_US.UTF-8' and not
'en_US.utf8'; flavors of Linux will accept either.
Stephen
In [*9*]: !cat blat.csv
# Some comment
# Å or not?
# Another comment
x y
1 2
3 4
5 6
In [*10*]: sys.version
Out[*10*]: '3.5.2 |Continuum Analytics, Inc.| (default, Jul 2 2016,
17:52:12) \n[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]'
In [*11*]: print(os.getenv('LANG'))
None
In [*12*]: t = Table.read('blat.csv', format='ascii.basic')
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-12-4c7389d51f8f> in <module>()
----> 1 t = Table.read('blat.csv', format='ascii.basic')
/Users/sbailey/anaconda/envs/desi/lib/python3.5/site-packages/astropy/table/table.py
in read(cls, *args, **kwargs)
* 2330* passed through to the underlying data reader (e.g. `~
astropy.io.ascii.read`).
* 2331* """
-> 2332 return io_registry.read(cls, *args, **kwargs)
* 2333*
* 2334* def write(self, *args, **kwargs):
/Users/sbailey/anaconda/envs/desi/lib/python3.5/site-packages/astropy/io/registry.py
in read(cls, *args, **kwargs)
* 349*
* 350* reader = get_reader(format, cls)
--> 351 data = reader(*args, **kwargs)
* 352*
* 353* if not isinstance(data, cls):
/Users/sbailey/anaconda/envs/desi/lib/python3.5/site-packages/astropy/io/ascii/connect.py
in io_read(format, filename, **kwargs)
* 35* from .ui import read
* 36* format = re.sub(r'^ascii\.', '', format)
---> 37 return read(filename, format=format, **kwargs)
* 38*
* 39*
/Users/sbailey/anaconda/envs/desi/lib/python3.5/site-packages/astropy/io/ascii/ui.py
in read(table, guess, **kwargs)
* 287* try:
* 288* with get_readable_fileobj(table) as fileobj:
--> 289 table = fileobj.read()
* 290* except ValueError: # unreadable or invalid binary
file
* 291* raise
/Users/sbailey/anaconda/envs/desi/lib/python3.5/encodings/ascii.py in
decode(self,
input, final)
* 24* class IncrementalDecoder(codecs.IncrementalDecoder):
* 25* def decode(self, input, final=False):
---> 26 return codecs.ascii_decode(input, self.errors)[0]
* 27*
* 28* class StreamWriter(Codec,codecs.StreamWriter):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 17:
ordinal not in range(128)
In [*13*]: *with* set_locale('en_US.UTF-8'):
...: t = Table.read('blat.csv', format='ascii.basic')
In [*14*]: t
Out[*14*]:
<Table length=3>
x y
int64 int64
----- -----
1 2
3 4
5 6
On Mon, Oct 24, 2016 at 6:53 PM, Aldcroft, Thomas <
aldcroft at head.cfa.harvard.edu> wrote:
>
>
> On Mon, Oct 24, 2016 at 6:11 PM, Derek Homeier <derek at astro.physik.uni-
> goettingen.de> wrote:
>
>> On 25 Oct 2016, at 12:01 am, Nathan Goldbaum <nathan12343 at gmail.com>
>> wrote:
>> >
>> > I believe this is issue 2923:
>> >
>> > https://github.com/astropy/astropy/issues/2923
>> >
>> > On Mon, Oct 24, 2016 at 4:45 PM, Benjamin Alan Weaver <baweaver at lbl.gov>
>> wrote:
>> > Hello y'all,
>> >
>> > We are trying to read "ASCII" tables containing atomic line data
>> > provided by NIST. When you request the line wavelength data in
>> > angstroms, NIST very helpfully labels the columns with the angstrom
>> > symbol (Å), which is not strictly part of the ASCII character set.
>> >
>> > We can read these tables with Table.read() *and* the environment
>> > variable LANG=en_US.utf-8 set. However, if LANG is not set,
>> > Table.read() fails to decode these files.
>> >
>> > As far as I can tell the underlying read() function in astropy.io.ascii
>> > does not accept keywords related to the file encoding.
>> >
>> > So two questions:
>> >
>> > 1. Is the lack of an encoding keyword a bug that should be reported?
>> >
>> > 2. Is there a workaround that does not rely on LANG being set?
>>
>> A workaround that would at least get you away without manipulating the
>> environment outside Python would be
>>
>> import locale
>> locale.setlocale(locale.LC_ALL, str(‘en_US.utf8’))
>>
>
> You can make this a little cleaner using the set_locale context manager in
> astropy:
>
> from astropy.utils.misc import set_locale
> with set_locale('en_US.utf8'):
> dat = Table.read(...)
>
> As to the original question of whether this should be reported as a bug,
> it has already been discussed in:
>
> https://github.com/astropy/astropy/issues/3826
>
> That discussion ended without any really clear consensus except that using
> Python 3 is a good thing if that is an option. I have never seriously
> evaluated how difficult it would be to implement support for unicode inputs
> for Python 2. A basic recipe is shown in the stdlib csv package
> documentation, but I don't know how messy a fully working implementation
> would get.
>
> Cheers,
> Tom A
>
>
>>
>> Cheers,
>> Derek
>>
>> _______________________________________________
>> AstroPy mailing list
>> AstroPy at scipy.org
>> https://mail.scipy.org/mailman/listinfo/astropy
>>
>
>
> _______________________________________________
> AstroPy mailing list
> AstroPy at scipy.org
> https://mail.scipy.org/mailman/listinfo/astropy
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/astropy/attachments/20161024/805a2331/attachment.html>
More information about the AstroPy
mailing list