UTF-16 or something else?

Tue Feb 9 09:59:08 EST 2021

Try setting encoding to: "utf-8-sig".

'eb bb bf' is the byte order mark for UTF8 (most systems do not include
this in UTF-8 encoded files)

Python will correctly read UTF8 BOMs if you use the 'utf-8-sig' encoding
when reading files

Steve

On Tue, Feb 9, 2021 at 2:56 PM Skip Montanaro <skip.montanaro at gmail.com>
wrote:

> I downloaded US hospital ICU capacity data this morning from this page:
>
>
> https://healthdata.gov/dataset/covid-19-reported-patient-impact-and-hospital-capacity-facility
>
> (The download link is about halfway down the page.)
>
> Trying to read it using my personal CSV tools without specifying an
> encoding, it failed to understand the first column, hospital_pk. That is
> apparently because the file isn't simply ASCII or UTF-8. There are a few
> bytes ahead of the "h". However, if I open the file using "utf-16" as the
> encoding, Python complains there is no BOM. od(1) suggests there is
> *something* ahead of the first column name, but it's three bytes, not two:
>
> % od -A x -t x1z -v <
>
> reported_hospital_capacity_admissions_facility_level_weekly_average_timeseries_20210207.csv
> | head
> 000000 *ef bb bf* 68 6f 73 70 69 74 61 6c 5f 70 6b 2c 63
> >...hospital_pk,c<
> 000010 6f 6c 6c 65 63 74 69 6f 6e 5f 77 65 65 6b 2c 73  >ollection_week,s<
> 000020 74 61 74 65 2c 63 63 6e 2c 68 6f 73 70 69 74 61  >tate,ccn,hospita<
> ...
>
> I'm opening the file like so:
>
> inf = open(args[0], "r", encoding=encoding)
>
> where encoding is passed on the command line. I know I can simply edit out
> those bytes and probably be good-to-go, but I'd prefer not to. What should
> I be passing for the encoding?
>
> Skip, who thought everybody had effectively settled on utf-8 at this point,
> but apparently not...
> --
> https://mail.python.org/mailman/listinfo/python-list
>