UTF-16 or something else?

Tue Feb 9 10:17:24 EST 2021

On 2021-02-09, Skip Montanaro <skip.montanaro at gmail.com> wrote:
> I downloaded US hospital ICU capacity data this morning from this page:
>
> https://healthdata.gov/dataset/covid-19-reported-patient-impact-and-hospital-capacity-facility
>
> (The download link is about halfway down the page.)
>
> Trying to read it using my personal CSV tools without specifying an
> encoding, it failed to understand the first column, hospital_pk. That is
> apparently because the file isn't simply ASCII or UTF-8. There are a few
> bytes ahead of the "h". However, if I open the file using "utf-16" as the
> encoding, Python complains there is no BOM. od(1) suggests there is
> *something* ahead of the first column name, but it's three bytes, not two:
>
> % od -A x -t x1z -v <
> reported_hospital_capacity_admissions_facility_level_weekly_average_timeseries_20210207.csv
>| head
> 000000 *ef bb bf* 68 6f 73 70 69 74 61 6c 5f 70 6b 2c 63  >...hospital_pk,c<
> 000010 6f 6c 6c 65 63 74 69 6f 6e 5f 77 65 65 6b 2c 73  >ollection_week,s<
> 000020 74 61 74 65 2c 63 63 6e 2c 68 6f 73 70 69 74 61  >tate,ccn,hospita<
> ...

It's UTF-8 with a UTF-16 BOM prepended, which is not uncommon when you
have a file that's been converted to UTF-8 from UTF-16 or has been
produced by shitty Microsoft software. You can tell instantly at a
glance that it's not UTF-16 because the ascii dump would l.o.o.k.
.l.i.k.e. .t.h.i.s.

You can decode it as utf-8 and ignore the BOM character, or as someone
else has rightly said, Python can decode it as utf-8-sig, which does
that automatically for you.