Changing strings in files

Tue Nov 10 15:25:57 EST 2020

On Wed, Nov 11, 2020 at 6:36 AM Eli the Bearded <*@eli.users.panix.com> wrote:
>
> In comp.lang.python, Chris Angelico <rosuav at gmail.com> wrote:
> > Eli the Bearded <*@eli.users.panix.com> wrote:
> >> Read first N lines of a file. If all parse as valid UTF-8, consider it text.
> >> That's probably the rough method file(1) and Perl's -T use. (In
> >> particular allow no nulls. Maybe allow ISO-8859-1.)
> > ISO-8859-1 is basically "allow any byte values", so all you'd be doing
> > is checking for a lack of NUL bytes.
>
> ISO-8859-1, unlike similar Windows "charset"s, does not use octets
> 128-190. Charsets like Windows CP-1252 are nastier, because they do
> use that range. Usage of 1-31 will be pretty restricted in either,
> probably not more than tab, linefeed, and carriage return.

Define "does not use", though. You can decode those bytes just fine:

>>> bytes(range(256)).decode("ISO-8859-1")
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
!"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0¡¢£¤¥¦§¨©ª«¬\xad®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'

This is especially true of \x01 to \x1F, since they are most
definitely defined, even though they aren't commonly used.

> > I'd definitely recommend
> > mandating UTF-8, as that's a very good way of recognizing valid text,
> > but if you can't do that then the simple NUL check is all you really
> > need.
>
> Dealing with all UTF-8 is my preference, too.
>
> > And let's be honest here, there aren't THAT many binary files that
> > manage to contain a total of zero NULs, so you won't get many false
> > hits :)
>
> There's always the issue of how much to read before deciding.
>

Right; but a lot of binary file formats are going to include
structured data that will frequently include a NUL byte. For instance,
a PNG file (after the header) consists of chunks, where each chunk is
identified by a four-byte size; and the first chunk (IHDR) is
generally going to be a very short one, meaning that its size will
generally have three NULs. So a typical PNG file will have a NUL
probably as the ninth byte of the file. Other file formats will be
similar, or even better; an ELF binary actually has a sixteen byte
header of which the last few bytes are reserved for future expansion
and must be zeroes, so that's an even stronger guarantee.

If the main job of the program, as in this situation, is to read the
entire file, I would probably have it read in the first 1KB or 16KB or
thereabouts, see if that has any NUL bytes, and if not, proceed to
read in the rest of the file. But depending on the situation, I might
actually have a hard limit on the file size (say, "any file over 1GB
isn't what I'm looking for"), so that would reduce the risks too.

ChrisA