Changing strings in files

Tue Nov 10 06:08:54 EST 2020

On 10Nov2020 10:07, Manfred Lotz <ml_news at posteo.de> wrote:
>On Tue, 10 Nov 2020 18:37:54 +1100
>Cameron Simpson <cs at cskk.id.au> wrote:
>> Use os.walk for trees. scandir does a single directory.
>
>Perhaps better. I like to use os.scandir this way
>
>def scantree(path: str) -> Iterator[os.DirEntry[str]]:
>    """Recursively yield DirEntry objects (no directories)
>          for a given directory.
>    """
>    for entry in os.scandir(path):
>        if entry.is_dir(follow_symlinks=False):
>            yield from scantree(entry.path)
>
>        yield entry
>
>Worked fine so far. I think I coded it this way because I wanted the
>full path of the file the easy way.

Yes, that's fine and easy to read. Note that this is effectively a 
recursive call though, with the associated costs:

- a scandir (or listdir, whatever) has the directory open, and holds it 
  open while you scan the subdirectories; by contrast os.walk only opens 
  one directory at a time

- likewise, if you're maintaining data during a scan, that is held while 
  you process the subdirectories; with an os.walk you tend to do that 
  and release the memory before the next iteration of the main loop 
  (obviously, depending exactly what you're doing)

However, directory trees tend not to be particularly deep, and the depth 
governs the excess state you're keeping around.

>> >   - check if a file is a text file
>>
>> This requires reading the entire file. You want to check that it
>> consists entirely of lines of text. In your expected text encoding -
>> these days UTF-8 is the common default, but getting this correct is
>> essential if you want to recognise text. So as a first cut, totally
>> untested:
>>
>> ...
>
>The reason I want to check if a file is a text file is that I don't
>want to try replacing patterns in binary files (executable binaries,
>archives, audio files aso).

Exactly, which is why you should not trust, say, the "file" utility. It 
scans only the opening part of the file. Great for rejecting files, but 
not reliable for being _sure_ about the whole file being text when it 
doesn't reject.

>Of course, to make this nicely work some heuristic check would be the
>right thing (this is what file command does). I am aware that an
>heuristic check is not 100% but I think it is good enough.

Shrug. That is a risk you must evaluate yourself. I'm quite paranoid 
about data loss, myself. If you've got backups or are working on copies 
the risks are mitigated.

You could perhaps take a more targeted approach: do your target files 
have distinctive file extensions (for example, all the .py files in a 
source tree).

Cheers,
Cameron Simpson <cs at cskk.id.au>