How to know if a file is a text file

Luca lucafbb at gmail.com
Sun Nov 15 07:49:54 EST 2009


On Sat, Nov 14, 2009 at 6:51 PM, Philip Semanchuk <philip at semanchuk.com> wrote:
> Hi Luca,
> You have to define what you mean by "text" file. It might seem obvious, but
> it's not.
>
> Do you mean just ASCII text? Or will you accept Unicode too? Unicode text
> can be more difficult to detect because you have to guess the file's
> encoding (unless it has a BOM; most don't).
>
> And do you need to verify that every single byte in the file is "text"? What
> if the file is 1GB, do you still want to examine every single byte?
>
> If you give us your own (specific!) definition of what "text" means, or
> perhaps a description of the problem you're trying to solve, then maybe we
> can help you better.
>

Thanks all.

I was quite sure that this is not a very simple task. Right now search
only inside ASCII encode is not enough for me (my native language is
outside this encode :-)
Checking every single byte can be a good solution...

I can start using the mimetype module and, if the file has no
extension, check byte one by one (commonly) as "file" command does.
Better: I can check use the "file" command if available.

Again: thanks all!

-- 
-- luca



More information about the Python-list mailing list