[Tutor] Unknown encoded file types.

mhysnm1964 at gmail.com mhysnm1964 at gmail.com
Sun Feb 7 04:55:49 EST 2021


Alan,

Looping in the mailer with my response.


This is what I was suspecting. Thanks for confirming. I have even tried to
decode the binary variable into UTF and it failed. I am thinking of trying
to work out how to clean the file to remove any text that don't fall within
the western language. Far as I am aware, only European / English should be
present. More English than anything else.

I will do some searching on that topic to see if I understand it or how
difficult it is

Another question?

When using binary mode to load a text file. Does all the encoding bytes stay
present in the file after the content of the file has been loaded? Thus when
you join the content from two files together. You are getting the encoding
information half way through the join text?

Sean
-----Original Message-----
From: Tutor <tutor-bounces+mhysnm1964=gmail.com at python.org> On Behalf Of
Alan Gauld via Tutor
Sent: Sunday, 7 February 2021 8:19 PM
To: tutor at python.org
Subject: Re: [Tutor] Unknown encoded file types.

On 07/02/2021 08:07, mhysnm1964 at gmail.com wrote:

> I have 100's of small plain text files that are under 5k each. I am 
> concatenating them into one big text file. The issue I am having is 
> getting encoding errors. I have tried to open them with the encode 
> parameter on the "with open" command. Some of the files are throwing
encoding UTF errors.
> Looking like they are not in that format. The only reliable way I have 
> managed to open the files  is in binary mode.

Yes, that's right, the only reliable way of opening a file, if you don't
know what is in it, is using binary mode and treating it as a stream of
bytes.

You can interrogate the bytes and see if you recognise any of them, or a
sequence and from that infer an encoding.
You say they are text files, but how do you know? Even if they have a .txt
extension that's no guarantee that they are really text. And if they are how
old are they? If more than 20 years you are likely to  be facing all manner
of weird encodings.

> Is there any way to identify the encoded format before opening to 
> change the encoded format? I have seen some info on the net and don't
understand it.

Not with certainty. There are tools that can look at the first few bytes and
make an intelligent guess but none are reliable.
Opening a file without knowing what is in it is always fraught with issues.

--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


_______________________________________________
Tutor maillist  -  Tutor at python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor




More information about the Tutor mailing list