[Tutor] Unknown encoded file types.

Cameron Simpson cs at cskk.id.au
Sun Feb 7 06:40:10 EST 2021


On 07Feb2021 20:55, Sean Murphy <mhysnm1964 at gmail.com> wrote:
>This is what I was suspecting. Thanks for confirming. I have even tried 
>to decode the binary variable into UTF and it failed.

That just says it isn't UTF-8. If you were trying UTF-8.

>I am thinking of trying
>to work out how to clean the file to remove any text that don't fall within
>the western language.

Is that necessary?

>Far as I am aware, only European / English should be
>present. More English than anything else.

That leaves plenty of scope for nonASCII bytes. What kind of criterion 
do you think would help you?

>When using binary mode to load a text file. Does all the encoding bytes 
>stay present in the file after the content of the file has been loaded? Thus when
>you join the content from two files together. You are getting the encoding
>information half way through the join text?

The often aren't any "encoding bytes" to save/preserve. The text will 
simply have been transcribed in whatever encoding was in use. There 
aren't standard "markers" for this stuff, which is why an unknown file 
is guesswork.

If the text commences with a BOM (FFFE or FEFF) it is probably UTF-16BE 
or UTF-16LE respectively. But otherwise you're on your own, falling back 
to libraries which guess from the elading data and the byte value 
distributions.

Cheers,
Cameron Simpson <cs at cskk.id.au>


More information about the Tutor mailing list