Opening multiple Files in Different Encoding

Steven D'Aprano steve+comp.lang.python at pearwood.info
Wed Jul 11 19:22:56 EDT 2012


On Wed, 11 Jul 2012 11:15:02 -0700, subhabangalore wrote:

> On Tuesday, July 10, 2012 11:16:08 PM UTC+5:30, Subhabrata wrote:
>> Dear Group,
>> 
>> I kept a good number of files in a folder. Now I want to read all of
>> them. They are in different formats and different encoding. Using
>> listdir/glob.glob I am able to find the list but how to open/read or
>> process them for different encodings?
>> 
>> If any one can help me out.I am using Python3.2 on Windows.
>> 
>> Regards,
>> Subhabrata Banerjee.
> Dear Group,
> 
> No generally I know the glob.glob or the encodings as I work lot on
> non-ASCII stuff, but I recently found an interesting issue, suppose
> there are .doc,.docx,.txt,.xls,.pdf files with different encodings. 

You can have text files with different encodings, but not the others.

.doc .docx .xls and .pdf are all binary files. You don't specify an 
encoding when you read them, because they aren't text -- encodings are 
for mapping bytes to text, not bytes to binary formats.

In particular, .docx is compressed XML, so once you have uncompressed it, 
the contents XML, which is *always* UTF-8.


> 1) First I have to determine on the fly the file type. 

Which is a different problem from your first post.

On Windows, you determine the file type using the file extension.

import os
name, ext = os.path.splitext("my_file_name.bmp")

will give you ext = ".bmp".

Then what do you expect to do? You can open the file as a binary blob, 
but what do you expect then?

f = open("my_file_name.bmp", "rb")

Now what do you want to do with it?


> 2) I can not assign
> encoding="..." whatever be the encoding I have to read it.

You can't set the encoding when you open files in binary mode, but binary 
files don't have an encoding.



-- 
Steven



More information about the Python-list mailing list