Converting text file to different encoding.

Dave Angel davea at davea.name
Fri Apr 17 10:48:57 EDT 2015


On 04/17/2015 09:19 AM, subhabrata.banerji at gmail.com wrote:
> I am having few files in default encoding. I wanted to change their encodings,
> preferably in "UTF-8", or may be from one encoding to any other encoding.
>

You neglected to specify what Python version this is for.  Other 
information that'd be useful is whether the file size is small enough 
that two copies of it will all fit reasonably into memory.

I'll assume it's version 2.7, because of various clues in your sample 
code.  But if it's version 3.x, it could be substantially easier.

> I was trying it as follows,
>
>     >>> import codecs
>     >>> sourceEncoding = "iso-8859-1"
>     >>> targetEncoding = "utf-8"
>     >>> source = open("source1","w")

mode "w" will truncate the source1 file, leaving you nothing to process. 
  i'd suggest "r"

>     >>> target = open("target", "w")

It's not usually a good idea to use the same variable for both the file 
name and the opened file object.  What if you need later to print the 
name, as in an error message?

>     >>> target.write(unicode(source, sourceEncoding).encode(targetEncoding))

I'd not recommend trying to do so much in one line, at least until you 
understand all the pieces.  Programming is not (usually) a contest to 
write the most obscure code, but rather to make a program you can still 
read and understand six months from now.  And, oh yeah, something that 
will run and accomplish something.

 >
 > but it was giving me error as follows,
 > Traceback (most recent call last):
 >    File "<pyshell#6>", line 1, in <module>
 >      target.write(unicode(source, sourceEncoding).encode(targetEncoding))
 > TypeError: coercing to Unicode: need string or buffer, file found


if you factor this you will discover your error.  Nowhere do you read 
the source file into a byte string.  And that's what is needed for the 
unicode constructor.  Factored, you might have something like:

      encodedtext = source.read()
      text = unicode(source, sourceEncoding)
      reencodedtext = text.encode(targetEncoding)
      target.write(encodedText)

Next, you need to close the files.

     source.close()
     target.close()

There are a number of ways to improve that code, but this is a start.

Improvements:

      Use codecs.open() to open the files, so encoding is handled 
implicitly in the file objects.

      Use with... syntax so that the file closes are implicit

      read and write the files in a loop, a line at a time, so that you 
needn't have all the data in memory (at least twice) at one time.  This 
will also help enormously if you encounter any errors, and want to 
report the location and problem to the user.  It might even turn out to 
be faster.

      You should write non-trivial code in a text file, and run it from 
there.

-- 
DaveA



More information about the Python-list mailing list