unicode to ascii converting

Fri Aug 6 15:36:46 EDT 2004

Peter Wilkinson wrote:
> Hello tlistmembers,
> 
> I am using the encoding function to convert unicode to ascii. At one 
> point this code was working just fine, however, now it has broken.
> 
> I am reading a text file that has is in unicode (I am unsure of which 
> flavour or bit depth). as I read in the file one line at a time 
> (readlines()) it converts to ascii. Simple enough. At the same time I am 
> copressing to bz2 with the bz2 module but that works just fine.  The 
> code is and error reported appears below. I am unsure what to do.
> 
> I assume that because it is reporting that ordinal is not in range, that 
> something to do with the character width that I am reading?
> 
> Peter W.
> 
> def encode_file(file_path, encode_type, compress='N'):
>     """
>     Changes encoding of file
>     """
>     new_encode = encode_type
>     old_file_path = file_path + '.old'
>     new_file_path = file_path
>     os.rename(file_path,old_file_path)
>     file_in  = file(old_file_path,'r')
> 
>     if compress == 'Y' or compress == 'y':
>         bz_file_path = file_path + '.bz2'
>         bz_file_out  = bz2.BZ2File(bz_file_path, 'w')
>         for line in file_in.readlines():
>             bz_file_out.write(line.encode(new_encode))
>         bz_file_out.close()
> 
>     else:
>         file_out = file(file_path,'w')
>         for line in file_in.readlines():
>             file_out.write(line.encode(new_encode))
>         file_out.close()
> 
>     file_in.close()
>     os.remove(old_file_path)
> 
> ERROR Reported:
> 
> Parsing 
> X:\GenomeQuebec_repository\microarray\HIS\M15K\Step_1_repository\HISH0224.txt 
> 
> Traceback (most recent call last):
>   File "C:\Program Files\ActiveState Komodo 2.5\callkomodo\kdb.py", line 
> 433, in _do_start
>     self.kdb.run(code_ob, locals, locals)
>   File "C:\Python23\lib\bdb.py", line 350, in run
>     exec cmd in globals, locals
>   File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py", 
> line 158, in ?
>     main()
>   File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py", 
> line 75, in main
>     encode_file(fileToProcess, options.encode,  'Y')
>   File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py", 
> line 144, in encode_file
>     bz_file_out.write(line.encode(new_encode))
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: 
> ordinal not in range(128)
> 

0xff in position 0? If there is a 0xfe is in position 1, I would suspect 
your dealing with the Byte Order Mark for a UTF-16 encoded file (UTF-16 
LE to be precise). What happens if you skip the first 2 bytes of the file?

--
Vincent Wehren