how to transfer my utf8 code saved in a file to gbk code

Mark Tolonen metolone+gmane at gmail.com
Mon Jun 8 01:58:16 EDT 2009


"higer" <higerinbeijing at gmail.com> wrote in message 
news:0c786326-1651-42c8-ba39-4679f3558660 at r13g2000vbr.googlegroups.com...
> On Jun 7, 11:25 pm, John Machin <sjmac... at lexicon.net> wrote:
>> On Jun 7, 10:55 pm, higer <higerinbeij... at gmail.com> wrote:
>>
>> > My file contains such strings :
>> > \xe6\x97\xa5\xe6\x9c\x9f\xef\xbc\x9a
>>
>
>
>> Are you sure? Does that occupy 9 bytes in your file or 36 bytes?
>>
>
> It was saved in a file, so it occupy 36 bytes. If I just use a
> variable to contain this string, it can certainly work out correct
> result,but how to get right answer when reading from file.

Did you create this file?  If it is 36 characters, it contains literal 
backslash characters, not the 9 bytes that would correctly encode as UTF-8. 
If you created the file yourself, show us the code.

>
>>
>>
>> > I want to read the content of this file and transfer it to the
>> > corresponding gbk code,a kind of Chinese character encode style.
>> > Everytime I was trying to transfer, it will output the same thing no
>> > matter which method was used.
>> >  It seems like that when Python reads it, Python will taks '\' as a
>> > common char and this string at last will be represented as "\\xe6\\x97\
>> > \xa5\\xe6\\x9c\\x9f\\xef\\xbc\\x9a" , then the "\" can be 'correctly'
>> > output,but that's not what I want to get.
>>
>> > Anyone can help me?
>>
>> try this:
>>
>> utf8_data = your_data.decode('string-escape')
>> unicode_data = utf8_data.decode('utf8')
>> # unicode derived from your sample looks like this 日期: is that what
>> you expected?
>
> You are right , the result is 日期 which I just expect. If you save the
> string in a variable, you surely can get the correct result. But it is
> just a sample, so I give a short string, what if so many characters in
> a file?
>
>> gbk_data = unicode_data.encode('gbk')
>>
>
> I have tried this method which you just told me, but unfortunately it
> does not work(mess code).

How are you determining this is 'mess code'?  How are you viewing the 
result?  You'll need to use a viewer that understands GBK encoding, such as 
"Chinese Window's Notepad".

>
>
>> If that "doesn't work", do three things:
>> (1) give us some unambiguous hard evidence about the contents of your
>> data:
>> e.g. # assuming Python 2.x
>
> My Python versoin is 2.5.2
>
>> your_data = open('your_file.txt', 'rb').read(36)
>> print repr(your_data)
>> print len(your_data)
>> print your_data.count('\\')
>> print your_data.count('x')
>>
>
> The result is:
>
> '\\xe6\\x97\\xa5\\xe6\\x9c\\x9f\\xef\\xbc\\x9a'
> 36
> 9
> 9
>
>> (2) show us the source of the script that you used
>
> def UTF8ToChnWords():
>    f = open("123.txt","rb")
>    content=f.read()
>    print repr(content)
>    print len(content)
>    print content.count("\\")
>    print content.count("x")

Try:

utf8data = content.decode('string-escape')
unicodedata = utf8data.decode('utf8')
gbkdata = unicodedata.encode('gbk')
print len(gbkdata),repr(gbkdata)
open("456.txt","wb").write(gbkdata)

The print should give:

6 '\xc8\xd5\xc6\xda\xa3\xba'

This is correct for GBK encoding.  456.txt should contain the 6 bytes of GBK 
data.  View the file with a program that understand GBK encoding.

-Mark





More information about the Python-list mailing list