[Tutor] Printing Chinese characters?

Thu Oct 16 00:54:41 EDT 2003

Ahh - and the final step - that would yield this utf-8 encoding (of
the original string minus the troublesome characters) rendered as a
python string:

print '\xe7\xaa\xaa\xe6\xb4\x89\xe9\x83\xbd\xe7\x8d\x97\xe8\x85\x94\xe3\x82\x81\xe8\xa1\xa7\xe7\xaa\xaa\xe8\x9d\xa5\xe7\xba\x97\xe5\xa5\xb4\x0a'

which prints fine on my utf-8-enabled xterm, as described so
wonderfuly at http://www.cl.cam.ac.uk/~mgk25/unicode.html by Markus
Kuhn.  Though a few characters aren't in the free font I got with
X11/Redhat 7.3.

Of course I may be way off-base here - just playing around with it.

-Neal

On Wed, Oct 15, 2003 at 10:44:28PM -0600, Neal McBurnett wrote:
> Well, I think the idea that it is at least similar to big5 is right.
> But it may have a Japanese hiragana character also.
> 
> But to make that work I had to drop the '?' characters, as well as the
> \xc8 (trial and error....)
> 
> I used linux and the free, quirky but very handy "recode" program to
> do the recoding.  I inserted a few newlines to help keep my place....
> 
> original string:
>  '\xba\xda\xcf?\xac\xb3\xa3\xbc\xfb\xb5\xc4\xc6\xe5\xd0?\xac\xba\xda\xc8\xe7\xba?\xf8\xb9\xa5\xa3\xbf'
> 
> my script:
> $ python2 -c "print '\xba\xda\xcf\xac\xb3\xa3\xbc\xfb\xb5\xc4\xc6\xe5\xd0\xac\xba\xda\n\xe7\xba\xf8\xb9\xa5\xa3'" |
>  recode big5..dump
> 
> Output, in Unicode UCS2 form:
> 
> UCS2   Mne   Description
> 
> 7AAA      
> 6D09      
> 90FD      
> 7357      
> 8154      
> 3081   me    hiragana letter me
> 8867      
> 7AAA      
> 000A   LF    line feed (lf)
> 8765      
> 7E97      
> 5974      
> 000A   LF    line feed (lf)
> 
> Those characters can be looked up via the Unihan.txt file at
> unicode.org, yielding the name of each character, and in many common
> cases also pronunciation and a definition:
> 
> $ for i in 7AAA 6D09 90FD 7357 8154 3081 8867 7AAA 000A 8765 7E97 5974 000A; do
>   fgrep $i Unihan.txt | grep kDefinition; done
> 
> U+7AAA kDefinition hollow; pit; depression; swamp
> U+90FD kDefinition metropolis, capital; all, the whole; elegant,
> refined
> U+7357 kDefinition unruly, wild, violent, lawless
> U+8154 kDefinition chest cavity; hollow in body
> U+7AAA kDefinition hollow; pit; depression; swamp
> U+8765 kDefinition a fly which is used similarly to cantharides
> U+5974 kDefinition slave, servant
> 
> The other characters weren't in that "dictionary".
> 
> I don't know what the deal is with the characters I had to drop out,
> so it may be some other character set which is related to big5.
> 
> But I think that  for someone who knows no Chinese, using
> free tools and databases....
> 
> Cheers,
> 
> Neal McBurnett                 http://bcn.boulder.co.us/~neal/
> Signed and/or sealed mail encouraged.  GPG/PGP Keyid: 2C9EBA60
> 
> 
> On Thu, Oct 16, 2003 at 01:54:54PM +1000, Alfred Milgrom wrote:
> > Hi Danny:
> > 
> > Thanks for your reply.
> > Given that this is a Chinese string, I think it might be a BIG-5 encoding, 
> > but I am unable to find the proper encoding files.
> > 
> > In my distribution of Python, there is an encodings directory under 
> > Python22/Lib, and a file called aliases.py. As I understand it, this module 
> > is used by the encodings package search function to map encodings names to 
> > module names.
> > 
> > There is an interesting comment under CJK encodings (Chinese, Japanese, 
> > Korean) as follows:
> >     # The codecs for these encodings are not distributed with the
> >     # Python core, but are included here for reference, since the
> >     # locale module relies on having these aliases available.
> > 
> > Do you (or anyone else) know where I can get the Chinese encodings, 
> > including BIG-5?
> > 
> > Thanks in advance,
> > Fred Milgrom
> > 
> > 
> > At 02:52 PM 15/10/03 -0700, Danny Yoo wrote:
> > 
> > ><snip>
> > >But that character string you've posted:
> > >
> > >###
> > >s = ('\xba\xda\xcf?\xac\xb3\xa3\xbc\xfb\xb5\xc4\xc6' +
> > >     '\xe5\xd0?\xac\xba\xda\xc8\xe7\xba?\xf8\xb9\xa5\xa3\xbf')
> > >###
> > >will need to be first decoded from whatever byte encoding it is in now
> > >into Unicode before any display approach will work.
> > >
> > ><snip> Do you have more information on
> > >the byte encoding is being used for your string 's'?
> > >
> > >Good luck to you!
> > 
> > 
> > 
> > _______________________________________________
> > Tutor maillist  -  Tutor at python.org
> > http://mail.python.org/mailman/listinfo/tutor
> 
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor