From zmason at gmail.com Wed Feb 1 02:39:18 2006 From: zmason at gmail.com (Zachary Mason) Date: Tue, 31 Jan 2006 17:39:18 -0800 Subject: [I18n-sig] japanese, unicode and python Message-ID: <85577c350601311739g6d780bbcl6fcc666f249518a5@mail.gmail.com> Hi. Im writing an NLP application that manipulates japanese characters and downloads japanese web-pages, in particular results pages for japanese search engines. Having a miserable time trying to make it work so far. I have python 2.4.3, but the transformations that seem to work for european languages throw errors for japan. Pointers to useful resources or better yet examples of manipulating japanese via python would be greatly appreciated. thanks Z. Mason From martin at v.loewis.de Sun Feb 12 20:38:47 2006 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 12 Feb 2006 20:38:47 +0100 Subject: [I18n-sig] japanese, unicode and python In-Reply-To: <85577c350601311739g6d780bbcl6fcc666f249518a5@mail.gmail.com> References: <85577c350601311739g6d780bbcl6fcc666f249518a5@mail.gmail.com> Message-ID: <43EF8EC7.2070909@v.loewis.de> Zachary Mason wrote: > Hi. Im writing an NLP application that manipulates japanese > characters and downloads japanese web-pages, in particular results > pages for japanese search engines. Having a miserable time trying to > make it work so far. I have python 2.4.3, but the transformations > that seem to work for european languages throw errors for japan. > Pointers to useful resources or better yet examples of manipulating > japanese via python would be greatly appreciated. This is a pretty unspecific question. What's wrong with doing stuff like py> u"Hello \u3068\u306f".encode("eucJP") 'Hello \xa4\xc8\xa4\xcf' Regards, Martin From jtauber at jtauber.com Mon Feb 13 05:02:50 2006 From: jtauber at jtauber.com (James Tauber) Date: Mon, 13 Feb 2006 12:02:50 +0800 Subject: [I18n-sig] possible bug in my UCA implementation In-Reply-To: References: Message-ID: <9D32CFE2-4699-4A16-8A5A-17F4973DB7D5@jtauber.com> I discovered I wasn't converting to NFD first but this doesn't solve the problem, it just explains why expansions were used. Even using NFD, I get the same result. James On 30/01/2006, at 4:35 PM, James Tauber wrote: > > My Python Unicode Collation Algorithm implementation is giving > unexpected results that could be because of: > > 1. a bug in my code > 2. a bug in the DUCET > 3. a difference of opinion between the way I think Ancient Greek > should be collated and the way DUCET thinks so > > I'd like to get the opinion of some of you who are more familiar > with UCA (and perhaps can try my example out on ICU) > > For the purposes of testing, say I'm trying to sort the three words: > > (1) ???? > (2) ???? > (3) ???? > > In my view they should be sorted in the reverse to what they are > now, but my pyuca code sorts them in the order listed above. > > pyuca assigns the words the following sort keys: > > (1) ['0x124e', '0x0', '0x0', '0x0', '0x1252', '0x1257', '0x126a', > '0x0', '0x20', '0x2a', '0x32', '0x97', '0x20', '0x20', '0x20', > '0x0', '0x2', '0x2', '0x2', '0x2', '0x2', '0x2', '0x19', '0x0', > '0x3b1', '0x314', '0x301', '0x345', '0x3b4', '0x3b7', '0x3c2'] > (2) ['0x124e', '0x0', '0x0', '0x124f', '0x1253', '0x125c', '0x0', > '0x20', '0x22', '0x32', '0x20', '0x20', '0x20', '0x0', '0x8', > '0x2', '0x2', '0x2', '0x2', '0x2', '0x0', '0x391', '0x313', > '0x301', '0x3b2', '0x3b5', '0x3bb'] > (3) ['0x124e', '0x0', '0x124f', '0x124f', '0x124e', '0x0', '0x0', > '0x20', '0x22', '0x20', '0x20', '0x20', '0x32', '0x0', '0x2', > '0x2', '0x2', '0x2', '0x2', '0x2', '0x0', '0x3b1', '0x313', > '0x3b2', '0x3b2', '0x3b1', '0x301'] > > The problem is that ? (the first character of (1)) expands to 4 > collation elements, ? (the first character of (2)) to 3 and ? > (the first character of (3)) to 2 and as a result and, because all > but the first element is zero, they are comparing less, just by > virtue of having more collation elements. > > I don't even understand why these letters are being treated as > expansions rather than simply taking advantage of the secondary and > tertiary levels, but sure enough that is how the DUCET describes them. > > Am I missing something fundamental in the algorithm? Or is it > possible the DUCET is wrong? > > James From martin at v.loewis.de Mon Feb 13 05:10:58 2006 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Mon, 13 Feb 2006 05:10:58 +0100 Subject: [I18n-sig] possible bug in my UCA implementation In-Reply-To: <9D32CFE2-4699-4A16-8A5A-17F4973DB7D5@jtauber.com> References: <9D32CFE2-4699-4A16-8A5A-17F4973DB7D5@jtauber.com> Message-ID: <43F006D2.9040808@v.loewis.de> James Tauber wrote: > I discovered I wasn't converting to NFD first but this doesn't solve > the problem, it just explains why expansions were used. > > Even using NFD, I get the same result. If you want to get a response to your message, you probably need to provide more explanation of background, so people can easier understand your reasoning. I, for one, have no idea what DUCET is. Regards, Martin From jtauber at jtauber.com Mon Feb 13 05:19:21 2006 From: jtauber at jtauber.com (James Tauber) Date: Mon, 13 Feb 2006 12:19:21 +0800 Subject: [I18n-sig] possible bug in my UCA implementation In-Reply-To: <43F006D2.9040808@v.loewis.de> References: <9D32CFE2-4699-4A16-8A5A-17F4973DB7D5@jtauber.com> <43F006D2.9040808@v.loewis.de> Message-ID: DUCET is the Unicode Consortium's Default Unicode Collation Element Table. The relevant Unicode TS is #10: http://www.unicode.org/reports/tr10/ which is what I've done a preliminary implementation of. i18n-sig is probably the wrong place to be asking but I thought there might be people here that have direct experience with DUCET and the UCA. James On 13/02/2006, at 12:10 PM, Martin v. L?wis wrote: > James Tauber wrote: >> I discovered I wasn't converting to NFD first but this doesn't solve >> the problem, it just explains why expansions were used. >> >> Even using NFD, I get the same result. > > If you want to get a response to your message, you probably need to > provide more explanation of background, so people can easier > understand > your reasoning. I, for one, have no idea what DUCET is. > > Regards, > Martin From jtauber at jtauber.com Mon Feb 13 05:44:49 2006 From: jtauber at jtauber.com (James Tauber) Date: Mon, 13 Feb 2006 12:44:49 +0800 Subject: [I18n-sig] possible bug in my UCA implementation In-Reply-To: <9D32CFE2-4699-4A16-8A5A-17F4973DB7D5@jtauber.com> References: <9D32CFE2-4699-4A16-8A5A-17F4973DB7D5@jtauber.com> Message-ID: <4BFCB2CE-2871-461B-8A6F-4F2D1CB935BA@jtauber.com> Okay, I've fixed the problem. It did turn out to be a bug in my implementation. The core of the algorithm was implemented as for level in range(4): if level: sort_key.append(0) # level separator for element in collation_elements: ce_l = int(element[1][level], 16) sort_key.append(ce_l) instead of for level in range(4): if level: sort_key.append(0) # level separator for element in collation_elements: ce_l = int(element[1][level], 16) if ce_l: sort_key.append(ce_l) In other words, the UCA itself makes it clear (Step S3.5) that only non-zero CE_L should be appended. I'll update the code on my website shortly. James On 13/02/2006, at 12:02 PM, James Tauber wrote: > > I discovered I wasn't converting to NFD first but this doesn't solve > the problem, it just explains why expansions were used. > > Even using NFD, I get the same result. > > James > > On 30/01/2006, at 4:35 PM, James Tauber wrote: > >> >> My Python Unicode Collation Algorithm implementation is giving >> unexpected results that could be because of: >> >> 1. a bug in my code >> 2. a bug in the DUCET >> 3. a difference of opinion between the way I think Ancient Greek >> should be collated and the way DUCET thinks so >> >> I'd like to get the opinion of some of you who are more familiar >> with UCA (and perhaps can try my example out on ICU) >> >> For the purposes of testing, say I'm trying to sort the three words: >> >> (1) ???? >> (2) ???? >> (3) ???? >> >> In my view they should be sorted in the reverse to what they are >> now, but my pyuca code sorts them in the order listed above. >> >> pyuca assigns the words the following sort keys: >> >> (1) ['0x124e', '0x0', '0x0', '0x0', '0x1252', '0x1257', '0x126a', >> '0x0', '0x20', '0x2a', '0x32', '0x97', '0x20', '0x20', '0x20', >> '0x0', '0x2', '0x2', '0x2', '0x2', '0x2', '0x2', '0x19', '0x0', >> '0x3b1', '0x314', '0x301', '0x345', '0x3b4', '0x3b7', '0x3c2'] >> (2) ['0x124e', '0x0', '0x0', '0x124f', '0x1253', '0x125c', '0x0', >> '0x20', '0x22', '0x32', '0x20', '0x20', '0x20', '0x0', '0x8', >> '0x2', '0x2', '0x2', '0x2', '0x2', '0x0', '0x391', '0x313', >> '0x301', '0x3b2', '0x3b5', '0x3bb'] >> (3) ['0x124e', '0x0', '0x124f', '0x124f', '0x124e', '0x0', '0x0', >> '0x20', '0x22', '0x20', '0x20', '0x20', '0x32', '0x0', '0x2', >> '0x2', '0x2', '0x2', '0x2', '0x2', '0x0', '0x3b1', '0x313', >> '0x3b2', '0x3b2', '0x3b1', '0x301'] >> >> The problem is that ? (the first character of (1)) expands to 4 >> collation elements, ? (the first character of (2)) to 3 and ? >> (the first character of (3)) to 2 and as a result and, because all >> but the first element is zero, they are comparing less, just by >> virtue of having more collation elements. >> >> I don't even understand why these letters are being treated as >> expansions rather than simply taking advantage of the secondary and >> tertiary levels, but sure enough that is how the DUCET describes >> them. >> >> Am I missing something fundamental in the algorithm? Or is it >> possible the DUCET is wrong? >> >> James > > _______________________________________________ > I18n-sig mailing list > I18n-sig at python.org > http://mail.python.org/mailman/listinfo/i18n-sig -- James Tauber http://jtauber.com/ journeyman of some http://jtauber.com/blog/