unicode and dbf files

Mon Oct 26 12:22:22 EDT 2009

John Machin wrote:
> On Oct 24, 4:14 am, Ethan Furman <et... at stoneleaf.us> wrote:
> 
>>John Machin wrote:
>>
>>>On Oct 23, 3:03 pm, Ethan Furman <et... at stoneleaf.us> wrote:
>>
>>>>John Machin wrote:
>>
>>>>>On Oct 23, 7:28 am, Ethan Furman <et... at stoneleaf.us> wrote:
>>
>>>>>>Greetings, all!
>>
>>>>>>I would like to add unicode support to my dbf project.  The dbf header
>>>>>>has a one-byte field to hold the encoding of the file.  For example,
>>>>>>\x03 is code-page 437 MS-DOS.
>>
>>>>>>My google-fu is apparently not up to the task of locating a complete
>>>>>>resource that has a list of the 256 possible values and their
>>>>>>corresponding code pages.
>>
>>>>>What makes you imagine that all 256 possible values are mapped to code
>>>>>pages?
>>
>>>>I'm just wanting to make sure I have whatever is available, and
>>>>preferably standard.  :D
>>
>>>>>>So far I have found this, plus variations:http://support.microsoft.com/kb/129631
>>
>>>>>>Does anyone know of anything more complete?
>>
>>>>>That is for VFP3. Try the VFP9 equivalent.
>>
>>>>>dBase 5,5,6,7 use others which are not defined in publicly available
>>>>>dBase docs AFAICT. Look for "language driver ID" and "LDID". Secondary
>>>>>source: ESRI support site.
>>
>>>>Well, a couple hours later and still not more than I started with.
>>>>Thanks for trying, though!
>>
>>>Huh? You got tips to (1) the VFP9 docs (2) the ESRI site (3) search
>>>keywords and you couldn't come up with anything??
>>
>>Perhaps "nothing new" would have been a better description.  I'd already
>>seen the clicketyclick site (good info there)
> 
> 
> Do you think so? My take is that it leaves out most of the codepage
> numbers, and these two lines are wrong:
> 65h 	Nordic MS-DOS	code page 865
> 66h 	Russian MS-DOS	code page 866

That was the site I used to get my whole project going, so ignoring the 
unicode aspect, it has been very helpful to me.

>>and all I found at ESRI
>>were folks trying to figure it out, plus one link to a list that was no
>>different from the vfp3 list (or was it that the list did not give the
>>hex values?  Either way, of no use to me.)
> 
> 
> Try this:
> http://webhelp.esri.com/arcpad/8.0/referenceguide/

Wow.  Question, though:  all those codepages mapping to 437 and 850 -- 
are they really all the same?

>>I looked at dbase.com, but came up empty-handed there (not surprising,
>>since they are a commercial company).
> 
> 
> MS and ESRI have docs ... does that mean that they are non-commercial
> companies?

I don't know enough about ESRI to make an informed comment, so I'll just 
say I'm grateful they have them!  MS is a complete mystery... perhaps 
they are finally seeing the light?  Hard to believe, though, from a 
company that has consistently changed their file formats with every release.

>>I searched some more on Microsoft's site in the VFP9 section, and was
>>able to find the code page section this time.  Sadly, it only added
>>about seven codes.
>>
>>At any rate, here is what I have come up with so far.  Any corrections
>>and/or additions greatly appreciated.
>>
>>code_pages = {
>>     '\x01' : ('ascii', 'U.S. MS-DOS'),
> 
> 
> All of the sources say codepage 437, so why ascii instead of cp437?

Hard to say, really.  Adjusted.

>>     '\x02' : ('cp850', 'International MS-DOS'),
>>     '\x03' : ('cp1252', 'Windows ANSI'),
>>     '\x04' : ('mac_roman', 'Standard Macintosh'),
>>     '\x64' : ('cp852', 'Eastern European MS-DOS'),
>>     '\x65' : ('cp866', 'Russian MS-DOS'),
>>     '\x66' : ('cp865', 'Nordic MS-DOS'),
>>     '\x67' : ('cp861', 'Icelandic MS-DOS'),
>>     '\x68' : ('cp895', 'Kamenicky (Czech) MS-DOS'),     # iffy
> 
> 
> Indeed iffy. Python doesn't have a cp895 encoding, and it's probably
> not alone. I suggest that you omit Kamenicky until someone actually
> wants it.

Yeah, I noticed that.  Tentative plan was to implement it myself (more 
for practice than anything else), and also to be able to raise a more 
specific error ("Kamenicky not currently supported" or some such).

>>     '\x69' : ('cp852', 'Mazovia (Polish) MS-DOS'),      # iffy
> 
> 
> Look 5 lines back. cp852 is 'Eastern European MS-DOS'. Mazovia
> predates and is not the same as cp852. In any case, I suggest that you
> omit Masovia until someone wants it. Interesting reading:
> 
> http://www.jastra.com.pl/klub/ogonki.htm

Very interesting reading.

>>     '\x6a' : ('cp737', 'Greek MS-DOS (437G)'),
>>     '\x6b' : ('cp857', 'Turkish MS-DOS'),
>>     '\x78' : ('big5', 'Traditional Chinese (Hong Kong SAR, Taiwan)\
> 
> 
> big5 is *not* the same as cp950. The products that create DBF files
> were designed for Windows. So when your source says that LDID 0xXX
> maps to Windows codepage YYY, I would suggest that all you should do
> is translate that without thinking to python encoding cpYYY.

Ack.  Not sure how I missed 'Windows' at the end of that description.

>>                Windows'),       # wag
> 
> What does "wag" mean?

wag == 'wild ass guess'

>>     '\x79' : ('iso2022_kr', 'Korean Windows'),          # wag
> 
> Try cp949.

Done.

>>     '\x7a' : ('iso2022_jp_2', 'Chinese Simplified (PRC, Singapore)\
>>                Windows'),       # wag
> 
> 
> Very wrong. iso2022_jp_2 is supposed to include basic Japanese, basic
> (1980) Chinese (GB2312) and a basic Korean kit. However to quote from
> "CJKV Information Processing" by Ken Lunde, "... from a practical
> point of view, ISO-2022-JP-2 ..... [is] equivalent to ISO-2022-JP-1
> encoding." i.e. no Chinese support at all. Try cp936.

Done.

>>     '\x7b' : ('iso2022_jp', 'Japanese Windows'),        # wag
> 
> 
> Try cp936.

You mean 932?

>>     '\x7c' : ('cp874', 'Thai Windows'),                 # wag
>>     '\x7d' : ('cp1255', 'Hebrew Windows'),
>>     '\x7e' : ('cp1256', 'Arabic Windows'),
>>     '\xc8' : ('cp1250', 'Eastern European Windows'),
>>     '\xc9' : ('cp1251', 'Russian Windows'),
>>     '\xca' : ('cp1254', 'Turkish Windows'),
>>     '\xcb' : ('cp1253', 'Greek Windows'),
>>     '\x96' : ('mac_cyrillic', 'Russian Macintosh'),
>>     '\x97' : ('mac_latin2', 'Macintosh EE'),
>>     '\x98' : ('mac_greek', 'Greek Macintosh') }
> 
> 
> HTH,
> John

Very helpful indeed.  Many thanks for reviewing and correcting. 
Learning to deal with unicode is proving more difficult for me than 
learning Python was to begin with!  ;D

~Ethan~