unicode and dbf files

Sat Oct 24 06:58:01 EDT 2009

On Oct 24, 4:14 am, Ethan Furman <et... at stoneleaf.us> wrote:
> John Machin wrote:
> > On Oct 23, 3:03 pm, Ethan Furman <et... at stoneleaf.us> wrote:
>
> >>John Machin wrote:
>
> >>>On Oct 23, 7:28 am, Ethan Furman <et... at stoneleaf.us> wrote:
>
> >>>>Greetings, all!
>
> >>>>I would like to add unicode support to my dbf project.  The dbf header
> >>>>has a one-byte field to hold the encoding of the file.  For example,
> >>>>\x03 is code-page 437 MS-DOS.
>
> >>>>My google-fu is apparently not up to the task of locating a complete
> >>>>resource that has a list of the 256 possible values and their
> >>>>corresponding code pages.
>
> >>>What makes you imagine that all 256 possible values are mapped to code
> >>>pages?
>
> >>I'm just wanting to make sure I have whatever is available, and
> >>preferably standard.  :D
>
> >>>>So far I have found this, plus variations:http://support.microsoft.com/kb/129631
>
> >>>>Does anyone know of anything more complete?
>
> >>>That is for VFP3. Try the VFP9 equivalent.
>
> >>>dBase 5,5,6,7 use others which are not defined in publicly available
> >>>dBase docs AFAICT. Look for "language driver ID" and "LDID". Secondary
> >>>source: ESRI support site.
>
> >>Well, a couple hours later and still not more than I started with.
> >>Thanks for trying, though!
>
> > Huh? You got tips to (1) the VFP9 docs (2) the ESRI site (3) search
> > keywords and you couldn't come up with anything??
>
> Perhaps "nothing new" would have been a better description.  I'd already
> seen the clicketyclick site (good info there)

Do you think so? My take is that it leaves out most of the codepage
numbers, and these two lines are wrong:
65h 	Nordic MS-DOS	code page 865
66h 	Russian MS-DOS	code page 866

> and all I found at ESRI
> were folks trying to figure it out, plus one link to a list that was no
> different from the vfp3 list (or was it that the list did not give the
> hex values?  Either way, of no use to me.)

Try this:
http://webhelp.esri.com/arcpad/8.0/referenceguide/

>
> I looked at dbase.com, but came up empty-handed there (not surprising,
> since they are a commercial company).

MS and ESRI have docs ... does that mean that they are non-commercial
companies?

> I searched some more on Microsoft's site in the VFP9 section, and was
> able to find the code page section this time.  Sadly, it only added
> about seven codes.
>
> At any rate, here is what I have come up with so far.  Any corrections
> and/or additions greatly appreciated.
>
> code_pages = {
>      '\x01' : ('ascii', 'U.S. MS-DOS'),

All of the sources say codepage 437, so why ascii instead of cp437?

>      '\x02' : ('cp850', 'International MS-DOS'),
>      '\x03' : ('cp1252', 'Windows ANSI'),
>      '\x04' : ('mac_roman', 'Standard Macintosh'),
>      '\x64' : ('cp852', 'Eastern European MS-DOS'),
>      '\x65' : ('cp866', 'Russian MS-DOS'),
>      '\x66' : ('cp865', 'Nordic MS-DOS'),
>      '\x67' : ('cp861', 'Icelandic MS-DOS'),
>      '\x68' : ('cp895', 'Kamenicky (Czech) MS-DOS'),     # iffy

Indeed iffy. Python doesn't have a cp895 encoding, and it's probably
not alone. I suggest that you omit Kamenicky until someone actually
wants it.

>      '\x69' : ('cp852', 'Mazovia (Polish) MS-DOS'),      # iffy

Look 5 lines back. cp852 is 'Eastern European MS-DOS'. Mazovia
predates and is not the same as cp852. In any case, I suggest that you
omit Masovia until someone wants it. Interesting reading:

http://www.jastra.com.pl/klub/ogonki.htm

>      '\x6a' : ('cp737', 'Greek MS-DOS (437G)'),
>      '\x6b' : ('cp857', 'Turkish MS-DOS'),
>      '\x78' : ('big5', 'Traditional Chinese (Hong Kong SAR, Taiwan)\

big5 is *not* the same as cp950. The products that create DBF files
were designed for Windows. So when your source says that LDID 0xXX
maps to Windows codepage YYY, I would suggest that all you should do
is translate that without thinking to python encoding cpYYY.

>                 Windows'),       # wag

What does "wag" mean?

>      '\x79' : ('iso2022_kr', 'Korean Windows'),          # wag

Try cp949.

>      '\x7a' : ('iso2022_jp_2', 'Chinese Simplified (PRC, Singapore)\
>                 Windows'),       # wag

Very wrong. iso2022_jp_2 is supposed to include basic Japanese, basic
(1980) Chinese (GB2312) and a basic Korean kit. However to quote from
"CJKV Information Processing" by Ken Lunde, "... from a practical
point of view, ISO-2022-JP-2 ..... [is] equivalent to ISO-2022-JP-1
encoding." i.e. no Chinese support at all. Try cp936.

>      '\x7b' : ('iso2022_jp', 'Japanese Windows'),        # wag

Try cp936.

>      '\x7c' : ('cp874', 'Thai Windows'),                 # wag
>      '\x7d' : ('cp1255', 'Hebrew Windows'),
>      '\x7e' : ('cp1256', 'Arabic Windows'),
>      '\xc8' : ('cp1250', 'Eastern European Windows'),
>      '\xc9' : ('cp1251', 'Russian Windows'),
>      '\xca' : ('cp1254', 'Turkish Windows'),
>      '\xcb' : ('cp1253', 'Greek Windows'),
>      '\x96' : ('mac_cyrillic', 'Russian Macintosh'),
>      '\x97' : ('mac_latin2', 'Macintosh EE'),
>      '\x98' : ('mac_greek', 'Greek Macintosh') }

HTH,
John