unicode and dbf files

John Machin sjmachin at lexicon.net
Mon Oct 26 20:38:25 EDT 2009


On Oct 27, 7:15 am, Ethan Furman <et... at stoneleaf.us> wrote:
> John Machin wrote:
> > On Oct 27, 3:22 am, Ethan Furman <et... at stoneleaf.us> wrote:
>
> >>John Machin wrote:
>
> >>>Try this:
> >>>http://webhelp.esri.com/arcpad/8.0/referenceguide/
>
> >>Wow.  Question, though:  all those codepages mapping to 437 and 850 --
> >>are they really all the same?
>
> > 437 and 850 *are* codepages. You mean "all those language driver IDs
> > mapping to codepages 437 and 850". A codepage merely gives an
> > encoding. An LDID is like a locale; it includes other things besides
> > the encoding. That's why many Western European languages map to the
> > same codepage, first 437 then later 850 then 1252 when Windows came
> > along.
>
> Let me rephrase -- say I get a dbf file with an LDID of \x0f that maps
> to a cp437, and the file came from a german oem machine... could that
> file have upper-ascii codes that will not map to anything reasonable on
> my \x01 cp437 machine?  If so, is there anything I can do about it?

ASCII is defined over the first 128 codepoints; "upper-ascii codes" is
meaningless. As for the rest of your question, if the file's encoded
in cpXXX, it's encoded in cpXXX. If either the creator or the reader
or both are lying, then all bets are off.

> > BTW, what are you planning to do with an LDID of 0x00?
>
> Hmmm.  Well, logical choices seem to be either treating it as plain
> ascii, and barfing when high-ascii shows up; defaulting to \x01; or
> forcing the user to choose one on initial access.

It would be more useful to allow the user to specify an encoding than
an LDID.

You need to be able to read files created not only by software like
VFP or dBase but also scripts using third-party libraries. It would be
useful to allow an encoding to override an LDID that is incorrect e.g.
the LDID implies cp1251 but the data is actually encoded in koi8[ru]

Read this: http://en.wikipedia.org/wiki/Code_page_437
With no LDID in the file and no encoding supplied, I'd be inclined to
make it barf if any codepoint not in range(32, 128) showed up.

Cheers,
John



More information about the Python-list mailing list