unicode and dbf files

John Machin sjmachin at lexicon.net
Tue Oct 27 23:46:28 EDT 2009


On Oct 28, 2:51 am, Ethan Furman <et... at stoneleaf.us> wrote:
> John Machin wrote:
> > On Oct 27, 7:15 am, Ethan Furman <et... at stoneleaf.us> wrote:
>
> >>Let me rephrase -- say I get a dbf file with an LDID of \x0f that maps
> >>to a cp437, and the file came from a german oem machine... could that
> >>file have upper-ascii codes that will not map to anything reasonable on
> >>my \x01 cp437 machine?  If so, is there anything I can do about it?
>
> > ASCII is defined over the first 128 codepoints; "upper-ascii codes" is
> > meaningless. As for the rest of your question, if the file's encoded
> > in cpXXX, it's encoded in cpXXX. If either the creator or the reader
> > or both are lying, then all bets are off.
>
> My confusion is this -- is there a difference between any of the various
> cp437s?

What various cp437s???

>  Going down the list at ESRI: 0x01, 0x09, 0x0b, 0x0d, 0x0f,
> 0x11, 0x15, 0x18, 0x19, and 0x1b all map to cp437,

Yes, this is called a "many-to-*one*" relationship.

> and they have names

"they" being the Language Drivers, not the codepages.

> such as US, Dutch, Finnish, French, German, Italian, Swedish, Spanish,
> English (Britain & US)... are these all the same?

When you read the Wikipedia page on cp437, did you see any reference
to different versions for French, German, Finnish, etc? I saw only one
mapping table; how many did you see? If there are multiple language
versions of a codepage, how do you expect to handle this given Python
has only one codec per codepage?

Trying again: *ONE* attribute of a Language Driver ID (LDID) is the
character set (codepage) that it uses. Other attributes may be things
like the collating (sorting) sequence, whether they use a dot or a
comma as the decimal point, etc. Many different languages in Western
Europe can use the same codepage. Initially the common one was cp 437,
then 850, then 1252.

There may possibly different interpretations of a codepage out there
somewhere, but they are all *intended* to be the same, and I advise
you to cross the different-cp437s bridge *if* it exists and you ever
come to it.

Have you got access to files with LDID not in (0, 1) that you can try
out?

Cheers,
John



More information about the Python-list mailing list