REQ : encoding windows cp1252 => iso latin 1
Brian Quinlan
brian at sweetapp.com
Tue Feb 5 14:11:13 EST 2002
Gillou wrote:
> My customers make copy/paste from M$ word docs to forms translated to
XML
> (expecting ISO latin 1 charset).
> My XML parser (pyexpat) does not accept cp1252 character, and I'm
looking
> for a function that can translate extra cp1252 characters to the
closest
> ISO latin 1 encoding.
I am such a nice guy. Here is a completely untested solution:
"""
80 20AC EURO SIGN
81 UNDEFINED
82 201A SINGLE LOW-9 QUOTATION MARK
83 0192 LATIN SMALL LETTER F WITH HOOK
84 201E DOUBLE LOW-9 QUOTATION MARK
85 2026 HORIZONTAL ELLIPSIS
86 2020 DAGGER
87 2021 DOUBLE DAGGER
88 02C6 MODIFIER LETTER CIRCUMFLEX ACCENT
89 2030 PER MILLE SIGN
8A 0160 LATIN CAPITAL LETTER S WITH CARON
8B 2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK
8C 0152 LATIN CAPITAL LIGATURE OE
8D UNDEFINED
8E 017D LATIN CAPITAL LETTER Z WITH CARON
8F UNDEFINED
90 UNDEFINED
91 2018 LEFT SINGLE QUOTATION MARK
92 2019 RIGHT SINGLE QUOTATION MARK
93 201C LEFT DOUBLE QUOTATION MARK
94 201D RIGHT DOUBLE QUOTATION MARK
95 2022 BULLET
96 2013 EN DASH
97 2014 EM DASH
98 02DC SMALL TILDE
99 2122 TRADE MARK SIGN
9A 0161 LATIN SMALL LETTER S WITH CARON
9B 203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
9C 0153 LATIN SMALL LIGATURE OE
9D UNDEFINED
9E 017E LATIN SMALL LETTER Z WITH CARON
9F 0178 LATIN CAPITAL LETTER Y WITH DIAERESIS
"""
replacement = {
# You pick the "closest characters"
0x80: "e"
0x82: ","
...
}
def cp1252_to_8859_1(str):
out_str = ''
for i in str:
# The only different characters are from 0x80-0x9f
if (ord(i) >= 0x80) and (ord(i) <= 0x9f):
# Or throw an exception and ask them not to use
# dumb characters
out_str += replacement[ord(i)]
else:
out_str += i
return out_str
More information about the Python-list
mailing list