translating foreign data

Ben Bacarisse ben.usenet at bsb.me.uk
Fri Jun 22 06:14:59 EDT 2018


Ethan Furman <ethan at stoneleaf.us> writes:

> On 06/21/2018 01:20 PM, Ben Bacarisse wrote:
<snip>
>> You say in a followup that you don't need to worry about digit grouping
>> marks (like thousands separators) so I'm not sure what the problem is.
>> Can't you just replace ',' with '.' a proceed as if you had only one
>> representation?
>
> I could, and that would work right up until a third decimal separator
> was found.  I'd like to solve the problem just once if possible.

Ah, I see.  I took you to mean you knew this won't be an issue.

>> The code page remark is curious.  Will some "code pages" have digits
>> that are not ASCII digits?
>
> Good question.  I have no idea.

It's much more of an open question than I thought.  My only advice,
then, it to ignore problems that *might* arise.  Solve the problem you
face now and hope that you can extend it as needed.  It's good to check
if there is an well-known solution ready to use out of the box, but
since there really isn't, you might as well get something working now.

> I get the appropriate decoder/encoder
> based on the code page contained in the file, then decode to unicode
> and go from there.

It's rather off-topic but what does it mean for the code page to be
contained in the file?  Are you guessing the character encoding from the
rest of the file contents or is there some actual description of the
encoding present?

> ... I was hoping to map the code page to
> a locale that would properly translate the numbers for me,
<snip>
> Worst case scenario is I manually create a map for each code page to
> decimal separator, but there's more than a few and I'd rather not if
> there is already a prebuilt solution out there.

That can't work in general, but you may be lucky with your particular
data set.  For example, files using one of the "Latin" encodings could
have numbers written using the UK convention (0.5) or the French
convention (0,5).  I do both depending on the audience.

-- 
Ben.



More information about the Python-list mailing list