translating foreign data

Richard Damon richard.damon at 1
Sat Jun 23 07:26:22 EDT 2018


From: "Richard Damon" <richard.damon at 1:261/38.remove-r7u-this>

From: Richard Damon <Richard at Damon-Family.org>

On 6/22/18 11:21 PM, Steven D'Aprano wrote:
> On Fri, 22 Jun 2018 20:06:35 +0100, Ben Bacarisse wrote:
>
>> Steven D'Aprano <steve+comp.lang.python at pearwood.info> writes:
>>
>>> On Fri, 22 Jun 2018 11:14:59 +0100, Ben Bacarisse wrote:
>>>
>>>>>> The code page remark is curious.  Will some "code pages" have digits
>>>>>> that are not ASCII digits?
>>>>> Good question.  I have no idea.
>>>> It's much more of an open question than I thought.
>>> Nah, Python already solves that for you:
>> My understanding was that the OP does not (reliably) know the encoding,
>> though that was a guess based on a turn of phrase.
> I took it the other way: that Ethan *does* know the encoding, and his
> problem is that knowing the encoding and/or locale is not enough to
> recognise whether to use a period or comma as the decimal separator.
>
> Which it isn't.
If you know the Locale, then you do know what the decimal separator is, as that
 is part of what a locale defines. The issue is that if you just know the
encoding, you don't necessarily know the locale. He also commented that he
didn't want to set the locale in the routine, as that sets it globally for the
full application (but perhaps that latter could be fixed by first doing a
locale.getlocale(), then setlocale for the files locale, and then at the end of
 reading and processing restore back the old locale.
>
> If he doesn't know the encoding, he has bigger problems than just
> converting strings into floats. Without knowing the encoding, he cannot
> even reliably detect non-ASCII digits at all.
>
>
>> Another guess is that the OP does not have Unicode data.  The term "code
>> page" hints at an 8-bit encoding or at least a pre-Unicode one.
> Assuming he is using Python 3, or using Python 2 sensibly, once he has
> specified the encoding and read the data from the file, he has Unicode.
>
> Unicode is a superset of (ideally) all code pages. Once you have decoded
> the data using the appropriate code page, you have a Unicode string, and
> Python doesn't care where it came from.
>
> The point is, once Ethan can get the intended characters out of the file
> into Python, it doesn't matter what code page they came from. They're now
> full-fledged Unicode characters, and Python's float() and int() functions
> can easily deal with non-ASCII digits. So long as he has digits in the
> first place, float() and int() will deal with them correctly.
>
>

--
Richard Damon

-+- BBBS/Li6 v4.10 Toy-3
 + Origin: Prism bbs (1:261/38)

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)



More information about the Python-list mailing list