[Tutor] Most efficient way to replace ", " with "." in a array and/or dataframe

Cameron Simpson cs at cskk.id.au
Sun Sep 22 04:20:52 EDT 2019


On 22Sep2019 07:39, Albert-Jan Roskam <sjeik_appie at hotmail.com> wrote:
>On 22 Sep 2019 04:27, Cameron Simpson <cs at cskk.id.au> wrote:
>On 21Sep2019 20:42, Markos <markos at c2o.pro.br> wrote:
>>I have a table.csv file with the following structure:
>>
>>, Polyarene conc ,, mg L-1 ,,,,,,,
>>Spectrum, Py, Ace, Anth,
>>1, "0,456", "0,120", "0,168"
>>2, "0,456", "0,040", "0,280"
>>3, "0,152", "0,200", "0,280"
>>
>>I open as dataframe with the command:
>>data = pd.read_csv ('table.csv', sep = ',', skiprows = 1)
>[...]
>>And the data_array variable gets the fields in string format:
>>[['0,456' '0,120' '0,168']
>[...]
>
>>Please see the documentation for the >read_csv function here:
>
>> https://pandas.pydata.org/pandas
>
>>docs/stable/reference/api/pandas.read_cs> v.html?highlight=read_csv#pandas.read_csv
>
>Do you think it's a deliberate design choice that decimal and thousands 
>where used here as params, and not a 'locale' param? It seems nice to 
>be able to specify e.g. locale='dutch' and then all the right 
>lc_numeric, lc_monetary, lc_time where used. Or even 
>locale='nl_NL.1252' and you also wouldn't need 'encoding' as a separate 
>param. Or might that be bad on windows where there's no locale-gen? 
>Just wondering...

Locales are tricky; I don't know enough.

A locale parameter might be convenient for some things, but such things 
are table driven. From an arbitrary Linux box nearby:

  % locale -a
  C
  C.UTF-8
  POSIX
  en_AU.utf8

No "dutch" or similar there.

I doubt pandas would ship with such a thing. And the OP probably doesn't 
know the originating locale anyway. Nor do _we_ know that those values 
themselves were driven from some well known locale table.

The advantage of specifical decimal= and thousands= parameters is that 
they do exactly what they say, rather than looking up a locale and 
hoping for a specific side effect. So the specific parameters offer 
better control.

The thousands= itself is a little parachial (for example, in India a 
factor of 100 is a common division point[1]), but it may merely be used 
to strip this character from the left portion of the number.

[1] https://en.wikipedia.org/wiki/Indian_numbering_system

So while I am not a pandas person, I would expect that decimal= and 
thousands= are useful parameters for specific lexical situations (like 
the OP's CSV data) and work regardless of any locale knowledge.

Cheers,
Cameron Simpson <cs at cskk.id.au>



More information about the Python-list mailing list