Elementary string-parsing

Mon Feb 4 12:01:50 EST 2008

On Mon, 04 Feb 2008 09:43:04 +0000, Odysseus wrote:

> In article <60nunqF1ro06iU4 at mid.uni-berlin.de>,
>  Marc 'BlackJack' Rintsch <bj_666 at gmx.net> wrote:
> 
>> def extract_data(names, na, cells):
>>     found = dict()
> 
> The problem with initializing the 'super-dictionary' within this 
> function is that I want to be able to add to it in further passes, with 
> a new set of "names" & "cells" each time.

Then you can either pass in `found` as argument instead of creating it
here, or you collect the passes in the calling code with the `update()`
method of `dict`.  Something like this:

found = dict()
for pass in passes:
    # ...
    found.update(extract_data(names, na, cells))

> BTW what's the difference between the above and "found = {}"?

I find it more "explicit".  ``dict`` and ``list`` are easier to
distinguish than ``{}`` and ``[]`` after a loooong coding session or when
printed/displayed in a small font.  It's just a matter of taste.

>>     for i, name in enumerate(names):
>>         data = dict()
>>         cells_index = 10 * i + na
>>         for cell_name, index, parse in (('epoch1', 0, parse_date),
>>                                         ('epoch2', 1, parse_date),
>>                                         ('time', 5, parse_number),
>>                                         ('score1', 6, parse_number),
>>                                         ('score2', 7, parse_number)):
>>             data[cell_name] = parse(cells[cells_index + index])
> 
> This looks a lot more efficient than my version, but what about the 
> strings that don't need parsing? Would it be better to define a 
> 'pass-through' function that just returns its input, so they can be 
> handled by the same loop, or to handle them separately with another loop?

I'd handle them in the same loop.  A "pass-through" function for strings
already exists:

In [255]: str('hello')
Out[255]: 'hello'

>>         assert name.startswith('Name: ')
> 
> I looked up "assert", but all I could find relates to debugging. Not 
> that I think debugging is something I can do without ;) but I don't 
> understand what this line does.

It checks if `name` really starts with 'Name: '.  This way I turned the
comment into code that checks the assertion in the comment.

>> The `parse_number()` function could look like this:
>> 
>> def parse_number(string):
>>     try:
>>         return float(string.replace(',', ''))
>>     except ValueError:
>>         return string
>> 
>> Indeed the commas can be replaced a bit more elegant.  :-)
> 
> Nice, but I'm somewhat intimidated by the whole concept of 
> exception-handling (among others). How do you know to expect a 
> "ValueError" if the string isn't a representation of a number?

Experience.  I just tried what happens if I feed `float()` with a string
that is no number:

In [256]: float('abc')
---------------------------------------------------------------------------
<type 'exceptions.ValueError'>            Traceback (most recent call last)

/home/bj/<ipython console> in <module>()

<type 'exceptions.ValueError'>: invalid literal for float(): abc

> Is there a list of common exceptions somewhere? (Searching for
> "ValueError" turned up hundreds of passing mentions, but I couldn't find
> a definition or explanation.)

The definition is quite vague.  The type of an argument is correct, but
there's something wrong with the value.

See http://docs.python.org/lib/module-exceptions.html for an overview of
the built in exceptions.

>> As already said, that ``while`` loop should be a ``for`` loop.  But if
>> you put `m_abbrevs` into a `list` you can replace the loop with a
>> single call to its `index()` method: ``dlist[1] =
>> m_abbrevs.index(dlist[1]) + 1``.
> 
> I had gathered that lists shouldn't be used for storing constants. Is
> that more of a suggestion than a rule?

Some suggest this.  Others say tuples are for data where the position of
an element has a "meaning" and lists are for elements that all have the
same "meaning" for some definition of meaning.  As an example ('John',
'Doe', 'Dr.') vs. ['Peter', 'Paul', 'Mary'].  In the first example we have
name, surname, title and in the second example all elements are just
names.  Unless the second example models a relation like child, father,
mother, or something like that.  Anyway, if you can make the source simpler
and easier to understand by using the `index()` method, use a list.  :-)

Ciao,
	Marc 'BlackJack' Rintsch