Best way to deal with different data types in a list comprehension

Wed Sep 24 09:16:53 EDT 2014

Larry Martell wrote:

> I have some code that I inherited:
> 
> ' '.join([self.get_abbrev()] +
>            [str(f['value')
>             for f in self.filters
>             if f.has_key('value')]).strip()
> 
> 
> This broke today when it encountered some non-ascii data.

It's already broken. It gives a Syntax Error part way through:

py> ' '.join([self.get_abbrev()] +
...            [str(f['value')
  File "<stdin>", line 2
    [str(f['value')
                  ^
SyntaxError: invalid syntax

Please copy and paste the actual code, don't retype it.

This is my guess of what you actually have, reformatted to make it more
clear (at least to me):

' '.join(
        [self.get_abbrev()] + 
        [str(f['value']) for f in self.filters if f.has_key('value')]
        ).strip()

I *think* that the call to strip() is redundant. Hmmm... perhaps not, if the
self.get_abbrev() begins with whitespace, or the last f['value'] ends with
whitespace. You should consider removing that call to .strip(), but for now
I'll assume it actually is useful and leave it in.

First change: assuming the filters are dicts, do the test this way:

' '.join(
        [self.get_abbrev()] + 
        [str(f['value']) for f in self.filters if 'value' in f]
        ).strip()

Now, the *right* way to fix your problem is to convert the whole application
to use unicode strings everywhere instead of byte strings. I'm guessing you
are using Python 2.6 or 2.7.

You say it broke when given some non-ascii data, but that's extremely
ambiguous. {23: 42} is non-ascii data. What exactly do you have, and where
did it come from?

My *guess* is that you had a Unicode string, containing characters which
cannot be converted to ASCII.

py> str(u'Ωπ')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1:
ordinal not in range(128)

> I changed the str(f['value']) line to f['value'].encode('utf-8'),

Hmmm, I'm not sure that's a good plan:

py> u'Ωπ'.encode('utf-8')
'\xce\xa9\xcf\x80'

Do you really want to find arbitrary bytes floating through your strings? A
better strategy is to convert the program to use unicode strings
internally, and only convert to byte strings when you read and write to
files.

But assuming you don't have the time or budget for that sort of re-write,
here's a minimal chance which might do the job:

u' '.join(
        [self.get_abbrev()] + 
        [unicode(f['value']) for f in self.filters if 'value' in f]
        ).strip()

That works correctly for random objects and ASCII byte strings:

py> unicode([1, 2, 3])
u'[1, 2, 3]'
py> unicode('bytes')
u'bytes'

Alas, it will fail for non-ASCII byte strings:

py> unicode('bytes \xFF')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6:
ordinal not in range(128)

Here's a version which prefers byte-strings, but should be able to handle
everything you throw at it:

' '.join(
          [self.get_abbrev()] + 
          [
           (x.encode('utf-8') if isinstance(x, unicode) else x) 
           for x in (f['value'] for f in self.filters if 'value' in f)
          ]
        ).strip()

Note the use of a generator expression inside the list comp.

-- 
Steven