How to turn a string into a list of integers?

Chris Angelico rosuav at gmail.com
Sun Sep 7 20:12:13 EDT 2014


On Mon, Sep 8, 2014 at 1:40 AM, Roy Smith <roy at panix.com> wrote:
> Well, technically, what you store is something which has the right
> behavior.  If I wrote:
>
> my_huffman_coded_list = [0] * 1000000
>
> I don't know of anything which requires Python to actually generate a
> million 0's and store them somewhere (even ignoring interning of
> integers).  As long as it generated an object (perhaps a subclass of
> list) which responded to all of list's methods the same way a real list
> would, it could certainly build a more compact representation.

Steven hinted at it, but I'll say one thing more explicitly here:
There's actually something that requires Python to *not* generate a
million 0 integers. What you get is a million references to the *same*
zero.

>>> another_list = [object()] * 1000000
>>> sum(id(x) for x in another_list)
140287290433648000000
>>> id(another_list[0]) * len(another_list)
140287290433648000000

The two figures are guaranteed to be the same, these are all the same object.

But what you're talking about here is an alternative encoding. And
it's definitely possible for different Pythons to encode strings
differently; uPy uses UTF-8 internally, which gives different
performance metrics, but guarantees the same semantics; I could
imagine someone implementing a Python interpreter in Pike, and using
the Pike string type to store Python strings (the semantics will all
be correct, as it's a Unicode string; the most notable difference is
that Pike strings are guaranteed to be interned, so all equality
comparisons are identity checks); if you wanted to, I'm sure you could
build a Python that uses a dictionary of words (added to every time
you create a string, of course), and actually represents entire words
as short integers, which would mean individual characters aren't
necessarily represented directly. But somehow, you have to turn the
concept of "sequence of Unicode characters" into some well-defined
sequence of bytes, and that's an encoding.

ChrisA



More information about the Python-list mailing list