Unicode and Python - how often do you index strings?

Tim Chase python.list at tim.thechases.com
Tue Jun 3 21:11:54 EDT 2014


On 2014-06-04 10:39, Chris Angelico wrote:
> A current discussion regarding Python's Unicode support centres (or
> centers, depending on how close you are to the cent[er]{2} of the
> universe) around one critical question: Is string indexing common?
> 
> Python strings can be indexed with integers to produce characters
> (strings of length 1). They can also be iterated over from beginning
> to end. Lots of operations can be built on either one of those two
> primitives; the question is, how much can NOT be implemented
> efficiently over iteration, and MUST use indexing? Theories are
> great, but solid use-cases are better - ideally, examples from
> actual production code (actual code optional).

Many of my string-indexing uses revolve around a sliding window which
can be done with itertools[1], though I often just roll it as
something like

  n = 3
  for i in range(1 + len(s) - n):
    do_something(s[i:i+n])

So that could be supplanted by the SO iterator linked below.

The other use big case I have from production code involves a
column-offset delimited file where the headers have a row of
underscores under them delimiting the field widths, so it looks
something like

  EmpID     Name                Cost Center
  --------- ------------------- -----------------------------
  314159    Longstocking, Pippi RJ45
  265358    Davis, Miles        JA22
  979328    Bell, Alexander     RJ15

I then take row 2 and use it to make a mapping of header-name to a
slice-object for slicing the subsequent strings:

  import re
  r = re.compile('-+') # a sequence of 1+ dashes
  f = file("data.txt")
  headers = next(f)
  lines = next(f)
  header_map = dict((
      headers[i.start():i.end()].strip().upper(),
      slice(i.start(), i.end())
      )
    for i in r.finditer(lines)
    )
  for row in f:
    print("EmpID = %s" % row[header_map["EMPID"]].strip())
    print("Name = %s" % row[header_map["NAME"]].strip())
    # ...

which I presume uses string indexing under the hood.

Perhaps there's a better way of doing that, but it's what I currently
use to process these large-ish files (largest max out at 10-20MB each)

There might be other use-cases I've done, but those two leap to mind.

-tkc


[1]
http://stackoverflow.com/questions/6822725/rolling-or-sliding-window-iterator-in-python







More information about the Python-list mailing list