string processing - some problems whenever I have to parse a more complex string

Terry Reedy tjreedy at udel.edu
Tue Oct 21 18:03:41 EDT 2014


On 10/21/2014 10:32 AM, CWr wrote:
>
> Hello together,
>
> currently I have to parse a string in an atomic way. Normally - in this case too - I have a counter variable to keep the current position inside the string. So far, I think this is the most flexible solution to do some lookaround's inside the string if necessary. Subroutines will be feed by the underlying data and the current position. A subroutine returns a tuple of the new position and the result. But I would like process subroutines with the same flexibillity (slicing/lookaround) but without returning the new position every again.
>
> Is there any implementation like C++ StringPiece class?

I am going to guess that this is a string view class that encapsulates a 
piece of an underlying class.  Otherwise there is no point.

A view class depends on a primary, independently accessible class for 
its data.  There are two main categories.  A subview gives the primary 
class interface to a part of the primary data. Numpy had array subviews 
an I presume you are talking about string subviews here.  An altview 
class gives an alternative interface to the primary data.  Dict views 
are examples.

If the primary object is mutable, one reason to use a view instead of a 
copy is to keep the data for two objects synchronized.  This does not 
apply to strings.

Another reason is to save memory space.  The downside is that the 
primary data cannot be erased until *both* objects are deleted. 
Moreover, if the primary data is small or the subview data is a small 
fraction of the primary data, the memory saving is small.  So small 
subviews that persist after the primary object may end up costing more 
memory than they save.  This is one reason Python does not have string 
subview.  The numpy array view use case is large subarrays of large 
arrays that have to persist through a calculation anyway.

Another reason Python lack sequence subviews is that the extra data 
needed for a contiguous slice are only the start and stop indexes. 
These can easily be manipulated directly without wrapping them in a 
class.  And anyone who does want a method interface can easily create a 
class to their liking.

To answer your question, I tried
https://pypi.python.org/pypi?%3Aaction=search&term=string+view&submit=search

and did not find anything.  'view' matches the generic use of 'view', as 
well as 'views', 'viewed', 'viewer', 'review', and 'preview'.

The third answer here
https://stackoverflow.com/questions/10085568/slices-to-immutable-strings-by-reference-and-not-copy
has a StringView class that could be modifed to work on 3.x by removing 
the unneeded use of buffer.

 > Or something like the following behavior:

>>>> s = StringSlice('abcdef')

s = 'abcdef'
a, b = 0, len(s)  # s start, s end

>>>> s
> StringSlice('abcdef') at xxx
>>>> s[0]

s[a]

> 'a'
>>>> s.chop(1) # chop the first item
>>>> s[0] # 'b' is the new first item

a += 1
s[a]

> 'b'
>>>> s[:2]

s[a:a+2]

> 'bc'
>>>> s.chop(-1) # chop the last item
>>>> s[-1]

b -= 1
s[b-1]

> 'e'
>>>> s[1:]

s[a+1:b]

> 'cde'
>>>> while s[0] != 'e':
>         s.chop(1)
 >>>> s[0]

while s[a] != 'e':
     a += 1
s[a]

> 'e'
>>>> s.startswith('e')

s[a:b].startswith('e')

> True
>>>> s.isdigit()

s[a:b].isdigit()

> False
>
> Subroutines could chop the number of processed items internally if no error occours.
>
> Another possibillty will be to chop the current item manually. But I don't know how efficient this is in case of large strings.
>
>>>> while string:
>         c = string[0]
>         # process it ...
>         string = string[1:]

This is extremely bad as it replaces the O(n) processing (below) with 
O(n*n) processing.  In general, the right way to linearly process any 
iterable is

for item in iterable:
   process(c)

or sometimes

for index, item in enumerate(iterable):
   process(index, item)

or even, for sequences, (but not when the first option above suffices)

for index in range(len(sequence)):
   process(index, sequence)

-- 
Terry Jan Reedy




More information about the Python-list mailing list