Weird bahaviour from shlex - line no

Sat Sep 28 10:52:34 EDT 2013

Dave Angel wrote:

> On 28/9/2013 02:26, Daniel Stojanov wrote:
> 
>> Can somebody explain this. The line number reported by shlex depends
>> on the previous token. I want to be able to tell if I have just popped
>> the last token on a line.
>>
> 
> I agree that it seems weird.  However, I don't think you have made
> clear why it's not what you (and I) expect.
> 
> import shlex
> 
> def parseit(string):
>     print
>     print "Parsing -", string
>     first = shlex.shlex(string)
>     token = "dummy"
>     while token:
>         token = first.get_token()
>         print token, " -- line", first.lineno
> 
> parseit("word1 word2\nword3")     #first
> parseit("word1 word2,\nword3")    #second
> parseit("word1 word2,word3\nword4")
> parseit("word1 word2+,?\nword3")
> 
> This will display the lineno attribute for every token.
> 
> shlex is documented at:
> 
> http://docs.python.org/2/library/shlex.html
> 
> And lineno is documented on that page as:
> 
> """shlex.lineno
> Source line number (count of newlines seen so far plus one).
> """
> 
> It's not at all clear what "seen so far" is intended to mean, but in
> practice, the line number is incremented for the last token on the
> line. Thus your first example
> 
> Parsing - word1 word2
> word3
> word1  -- line 1
> word2  -- line 2
> word3  -- line 2
>   -- line 2
> 
> word2 has the incremented line number.
> 
> But when the token is neither whitespace nor ASCII letters, then it
> doesn't increment lineno.  Thus second example:
> 
> Parsing - word1 word2,
> word3
> word1  -- line 1
> word2  -- line 1
> ,  -- line 1                      #we would expect this to be "line 2"
> word3 -- line 2 -- line 2
> 
> Anybody else have some explanation 

The explanation seems obvious: a word may be continued by the next character 
if that is in wordchars, so the parser has to look at that character. If it 
happens to be '\n' the lineno is immediately incremented. Non-wordchars are 
returned as single characters, so there is no need to peek ahead and the 
lineno is not altered.

In short: this looks like an implementation accident. 

OP: I don't see a usecase for the current behaviour -- I suggest that you 
file a bug report.

> or advice for Daniel, other than
> preprocessing the string by stripping any non letters off the end of the
> line?

The following gives the tokens' starting line for your examples

def shlexiter(s):
    p = shlex.shlex(s)
    p.whitespace = p.whitespace.replace("\n", "")
    while True:
        lineno = p.lineno
        token = p.get_token()
        if not token:
            break
        if token == "\n":
            continue
        yield lineno, token

def parseit(string):
    print("Parsing - {!r}".format(string))
    for lineno, token in shlexiter(string):
        print("{:3} {!r}".format(lineno, token))
    print("")

but I have no idea about the implications for more complex input.