[New-bugs-announce] [issue23297] ‘tokenize.detect_encoding’ is confused between text and bytes: no ‘startswith’ method on a byte string

Thu Jan 22 05:40:26 CET 2015

New submission from Ben Finney:

In `tokenize.detect_encoding` is the following code::

    first = read_or_stop()
    if first.startswith(BOM_UTF8):
        # …

The `read_or_stop` function is defined as::

    def read_or_stop():
        try:
            return readline()
        except StopIteration:
            return b''

So, on catching ``StopIteration``, the return value will be a byte string. The `detect_encoding` code then immediately calls `sartswith`, which fails::

    File "/usr/lib/python3.4/tokenize.py", line 409, in detect_encoding
      if first.startswith(BOM_UTF8):
  TypeError: startswith first arg must be str or a tuple of str, not bytes

One or both of those locations in the code is wrong. Either `read_or_stop` should never return a byte string; or `detect_encoding` should not assume it can call `startswith` on the result.

----------
components: Library (Lib)
messages: 234471
nosy: bignose
priority: normal
severity: normal
status: open
title: ‘tokenize.detect_encoding’ is confused between text and bytes: no ‘startswith’ method on a byte string
type: crash
versions: Python 3.4

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue23297>
_______________________________________