[Python-Dev] What does a double coding cookie mean?

Thu Mar 17 15:11:04 EDT 2016

On Thu, Mar 17, 2016 at 9:50 AM, Serhiy Storchaka <storchaka at gmail.com> wrote:
> On 17.03.16 16:55, Guido van Rossum wrote:
>>
>> On Thu, Mar 17, 2016 at 5:04 AM, Serhiy Storchaka <storchaka at gmail.com>
>> wrote:
>>>>
>>>> Should we recommend that everyone use tokenize.detect_encoding()?
>>>
>>>
>>> Likely. However the interface of tokenize.detect_encoding() is not very
>>> simple.
>>
>>
>> I just found that out yesterday. You have to give it a readline()
>> function, which is cumbersome if all you have is a (byte) string and
>> you don't want to split it on lines just yet. And the readline()
>> function raises SyntaxError when the encoding isn't right. I wish
>> there were a lower-level helper that just took a line and told you
>> what the encoding in it was, if any. Then the rest of the logic can be
>> handled by the caller (including the logic of trying up to two lines).
>
>
> The simplest way to detect encoding of bytes string:
>
>     lines = data.splitlines()
>     encoding = tokenize.detect_encoding(iter(lines).__next__)[0]

This will raise SyntaxError if the encoding is unknown. That needs to
be caught in mypy's case and then it needs to get the line number from
the exception. I tried this and it was too painful, so now I've just
changed the regex that mypy uses to use non-eager matching
(https://github.com/python/mypy/commit/b291998a46d580df412ed28af1ba1658446b9fe5).

> If you don't want to split all data on lines, the most efficient way in
> Python 3.5 is:
>
>     encoding = tokenize.detect_encoding(io.BytesIO(data).readline)[0]
>
> In Python 3.5 io.BytesIO(data) has constant complexity.

Ditto with the SyntaxError though.

> In older versions for detecting encoding without copying data or splitting
> all data on lines you should write line iterator. For example:
>
>     def iterlines(data):
>         start = 0
>         while True:
>             end = data.find(b'\n', start) + 1
>             if not end:
>                 break
>             yield data[start:end]
>             start = end
>         yield data[start:]
>
>     encoding = tokenize.detect_encoding(iterlines(data).__next__)[0]
>
> or
>
>     it = (m.group() for m in re.finditer(b'.*\n?', data))
>     encoding = tokenize.detect_encoding(it.__next__)
>
> I don't know what approach is more efficient.

Having my own regex was simpler. :-(

-- 
--Guido van Rossum (python.org/~guido)