[Python-Dev] Reading Python source file

Mon Nov 16 22:05:33 EST 2015

On 2015-11-17 01:53, Serhiy Storchaka wrote:
> I'm working on rewriting Python tokenizer (in particular the part that
> reads and decodes Python source file). The code is complicated. For now
> there are such cases:
>
> * Reading from the string in memory.
> * Interactive reading from the file.
> * Reading from the file:
>     - Raw reading ignoring encoding in parser generator.
>     - Raw reading UTF-8 encoded file.
>     - Reading and recoding to UTF-8.
>
> The file is read by the line. It makes hard to check correctness of the
> first line if the encoding is specified in the second line. And it makes
> very hard problems with null bytes and with desynchronizing buffered C
> and Python files. All this problems can be easily solved if read all
> Python source file in memory and then parse it as string. This would
> allow to drop a large complex and buggy part of code.
>
> Are there disadvantages in this solution? As for memory consumption, the
> source text itself will consume only small part of the memory consumed
> by AST tree and other structures. As for performance, reading and
> decoding all file can be faster then by the line.
>
> [1] http://bugs.python.org/issue25643
>
As I understand it, *nix expects the shebang to be b'#!', which means 
that the
first line should be ASCII-compatible (it's possible that the UTF-8 BOM 
might
be present). This kind of suggests that encodings like UTF-16 would cause a
problem on such systems.

The encoding line also needs to be ASCII-compatible.

I believe that the recent thread "Support of UTF-16 and UTF-32 source
encodings" also concluded that UTF-16 and UTF-32 shouldn't be supported.

This means that you could treat the first 2 lines as though they were some
kind of extended ASCII (Latin-1?), the line ending being '\n' or '\r' or
'\r\n'.

Once you'd identify the encoding, you could decode everything (including the
shebang line) using that encoding.

(What should happen if the encoding line then decoded differently, i.e.
encoding_line.decode(encoding) != encoding_line.decode('latin-1')?)