[issue34979] Python throws “SyntaxError: Non-UTF-8 code start with \xe8...” when parse source file

Lu jaymin report at bugs.python.org
Sat Oct 13 22:01:39 EDT 2018


New submission from Lu jaymin <ljm51689 at gmail.com>:

```
# demo.py
s = '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试'
```
The file on above is for testing, it's encoding is utf-8, the length of `s` is 1020 bytes(3 * 340).

When execute `python3 demo.py` on terminal, Python will throws the following error:

```
$ python3 -V
Python 3.6.4

$ python3 demo.py
  File "demo.py", line 2
SyntaxError: Non-UTF-8 code starting with '\xe8' in file demo.py on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
```

I've found this error occurred on about line 630(the bottom of the function `decoding_fgets`) of the file `cpython/Parser/tokenizer.c` after I read Python-3.6.6's source code.

When Python execute xxx.py, Python will call the function `decoding_fgets` to read one line of raw bytes from file and save the raw bytes to a buffer, the initial length of the buffer is 1024 bytes, `decoding_fgets` will use the function `valid_utf8` to check raw bytes's encoding.

If the lenght of raw bytes is too long(like greater than 1023 bytes), then Python will call `decoding_fgets` multiple times and increase buffer's size by 1024 bytes every time.so raw bytes read by `decoding_fgets` is maybe incomplete, for example, raw bytes contains a part of bytes of a character, that will cause `valide_utf8` failed.

I suggest that we should always use `fp_readl` to read source coe from file.

----------
components: Interpreter Core
messages: 327686
nosy: Lu jaymin
priority: normal
severity: normal
status: open
title: Python throws “SyntaxError: Non-UTF-8 code start with \xe8...” when parse source file
type: behavior
versions: Python 3.6

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue34979>
_______________________________________


More information about the Python-bugs-list mailing list