[issue34979] Python throws “SyntaxError: Non-UTF-8 code start with \xe8...” when parse source file

Sun Oct 14 07:12:29 EDT 2018

Lu jaymin <ljm51689 at gmail.com> added the comment:

I think these two issue is the same issue, and the following is a patch
write by me, hope this patch will help.

```

diff --git a/Parser/tokenizer.c b/Parser/tokenizer.c
index 1af27bf..ba6fb3a 100644
--- a/Parser/tokenizer.c
+++ b/Parser/tokenizer.c
@@ -617,32 +617,21 @@ decoding_fgets(char *s, int size, struct tok_state
*tok)
         if (!check_coding_spec(line, strlen(line), tok, fp_setreadl)) {
             return error_ret(tok);
         }
-    }
-#ifndef PGEN
-    /* The default encoding is UTF-8, so make sure we don't have any
-       non-UTF-8 sequences in it. */
-    if (line && !tok->encoding) {
-        unsigned char *c;
-        int length;
-        printf("[DEBUG] - [decoding_fgets]: line = %s\n", line);
-        for (c = (unsigned char *)line; *c; c += length)
-            if (!(length = valid_utf8(c))) {
-                badchar = *c;
-                break;
+        if(!tok->encoding){
+            char* cs = new_string("utf-8", 5, tok);
+            int r = fp_setreadl(tok, cs);
+            if (r) {
+                tok->encoding = cs;
+                tok->decoding_state = STATE_NORMAL;
+            } else {
+                PyErr_Format(PyExc_SyntaxError,
+                             "You did not decalre the file encoding at the
top of the file, "
+                             "and we found that the file is not encoding
by utf-8,"
+                             "see http://python.org/dev/peps/pep-0263/ for
details.");
+                PyMem_FREE(cs);
             }
+        }
     }
-    if (badchar) {
-        /* Need to add 1 to the line number, since this line
-           has not been counted, yet.  */
-        PyErr_Format(PyExc_SyntaxError,
-                "Non-UTF-8 code starting with '\\x%.2x' "
-                "in file %U on line %i, "
-                "but no encoding declared; "
-                "see http://python.org/dev/peps/pep-0263/ for details",
-                badchar, tok->filename, tok->lineno + 1);
-        return error_ret(tok);
-    }
-#endif
     return line;
 }
```

by the way, my platform is macOS Mojave Version 10.14

Karthikeyan Singaravelan <report at bugs.python.org> 于2018年10月14日周日 下午5:10写道：

>
> Karthikeyan Singaravelan <tir.karthi at gmail.com> added the comment:
>
> Got it. Thanks for the details and patience. I tested with less number of
> characters and it seems to work fine so using the encoding at the top is
> not a good way to test the original issue as you have mentioned. Then I
> searched around and found issue14811 with test. This seems to be a very
> similar issue and there is a patch to detect this scenario to throw
> SyntaxError that the line is longer than the internal buffer instead of an
> encoding related error. I applied the patch to master and it throws an
> error about the internal buffer length as expected. But the patch was not
> applied and it seems Victor had another solution in mind as per msg167154.
> I tested with the patch as below :
>
> # master
>
> ➜  cpython git:(master) cat ../backups/bpo34979.py
>
> s =
> '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试'
>
> print("str len : ", len(s))
> print("bytes len : ", len(s.encode('utf-8')))
> ➜  cpython git:(master) ./python.exe ../backups/bpo34979.py
>   File "../backups/bpo34979.py", line 2
> SyntaxError: Non-UTF-8 code starting with '\xe8' in file
> ../backups/bpo34979.py on line 2, but no encoding declared; see
> http://python.org/dev/peps/pep-0263/ for details
>
>
> # Applying the patch file from issue14811
>
> ➜  cpython git:(master) ✗ ./python.exe ../backups/bpo34979.py
>   File "../backups/bpo34979.py", line 2
> SyntaxError: Line 2 of file ../backups/bpo34979.py is longer than the
> internal buffer (1024)
>
> # Patch on master
>
> diff --git a/Parser/tokenizer.c b/Parser/tokenizer.c
> index fc75bae537..48b3ac0ee9 100644
> --- a/Parser/tokenizer.c
> +++ b/Parser/tokenizer.c
> @@ -586,6 +586,7 @@ static char *
>  decoding_fgets(char *s, int size, struct tok_state *tok)
>  {
>      char *line = NULL;
> +    size_t len;
>      int badchar = 0;
>      for (;;) {
>          if (tok->decoding_state == STATE_NORMAL) {
> @@ -597,6 +598,15 @@ decoding_fgets(char *s, int size, struct tok_state
> *tok)
>              /* We want a 'raw' read. */
>              line = Py_UniversalNewlineFgets(s, size,
>                                              tok->fp, NULL);
> +           if (line != NULL) {
> +                len = strlen(line);
> +                if (1 < len && line[len-1] != '\n') {
> +                    PyErr_Format(PyExc_SyntaxError,
> +                            "Line %i of file %U is longer than the
> internal buffer (%i)",
> +                                tok->lineno + 1, tok->filename, size);
> +                    return error_ret(tok);
> +                }
> +            }
>              break;
>          } else {
>              /* We have not yet determined the encoding.
>
>
> If it's the same issue then I think closing this issue and discussing
> there will be good since the issue has a patch with test and relevant
> discussion. Also it seems BUFSIZ is platform dependent so adding your
> platform details would also help.
>
> TIL about difference Python 2 and 3 on handling unicode related files.
> Thanks again!
>
> ----------
>
> _______________________________________
> Python tracker <report at bugs.python.org>
> <https://bugs.python.org/issue34979>
> _______________________________________
>

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue34979>
_______________________________________