[issue34979] Python throws “SyntaxError: Non-UTF-8 code start with \xe8...” when parse source file

Sun Oct 14 05:10:44 EDT 2018

Karthikeyan Singaravelan <tir.karthi at gmail.com> added the comment:

Got it. Thanks for the details and patience. I tested with less number of characters and it seems to work fine so using the encoding at the top is not a good way to test the original issue as you have mentioned. Then I searched around and found issue14811 with test. This seems to be a very similar issue and there is a patch to detect this scenario to throw SyntaxError that the line is longer than the internal buffer instead of an encoding related error. I applied the patch to master and it throws an error about the internal buffer length as expected. But the patch was not applied and it seems Victor had another solution in mind as per msg167154. I tested with the patch as below : 

# master

➜  cpython git:(master) cat ../backups/bpo34979.py

s = '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试'

print("str len : ", len(s))
print("bytes len : ", len(s.encode('utf-8')))
➜  cpython git:(master) ./python.exe ../backups/bpo34979.py
  File "../backups/bpo34979.py", line 2
SyntaxError: Non-UTF-8 code starting with '\xe8' in file ../backups/bpo34979.py on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details


# Applying the patch file from issue14811

➜  cpython git:(master) ✗ ./python.exe ../backups/bpo34979.py
  File "../backups/bpo34979.py", line 2
SyntaxError: Line 2 of file ../backups/bpo34979.py is longer than the internal buffer (1024)

# Patch on master

diff --git a/Parser/tokenizer.c b/Parser/tokenizer.c
index fc75bae537..48b3ac0ee9 100644
--- a/Parser/tokenizer.c
+++ b/Parser/tokenizer.c
@@ -586,6 +586,7 @@ static char *
 decoding_fgets(char *s, int size, struct tok_state *tok)
 {
     char *line = NULL;
+    size_t len;
     int badchar = 0;
     for (;;) {
         if (tok->decoding_state == STATE_NORMAL) {
@@ -597,6 +598,15 @@ decoding_fgets(char *s, int size, struct tok_state *tok)
             /* We want a 'raw' read. */
             line = Py_UniversalNewlineFgets(s, size,
                                             tok->fp, NULL);
+           if (line != NULL) {
+                len = strlen(line);
+                if (1 < len && line[len-1] != '\n') {
+                    PyErr_Format(PyExc_SyntaxError,
+                            "Line %i of file %U is longer than the internal buffer (%i)",
+                                tok->lineno + 1, tok->filename, size);
+                    return error_ret(tok);
+                }
+            }
             break;
         } else {
             /* We have not yet determined the encoding.


If it's the same issue then I think closing this issue and discussing there will be good since the issue has a patch with test and relevant discussion. Also it seems BUFSIZ is platform dependent so adding your platform details would also help.

TIL about difference Python 2 and 3 on handling unicode related files. Thanks again!

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue34979>
_______________________________________