[Python-checkins] r54907 - in python/branches/release25-maint: Lib/encodings/utf_8_sig.py Lib/test/test_codecs.py Misc/NEWS

walter.doerwald python-checkins at python.org
Sat Apr 21 12:31:45 CEST 2007


Author: walter.doerwald
Date: Sat Apr 21 12:31:43 2007
New Revision: 54907

Modified:
   python/branches/release25-maint/Lib/encodings/utf_8_sig.py
   python/branches/release25-maint/Lib/test/test_codecs.py
   python/branches/release25-maint/Misc/NEWS
Log:
Backport r54786:
Fix utf-8-sig incremental decoder, which didn't recognise a BOM when the
first chunk fed to the decoder started with a BOM, but was longer than 3 bytes.


Modified: python/branches/release25-maint/Lib/encodings/utf_8_sig.py
==============================================================================
--- python/branches/release25-maint/Lib/encodings/utf_8_sig.py	(original)
+++ python/branches/release25-maint/Lib/encodings/utf_8_sig.py	Sat Apr 21 12:31:43 2007
@@ -44,14 +44,19 @@
         self.first = True
 
     def _buffer_decode(self, input, errors, final):
-        if self.first and codecs.BOM_UTF8.startswith(input): # might be a BOM
+        if self.first:
             if len(input) < 3:
-                # not enough data to decide if this really is a BOM
-                # => try again on the next call
-                return (u"", 0)
-            (output, consumed) = codecs.utf_8_decode(input[3:], errors, final)
-            self.first = False
-            return (output, consumed+3)
+                if codecs.BOM_UTF8.startswith(input):
+                    # not enough data to decide if this really is a BOM
+                    # => try again on the next call
+                    return (u"", 0)
+                else:
+                    self.first = None
+            else:
+                self.first = None
+                if input[:3] == codecs.BOM_UTF8:
+                    (output, consumed) = codecs.utf_8_decode(input[3:], errors, final)
+                    return (output, consumed+3)
         return codecs.utf_8_decode(input, errors, final)
 
     def reset(self):

Modified: python/branches/release25-maint/Lib/test/test_codecs.py
==============================================================================
--- python/branches/release25-maint/Lib/test/test_codecs.py	(original)
+++ python/branches/release25-maint/Lib/test/test_codecs.py	Sat Apr 21 12:31:43 2007
@@ -430,6 +430,11 @@
         # SF bug #1601501: check that the codec works with a buffer
         unicode("\xef\xbb\xbf", "utf-8-sig")
 
+    def test_bom(self):
+        d = codecs.getincrementaldecoder("utf-8-sig")()
+        s = u"spam"
+        self.assertEqual(d.decode(s.encode("utf-8-sig")), s)
+
 class EscapeDecodeTest(unittest.TestCase):
     def test_empty(self):
         self.assertEquals(codecs.escape_decode(""), ("", 0))

Modified: python/branches/release25-maint/Misc/NEWS
==============================================================================
--- python/branches/release25-maint/Misc/NEWS	(original)
+++ python/branches/release25-maint/Misc/NEWS	Sat Apr 21 12:31:43 2007
@@ -602,6 +602,8 @@
 - Fix bsddb test_basics.test06_Transactions to check the version
   number properly.
 
+- Fix utf-8-sig incremental decoder, which didn't recognise a BOM when the
+  first chunk fed to the decoder started with a BOM, but was longer than 3 bytes.
 
 Documentation
 -------------


More information about the Python-checkins mailing list