This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: sgmllib should allow angle brackets in quoted values
Type: Stage:
Components: Library (Lib) Versions: Python 2.4
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, barnabas79, fdrake, haepal, nnorwitz, rubys
Priority: normal Keywords: patch

Created on 2006-06-11 12:58 by rubys, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
sgmllib.patch rubys, 2006-06-11 20:23 better patch
sgmllib_2008-03-08.patch barnabas79, 2008-03-09 05:01 patch to allow angle brackets, newlines in quoted attributes
Messages (7)
msg28773 - (view) Author: Sam Ruby (rubys) Date: 2006-06-11 12:58
Real live example (search for "other<br />corrections")

http://latticeqcd.blogspot.com/2006/05/non-relativistic-qcd.html

This addresses the following (included in the file):

# XXX The following should skip matching quotes (' or ")
msg28774 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2006-06-29 17:17
Logged In: YES 
user_id=3066

I checked in a modified version of this patch: changed to
use separate REs for start and end tags to reduce matching
cost for end tags; extended tests; updated to avoid breaking
previous changes to support IPv6 addresses in unquoted
attribute values.

Committed as revisions 47154 (trunk) and 47155
(release24-maint).
msg28775 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2006-09-11 04:26
Logged In: YES 
user_id=33168

I reverted the patch and added the test case for sgml so the
infinite loop doesn't recur.  This was mentioned several
times on python-dev.

Committed revision 51854. (head)
Committed revision 51850. (2.5)
Committed revision 51853. (2.4)
msg28776 - (view) Author: Haejoong Lee (haepal) Date: 2007-01-11 18:01
Could someone check if the following patch fixes the problem?
This patch was made against revision 51854.

--- sgmllib.py.org	2006-11-06 02:31:12.000000000 -0500
+++ sgmllib.py	2007-01-11 12:39:30.000000000 -0500
@@ -16,6 +16,35 @@
 
 # Regular expressions used for parsing
 
+class MyMatch:
+    def __init__(self, i):
+        self._i = i
+    def start(self, i):
+        return self._i
+    
+class EndBracket:
+    def search(self, data, index):
+        s = data[index:]
+        bs = None
+        quote = None
+        for i,c in enumerate(s):
+            if bs:
+                bs = False
+            else:
+                if c == '<' or c == '>':
+                    if quote is None:
+                        break
+                elif c == "'" or c == '"':
+                    if c == quote:
+                        quote = None
+                    else:
+                        quote = c
+                elif c == '\\':
+                    bs = True
+        else:
+            return None
+        return MyMatch(i+index)
+        
 interesting = re.compile('[&<]')
 incomplete = re.compile('&([a-zA-Z][a-zA-Z0-9]*|#[0-9]*)?|'
                            '<([a-zA-Z][^<>]*|'
@@ -29,7 +58,8 @@
 shorttagopen = re.compile('<[a-zA-Z][-.a-zA-Z0-9]*/')
 shorttag = re.compile('<([a-zA-Z][-.a-zA-Z0-9]*)/([^/]*)/')
 piclose = re.compile('>')
-endbracket = re.compile('[<>]')
+#endbracket = re.compile('[<>]')
+endbracket = EndBracket()
 tagfind = re.compile('[a-zA-Z][-_.a-zA-Z0-9]*')
 attrfind = re.compile(
     r'\s*([a-zA-Z_][-:.a-zA-Z_0-9]*)(\s*=\s*'
msg28777 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2007-01-12 06:04
You should be able to check yourself.  Use the current version of Python, apply the test case from the original patch and your patch to the code.  If the test passes, I'll be happy to check in the fix.  If that does work, please create a new patch with your code and the test case from the original patch.
msg63409 - (view) Author: Paul Molodowitch (barnabas79) Date: 2008-03-09 05:01
Patch for sgmllib.py (and test_sgmllib.py)

Correctly parses quoted attribute - allowing for brackets, newlines, etc
within attributes - implemented by altering the loop which finds
attributes within parse_starttag so it checks for open-ended quotes, and
makes sure any closing brackets it finds are not within quotes

In test_sgmllib, added the test case from the original patch, as well as
re-enabling two other test cases, which both work now
msg114668 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-08-22 10:40
sgmllib has been deprecated since 2.6 and has been removed from py3k.
History
Date User Action Args
2022-04-11 14:56:18adminsetgithub: 43487
2010-08-22 10:40:37BreamoreBoysetstatus: open -> closed

nosy: + BreamoreBoy
messages: + msg114668

resolution: out of date
2010-07-29 14:02:30georg.brandllinkissue745002 superseder
2008-03-09 05:01:25barnabas79setfiles: + sgmllib_2008-03-08.patch
nosy: + barnabas79
messages: + msg63409
keywords: + patch
2006-06-11 12:58:36rubyscreate