BeautifulSoup vs. loose & chars

Frederic Rentsch anthra.norell at vtxmail.ch
Tue Dec 26 18:30:51 EST 2006


John Nagle wrote:
> Felipe Almeida Lessa wrote:
>   
>> On 26 Dec 2006 04:22:38 -0800, placid <Bulkan at gmail.com> wrote:
>>
>>     
>>> So do you want to remove "&" or replace them with "&" ? If you want
>>> to replace it try the following;
>>>       
>> I think he wants to replace them, but just the invalid ones. I.e.,
>>
>> This & this & that
>>
>> would become
>>
>> This & this & that
>>
>>
>> No, i don't know how to do this efficiently. =/...
>> I think some kind of regex could do it.
>>     
>
>     Yes, and the appropriate one is:
>
> 	krefindamp = re.compile(r'&(?!(\w|#)+;)')
> 	...
> 	xmlsection = re.sub(krefindamp,'&',xmlsection)
>
> This will replace an '&' with '&amp' if the '&' isn't
> immediately followed by some combination of letters, numbers,
> and '#' ending with a ';'  Admittedly this would let something
> like '&xx#2;', which isn't a legal entity, through unmodified.
>
> There's still a potential problem with unknown entities in the output XML, but
> at least they're recognized as entities.
>
> 				John Nagle
>
>
>   

Here's another idea:

 >>> s = '''<html> htm tag should not translate
           > & should be &                               
           > &xx#2; isn't a legal entity and should translate 
           > { is a legal entity and should not translate
           </html>

 >>> import SE  # http://cheeseshop.python.org/pypi/SE/2.3        
 >>> HTM_Escapes = SE.SE (definitions)  # See definitions below the 
dotted line
                                                                 
 >>> print HTM_Escapes (s)
<html> htm tag should not translate
           > & should be &                               
           > &xx#2; isn"t a legal entity and should translate 
           > { is a legal entity and should not translate
           </html>

Regards

Frederic


------------------------------------------------------------------------------


definitions = '''

  # Do         # Don't do
# " = "    ==      #   32  20
  (34)=&dquot; &dquot;==     #   34  22
  &=&      &==       #   38  26
  '="     "==      #   39  27
  <=<       <==        #   60  3c
  >=>       >==        #   62  3e
  ©=©     ©==      #  169  a9
  ·=·   ·==    #  183  b7
  »=»    »==     #  187  bb
  À=À   À==    #  192  c0
  Á=Á   Á==    #  193  c1
  Â=Â    Â==     #  194  c2
  Ã=Ã   Ã==    #  195  c3
  Ä=Ä     Ä==      #  196  c4
  Å=Å    Å==     #  197  c5
  Æ=Æ    Æ==     #  198  c6
  Ç=Ç   Ç==    #  199  c7
  È=È   È==    #  200  c8
  É=É   É==    #  201  c9
  Ê=Ê    Ê==     #  202  ca
  Ë=Ë     Ë==      #  203  cb
  Ì=Ì   Ì==    #  204  cc
  Í=Í   Í==    #  205  cd
  Î=Î    Î==     #  206  ce
  Ï=Ï     Ï==      #  207  cf
  Ð=&Eth;      &Eth;==       #  208  d0
  Ñ=Ñ   Ñ==    #  209  d1
  Ò=Ò   Ò==    #  210  d2
  Ó=Ó   Ó==    #  211  d3
  Ô=Ô    Ô==     #  212  d4
  Õ=Õ   Õ==    #  213  d5
  Ö=Ö     Ö==      #  214  d6
  Ø=Ø   Ø==    #  216  d8
  Ù=&Ugrve;    &Ugrve;==     #  217  d9
  Ú=Ú   Ú==    #  218  da
  Û=Û    Û==     #  219  db
  Ü=Ü     Ü==      #  220  dc
  Ý=Ý   Ý==    #  221  dd
  Þ=&Thorn;    &Thorn;==     #  222  de
  ß=ß    ß==     #  223  df
  à=à   à==    #  224  e0
  á=á   á==    #  225  e1
  â=â    â==     #  226  e2
  ã=ã   ã==    #  227  e3
  ä=ä     ä==      #  228  e4
  å=å    å==     #  229  e5
  æ=æ    æ==     #  230  e6
  ç=ç   ç==    #  231  e7
  è=è   è==    #  232  e8
  é=é   é==    #  233  e9
  ê=ê    ê==     #  234  ea
  ë=ë     ë==      #  235  eb
  ì=ì   ì==    #  236  ec
  í=í   í==    #  237  ed
  î=î    î==     #  238  ee
  ï=ï     ï==      #  239  ef
  ð=ð      ð==       #  240  f0
  ñ=ñ   ñ==    #  241  f1
  ò=ò   ò==    #  242  f2
  ó=ó   ó==    #  243  f3
  ô=&ocric;    &ocric;==     #  244  f4
  õ=õ   õ==    #  245  f5
  ö=ö     ö==      #  246  f6
  ø=ø   ø==    #  248  f8
  ù=ù   ù==    #  249  f9
  ú=ú   ú==    #  250  fa
  û=û    û==     #  251  fb
  ü=ü     ü==      #  252  fc
  ý=ý   ý==    #  253  fd
  þ=þ    þ==     #  254  fe
  (xff)=ÿ               #  255  ff
               &#==          #  All numeric codes
            "~<(.|\n)*?>~==" #  All HTM tags '''

If the ampersand is all you need to handle you can erase the others
in the first column. You need to keep the second column though, except
the last entry, because the tags don't need protection if '<' and
'>' in the first column are gone.
      Definitions are easily edited and can be kept in text files.
The SE constructor accepts a file name instead of a definitions string:

 >>> HTM_Escapes = SE.SE ('definition_file_name')


-------------------------------------------------------------------




More information about the Python-list mailing list