Question about working with html entities in python 2 to use them as filenames

dieter dieter at handshake.de
Thu Nov 24 02:38:39 EST 2016


Steven Truppe <steven.truppe at chello.at> writes:

> type= <type 'str'> title =  Wizo - Anderster Full Album - YouTube
> type= <type 'str'> title =  Wizo - Bleib Tapfer / für'n Arsch Full
> Album - YouTube
> Traceback (most recent call last):
>   File "./music-fetcher.py", line 39, in <module>
>     title = HTMLParser.HTMLParser().unescape(title)
>   File "/usr/lib/python2.7/HTMLParser.py", line 475, in unescape
>     return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));",
> replaceEntities, s)
>   File "/usr/lib/python2.7/re.py", line 155, in sub
>     return _compile(pattern, flags).sub(repl, string, count)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
> 23: ordinal not in range(128)

This looks like a bug with "HTMLParser" or a usage problem with its
"unescape" method.

I would use "lxml" in order to parse your HTML. It automatically converts
character references (like the above "&39;") and handles special
characters (like "ü") adequately. Under Python 2, "lxml" either returns text
data as "str" (if the result is fully ascii) or "unicode" (otherwise).




More information about the Python-list mailing list