Question about working with html entities in python 2 to use them as filenames

Thu Nov 24 02:38:39 EST 2016

Steven Truppe <steven.truppe at chello.at> writes:

> type= <type 'str'> title =  Wizo - Anderster Full Album - YouTube
> type= <type 'str'> title =  Wizo - Bleib Tapfer / für'n Arsch Full
> Album - YouTube
> Traceback (most recent call last):
>   File "./music-fetcher.py", line 39, in <module>
>     title = HTMLParser.HTMLParser().unescape(title)
>   File "/usr/lib/python2.7/HTMLParser.py", line 475, in unescape
>     return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));",
> replaceEntities, s)
>   File "/usr/lib/python2.7/re.py", line 155, in sub
>     return _compile(pattern, flags).sub(repl, string, count)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
> 23: ordinal not in range(128)

This looks like a bug with "HTMLParser" or a usage problem with its
"unescape" method.

I would use "lxml" in order to parse your HTML. It automatically converts
character references (like the above "&39;") and handles special
characters (like "ü") adequately. Under Python 2, "lxml" either returns text
data as "str" (if the result is fully ascii) or "unicode" (otherwise).