Question about working with html entities in python 2 to use them as filenames

Wed Nov 23 05:17:35 EST 2016

type= <type 'str'> title =  Wizo - Anderster Full Album - YouTube
type= <type 'str'> title =  Wizo - Bleib Tapfer / für'n Arsch Full 
Album - YouTube
Traceback (most recent call last):
   File "./music-fetcher.py", line 39, in <module>
     title = HTMLParser.HTMLParser().unescape(title)
   File "/usr/lib/python2.7/HTMLParser.py", line 475, in unescape
     return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", 
replaceEntities, s)
   File "/usr/lib/python2.7/re.py", line 155, in sub
     return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: 
ordinal not in range(128)

The pastebins from below are showing how i parse the html data from the 
<title> and i wan to have a normal filename with Ü,Ö etc if possible,

i've tried converting with decode('utf-8') and encode or str.encode() 
and other thing but i think i'm missing here.

I want to create filename out of the <title>DATA</title>, it's realy 
important.

Hope in regards,

Truppe Steven

On 2016-11-23 02:32, Steve D'Aprano wrote:
> On Wed, 23 Nov 2016 09:00 am, Lew Pitcher wrote:
>
>> 2) Apparently os.mkdir() (at least) defaults to requiring an ASCII
>> pathname.
> No, you have misinterpreted what you have seen.
>
> Even in Python 2, os.mkdir will accept a Unicode argument. You just have to
> make sure it is given as unicode:
>
> os.mkdir(u'/tmp/für')
>
> Notice the u' delimiter instead of the ordinary ' delimiter? That tells
> Python to use a unicode (text) string instead of an ascii byte-string.
>
> If you don't remember the u' delimiter, and write an ordinary byte-string '
> delimiter, then the result you get will depend on some combination of your
> operating system, the source code encoding, and Python's best guess of what
> you mean.
>
> os.mkdir('/tmp/für')  # don't do this!
>
> *might* work, if all the factors align correctly, but often won't. And when
> it doesn't, the failure can be extremely mysterious, usually involving a
> spurious
>
> UnicodeDecodeError: 'ascii' codec
>
> error.
>
> Dealing with Unicode text is much simpler in Python 3. Dealing with
> *unknown* encodings is never easy, but so long as you can stick with
> Unicode and UTF-8, Python 3 makes it easy.
>
>
>
>