Question about working with html entities in python 2 to use them as filenames

Lew Pitcher lew.pitcher at digitalfreehold.ca
Tue Nov 22 17:00:47 EST 2016


On Tuesday November 22 2016 15:54, in comp.lang.python, "Steven Truppe"
<steven.truppe at chello.at> wrote:

> I've made a pastebin with a few examples: http://pastebin.com/QQQFhkRg
> 
> 
> 
> On 2016-11-22 21:33, Steven Truppe wrote:
>> I all,
>>
>>
>> i'm using linux and python 2 and want to parse a file line by line by
>> executing a command with the line (with os.system).
>>
>> My problem now is that i'm opening the file and parse the title but
>> i'm not able to get it into a normal filename:
>>
>>
>> import os,sys
>>
>> import urlib,re,cgi
>>
>> import HTMLParser, uincodedata
>>
>> import htmlentiytdefs
>>
>> imort chardet
>>
>> for ULR in open('list.txt', "r").readlines():
>>
>>     teste_egex="<title>(.+?)</title>
>>
>>     patter = re.compile(these_regex)
>>
>>     htmlfile=urlib.urlopen(URL)
>>
>>     htmltext=htmlfile.read()
>>
>>     title=re.aindall(pater, htmltext)[0]
>>
>>     title = HTMLParser.HTMLParser.unescape(title)
>>
>>     print "title = ", title
>>
>> # here i would like to create a directory named after the content of
>> the title
>>
>>
>> I allways get this error:
>>
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2
>>
>>
>>
>> i've played around with .ecode('latin-1') or ('utf8') but i was not
>> yet able to sove this simple issue.

I'm no python programmer, but I do have a couple of observations.

First, though, here's an extract from that pastebin posting of yours:
>     print "Title = ", title.decode()
>  
> ----- RESULT --------
> Title =  Wizo - Anderster Full Album - YouTube
> Title =  Wizo - Bleib Tapfer / für'n Arsch Full Album - YouTube
> Title =  WIZO - Uuaarrgh Full Album - YouTube
> Title =  WIZO - Full Album - "Punk gibt's nicht umsonst! (Teill
III)" - YouTube
> Title =  WIZO - Full Album - "DER" - YouTube
> Title =  Alarmsignal -  Wir leben - YouTube
> Title =  the Pogues - Body of an american - YouTube
> Title =  The Pogues -  The band played waltzing matilda - YouTube
> Title =  Hey Rote Zora - Heiter bis Wolkig - YouTube
> Title =  Für immer Punk - die goldenen Zitronen - YouTube
> Title =  Fuckin' Faces - Krieg und Frieden - YouTube
> Title =  Sluts - Anders - YouTube
> Title =  Absturz - Es ist schön ein Punk zu sein - YouTube
> Title =  Broilers - Ruby Light & Dark - YouTube
> Title =  Less Than Jake 02 - My Very Own Flag - YouTube
> Title =  The Mighty Mighty Bosstones - The Impression That I Get - YouTube
> Title =  Streetlight Manifesto - Failing Flailing (lyrics) - YouTube
> Title =  Mustard Plug - Mr. Smiley - YouTube
>  
> But when i try:
> os.mkdir(title)
> i get the following:
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23:
> ordinal not in range(128) 

Now for the observations

1) some of your titles contain the '/' character, which on your platform
(Linux) is taken as a path separator character. The os.mkdir() method
apparently expects it's "path" argument to name a file in an already existing
directory. That is to say, if path is "/a/b/c", then os.mkdir() expects that
the directory /a/b will already exist. Those titles that contain the path
separator character will cause os.mkdir() to attempt to create a file in a
subdirectory of the current directory, and that subdirectory doesn't exist
yet. You either have to sanitize your input to remove the path separators,
and use os.mkdir() to create a file named with the name of the sanitized
path, /or/ use os.makedirs(), which will create all the subdirectories
required by your given path.

2) Apparently os.mkdir() (at least) defaults to requiring an ASCII pathname.
Those of your titles that contain Unicode characters cannot be stored
verbatim without either
  a) re-encoding the title in ASCII, or
  b) flagging to os.mkdir() that Unicode is acceptable.
Apparently, this is a common problem; a google search brought up several pages
dedicated to answering this question, including one extensive paper on the
topic (http://nedbatchelder.com/text/unipain.html). There apparently are ways
to cause os.mkdir() to accept Unicode inputs; their effectiveness and
side-effects are beyond me.

HTH
-- 
Lew Pitcher
"In Skills, We Trust"
PGP public key available upon request




More information about the Python-list mailing list