Changing filenames from Greeklish => Greek (subprocess complain)

Νικόλαος Κούρας nikos.gr33k at gmail.com
Thu Jun 6 07:16:44 EDT 2013


Τη Πέμπτη, 6 Ιουνίου 2013 1:24:16 μ.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε:
> On 05Jun2013 11:43, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k at gmail.com> wrote:
> 
> | Τη Τετάρτη, 5 Ιουνίου 2013 9:32:15 μ.μ. UTC+3, ο χρήστης MRAB έγραψε:
> 
> | > Using Python, I think you could get the filenames using os.listdir, 
> 
> | > passing the directory name as a bytestring so that it'll return the
> 
> | > names as bytestrings.
> 
> | 
> 
> | > Then, for each name, you could decode from its current encoding and 
> 
> | > encode to UTF-8 and rename the file, passing the old and new paths to
> 
> | > os.rename as bytestrings.
> 
> | 
> 
> | Iam not sure i follow:
> 
> | 
> 
> | Change this:
> 
> | 
> 
> | # Compute a set of current fullpaths
> 
> | fullpaths = set()
> 
> | path = "/home/nikos/public_html/data/apps/"
> 
> | 
> 
> | for root, dirs, files in os.walk(path):
> 
> [...]
> 
> 
> 
> Have a read of this:
> 
> 
> 
>   http://docs.python.org/3/library/os.html#os.listdir
> 
> 
> 
> The UNIX API accepts bytes for filenames and paths.
> 
> 
> 
> Python 3 strs are sequences of Unicode code points. If you try to
> 
> open a file or directory on a UNIX system using a Python str, that
> 
> string must be converted to a sequence of bytes before being handed
> 
> to the OS.
> 
> 
> 
> This is done implicitly using your locale settings if you just use a str.
> 
> 
> 
> However, if you pass a bytes to open or listdir, this conversion
> 
> does not take place. You put bytes in and in the case of listdir
> 
> you get bytes out.
> 
> 
> 
> You can work on pathnames in bytes and never concern yourself with
> 
> encode/decode at all.
> 
> 
> 
> In this way you can write code that does not care about the translation
> 
> between Unicode and some arbitrary byte encoding.
> 
> 
> 
> Of course, the issue will still arise when accepting user input;
> 
> your shell has done exactly this kind of thing when you renamed
> 
> your MP3 file. But it is possible to write pure utility code that
> 
> doesn't care about filenames as Unicode or str if you work purely
> 
> in bytes.



> 
> Regarding user filenames, the common policy these days is to use
> 
> utf-8 throughout. Of course you need to get everything into that
> 
> regime to start with





Τη Πέμπτη, 6 Ιουνίου 2013 1:24:16 μ.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε:
> On 05Jun2013 11:43, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k at gmail.com> wrote:
> 
> | Τη Τετάρτη, 5 Ιουνίου 2013 9:32:15 μ.μ. UTC+3, ο χρήστης MRAB έγραψε:
> 
> | > Using Python, I think you could get the filenames using os.listdir, 
> 
> | > passing the directory name as a bytestring so that it'll return the
> 
> | > names as bytestrings.
> 
> | 
> 
> | > Then, for each name, you could decode from its current encoding and 
> 
> | > encode to UTF-8 and rename the file, passing the old and new paths to
> 
> | > os.rename as bytestrings.
> 
> | 
> 
> | Iam not sure i follow:
> 
> | 
> 
> | Change this:
> 
> | 
> 
> | # Compute a set of current fullpaths
> 
> | fullpaths = set()
> 
> | path = "/home/nikos/public_html/data/apps/"
> 
> | 
> 
> | for root, dirs, files in os.walk(path):
> 
> [...]
> 
> 
> 
> Have a read of this:
> 
> 
> 
>   http://docs.python.org/3/library/os.html#os.listdir
> 
> 
> 
> The UNIX API accepts bytes for filenames and paths.
> 
> 
> 
> Python 3 strs are sequences of Unicode code points. If you try to
> 
> open a file or directory on a UNIX system using a Python str, that
> 
> string must be converted to a sequence of bytes before being handed
> 
> to the OS.
> 
> 
> 
> This is done implicitly using your locale settings if you just use a str.
> 
> 
> 
> However, if you pass a bytes to open or listdir, this conversion
> 
> does not take place. You put bytes in and in the case of listdir
> 
> you get bytes out.
> 
> 
> 
> You can work on pathnames in bytes and never concern yourself with
> 
> encode/decode at all.
> 
> 
> 
> In this way you can write code that does not care about the translation
> 
> between Unicode and some arbitrary byte encoding.
> 
> 
> 
> Of course, the issue will still arise when accepting user input;
> 
> your shell has done exactly this kind of thing when you renamed
> 
> your MP3 file. But it is possible to write pure utility code that
> 
> doesn't care about filenames as Unicode or str if you work purely
> 
> in bytes.
> 
> 
> 
> Regarding user filenames, the common policy these days is to use
> 
> utf-8 throughout. Of course you need to get everything into that
> 
> regime to start with.

So i i nee to use os.listdir() to grab those filenames into bytes. okey.

So by changing this to:

fullpaths = set()
path = "/home/nikos/public_html/data/apps/"

for root, dirs, files in os.walk(path):
	for fullpath in files:
		fullpaths.add( os.path.join(root, fullpath) )



# Compute a set of current fullpaths
fullpaths = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for fullpath in fullpaths:
	try: 
		# Check the presence of a file against the database and insert if it doesn't exist
		cur.execute('''SELECT url FROM files WHERE url = %s''', (fullpath,) )
		data = cur.fetchone()        #URL is unique, so should only be one


-----------------------------
[Thu Jun 06 14:15:38 2013] [error] [client 79.103.41.173] Original exception was:
[Thu Jun 06 14:15:38 2013] [error] [client 79.103.41.173] Traceback (most recent call last):
[Thu Jun 06 14:15:38 2013] [error] [client 79.103.41.173]   File "files.py", line 67, in <module>
[Thu Jun 06 14:15:38 2013] [error] [client 79.103.41.173]     cur.execute('''SELECT url FROM files WHERE url = %s''', (fullpath,) )
[Thu Jun 06 14:15:38 2013] [error] [client 79.103.41.173]   File "/usr/local/lib/python3.3/site-packages/PyMySQL3-0.5-py3.3.egg/pymysql/cursors.py", line 108, in execute
[Thu Jun 06 14:15:38 2013] [error] [client 79.103.41.173]     query = query.encode(charset)
[Thu Jun 06 14:15:38 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 'utf-8' codec can't encode character '\\udcc5' in position 35: surrogates not allowed



More information about the Python-list mailing list