Changing filenames from Greeklish => Greek (subprocess complain)

Νικόλαος Κούρας nikos.gr33k at gmail.com
Fri Jun 7 07:53:42 EDT 2013


Τη Παρασκευή, 7 Ιουνίου 2013 11:53:04 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε:
> On 07Jun2013 09:56, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k at gmail.com> wrote:
> 
> | On 7/6/2013 4:01 πμ, Cameron Simpson wrote:
> 
> | >On 06Jun2013 11:46, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k at gmail.com> wrote:
> 
> | >| Τη Πέμπτη, 6 Ιουνίου 2013 3:44:52 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:
> 
> | >| > py> s = '999-Eυχή-του-Ιησού'
> 
> | >| > py> bytes_as_utf8 = s.encode('utf-8')
> 
> | >| > py> t = bytes_as_utf8.decode('iso-8859-7', errors='replace')
> 
> | >| > py> print(t)
> 
> | >| > 999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ
> 
> | >|
> 
> | >| errors='replace' mean dont break in case or error?
> 
> | >
> 
> | >Yes. The result will be correct for correct iso-8859-7 and slightly mangled
> 
> | >for something that would not decode smoothly.
> 
> |
> 
> | How can it be correct? We have encoded out string in utf-8 and then
> 
> | we tried to decode it as greek-iso? How can this possibly be
> 
> | correct?

> If it is a valid iso-8859-7 sequence (which might cover everything, 
> since I expect it is an 8-bit 1:1 mapping from bytes values to a 
> set of codepoints, just like iso-8859-1) then it may decode to the 
> "wrong" characters, but the reverse process (characters encoded as
> bytes) should produce the original bytes.  With a mapping like this, 
> errors='replace' may mean nothing; there will be no errors because
> the only Unicode characters in play are all from iso-8859-7 to start
> with. Of course another string may not be safe. 

> Visually, the names will be garbage. And if you go:
>   mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3'
> while using the iso-8859-7 locale, the wrong thing will occur
> (assuming it even works, though I think it should because all these
> characters are represented in iso-8859-7, yes?)

All the rest you i understood only the above quotes its still unclear to me.
I cant see to understand it.

Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st 0-127 codepoints similar?

For example char 'a' has the value of '65' for all of those character sets?
Is hat what you mean?

s = 'a'  (This is unicode right?  Why when we assign a string to a variable that string's type is always unicode and does not automatically become utf-8 which includes all available world-wide characters? Unicode is something different that a character set? )

utf8_byte = s.encode('utf-8')

Now if we are to decode this back to utf8 we will receive the char 'a'.
I beleive same thing will happen with latin, greek, ascii isos. Correct?

utf8_a = utf8_byte.decode('iso-8859-7')
latin_a = utf8_byte.decode('iso-8859-1')
ascii_a = utf8_byte.decode('ascii')
utf8_a = utf8_byte.decode('iso-8859-7')

Is this correct? 
All of those decodes will work even if the encoded bytestring was of utf8 type?

The characters that will not decode correctly are those that their codepoints are greater that > 127 ?

for example if s = 'α' (greek character equivalent to english 'a')

Is this what you mean?
--------------------------------

Now back to my almost ready files.py script please:


#========================================================
# Collect filenames of the path dir as bytes
greek_filenames = os.listdir( b'/home/nikos/public_html/data/apps/' )

for filename in greek_filenames:
	# Compute 'path/to/filename' in bytes
	greek_path = b'/home/nikos/public_html/data/apps/' + b'filename'
	try:
		filepath = greek_path.decode('iso-8859-7')
		
		# Rename current filename from greek bytes --> utf-8 bytes
		os.rename( greek_path, filepath.encode('utf-8') )
	except UnicodeDecodeError:
		# Since its not a greek bytestring then its a proper utf8 bytestring
		filepath = greek_path.decode('utf-8')


#========================================================
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in filenames:
	try:
		# Check the presence of a file against the database and insert if it doesn't exist
		cur.execute('''SELECT url FROM files WHERE url = %s''', filename )
		data = cur.fetchone()
		
		if not data:
			# First time for file; primary key is automatic, hit is defaulted 
			cur.execute('''INSERT INTO files (url, host, lastvisit) VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
	except pymysql.ProgrammingError as e:
		print( repr(e) )


#========================================================
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
filepaths = ()

# Build a set of 'path/to/filename' based on the objects of path dir
for filename in filenames:
	filepaths.add( filename )

# Delete spurious 
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

# Check database's filenames against path's filenames
for rec in data:
	if rec not in filepaths:
		cur.execute('''DELETE FROM files WHERE url = %s''', rec )

=======================

nikos at superhost.gr [~/www/cgi-bin]# [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] Error in sys.excepthook:
[Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] ValueError: underlying buffer has been detached
[Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173]
[Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] Original exception was:
[Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] Traceback (most recent call last):
[Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173]   File "/home/nikos/public_html/cgi-bin/files.py", line 71, in <module>
[Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173]     os.rename( greek_path, filepath.encode('utf-8') )
[Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] FileNotFoundError: [Errno 2] \\u0394\\u03b5\\u03bd \\u03c5\\u03c0\\u03ac\\u03c1\\u03c7\\u03b5\\u03b9 \\u03c4\\u03ad\\u03c4\\u03bf\\u03b9\\u03bf \\u03b1\\u03c1\\u03c7\\u03b5\\u03af\\u03bf \\u03ae \\u03ba\\u03b1\\u03c4\\u03ac\\u03bb\\u03bf\\u03b3\\u03bf\\u03c2: '/home/nikos/public_html/data/apps/filename'


?????



More information about the Python-list mailing list