Changing filenames from Greeklish => Greek (subprocess complain)

Νικόλαος Κούρας nikos.gr33k at gmail.com
Sun Jun 9 00:46:40 EDT 2013


On 9/6/2013 1:32 πμ, Cameron Simpson wrote:
> On 08Jun2013 14:14, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k at gmail.com> wrote:
> | Τη Σάββατο, 8 Ιουνίου 2013 10:01:57 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:
> | > ASCII actually needs 7 bits to store a character. Since computers are
> | > optimized to work with bytes, not bits, normally ASCII characters are
> | > stored in a single byte, with one bit wasted.
> |
> | So ASCII and Unicode are 2 Encoding Systems currently in use.
> | How should i imagine them, visualize them?
> | Like tables 'A' = 65, 'B' = 66 and so on?
>
> Yes, that works.
>
> | But if i do then that would be the visualization of a 'charset' not of an encoding system.
> | What the diffrence of an encoding system and of a charset?
>
> And encoding system is the method or transcribing these values to bytes and back again.
So we have:

( 'A' mapped to the value of '65' ) => encoding process(i.e. uf-8) => bytes
bytes => decoding process(i.e. utf-8) =>  ( '65' mapped to character 'A' )

Why does every character in a character set needs to be associated with 
a numeric value?
I mean couldn't we just have characters sets that wouldn't have numeric 
associations like:

'A'  => encoding process(i.e. uf-8) => bytes
bytes => decoding process(i.e. utf-8) =>  character 'A'


>
> EBCDIC and ASCII and Unicode and Greek-ISO (iso-8859-7) are all character sets.
> (1:1 mappings of characters to numbers/ordinals).
>
> And encoding is a way of writing these values to bytes.
> Decoding reads bytes and emits character values.
>
> Because all of EBCDIC, ASCII and the iso-8859-x characters sets fit in the range 0-255,
> they are usually transcribed (encoded) directly, one byte per ordinal.
>
> Unicode is much larger. It cannot be transcribed (encoded) as one bytes to one value.
> There are several ways of transcribing Unicode. UTF-8 is a popular and usually compact form,
> using one byte for values below 128 and and multiple bytes for higher values.
An ordinal = ordered numbers like 7,8,910 and so on?

Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for 
values up to 256?

UTF-8 and UTF-16 and UTF-32
I though the number beside of UTF- was to declare how many bits the 
character set was using to store a character into the hdd, no?

"Narrow" Unicode uses two bytes per character. Since two bytes is only
enough for about 65,000 characters, not 1,000,000+, the rest of the
characters are stored as pairs of two-byte "surrogates".

Can you please explain this line "the rest of thecharacters are stored 
as pairs of two-byte "surrogates"" more easily for me to understand it?
I'm still having troubl understanding what a surrogate is.

Again, thank you very much for explaining the encodings to me, they were 
giving me trouble for years in all of my scripts.


And one last thing.
When locale to linux system is set to utf-8 that would mean that the 
linux applications, should try to encode string into hdd by using 
system's default encoding to utf-8 nad read them back from bytes by also 
using utf-8. Is that correct?
-- 
Webhost <http://superhost.gr>&& Weblog <http://psariastonafro.wordpress.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20130609/ac20894a/attachment.html>


More information about the Python-list mailing list