Newbie question about text encoding

Sam Raker sam.raker at gmail.com
Thu Feb 26 11:45:56 EST 2015


I'm 100% in favor of expanding Unicode until the sun goes dark. Doing so helps solve the problems affecting speakers of "underserved" languages--access and language preservation. Speakers of Mongolian, Cherokee, Georgian, etc. all deserve to be able to interact with technology in their native languages as much as we speakers of ASCII-friendly languages do. Unicode support also makes writing papers on, dictionaries of, and new texts in such languages much easier, which helps the fight against language extinction, which is a sadly pressing issue.

Also, like, computers are big. Get an external drive for your high-resolution PDF collection of Medieval manuscripts if you feel like you're running out of space. A few extra codepoints aren't going to be the straw that breaks the camel's back.


On Thursday, February 26, 2015 at 8:24:34 AM UTC-5, Chris Angelico wrote:
> On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody <rustompmody at gmail.com> wrote:
> > Wrote something up on why we should stop using ASCII:
> > http://blog.languager.org/2015/02/universal-unicode.html
> 
> >From that post:
> 
> """
> 5.1 Gibberish
> 
> When going from the original 2-byte unicode (around version 3?) to the
> one having supplemental planes, the unicode consortium added blocks
> such as
> 
> * Egyptian hieroglyphs
> * Cuneiform
> * Shavian
> * Deseret
> * Mahjong
> * Klingon
> 
> To me (a layman) it looks unprofessional - as though they are playing
> games - that billions of computing devices, each having billions of
> storage words should have their storage wasted on blocks such as
> these.
> """
> 
> The shift from Unicode as a 16-bit code to having multiple planes came
> in with Unicode 2.0, but the various blocks were assigned separately:
> * Egyptian hieroglyphs: Unicode 5.2
> * Cuneiform: Unicode 5.0
> * Shavian: Unicode 4.0
> * Deseret: Unicode 3.1
> * Mahjong Tiles: Unicode 5.1
> * Klingon: Not part of any current standard
> 
> However, I don't think historians will appreciate you calling all of
> these "gibberish". To adequately describe and discuss old texts
> without these Unicode blocks, we'd have to either do everything with
> images, or craft some kind of reversible transliteration system and
> have dedicated software to render the texts on screen. Instead, what
> we have is a well-known and standardized system for transliterating
> all of these into numbers (code points), and rendering them becomes a
> simple matter of installing an appropriate font.
> 
> Also, how does assigning meanings to codepoints "waste storage"? As
> soon as Unicode 2.0 hit and 16-bit code units stopped being
> sufficient, everyone needed to allocate storage - either 32 bits per
> character, or some other system - and the fact that some codepoints
> were unassigned had absolutely no impact on that. This is decidedly
> NOT unprofessional, and it's not wasteful either.
> 
> ChrisA




More information about the Python-list mailing list