Unicode codepoints

Saul Spatz saul.spatz at gmail.com
Tue Jun 21 23:37:29 EDT 2011


Hi,

I'm just starting to learn a bit about Unicode. I want to be able to read a utf-8 encoded file, and print out the codepoints it encodes.  After many false starts, here's a script that seems to work, but it strikes me as awfully awkward and unpythonic.  Have you a better way?

def codePoints(s):
    ''' return a list of the Unicode codepoints in the string s '''
    answer = []
    skip = False
    for k, c in enumerate(s):
        if skip:
            skip = False
            answer.append(ord(s[k-1:k+1]))
            continue
        if not 0xd800 <= ord(c) <= 0xdfff:
            answer.append(ord(c))
        else:
            skip = True
    return answer
            
if __name__ == '__main__':
    s = open('test.txt', encoding = 'utf8', errors = 'replace').read()
    code = codePoints(s)
    for c in code:
        print('U+'+hex(c)[2:])

Thanks for any help you can give me.

Saul

        



More information about the Python-list mailing list