Unicode codepoints

Wed Jun 22 00:00:22 EDT 2011

On Wed, Jun 22, 2011 at 1:37 PM, Saul Spatz <saul.spatz at gmail.com> wrote:
> Hi,
>
> I'm just starting to learn a bit about Unicode. I want to be able to read a utf-8 encoded file, and print out the codepoints it encodes.  After many false starts, here's a script that seems to work, but it strikes me as awfully awkward and unpythonic.  Have you a better way?

Once you have your data as a Unicode string (and you seem to be using
Python 3, so 's' will be a Unicode string), wouldn't a list of its
codepoints simply be this?

for c in s:
  print('U+'+hex(ord(c))[2:])

But if you do need the codePoints() function, I'd do it as a generator.

> def codePoints(s):
>    ''' return a list of the Unicode codepoints in the string s '''
>    skip = False
>    for k, c in enumerate(s):
>        if skip:
>            skip = False
>            yield ord(s[k-1:k+1])
>            continue
>        if not 0xd800 <= ord(c) <= 0xdfff:
>            yield ord(c)
>        else:
>            skip = True

Your main function doesn't even have to change - it's iterating over
the list, so it may as well iterate over the generator instead.

But I don't really understand what codePoints() does. Is it expecting
the parameter to be a string of bytes or of Unicode characters?

Chris Angelico