From sgarman at zenlinux.com Mon Oct 13 00:01:29 2014 From: sgarman at zenlinux.com (Scott Garman) Date: Sun, 12 Oct 2014 15:01:29 -0700 Subject: [portland] Umlats from another dimension Message-ID: <543AFA39.5000908@zenlinux.com> Hi all, I'm getting pretty confused by a problem I'm trying to solve in python, which is to detect lower-case characters in a string. This would normally be a simple regex, but I have to also accept input strings with umlats in them, such as '?'. I'm using python 2.7.6. At first I thought this was a unicode problem, but now I'm not so sure. About anything. #!/usr/bin/env python # -*- coding: utf-8 -*- str = '?' if isinstance(str, unicode): print "This is unicode" Running this tells me that string is *not* unicode. I know that there's a thing called extended ASCII, and if I look up a table for that, I see characters with accents and umlats: http://www.asciitable.com/ This table suggests that '?' should correspond to an ordinal value of 132. But if I run: #!/usr/bin/env python # -*- coding: utf-8 -*- string = '?' for c in string: print ord(c) I get: 195 164 which tells me that I'm dealing with a two-byte character, which brings me back to this being unicode. Now looking at which characters in the extended ASCII table correspond to those values, I don't see any relation to '?'. Finally, my understanding of python 2.x is that it does not support unicode in regexes. Otherwise I'd just use \p{Ll} and have a good deal more hair left on my head. I've also tried forcing the string to ASCII using: str.decode("ascii", "ignore") and this is one of those characters that just gets dropped in the conversion. Any insights on what I'm missing would be greatly appreciated. Thanks, Scott From eric at ericholscher.com Mon Oct 13 00:24:25 2014 From: eric at ericholscher.com (Eric Holscher) Date: Sun, 12 Oct 2014 15:24:25 -0700 Subject: [portland] Umlats from another dimension In-Reply-To: <543AFA39.5000908@zenlinux.com> References: <543AFA39.5000908@zenlinux.com> Message-ID: I think you want: string = u'?' This will define it at a unicode string. For example: In [7]: string = unicode('?') --------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) in () ----> 1 string = unicode('?') UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) In [8]: string = unicode(u'?') Cheers, Eric On Sun, Oct 12, 2014 at 3:01 PM, Scott Garman wrote: > Hi all, > > I'm getting pretty confused by a problem I'm trying to solve in python, > which is to detect lower-case characters in a string. This would normally > be a simple regex, but I have to also accept input strings with umlats in > them, such as '?'. I'm using python 2.7.6. > > At first I thought this was a unicode problem, but now I'm not so sure. > About anything. > > #!/usr/bin/env python > # -*- coding: utf-8 -*- > > str = '?' > > if isinstance(str, unicode): > print "This is unicode" > > Running this tells me that string is *not* unicode. I know that there's a > thing called extended ASCII, and if I look up a table for that, I see > characters with accents and umlats: > > http://www.asciitable.com/ > > This table suggests that '?' should correspond to an ordinal value of 132. > But if I run: > > #!/usr/bin/env python > # -*- coding: utf-8 -*- > > string = '?' > > for c in string: > print ord(c) > > I get: > > 195 > 164 > > which tells me that I'm dealing with a two-byte character, which brings me > back to this being unicode. > > Now looking at which characters in the extended ASCII table correspond to > those values, I don't see any relation to '?'. > > Finally, my understanding of python 2.x is that it does not support > unicode in regexes. Otherwise I'd just use \p{Ll} and have a good deal more > hair left on my head. > > I've also tried forcing the string to ASCII using: > > str.decode("ascii", "ignore") > > and this is one of those characters that just gets dropped in the > conversion. > > Any insights on what I'm missing would be greatly appreciated. > > Thanks, > > Scott > > _______________________________________________ > Portland mailing list > Portland at python.org > https://mail.python.org/mailman/listinfo/portland > -- Eric Holscher Maker of the internet residing in Portland, Or http://ericholscher.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From sgarman at zenlinux.com Mon Oct 13 00:31:28 2014 From: sgarman at zenlinux.com (Scott Garman) Date: Sun, 12 Oct 2014 15:31:28 -0700 Subject: [portland] Umlats from another dimension In-Reply-To: References: <543AFA39.5000908@zenlinux.com> Message-ID: <543B0140.1030807@zenlinux.com> Thanks Eric! My simplified test case was throwing me off. Now with that fixed, I see that islower() seems to work with the umlat characters, which is how I can address the main problem I'm working on. Regards, Scott On 10/12/2014 03:24 PM, Eric Holscher wrote: > I think you want: > > string = u'?' > > This will define it at a unicode string. > > For example: > > In [7]: string = unicode('?') > --------------------------------------------------------------------------- > UnicodeDecodeError Traceback (most recent call last) > in () > ----> 1 string = unicode('?') > > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: > ordinal not in range(128) > > In [8]: string = unicode(u'?') > > Cheers, > Eric > > > On Sun, Oct 12, 2014 at 3:01 PM, Scott Garman wrote: > >> Hi all, >> >> I'm getting pretty confused by a problem I'm trying to solve in python, >> which is to detect lower-case characters in a string. This would normally >> be a simple regex, but I have to also accept input strings with umlats in >> them, such as '?'. I'm using python 2.7.6. >> >> At first I thought this was a unicode problem, but now I'm not so sure. >> About anything. >> >> #!/usr/bin/env python >> # -*- coding: utf-8 -*- >> >> str = '?' >> >> if isinstance(str, unicode): >> print "This is unicode" >> >> Running this tells me that string is *not* unicode. I know that there's a >> thing called extended ASCII, and if I look up a table for that, I see >> characters with accents and umlats: >> >> http://www.asciitable.com/ >> >> This table suggests that '?' should correspond to an ordinal value of 132. >> But if I run: >> >> #!/usr/bin/env python >> # -*- coding: utf-8 -*- >> >> string = '?' >> >> for c in string: >> print ord(c) >> >> I get: >> >> 195 >> 164 >> >> which tells me that I'm dealing with a two-byte character, which brings me >> back to this being unicode. >> >> Now looking at which characters in the extended ASCII table correspond to >> those values, I don't see any relation to '?'. >> >> Finally, my understanding of python 2.x is that it does not support >> unicode in regexes. Otherwise I'd just use \p{Ll} and have a good deal more >> hair left on my head. >> >> I've also tried forcing the string to ASCII using: >> >> str.decode("ascii", "ignore") >> >> and this is one of those characters that just gets dropped in the >> conversion. >> >> Any insights on what I'm missing would be greatly appreciated. >> >> Thanks, >> >> Scott >> >> _______________________________________________ >> Portland mailing list >> Portland at python.org >> https://mail.python.org/mailman/listinfo/portland >> > > > From georgedorn at gmail.com Mon Oct 13 00:33:51 2014 From: georgedorn at gmail.com (Sam Thompson) Date: Sun, 12 Oct 2014 15:33:51 -0700 Subject: [portland] Umlats from another dimension In-Reply-To: <543AFA39.5000908@zenlinux.com> References: <543AFA39.5000908@zenlinux.com> Message-ID: The first important concept to understand is that UTF-8 and Unicode are not the same thing. Because you specified coding: utf-8, every string you define within the python script is a bytestring encoded using utf-8. This is not the same as a python unicode object, it is a bytestring (because you are using python 2.x). The reason that '?' produces two character ordinals is that utf-8 is variable in character length. 195+164 is the code point for '?'. If you want the python unicode object for the string, use mystring.decode('utf-8') instead of 'ascii', because it's not ascii. The second important concept is that strings defined within the python script may not be the same type as strings read from input, a file, a web request, etc. Where is your input coming from? If you can be sure your input is utf-8 (and this is a giant leap if you're working with web input), convert it to unicode (via .decode()), iterate over the unicode sequence and test each character with .islower(). If you can't be sure what encoding your bytestrings are in, check out the chardet library on pypi. On Sun, Oct 12, 2014 at 3:01 PM, Scott Garman wrote: > Hi all, > > I'm getting pretty confused by a problem I'm trying to solve in python, > which is to detect lower-case characters in a string. This would normally > be a simple regex, but I have to also accept input strings with umlats in > them, such as '?'. I'm using python 2.7.6. > > At first I thought this was a unicode problem, but now I'm not so sure. > About anything. > > #!/usr/bin/env python > # -*- coding: utf-8 -*- > > str = '?' > > if isinstance(str, unicode): > print "This is unicode" > > Running this tells me that string is *not* unicode. I know that there's a > thing called extended ASCII, and if I look up a table for that, I see > characters with accents and umlats: > > http://www.asciitable.com/ > > This table suggests that '?' should correspond to an ordinal value of 132. > But if I run: > > #!/usr/bin/env python > # -*- coding: utf-8 -*- > > string = '?' > > for c in string: > print ord(c) > > I get: > > 195 > 164 > > which tells me that I'm dealing with a two-byte character, which brings me > back to this being unicode. > > Now looking at which characters in the extended ASCII table correspond to > those values, I don't see any relation to '?'. > > Finally, my understanding of python 2.x is that it does not support > unicode in regexes. Otherwise I'd just use \p{Ll} and have a good deal more > hair left on my head. > > I've also tried forcing the string to ASCII using: > > str.decode("ascii", "ignore") > > and this is one of those characters that just gets dropped in the > conversion. > > Any insights on what I'm missing would be greatly appreciated. > > Thanks, > > Scott > > _______________________________________________ > Portland mailing list > Portland at python.org > https://mail.python.org/mailman/listinfo/portland > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sgarman at zenlinux.com Mon Oct 13 00:38:39 2014 From: sgarman at zenlinux.com (Scott Garman) Date: Sun, 12 Oct 2014 15:38:39 -0700 Subject: [portland] Umlats from another dimension In-Reply-To: References: <543AFA39.5000908@zenlinux.com> Message-ID: <543B02EF.6000305@zenlinux.com> On 10/12/2014 03:33 PM, Sam Thompson wrote: > The first important concept to understand is that UTF-8 and Unicode are not > the same thing. > > Because you specified coding: utf-8, every string you define within the > python script is a bytestring encoded using utf-8. This is not the same as > a python unicode object, it is a bytestring (because you are using python > 2.x). > > The reason that '?' produces two character ordinals is that utf-8 is > variable in character length. 195+164 is the code point for '?'. If you > want the python unicode object for the string, use mystring.decode('utf-8') > instead of 'ascii', because it's not ascii. > > The second important concept is that strings defined within the python > script may not be the same type as strings read from input, a file, a web > request, etc. Where is your input coming from? > > If you can be sure your input is utf-8 (and this is a giant leap if you're > working with web input), convert it to unicode (via .decode()), iterate > over the unicode sequence and test each character with .islower(). > > If you can't be sure what encoding your bytestrings are in, check out the > chardet library on pypi. Thanks Sam, this explanation helped to fill in my gaps on bytestrings and unicode in python, which until now I've been quite clueless about. Scott From tim.morgan at owasp.org Mon Oct 13 00:35:58 2014 From: tim.morgan at owasp.org (Tim) Date: Sun, 12 Oct 2014 15:35:58 -0700 Subject: [portland] Umlats from another dimension In-Reply-To: <543AFA39.5000908@zenlinux.com> References: <543AFA39.5000908@zenlinux.com> Message-ID: <20141012223558.GL12633@sentinelchicken.org> Hi Scott, > Any insights on what I'm missing would be greatly appreciated. Traditional Python strings are much more like byte arrays than character strings. However, explicit unicode strings can be defined as well, but it is a separate data type. Your isinstance() test is merely checking the data type of the object, but this has nothing to do with the content stored within. For instance: >>> str = u'?' >>> if isinstance(str, unicode): ... print "This is unicode" ... This is unicode And: >>> str = u'any string, now stored as unicode' >>> if isinstance(str, unicode): ... print "This is unicode" ... This is unicode Note the "u" letter prefix to the string definitions. I suspect when you include a character with an umlaut statically in the script as a traditional string, this is automatically encoded in your default character set (I guess utf-8) and stored within the string (once again, just a sequence of bytes). When you read data in from your users and want to inspect it for character content that doesn't fall within traditional ascii, I recommend you first decode it to unicode and then perform operations on it that way. But for goodness sakes, don't force it to "ascii"! If you want to handle unicode, then interpret the input as utf-8 or whatever makes sense, then manipulate the resulting unicode object, preserving the extended character set. Consider this: >>> raw = '?' >>> unicode = raw.decode('utf-8') >>> for c in unicode: ... print ord(c) ... 228 Here, since Python knows how to interpret the value stored in the unicode object, the logical character value is printed out, rather than seeing two encoded bytes. Now, beyond just getting the characters converted into unicode properly, you still have to worry about what does Python consider to be an uppercase vs. lowercase character. I believe that will depend on the locale you have set in the environment. But that's about as far as my knowledge goes here... Hope that helps, tim PS- In Python 3, the default string object *is* unicode. The old behavior of strings is relegated to bytes(). In some ways this makes it easier to understand what is going on with unicode. From sgarman at zenlinux.com Mon Oct 13 00:56:24 2014 From: sgarman at zenlinux.com (Scott Garman) Date: Sun, 12 Oct 2014 15:56:24 -0700 Subject: [portland] Umlats from another dimension In-Reply-To: <20141012223558.GL12633@sentinelchicken.org> References: <543AFA39.5000908@zenlinux.com> <20141012223558.GL12633@sentinelchicken.org> Message-ID: <543B0718.60308@zenlinux.com> What a great community. :) Thanks for this, Tim. Scott On 10/12/2014 03:35 PM, Tim wrote: > > Hi Scott, > >> Any insights on what I'm missing would be greatly appreciated. > > > Traditional Python strings are much more like byte arrays than > character strings. However, explicit unicode strings can be defined > as well, but it is a separate data type. > > Your isinstance() test is merely checking the data type of the object, > but this has nothing to do with the content stored within. > > For instance: > >>>> str = u'?' >>>> if isinstance(str, unicode): > ... print "This is unicode" > ... > This is unicode > > > And: > >>>> str = u'any string, now stored as unicode' >>>> if isinstance(str, unicode): > ... print "This is unicode" > ... > This is unicode > > > Note the "u" letter prefix to the string definitions. > > > I suspect when you include a character with an umlaut statically in > the script as a traditional string, this is automatically encoded in > your default character set (I guess utf-8) and stored within the > string (once again, just a sequence of bytes). > > When you read data in from your users and want to inspect it for > character content that doesn't fall within traditional ascii, I > recommend you first decode it to unicode and then perform operations > on it that way. But for goodness sakes, don't force it to "ascii"! > If you want to handle unicode, then interpret the input as utf-8 or > whatever makes sense, then manipulate the resulting unicode object, > preserving the extended character set. > > Consider this: > >>>> raw = '?' >>>> unicode = raw.decode('utf-8') >>>> for c in unicode: > ... print ord(c) > ... > 228 > > > Here, since Python knows how to interpret the value stored in the > unicode object, the logical character value is printed out, rather > than seeing two encoded bytes. > > > Now, beyond just getting the characters converted into unicode > properly, you still have to worry about what does Python consider to > be an uppercase vs. lowercase character. I believe that will depend > on the locale you have set in the environment. But that's about as > far as my knowledge goes here... > > Hope that helps, > tim > > > PS- In Python 3, the default string object *is* unicode. The old > behavior of strings is relegated to bytes(). In some ways this > makes it easier to understand what is going on with unicode. > > _______________________________________________ > Portland mailing list > Portland at python.org > https://mail.python.org/mailman/listinfo/portland > From fractalid at gmail.com Mon Oct 13 01:33:32 2014 From: fractalid at gmail.com (Nathan Miller) Date: Sun, 12 Oct 2014 16:33:32 -0700 Subject: [portland] Umlats from another dimension In-Reply-To: <543AFA39.5000908@zenlinux.com> References: <543AFA39.5000908@zenlinux.com> Message-ID: Others have answered your question far better than I could have, but I wanted to add a caution against clobbering built-ins. "str" is a python built-in, the string type. In one-off code like this it's unlikely to cause a problem, but overriding str or other built-ins this way in more complex code can lead to very confusing bugs. On Oct 12, 2014 3:09 PM, "Scott Garman" wrote: > Hi all, > > I'm getting pretty confused by a problem I'm trying to solve in python, > which is to detect lower-case characters in a string. This would normally > be a simple regex, but I have to also accept input strings with umlats in > them, such as '?'. I'm using python 2.7.6. > > At first I thought this was a unicode problem, but now I'm not so sure. > About anything. > > #!/usr/bin/env python > # -*- coding: utf-8 -*- > > str = '?' > > if isinstance(str, unicode): > print "This is unicode" > > Running this tells me that string is *not* unicode. I know that there's a > thing called extended ASCII, and if I look up a table for that, I see > characters with accents and umlats: > > http://www.asciitable.com/ > > This table suggests that '?' should correspond to an ordinal value of 132. > But if I run: > > #!/usr/bin/env python > # -*- coding: utf-8 -*- > > string = '?' > > for c in string: > print ord(c) > > I get: > > 195 > 164 > > which tells me that I'm dealing with a two-byte character, which brings me > back to this being unicode. > > Now looking at which characters in the extended ASCII table correspond to > those values, I don't see any relation to '?'. > > Finally, my understanding of python 2.x is that it does not support > unicode in regexes. Otherwise I'd just use \p{Ll} and have a good deal more > hair left on my head. > > I've also tried forcing the string to ASCII using: > > str.decode("ascii", "ignore") > > and this is one of those characters that just gets dropped in the > conversion. > > Any insights on what I'm missing would be greatly appreciated. > > Thanks, > > Scott > > _______________________________________________ > Portland mailing list > Portland at python.org > https://mail.python.org/mailman/listinfo/portland > -------------- next part -------------- An HTML attachment was scrubbed... URL: From software at azurestandard.com Mon Oct 27 18:15:00 2014 From: software at azurestandard.com (Jordan Schatz) Date: Mon, 27 Oct 2014 10:15:00 -0700 (PDT) Subject: [portland] Looking for Python, Django (AngularJS is nice too) developer In-Reply-To: <684233776.568459.1414176771741.JavaMail.zimbra@azurestandard.com> Message-ID: <465729715.569321.1414177745576.JavaMail.zimbra@azurestandard.com> Hello, The company I work for (Azure Standard) is looking for 1-2 full time python developers. The position(s) are telecommute friendly (the majority of the team is remote). The more Python, Django, Linux, larger system experience you have, the better. We are also steadily using more AngularJS and Javascript on the front end so it is a plus if you know that too. I want to keep this brief for the list, so: Azure distributes organic and bulk food, almost nationwide. We care about what we do, and have passionate customers. The company is stable, we work flexible hours, and embrace (good) agile practices. There is a mix of tech, some modern and cool (docker, websockets, ZMQ, real time server push) some with much more of a legacy feel to it (Django 1.4, with mods, slowly being pulled towards the current version). If interested please contact Susan at Jordan Schatz | Sr. Software Engineer Azure Standard