Py3.3 unicode literal and input()

Wed Jun 20 04:12:00 EDT 2012

On Jun 20, 1:21 am, Steven D'Aprano <steve
+comp.lang.pyt... at pearwood.info> wrote:
> On Mon, 18 Jun 2012 07:00:01 -0700, jmfauth wrote:
> > On 18 juin, 12:11, Steven D'Aprano <steve
> > +comp.lang.pyt... at pearwood.info> wrote:
> >> On Mon, 18 Jun 2012 02:30:50 -0700, jmfauth wrote:
> >> > On 18 juin, 10:28, Benjamin Kaplan <benjamin.kap... at case.edu> wrote:
> >> >> The u prefix is only there to
> >> >> make it easier to port a codebase from Python 2 to Python 3. It
> >> >> doesn't actually do anything.
>
> >> > It does. I shew it!
>
> >> Incorrect. You are assuming that Python 3 input eval's the input like
> >> Python 2 does. That is wrong. All you show is that the one-character
> >> string "a" is not equal to the four-character string "u'a'", which is
> >> hardly a surprise. You wouldn't expect the string "3" to equal the
> >> string "int('3')" would you?
>
> >> --
> >> Steven
>
> > A string is a string, a "piece of text", period.
>
> > I do not see why a unicode literal and an (well, I do not know how the
> > call it) a "normal class <str>" should behave differently in code source
> > or as an answer to an input().
>
> They do not. As you showed earlier, in Python 3.3 the literal strings
> u'a' and 'a' have the same meaning: both create a one-character string
> containing the Unicode letter LOWERCASE-A.
>
> Note carefully that the quotation marks are not part of the string. They
> are delimiters. Python 3.3 allows you to create a string by using
> delimiters:
>
> ' '
> " "
> u' '
> u" "
>
> plus triple-quoted versions of the same. The delimiter is not part of the
> string. They are only there to mark the start and end of the string in
> source code so that Python can tell the difference between the string "a"
> and the variable named "a".
>
> Note carefully that quotation marks can exist inside strings:
>
> my_string = "This string has 'quotation marks'."
>
> The " at the start and end of the string literal are delimiters, not part
> of the string, but the internal ' characters *are* part of the string.
>
> When you read data from a file, or from the keyboard using input(),
> Python takes the data and returns a string. You don't need to enter
> delimiters, because there is no confusion between a string (all data you
> read) and other programming tokens.
>
> For example:
>
> py> s = input("Enter a string: ")
> Enter a string: 42
> py> print(s, type(s))
> 42 <class 'str'>
>
> Because what I type is automatically a string, I don't need to enclose it
> in quotation marks to distinguish it from the integer 42.
>
> py> s = input("Enter a string: ")
> Enter a string: This string has 'quotation marks'.
> py> print(s, type(s))
> This string has 'quotation marks'. <class 'str'>
>
> What you type is exactly what you get, no more, no less.
>
> If you type 42, you get the two character string "42" and not the int 42.
>
> If you type [1, 2, 3], then you get the nine character string "[1, 2, 3]"
> and not a list containing integers 1, 2 and 3.
>
> If you type 3**0.5 then you get the six character string "3**0.5" and not
> the float 1.7320508075688772.
>
> If you type u'a' then you get the four character string "u'a'" and not
> the single character 'a'.
>
> There is nothing new going on here. The behaviour of input() in Python 3,
> and raw_input() in Python 2, has not changed.
>
> > Should a user write two derived functions?
>
> > input_for_entering_text()
> > and
> > input_if_you_are_entering_a_text_as_litteral()
>
> If you, the programmer, want to force the user to write input in Python
> syntax, then yes, you have to write a function to do so. input() is very
> simple: it just reads strings exactly as typed. It is up to you to
> process those strings however you wish.
>
> --
> Steven

Python 3.3.0a4 (v3.3.0a4:7c51388a3aa7+, May 31 2012, 20:15:21) [MSC v.
1600
32 bit (Intel)] on win32
>>> ---
running smidzero.py...
...smidzero has been executed
>>> ---
input(':')
:éléphant
'éléphant'
>>> ---
input(':')
:u'éléphant'
'éléphant'
>>> ---
input(':')
:u'\u00e9l\xe9phant'
'éléphant'
>>> ---
input(':')
:u'\U000000e9léphant'
'éléphant'
>>> ---
input(':')
:\U000000e9léphant
'éléphant'
>>> ---
>>> ---
# this is expected
>>> ---
input(':')
:b'éléphant'
"b'éléphant'"
>>> ---
len(input(':'))
:b'éléphant'
11

---

Good news on the ru''/ur'' front:
http://bugs.python.org/issue15096

---

Finally I'm just wondering if this unicode_literal
reintroduction is not a bad idea.

b'these_are_bytes'
u'this_is_a_unicode_string'

I wrote all my Py2 code in a "unicode mode" since ... Py2.3 (?).

jmf