[IPython-dev] Pasting fix, unicode woes

Tue Sep 7 02:06:10 EDT 2010

Hey Evan,

I just fixed the paste-trailing-newline annoyance:

http://github.com/ipython/ipython/commit/92971904bc9fd2b988d8c16e9502edc39a70ff25

I think that approach is good, because it gives the user a chance to
edit the code before actually executing, but otherwise just needs  a
simple return to execute.

I do have one question though: why disallow unicode paste?  People are
quite likely to have non-ascii in their examples, and it seems odd to
block them from pasting it in.  Consider for example that I can't
paste this:

name = "Fernando Pérez"

I consider the fact that I can't type my own name into ipython a bug :)

I think the solution is to set the GUI encoding by default to UTF-8,
with an option for the user to change that according to their
preferences later.  I had a quick go at it, but it was getting too
complicated so I didn't commit anything anywhere.  Here's the diff in
case you find it useful as a starting point (I just reverted locally):

####
(newkernel)amirbar[qt]> git diff

diff --git a/IPython/frontend/qt/console/console_widget.py
b/IPython/frontend/qt/console/console_widget.py
index d78cd63..f6ae9fd 100644
--- a/IPython/frontend/qt/console/console_widget.py
+++ b/IPython/frontend/qt/console/console_widget.py
@@ -10,7 +10,7 @@ from PyQt4 import QtCore, QtGui
 # Local imports
 from IPython.config.configurable import Configurable
 from IPython.frontend.qt.util import MetaQObjectHasTraits
-from IPython.utils.traitlets import Bool, Enum, Int
+from IPython.utils.traitlets import Bool, Enum, Int, Str
 from ansi_code_processor import QtAnsiCodeProcessor
 from completion_widget import CompletionWidget

@@ -37,6 +37,9 @@ class ConsoleWidget(Configurable, QtGui.QWidget):
     # non-positive number disables text truncation (not recommended).
     buffer_size = Int(500, config=True)

+    # The default encoding used by the GUI.
+    encoding = Str('utf-8')
+
     # Whether to use a list widget or plain text output for tab completion.
     gui_completion = Bool(False, config=True)

@@ -233,7 +236,7 @@ class ConsoleWidget(Configurable, QtGui.QWidget):
             text = QtGui.QApplication.clipboard().text()
             if not text.isEmpty():
                 try:
-                    str(text)
+                    text.encode(self.encoding)
                     return True
                 except UnicodeEncodeError:
                     pass
@@ -421,7 +424,8 @@ class ConsoleWidget(Configurable, QtGui.QWidget):
             try:
                 # Remove any trailing newline, which confuses the GUI and
                 # forces the user to backspace.
-                text = str(QtGui.QApplication.clipboard().text(mode)).rstrip()
+                raw = QtGui.QApplication.clipboard().text(mode).rstrip()
+                text = raw.encode(self.encoding)
             except UnicodeEncodeError:
                 pass
             else:
@@ -1034,7 +1038,7 @@ class ConsoleWidget(Configurable, QtGui.QWidget):
         cursor.movePosition(QtGui.QTextCursor.StartOfBlock)
         cursor.movePosition(QtGui.QTextCursor.EndOfBlock,
                             QtGui.QTextCursor.KeepAnchor)
-        return str(cursor.selection().toPlainText())
+        return unicode(cursor.selection().toPlainText()).encode(self.encoding)

     def _get_cursor(self):
         """ Convenience method that returns a cursor for the current position.
####

By the way, this isn't an odd corner case: in other countries, people
are likely to have files and directories with unicode in them *all the
time*, so this problem will hit us immediately once the code is out,
I'm afraid.

I saw multiple calls of the form str(some.Qt.Code()) that were
throwing exceptions and decided to stop before I get myself too deep
into Qt code I don't know well.  But the right approach is probably to
encapsulate all those into a single common call that manages the
encoding.

The tricky part, I suspect, will be to do the cursor positioning logic
with unicode in play: you need to correctly compute the lengths in
terms of characters on the unicode string (more precisely, the number
of glyphs that the code points map to), not bytes on the raw one.

Welcome to the wonderful world of unicode!

Cheers,

f

ps - and on py3k it's *only* unicode everywhere, so we might as well
get this code right from the get go.  Now that we have people starting
to help towards py3, the last thing we should do is write a ton of new
code that is unicode-unsafe for a py3 transition.  We're not writing
py3 code yet, but we should write *with an eye towards py3*.