Problems Writing £ (pound sterling) To MS SQL Server using pymssql

Mon Nov 17 10:25:15 EST 2008

On Mon, 2008-11-17 at 14:52 +0000, Darren Mansell wrote:
> Hi. 
> 
> I'm relatively new to python so please be gentle :)
> 
> I'm trying to write a £ symbol to an MS SQL server using pymsssql . This
> works but when selecting the data back (e.g. using SQL management
> studio) the £ symbol is replaced with Â£ (latin capital letter A with
> circumflex).
> 
> I can reproduce it like so:
> 
> >>> con = pymssql.connect(host='testdb',user='testuser',password='password',database='test')
> >>> sql = "insert into table_1 values ('£')"
> >>> cur.execute(sql)
> >>> con.commit()
> >>> sql = "insert into table_1 values ('1')"
> >>> cur.execute(sql)
> >>> con.commit()
> >>> sql = "select * from table_1"
> >>> cur.execute(sql)
> >>> cur.fetchall()
> [('\xc2\xa3',), ('1',)]
> 
> If I insert a £ symbol in using SQL management studio then select it back from Python I get this:
> 
> [('\xc2\xa3',), ('1',), ('\xa3',)]
> 
> If I look in SQL management studio it says this:
> 
> Â£ 
> 
> for the inserts I do through Python/pymssql.
> 
> Does anyone have any ideas whats happening and how to fix it?

What you are seeing is the character codes for £ in UTF-8.  The
information is being sent from your database as a series of bytes.
Unicode is an abstraction that consists of a series of characters (which
in UTF-8, can be made up of 1, 2, 3, or 4 bytes.  In order to get a
unicode object from a bytestring, you need to use the decode method.  If
you decode from Latin-1 ('\xc2\xa3'.decode('latin-1')), you get Â£, if
you decode from UTF-8 ('\xc2\xa3'.decode('utf-8')), you get £.  Note
that it is a pure coincidence that £ is present in the latin-1 decoding.

In binary, '\xa3' looks like this: 

10100011

or,
      10   100011

To encode it to UTF-8, you prepend 110 to the first byte, (the two 1s
indicate that the character will be two bytes long.  A three byte
character would start with 1110.)  The second byte (and third, if it
existed) start with 10, and the bits are packed as far to the right as
possible in that framework.

110xxxxx 10xxxxxx
      10   100011
11000010 10100011

1100 0010  1010 0011
   c    2     a    3

Reverse the process to get a unicode character from a UTF-8 bytestring.

See the wikipedia page on UTF-8 for a more complete explanation on how
UTF-8 encoding works.

> Thanks
>