[spambayes-bugs] [ spambayes-Bugs-901920 ] sb_dbexpimp.py barfs on 0xA3 char

SourceForge.net noreply at sourceforge.net
Tue Mar 16 16:57:56 EST 2004


Bugs item #901920, was opened at 2004-02-21 17:12
Message generated for change (Comment added) made by montanaro
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=901920&group_id=61702

Category: None
Group: Source code 1.0a9 (0.9)
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: Dougie Lawson (dougielawson)
Assigned to: Skip Montanaro (montanaro)
Summary: sb_dbexpimp.py barfs on 0xA3 char

Initial Comment:
I've got a problem where I can't export my hammie.db
and re-import it with sb_dbexpimp.py.

The script barfs with "UnicodeDecodeError".

jerry:/etc/spambayes # ~/sb_dbexpimp.py -i -d new.db -f
hammie.db.export
Importing database new.db using file hammie.db.export
Debug: current word: subject%3A%20%A3

Traceback (most recent call last):
  File "/root/broke.py", line 271, in ?
    runImport(dbFN, useDBM, newDBM, flatFN)
  File "/root/broke.py", line 203, in runImport
    word = uunquote(word)
  File "/root/broke.py", line 116, in uunquote
    return unicode(urllib.unquote(s), 'utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa3
in position 9: unexpected code byte
jerry:/etc/spambayes #

I added this code to get the debugging output:

def uunquote(s):
    try:
        return unicode(urllib.unquote(s), 'utf-8')
    except UnicodeDecodeError, e:
        print "Debug: current word: %s\n" % s
        raise

jerry:/etc/spambayes # python -V
Python 2.3.3

0xA3 is a GBP currency symbol. The web interface
handles it OK.

I get these results for a word query on "subject: £"

Statistics for 'subject: £'
   Number of spam messages: 0.
   Number of ham messages: 2.
   Probability that a message containing this word is   
   spam: 0.091837.

If you need a copy of the exported (100K) file just ask.

----------------------------------------------------------------------

>Comment By: Skip Montanaro (montanaro)
Date: 2004-03-16 15:57

Message:
Logged In: YES 
user_id=44345

Good to here.  I checked in sb_dbexpimp.py 1.8 and compatcsv.py 
1.1 to deal with this issue.  The side effect is that the old 
interchange format can't be read anymore.  Hopefully this won't 
affect anyone.


----------------------------------------------------------------------

Comment By: Dougie Lawson (dougielawson)
Date: 2004-03-16 15:13

Message:
Logged In: YES 
user_id=965369

I've tried your sb_dbexpimp.py test fix. It's working on my
hammie.db w/Python 2.3.3.

Thanks for this.



----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2004-03-15 19:06

Message:
Logged In: YES 
user_id=44345

whooops - forgot to remove a debug raise which forced use of 
compatcsv.py.


----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2004-03-15 19:02

Message:
Logged In: YES 
user_id=44345

I'm attaching two files, a modified version of sb_dbexpimp.py and 
a compatcsv.py (only needed for Python 2.2) which has just 
enough csv juju to keep db_dbexpimp.py happy.  Can you give 
them a try?  I intend to check them in soon but would appreciate a 
little extra vetting.

----------------------------------------------------------------------

Comment By: Tony Meyer (anadelonbrin)
Date: 2004-03-15 17:44

Message:
Logged In: YES 
user_id=552329

Skip's looking into using the CSV module for sb_dbexpimp.py,
which would fix this, so assigning to him ;)

----------------------------------------------------------------------

Comment By: Dougie Lawson (dougielawson)
Date: 2004-02-21 17:33

Message:
Logged In: YES 
user_id=965369

I've checked through the whole export file and it barfs on
all these records.

subject%3A%20%A3`2`0`
subject%3A%20%A8`0`1`
subject%3A%20%B5%C4%CD%CB%D0%C5`0`1`
subject%3A%20%E9`0`1`
subject%3A%25%20%D0%CE`0`1`
subject%3A%2C%C0%B4%D7%D4%20`0`1`
subject%3A%A385.99`2`0`
subject%3A%B5%C4%CD%CB%D0%C5`0`1`
subject%3A%C0%20%40%20`0`1`
subject%3A%C0%B4%D7%D4`0`1`
subject%3A%CE%C0`0`1`
subject%3A%CF%B5%CD%B3%CD%CB%D0%C5`0`1`
subject%3A%D0%CESC0%DCNT`0`1`
subject%3A%DC`0`1`
subject%3A%E9`0`1`
subject%3A%E9nlarger`0`1`
subject%3A%ED`0`4`
subject%3AV%CE%C0GR%C0`0`1`
subject%3Ap%E9nis`0`1`
subject%3Av%EDagra`0`4`


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=901920&group_id=61702



More information about the Spambayes-bugs mailing list