[ python-Bugs-1450212 ] int() and isdigit() accept non-digit numbers when using unic
SourceForge.net
noreply at sourceforge.net
Wed Mar 15 10:05:10 CET 2006
Bugs item #1450212, was opened at 2006-03-15 09:05
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1450212&group_id=5470
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Interpreter Core
Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Submitted By: Pierre-Frédéric Caillaud (peufeu)
Assigned to: Nobody/Anonymous (nobody)
Summary: int() and isdigit() accept non-digit numbers when using unic
Initial Comment:
I had a very surprising bug this morning, in a python script which
extract numeric information from human entered text.
The problem is the following : many UNICODE characters, in
UNICODE strings, are considered to be digits. For instance, the
character "²" (does it appear on your screen ? it's u'\xb2').
The output of the following command is pretty interesting :
print ''.join([x for x in map( unichr, xrange( 65536 )) if x.isdigit()])
Then, int() will happily parse the string :
int( u"٥٦٧٨٩۰۱۲" )
56789012
(I really hope this bug system supports unicode).
However, I can't do a=٥٦٧٨٩۰۱۲ for instance.
Philosophically, Python is right, these characters are probably all
digits, and it's pretty cool to be able to parse numbers written in
ARABIC-INDIC DIGITs or something, as unicodedata.name says).
However, from a practical point of view, I guess most parsing done
with python isn't on OCR'd cuneiform stone tablets, but rather
modern computer documents...
Whenever a surface (in m²) was near a phone number in my human
entered text, the "²" would be absorbed as a part of the phone
number, because u"²".isdigit() is True. Then bullshit phone numbers
would appear on the website.
Any number followed by a little footnote number will get the
footnote number embedded...
I had to replace all the .isdigit() with a re.compile( ur"^\d+$" ).
match(). Interestingly, for re, even in unicode, \d is 0-9 and nothing
else.
At least, it would be normal for int() to raise an exception when fed
this type of data. Please.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1450212&group_id=5470
More information about the Python-bugs-list
mailing list