From carles at pina.cat Fri May 7 22:29:52 2010 From: carles at pina.cat (Carles Pina i Estany) Date: Fri, 7 May 2010 21:29:52 +0100 Subject: [python-uk] Bayesian filter Message-ID: <20100507202952.GA8906@pina.cat> Hello, (scroll to PYTHON TEST for a test) I've took another look to the Bayesian filter (it was not my "task" :-) but it's my pleasure). Ok, to start, Reverend tokenizes the training texts and works only on token level, not sub-token level. So we should not expect that will detect c0mputer as computer (quite common mistake yesterday, I think) (I was doing a high-level mathematical description, but I will postpone -for when I will check some more things- or just leave for the Pythoner who will do it for the next Meetup ;-) ) PYTHON TEST carles at pinux:~/bayes$ ls training/ bash c++ python Each of these directories contains between 18 and 29 files that I've copied randomly from different places of my hard disk. Then I have: carles at pinux:~/bayes$ ls guessing/ demanar.py keymap.sh medium.py qdacco.cpp carles at pinux:~/bayes$ some other files that I've copied there... The Bayesian filter never knows the name of the file. Just using this set for training, look the results: ----- Start test ./guessing/qdacco.cpp [('c++', 0.6590693537529797), ('python', 0.59287521198182513), ('bash', 0.28091954259046653)] ./guessing/demanar.py [('python', 0.58882188718297557), ('c++', 0.57869106382644175), ('bash', 0.36380374534210203)] ./guessing/keymap.sh [('bash', 0.54270073170250122), ('c++', 0.47142124856042872), ('python', 0.36321294599284148)] ./guessing/main.py [('python', 0.65909707358336711), ('c++', 0.52731742496139433), ('bash', 0.3261511618248264)] I consider it quite good. bayes.py is 30 lines long -could be less- and it works pretty well, even having only parts of the program (don't tell me to check for #include , #!/bin/bash or #!/usr/bin/python, not needed at all, works with snippets of code, etc.) Yes, there is one case that guess that it's Pythonn and not far from c++. I probably need a bigger data set, but even then if it guess it "quite well" then is "quite good" :) (I'm thinking, for example, in some service like pastebin, that would guess that the code that you are copy-pasting there, and if you change the guess, it can train itself with the new code). My training sets are very noisy, and I should subclass Reverend and improve the tokenizer to use a a separator "=", "(", ")" and other things, since now a line like: linia=random.randint(1,float(total_paraules)) It's one token... The literals should be probably removed as well. I'm taking a look to the statistics part. Here is a good start: http://en.wikipedia.org/wiki/Naive_Bayes_classifier Cheers, -- Carles Pina i Estany http://pinux.info From tom at tomdunham.org Sat May 8 00:13:58 2010 From: tom at tomdunham.org (Thomas Dunham) Date: Fri, 7 May 2010 23:13:58 +0100 Subject: [python-uk] Bayesian filter In-Reply-To: <20100507202952.GA8906@pina.cat> References: <20100507202952.GA8906@pina.cat> Message-ID: Thanks Carles, will try to force some of this into my head this weekend.... From carles at pina.cat Sat May 8 02:10:43 2010 From: carles at pina.cat (Carles Pina i Estany) Date: Sat, 8 May 2010 01:10:43 +0100 Subject: [python-uk] Bayesian filter In-Reply-To: References: <20100507202952.GA8906@pina.cat> Message-ID: <20100508001043.GA12615@pina.cat> Hi, On May/07/2010, Thomas Dunham wrote: > Thanks Carles, will try to force some of this into my head this > weekend.... good! For me one of the keys is in: http://en.wikipedia.org/wiki/Naive_Bayes_classifier Just above "Using the Bayesian result". I can read the formulas like: -Probability of the document being spam is the multiplicatoin of each individual word of this document being spam Also interesting here: http://en.wikipedia.org/wiki/Bayesian_spam_filtering When talks about "Combining individual probabilities" (talks about the assumptions and links to the previous Wikipedia article) Other key is in the file reverend/thomas.py, buildCache, where it computes the probability of each token to belong in each group. The thing is that there is doing some "magic" with the metrics that, at the moment, I'm not following very well (what it does and why is needed). So, in a very high level does: Training: -Tokenize the input -Save how many times appears each word in the corpus buildCache (so, part of guessing if no more training is done): -Calculates, per token, how likely is to be in each category (and something else that I'm not following with the good and badMetric guesser: -Tokenize the new input -Combines the probabilities of each token of the input, using the cache to know how likely is this token to be of each category. I think that this is a very high level design with some mistake for sure. If someone can calculate one example by hand and the result is the same than Reverend would get some extra points :-D I'm only quite confused with some things in buildCache... -- Carles Pina i Estany http://pinux.info From ntoll at ntoll.org Mon May 10 11:49:23 2010 From: ntoll at ntoll.org (Nicholas Tollervey) Date: Mon, 10 May 2010 10:49:23 +0100 Subject: [python-uk] 10th London Python Code Dojo Message-ID: Folks, The next London Python code dojo will take place on Thursday 3rd June at 6:30pm. You can find out here and book your place here: http://ldnpydojo.eventwax.com/10th-london-python-code-dojo After five dojos of building an adventure game we all decided to take a break and organise a ?talks? night (with some of the talks discussing adventure game related stuff). Don?t let this fool you into thinking it?ll just be a sequence of presentations: because of the participatory nature of the dojo we encourage attendees to interrupt, ask questions, code along and generally interact with what?s going on. Think of it more as a set of seminars rather than presentations. So what about the talks..? (In no particular order and subject to change): ? Tim Golden ? Pyro (http://pyro.sourceforge.net/) ? Dave Kirby ? Twisted (http://twistedmatrix.com/trac/) ? John Ribbens ? john.py (a web framework) ? Andy Kilner ? CMD (http://docs.python.org/library/cmd.html) ? Nicholas Tollervey ? Fluiddb (http://fluidinfo.com) ? Tom Dunham ? The bayesian problems encountered in the adventure game Pizza and beer start at 6:30pm and the talks themselves will start at the earliertime of 7-7:15ish. We aim to finish 9:30ish. Photos from the 9th Dojo can be found here: http://www.flickr.com/photos/11306102 at N05/sets/72157623906491735/ Free pizza and beer will be provided. (Thanks Fry-IT) Participants get the chance to win a cool book (thanks O?Reilly). Look forward to seeing you there! Nicholas. From tony at tonyibbs.co.uk Thu May 20 20:36:35 2010 From: tony at tonyibbs.co.uk (Tony Ibbs) Date: Thu, 20 May 2010 19:36:35 +0100 Subject: [python-uk] Next Cambridge & East Anglia Meeting: Tue 8th June Message-ID: <12126546-212C-428D-8FD1-4DA04E9FBC1F@tonyibbs.co.uk> As discussed on the CamPUG google group, the next meeting will (again) be delayed, mainly because of people being on holiday. As it says there: Assuming this is OK with Tom and Robin, the next meeting (which should be a Code Dojo meeting) will be: 7.30pm, Tue 8th June at RealVNC (http://tinyurl.com/realvncoffices). The meeting after that (back on schedule) should be Tuesday 6th July. EuroPython is then 19th - 22nd July. I think we *should* be OK to hold the meeting after that on Tuesday 3rd August, but I shall be away on holiday (again!). Hope that all makes sense... Tibs From ntoll at ntoll.org Thu May 27 11:26:19 2010 From: ntoll at ntoll.org (Nicholas Tollervey) Date: Thu, 27 May 2010 10:26:19 +0100 Subject: [python-uk] Reminder: Next London Python Dojo a week today Message-ID: <5C391D85-CE1D-4B60-A879-9250F78C0AD4@ntoll.org> Folks, Just the obligatory week's notice that the next London Python Code Dojo is happening on the 3rd June at 6:30pm at the offices of Fry-IT. Details and sign-up can be found here: http://ldnpydojo.eventwax.com/10th-london-python-code-dojo It'll be another "talks" night with several speakers presenting on subjects related to the problems, code and "features" we encountered whilst writing the adventure game (and some un-related talks too). As this is the dojo we're encouraging audience participation, questions, comments and code-along. They'll be more like seminars than lectures in style and delivery. Finally, I promise to get *MORE* pizza and beer this time round. Either, the new supplier of pizzas produce better pizza so more were eaten OR their pizzas are smaller than the old place. In any case, we'll be snowed under with pizza this time round. Pizza delivery is at 6:30pm. If you want any, get there before Bruce does (he has two dojo's worth of pizza to get through). Looking forward to it, Nicholas. From fuzzyman at voidspace.org.uk Thu May 27 15:26:30 2010 From: fuzzyman at voidspace.org.uk (Michael Foord) Date: Thu, 27 May 2010 14:26:30 +0100 Subject: [python-uk] Northampton Geek Meet - tonight Message-ID: <4BFE7306.4060409@voidspace.org.uk> Hello guys, Sorry for the short notice, the next Northampton Geek Meet is *tonight* at 7.30pm, at the Malt Shovel Pub Northampton. If you want more timely warnings then you can follow @northantsgeeks on twitter :-) http://twitter.com/northantsgeeks The Malt Shovel Pub can be found at: http://www.maltshoveltavern.com/ All the best, Michael Foord -- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog READ CAREFULLY. By accepting and reading this email you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies (?BOGUS AGREEMENTS?) that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer. From michael at voidspace.org.uk Thu May 27 15:11:19 2010 From: michael at voidspace.org.uk (Michael Foord) Date: Thu, 27 May 2010 14:11:19 +0100 Subject: [python-uk] Northampton Geek Meet - tonight Message-ID: <4BFE6F77.2050005@voidspace.org.uk> Hello guys, Sorry for the short notice, the next Northampton Geek Meet is *tonight* at 7.30pm, at the Malt Shovel Pub Northampton. If you want more timely warnings then you can follow @northantsgeeks on twitter :-) http://twitter.com/northantsgeeks The Malt Shovel Pub can be found at: http://www.maltshoveltavern.com/ All the best, Michael Foord -- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog READ CAREFULLY. By accepting and reading this email you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies (?BOGUS AGREEMENTS?) that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer.