Review Request of Python Code

Thu Mar 10 14:12:37 EST 2016

SQL doesn't allow decimal numbers for LIMIT.
Use decimal numbers it still work but is the proper way.

Then clean up a bit your code and remove the commented lines #

-----Original Message-----
From: Python-list [mailto:python-list-bounces+joaquin.alzola=lebara.com at python.org] On Behalf Of subhabangalore at gmail.com
Sent: 10 March 2016 18:12
To: python-list at python.org
Subject: Re: Review Request of Python Code

On Wednesday, March 9, 2016 at 9:49:17 AM UTC+5:30, subhaba... at gmail.com wrote:
> Dear Group,
>
> I am trying to write a code for pulling data from MySQL at the backend and annotating words and trying to put the results as separated sentences with each line. The code is generally running fine but I am feeling it may be better in the end of giving out sentences, and for small data sets it is okay but with 50,000 news articles it is performing dead slow. I am using Python2.7.11 on Windows 7 with 8GB RAM.
>
> I am trying to copy the code here, for your kind review.
>
> import MySQLdb
> import nltk
> def sql_connect_NewTest1():
>     db = MySQLdb.connect(host="localhost",
>                      user="*****",
>                      passwd="*****",
>                      db="abcd_efgh")
>     cur = db.cursor()
>     #cur.execute("SELECT * FROM newsinput limit 0,50000;") #REPORTING RUNTIME ERROR
>     cur.execute("SELECT * FROM newsinput limit 0,50;")
>     dict_open=open("/python27/NewTotalTag.txt","r") #OPENING THE DICTIONARY FILE
>     dict_read=dict_open.read()
>     dict_word=dict_read.split()
>     a4=dict_word #Assignment for code.
>     list1=[]
>     flist1=[]
>     nlist=[]
>     for row in cur.fetchall():
>         #print row[2]
>         var1=row[3]
>         #print var1 #Printing lines
>         #var2=len(var1) # Length of file
>         var3=var1.split(".") #SPLITTING INTO LINES
>         #print var3 #Printing The Lines
>         #list1.append(var1)
>         var4=len(var3) #Number of all lines
>         #print "No",var4
>         for line in var3:
>             #print line
>             #flist1.append(line)
>             linew=line.split()
>             for word in linew:
>                 if word in a4:
>                     windex=a4.index(word)
>                     windex1=windex+1
>                     word1=a4[windex1]
>                     word2=word+"/"+word1
>                     nlist.append(word2)
>                     #print list1
>                     #print nlist
>                 elif word not in a4:
>                     word3=word+"/"+"NA"
>                     nlist.append(word3)
>                     #print list1
>                     #print nlist
>                 else:
>                     print "None"
>
>     #print "###",flist1
>     #print len(flist1)
>     #db.close()
>     #print nlist
>     lol = lambda lst, sz: [lst[i:i+sz] for i in range(0, len(lst), sz)] #TRYING TO SPLIT THE RESULTS AS SENTENCES
>     nlist1=lol(nlist,7)
>     #print nlist1
>     for i in nlist1:
>         string1=" ".join(i)
>         print i
>         #print string1
>
>
> Thanks in Advance.

****************************************************************************
Dear Group,

Thank you all, for your kind time and all suggestions in helping me.

Thank you Steve for writing the whole code. It is working full and fine. But speed is still an issue. We need to speed up.

Inada I tried to change to
cur = db.cursor(MySQLdb.cursors.SSCursor) but my System Admin said that may not be an issue.

Freidrich, my problem is I have a big text repository of .txt files in MySQL in the backend. I have another list of words with their possible tags. The tags are not conventional Parts of Speech(PoS) tags,  and bit defined by others.
The code is expected to read each file and its each line.
On reading each line it will scan the list for appropriate tag, if it is found it would assign, else would assign NA.
The assignment should be in the format of /tag, so that if there is a string of n words, it should look like, w1/tag w2/tag w3/tag w4/tag ....wn/tag,

where tag may be tag in the list or NA as per the situation.

This format is taken because the files are expected to be tagged in Brown Corpus format. There is a Python Library named NLTK.
If I want to save my data for use with their models, I need some specifications. I want to use it as Tagged Corpus format.

Now the tagged data coming out in this format, should be one tagged sentences in each new line or a lattice.

They expect the data to be saved in .pos format but presently I am not doing in this code, I may do that later.

Please let me know if I need to give any more information.

Matt, thank you for if...else suggestion, the data of NewTotalTag.txt is like a simple list of words with unconventional tags, like,

w1 tag1
w2 tag2
w3 tag3
...
...
w3  tag3

like that.

Regards,
Subhabrata

--
https://mail.python.org/mailman/listinfo/python-list
This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.