[Tutor] Word List

Emad Nawfal emadnawfal at gmail.com
Sun Mar 9 18:09:58 CET 2008


Dear Tutors,
> I'm trying to get the most frequent words in an Arabic text. I wrote the
> following code and tried it on English and it works fine, but when I try it
> on Arabic, all I get is the slashes and x's. I'm not familiar with Unicode.
> Could somebody please tell me what's wrong here, and how I can get the
> actual Arabic words?
> Thank you in anticipation
>
>
> import codecs
> infile = codecs.open(r'C:\Documents and Settings\Emad\Desktop\milal.txt',
> 'r', 'utf-8').read().split()
> num = {}
> for word in infile:
>     if word not in num:
>         num[word] = 1
>     num[word] +=1
> new = zip(num.values(), num.keys())
> new.sort()
> new.reverse()
> outfile = codecs.open(r'C:\Documents and
> Settings\Emad\Desktop\milalwanihal.txt', 'w', 'utf-8')
> for word in new:
>         print >> out, word
> out.close()
>
>
> --
> لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.....محمد
> الغزالي
> "No victim has ever been more repressed and alienated than the truth"
>
> Emad Soliman Nawfal
> Indiana University, Bloomington
> http://emnawfal.googlepages.com
> --------------------------------------------------------




-- 
لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.....محمد
الغزالي
"No victim has ever been more repressed and alienated than the truth"

Emad Soliman Nawfal
Indiana University, Bloomington
http://emnawfal.googlepages.com
--------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/tutor/attachments/20080309/9e038ee0/attachment.htm 


More information about the Tutor mailing list