[FAQTS] Python Knowledge Base Update -- August 25th, 2000
Fiona Czuczman
fiona at sitegnome.com
Fri Aug 25 00:05:54 EDT 2000
Hi All,
Here are the entries I've entered today into http://python.faqts.com
cheers,
Fiona
## New Entries #################################################
-------------------------------------------------------------
Converting a html table to a tree.
http://www.faqts.com/knowledge-base/view.phtml/aid/5515
-------------------------------------------------------------
Fiona Czuczman
Alex Martelli
Problem:
I'm completely new to python...just started reading learning python.
I've got 5 days to figure out how to write a script to take an html
table and convert it to a tree....basically a nested array (that's my
guess on how it would be done anyhow). Oh yeah...and I also have to
drive 3000 miles in those 5 days ;)p
I was hoping someone could give me a push in the right direction. What
functions or whatever should I look at to get this done? I saw that in
the book it mentions pythons ability to grab html and parse it, as one
if the pluses of the language ('internet utility modules' is what the
book called it/them).
And I just found I don't have to actually go out and grab the page off a
webserver. It'll be a file residing on the machine where the script
will be run. I assume that'll make it a little easier for me :)
Solution:
Standard modules htmllib and sgmllib will indeed help you.
> webserver. It'll be a file residing on the machine where the script
> run. I assume that'll make it a little easier for me :)
Not by much, since getting data off an arbitrary URL is so easy with
Python, but, yes, you can reduce your 'main program' to two lines:
parser.feed(open('myfile.html').read())
parser.close()
once you have properly instantiated the 'parser' instance you need.
Basically, you want to derive your class from htmllib.HTMLParser, and
add methods to handle the tags you're specifically interested in -- for
the problem you stated, table-related tags.
For any tag-name FOO, you need to define in your class, either one
method:
def do_foo(self, attributes):
# do whatever
if the tag does not require a corresponding close-tag (e.g., <br>); or,
more commonly, two methods:
def start_foo(self, attributes):
# opening stuff
def end_foo(self):
# closing stuff
if both an opening and a closing tag will be there (<table> ...
</table>,
and similar cases).
The 'attributes' argument is a (possibly empty) list of (name,value)
pairs.
Further, you'll want to define a method in your class:
def handle_data(self, data):
# whatever
that will receive all textual data. Of course, you'll have flags
you maintain on start/end methods telling you whether the data must
simply be discarded, or how it is to be processed if relevant.
Now to your specific case. The tags you may want to handle are:
TABLE, CAPTION, COL, COLGROUP, TBODY, TD, TFOOT, TH, THEAD, TR.
(have I missed any...?). COL, I believe, is the only one that
does not require a closing-tag (although I think COLGROUP has an
_optional_ closing-tag if COL elements are not nested in it, but
I'm not sure). Which of these tags carry significant information
for your purposes...?
The general structure might be:
TABLE
CAPTION
THEAD
TBODY
TFOOT
CAPTION is optional. So is each of THEAD, TBODY, TFOOT: if none
is explicitly specified, TBODY is implied. Each of THEAD, TBODY,
TFOOT has contents:
THEAD|TBODY|TFOOT:
TR
TH
TD
Zero or more TH and TD within each TR, zero or more TR's.
I'm skipping COL, COLGROUP, and the attributes, as I think they
are basically presentational only, and you seem interested in
content-structuring instead.
Now, we need more precise specs: what kinds of tables do you
need to parse, and how do you want to structure (and output?)
the data they contain, depending on caption/thead/tbody/tfoot
and tr/th/td issues...?
Let's take a very simple case to make things more definite.
We process TABLE elements where only TBODY is interesting --
THEAD and TFOOT, we skip silently. Similarly, we skip TH
and its contents too: we're only interested in:
TABLE
TBODY (may be implied)
TR (zero or more)
TD (zero or more)
data contents of TD tags only
As a result, we return a list (Python's normal data structure
for sequences; 'array' is very specialized in Python) where
each element corresponds to one row (TR); each element in
the list is another, nested, list, where each element
corresponds to the data in a TD, in sequence.
Our class will expect to be 'fed' a document fragment
containing exactly one TABLE (the TABLE tag will have
to be explicit), and will ignore anything outside of
that tag as well as any redundant or nested TABLE tags
that may also be present. This is basically for simplicity;
you will have to think deep about what you want to do
in each of these cases! And add good error diagnosis...
We'll also basically assume decent nesting rather than
go out of our way to accept peculiarly structured tables; this, too,
will need in-depth review!
import htmllib
import formatter
import string
import pprint
class TableParser(htmllib.HTMLParser):
def __init__(self):
self.active=0
self.finished=0
self.skipping=0
self.result=[]
self.current_row=[]
self.current_data=[]
htmllib.HTMLParser.__init__(
self, formatter.NullFormatter())
def start_table(self,attributes):
if not self.finished:
self.active=1
def end_table(self):
self.active=0
self.finished=1
def start_tbody(self,attributes):
self.skipping=0
def end_tbody(self):
self.skipping=1
def start_thead(self,attributes):
self.skipping=1
def end_thead(self):
self.skipping=0
def start_tfoot(self,attributes):
self.skipping=1
def end_tfoot(self):
self.skipping=0
def start_caption(self,attributes):
self.skipping=1
def end_caption(self):
self.skipping=0
def start_th(self,attributes):
self.skipping=self.skipping+1
def end_th(self):
self.skipping=self.skipping-1
def start_tr(self,attributes):
if self.active and not self.skipping:
self.current_row = []
def end_tr(self):
if self.active and not self.skipping:
self.result.append(self.current_row)
def start_td(self,attributes):
if self.active and not self.skipping:
self.current_data = []
def end_td(self):
if self.active and not self.skipping:
self.current_row.append(
string.join(self.current_data))
def handle_data(self, data):
if self.active and not self.skipping:
self.current_data.append(data)
def process(filename):
parser=TableParser()
parser.feed(open(filename).read())
parser.close()
return parser.result
def showparse(filename):
pprint.pprint(process(filename))
def _test():
return showparse('c:/atable.htm')
if __name__=='__main__':
_test()
With c:/atable.htm contents being, for example:
<TABLE BORDER=1 WIDTH=80%>
<THEAD>
<TR>
<TH>Heading 1</TH>
<TH>Heading 2</TH>
</TR>
</THEAD>
<TBODY>
<TR>
<TD>Row 1, Column 1 text.</TD>
<TD>Row 1, Column 2 text.</TD>
</TR>
<TR>
<TD>Row 2, Column 1 text.</TD>
<TD>Row 2, Column 2 text.</TD>
</TR>
</TBODY>
</TABLE>
running the _test function will emit:
>>> tableparse._test()
[['Row 1, Column 1 text.', 'Row 1, Column 2 text.'],
['Row 2, Column 1 text.', 'Row 2, Column 2 text.']]
>>>
I hope this gives you a somewhat usable start on your problem.
A <TR> could contain <TH>. What would you want to do with those?
<TABLE>
<THEAD>
<TR> <TH>A header</TH> <TH>Another</TH> </TR>
</THEAD>
<TBODY>
<TR> <TD>Some data</TD> <TD>Some more</TD> </TR>
</TBODY>
</TABLE>
What do you want to come out of this? I suspect ignoring the <THEAD>
is probably closest to your needs, as, also, ignoring a <TR> that
contains no <TD>'s (but rather <TH>'s).
-------------------------------------------------------------
Is there a way to obtain sub-second times in Python (at most to the milli-second)?
http://www.faqts.com/knowledge-base/view.phtml/aid/5513
-------------------------------------------------------------
Fiona Czuczman
Dan Schmidt
Whenever I have to do precise timing in a scripting language, I look
at how the profiler does it, and it worked this time too. Look at
how Profile.__init__ in profile.py: sets self.timer; it does some
special-casing based on OS-specific things.
-------------------------------------------------------------
How can I get a message's flag (Unread, Seen, Reply...) using Imaplib?
http://www.faqts.com/knowledge-base/view.phtml/aid/5517
-------------------------------------------------------------
Fiona Czuczman
Donn Cave
Assuming srv is an instance of IMAP4, and you want both flags and
headers for messages 1 through 5 in the currently selected folder -
status, result = srv.fetch('1:5', '(FLAGS RFC822.HEADER)')
Now you have in "result" a sequence of strings, with all the
information you requested. It's not broken up in the most convenient
way for you -- you will need to use some string functions to extract
the flags and everything.
-------------------------------------------------------------
How can I make a frame no-resizeable in wxPython (wxWindows)?
http://www.faqts.com/knowledge-base/view.phtml/aid/5519
-------------------------------------------------------------
Fiona Czuczman
Ruediger, Vadim Zeitlin
you need to specify the frame style, like:
wxFrame.__init__(self, parent, ID, title, wxDefaultPosition,
wxSize(600, 450), wxSYSTEM_MENU | wxCAPTION)
Problem cont:
Ah yes, thats clear, but what style?
I tried:
wxDEFAULT_FRAME_STYLE & ~ (wxMINIMIZE_BOX | wxRESIZE_BOX |
wxMAXIMIZE_BOX)
but that didn't work. Minimize and maximize are gone but I can still
resize my frame.
Solution:
wxRESIZE_BORDER is the one which controls "resizing by dragging the
frame border".
-------------------------------------------------------------
How can I create compressed tarfiles, ie. lots of files inside a compressed file, using either the tarlib.py or tar.py ( found in Zope and gzip and/or zlib?
http://www.faqts.com/knowledge-base/view.phtml/aid/5521
-------------------------------------------------------------
Fiona Czuczman
Martin von Loewis
Without trying, it appears that Zope's tar.py already handles
compressed files.
Let names contain a list of all files to add, I'd say the following
might work:
archive = tar.tgzarchive("foo")
for n in names:
data = open(n).read()
archive.add(n, data)
archive.finish()
open("foo.tgz").write(archive)
It appears as if this would keep the compressed file in memory,
though; there is apparently no streaming tar library.
More information about the Python-list
mailing list