Get html DOM tree by only basic builtin moudles

Wesley nispray at gmail.com
Fri Jun 5 19:24:31 EDT 2015


> On Fri, Jun 5, 2015 at 12:10 PM, Wesley <nispray at gmail.com> wrote:
> > Hi Laura,
> >   Sure, I got special requirement that just parse html file into DOM tree, by only general basic modules, and based on my DOM tree structure, draft an bitmap.
> >
> >   So, could you give me an direction how to get the DOM tree?
> > Currently, I just think out to use something like stack, I mean, maybe read the file line by line, adding to a stack data structure(list for example), and, then, got the parent/child relation .etc
> >
> > I don't know if what I said is easy to achieve, I am just trying.
> > Any better suggestions will be great appreciated.
> 
> If you want to recreate the same DOM structure that would be created
> by a browser, the standardized algorithm to do so is very complicated,
> but you can find it at
> http://www.w3.org/TR/2011/WD-html5-20110113/parsing.html.
> 
> If you're not necessarily seeking perfect fidelity, I would encourage
> you to try to find some way to incorporate beautifulsoup into your
> project. It likely won't produce the same structure that a real
> browser would, but it should do well enough to scrape from even badly
> malformed html.
> 
> I recommend against using an XML parser, because HTML isn't XML, and
> such a parser may choke even on perfectly valid HTML such as this:
> 
> <!DOCTYPE html>
> <html>
>   <head><title>Document</title></head>
>   <body>
>     First line
>     <br>
>     Second line
>   </body>
> </html>

Hi,
  Hmm, it's really complex.
Currently, I don't need to involve all error handling,and assume html is well formatted, then, generate the DOM tree.

Html sample below:
<!DOCTYPE html>
<!-- saved from url=(0026)http://www.opera.com/about -->
<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  <meta name="description" content="Opera is an independent Scandinavian company that's been in the business of making web browsers since 1994. Read more about Opera Software here.">
  <title>About - Opera Software</title>
  <link rel="apple-touch-icon" sizes="57x57" href="http://d2jc9zwbrclgz3.cloudfront.net/static-heap/da/dafd15591b35d4f81ca96cf7de6582d705850ff0/apple-touch-icon-57x57.png">
</head>
<body screen_capture_injected="true"><div style="position: fixed; top: 0px; left: 0px; height: 0px; width: 0px; z-index: 9999999;"><div style="position: fixed; top: 100%; height: 0px;"><div style="position: relative;"></div></div></div>
<!-- Google Tag Manager -->
<nav class="business-menu">
  <ul>
    <li><a data-action-id="header_item" href="http://operamediaworks.com/">Opera Mediaworks</a></li>
  </ul>
</nav>
<main role="main" class="generic_landing_page">
<h1>Who we are, what we do</h1>  <figure class="visuals">
  <img src="./About - Opera Software_files/pro-kompaniyu.jpg" alt="" width="900" height="424">
</figure>  
<ul class="blocks col3">
<li>
<h3>Vision</h3>
<p>We strive to develop superior products and services for our users around the world, through state-of-the-art technology, innovation, leadership and partnerships.</p><p><a href="http://www.operasoftware.com/company/vision" target="_self">Find out more</a>.</p>
</li>
<li>
</ul>
</main>
<footer class="ns--hf">
<aside>
<div class="hf--extra">
  <h2 class="hf--visuallyhidden">Page language</h2>
  <div id="language" class="hf--language hf--hover-enabled hf--popup-container">
    <input id="language-toggle" class="hf--popup-toggle hf--visuallyhidden" type="checkbox" aria-haspopup="true">
    <label for="language-toggle" class="hf--popup-toggle-label" tabindex="0">
      <span class="hf--hide-overflow">
      <span class="">Select your language:</span>
      <span class="">English</span>
      </span>
    </label>
  </div>
</div>
</aside>
<div class="hf--meta hf--clearfix">
<small class="hf--company">Copyright ? 2014 Opera Software ASA. All rights reserved.
<a data-action-id="footer_item" href="http://www.opera.com/privacy">Privacy.</a> <a data-action-id="footer_item" href="http://www.opera.com/terms">Terms of Use.</a>
</small>
</div>
</footer>
</body></html>



More information about the Python-list mailing list