caucho
 HTML Parsing


Resin includes an HTML parser. Parsing HTML is convenient:

  • Parse documents created by HTML editors.
  • Parse documents on the web.
  • Provide a HTML interface for web designers.

Parsing HTML is like using the JAXP interface, except you'll be using Resin's API.

Because printing HTML uses different rules from XML, e.g. <img> has no end tag, you'll need to use printHtml instead of just print.

Parsing HTML
import java.io.*;
import org.w3c.dom.*;
import com.caucho.xml.*;

...

Html parser = new HtmlParser();

// Parse the file into a DOM Document (org.w3c.dom)
Document doc = parser.parse("test.html");

// Create a new HTML printer (com.caucho.xml)
FileOutputStream os = new FileOutputStream("out.xml");
XmlPrinter printer = new XmlPrinter(os);

// Print the document using HTML rules
printer.printHtml(doc);
os.close();

You can also take advantage of Resin's VFS API and parse documents directly from the web:

Parsing HTML
import java.io.*;
import org.w3c.dom.*;
import com.caucho.xml.*;

...

Html parser = new HtmlParser();

Path yahoo = Vfs.lookup("http://www.yahoo.com");

// Parse the file into a DOM Document (org.w3c.dom)
Document doc = parser.parse(yahoo);

Copyright © 1998-2002 Caucho Technology, Inc. All rights reserved.
Resin® is a registered trademark, and HardCoretm and Quercustm are trademarks of Caucho Technology, Inc.