转自:
[url]http://playfish.javaeye.com/blog/150382[/url]Jericho HTML Parser
Jericho HTML Parser is a simple but powerful java library allowing
analysis and manipulation of parts of an HTML document, including some
common server-side tags, while reproducing verbatim any unrecognised or
invalid HTML. It also provides high-level HTML form manipulation
functions.
The
javadocs
provide comprehensive documentation of the entire API, as well as being
a very useful reference on aspects of HTML and XML in general.
Release notes for each version can be found in a file called
release.txt in the project root directory.
Features
The library distinguishes itself from other HTML parsers with the following major features:
- The presence of badly formatted HTML does not interfere
with the parsing of the rest of the document, which makes the library
ideal for use with "real-world" HTML that chokes other parsers.
- ASP, JSP, PSP, PHP and Mason
server tags are explicitly recognised by the parser. This means that
normal HTML is still parsed properly even if there are server tags
inside them, which is common for example when dynamically setting
element attributes.
- It is neither an event nor tree based parser,
but rather uses a combination of simple text search, efficient tag
recognition and a tag position cache. The text of the whole source
document is first loaded into memory, and then only the relevant
segments searched for the relevant characters of each search operation.
- Compared to a tree based parser such as DOM,
the memory and resource requirements can be far better if only small
sections of the document need to be parsed or modified. Incorrect or
badly formatted HTML can easily be ignored, unlike tree based parsers
which must identify every node in the document from top to bottom.
- Compared to an event based parser such as SAX, the interface is on a much higher level and more intuitive, and a tree representation of the document element hierarchy is easily created if required.
- The begin and end
positions in the source document of all parsed segments are accessible,
allowing modification of only selected segments of the document without
having to reconstruct the entire document from a tree.
- The row and column number of each position in the source document is easily accessible.
- Provides a simple but comprehensive interface for the analysis and manipulation of HTML form controls, including the extraction and population of initial values, and conversion to read-only or data display
modes. Analysis of the form controls also allows data received from the
form to be stored and presented in an appropriate manner.
- Custom tag types can be easily defined and registered for recognition by the parser.
- Built-in functionality to format HTML source code that indents elements according to their depth in the document element hierarchy.
- Built-in functionality to render HTML markup with simple text formatting.
- Built-in functionality to extract all text from HTML markup, suitable for feeding into a text search engine such as Apache Lucene.
Sample Programs
The samples/console
directory in the download package
contains sample programs for performing common tasks and demonstrating
the functionality of the library. The .bat
files can be
run directly on a MS-Windows operating system, or the following syntax
can be used on a UNIX based operating system from the samples/console
directory:
java -classpath classes;../lib/jericho-html-x.x.jar ProgramName
where x.x
is the current release number and ProgramName
is the name of the sample program to run.
The following sample programs are available:
ConvertStyleSheets.java |
Demonstrates how to detect all external style sheets and place them inline into the document. |
DisplayAllElements.java |
Demonstrates
the behaviour of the library when retrieving all elements from a
document containing a mix of normal HTML, different types of server
tags, and badly formatted HTML. |
ExtractText.java |
Demonstrates the use of the TextExtractor class that extracts all of the text from a document, as well as the title, description, keywords and links. |
FindSpecificTags.java |
Demonstrates how to search for tags with a specified name, in a specified namespace, or special tags such as document type declarations, XML declarations, XML processing instructions, common server tags, PHP tags, Mason tags, and HTML comments. |
FormControlDisplayCharacteristics.java |
Demonstrates setting the display characteristics of individual form controls. This allows a control to be disabled, removed, or replaced with a plain text representation of its value (display value). The new document is written to a file called NewForm.html |
FormFieldCSVOutput.java |
Demonstrates the use of the FormFields.getColumnValues(Map) method to store form data in a .CSV
file, automatically creating separate columns for fields that can
contain multiple values (such as checkboxes). The output is written to
a file called FormData.csv |
FormFieldList.java |
Demonstrates the use of the Segment.findFormFields() method to list all form fields and their associated controls in a document. |
FormFieldSetValues.java |
Demonstrates setting the values of form controls, which is best done via the FormFields object. The new document is written to a file called NewForm.html |
FormatSource.java |
Demonstrates the use of the SourceFormatter
class that formats HTML source by laying out each non-inline-level
element on a new line with an appropriate indent. Also known as a
"source beautifier". |
RenderToText.java |
Demonstrates the use of the Renderer
class that performs a simple text rendering of HTML markup, similar to
the way Mozilla Thunderbird and other email clients provide an
automatic conversion of HTML content to text in their alternative MIME
encoding of emails. |
Encoding.java |
Demonstrates the use of the EncodingDetector class and how to determine the encoding of a source document. |
SplitLongLines.java |
Demonstrates how to reformat a document so that lines exceeding a certain number of characters are split into multiple lines. |
Building
The build and sample files are implemented as DOS .bat files only.
This is because I wanted to avoid the need to install ANT for such a
simple library. Sorry to all the unix users for the inconvenience.
On the Drawing Board...
- Ability to generate a JDOM document, making it a JTidy alternative
- Online interactive sample programs - please let me know if you are willing to host the FormatSource.jsp page on your web server
- .NET (DotNet) version if enough interest shown (register you interest via the forums)
Alternative HTML Parsers
This package was originally written in the latter half of 2002. At
that time I evaluated 6 other parsers, none of which were capable of
achieving my aims. Most couldn't reproduce a typical HTML document
without change, none could reproduce a source document containing badly
formatted or non-HTML components without change, and none provided a
means to track the positions of nodes in the source text. A list of
these parsers and a brief description follows, but please note that I
have not revised this analysis since the before this package was
written. Please let me know if there are any errors.
- JavaCC HTML Parser by Quiotix Corporation ([url]http://www.quiotix.com/downloads/html-parser/[/url])
GNU GPL licence, expensive licence fee to use in commercial
application. Does not support document structure (parses into a flat
node stream). - Demonstrational HTML 3.2 parser bundled with JavaCC. Virtually useless.
- JTidy ([url]http://jtidy.sourceforge.net/[/url])
Supports document structure, but by its very nature it "tidies" up
anything it doesn't like in the source document. On first glance it
looks like the positions of nodes in the source are accessible, at
least in protected start and end fields in the Node class, but these
are pointers into a different buffer and are of no use. - javax.swing.text.html.parser.Parser
Comes standard in the JDK. Supports document structure. Does not track
the positions of nodes in the source text, but can be easily modified
to do so (although not sure of legal implications of modifications).
Requires a DTD to function, but only comes with HTML3.2 DTD which is
unsuitable. Even if an HTML 4.01 DTD were found, the parser itself
might need tweaking to cater for the new element types. The DTD needs
to be in the format of a "bdtd" file, which is a binary format used
only by Sun in this parser implementation. I have found many requests
for a 4.01 bdtd file in newsgroups etc on the web, but they all reamain
unanswered. Building it from scratch is not so easy. - Kizna HTML Parser v1.1 ([url]http://htmlparser.sourceforge.net/[/url])
GNU LGPL licence. Version 1.1 was very simple without support for
document structure. I have since revisited this project at sourceforge
(early 2004), where version 1.4 is now available. There are now two
separate libraries, one with and one without document structure
support. It claims to now also be capable of reproducing source text
verbatim. - CyberNeko HTML Parser ([url]http://www.apache.org/~andyc/neko/doc/html/index.html[/url])
Apache-style licence. Supports document structure. Based on the very
popular Xerces XML parser. At the time of evaluation this parser didn't
regenerate the source accurately enough.