public class JsoupBasedHtmlParser extends HTMLParser
LagartoBasedHtmlParser
and this one (adapter pattern)ATT_BACKGROUND, ATT_CODE, ATT_CODEBASE, ATT_DATA, ATT_HREF, ATT_IS_IMAGE, ATT_REL, ATT_SRC, ATT_STYLE, ATT_TYPE, DEFAULT_PARSER, PARSER_CLASSNAME, STYLESHEET, TAG_APPLET, TAG_BASE, TAG_BGSOUND, TAG_BODY, TAG_EMBED, TAG_FRAME, TAG_IFRAME, TAG_IMAGE, TAG_INPUT, TAG_LINK, TAG_OBJECT, TAG_SCRIPT
Constructor and Description |
---|
JsoupBasedHtmlParser() |
Modifier and Type | Method and Description |
---|---|
Iterator<URL> |
getEmbeddedResourceURLs(byte[] html,
URL baseUrl,
URLCollection coll,
String encoding)
Get the URLs for all the resources that a browser would automatically
download following the download of the HTML content, that is: images,
stylesheets, javascript files, applets, etc...
|
protected boolean |
isReusable()
Parsers should over-ride this method if the parser class is re-usable, in
which case the class will be cached for the next getParser() call.
|
getEmbeddedResourceURLs, getEmbeddedResourceURLs, getParser, getParser
public Iterator<URL> getEmbeddedResourceURLs(byte[] html, URL baseUrl, URLCollection coll, String encoding) throws HTMLParseException
HTMLParser
All URLs should be added to the Collection.
Malformed URLs can be reported to the caller by having the Iterator return the corresponding RL String. Overall problems parsing the html should be reported by throwing an HTMLParseException. N.B. The Iterator returns URLs, but the Collection will contain objects of class URLString.
getEmbeddedResourceURLs
in class HTMLParser
html
- HTML codebaseUrl
- Base URL from which the HTML code was obtainedcoll
- URLCollectionencoding
- CharsetHTMLParseException
protected boolean isReusable()
HTMLParser
isReusable
in class HTMLParser
Copyright © 1998-2015 Apache Software Foundation. All Rights Reserved.