org.apache.jmeter.protocol.http.parser

Class HTMLParser


public abstract class HTMLParser
extends Object

HtmlParsers can parse HTML content to obtain URLs.

Field Summary

protected static String
ATT_BACKGROUND
protected static String
ATT_HREF
protected static String
ATT_IS_IMAGE
protected static String
ATT_REL
protected static String
ATT_SRC
protected static String
ATT_STYLE
protected static String
ATT_TYPE
static String
DEFAULT_PARSER
static String
PARSER_CLASSNAME
protected static String
STYLESHEET
protected static String
TAG_APPLET
protected static String
TAG_BASE
protected static String
TAG_BGSOUND
protected static String
TAG_EMBED
protected static String
TAG_FRAME
protected static String
TAG_IMAGE
protected static String
TAG_INPUT
protected static String
TAG_LINK
protected static String
TAG_SCRIPT

Constructor Summary

HTMLParser()
Protected constructor to prevent instantiation except from within subclasses.

Method Summary

Iterator
getEmbeddedResourceURLs(byte[] html, URL baseUrl)
Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc...
Iterator
getEmbeddedResourceURLs(byte[] html, URL baseUrl, Collection coll)
Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc...
abstract Iterator
getEmbeddedResourceURLs(byte[] html, URL baseUrl, URLCollection coll)
Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc...
static HTMLParser
getParser()
static HTMLParser
getParser(String htmlParserClassName)
protected boolean
isReusable()
Parsers should over-ride this method if the parser class is re-usable, in which case the class will be cached for the next getParser() call.

Field Details

ATT_BACKGROUND

protected static final String ATT_BACKGROUND

ATT_HREF

protected static final String ATT_HREF

ATT_IS_IMAGE

protected static final String ATT_IS_IMAGE

ATT_REL

protected static final String ATT_REL

ATT_SRC

protected static final String ATT_SRC

ATT_STYLE

protected static final String ATT_STYLE

ATT_TYPE

protected static final String ATT_TYPE

DEFAULT_PARSER

public static final String DEFAULT_PARSER

PARSER_CLASSNAME

public static final String PARSER_CLASSNAME

STYLESHEET

protected static final String STYLESHEET

TAG_APPLET

protected static final String TAG_APPLET

TAG_BASE

protected static final String TAG_BASE

TAG_BGSOUND

protected static final String TAG_BGSOUND

TAG_EMBED

protected static final String TAG_EMBED

TAG_FRAME

protected static final String TAG_FRAME

TAG_IMAGE

protected static final String TAG_IMAGE

TAG_INPUT

protected static final String TAG_INPUT

TAG_LINK

protected static final String TAG_LINK

TAG_SCRIPT

protected static final String TAG_SCRIPT

Constructor Details

HTMLParser

protected HTMLParser()
Protected constructor to prevent instantiation except from within subclasses.

Method Details

getEmbeddedResourceURLs

public Iterator getEmbeddedResourceURLs(byte[] html,
                                        URL baseUrl)
            throws HTMLParseException
Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc...

URLs should not appear twice in the returned iterator.

Malformed URLs can be reported to the caller by having the Iterator return the corresponding RL String. Overall problems parsing the html should be reported by throwing an HTMLParseException.

Parameters:
html - HTML code
baseUrl - Base URL from which the HTML code was obtained
Returns:
an Iterator for the resource URLs

getEmbeddedResourceURLs

public Iterator getEmbeddedResourceURLs(byte[] html,
                                        URL baseUrl,
                                        Collection coll)
            throws HTMLParseException
Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc... N.B. The Iterator returns URLs, but the Collection will contain objects of class URLString.
Parameters:
html - HTML code
baseUrl - Base URL from which the HTML code was obtained
coll - Collection - will contain URLString objects, not URLs
Returns:
an Iterator for the resource URLs

getEmbeddedResourceURLs

public abstract Iterator getEmbeddedResourceURLs(byte[] html,
                                                 URL baseUrl,
                                                 URLCollection coll)
            throws HTMLParseException
Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc...

All URLs should be added to the Collection.

Malformed URLs can be reported to the caller by having the Iterator return the corresponding RL String. Overall problems parsing the html should be reported by throwing an HTMLParseException. N.B. The Iterator returns URLs, but the Collection will contain objects of class URLString.

Parameters:
html - HTML code
baseUrl - Base URL from which the HTML code was obtained
coll - URLCollection
Returns:
an Iterator for the resource URLs

getParser

public static final HTMLParser getParser()

getParser

public static final HTMLParser getParser(String htmlParserClassName)

isReusable

protected boolean isReusable()
Parsers should over-ride this method if the parser class is re-usable, in which case the class will be cached for the next getParser() call.
Returns:
true if the Parser is reusable

Copyright © 1998-2010 Apache Software Foundation. All Rights Reserved.