au.id.jericho.lib.html
Class Renderer
java.lang.Object
au.id.jericho.lib.html.Renderer
- CharStreamSource
public final class Renderer
extends java.lang.Object
Performs a simple rendering of HTML markup into text.
This provides a human readable version of the segment content that is modelled on the way
Mozilla Thunderbird and other email clients provide an automatic conversion of
HTML content to text in their
alternative MIME encoding of emails.
The output using default settings complies with the "text/plain; format=flowed" (DelSp=No) protocol described in
RFC3676.
Many properties are available to customise the output, possibly the most significant of which being
MaxLineLength
.
See the individual property descriptions for details.
Use one of the following methods to obtain the output:
The rendering of some constructs, especially tables, is very rudimentary.
No attempt is made to render nested tables properly, except to ensure that all of the text content is included in the output.
Rendering an entire
Source
object performs a
full sequential parse automatically.
Any aspect of the algorithm not specifically mentioned here is subject to change without notice in future versions.
To extract pure text without any rendering of the markup, use the
TextExtractor
class instead.
Renderer
public Renderer(Segment segment)
Constructs a new
Renderer
based on the specified
Segment
.
segment
- the segment containing the HTML to be rendered.
getBlockIndentSize
public int getBlockIndentSize()
Returns the size of the indent to be used for anything other than
LI
elements.
See the
setBlockIndentSize(int)
method for a full description of this property.
- the size of the indent to be used for anything other than
LI
elements.
getConvertNonBreakingSpaces
public boolean getConvertNonBreakingSpaces()
Indicates whether non-breaking space (
CharacterEntityReference._nbsp
) character entity references are converted to spaces.
See the
setConvertNonBreakingSpaces(boolean)
method for a full description of this property.
true
if non-breaking space (CharacterEntityReference._nbsp
) character entity references are converted to spaces, otherwise false
.
getDecorateFontStyles
public boolean getDecorateFontStyles()
true
if decoration characters are to be included around the content of some font style elements, otherwise false
.
getEstimatedMaximumOutputLength
public long getEstimatedMaximumOutputLength()
Returns the estimated maximum number of characters in the output, or
-1
if no estimate is available.
The returned value should be used as a guide for efficiency purposes only, for example to set an initial
StringBuffer
capacity.
There is no guarantee that the length of the output is indeed less than this value,
as classes implementing this method often use assumptions based on typical usage to calculate the estimate.
Although implementations of this method should never return a value less than -1, users of this method must not assume that this will always be the case.
Standard practice is to interpret any negative value as meaning that no estimate is available.
- getEstimatedMaximumOutputLength in interface CharStreamSource
- the estimated maximum number of characters in the output, or
-1
if no estimate is available.
getIncludeHyperlinkURLs
public boolean getIncludeHyperlinkURLs()
true
if hyperlink URL's are included in the output, otherwise false
.
getListBullets
public char[] getListBullets()
Returns the bullet characters to use for list items inside
UL
elements.
See the
setListBullets(char[])
method for a full description of this property.
- the bullet characters to use for list items inside
UL
elements.
getListIndentSize
public int getListIndentSize()
Returns the size of the indent to be used for
LI
elements.
See the
setListIndentSize(int)
method for a full description of this property.
- the size of the indent to be used for
LI
elements.
getMaxLineLength
public int getMaxLineLength()
Returns the column at which lines are to be wrapped.
See the
setMaxLineLength(int)
method for a full description of this property.
- the column at which lines are to be wrapped.
getNewLine
public String getNewLine()
Returns the string to be used to represent a
newline in the output.
See the
setNewLine(String)
method for a full description of this property.
- the string to be used to represent a newline in the output.
getTableCellSeparator
public String getTableCellSeparator()
- the string that is to separate table cells.
setBlockIndentSize
public Renderer setBlockIndentSize(int blockIndentSize)
Sets the size of the indent to be used for anything other than
LI
elements.
At present this applies to
BLOCKQUOTE
and
DD
elements.
The default value is
4
.
blockIndentSize
- the size of the indent.
- this
Renderer
instance, allowing multiple property setting methods to be chained in a single statement.
setConvertNonBreakingSpaces
public Renderer setConvertNonBreakingSpaces(boolean convertNonBreakingSpaces)
Sets whether non-breaking space (
CharacterEntityReference._nbsp
) character entity references are converted to spaces.
The default value is that of the static
Config.ConvertNonBreakingSpaces
property at the time the
Renderer
is instantiated.
convertNonBreakingSpaces
- specifies whether non-breaking space (CharacterEntityReference._nbsp
) character entity references are converted to spaces.
- this
Renderer
instance, allowing multiple property setting methods to be chained in a single statement.
setDecorateFontStyles
public Renderer setDecorateFontStyles(boolean decorateFontStyles)
Sets whether decoration characters are to be included around the content of some
font style elements and
phrase elements.
The default value is
false
.
Below is a table summarising the decorated elements.
Elements | Character | Example Output |
---|
B and STRONG | * | *bold text* |
I and EM | / | /italic text/ |
U | _ | _underlined text_ |
CODE | | | |code| |
decorateFontStyles
- specifies whether decoration characters are to be included around the content of some font style elements.
- this
Renderer
instance, allowing multiple property setting methods to be chained in a single statement.
setIncludeHyperlinkURLs
public Renderer setIncludeHyperlinkURLs(boolean includeHyperlinkURLs)
Sets whether hyperlink URL's are included in the output.
The default value is
true
.
When this property is
true
, the URL of each hyperlink is included in the output, enclosed in angle brackets, after the hyperlink label.
true
<a href="http://jericho.htmlparser.net/">Jericho HTML Parser</a>
Jericho HTML Parser <http://jericho.htmlparser.net/>
includeHyperlinkURLs
- specifies whether hyperlink URL's are included in the output.
- this
Renderer
instance, allowing multiple property setting methods to be chained in a single statement.
setListBullets
public Renderer setListBullets(char[] listBullets)
Sets the bullet characters to use for list items inside
UL
elements.
The values in the default array are
*
,
o
,
+
and
#
.
If the nesting of rendered lists goes deeper than the length of this array, the bullet characters start repeating from the first in the array.
WARNING: If any of the characters in the default array are modified, this will affect all other instances of this class using the default array.
listBullets
- an array of characters to be used as bullets, must have at least one entry.
- this
Renderer
instance, allowing multiple property setting methods to be chained in a single statement.
setListIndentSize
public Renderer setListIndentSize(int listIndentSize)
Sets the size of the indent to be used for
LI
elements.
The default value is
6
.
This applies to
LI
elements inside both
UL
and
OL
elements.
The bullet or number of the list item is included as part of the indent.
listIndentSize
- the size of the indent.
- this
Renderer
instance, allowing multiple property setting methods to be chained in a single statement.
setMaxLineLength
public Renderer setMaxLineLength(int maxLineLength)
Sets the column at which lines are to be wrapped.
Lines that would otherwise exceed this length are wrapped onto a new line at a word boundary.
A Line may still exceed this length if it consists of a single word, where the length of the word plus the line indent exceeds the maximum length.
In this case the line is wrapped immediately after the end of the word.
The default value is
76
, which reflects the maximum line length for sending
email data specified in
RFC2049 section 3.5.
maxLineLength
- the column at which lines are to be wrapped.
- this
Renderer
instance, allowing multiple property setting methods to be chained in a single statement.
setNewLine
public Renderer setNewLine(String newLine)
Sets the string to be used to represent a
newline in the output.
The default value is
"\r\n"
(CR+LF) regardless of the platform on which the library is running.
This is so that the default configuration produces valid
MIME plain/text output, which mandates the use of CR+LF for line breaks.
Specifying a
null
argument causes the output to use same new line string as is used in the source document, which is
determined via the
Source.getNewLine()
method.
If the source document does not contain any new lines, a "best guess" is made by either taking the new line string of a previously parsed document,
or using the value from the static
Config.NewLine
property.
newLine
- the string to be used to represent a newline in the output, may be null
.
- this
Renderer
instance, allowing multiple property setting methods to be chained in a single statement.
setTableCellSeparator
public Renderer setTableCellSeparator(String tableCellSeparator)
Sets the string that is to separate table cells.
The default value is
" \t"
(a space followed by a tab).
tableCellSeparator
- the string that is to separate table cells.
- this
Renderer
instance, allowing multiple property setting methods to be chained in a single statement.
writeTo
public void writeTo(Writer writer)
throws IOException
Writes the output to the specified Writer
.
- writeTo in interface CharStreamSource
writer
- the destination java.io.Writer
for the output.