|
|||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||||
java.lang.Objectorg.pdfbox.util.PDFStreamEngine
org.pdfbox.util.PDFTextStripper
This class will take a pdf document and strip out all of the text and ignore the formatting and such.
| Field Summary | |
protected Vector |
charactersByArticle
The charactersByArticle is used to extract text by article divisions. |
protected Writer |
output
The stream to write the output to. |
| Constructor Summary | |
PDFTextStripper()
Instantiate a new PDFTextStripper object. |
|
PDFTextStripper(Properties props)
Instantiate a new PDFTextStripper object. |
|
| Method Summary | |
protected void |
endDocument(PDDocument pdf)
This method is available for subclasses of this class. |
protected void |
endPage(PDPage page)
End a page. |
protected void |
endParagraph()
End a paragraph. |
protected void |
flushText()
This will print the text to the output stream. |
protected List |
getCharactersByArticle()
Character strings are grouped by articles. |
protected int |
getCurrentPageNo()
Get the current page number that is being processed. |
PDOutlineItem |
getEndBookmark()
Get the bookmark where text extraction should end, inclusive. |
int |
getEndPage()
This will get the last page that will be extracted. |
String |
getLineSeparator()
This will get the line separator. |
protected Writer |
getOutput()
The output stream that is being written to. |
String |
getPageSeparator()
This will get the page separator. |
PDOutlineItem |
getStartBookmark()
Get the bookmark where text extraction should start, inclusive. |
int |
getStartPage()
This is the page that the text extraction will start on. |
String |
getText(COSDocument doc)
Deprecated. |
String |
getText(PDDocument doc)
This will return the text of a document. |
String |
getWordSeparator()
This will get the word separator. |
protected void |
processPage(PDPage page,
COSStream content)
This will process the contents of a page. |
protected void |
processPages(List pages)
This will process all of the pages and the text that is in them. |
void |
setEndBookmark(PDOutlineItem aEndBookmark)
Set the bookmark where the text extraction should stop. |
void |
setEndPage(int endPageValue)
This will set the last page to be extracted by this class. |
void |
setLineSeparator(String separator)
Set the desired line separator for output text. |
void |
setPageSeparator(String separator)
Set the desired page separator for output text. |
void |
setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
Set if the text stripper should group the text output by a list of beads. |
void |
setSortByPosition(boolean newSortByPosition)
The order of the text tokens in a PDF file may not be in the same as they appear visually on the screen. |
void |
setStartBookmark(PDOutlineItem aStartBookmark)
Set the bookmark where text extraction should start, inclusive. |
void |
setStartPage(int startPageValue)
This will set the first page to be extracted by this class. |
void |
setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)
By default the text stripper will attempt to remove text that overlapps each other. |
void |
setWordSeparator(String separator)
Set the desired word separator for output text. |
boolean |
shouldSeparateByBeads()
This will tell if the text stripper should separate by beads. |
boolean |
shouldSortByPosition()
This will tell if the text stripper should sort the text tokens before writing to the stream. |
boolean |
shouldSuppressDuplicateOverlappingText()
|
protected void |
showCharacter(TextPosition text)
This will show add a character to the list of characters to be printed to the text file. |
protected void |
startDocument(PDDocument pdf)
This method is available for subclasses of this class. |
protected void |
startPage(PDPage page)
Start a new page. |
protected void |
startParagraph()
Start a new paragraph. |
protected void |
writeCharacters(TextPosition text)
Write the string to the output stream. |
void |
writeText(COSDocument doc,
Writer outputStream)
Deprecated. |
void |
writeText(PDDocument doc,
Writer outputStream)
This will take a PDDocument and write the text of that document to the print writer. |
| Methods inherited from class org.pdfbox.util.PDFStreamEngine |
getColorSpaces, getCurrentPage, getFonts, getGraphicsStack, getGraphicsState, getGraphicsStates, getResources, getTextLineMatrix, getTextMatrix, getXObjects, processOperator, processOperator, processStream, processSubStream, registerOperatorProcessor, resetEngine, setColorSpaces, setFonts, setGraphicsStack, setGraphicsState, setGraphicsStates, setTextLineMatrix, setTextMatrix, showString |
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
protected Vector charactersByArticle
protected Writer output
| Constructor Detail |
public PDFTextStripper()
throws IOException
IOException - If there is an error loading the properties.
public PDFTextStripper(Properties props)
throws IOException
props - The properties containing the mapping of operators to PDFOperator
classes.
IOException - If there is an error reading the properties.| Method Detail |
public String getText(PDDocument doc)
throws IOException
doc - The document to get the text from.
IOException - if the doc state is invalid or it is encrypted.
public String getText(COSDocument doc)
throws IOException
doc - The document to extract the text from.
IOException - If there is an error extracting the text.getText( PDDocument )
public void writeText(COSDocument doc,
Writer outputStream)
throws IOException
doc - The document to extract the text.outputStream - The stream to write the text to.
IOException - If there is an error extracting the text.writeText( PDDocument, Writer )
public void writeText(PDDocument doc,
Writer outputStream)
throws IOException
doc - The document to get the data from.outputStream - The location to put the text.
IOException - If the doc is in an invalid state.
protected void processPages(List pages)
throws IOException
pages - The pages object in the document.
IOException - If there is an error parsing the text.
protected void startDocument(PDDocument pdf)
throws IOException
pdf - The PDF document that is being processed.
IOException - If an IO error occurs.
protected void endDocument(PDDocument pdf)
throws IOException
pdf - The PDF document that is being processed.
IOException - If an IO error occurs.
protected void processPage(PDPage page,
COSStream content)
throws IOException
page - The page to process.content - The contents of the page.
IOException - If there is an error processing the page.
protected void startParagraph()
throws IOException
IOException - If there is any error writing to the stream.
protected void endParagraph()
throws IOException
IOException - If there is any error writing to the stream.
protected void startPage(PDPage page)
throws IOException
page - The page we are about to process.
IOException - If there is any error writing to the stream.
protected void endPage(PDPage page)
throws IOException
page - The page we are about to process.
IOException - If there is any error writing to the stream.
protected void flushText()
throws IOException
IOException - If there is an error writing the text.
protected void writeCharacters(TextPosition text)
throws IOException
text - The text to write to the stream.
IOException - If there is an error when writing the text.protected void showCharacter(TextPosition text)
showCharacter in class PDFStreamEnginetext - The description of the character to display.public int getStartPage()
public void setStartPage(int startPageValue)
startPageValue - New value of property startPage.public int getEndPage()
public void setEndPage(int endPageValue)
endPageValue - New value of property endPage.public void setLineSeparator(String separator)
separator - The desired line separator string.public String getLineSeparator()
public void setPageSeparator(String separator)
separator - The desired page separator string.public String getWordSeparator()
public void setWordSeparator(String separator)
separator - The desired page separator string.public String getPageSeparator()
public boolean shouldSuppressDuplicateOverlappingText()
protected int getCurrentPageNo()
protected Writer getOutput()
protected List getCharactersByArticle()
public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)
suppressDuplicateOverlappingTextValue - The suppressDuplicateOverlappingText to set.public boolean shouldSeparateByBeads()
public void setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
aShouldSeparateByBeads - The new grouping of beads.public PDOutlineItem getEndBookmark()
public void setEndBookmark(PDOutlineItem aEndBookmark)
aEndBookmark - The ending bookmark.public PDOutlineItem getStartBookmark()
public void setStartBookmark(PDOutlineItem aStartBookmark)
aStartBookmark - The starting bookmark.public boolean shouldSortByPosition()
public void setSortByPosition(boolean newSortByPosition)
newSortByPosition - Tell PDFBox to sort the text positions.
|
|||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||||