net.htmlparser.jericho
public class Segment extends Object implements Comparable<Segment>, CharSequence
Source
document.
Many of the tag search methods are defined in this class.
The span of a segment is defined by the combination of its begin and end character positions.
Constructor and Description |
---|
Segment(Source source,
int begin,
int end)
Constructs a new
Segment within the specified source document with the specified begin and end character positions. |
Modifier and Type | Method and Description |
---|---|
char |
charAt(int index)
Returns the character at the specified index.
|
int |
compareTo(Segment segment)
Compares this
Segment object to another object. |
boolean |
encloses(int pos)
Indicates whether this segment encloses the specified character position in the source document.
|
boolean |
encloses(Segment segment)
Indicates whether this
Segment encloses the specified Segment . |
boolean |
equals(Object object)
Compares the specified object with this
Segment for equality. |
List<CharacterReference> |
getAllCharacterReferences()
Returns a list of all
CharacterReference objects that are enclosed by this segment. |
List<Element> |
getAllElements()
|
List<Element> |
getAllElements(StartTagType startTagType)
|
List<Element> |
getAllElements(String name)
|
List<Element> |
getAllElements(String attributeName,
Pattern valueRegexPattern)
|
List<Element> |
getAllElements(String attributeName,
String value,
boolean valueCaseSensitive)
|
List<Element> |
getAllElementsByClass(String className)
|
List<StartTag> |
getAllStartTags()
|
List<StartTag> |
getAllStartTags(StartTagType startTagType)
|
List<StartTag> |
getAllStartTags(String name)
|
List<StartTag> |
getAllStartTags(String attributeName,
Pattern valueRegexPattern)
|
List<StartTag> |
getAllStartTags(String attributeName,
String value,
boolean valueCaseSensitive)
|
List<StartTag> |
getAllStartTagsByClass(String className)
|
List<Tag> |
getAllTags()
|
List<Tag> |
getAllTags(TagType tagType)
|
int |
getBegin()
Returns the character position in the
Source document at which this segment begins, inclusive. |
List<Element> |
getChildElements()
Returns a list of the immediate children of this segment in the document element hierarchy.
|
String |
getDebugInfo()
Returns a string representation of this object useful for debugging purposes.
|
int |
getEnd()
Returns the character position in the
Source document immediately after the end of this segment. |
Element |
getFirstElement()
|
Element |
getFirstElement(String name)
|
Element |
getFirstElement(String attributeName,
Pattern valueRegexPattern)
|
Element |
getFirstElement(String attributeName,
String value,
boolean valueCaseSensitive)
|
Element |
getFirstElementByClass(String className)
|
StartTag |
getFirstStartTag()
|
StartTag |
getFirstStartTag(StartTagType startTagType)
|
StartTag |
getFirstStartTag(String name)
|
StartTag |
getFirstStartTag(String attributeName,
Pattern valueRegexPattern)
|
StartTag |
getFirstStartTag(String attributeName,
String value,
boolean valueCaseSensitive)
|
StartTag |
getFirstStartTagByClass(String className)
|
List<FormControl> |
getFormControls()
Returns a list of the
FormControl objects that are enclosed by this segment. |
FormFields |
getFormFields()
Returns the
FormFields object representing all form fields that are enclosed by this segment. |
Iterator<Segment> |
getNodeIterator()
Returns an iterator over every tag, character reference and plain text segment contained within this segment.
|
Renderer |
getRenderer()
Performs a simple rendering of the HTML markup in this segment into text.
|
Source |
getSource()
Returns the
Source document containing this segment. |
TextExtractor |
getTextExtractor()
Extracts the textual content from the HTML markup of this segment.
|
int |
hashCode()
Returns a hash code value for the segment.
|
void |
ignoreWhenParsing()
Causes the this segment to be ignored when parsing.
|
boolean |
isWhiteSpace()
Indicates whether this segment consists entirely of white space.
|
static boolean |
isWhiteSpace(char ch)
Indicates whether the specified character is white space.
|
int |
length()
Returns the length of the segment.
|
Attributes |
parseAttributes()
Parses any
Attributes within this segment. |
CharSequence |
subSequence(int beginIndex,
int endIndex)
Returns a new character sequence that is a subsequence of this sequence.
|
String |
toString()
Returns the source text of this segment as a
String . |
public final Source getSource()
Source
document containing this segment.
If a StreamedSource
is in use, this method throws an UnsupportedOperationException
.
Source
document containing this segment.public final int getBegin()
Source
document at which this segment begins, inclusive.Source
document at which this segment begins, inclusive.public final int getEnd()
Source
document immediately after the end of this segment.
The character at the position specified by this property is not included in the segment.
Source
document immediately after the end of this segment.public final boolean equals(Object object)
Segment
for equality.
Returns true
if and only if the specified object is also a Segment
,
and both segments have the same Source
, and the same begin and end positions.
public int hashCode()
The current implementation returns the sum of the begin and end positions, although this is not guaranteed in future versions.
public int length()
length
in interface CharSequence
public final boolean encloses(Segment segment)
Segment
encloses the specified Segment
.
This is the case if getBegin()
<=segment.
getBegin()
&&
getEnd()
>=segment.
getEnd()
.
Note that a segment encloses itself.
segment
- the segment to be tested for being enclosed by this segment.true
if this Segment
encloses the specified Segment
, otherwise false
.public final boolean encloses(int pos)
This is the case if getBegin()
<= pos <
getEnd()
.
pos
- the position in the Source
document.true
if this segment encloses the specified character position in the source document, otherwise false
.public String toString()
String
.
The returned String
is newly created with every call to this method, unless this
segment is itself an instance of Source
.
toString
in interface CharSequence
toString
in class Object
String
.public Renderer getRenderer()
The output can be configured by setting any number of properties on the returned Renderer
instance before
obtaining its output.
Renderer
based on this segment.getTextExtractor()
public TextExtractor getTextExtractor()
The output can be configured by setting properties on the returned TextExtractor
instance before
obtaining its output.
TextExtractor
based on this segment.getRenderer()
public Iterator<Segment> getNodeIterator()
See the Source.iterator()
method for a detailed description.
The following code demonstrates the typical usage of this method to make an exact copy of this segment to writer
(assuming no server tags are present):
for (Iterator<Segment> nodeIterator=segment.getNoteIterator(); nodeIterator.hasNext();) { Segment nodeSegment=nodeIterator.next(); if (nodeSegment instanceof Tag) { Tag tag=(Tag)nodeSegment; // HANDLE TAG // Uncomment the following line to ensure each tag is valid XML: // writer.write(tag.tidy()); continue; } else if (nodeSegment instanceof CharacterReference) { CharacterReference characterReference=(CharacterReference)nodeSegment; // HANDLE CHARACTER REFERENCE // Uncomment the following line to decode all character references instead of copying them verbatim: // characterReference.appendCharTo(writer); continue; } else { // HANDLE PLAIN TEXT } // unless specific handling has prevented getting to here, simply output the segment as is: writer.write(nodeSegment.toString()); }
public List<Tag> getAllTags()
Tag
objects that are enclosed by this segment.
The Source.fullSequentialParse()
method should be called after construction of the Source
object
if this method is to be used on a large proportion of the source.
It is called automatically if this method is called on the Source
object itself.
See the Tag
class documentation for more details about the behaviour of this method.
public List<Tag> getAllTags(TagType tagType)
Tag
objects of the specified type that are enclosed by this segment.
See the Tag
class documentation for more details about the behaviour of this method.
Specifying a null
argument to the tagType
parameter is equivalent to getAllTags()
.
tagType
- the type of tags to get.Tag
objects of the specified type that are enclosed by this segment.getAllStartTags(StartTagType)
public List<StartTag> getAllStartTags()
StartTag
objects that are enclosed by this segment.
The Source.fullSequentialParse()
method should be called after construction of the Source
object
if this method is to be used on a large proportion of the source.
It is called automatically if this method is called on the Source
object itself.
See the Tag
class documentation for more details about the behaviour of this method.
public List<StartTag> getAllStartTags(StartTagType startTagType)
StartTag
objects of the specified type that are enclosed by this segment.
See the Tag
class documentation for more details about the behaviour of this method.
Specifying a null
argument to the startTagType
parameter is equivalent to getAllStartTags()
.
public List<StartTag> getAllStartTags(String name)
StartTag
objects with the specified name that are enclosed by this segment.
See the Tag
class documentation for more details about the behaviour of this method.
Specifying a null
argument to the name
parameter is equivalent to getAllStartTags()
, which may include non-normal start tags.
This method also returns unregistered tags if the specified name is not a valid XML tag name.
public List<StartTag> getAllStartTags(String attributeName, String value, boolean valueCaseSensitive)
StartTag
objects with the specified attribute name/value pair that are enclosed by this segment.
See the Tag
class documentation for more details about the behaviour of this method.
attributeName
- the attribute name (case insensitive) to search for, must not be null
.value
- the value of the specified attribute to search for, must not be null
.valueCaseSensitive
- specifies whether the attribute value matching is case sensitive.StartTag
objects with the specified attribute name/value pair that are enclosed by this segment.getAllStartTags(String attributeName, Pattern valueRegexPattern)
public List<StartTag> getAllStartTags(String attributeName, Pattern valueRegexPattern)
StartTag
objects with the specified attribute name and value pattern that are enclosed by this segment.
Specifying a null
argument to the valueRegexPattern
parameter performs the search on the attribute name only,
without regard to the attribute value. This will also match an attribute that has no value at all.
See the Tag
class documentation for more details about the behaviour of this method.
attributeName
- the attribute name (case insensitive) to search for, must not be null
.valueRegexPattern
- the regular expression pattern that must match the attribute value, may be null
.StartTag
objects with the specified attribute name and value pattern that are enclosed by this segment.getAllStartTags(String attributeName, String value, boolean valueCaseSensitive)
public List<StartTag> getAllStartTagsByClass(String className)
StartTag
objects with the specified class that are enclosed by this segment.
This matches start tags with a class
attribute that contains the specified class name, either as an exact match or where the specified class name is one of multiple
class names separated by white space in the attribute value.
See the Tag
class documentation for more details about the behaviour of this method.
public List<Element> getChildElements()
The returned list may include an element that extends beyond the end of this segment, as long as it begins within this segment.
An element found at the start of this segment is included in the list.
Note however that if this segment is an Element
, the overriding Element.getChildElements()
method is called instead,
which only returns the children of the element.
Calling getChildElements()
on an Element
is much more efficient than calling it on a Segment
.
The objects in the list are all of type Element
.
The Source.fullSequentialParse()
method should be called after construction of the Source
object
if this method is to be used on a large proportion of the source.
It is called automatically if this method is called on the Source
object itself.
See the Source.getChildElements()
method for more details.
null
.Element.getParentElement()
public List<Element> getAllElements()
Element
objects that are enclosed by this segment.
The Source.fullSequentialParse()
method should be called after construction of the Source
object
if this method is to be used on a large proportion of the source.
It is called automatically if this method is called on the Source
object itself.
The elements returned correspond exactly with the start tags returned in the getAllStartTags()
method.
If this segment is itself an Element
, the result includes this element in the list.
public List<Element> getAllElements(String name)
Element
objects with the specified name that are enclosed by this segment.
The elements returned correspond with the start tags returned in the getAllStartTags(String name)
method,
except that elements which are not entirely enclosed by this segment are excluded.
Specifying a null
argument to the name
parameter is equivalent to getAllElements()
, which may include elements of non-normal tags.
This method also returns elements consisting of unregistered tags if the specified name is not a valid XML tag name.
If this segment is itself an Element
with the specified name, the result includes this element in the list.
public List<Element> getAllElements(StartTagType startTagType)
Element
objects with start tags of the specified type that are enclosed by this segment.
The elements returned correspond with the start tags returned in the getAllTags(TagType)
method,
except that elements which are not entirely enclosed by this segment are excluded.
If this segment is itself an Element
with the specified type, the result includes this element in the list.
public List<Element> getAllElements(String attributeName, String value, boolean valueCaseSensitive)
Element
objects with the specified attribute name/value pair that are enclosed by this segment.
The elements returned correspond with the start tags returned in the getAllStartTags(String attributeName, String value, boolean valueCaseSensitive)
method,
except that elements which are not entirely enclosed by this segment are excluded.
If this segment is itself an Element
with the specified name/value pair, the result includes this element in the list.
attributeName
- the attribute name (case insensitive) to search for, must not be null
.value
- the value of the specified attribute to search for, must not be null
.valueCaseSensitive
- specifies whether the attribute value matching is case sensitive.Element
objects with the specified attribute name/value pair that are enclosed by this segment.getAllElements(String attributeName, Pattern valueRegexPattern)
public List<Element> getAllElements(String attributeName, Pattern valueRegexPattern)
Element
objects with the specified attribute name and value pattern that are enclosed by this segment.
The elements returned correspond with the start tags returned in the getAllStartTags(String attributeName, Pattern valueRegexPattern)
method,
except that elements which are not entirely enclosed by this segment are excluded.
Specifying a null
argument to the valueRegexPattern
parameter performs the search on the attribute name only,
without regard to the attribute value. This will also match an attribute that has no value at all.
If this segment is itself an Element
with the specified attribute name and value pattern, the result includes this element in the list.
attributeName
- the attribute name (case insensitive) to search for, must not be null
.valueRegexPattern
- the regular expression pattern that must match the attribute value, may be null
.Element
objects with the specified attribute name and value pattern that are enclosed by this segment.getAllElements(String attributeName, String value, boolean valueCaseSensitive)
public List<Element> getAllElementsByClass(String className)
Element
objects with the specified class that are enclosed by this segment.
This matches elements with a class
attribute that contains the specified class name, either as an exact match or where the specified class name is one of multiple
class names separated by white space in the attribute value.
The elements returned correspond with the start tags returned in the getAllStartTagsByClass(String className)
method,
except that elements which are not entirely enclosed by this segment are excluded.
If this segment is itself an Element
with the specified class, the result includes this element in the list.
public List<CharacterReference> getAllCharacterReferences()
CharacterReference
objects that are enclosed by this segment.CharacterReference
objects that are enclosed by this segment.public final StartTag getFirstStartTag()
StartTag
enclosed by this segment.
This is functionally equivalent to getAllStartTags()
.iterator().next()
,
but does not search beyond the first start tag and returns null
if no such start tag exists.
public final StartTag getFirstStartTag(StartTagType startTagType)
StartTag
of the specified type enclosed by this segment.
This is functionally equivalent to getAllStartTags(startTagType)
.iterator().next()
,
but does not search beyond the first start tag and returns null
if no such start tag exists.
public final StartTag getFirstStartTag(String name)
StartTag
enclosed by this segment.
This is functionally equivalent to getAllStartTags(name)
.iterator().next()
,
but does not search beyond the first start tag and returns null
if no such start tag exists.
Specifying a null
argument to the name
parameter is equivalent to getFirstStartTag()
.
public final StartTag getFirstStartTag(String attributeName, String value, boolean valueCaseSensitive)
StartTag
with the specified attribute name/value pair enclosed by this segment.
This is functionally equivalent to getAllStartTags(attributeName,value,valueCaseSensitive)
.iterator().next()
,
but does not search beyond the first start tag and returns null
if no such start tag exists.
attributeName
- the attribute name (case insensitive) to search for, must not be null
.value
- the value of the specified attribute to search for, must not be null
.valueCaseSensitive
- specifies whether the attribute value matching is case sensitive.StartTag
with the specified attribute name/value pair enclosed by this segment, or null
if none exists.getFirstStartTag(String attributeName, Pattern valueRegexPattern)
public final StartTag getFirstStartTag(String attributeName, Pattern valueRegexPattern)
StartTag
with the specified attribute name and value pattern that is enclosed by this segment.
This is functionally equivalent to getAllStartTags(attributeName,valueRegexPattern)
.iterator().next()
,
but does not search beyond the first start tag and returns null
if no such start tag exists.
attributeName
- the attribute name (case insensitive) to search for, must not be null
.valueRegexPattern
- the regular expression pattern that must match the attribute value, may be null
.StartTag
with the specified attribute name and value pattern that is enclosed by this segment, or null
if none exists.getFirstStartTag(String attributeName, String value, boolean valueCaseSensitive)
public final StartTag getFirstStartTagByClass(String className)
StartTag
with the specified class that is enclosed by this segment.
This is functionally equivalent to getAllStartTagsByClass(className)
.iterator().next()
,
but does not search beyond the first start tag and returns null
if no such start tag exists.
public final Element getFirstElement()
Element
enclosed by this segment.
This is functionally equivalent to getAllElements()
.iterator().next()
,
but does not search beyond the first enclosed element and returns null
if no such element exists.
If this segment is itself an Element
, this element is returned, not the first child element.
public final Element getFirstElement(String name)
Element
with the specified name enclosed by this segment.
This is functionally equivalent to getAllElements(name)
.iterator().next()
,
but does not search beyond the first enclosed element and returns null
if no such element exists.
Specifying a null
argument to the name
parameter is equivalent to getFirstElement()
.
If this segment is itself an Element
with the specified name, this element is returned.
public final Element getFirstElement(String attributeName, String value, boolean valueCaseSensitive)
Element
with the specified attribute name/value pair enclosed by this segment.
This is functionally equivalent to getAllElements(attributeName,value,valueCaseSensitive)
.iterator().next()
,
but does not search beyond the first enclosed element and returns null
if no such element exists.
If this segment is itself an Element
with the specified attribute name/value pair, this element is returned.
attributeName
- the attribute name (case insensitive) to search for, must not be null
.value
- the value of the specified attribute to search for, must not be null
.valueCaseSensitive
- specifies whether the attribute value matching is case sensitive.Element
with the specified attribute name/value pair enclosed by this segment, or null
if none exists.getFirstElement(String attributeName, Pattern valueRegexPattern)
public final Element getFirstElement(String attributeName, Pattern valueRegexPattern)
Element
with the specified attribute name and value pattern that is enclosed by this segment.
This is functionally equivalent to getAllElements(attributeName,valueRegexPattern)
.iterator().next()
,
but does not search beyond the first enclosed element and returns null
if no such element exists.
If this segment is itself an Element
with the specified attribute name and value pattern, this element is returned.
attributeName
- the attribute name (case insensitive) to search for, must not be null
.valueRegexPattern
- the regular expression pattern that must match the attribute value, may be null
.Element
with the specified attribute name and value pattern that is enclosed by this segment, or null
if none exists.getFirstElement(String attributeName, String value, boolean valueCaseSensitive)
public final Element getFirstElementByClass(String className)
Element
with the specified class that is enclosed by this segment.
This is functionally equivalent to getAllElementsByClass(className)
.iterator().next()
,
but does not search beyond the first enclosed element and returns null
if no such element exists.
If this segment is itself an Element
with the specified class, this element is returned.
public List<FormControl> getFormControls()
FormControl
objects that are enclosed by this segment.FormControl
objects that are enclosed by this segment.public FormFields getFormFields()
FormFields
object representing all form fields that are enclosed by this segment.
This is equivalent to new FormFields
(
getFormControls()
)
.
FormFields
object representing all form fields that are enclosed by this segment.getFormControls()
public Attributes parseAttributes()
Attributes
within this segment.
This method is only used in the unusual situation where attributes exist outside of a start tag.
The StartTag.getAttributes()
method should be used in normal situations.
This is equivalent to source.
parseAttributes
(
getBegin()
,
getEnd()
)
.
Attributes
within this segment, or null
if too many errors occur while parsing.public void ignoreWhenParsing()
Ignored segments are treated as blank spaces by the parsing mechanism, but are included as normal text in all other functions.
This method was originally the only means of preventing server tags located inside normal tags from interfering with the parsing of the tags (such as where an attribute of a normal tag uses a server tag to dynamically set its value), as well as preventing non-server tags from being recognised inside server tags.
It is not necessary to use this method to ignore server tags located inside normal tags, as the attributes parser automatically ignores any server tags.
It is not necessary to use this method to ignore non-server tags inside server tags, or the contents of SCRIPT
elements,
as the parser does this automatically when performing a full sequential parse.
This leaves only very few scenarios where calling this method still provides a significant benefit.
One such case is where XML-style server tags are used inside normal tags. Here is an example using an XML-style JSP tag:
<a href="<i18n:resource path="/Portal"/>?BACK=TRUE">back</a>
The first double-quote of "/Portal"
will be interpreted as the end quote for the href
attribute,
as there is no way for the parser to recognise the il8n:resource
element as a server tag.
Such use of XML-style server tags inside normal tags is generally seen as bad practice,
but it is nevertheless valid JSP. The only way to ensure that this library is able to parse the normal tag surrounding it is to
find these server tags first and call the ignoreWhenParsing
method to ignore them before parsing the rest of the document.
It is important to understand the difference between ignoring the segment when parsing and removing the segment completely.
Any text inside a segment that is ignored when parsing is treated by most functions as content, and as such is included in the output of
tools such as TextExtractor
and Renderer
.
To remove segments completely, create an OutputDocument
and call its remove(Segment)
or
replaceWithSpaces(int begin, int end)
method for each segment.
Then create a new source document using new Source(outputDocument.toString())
and perform the desired operations on this new source object.
Calling this method after the Source.fullSequentialParse()
method has been called is not permitted and throws an IllegalStateException
.
Any tags appearing in this segment that are found before this method is called will remain in the tag cache,
and so will continue to be found by the tag search methods.
If this is undesirable, the Source.clearCache()
method can be called to remove them from the cache.
Calling the Source.fullSequentialParse()
method after this method clears the cache automatically.
For best performance, this method should be called on all segments that need to be ignored without calling any of the tag search methods in between.
public int compareTo(Segment segment)
Segment
object to another object.
If the argument is not a Segment
, a ClassCastException
is thrown.
A segment is considered to be before another segment if its begin position is earlier, or in the case that both segments begin at the same position, its end position is earlier.
Segments that begin and end at the same position are considered equal for the purposes of this comparison, even if they relate to different source documents.
Note: this class has a natural ordering that is inconsistent with equals.
This means that this method may return zero in some cases where calling the
equals(Object)
method with the same argument returns false
.
compareTo
in interface Comparable<Segment>
segment
- the segment to be comparedClassCastException
- if the argument is not a Segment
public final boolean isWhiteSpace()
true
if this segment consists entirely of white space, otherwise false
.public static final boolean isWhiteSpace(char ch)
The HTML 4.01 specification section 9.1 specifies the following white space characters:
Despite the explicit inclusion of the zero-width space in the HTML specification, Microsoft IE6 does not
recognise them as white space and renders them as an unprintable character (empty square).
Even zero-width spaces included using the numeric character reference ​
are rendered this way.
ch
- the character to test.true
if the specified character is white space, otherwise false
.public String getDebugInfo()
public char charAt(int index)
This is logically equivalent to toString().charAt(index)
for valid argument values 0 <= index < length()
.
However because this implementation works directly on the underlying document source string,
it should not be assumed that an IndexOutOfBoundsException
is thrown
for an invalid argument value.
charAt
in interface CharSequence
index
- the index of the character.public CharSequence subSequence(int beginIndex, int endIndex)
This is logically equivalent to toString().subSequence(beginIndex,endIndex)
for valid values of beginIndex
and endIndex
.
However because this implementation works directly on the underlying document source text,
it should not be assumed that an IndexOutOfBoundsException
is thrown
for invalid argument values as described in the String.subSequence(int,int)
method.
subSequence
in interface CharSequence
beginIndex
- the begin index, inclusive.endIndex
- the end index, exclusive.