It is a forked version of jsoup for research and study. We are going to inspect the design patterns applied to jsoup, enhance them if possible, and extend the features. Every development process or discussion history will be recorded in this document.
Manifesto
We're trying to focus on jsoup itself. It is a powerful HTML parser, not a headless browser, web driver or anything else like those. So the extended features are mostly binded to what it does well. And the author of this software currently works at Amazon. It means and it is obvious that the code structure or the code quality are so good that we cannot easily assert our opinion about them.
Postscript
We've learnt a lot of things from this project. We found out that design patterns are not always being used, and sometimes in many different ways. Also it was really hard to inspect and extend source codes written by other programmers.
We're writing member-specific ideas and notes in docs. This markdown is aimed to converge and share those ideas in more general form.
⚠️ All the contents written inside docs directory have been merged to here.
- org.jsoup.Jsoup
- org.jsoup.internal.ConstrainableInputStream
- org.jsoup.parser.CharacterReader
- org.jsoup.parser.Parser
- org.jsoup.select.Collector.Accumulator
- org.jsoup.parser.HtmlTreeBuilder
- org.jsoup.parser.Tokeniser
- org.jsoup.nodes.Node
- org.jsoup.select.NodeVisitor
Facade
Provides a unified interface to a set of interfaces in a subsystem. It defines a higher-level interface that makes a subsystem easier to use.
Why?
Jsoup core features are available from this class. It depends on many subsystem and also all the elements don't depend on this. See the below comments on this class.
Role | Class |
---|---|
Facade | Jsoup |
/**
The core public access point to the jsoup functionality.
@author Jonathan Hedley */
public class Jsoup {
...
Decorator
Attaches additional responsibilities to an object dynamically. Decorators provide a flexible alternative to subclassing for extending functionality.
Why?
This class have the same super type as the object it decorate. And BufferedInputStream
, its parent class, is one of the most famous representative of Decorator pattern. Also we can pass around a decorated object in place of the original(wrapped) object. See the below codes.
Role | Class |
---|---|
Component | InputStream |
ConcreteDecorator | ConstrainableInputStream |
private ConstrainableInputStream(InputStream in, ...) {
super(in, bufferSize);
...
Strategy
Defines a set of encapsulated algorithms that can be swapped to carry out a specific behavior.
Why?
This class use java.io.Reader
by object composition. The Reader
is abstract class. And Reader
's concrete type is decided dynamically at run-time when CharacterReader
is initialized. So CharacterReader
is client and Reader
is encapsulated algorithm in the strategy pattern.
Role | Class |
---|---|
Context | CharacterReader |
Strategy | Reader |
ConcreteStrategy | StringReader BufferedReader |
public final class CharacterReader {
...
private final Reader reader;
...
public CharacterReader(Reader input, int sz) {
Validate.notNull(input);
Validate.isTrue(input.markSupported());
reader = input;
...
final long skipped = reader.skip(pos);
reader.mark(maxBufferLen);
final int read = reader.read(charBuf);
reader.reset();
Strategy
Why?
This class use org.jsoup.parser.TreeBuilder
by object composition. The TreeBuilder
is abstract class. And TreeBuilder
's concrete type is decided dynamically at run-time when Parser
is initialized or call by setTreeBuilder
method. So Parser
is client and TreeBuilder
is encapsulated algorithm in the strategy pattern.
Role | Class |
---|---|
Context | Parser |
Strategy | TreeBuilder |
ConcreteStrategy | HtmlTreeBuilder XmlTreeBuilder |
public class Parser {
private TreeBuilder treeBuilder;
...
public Parser(TreeBuilder treeBuilder) {
this.treeBuilder = treeBuilder;
...
public Parser setTreeBuilder(TreeBuilder treeBuilder) {
this.treeBuilder = treeBuilder;
...
public Document parseInput(String html, String baseUri) {
return treeBuilder.parse(new StringReader(html), baseUri, this);
}
Strategy
Why?
This class use org.jsoup.select.Evaluator
by object composition. The Evaluator
is abstract class. And Evaluator
's concrete type is decided dynamically at run-time when Accumulator
is initialized. So Accumulator
is client and Evaluator
is encapsulated algorithm in the strategy pattern.
Role | Class |
---|---|
Context | Accumulator |
Strategy | Evaluator |
ConcreteStrategy | Check out the screenshot below |
Accumulator(Element root, Elements elements, Evaluator eval) {
this.root = root;
this.elements = elements;
this.eval = eval;
}
...
if (eval.matches(root, el))
elements.add(el);
State
Ties object circumstances to its behavior, allowing the object to behave in different ways based upon its internal state.
Why?
This class has the member variable state
, which is HtmlTreeBuilderState
type. The HtmlTreeBuilderState
declare abstract method and its subtypes implement this method. And subtypes of HtmlTreeBuilderState
call transition
method for transiting to another state.
Role | Class |
---|---|
Context | HtmlTreeBuilder |
State | HtmlTreeBuilderState |
ConcreteState | Many nested states |
private HtmlTreeBuilderState state; // the current state
...
protected boolean process(Token token) {
currentToken = token;
return this.state.process(token, this);
}
...
void transition(HtmlTreeBuilderState state) {
this.state = state;
}
Builder
Builder pattern allow for dynamic creation of objects based upon easily interchangeable algorithms. It is used when runtime control over the creation process is required and the addition of new creation functionality without changing the core code is necessary.
There are director, builder and concrete builder in this pattern. Director knows what parts are needed for the final product. And concrete builder knows how to produce the part and add it to the final product.
In jsoup, a Parser parses the HTML with an HtmlTreeBuilder which extends an abstract class TreeBuilder. Then it returns a Document which is a product of the builder.
Role | Class |
---|---|
Director | Parser |
Builder | TreeBuilder |
Concrete Builder | HtmlTreeBuilder, XmlTreeBuilder |
Product | Document |
// Parser.java
public Document parseInput(String html, String baseUri) {
return treeBuilder.parse(new StringReader(html), baseUri, this);
}
...
public static Document parse(String html, String baseUri) {
TreeBuilder treeBuilder = new HtmlTreeBuilder();
return treeBuilder.parse(new StringReader(html), baseUri, new Parser(treeBuilder));
}
// TreeBuilder.java
Document parse(Reader input, String baseUri, Parser parser) {
initialiseParse(input, baseUri, parser);
runParser();
return doc;
}
// HtmlTreeBuilder.java
public class HtmlTreeBuilder extends TreeBuilder {
...
}
State
Why?
This class has the member variable state
, which is TokeniserState
type. The TokeniserState
declare abstract method and its subtypes implement this method. And subtypes of TokeniserState
call transition
method for transiting to another state.
Role | Class |
---|---|
Context | Tokeniser |
State | TokeniserState |
ConcreteState | Many nested states |
private TokeniserState state = TokeniserState.Data; // current tokenisation state
...
Token read() {
while (!isEmitPending)
state.read(this, reader);
...
void transition(TokeniserState state) {
this.state = state;
}
Composite
Compose objects into tree structures to represent part-whole hierarchies. Composite lets clients treat individual objects and compositions of objects uniformly.
Why?
Node
is the parent class of LeafNode
and Element
. Element
delegates to multiple Node
. And this composite member variable name is childNodes. It can be LeafNode
or Element
recursively.
Role | Class |
---|---|
Component | Node |
Composite | Element |
Leaf | LeafNode |
// Element.java
public class Element extends Node {
...
List<Node> childNodes;
// LeafNode.java
abstract class LeafNode extends Node {
Visitor
Visitor perform an operation on a group of similar kind of Objects. By using Visitor we can move the operational logic from the objects to another class.
Why?
So many class in this project used NodeVisitor.
Example) Cleaner class
Cleaner
class is Client
that accesses data structure objects of other class by using Visitor
,CleaningVisitor
. NodeVisitor
is ConcreteVisitor
interface which is type of visitor CleaningVisitor
. this pattern can be also seen in W3CBuilder
Role | Class |
---|---|
Client | Cleaner |
Visitor | CleaningVisitor |
ConcreteVisitor | NodeVisitor |
// NodeVisitor.java
public class Cleaner {
...
private final class CleaningVisitor implements NodeVisitor {
private CleaningVisitor(Element root, Element destination) { ...
}
public void head(Node source, int depth) { ...
}
public void tail(Node source, int depth) { ...
}
}
private int copySafeNodes(Element source, Element dest) {
CleaningVisitor cleaningVisitor = new CleaningVisitor(source, dest);
NodeTraversor.traverse(cleaningVisitor, source);
return cleaningVisitor.numDiscarded;
}
}
// NodeVisitor.java
public interface NodeVisitor {
void head(Node node, int depth);
void tail(Node node, int depth);
}
- Get elements by inline style properties
- Get text content in an element while keeping HTML default block level line breaks
- Element inspection
- Get Iframe elements and merge into original document
- SQLish: SQL-like utility for elements
Idea
With this feature you can directly find elements with CSS properties inside inline style attribute.
It is already possible with CSS selector implemented in jsoup, for example, if you would like to select div
tag with a style display: block
you can simply achieve this by
element.select("div[style*=\"display: block\"]")
Unfortunately it only matches to the exact string display: block
, while not working with display : block
or display:block
. So the codes can easily become fragile and you may not get the results as you expected.
This problem happens because unlike most other attributes, style attribute contains a set of CSS key/value pairs in just a single line of string. So it is reasonable to separate, parse and store them in another form of structure to improve utility and usability.
Implementation
First of all, we need a new class called Style
alongside the style
attribute. This object would have key
and value
as its properties. Each of them matches to CSS' key/value. What we're going to do is that when an Element is created with the given attributes, parse style attribute's string (only if style attribute exists to improve performance), and then create Style
instance with the key/value and store them in an Element property styles
which is an ArrayList<Style>
. Now you can find elements with inline styles using getElementsByInlineStyle()
without writing messy, tedious queries.
Style.java
public class Style implements Map.Entry<String, String>, Cloneable {
private String key;
private String val;
public Style(String key, String val) {
Validate.notNull(key);
key = key.trim().toLowerCase();
Validate.notEmpty(key);
this.key = key;
this.val = val;
}
...
}
Element.java
public class Element extends Node {
...
private ArrayList<Style> styles = null;
...
}
public Elements getElementsByInlineStyle(String key, String val) {
Validate.notEmpty(key);
Elements results = new Elements();
Elements children = this.getAllElements();
for (Integer i = 0; i < children.size(); i += 1) {
Element child = children.get(i);
if (child.hasInlineStyles()) {
for (Integer j = 0; j < child.styles.size(); j += 1) {
Style style = child.styles.get(j);
if (style.getKey().equals(key) && style.getValue().equals(val)) {
results.add(child);
}
}
}
}
return results;
}
Changelogs
Related source codes
FormattedTextVisitor.java
Node.java
: accept methodTextNode.java
: accept methodElement.java
: accept methodElement.java
: formattedText method
Idea
When we get text content inside elements, currently text()
or wholeText()
methods concatenate all of them in a single line by unrespecting HTML's block-level line breaks. For example, suppose there is a DOM tree like below.
<div>
<h1>My First Program</h1>
<p>
<span>Hello</span>
World
</p>
</div>
What the form of text we want is like this since the div
, h1
, p
tags are block-level and span
is inline(or no default display).
My First Program
Hello World
However, jsoup's two mostly used getting text methods don't work as we expected.
My First Program Hello World // Result of Element.text()
My First ProgramHello World // Result of Element.wholeText()
First implementation
On the first shot, we implemented this feature in a single method in Element. However, we found that visitor pattern can be applied to this feature by creating a visitor class and accept methods in each Node's subclasses.
Visitor pattern
Related to this pull request #12.
We created a concrete class called FormattedTextVisitor
and here is the core part of this class.
public class FormattedTextVisitor {
private String formattedText = "";
public String text() {
return this.formattedText;
}
public void visit(Element element) {
if (element.tagName().equals("br")) {
this.formattedText += "\n";
}
}
public void visit(TextNode textNode) {
this.formattedText += textNode.text();
if (textNode.parentNode() instanceof Element) {
Element parentElement = (Element) textNode.parentNode();
if (parentElement.isBlock()) {
// Block level
this.formattedText += "\n";
} else {
// No block level
}
}
}
...
}
Then added the accept method to Node.
public void accept(FormattedTextVisitor visitor) {
visitor.visit(this);
}
As a result of visitor pattern, the implementation becomes much simpler than the first approach. From now on, formattedText()
method doesn't have to know each node's type anymore. All it has to do is create a visitor instance and accept it to each node inside NodeTraversor. Then the visitor will accumulate strings and we can simply get the result using text()
method.
public String formattedText() {
final FormattedTextVisitor visitor = new FormattedTextVisitor();
NodeTraversor.traverse(new NodeVisitor() {
@Override
public void head(Node node, int depth) {
node.accept(visitor);
}
@Override
public void tail(Node node, int depth) {
}
}, this);
return visitor.text();
}
Idea
Most of the time, when you are trying to crawl something, the crawling targets are repeating elements. Suppose there an HTML looks like this.
<div>
<ul>
<div>Included div element</div>
<li>item 1</li>
<li>item 2</li>
<li>item 3</li>
</ul>
<div class="car">Tesla</div>
<div class="car">Jaguar</div>
<div class="car">Lexus</div>
<div class="car">Chevrolet</div>
</div>
We can assume that item 1, item 2..., and cars are similar elements which are repeating several times. There may be a high chance to crawl data from them. To do that, a programmer or an user must inspect the page using browsers or other tools and traverse through them to identify what is repeating and how to get it by making queries targetting them. This is very tedious work and this feature could be an aid for this situation.
If you run inspect()
on the element above, you will get the result of this.
## This kind of element has been repeated 4 times ##
Query Recommendation: div.car
<div class="car">
Tesla
</div>
## This kind of element has been repeated 3 times ##
Query Recommendation: li
<li>item 1</li>
It simply tells you what are repeating and how to get them by recommending a query.
Both jsoup's clone()
method in deep or shallow return a cloned node with all the text nodes and attributes included. But sometimes we just want to get only the structure of elements to see the appearance of them or to create reusable components.
public Element frameClone(final String[] preservingAttrs) {
Element clone = this.clone();
final ArrayList<TextNode> textNodes = new ArrayList<>();
NodeTraversor.traverse(new NodeVisitor(){
@Override
public void head(Node node, int depth) {
if (node instanceof TextNode) {
TextNode textNode = (TextNode) node;
textNodes.add(textNode);
} else if (node instanceof Element) {
Element element = (Element) node;
Attributes attrs = element.attributes();
for (Attribute attr : attrs) {
if (!Arrays.asList(preservingAttrs).contains(attr.getKey())) {
attr.setValue("");
}
}
}
}
}, clone);
for (TextNode node : textNodes) {
node.parentNode.removeChild(node);
}
return clone;
}
Additionally, you can pass a list of strings(attribute names) to be preserved, for example, class or id.
Element.frameClone(new String[]{"class", "id"})
Comparison with clone()
<!-- clone() -->
<div id="wrapper">
<h1 class="title typography-big" data-title-id="123">
This is title
</h1>
</div>
<!-- frameClone() -->
<div id="">
<h1 class="" data-title-id=""></h1>
</div>
<!-- frameClone(new String[]{"id", "class"}) -->
<div id="wrapper">
<h1 class="title typography-big" data-title-id=""></h1>
</div>
There are html()
and outerHtml()
methods to be used readily when you need to get an HTML string from an element. In default, jsoup's pretty output configuration is set to true, so the result string looks great keeping all the indentations.
This feature post-processes the result of outerHtml()
to be minified.
Its implementation is very simple. We achieved the result by just removing all white spaces and line breaks. Still it conserves the HTML syntax and keeps all content intact.
public String outerHtml(Boolean minify) {
if (!minify) return this.outerHtml();
return this.outerHtml()
.replace("\n", "")
.replace("\r", "")
.replaceAll(" ","")
.replace("> <", "><");
}
We created this method by overriding the exising method outerHtml()
and all you have to do is pass a boolean argument to it.
Comparison between the original outerHtml()
<!-- Default -->
<div id="div1">
<p>Hello</p>
<p>
Another
<b>element</b>
</p>
<div id="div2">
<img src="foo.png" />
</div>
</div>
<!-- Minified -->
<div id="div1"><p>Hello</p><p>Another <b>element</b></p><div id="div2"><img src="foo.png" /></div></div>
Updates
We found TextUtil.stripNewlines()
.
Idea
There is no implement to get iframe's elements. Jsoup focused on static html. For the elements loaded dynamically in runtime. So it is time-spending work to look reference and read document only for this small function.
Implementation
To get every detail from iframe, first we need to find iframe elements in document. Simply we got every iframe and extract src
attribute from element. After we extract src, we call it's document and prepend it to original element. Because Jsoup only look for a HTML things. So we have to manually call it. So node is generated, and matches with original tree. But we append whole text including META. Because we shouldn't give any restriction to user.
With this feature you can get Document
with all Element
including Element inside iframe
public static Document nestedConnect(String url) throws IOException {
Document doc = Jsoup.connect(url).get();
Elements iframes = doc.select("iframe");
for (Element iframe : iframes) {
if (iframe.attr("src").startsWith("http")) {
try {
String source = iframe.attr("src");
Document iframeDoc = Jsoup.connect(source).get();
iframe.prependElement(iframeDoc.toString());
} catch (IOException e) {
e.printStackTrace();
}
}
}
return doc;
}
Idea
jsoup is already a very powerful tool to parse HTML and crawl data from any website (except dynamically rendered pages like Single Page Application). What most people do with this library is crawl data just I said. However, everytime users or programmers trying to achieve that purpose, they might code tedious work. SQLish is an answer for this situation. It is packed with many useful tools for crawling and you can use this in a SQL-like ways. Crawling data is highly close to data things, anyone can use this tool without knowing the all API of jsoup.
Implementation
- Members
- Documentation (deprecated)
- Design patterns found in jsoup
- New features
- Get elements by inline style CSS properties
- Get text content in an element while keeping HTML default block level line breaks
- Element inspection
- Frame cloning
- HTML minifying
- Get Iframe elements and merge into original document
- SQLish: SQL-like utility for elements
- Sort elements as ascending order of its text
- Sort elements as descending order of its text
- Get the only elements which are starts with the specified prefix
- Get the only elements which are ends with the specified suffix
- Get the only elements which text integer are greater than or equal to specified number
- Get the only elements which text integer are less than or equal to specified number
- Returns the portion of these elements
- What we tried
Command pattern
Queries are generated with command objects and every time you call SQL methods the commands are stored sequentially. After that, you can get the result by executing all the queries you just accumulated using exec() method. Also, for a duplicate query, you can reduce the code duplication by popping out the previous command inside the oboject instead of generating another cloned object. Now as you know, the commands are objects we could get many advantages from this like applying other design patterns or something like that.
Role | Class |
---|---|
Command | SQLCommand |
ConcreteCommand | Many nested classes |
Receiver | Elements |
Invoker | SQLish |
Strategy pattern
There are many ways to extract texts from Elements. You can extract them including every child text nodes in the element(by using text() method) or you can extract only the text content which is related directly to that element (by using ownText() method). Queries are different for each method and both of them are plentifully used as users needed. Furthermore, since the texts should be extracted at run time, we used strategy pattern.
Role | Class |
---|---|
Context | SQLCommand |
Strategy | TextExtractor |
ConcreteStrategy | Nested classes |
Facade pattern
Classes related to SQL are implemented with command pattern and strategy pattern. So the clients need to know what are them and how to use/apply them. For the users who don’t have knowledge base on these, we prepared facade pattern to enable them easily use our utility without knowing the whole notions.
Role | Class |
---|---|
Facade | SQLish |
Test elements 1
<!-- test elements -->
<p>
hello
<span>mango</span>
</p>
<p>
hello
<span>ironman</span>
</p>
<p>
hello
<span>nobody</span>
</p>
<p>
hello
<span>food</span>
</p>
<p>
hello
<span>programmer</span>
</p>
<p>
hello
<span>love</span>
</p>
<p>
hello
<span>ice</span>
</p>
<p>
hello
<span>apple</span>
</p>
<p>
hello
<span>human</span>
</p>
<p>
hello
<span>zoo</span>
</p>
<p>
hello
<span>solo</span>
</p>
<p>
hello
<span>banana</span>
</p>
<p>
hello
<span>melon</span>
</p>
<p>
hello
<span>apology</span>
</p>
<p>
hello
<span>for</span>
</p>
<p>
hello
<span>prolong</span>
</p>
Test elements 2
<!-- test elements -->
<p>
23
<span>human</span>
</p>
<p>
18
<span>cat</span>
</p>
<p>
4939
<span>nobody</span>
</p>
<p>
19
<span>food</span>
</p>
<p>
293
<span>dog</span>
</p>
<p>
174
<span>love</span>
</p>
<p>
3942
<span>lion</span>
</p>
<p>
92
<span>elephant</span>
</p>
<p>
12
<span>human</span>
</p>
<p>
443
<span>giraffe</span>
</p>
Method
SQLish#orderByTextAsc()
Results (Test elements 1)
hello apology
hello apple
hello banana
hello food
hello for
hello human
hello ice
hello ironman
hello love
hello mango
hello melon
hello nobody
hello programmer
hello prolong
hello solo
hello zoo
Method
SQLish#orderByTextDesc()
Results (Test elements 1)
hello zoo
hello solo
hello prolong
hello programmer
hello nobody
hello melon
hello mango
hello love
hello ironman
hello ice
hello human
hello for
hello food
hello banana
hello apple
hello apology
Method
SQLish#startsWithText("hello pro")
Results (Test elements 1)
hello programmer
hello prolong
Method
SQLish#endsWithText("man")
Results (Test elements 1)
hello ironman
hello human
Method
SQLish#gteByText(200)
Results (Test elements 2)
293 dog
443 giraffe
3942 lion
4939 nobody
Method
SQLish#lteByText(100)
Results (Test elements 2)
92 elephant
23 human
19 food
18 cat
12 human
Method
SQLish#limit(3, 5)
SQLish#limit(2)
Results (Test elements 1)
hello food
hello for
hello human
hello ice
hello ironman
hello food
hello for
- Response from website not getting appropriate encoding
- Converting html to plain text, line brokes!
- Get only text from element
Idea
when getting response, some web site doesn't return character-set. due to unmatching character-set stream broke.
Problem
Some old website's response doesn't include charset. Jsoup do send request with execute() method. But in execute() it only read for the response, not checking META.
So "requester" doesn't have any way to find out it's original char-set. So we thought it is nessesary to add parsing method to execute() method.
We started to find META parsing part and tried to add it in execute(). But we thought this action was not the one 'Maker intended'.
Because execute() is only for requesting data to server. So not parsing or additional action required. So we stopped.
In order to change encoding we used META data. so add parse() method to execute() method or just do parse() after execute().
Idea
When convert html to plain text, line brokes.
Problem
With this feature you can get Document
with all Element
including Element inside iframe
<html>
<head>
<title>
</title>
<style>body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;}
</style>
</head>
<body><p><b>hello world</b></p><p><br><b>yo</b> <a href=\"http://google.com\">googlez</a></p>
</body>
when extracting 'hello world yo googlez' traditional way of Jsoup makes it intoline.
hello world
yo googlez
if we want to get things in this format we should prepend "\n" things. At first. we tried to build function.
But figured out there already function ''Jsoup.parse().wholeText()'' exist. So we stopped
Updates
The wholeText()
also does not return the result as we expected. So we created a feature for that.
Idea
Sometimes when 'GET' we only need to extract element. But only extract can be pretty hard if we manually find strings and iterating objects.
Problem
We found ownText()
method.
Element p = doc.select("p").first();
System.out.println(p.ownText());
for (Node node :p.childNodes()){
if (node instanceof TextNode){
System.out.println(((TextNode)node).text());
}
}