This plug-in contains utilities to parse an HTML blob.
Internally, it uses the Jericho HTML Parser library (no need to install anything in your server, this library is already used by Nuxeo)
QA Last Build Page of the Nuxeo Package, to get the .zip package and install it on your server (no need to build it).
HTML: Get Links
(idHTML.GetLinks
)- Accepts a
Blob
,Document
orstring
, returns aString
- Parses the html for every tag with a "src" or a "href" attribute, and returns a JSON string of an array of objects. Each object has a
tag
,attribute
,text
andlink
field. - When the input is
Document
, you can use thexpath
parameter to let the plug-in know where to get the blob from (default isfile:content
). - Notice: If the input is
Document
andxpath
is left empty or there is no blob, the plug-in will check if the document has thenote
schema. If yes, it uses itsnote:note
field for parsing - Here is an example using JavaScript automation:
- Accepts a
// In this example, the JavaScript receives a Document as input, and uses its file:content field to get the blob.
function run(input, params) {
var blob, resultStr, resultJson;
blob = doc["file:content"];
resultStr = HTML.GetLinks(blog, {});
// We have the JSON string, convert it to full JSON
resultJson = JSON.parse(resultStr);
// Now, we can loop and get the values of each field:
resultJson.forEach(function(obj) {
obj.tag contains "a" or "img" for example. Could contains "script", "link", ...
obj.attribute contains "href" or "src"
obj.text contains "hello" for <a href="http://site.com">Hello</a>
obj.link contains "http://site.com" for <a href="http://site.com">Hello</a>
}
}
-
HTML: Get Plain Text
(idHTML.GetPlainText
)- Accepts
Blob
,Document
orString
, returns aString
- Parses the html and returns the plain text content.
- Parameters:
includeHyperlinkURLs
: Boolean, optionnal. Default value isfalse
.includeAlternateText
: Boolean, optionnal. Default value isfalse
.convertNonBreakingSpaces
: Boolean, optionnal. Default value isfalse
.lineSeparator
: String, optionnal. Default value isLF
(char #10,"\n"
)xpath
: The xpath to use when the input isDocument
. Default value isfile:content
- Notice: If the input is
Document
andxpath
is left empty or there is no blob, the plug-in will check if the document has thenote
schema. If yes, it uses itsnote:note
field for parsing
- Notice: If the input is
- Accepts
-
HTML: Get Info
(idHTML.GetInfo
)- Accepts
Blob
,Document
orString
, returns aString
- Parses the html and returns a JSON string containing an object with the following properties:
title
: The content of the<title>...</title>
tag. Returns""
if there is no such tag- One property per meta name in the
metaList
parameter
- When called using JavaScript Automation, one can easily use
JSON.parse
on the resulting stirng to quickly extract the values. - Parameters:
metaList
: String, optionnal. A list (comma-separated) of the names of the<meta>
tags for wich you want to get the content. The plug-in will trim any exta space at the beginning.end of tags.xpath
: The xpath to use when the input isDocument
. Default value isfile:content
- Notice: If the input is
Document
andxpath
is left empty or there is no blob, the plug-in will check if the document has thenote
schema. If yes, it uses itsnote:note
field for parsing
- Notice: If the input is
- Accepts
cd /path/to/nuxeo-html-utils
mvn clean install
Nuxeo, developer of the leading Content Services Platform, is reinventing enterprise content management (ECM) and digital asset management (DAM). Nuxeo is fundamentally changing how people work with data and content to realize new value from digital information. Its cloud-native platform has been deployed by large enterprises, mid-sized businesses and government agencies worldwide. Customers like Verizon, Electronic Arts, ABN Amro, and the Department of Defense have used Nuxeo's technology to transform the way they do business. Founded in 2008, the company is based in New York with offices across the United States, Europe, and Asia. Learn more at www.nuxeo.com.