Follow Web links

Overview

Follow Web Links can be used in the pipeline with both Web/Feed extractors and File/Database extractors.  These are two different applications with different configuration possibilities.

Web/Feed:

For Web and Feed extractors, will scan the raw content of extracted documents and pull out new documents (eg from links in web pages) that are then extracted using the subsequent pipeline elements. The original documents can either be discarded or processed with the same pipeline.

File/Database:

For File or Database extractors can be called in one of 2 ways:

  • Follow web links ("links") - just uses the userAgent/proxyOverride/httpFields to control any HTTP requests that are made either from the subsequent "split document" elements
  • Split documents ("spliter") - takes the current document and uses its fullText and/or metadata to create new documents (by outputting an object in the format immediately below). The current document can then be retained or discarded. Newly spawned documents appear at the current point in the pipeline, ie skip earlier elements.

Note that File/Database extractors need to contain a "links" element if they are to be able to access HTTPS with self signed certificates.

In this section:

Format

{
	"display": string,
	"links": { // or "splitter"
	    "userAgent": string, // (Optional) Overrides the "parent" (rss) setting for "search" operations (see usage guide)
	    "proxyOverride": string, // (Optional) "direct" to bypass proxy (the default), or a proxy specification "(http|socks)://host:port" 
	    "script": string, // (Mandatory) Script, must "return" (last statement evaluated) an array of the following format:
		                // [ { "url": string, "title": string /* optional-ish */, 
		                //     "description": string /* optional */, publishedDate: string /* optional */,
		                //     "spiderOut": string /*optional */ }
	    "scriptlang": string, // (Mandatory) "javascript" is supported, use extraMeta for different script types.  Can also be configured using "automatic", "automatic_json" and "automatic_xml".  See below for definitions.
	    "scriptflags": string, // (Optional) The flags to apply to the above script, see "unstructuredAnalysis.meta" for more details
	    "extraMeta": [ {...} ], // (Optional) A pipeline of metadata extraction operations that are applied prior to "script", see "Using The Feed Harvester" overview
	    "pageChangeRegex": string, // (Optional) If non-null, this regex should be used to match the pagination URL parameter (which will be replaced by pageChangeReplace)
				                    // Also, group 1 should be the start, to allow any offsets specified in the URL to be respected
	    "pageChangeReplace": string, // (Optional) Mandatory if pageChangeRegex is non-null, must be a replace string where $1 is the page*numResultsPerPage
	    "numPages": integer, // (Optional) Mandatory if pageChangeRegex is non-null - controls the number of pages deep the search will go
	    "stopPaginatingOnDuplicate": boolean, // (Ignored unless pageChangeRegex is non-null) If true (default: false) then will stop harvesting as soon as an already harvested link is encountered
	                                            // (for APIs that return docs in time order, ensures that no time is wasted harvesting and then discarding duplicate links)
	    "numResultsPerPage": integer, // (Optional) Mandatory if pageChangeRegex is non-null - controls the number of results per page
	    "waitTimeBetweenPages_ms": integer, // (Optional) Only used if pageChangeRegex is non-null - controls a wait between successive pages if set
 
	    "maxDepth": integer // (Optional, defaults to 2) If spidering out (returning "spiderOut": "true" from the script) the maximum depth to go
	}
}

As with the Web extractor, "httpFields" can contain the special field "Content", which will POST the associated value.

Description

This method can be used on web/feed extractors, or on file/database extractors.

If it is used on web/feed sources, it can be called using "links."  If used on file/database sources it can be called using "links" or "split."

Follow Web Links has the following two major use cases:

It takes as its input documents that have been generated by an extractor, and then creates new documents based on the url links, or JSON/XML fullText/metadata.  The original documents can then be retained or discarded.

 

When using the spliter element, it is possible to perform standard splits on JSON and XML without any complex scripting, using a shortcut format that is described below.

FieldDescriptionNote
scriptlang

For file extraction using JSON/XML, supports the following:

"automatic", "automatic_json", or "automatic_xml

 
script

Works with scriptlang setting above and can be set as follows:

  • "fullText": Parses the text as a line-separated collection of either XML or JSON (eg "<object>1</object><object>2</object>" would generate two documents)
  • "fullText.<field>": As above, except grabs the object represented by field (which can be nested using dot notation), eg "fullText.object" would parse "<objects><object>1</object><object>2</object></objects>"
  • "metadata.<field>": Takes the specified field from the metadata object (which is an array), and converts each element of the array into document. 
    • (Note <field> cannot be nested, unlike for "fullText")
  • "<field>": As above, ie the "metadata." prefix is optional. (It might be necessary if the <field> is "metadata" or "fullText".
 

As with any usage of Follow Web Links, script must return an array including URL, title, text and metadata.

FieldDescriptionNote
URL

The URL is set to the parent URL + "#<docnum>"

See below for advanced usage.

 
titlethe title is set to parent title plus "(<docnum>)". 
fullTextDepends on scriptlang setting. See below. 
metadataDepend on scriptlang setting. See below. 

 

The fullText and metadata fields of the split object depends on the scriptlang:

  • If "automatic" is used then no metadata is generated and the fullText is the split object, eg in the above XML example, you'd get two documents, with no metadata and the following fullText fields:
  • If "automatic_json" is used, then the fullText is the same, but the metadata object contains a single field, "json", containing the JSON-ified object, eg:
    • "metadata": { "json": [ { "meta": "1", "url": "http://blah1" } ] }
    • "metadata": { "json": [ { "meta": "2", "url": "http://blah2" } ] }
  • If "automatic_xml" is used, it is similar, except the metadata object contains one element for each field of the JSON-ified object, eg:
    • "metadata": { "meta": [ "1" ], "url": [ "http://blah1" ] }
    • "metadata": { "meta": [ "2" ], "url": [ "http://blah2" ] }
  • (ie "automatic_json" vs "automatic_xml" are consistent with the metadata formats derived from the "file" extractor element)

Setting the URL to the above default is in many cases not desirable, since unlike title/description/fullText/displayUrl, the document "url" field cannot be changed (since it is used for deduplication).

Therefore there is a more complex syntax that enables the URL to be derived from one or more fields:

  • Simpler version: set "script" field to "<splitting-field>,<url-field>"
    • <splitting-field> is as described above
    • <url-field> takes the specified field from the JSON/XML/metadata and uses it for the URL
      • eg: "script": "fullText.object,url" would parse "<objects><object><meta>1</meta><url>http://blah1</url></object><object><meta>2</meta><url>http://blah2</url></object></objects>"
  • More complex version, set "script" field to "<splitting-field>,<url-string>,<url-field1>,<url-field2>,etc"

In the example below, "fullText.<field>": is set to fullText.object, enabling the platform to correctly map the nested objects in the original document's fulltext into two separate documents with their own unique metadata.

 

Full splitter example
////The source 
 
{
//...
	"processingPipeline": [
//...
		{
			"splitter": {
				"scriptlang": "automatic_json",
				"script": "fullText.object, http://test/{1}/{2}, url, meta"
			}
		}
//...
	]
//...
}
 
////Would map the extracted document
 
{
	"url": "blahurl",
	"title": "blah"
	"fullText": "<objects><object><meta>1</meta><url>blah1</url></object><object><meta>2</meta><url>blah2</url></object></objects>"
}
 
////to the 2 derived docs:
 
{
	"title": "blah (1)",
	"url": "http://test/blah1/1",
	"fullText": "<object><meta>1</meta><url>blah1</url></object>",
	"metadata": {
		"json": [ { "meta": "1", "url": "http://blah1" } ]
	}
},
{
	"title": "blah (2)",
	"url": "http://test/blah2/2",
	"fullText": "<object><meta>1</meta><url>blah1</url></object>",
	"metadata": {
		"json": [ { "meta": "2", "url": "http://blah2" } ]
	}
}
 
////Of course, subsequent pipeline elements can then manipulate/add fields other than "url" as per usual
 

 


 

"Links" Examples

When Follow Web Links is used for API-style parsing, scriptlang can be set to "javascript."  You can then specify a javascript type script for the script field, which will be passed a variable "text," containing the response to the specified url.  The script must output an array in the following format:

 

{ 
	"url": string, // Mandatory - this URL is copied into the "URL" field of the generated document,                     
					// and is used to fetch the content unless "fullText" is set.
    "title": string, // Mandatory (unless "spiderOut" set, see below) - this is used to generate the document's title.
    "description": string, // Optional, if set then used to generate the document's description.
    "publishedDate": string, // Optional, if not set then the current date is used instead.
 
    "fullText": string, // Optional, if set then this is the content for the generated document, ie "url" is not followed.
 
    "spiderOut": integer //Optional, if set to true then the Follow Web Links script is applied to the resulting document,
                            // for a depth of up to "maxDepth" times
                            // Note spiderOut only works if rss.extraUrls is non-empty (ie use that instead of url)
}

It is important to understand the 2 control parameters in this object:

  • If "title" is set, the document is forwarded to the rest of the pipeline (ie will be harvested into the platform, as if the "follow web links" element were not present)
  • If "title" is not set, the document will be discarded after it has been scanned for links to crawl
  • If "spiderOut" is set to true, then the document will be scanned for links to crawl
    • (so eg no "title" and "spiderOut" is a degenerate case)

 

spiderOut can be used to apply the Follow Web Links script to the resulting document.  This means that if the newly generated document also contains additional urls, the script will run again on these urls and return the array to make additional documents.

When spiderOut is enabled it will continue to follow additional urls (if present) until the parameter setting for maxDepth has been reached.

The links parameter is configured using script to spiderout to the additional URLs in the input document.  In the example, the script parameter is used to determine the behavior of "Follow Web Links" when an additional URL is found in the input data.

 

},
        {
            "links": {
                "script": "var retVals = [];\nvar n = -1;\nvar url = _doc.url.replace(/[?].*/,\"\");\nvar start = 0;\nwhile (start < text.length) {\n    var end = text.indexOf('\\n', start);\n    if (end == -1) end = text.length;\n    var line = text.substr(start,end-1);\n    start = end + 1;    \n    \n    n++;\n    if (0 == n) continue;\n    \n    var title = 'line #' + n.toString();\n    var url2 = url + '#' + n.toString();\n    var fullText = line;\n    var retVal = { 'title':title, 'url':url2, 'fullText':line };\n    retVals.push(retVal); \n}\nretVals;\n      "
            }
        },

For XML APIs the basic principle is the same, but the XML object needs to be parsed using embedded Java calls (since the Rhino javascript engine currently in use does not support e4x).

The links object has a field extraMeta that enables other script types to be used. The main use case for this is using the "xpath" scripting language to extract the URLs required, and then use the existing script field to tidy up those objects into the required format.

 

  },
        {
            "links": {
                "extraMeta": [
                    {
                        "context": "First",
                        "fieldName": "convert_to_json",
                        "flags": "o",
                        "script": "//breakfast_menu/food[*]",
                        "scriptlang": "xpath"
                    }
                ],
                "script": "function convert_to_docs(jsonarray, url)\n{\n    var docs = [];\n    for (var docIt in jsonarray) {\n        var predoc = jsonarray[docIt];\n        delete predoc.content;\n        var doc = {};\n        doc.url = _doc.url.replace(/[?].*/,\"\") + '#' + docIt;\n        doc.fullText = predoc;\n        doc.title = \"TBD\";\n        doc.description = \"TBD\";\n        docs.push(doc);\n    }\n    return docs;\n}\nvar docs = convert_to_docs(_doc.metadata['convert_to_json'], _doc.url);\ndocs;",
                "scriptflags": "d"
            }
        },

 

In the example, you can see that the links object has a complex parameter extrameta which is configured to call an xpath script that parses the input of the XML and converts it to JSON output.

 


When Follow Web Links is used to ingest HTML pages, some additional considerations are required for HTML.

The extraMeta field is used by "Follow Web Links" to enable other script types to be used.

This enables the use of the xpath scripting language to extract the required urls from the HTML pages, before script can put them into the required array format.

After the extraMeta array has been used by script, it is discarded.

The extraMeta field can also be used for 2 debugging/error handling cases:

  • If a field called "_ONERROR_" is generated then if no links are returned from the first page (ie likely due to a formatting error) then the contents of _ONERROR_ (assumed to be a string) are dumped to the harvest message.
  • Only when running from the "Config - Source - Test" API call (including from the Source Editor GUI), then for every page, all of the _ONDEBUG_ field values (can be string or object) are dumped to the harvest message.

 

Follow Web Links can be used to follow links when multiple pages must be traversed (results pages for Intranet search pages).  Follow Web Links can then generate new documents based on these urls.

The following parameters are part of the configuration

  • numPages: the total number of pages that will be checked each search cycle.
  • pageChangeRegex: a regex that must have at least one capturing group and must match the entire part of the URL that controls the page number. See example below.
  • pageChangeReplace: the above string that controls the page number, with $1 used to represent the page number.
  • (slightly misnamed) numResultsPerPage: If the "page number" in the URL is actually a result offset and not a page offset, then this field should be the number of results per page (which is then multiplied by the page number to generate the "$1" string mentioned above). See example.

 

Examples:

For example, consider a URL of the form:

Then the following parameters would be used: "pageChangeRegex": "page=(\d+)", "pageChangeReplace": "page=$1", "numResultsPerPage": 1

And for a URL of the form:

The following parameters would be used: "pageChangeRegex": "start_result=(\d+)", "pageChangeReplace": "start_result=$1", "numResultsPerPage": 20

It is likely that standard web-crawling measures are needed such as custom user-agents, and per-page wait times. Because these might well be different from the search engine to the pages themselves, "Follow Web Links" has its own waitTimeBetweenPages_ms, and userAgent fields (if not specified these are inherited from the parent object).

 


Splitter Examples

In the following example using "splitter", Follow Web Links has been configured to act on JSON/XML endpoints.  Metadata is extracted from the endpoints, which is then used to generate new documents.  deleteExisting is set to True to delete the originals.

After the extra documents have been generated, additional enrichment can be performed as part of the processing pipeline.

 },
        {
            "display": "A global space to group all the complex parsing and processing logic, can be called from anywhere",
            "globals": {
                "scriptlang": "javascript",
                "scripts": ["function create_links( urls, input_array )\n{\n    for (var x in input_array) {\n        var input = input_array[x];\n        urls.push( { url: input.url, title: input.title, description: input.desc, publishedData: input.date, fullText: input.text });\n    }\n}"]
            }
        },
        {
            "display": "Only check the API every 10 minutes (can be set to whatever you'd like)",
            "harvest": {
                "duplicateExistingUrls": true,
                "searchCycle_secs": 600
            }
        },
        {
            "contentMetadata": [{
                "fieldName": "json",
                "index": false,
                "script": "var json = eval('('+text+')'); json; ",
                "scriptlang": "javascript",
                "store": true
            }],
            "display": "Convert the text into a JSON object in the document's metadata field: _doc.metadata.json[0]"
        },
        {
            "display": "Take the original documents, split them using their metadaata into new documents, and then delete the originals",
            "splitter": {
                "deleteExisting": true,
                "script": "var urls = []; create_links( urls, _metadata.json[0].data ); urls;",
                "scriptflags": "m",
                "scriptlang": "javascript"
            }
        },

 

In this example, the individual pages of an E-Book are ingested into infinit.e and then split into individual documents using "splitter."  The original document is then deleted. 

The Global javascript function enables "splitter" to format the input into the appropriate array output.

 },
         {
            "display": "A global space to group all the complex parsing and processing logic, can be called from anywhere",
            "globals": {
                "scriptlang": "javascript",
                "scripts": [
                    "function convert_to_docs(jsonarray, topDoc)\n{\n    var docs = [];\n    for (var docIt in jsonarray) \n    { \n        var predoc = jsonarray[docIt];\n        var doc = {};\n        doc.url = topDoc.url.replace(/[?].*/,\"\") + '#' + (parseInt(docIt) + 1).toString();\n        doc.fullText = predoc.replace(/\\\\\\//,\"/\");\n        doc.title = topDoc.title + \"; Page: \" + (parseInt(docIt) + 1).toString();\n        doc.publishedDate = topDoc.publishedDate;\n        doc.description = topDoc.url;\n        docs.push(doc);\n    }\n    return docs; \n}\n\n"
                ]
            }
        },
        {
            "contentMetadata": [
                {
                    "fieldName": "pages",
                    "index": false,
                    "script": "div",
                    "scriptlang": "stream",
                    "store": true
                }
            ],
            "display": "Uses the PDF's internal structured to break each page into an element in a pages metadata fields in the first document"
        },
        {
            "display": "Take the individual pages created in the previous step, convert them into docs, then delete the original",
            "splitter": {
                "deleteExisting": true,
                "numPages": 10,
                "numResultsPerPage": 1,
                "script": "var docs = convert_to_docs(_doc.metadata['pages'], _doc); docs;",
                "scriptflags": "d",
                "scriptlang": "javascript"
            }
        },

 

Footnotes:

Legacy documentation:

Legacy documentation: