Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Follow web links ("links") - just uses the userAgent/proxyOverride/httpFields to control any HTTP requests that are made either from the subsequent "split document" elements
  • Split documents ("splitspliter") - takes the current document and uses its metadata to create new documents (by outputting an object in the format immediately below). The current document can then be retained or discarded. Newly spawned documents appear at the current point in the pipeline, ie skip earlier elements.

...

This page has been organized into the following sections for ease of localization.

Table of Contents

 

Format

...

Splitter shortcut syntax

When using the "spliter" element, it is possible to perform standard splits on JSON and XML without any complex scripting, using a shortcut format that is described now.

  • Set the "scriptlang" to "automatic", "automatic_json", or "automatic_xml" (the distinction is described below)
  • In basic mode, the "script" is in one of the following formats:
    • "fullText": Parses the text as a line-separated collection of either XML or JSON (eg "<object>1</object><object>2</object>" would generate two documents)
    • "fullText.<field>": As above, except grabs the object represented by field (which can be nested using dot notation), eg "fullText.object" would parse "<objects><object>1</object><object>2</object></objects>"
    • "metadata.<field>": Takes the specified field from the metadata object (which is an array), and converts each element of the array into document. 
      • (Note <field> cannot be nested, unlike for "fullText")
    • "<field>": As above, ie the "metadata." prefix is optional. (It might be necessary if the <field> is "metadata" or "fullText".

In basic mode, the URL is set to the parent URL + "#<docnum>", the title is set to parent title plus "(<docnum>)". The text and metadata format is described after the info box below.

Info

Setting the URL to the above default is in many cases not desirable, since unlike title/description/fullText/displayUrl, the document "url" field cannot be changed (since it is used for deduplication).

Therefore there is a more complex syntax that enables the URL to be derived from one or more fields:

  • Simpler version: set "script" field to "<splitting-field>,<url-field>"
    • <splitting-field> is as described above
    • <url-field> takes the specified field from the JSON/XML/metadata and uses it for the URL
      • eg: "script": "fullText.object,url" would parse "<objects><object><meta>1</meta><url>http://blah1</url></object><object><meta>2</meta><url>http://blah2</url></object></objects>"
  • More complex version, set "script" field to "<splitting-field>,<url-string>,<url-field1>,<url-field2>,etc"
    • <url-string> is a string (no commas allowed) with substitutions for {0}, {1}, etc mapping to <url-field1>, <url-field2> etc (full format specification)
      • eg with the same XML fullText as the previous example then "script": "fullText.object,my_url_is_{0},url" would return "my_url_is_http://blah1" and "my_url_is_http://blah2" as the 2 URLs.

The fullText and metadata fields of the split object depends on the scriptlang:

  • If "automatic" is used then no metadata is generated and the fullText is the split object, eg in the above XML example, you'd get two documents, with no metadata and the following fullText fields:
    • "<meta>1</meta><url>http://blah1</url>"
    • "<meta>2</meta><url>http://blah2</url>"
  • If "automatic_json" is used, then the fullText is the same, but the metadata object contains a single field, "json", containing the JSON-ified object, eg:
    • "metadata": { "json": [ { "meta": "1", "url": "http://blah1" } ] }
    • "metadata": { "json": [ { "meta": "2", "url": "http://blah2" } ] }
  • If "automatic_xml" is used, it is similar, except the metadata object contains one element for each field of the JSON-ified object, eg:
    • "metadata": { "meta": [ "1" ], "url": [ "http://blah1" ] }
    • "metadata": { "meta": [ "1" ], "url": [ "http://blah2" ] }
  • (ie "automatic_json" vs "automatic_xml" are consistent with the metadata formats derived from the "file" extractor element)
Code Block
languagejs
titleFull splitter example
////The source 
 
{
//...
	"processingPipeline": [
//...
		{
			"splitter": {
				"scriptlang": "automatic_json",
				"script": "fullText.object, http://test/{0}/{1}, url, meta"
			}
		}
//...
	]
//...
}
 
////Would map the extracted document
 
{
	"url": "blahurl",
	"title": "blah"
	"fullText": "<objects><object><meta>1</meta><url>blah1</url></object><object><meta>2</meta><url>blah2</url></object></objects>"
}
 
////to the 2 derived docs:
 
{
	"title": "blah (1)",
	"url": "http://test/blah1/1",
	"fullText": "<object><meta>1</meta><url>blah1</url></object>",
	"metadata: {
		"json": [ { "meta": "1", "url": "http://blah1" } ]
	}
},
{
	"title": "blah (2)",
	"url": "http://test/blah2/2",
	"fullText": "<object><meta>1</meta><url>blah1</url></object>",
	"metadata: {
		"json": [ { "meta": "2", "url": "http://blah2" } ]
	}
}
 
////Of course, subsequent pipeline elements can then manipulate/add fields other than "url" as per usual
 

Format

Code Block
{
	"display": string,
	"links": { // or "splitter"
	    "userAgent": string, // (Optional) Overrides the "parent" (rss) setting for "search" operations (see usage guide)
	    "proxyOverride": string, // (Optional) "direct" to bypass proxy (the default), or a proxy specification "(http|socks)://host:port" 
	    "script": string, // (Mandatory) Script, must "return" (last statement evaluated) an array of the following format:
		                // [ { "url": string, "title": string /* optional-ish */, 
		                //     "description": string /* optional */, publishedDate: string /* optional */,
		                //     "spiderOut": string /*optional */ }
	    "scriptlang": string, // (Mandatory) Only "javascript" is supported, use extraMeta for different script types
	    "scriptflags": string, // (Optional) The flags to apply to the above script, see "unstructuredAnalysis.meta" for more details
	    "extraMeta": [ {...} ], // (Optional) A pipeline of metadata extraction operations that are applied prior to "script", see "Using The Feed Harvester" overview
	    "pageChangeRegex": string, // (Optional) If non-null, this regex should be used to match the pagination URL parameter (which will be replaced by pageChangeReplace)
				                    // Also, group 1 should be the start, to allow any offsets specified in the URL to be respected
	    "pageChangeReplace": string, // (Optional) Mandatory if pageChangeRegex is non-null, must be a replace string where $1 is the page*numResultsPerPage
	    "numPages": integer, // (Optional) Mandatory if pageChangeRegex is non-null - controls the number of pages deep the search will go
	    "stopPaginatingOnDuplicate": boolean, // (Ignored unless pageChangeRegex is non-null) If true (default: false) then will stop harvesting as soon as an already harvested link is encountered
	                                            // (for APIs that return docs in time order, ensures that no time is wasted harvesting and then discarding duplicate links)
	    "numResultsPerPage": integer, // (Optional) Mandatory if pageChangeRegex is non-null - controls the number of results per page
	    "waitTimeBetweenPages_ms": integer, // (Optional) Only used if pageChangeRegex is non-null - controls a wait between successive pages if set
 
	    "maxDepth": integer // (Optional, defaults to 2) If spidering out (returning "spiderOut": "true" from the script) the maximum depth to go
	}
}

...