Format

{
	"display": string,
	"web": {
	    "feedType": string, // Currently not used - will allow for RSS vs Atom in future releases (currently only RSS is supported)
 
    	"waitTimeOverride_ms": integer, // Optional - if specified, controls the amount of time between successive reads to a site (default: 10000ms):
        	            // ie if a site is timing out it may limit the number of accesses from a given IP - set the number higher
            	        // for large sites you can increase the performance of the harvester by setting this number lower
    	"updateCycle_secs": integer, // Optional - if present harvested URLs may be replaced if they are older than this time and are encountered from the RSS or in the "extraUrls"
    	"regexInclude": string, // Optional - if specified, only URLs matching the regex will be harvested
    	"regexExclude": string, // Optional - if specified, any URLs matching the regex will not be harvested
     
    	"extraUrls": [ // This array allows for manually specified URLs to be harvested once
        {
            "url": string, // The URL 
            "title": string, // The title that the document will be given (ie the equivalent to the RSS title)
            "description": string, // (Optional) The description that the document will be given (ie the equivalent to the RSS description)
            "publishedData": string, // (Optional) The date that will be assigned to the document (default: now) - this can be overridden from "structuredAnalysis"
            "fullText": string // (Optional) If present and "useTextExtractor" is "none", then uses the specified string instead of the URL contents (mainly for debugging)
		}
    	],
	    "userAgent": string, // (Optional) If present overrides the system default user agent string
	    "proxyOverride": string, // (Optional) "direct" to bypass proxy (the default), or a proxy specification "(http|socks)://host:port" 
		"httpFields": // (Optional) Additional HTTP fields to be applied to the request headers  
		{ 
			"field": "value" // eg "cookie": "sessionkey=346547657687"
		}
	} 
}

Description

The following table describes the parameters of the web extractor configuration.

Field	Description
feedType	Currently not used - will allow for RSS vs Atom in future releases (currently only RSS is supported)
waitTimeOverride_ms	Optional - if specified, controls the amount of time between successive reads to a site (default: 10000ms): // ie if a site is timing out it may limit the number of accesses from a given IP - set the number higher // for large sites you can increase the performance of the harvester by setting this number lower.
updateCycle_secs	Optional - if present harvested URLs may be replaced if they are older than this time and are encountered from the RSS or in the "extraUrls"
regexInclude	Optional - if specified, only URLs matching the regex will be harvested
regexExclude	Optional - if specified, any URLs matching the regex will not be harvested
extraUrls	Complex Type "url": string, // The URL "title": string, // The title that the document will be given (ie the equivalent to the RSS title). See below. url must always be set title must be set unless the "Follow web links" element is also used, and then this page will only be crawled for links, it will not be harvested. "description": string, // (Optional) The description that the document will be given (ie the equivalent to the RSS description) "publishedData": string, // (Optional) The date that will be assigned to the document (default: now) - this can be overridden from "structuredAnalysis" "fullText": string // (Optional) If present and "useTextExtractor" is "none", then uses the specified string instead of the URL contents (mainly for debugging) "fullText: string , // (Optional) Can be used to pre-populate content - mostly useful for debugging
userAgent	(Optional) If present overrides the system default user agent string
proxyOverride	(Optional) "direct" to bypass proxy (the default), or a proxy specification "(http\|socks)://host:port"
httpFields	(Optional) Additional HTTP fields to be applied to the request headers Can contain the special field "Content", which will POST the associated value.

About extraUrls

Usage of the "title" string impacts how the web extractor will generate documents. There is a dependency with the links or splitter elements which can be specified downstream in the source pipeline. For more information about links and splitter see Follow Web links.

Links or Splitter is Not Included:

When neither a links or splitter element is included downstream, specifying a "title" for extraUrls will cause Web Extractor to process the included url as a web page. When no title is specified, the url is treated as an RSS feed. This functionality enables you to mix both RSS and web pages within the same source configuration.

Links or Splitter is Included:

If a links element is included downstream, specifying a "title" will cause Web Extractor to treat the url as a web page. The original page will be preserved as a document, and links can still be followed based on how the links element is setup.

When no "title" is specified, Web Extractor will simply see the web page as an API endpoint and will discard it as a document. Documents are only generated based on the responses from the API.

Examples

extraUrls

In the following example, the Web Extractor is used to run extraUrls parameter against the web content. extraUrls is used to indicate URLs that should only be harvested once.

In addition, when using the Web Extractor (as opposed to the Feed Extractor) it is also possible to specify title, description, publisheddate, and fullText attributes of extraUrls.

In this way, you can manually specify the values that these attributes should take when the extraUrls are harvested.

In this example, the updateCycle_secs parameter is also used to specify the refresh rate of the harvested urls.

"Title" is also specified, which can impact document generation depending on the presence of the links or splitter elements downstream in the source pipeline.

 {
            "display": "Extract each document (re-extracting every 'updateCycle_secs') with the specified title and summary text",
            "web": {
                "extraUrls": [{
                    "description": "Optional",
                    "title": "Page Title",
                    "url": "http://youraddress.com/title.html"
                }],
                "updateCycle_secs": 86400
            }
        }

Footnotes:

Legacy documentation:

Feed object

Legacy documentation:

Using the Feed Harvester

Web extractor

Overview

Format

Description

About extraUrls

Examples

extraUrls