Overview

Extracts documents from lists of URLs (and also RSS feeds, if no "title" is specified). Note that the resulting documents normally need to be passed through a text extraction stage before further processing. One exception is using theFollow Web Links processing element to extract more documents eg from hyperlinks in the original web pages.

Format

{
	"display": string,
	"web": {
	    "feedType": string, // Currently not used - will allow for RSS vs Atom in future releases (currently only RSS is supported)
 
    	"waitTimeOverride_ms": integer, // Optional - if specified, controls the amount of time between successive reads to a site (default: 10000ms):
        	            // ie if a site is timing out it may limit the number of accesses from a given IP - set the number higher
            	        // for large sites you can increase the performance of the harvester by setting this number lower
    	"updateCycle_secs": integer, // Optional - if present harvested URLs may be replaced if they are older than this time and are encountered from the RSS or in the "extraUrls"
    	"regexInclude": string, // Optional - if specified, only URLs matching the regex will be harvested
    	"regexExclude": string, // Optional - if specified, any URLs matching the regex will not be harvested
     
    	"extraUrls": [ // This array allows for manually specified URLs to be harvested once
        {
            "url": string, // The URL 
            "title": string, // The title that the document will be given (ie the equivalent to the RSS title)
            "description": string, // (Optional) The description that the document will be given (ie the equivalent to the RSS description)
            "publishedData": string, // (Optional) The date that will be assigned to the document (default: now) - this can be overridden from "structuredAnalysis"
            "fullText": string // (Optional) If present and "useTextExtractor" is "none", then uses the specified string instead of the URL contents (mainly for debugging)
		}
    	],
	    "userAgent": string, // (Optional) If present overrides the system default user agent string
	    "proxyOverride": string, // (Optional) "direct" to bypass proxy (the default), or a proxy specification "(http|socks)://host:port" 
		"httpFields": // (Optional) Additional HTTP fields to be applied to the request headers  
		{ 
			"field": "value" // eg "cookie": "sessionkey=346547657687"
		}
	} 
}

Description

In the following example, the web extractor is used to run extraUrls parameter against the web content.

{
    "description": "For cyber demo",
    "isPublic": false,
    "mediaType": "Log",
    "searchCycle_secs": 3600,
    "tags": [
        "cyber",
        "structured"
    ],
    "title": "Cyber Logs Test",
    "processingPipeline": [
        {
            "feed": {
                "extraUrls": [
                    {
                        "url": "http://INFINITE_ENDPOINT/api/share/get/51ad28a440b4a4f0f757824c?infinite_api_key=API_KEY"
                    }
                ]
            }
        },

In this example,the updateCycle_secs parameter is also used to specify the refresh rate of the harvested urls.

{
    "description": "wiy",
    "isPublic": true,
    "mediaType": "News",
    "tags": [
        "tag1"
    ],
    "title": "aaa xml test",
    "processingPipeline": [
        {
            "feed": {
                "extraUrls": [
                    {
                        "url": "http://www.w3schools.com/xml/simple.xml"
                    }
                ],
                "updateCycle_secs": 86400
            }
        },

Footnotes:

Legacy documentation:

Feed object

Legacy documentation:

Using the Feed Harvester

Web extractor

Overview

Format

Description