Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagejs
{
	"display": string,
	"feed": {
	    "feedType": string, // Currently not used - will allow for RSS vs Atom in future releases (currently only RSS is supported)
 
    	"waitTimeOverride_ms": integer, // Optional - if specified, controls the amount of time between successive reads to a site (default: 10000ms):
        	            // ie if a site is timing out it may limit the number of accesses from a given IP - set the number higher
            	        // for large sites you can increase the performance of the harvester by setting this number lower
    	"updateCycle_secs": integer, // Optional - if present harvested URLs may be replaced if they are older than this time and are encountered from the RSS or in the "extraUrls"
    	"regexInclude": string, // Optional - if specified, only URLs matching the regex will be harvested
    	"regexExclude": string, // Optional - if specified, any URLs matching the regex will not be harvested
     
    	"extraUrls": [ // This array allows for manually specified URLs to be harvested once
        {
            "url": string // The URL 
		}
    	],
	    "userAgent": string, // (Optional) If present overrides the system default user agent string
	    "proxyOverride": string, // (Optional) "direct" to bypass proxy (the default), or a proxy specification "(http|socks)://host:port" 
		"httpFields": // (Optional) Additional HTTP fields to be applied to the request headers  
		{ 
			"field": "value" // eg "cookie": "sessionkey=346547657687"
		}
	} 
}

Legacy documentation:

...

 

Description

The Feed Harvester will connect to and extract data from an RSS feed.

It uses feedType to specify that the data source is RSS.  It connects to the specified urls and can either include or exclude specified urls via regex using regexInclude or regexExclude.

In the following example, the web extractor is used to run extraUrls

...

Complex parameter against the web content.

Code Block
{
    "description": "For cyber demo",
    "isPublic": false,
    "mediaType": "Log",
    "searchCycle_secs": 3600,
    "tags": [
        "cyber",
        "structured"
    ],
    "title": "Cyber Logs Test",
    "processingPipeline": [
        {
            "feed": {
                "extraUrls": [
                    {
                        "url": "http://INFINITE_ENDPOINT/api/share/get/51ad28a440b4a4f0f757824c?infinite_api_key=API_KEY"
                    }
                ]
            }
        },




 

ExtraUrls is a complex type that enables urls to be manually specified, overriding settings that would be provided by the RSS feed.

Example:

In the example, the manually harvested urls url to be harvested once are is specified.  The  title, description, publisheddate, and fullText parameters can be used to manually specify strings for the specified urls, rather than using the harvested RSS data.

In this example,the updateCycle_secs parameter is also used to specify the refresh rate of the harvested urls.

Code Block
"extraUrls": [ // This array allows for manually specified URLs to be harvested once  {
    "description": "wiy",
    "isPublic": true,
    "mediaType": "News",
    "tags": [
     {   "tag1"
    ],
    "urltitle": string, // The URL  "aaa xml test",
    "processingPipeline": [
        {
            "titlefeed": string, // The title that the document will be given (ie the equivalent to the RSS title) {
                "extraUrls": [
              "description": string, // (Optional) The description that{
the document will be given (ie the equivalent to the RSS description)             "publishedDataurl": string, // (Optional) The date that will be assigned to the document (default: now) - this can be overridden from "structuredAnalysis""http://www.w3schools.com/xml/simple.xml"
                    }
                ],
                "fullTextupdateCycle_secs": string86400
//  (Optional) If present and "useTextExtractor" is "none", then uses the  specified}
string instead of the URL contents (mainly for debugging)

 

userAgent

This parameter, if present, will override the system default user agent string.  This can be used for emulating a specific web browser.

Example:

TODO

proxyOverride

TODO

httpFields

TODO

Examples

...

 },