Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The following table describes the parameters of the file feed extractor configuration.

FieldDescription
feedType

Currently not used - will allow for RSS vs Atom in future releases (currently only RSS is supported)

waitTimeOverride_ms

Optional - if specified, controls the amount of time between successive reads to a site (default: 10000ms): // ie if a site is timing out it may limit the number of accesses from a given IP - set the number higher // for large sites you can increase the performance of the harvester by setting this number lower.

updateCycle_secs

Optional - if present harvested URLs may be replaced if they are older than this time and are encountered from the RSS or in the "extraUrls"

regexInclude

Optional - if specified, only URLs matching the regex will be harvested

regexExclude

Optional - if specified, any URLs matching the regex will not be harvested

extraUrls

This array allows for manually specified URLs to be harvested once { "url": string // The URL

userAgent

(Optional) If present overrides the system default user agent string

proxyOverride

(Optional) "direct" to bypass proxy (the default), or a proxy specification "(http|socks)://host:port"

httpFields

(Optional) Additional HTTP fields to be applied to the request headers  

...

ExtraUrls

In the following feed example, the web extractor is used to run extraUrls parameter against the web content.the feed.  ExtraUrls is a complex type that enables urls to be manually specified, overriding settings that would be provided by the RSS feed.  Additionally, in this example,  text extraction is performed using textEngine and featureEngine.

 

Code Block
{
    "description": "For cyber demo" "Article on Medical Issues",
    "harvestBadSource": false,
    "isApproved": true,
    "isPublic": falsetrue,
    "key": "http.www.mayoclinic.com.rss.blog.xml",
    "mediaType": "LogNews",
    "searchCycle_secs": 3600"modified": "Oct 19, 2010 11:31:59 AM",
    "tags": [
        "cybertopic:healthcare",
        "industry:healthcare",
        "mayo clinic",
        "structuredhealth"
    ],
    "title": "CyberMayoClinic: LogsGeneral TestTopics",
    "processingPipeline": [
        {
            "feed": {
                "extraUrls": [
                    {
                        "url": "http://INFINITE_ENDPOINT/api/share/get/51ad28a440b4a4f0f757824c?infinite_api_key=API_KEYwww.mayoclinic.com/rss/blog.xml"
                    }
                ]
            }
        },
    

 

ExtraUrls is a complex type that enables urls to be manually specified, overriding settings that would be provided by the RSS feed.

...

    {
            "textEngine": {
                "engineName": "AlchemyAPI"
            }
        },
        {
            "featureEngine": {
                "engineName": "OpenCalais"
            }
        }
    ]
}


 

Refreshing URLs

In this example,the updateCycle_secs parameter is also used to specify the refresh rate of the harvested urls.

...