Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 17 Next »

Overview

Extracts documents from lists of URLs (and also RSS feeds, if no "title" is specified). Note that the resulting documents normally need to be passed through a text extraction stage before further processing. One exception is using the Follow Web Links processing element to extract more documents from hyperlinks in the original web pages.

Format

{
	"display": string,
	"web": {
	    "feedType": string, // Currently not used - will allow for RSS vs Atom in future releases (currently only RSS is supported)
 
    	"waitTimeOverride_ms": integer, // Optional - if specified, controls the amount of time between successive reads to a site (default: 10000ms):
        	            // ie if a site is timing out it may limit the number of accesses from a given IP - set the number higher
            	        // for large sites you can increase the performance of the harvester by setting this number lower
    	"updateCycle_secs": integer, // Optional - if present harvested URLs may be replaced if they are older than this time and are encountered from the RSS or in the "extraUrls"
    	"regexInclude": string, // Optional - if specified, only URLs matching the regex will be harvested
    	"regexExclude": string, // Optional - if specified, any URLs matching the regex will not be harvested
     
    	"extraUrls": [ // This array allows for manually specified URLs to be harvested once
        {
            "url": string, // The URL 
            "title": string, // The title that the document will be given (ie the equivalent to the RSS title)
            "description": string, // (Optional) The description that the document will be given (ie the equivalent to the RSS description)
            "publishedData": string, // (Optional) The date that will be assigned to the document (default: now) - this can be overridden from "structuredAnalysis"
            "fullText": string // (Optional) If present and "useTextExtractor" is "none", then uses the specified string instead of the URL contents (mainly for debugging)
		}
    	],
	    "userAgent": string, // (Optional) If present overrides the system default user agent string
	    "proxyOverride": string, // (Optional) "direct" to bypass proxy (the default), or a proxy specification "(http|socks)://host:port" 
		"httpFields": // (Optional) Additional HTTP fields to be applied to the request headers  
		{ 
			"field": "value" // eg "cookie": "sessionkey=346547657687"
		}
	} 
}

 

Description

The following table describes the parameters of the web extractor configuration.

FieldDescription
feedType

Currently not used - will allow for RSS vs Atom in future releases (currently only RSS is supported)

waitTimeOverride_ms

Optional - if specified, controls the amount of time between successive reads to a site (default: 10000ms): // ie if a site is timing out it may limit the number of accesses from a given IP - set the number higher // for large sites you can increase the performance of the harvester by setting this number lower.

updateCycle_secs

Optional - if present harvested URLs may be replaced if they are older than this time and are encountered from the RSS or in the "extraUrls"

regexInclude

Optional - if specified, only URLs matching the regex will be harvested

regexExclude

Optional - if specified, any URLs matching the regex will not be harvested

extraUrls

This array allows for manually specified URLs to be harvested once { "url": string // The URL

userAgent

(Optional) If present overrides the system default user agent string

proxyOverride

(Optional) "direct" to bypass proxy (the default), or a proxy specification "(http|socks)://host:port"

httpFields

(Optional) Additional HTTP fields to be applied to the request headers  

Examples

extraUrls

In the following example, the Web Extractor is used to run extraUrls parameter against the web content.  extraUrls is used to indicate URLs that should only be harvested once.

In addition, when using the Web Extractor (as opposed to the Feed Extractor) it is also possible to specify  title, description, publisheddate, and fullText attributes of extraUrls.

In this way, you can manually specify the values that these attributes should take when the extraUrls are harvested.

In this example, the updateCycle_secs parameter is also used to specify the refresh rate of the harvested urls.

 {
            "display": "Extract each document (re-extracting every 'updateCycle_secs') with the specified title and summary text",
            "web": {
                "extraUrls": [{
                    "description": "Optional",
                    "title": "Page Title",
                    "url": "http://youraddress.com/title.html"
                }],
                "updateCycle_secs": 86400
            }
        }



Footnotes:

Legacy documentation:

Feed object

Legacy documentation:

Using the Feed Harvester

 

 

 

  • No labels