Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • extractType
    The extractType field is used to tell the harvester the type of source to extract from, i.e.: Feed. Other valid values include: Database, Feed, etc.
  • authentication (optional)
    The Authentication object of the Source document is a subset of the full Authentication object in that it only uses the 'username' and 'password' fields. The Feed Harvester uses the username and password from the Authentication object as feed credentials (if needed).
    • username
    • password
      Note: The password field in the Authentication object is currently clear text. If the string value placed in password is clear text it is not encrypted by Infinit.e. Encryption of the password field is planned for a future release.
  • url
    The URL to retrieve the RSS feed from.
  • extraUrls
    Allows collection of specified URLs
  • searchConfig
    Described below

Anchor
Webcrawl
Webcrawl

...

API parsing, Link following, web crawling, and similar activities

It will often be the case that a base URL will not contain useful content, but will contain links to useful content. Examples include:

  • Internet and Intranet searches, lists of PDFs of academic publications, etc.
  • Many site APIs (eg LinkedIn, TechCrunch, Flickr)
  • Directory or disambiguation pages (eg Wiki)

In addition, it will often be the case that pages with useful content also contain links to more pages with useful content, ie a standard web crawling issue.

In addition, it will often be the case that both XML and JSON APIs will return a single URL containing many "documents" ie independent content snippets.

In some cases pagination from APIs is achieved by passing a field containing the URL of the next page; in other cases the URL follows a standard pattern.

The "searchConfig" field of the Feed Harvester provides a nice interface for all aspects of handling JSON APIs (splitting replies into many documents, fetching links, and following "next page" links). It can also be used for more standard HTML link-following and web-crawling, though is harder to use in that context.

The remainder of this section describes the basic usage of the "searchConfig" object (ie suitable for API parsing), and then provides a brief description of the more complex activity of HTML parsing.

Basic use of searchConfig

The "script" field of the "searchConfig" object needs to contain javascript (so the "scriptlang" field should always be set to "javascript"). The javascript is passed a single variable "text", containing the response to the specified URL (or URLs is "extraUrls" is specified), and needs to return an array of the following objects:

Code Block
{
	"url": string, // Mandatory - this URL is copied into the "URL" field of the generated document, 
					// and is used to fetch the content unless "fullText" is set.
	"title": string, // Mandatory (unless "spiderOut" set, see below) - this is used to generate the document's title.
	"description": string, // Optional, if set then used to generate the document's description.
	"publishedDate": string, // Optional, if not set then the current date is used instead.
 
	"fullText": string, // Optional, if set then this is the content for the generated document, ie "url" is not followed.
 
	"spiderOut": integer //Optional, if set to true then the searchConfig.script is applied to the resulting document,
							// for a depth of up to "searchConfig.maxDepth" times
}

So the basic use of searchConfig on JSON-based APIs should be quite straightforward, ie along the following lines:

Code Block
languagejavascript
titleOutline API parsing using searchConfig
var json = eval('(' + text + ')');
var retval = [];
// For each "result" in the array
// Extract URL, title, description, eg for the flickr blogs API 
// (http://www.flickr.com/services/api/response.json.html)
for (x in json.blogs.blog) {
	var blog = json.blogs.blog[x];
	var retobj = { url: blog.url, title: blog.name };
	retval.push(retobj);
}
// Alternatively set retobj.fullText to specify the content from the API response
// In addition set retobj.spiderOut: true, to run this script on the corresponding URL.
retval; // annoying feature of our javascript engine, instead of returning you just evaluate the var to return
Info

When Javascript is used, the same security restrictions as elsewhere apply.

Advanced use of searchConfig

TODO: IN PROGRESS