Page Comparison

...

When spiderOut is enabled it will continue to follow additional urls (if present) until the parameter setting for maxDepth has been reached.

Gliffy


name	Max Depth

Standard API Example

The links parameter is configured using script to spiderout to the additional urls URLs in the input document. In the example, the script parameter is used to determine the behavior of follow web links when an additional URL is found in the input data.

Code Block

 "links": {                "script": "var
 retVals = [];\nvar n = -1;\nvar url = 
_doc.url.replace(/[?].*/,\"\");\nvar start = 0;\nwhile (start < 
text.length) {\n    var end = text.indexOf('\\n', start);\n    if (end 
== -1) end = text.length;\n    var line = text.substr(start,end-1);\n   
 start = end + 1;    \n    \n    n++;\n    if (0 == n) continue;\n    
\n    var title = 'line #' + n.toString();\n    var url2 = url + '#' + 
n.toString();\n    var fullText = line;\n    var retVal = { 
'title':title, 'url':url2, 'fullText':line };\n    retVals.push(retVal);
 \n}\nretVals;\n      "

...

XML API Example

For XML APIs the basic principle is the same, but the XML object needs to be parsed using embedded Java calls (since the Rhino javascript engine currently in use does not support e4x - it is on our

...

road map to upgrade to a version that does)

...

HTML

When Follow Web Links is used to ingest HTML pages, some additional considerations are required for HTML.

Parsing HTML

The extraMeta field is used by "Follow Web Links" to enable The links object has a field "extraMeta" that enables other script types to be used. The main use case for this is using the "xpath" scripting language to extract the URLs required, and then use the existing "script" field to tidy up those objects into the required format.

Code Block

},        {
            "links": {
                "extraMeta": [
                    {
                        "context": "First",
                        "fieldName": "convert_to_json",
                        "flags": "o",
                        "script": "//breakfast_menu/food[*]",
                        "scriptlang": "xpath"
                    }
                ],
                "script": "function
 convert_to_docs(jsonarray, url)\n{\n    var docs = [];\n    for (var 
docIt in jsonarray) {\n        var predoc = jsonarray[docIt];\n        
delete predoc.content;\n        var doc = {};\n        doc.url = 
_doc.url.replace(/[?].*/,\"\") + '#' + docIt;\n        doc.fullText = 
predoc;\n        doc.title = \"TBD\";\n        doc.description = 
\"TBD\";\n        docs.push(doc);\n    }\n    return docs;\n}\nvar docs =
 convert_to_docs(_doc.metadata['convert_to_json'], _doc.url);\ndocs;",
                "scriptflags": "d"
            }
        },

In the example, you can see that the links object has a complex parameter extrameta which is configured to call an xpath script that parses the input of the XML and converts it to JSON output.

...

HTML

When Follow Web Links is used to ingest HTML pages, some additional considerations are required for HTML.

Parsing HTML

The extraMeta field is used by "Follow Web Links" to enable other script types to be used.

This enables the use of the xpath scripting language to extract the required urls from the HTML pages, before script can put them into the required array format.

...

If a field called "_ONERROR_" is generated then if no links are returned from the first page (ie likely due to a formatting error) then the contents of _ONERROR_ (assumed to be a string) are dumped to the harvest message.
Only when running from the "Config - Source - Test" API call (including from the Source Editor GUI), then for every page, all of the _ONDEBUG_ field values (can be string or object) are dumped to the harvest message.

Example:

-insert example for parsing html

Working with Multiple Pages

...

Versions Compared

Old Version 11

New Version 12

Key

Standard API Example

XML API Example

HTML

Parsing HTML

HTML

Parsing HTML

Working with Multiple Pages