Overview

This is similar to the WITS example, except that the XML is hosted on a web server instead of in a fileshare. Because the Feed Harvester does not have the same built-in decoding capabilities as the File Harvester, this makes life a little bit more complicated.

Example data

http://www.w3schools.com/xml/simple.xml

Note that when accessing Web documents you must use "rss.extraUrls" and specify minimally "url" and "title" fields, and not the top-level "url" (otherwise the URL is treated as an RSS feed rather than a standalone web page)

<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu>
   <food>
      <name>Belgian Waffles</name>
      <price>$5.95</price>
      <description>two of our famous Belgian Waffles with plenty of real maple syrup</description>
      <calories>650</calories>
   </food>
   <food>
      <name>Strawberry Belgian Waffles</name>
      <price>$7.95</price>
      <description>light Belgian waffles covered with strawberries and whipped cream</description>
      <calories>900</calories>
   </food>
   <food>
      <name>Berry-Berry Belgian Waffles</name>
      <price>$8.95</price>
      <description>light Belgian waffles covered with an assortment of fresh berries and whipped cream</description>
      <calories>900</calories>
   </food>
   <food>
      <name>French Toast</name>
      <price>$4.50</price>
      <description>thick slices made from our homemade sourdough bread</description>
      <calories>600</calories>
   </food>
   <food>
      <name>Homestyle Breakfast</name>
      <price>$6.95</price>
      <description>two eggs, bacon or sausage, toast, and our ever-popular hash browns</description>
      <calories>950</calories>
   </food>
</breakfast_menu>

Source

Note the use of XPath to identify easily how to convert the top-level XML document into lots of little documents - the "web,searchConfig.script" is then boilerplate and converts the XML into lots of small documents, with the "fullText" of each containing the JSON representation of the selected XML. This is then converted into metadata by the "contentMetadata" block. Normally "docMetadata"/"entities"/"associations" block would finally be used to set the per-document titles, descriptions, entities etc.

Source:

{
    "description": "Page by Page Analysis of E-Book",
    "extractType": "File",
    "harvestBadSource": false,
    "isApproved": true,
    "isPublic": false,
    "mediaType": "Record",
    "processingPipeline": [
        {
            "display": "",
            "file": {
                "XmlRootLevelValues": ["output:xml"],
                "password": "PASSWORD",
                "type": "tika",
                "url": "s3://S3BUCKETNAME/PATH/",
                "username": "USERNAME"
            }
        },
        {
            "display": "A global space to group all the complex parsing and processing logic, can be called from anywhere",
            "globals": {
                "scriptlang": "javascript",
                "scripts": ["function convert_to_docs(jsonarray, topDoc)\n{\n    var docs = [];\n    for (var docIt in jsonarray) \n    { \n        var predoc = jsonarray[docIt];\n        var doc = {};\n        doc.url = topDoc.url.replace(/[?].*/,\"\") + '#' + docIt;\n        doc.fullText = predoc.replace(/\\\\\\//,\"/\");\n        doc.title = topDoc.title + \"; Page: \" + docIt;\n        doc.publishedDate = topDoc.publishedDate;\n        doc.description = topDoc.url;\n        docs.push(doc);\n    }\n    return docs; \n}\n\n"]
            }
        },
        {
             "display" : "A processing block to append the text of each page of the e-book in an array in in the document's metadata",
             "contentMetadata": [{
                "fieldName": "pages",
                "index": false,
                "script": "div",
                "scriptlang": "stream",
                "store": true
            }],
            "display": ""
        },
        {
            "display": "Take the individual pages from the first doc's metadata, split them into new documents, and then delete the original",
            "splitter": {
                "deleteExisting": true,
                "numPages": 10,
                "numResultsPerPage": 1,
                "script": "var docs = convert_to_docs(_doc.metadata['pages'], _doc); docs;",
                "scriptflags": "d",
                "scriptlang": "javascript"
            }
        },
        {
            "display" : "Clean up the full text of each page using xpath",
            "contentMetadata": [{
                "fieldName": "fullText",
                "flags": "H",
                "index": false,
                "script": "//div",
                "scriptlang": "xpath",
                "store": true
            }],
            "display": ""
        },
        {
            "display": "Set master PDF url to display url",
            "docMetadata": {
                "appendTagsToDocs": false,
                "displayUrl": "$SCRIPT( var text = _doc.url; return text; )"
            }
        },
        {
            "display": "Set page text to the description",
            "docMetadata": {
                "appendTagsToDocs": false,
                "description": "$SCRIPT( var text = _doc.metadata.fullText[0]; return text; )",
                "publishedDate": "$SCRIPT( return '3/31/2014'; ) "
            }
        },
        {
            "display": "Run document text through an entity extractor",
            "featureEngine": {
                "engineName": "OpenCalais",
                "exitOnError": true
            }
        }
    ],
    "tags": [
        "ebook",
        "opencalais"
    ],
    "title": "E-Book Splitter Example"
}

Output:

Multiple documents from a single PDF

Overview

Example data

Source

Source: