Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Info
titleWarning

THIS IS A DRAFT PAGE, WORK IN PROGRESS

Overview

This is similar example shows how to the WITS example, except that the XML is hosted on a web server instead of in a fileshare. Because the Feed Harvester does not have the same built-in decoding capabilities as the File Harvester, this makes life a little bit more complicated.

Example data

http://www.w3schools.com/xml/simple.xml

Info

Note that when accessing Web documents you must use "rss.extraUrls" and specify minimally "url" and "title" fields, and not the top-level "url" (otherwise the URL is treated as an RSS feed rather than a standalone web page)

Code Block
languagehtml/xml
<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu>
   <food>
      <name>Belgian Waffles</name>
      <price>$5.95</price>
      <description>two of our famous Belgian Waffles with plenty of real maple syrup</description>
      <calories>650</calories>
   </food>
   <food>
      <name>Strawberry Belgian Waffles</name>
      <price>$7.95</price>
      <description>light Belgian waffles covered with strawberries and whipped cream</description>
      <calories>900</calories>
   </food>
   <food>
      <name>Berry-Berry Belgian Waffles</name>
      <price>$8.95</price>
      <description>light Belgian waffles covered with an assortment of fresh berries and whipped cream</description>
      <calories>900</calories>
   </food>
   <food>
      <name>French Toast</name>
      <price>$4.50</price>
      <description>thick slices made from our homemade sourdough bread</description>
      <calories>600</calories>
   </food>
   <food>
      <name>Homestyle Breakfast</name>
      <price>$6.95</price>
      <description>two eggs, bacon or sausage, toast, and our ever-popular hash browns</description>
      <calories>950</calories>
   </food>
</breakfast_menu>

Source

Note the use of XPath to identify easily how to convert the top-level XML document into lots of little documents - the "web,searchConfig.script" is then boilerplate and converts the XML into lots of small documents, with the "fullText" of each containing the JSON representation of the selected XML. This is then converted into metadata by the "contentMetadata" block. Normally "docMetadata"/"entities"/"associations" block would finally be used to set the per-document titles, descriptions, entities etc.use the document splitter to take a large document and break it down into a series of smaller ones.  This is particularly helpful for use cases where documents become so large that identifying where a particular string, entity, or association no longer assists in analysis.  Similar to the WITS example, this example points to data hosted on S3, but PDFs processed in this way could be stored in any number of locations.  Using different split logic, we could also break documents up by chapters or any other logical marker.

Example data

There is no sample data for this example, but it can be done with any basic PDF document.  The source example assumes this format.

Source:

Code Block
languagejavascript
{
    "description": "Page by Page Analysis of E-Booka PDF",
    "extractType": "File",
    "harvestBadSource": false,
    "isApproved": true,
    "isPublic": false,
    "mediaType": "Record",
    "processingPipeline": [
        {
            "display": "A file connector to wherever the PDFs to be processed reside, example below shows an S3 bucket",
            "file": {
                "XmlRootLevelValues": ["output:xml"],
                "password": "PASSWORD",
                "type": "tika",
                "url": "s3://S3BUCKETNAME/PATH/",
                "username": "USERNAME"
            }
        },
         {
            "display": "A global space to group all the complex parsing and processing logic, can be called from anywhere",
            "globals": {
                "scriptlang": "javascript",
                "scripts": [
                    "function convert_to_docs(jsonarray, topDoc)\n{\n    var docs = [];\n    for (var docIt in jsonarray) \n    { \n        var predoc = jsonarray[docIt];\n        var doc = {};\n        doc.url = topDoc.url.replace(/[?].*/,\"\") + '#' + (parseInt(docIt) + 1).toString();\n        doc.fullText = predoc.replace(/\\\\\\//,\"/\");\n        doc.title = topDoc.title + \"; Page: \" + (parseInt(docIt) + 1).toString();\n        doc.publishedDate = topDoc.publishedDate;\n        doc.description = topDoc.url;\n        docs.push(doc);\n    }\n    return docs; \n}\n\n"]
              }  ]
      },      }
  {      },
       "display" : "A{
processing block to append the text of each page of the e-book in an array in in the document's metadata", "contentMetadata": [
                {
  "contentMetadata": [{                 "fieldName": "pages",
                    "index": false,
                    "script": "div",
                    "scriptlang": "stream",
                    "store": true
                }
            ],
            "display": "Uses the PDF's internal structured to break each page into an element in a pages metadata fields in the first document"
        },
        {
            "display": "Take the individual pages created fromin the first doc's metadataprevious step, splitconvert them into new documentsdocs, and then delete the original",
            "splitter": {
                "deleteExisting": true,
                "numPages": 10,
                "numResultsPerPage": 1,
                "script": "var docs = convert_to_docs(_doc.metadata['pages'], _doc); docs;",
                "scriptflags": "d",
                "scriptlang": "javascript"
            }
        },
        {
             "displaycontentMetadata" : "Clean[
up the full text of each page using xpath",        {
    "contentMetadata": [{                 "fieldName": "fullText",
                "flags": "H",
                "index"flags": false,
                "script": "//div",
        "H",
       "scriptlang": "xpath",                 "storeindex": true
    false,
       }],             "displayscript": ""
        }//div",
        {             "displayscriptlang": "Set master PDF url to display url",
      xpath",
     "docMetadata": {                 "appendTagsToDocsstore": false,true
                "displayUrl": "$SCRIPT( var text = _doc.url; return text; )"}
                    }
        }],
        {
            "display": "SetExtract the pagefull text tofrom the description",documents             "docMetadata": {using xpath"
                 "appendTagsToDocs": false},
        {
       "description": "$SCRIPT( var text = _doc.metadata.fullText[0]; return text; )",
                "publishedDate"display": "$SCRIPT(At return '3/31/2014'; ) "
            }
        },
        {
            "display": "Run document text through an entity extractor"this point your documents can be processed like normal. Example below shows a feature engine step",
            "featureEngine": {
                "engineName": "OpenCalais",
                "exitOnError": true
            }
        }
    ],
    "tags": [
        "ebook",
        "opencalais"
    ],
    "title": "E-BookPDF Splitter Example"
}

 Output: