Overview

This is example shows how to use the document splitter to take a large document and break it down into a series of smaller ones. This is particularly helpful for use cases where documents become so large that identifying where a particular string, entity, or association no longer assists in analysis. Similar to the WITS example, this example points to data hosted on S3, but PDFs processed in this way could be stored in any number of locations. Using different split logic, we could also break documents up by chapters or any other logical marker.

Example data

There is no sample data for this example, but it can be done with any basic PDF document. The source example assumes this format.

Source:

{
    "description": "Page by Page Analysis of a PDF",
    "extractType": "File",
    "harvestBadSource": false,
    "isApproved": true,
    "isPublic": false,
    "mediaType": "Record",
    "processingPipeline": [
        {
            "display": "A file connector to wherever the PDFs to be processed reside, example below shows an S3 bucket",
            "file": {
                "XmlRootLevelValues": ["output:xml"],
                "password": "PASSWORD",
                "type": "tika",
                "url": "s3://S3BUCKETNAME/PATH/",
                "username": "USERNAME"
            }
        },
         {
            "display": "A global space to group all the complex parsing and processing logic, can be called from anywhere",
            "globals": {
                "scriptlang": "javascript",
                "scripts": [
                    "function convert_to_docs(jsonarray, topDoc)\n{\n    var docs = [];\n    for (var docIt in jsonarray) \n    { \n        var predoc = jsonarray[docIt];\n        var doc = {};\n        doc.url = topDoc.url.replace(/[?].*/,\"\") + '#' + (parseInt(docIt) + 1).toString();\n        doc.fullText = predoc.replace(/\\\\\\//,\"/\");\n        doc.title = topDoc.title + \"; Page: \" + (parseInt(docIt) + 1).toString();\n        doc.publishedDate = topDoc.publishedDate;\n        doc.description = topDoc.url;\n        docs.push(doc);\n    }\n    return docs; \n}\n\n"
                ]
            }
        },
        {
            "contentMetadata": [
                {
                    "fieldName": "pages",
                    "index": false,
                    "script": "div",
                    "scriptlang": "stream",
                    "store": true
                }
            ],
            "display": "Uses the PDF's internal structured to break each page into an element in a pages metadata fields in the first document"
        },
        {
            "display": "Take the individual pages created in the previous step, convert them into docs, then delete the original",
            "splitter": {
                "deleteExisting": true,
                "numPages": 10,
                "numResultsPerPage": 1,
                "script": "var docs = convert_to_docs(_doc.metadata['pages'], _doc); docs;",
                "scriptflags": "d",
                "scriptlang": "javascript"
            }
        },
        {
            "contentMetadata": [
                {
                    "fieldName": "fullText",
                    "flags": "H",
                    "index": false,
                    "script": "//div",
                    "scriptlang": "xpath",
                    "store": true
                }
            ],
            "display": "Extract the full text from the documents using xpath"
        },
        {
            "display": "At this point your documents can be processed like normal. Example below shows a feature engine step",
            "featureEngine": {
                "engineName": "OpenCalais",
                "exitOnError": true
            }
        }
    ],
    "tags": [
        "ebook",
        "opencalais"
    ],
    "title": "PDF Splitter Example"
}

Output:

PDFs split by page

Overview

Example data

Source: