Overview

Starting with either the raw content (or the content transformed by a preceding manual or automated text pipeline element), applies the javascript, regex, or xpath transformation and writes the output to the document's full text (or description, or title, or one of the textual metadata fields).

...

Parameter	Description
`fieldName`	Specifies the data source that the script will execute against "fullText," "description," or "title"
`script`	Specify your script
`flags`	Standard Java regex field Can have different values, based on `scriptlang` See below.
	javascript: There are a few flags that provide additional variables in the javascript: "m" to get "_doc.metadata", written into the variable "_metadata" (for example this flag can be used to copy a subset of the fields from one fieldname to another, before using the "metadataFields" field in the "structuredAnalysis" object to delete the larger field) "d" to get "_doc", written into the variable "_doc", "t" to return the full text of the document into "text". If the "flags" field is not specified, this is returned by default. If the "flags" field is specified, then "t" must be included or the "text" variable is not populated.
	xpath (and regex, except for "O"): 'H': will HTML-decode resulting fields. (Eg "&" -> "&") 'o': if the XPath expression points to an HTML (/XML) object, then this object is converted to JSON and stored as an object in the corresponding metadata field array. (Can also be done via the deprecated "groupNum":-1) 'x': if the XPath expression points to an HTML (/XML) object, then the XML of the object is displayed with no decoding (eg stripping of fields) 'D': described above 'c': if set then fields with the same name are chained together (otherwise they will all append their results to the field within metadata)
`replacement`	If `scriptlang` is regex or xpath, `replacement` can be used to replace the value indicated in the regex/xpath. eg. You could find the instance C/M or C/F in a document and extract that it is important to note that the Race is Caucasian. The same can be done to extract M or F as a Sex meaning Male or Female.
`scriptlang`	Specifies the language of the script that will be provided One of "javascript," "regex," or "xpath"

...

Supported Script Languages

You can program manual text extraction using the following supported languages

Javascript

See detailed example below.

Regex

See detailed example below.

Xpath

See detailed example below.

Examples

Anchor
java
java
Javascript

For power users, metadata can be generated from the content using javascript. This gives a huge amount of flexibility to apply site/source-specific knowledge to pull out metadata that can be turned into entities or associations.

Log File From File Share

...

In the following example, manual text transformation is used to parse a log file over the web, with a script of type javascript.

...

Globals is used to define a function called "decode," which is then used to capture the metadata for the sample input data in a variable called "info."

Info can be used to capture the metadata for the sample input data as follows:

info.date
info.srcIP
info.dstIP
info.alert
info.country

Code Block

{
            "globals": {
                "scripts": [
                    "function decode(x)\n{\n    var info = {};   \n    var rec = x.split(',');   \n    info.device = rec[0];\n    info.date = rec[1];\n    info.srcIP = rec[2];\n    info.dstIP = rec[3];\n    info.alert = rec[4];\n    info.country = rec[5];\n    return info;\n}"
                ]
            }
        },
        {
            "harvest": {
                "searchCycle_secs": 3600
            }
        },
        {
            "docMetadata": {
                "title": "$metadata.info.alert @ $metadata.info.date [$metadata.info.device]: $metadata.info.dstIP -> $metadata.info.srcIP",
                "publishedDate": "$SCRIPT( return _doc.metadata.info[0].date; )"
            }
        },
        {
            "contentMetadata": [
                {
                    "fieldName": "info",
                    "script": "var info = decode(text); info;",
                    "scriptlang": "javascript"
                }
            ]
        }

Globals is used to define a function called "decode," which is then used to capture the metadata for the sample input data in a variable called "info."

Info can be used to capture the metadata for the sample input data as follows:

info.date
info.srcIP
info.dstIP
info.alert
info.country

Metadata:

This captured metadata from the sample input data can then be used as output for the script:.

Code Block

 ],    "fullText": "SCANNER_1 , 2012-01-01T13:43:00 , 10.0.0.1 , 66.66.66.66 , DUMMY_ALERT_TYPE_1 , United States",
    "mediaType": ["Log"],
    "metadata": {"info": [{
        "alert": "DUMMY_ALERT_TYPE_1 ",
        "country": "United States",
        "date": "2012-01-01T13:43:00",
        "device": "SCANNER_1 ",
        "dstIP": "66.66.66.66",
        "srcIP": " 10.0.0.1"
    }]},

...

Javascript can also return more complex objects, arrays of objects, or array of primitives.

...

Anchor
regex
regex
Regex

...

Log File

Source:

The following example shows how a regex script can be used to manually parse the text of the ingested data:

...

Consider the following alarm logs which include a record of device alerts, including their network and physical locations.

Code Block

Date,Device,SrcIP,dstIP,Alert,Country
SCANNER_1,2012-01-01T13:43:00,10.0.0.1,66.66.66.66,DUMMY_ALERT_TYPE_1,United States
SCANNER_2,2012-02-01T14:21:00,SCANNER_2,10.0.0.2,66.66.66.66,DUMMY_ALERT_TYPE_2,United Kingdom
SCANNER_3,2012-03-01T15:17:00,10.0.0.1,99.66.99.66,DUMMY_ALERT_TYPE_3,Netherlands

Source Configuration:

In the source configuration, a regex script is used to extract data to make up the "fullText" and "description" of the resulting document.

Code Block

   },
        {
            "text": [
                {
                    "fieldName": "fullText",
                    "script": ",",
                    "scriptlang": "regex",
    "contentMetadata": [               "flags": "md",
{
                    "fieldNamereplacement": "organization" , "
                },
  "script": "believed the (.*?)(?: \\([^)]*\\))? (was|were) responsible",         {
           "scriptlang": "regex"           "fieldName": "description",
    },                 {"script": ",",
                    "fieldNamescriptlang": "organizationregex",
                    "scriptflags": "believed (.*?)(?: \\([^)]*\\))? (was|were) responsiblemd",
                    "scriptlangreplacement": "regex , "
                },
                {]
        },

Output:

.The example output includes the "fullText" which results from the regex script.

Code Block

  }
    ],
    "fieldNamefullText": "organization",
       SCANNER_1 , 2012-01-01T13:43:00 , 10.0.0.1 , 66.66.66.66 , DUMMY_ALERT_TYPE_1 , United States",
       "mediaType": ["Log"],
    "scriptmetadata": ".  ([^.]*?)(?: \\([^)]*\\))? claimed responsibility\\.$",{"info": [{
            "alert": "DUMMY_ALERT_TYPE_1 ",
         "scriptlangcountry": "regexUnited States",
        "date": "2012-01-01T13:43:00",
      }  "device": "SCANNER_1 ",
        ]
"dstIP": "66.66.66.66",
       },

In the example code snippet, the manual text transformation defines a field name called "organization" and it uses Regex to search the input XML data to find matches. In this example, the XML data is an incident report.

.The sample output reports that no known "organization" was implicated.

Code Block

 "srcIP": " 10.0.0.1"
    }]},
    "modified": "Jun 4, 2013 12:54:34 AM UTC",
    "multipledayspublishedDate": ["No"]"January 1, 2012 13:43:00 PM UTC",
    "organizationsource": ["NoCyber Logs groupTest"],
    "sourceKey": ["INFINITE_ENDPOINT.api.share.get.51ad28a440b4a4f0f757824c.25.26"],
    "perpetrator": [{tags": [
        "cyber",
        "characteristicstructured":
  "Islamic Extremist (Sunni)"],
    "title": "DUMMY_ALERT_TYPE_1  @ 2012-01-01T13:43:00 [SCANNER_1 ]:  "nationality": "Unknown"66.66.66.66 -> 10.0.0.1",
    "url":    }],

"http://INFINITE_ENDPOINT/api/share/get/51ad28a440b4a4f0f757824c#1"
}

...

Anchor
xpath
xpath
Xpath

Neither regex nor javascript are well suited for extracting fields from HTML and XML.

...

In this example, an Xpath script is used as part of manual text extraction, in order to convert a sample XML document into JSON.Original

XML

Source Input:

Consider the following xml file, which includes a price list for several food items.

Code Block

<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu>
   <food>
      <name>Belgian Waffles</name>
      <price>$5.95</price>
      <description>two of our famous Belgian Waffles with plenty of real maple syrup</description>
      <calories>650</calories>
   </food>
   <food>
      <name>Strawberry Belgian Waffles</name>
      <price>$7.95</price>
      <description>light Belgian waffles covered with strawberries and whipped cream</description>
      <calories>900</calories>
   </food>
   <food>
      <name>Berry-Berry Belgian Waffles</name>
      <price>$8.95</price>
      <description>light Belgian waffles covered with an assortment of fresh berries and whipped cream</description>
      <calories>900</calories>
   </food>
   <food>
      <name>French Toast</name>
      <price>$4.50</price>
      <description>thick slices made from our homemade sourdough bread</description>
      <calories>600</calories>
   </food>
   <food>
      <name>Homestyle Breakfast</name>
      <price>$6.95</price>
      <description>two eggs, bacon or sausage, toast, and our ever-popular hash browns</description>
      <calories>950</calories>
   </food>
</breakfast_menu>

...

Code Block

  {
            "links": {
                "extraMeta": [
                    {
                        "context": "First",
                        "fieldName": "convert_to_json",
                        "flags": "o",
                        "script": "//breakfast_menu/food[*]",
                        "scriptlang": "xpath"
                    }
                ],
                "script": "function convert_to_docs(jsonarray, url)\n{\n    var docs = [];\n    for (var docIt in jsonarray) {\n        var predoc = jsonarray[docIt];\n        delete predoc.content;\n        var doc = {};\n        doc.url = _doc.url.replace(/[?].*/,\"\") + '#' + docIt;\n        doc.fullText = predoc;\n        doc.title = \"TBD\";\n        doc.description = \"TBD\";\n        docs.push(doc);\n    }\n    return docs;\n}\nvar docs = convert_to_docs(_doc.metadata['convert_to_json'], _doc.url);\ndocs;",
                "scriptflags": "d"
            }

Output:

The sample output would then return a series output returns an array of JSON formatted responses. For example,:

Code Block

{
    "communityId": ["4d38b72c054548f038a0414a"],
    "created": "Jun 5, 2013 09:12:15 PM UTC",
    "description": "TBD",
    "fullText": "{
 \"calories\" : \"650\" , \"description\" : \"two of our famous Belgian 
Waffles with plenty of real maple syrup\" , \"price\" : \"$5.95\" , 
\"name\" : \"Belgian Waffles\"}",
    "mediaType": ["News"],
    "metadata": {"json": [{
        "calories": "650",
        "description": "two of our famous Belgian Waffles with plenty of real maple syrup",
        "name": "Belgian Waffles",
        "price": "$5.95"
    }]},
    "modified": "Jun 5, 2013 09:12:15 PM UTC",
    "publishedDate": "Jun 5, 2013 09:12:15 PM UTC",
    "source": ["aaa xml test"],
    "sourceKey": ["www.w3schools.com.xml.simple.xml"],
    "tags": ["tag1"],
    "title": "TBD",
    "url": "http://www.w3schools.com/xml/simple.xml#0"
}
{
    "communityId": ["4d38b72c054548f038a0414a"],
    "created": "Jun 5, 2013 09:12:15 PM UTC",
    "description": "TBD",
    "fullText": "{
 \"calories\" : \"900\" , \"description\" : \"light Belgian waffles 
covered with strawberries and whipped cream\" , \"price\" : \"$7.95\" , 
\"name\" : \"Strawberry Belgian Waffles\"}",
    "mediaType": ["News"],
    "metadata": {"json": [{
        "calories": "900",
        "description": "light Belgian waffles covered with strawberries and whipped cream",
        "name": "Strawberry Belgian Waffles",
        "price": "$7.95"
    }]},
    "modified": "Jun 5, 2013 09:12:15 PM UTC",
    "publishedDate": "Jun 5, 2013 09:12:15 PM UTC",
    "source": ["aaa xml test"],
    "sourceKey": ["www.w3schools.com.xml.simple.xml"],
    "tags": ["tag1"],
    "title": "TBD",
    "url": "http://www.w3schools.com/xml/simple.xml#1"
}

...

Versions Compared

Old Version 15

New Version 16

Key

Overview

Supported Script Languages

Examples

Anchor
java
java
Javascript

Log File From File Share

Anchor
regex
regex
Regex

Log File

Anchor
xpath
xpath
Xpath

XML

Page Comparison

Versions Compared

Old Version 15

New Version 16

Key

<span class="diff-html-added" data-a11y-before="Start of added content" data-a11y-after="End of added content" id="added-diff-0">[data-colorid=k995krrcx6]{color:#9b2a17} html[data-color-mode=dark] [data-colorid=k995krrcx6]{color:#e87764}</span>Overview

Supported Script Languages

Examples

AnchorjavajavaJavascript

Log File From File Share

AnchorregexregexRegex

Log File

AnchorxpathxpathXpath

XML

Overview

Anchor
java
java
Javascript

Anchor
regex
regex
Regex

Anchor
xpath
xpath
Xpath