Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

You can program manual text extraction using the following supported langugaeslanguages

  • javascript
  • regex
  • xpath

javascript

...

In the following example, manual text transformation is used to parse a log file over the web, with a script of type javascript.

Code Block
  },        {
            "contentMetadata": [
                {
                    "fieldName": "info",
                    "script": "var info = decode(text); info;",
                    "scriptlang": "javascript"
                }
            ]
        },
        {
            "text": [
                {
                    "fieldName": "fullText",
                    "script": ",",
                    "scriptlang": "regex",
                    "flags": "md",
                    "replacement": " , "
                },
                {
                    "fieldName": "description",
                    "script": ",",
                    "scriptlang": "regex",
                    "flags": "md",
                    "replacement": " , "
                }
            ]
        },

...

After "globals" has been used to define a variable called info, info can be used to capture the metadata for the sample input data.  the metada The metadata that will be captured in the example is as follows:

...

This captured metadata from th esample the sample input data can then be used as output for the script.

...

The following example shows how a regex script can be used to manually parse the text of the ingested data:

Code Block
 },        {
            "contentMetadata": [
                {
                    "fieldName": "organization",
                    "script": "believed the (.*?)(?: \\([^)]*\\))? (was|were) responsible",
                    "scriptlang": "regex"
                },
                {
                    "fieldName": "organization",
                    "script": "believed (.*?)(?: \\([^)]*\\))? (was|were) responsible",
                    "scriptlang": "regex"
                },
                {
                    "fieldName": "organization",
                    "script": ".  ([^.]*?)(?: \\([^)]*\\))? claimed responsibility\\.$",
                    "scriptlang": "regex"
                }
            ]
        },

 

The In the example code snipetsnippet, the manual text transformation is defining a field name called "organization" and it uses Regex regex to search the input XML data to find matches.  In the case of this example, the XML data is an incident report.

...

Neither regex nor javascript are well suited for extracting fields from HTML and XML (particularly since the current Javascript engine, the Java version of Rhino, does not support DOM).

As a result, Infinit.e supports XPath 1.0 (with one minor extension to allow combined XPath regex). 

In this example, an xpath script is used as part of manual text extraction, in order to convert a sample XML document into JSON.

...