Federated Query Source

Overview

(IN PROGRESS)

The federated query source is a bit different to the others - it temporarily imports documents into the platform via an external API when a recognized query is performed via the API or UI.

There are 3 ways of bringing in the data:

  • URL requests
  • Specifying a python script (actually a Jython script) - the "importScript" should be the python code to execute, with the last evaluated expression being the value to pass back
    • (use "scriptlang": "python")
  • Specifying an external script  - the importScript should be the command line, eg "dir/scriptname.sh args" (no "sh" in front of the script - a "#!" construct must be used inside the script)
    • (use "scriptlang": "external")
    • (the same access rules apply as for the "External Script" extractor described here, ie "dir" is offset from "/opt/infinite-home/lib/extractor-scripts")

Troubleshooting errors:

  • If you get an "IOException" it is likely a permissions error, try:
    • runuser - tomcat -c "/opt/infinite-home/external-scripts/<PATH> <args>"
      to see what error you get

 

The rules for whether to generate a federated query source are as follows:

  • If there is a single entity query term (apart from "*", which is ignored), and its type is one of the elements of "entityTypes" (apart from the special case described below), then that query term is copied into $1 and the federated query is issues
  • If entityTypes contains elements in the form "/regex/TYPE" (ie starting with /), and the query term is a single text term (again apart from "*") that matches the regex, then the matching string is used as the term, together with the corresponding TYPE

The query term is copied into $1, which can be placed in any of the requests.*/importScript fields.

Once data has been obtained from an external source, it can be processed in one of 2 ways:

  • If the pipeline just consists of the federated query element. the "docConversionMap" can be used to generate entities:
    • The docConversionMap keys point to nested JSON fieldnames (":" used instead of "."), 
      • if a key starts with ":" then the JsonPath syntax is used.
      • The corresponding values are the entity types
      • "typeToDimensionMap" maps the types to dimensions (Who/What/Where/When)
  • A normal source object can be used with the federated query as the first element - a single document is passed into the pipeline, with the following full text:
    • The output of the script 
    • A new-line separated list of the outputs of the URL requests (which are also individually copied into a metadata array called "__FEDERATED_REPLIES__", if there are more than one)

There are 2 levels of caching:

  • The API response is cached for the period specified by "cacheTime_days"
  • The separate documents are cached indefinitely (though will be refreshed whenever the API response) - although this provides a second layer of caching, its primary purpose is to enable the documents to be stored in buckets and queues. 
    • Note that when the docs are refreshed, "updateId" is used to retain the original "_id", see Document JSON format.

Finally, note that there is a "testQueryJson" string field, which is just used from the "Test Source" UI/API function - it injects a fake query that is used to generate the API request.

Format

testQueryJson
//URL endpoint
{
        "federatedQuery": {
            "cacheTime_days": 5,
            "docConversionMap": {"resolutions:ip_address": "ExternalIp"},
            "entityTypes": ["ExternalDomain", "/[a-z0-9_.-][.]com/ExternalDomain"],
            "requests": [
                {
                    "endPointUrl": "",
                    "urlParams": {
                        "apikey": "XXX",
                        "domain": "$1"
                    }
                },
                {
                    "endPointUrl": "",
                    "urlParams": {
                        "apikey": "XXX",
                        "domain": "$1"
                    }
                }
            ],
            "testQueryJson": "{'qt':[{'entity':'garyhart.com/externaldomain'}]}",
            "titlePrefix": "Virus Total Domain Lookup",
            "typeToDimensionMap": {"ExternalIp": "Who"}
        }
}
//OR
{
        "federatedQuery": {
			"importScript": string,
			"scriptlang": string ("python" or "external")
			// no requests array, otherwise the same as above
//...
}

Example

{
    "description": "Federated Query - Virustotal Domain",
    "extractType": "Federated",
    "federatedQueryCommunityIds": [
        "53ab42a2e4b04bcfe2de4387"
    ],
    "isPublic": true,
    "mediaType": "Record",
    "processingPipeline": [
        {
            "display": "Just contains a string in which to put the logstash configuration (minus the output, which is appended by Infinit.e)",
            "federatedQuery": {
                "bypassSimpleQueryParsing": false,
                "cacheTime_days": 5,
                "docConversionMap": {
                    "Webutation domain info:Safety score": "SafetyScore",
                    "Webutation domain info:Verdict": "SafetyRating",
                    "detected_communicating_samples:date": "Date",
                    "detected_communicating_samples:positives": "CleanURLScan",
                    "detected_communicating_samples:sha256": "Hash",
                    "detected_downloaded_samples:date": "Date",
                    "detected_downloaded_samples:positives": "MaliciousURLScan",
                    "detected_downloaded_samples:sha256": "Hash",
                    "resolutions:ip_address": "ExternalIp",
                    "resolutions:last_resolved": "ResolvedDate"
                },
                "entityTypes": [
                    "externaldomain",
                    "/.*[.][a-z]+/externaldomain"
                ],
                "requests": [
                    {
                        "endPointUrl": "https://www.virustotal.com/vtapi/v2/domain/report",
                        "urlParams": {
                            "apikey": "xxxxxxxxxxxxxxxx...",
                            "domain": "$1"
                        }
                    }
                ],
                "scriptlang": "none",
                "testQueryJson": "{'qt':[{'entity':'garyhart.com/externaldomain'}]}",
                "titlePrefix": "Virus Total Domain Lookup",
                "typeToDimensionMap": {
                    "CleanAVURLScan": "What",
                    "Date": "What",
                    "ExternalIp": "What",
                    "Hash": "What",
                    "MaliciousAVURLScan": "What",
                    "ResolvedDate": "What",
                    "SafetyRating": "What",
                    "SafetyScore": "What"
                }
            }
        }
    ],
    "tags": [
        "Federated",
        "Query",
        "Virustotal",
        "Domain"
    ],
    "title": "Federated Query - Virustotal Domain"
}

 

Example Output

{
 "_id": "54372cdae4b00de66d2dc0d2",
 "aggregateSignif": 100,
 "communityId": ["53ab42a2e4b04bcfe2de4387"],
 "created": "Oct 10, 2014 12:48:26 AM UTC",
 "description": "[\n {\n \"whois\": \" Domain Name: GARYHART.COM\\n Registrar: NETWORK SOLUTIONS, LLC.\\n Whois Server: whois.networksolutions.com\\n Referral URL: http://networksolutions.com\\n Name Server: NS61.WORLDNIC.COM\\n Name Server: NS62.WORLDNIC.COM\\n Status: clientTransferProhibited\\n Updated Date: 15-may-2014\\n Creation Date: 15-jul-1997\\n Expiration Date: 14-jul-2015\\n\\nThe Registry database contains ONLY .COM, .NET, .EDU domains and\\nRegistrars.\\nWelcome to the Network Solutions(R) Registrar WHOIS Server.\\n\\nTo see the Network Solutions WHOIS Policy, click on or copy and paste the following\\nURL into your browser:\\n\\nhttp://www.networksolutions.com/whois/index.jhtml\\n\\nIf you feel that you have received this message in error, please email us using the online\\nform at http://www.networksolutions.com/help/email.jsp with the following information:\\n\\nWhois Query: garyhart.com\\nYOUR IP address is 91.121.71.92\\nDate and Time of Query: Fri Sep 26 18:26:56 EDT 2014\\nReason Code: IE\",\n \"whois_timestamp\": 1.4117709495514E9,\n \"response_code\": 1,\n \"verbose_msg\": \"Domain found in dataset\",\n \"Websense ThreatSeeker category\": \"bot networks. illegal or questionable\",\n \"resolutions\": [\n {\n \"last_resolved\": \"2013-09-04 00:00:00\",\n \"ip_address\": \"63.233.155.6\"\n }\n ],\n \"detected_urls\": [\n {\n \"url\": \"http://garyhart.com/\",\n \"positives\": 3,\n \"total\": 59,\n \"scan_date\": \"2014-09-26 22:26:49\"\n }\n ],\n \"categories\": [\n \"bot networks. illegal or questionable\"\n ]\n }\n]",
 "entities": [
 {
 "datasetSignificance": 10,
 "dimension": "What",
 "disambiguated_name": "63.233.155.6",
 "doccount": 1,
 "frequency": 1,
 "index": "63.233.155.6/externalip",
 "queryCoverage": 100,
 "relevance": 1,
 "totalfrequency": 1,
 "type": "ExternalIp"
 },
 {
 "datasetSignificance": 10,
 "dimension": "What",
 "disambiguated_name": "2013-09-04 00:00:00",
 "doccount": 1,
 "frequency": 1,
 "index": "2013-09-04 00:00:00/resolveddate",
 "queryCoverage": 100,
 "relevance": 1,
 "totalfrequency": 1,
 "type": "ResolvedDate"
 }
 ],
 "mediaType": ["Record"],
 "metadata": {"json": [{
 "Websense ThreatSeeker category": "bot networks. illegal or questionable",
 "categories": ["bot networks. illegal or questionable"],
 "detected_urls": [{
 "positives": 3,
 "scan_date": "2014-09-26 22:26:49",
 "total": 59,
 "url": "http://garyhart.com/"
 }],
 "resolutions": [{
 "ip_address": "63.233.155.6",
 "last_resolved": "2013-09-04 00:00:00"
 }],
 "response_code": 1,
 "verbose_msg": "Domain found in dataset",
 "whois": " Domain Name: GARYHART.COM\n Registrar: NETWORK SOLUTIONS, LLC.\n Whois Server: whois.networksolutions.com\n Referral URL: http://networksolutions.com\n Name Server: NS61.WORLDNIC.COM\n Name Server: NS62.WORLDNIC.COM\n Status: clientTransferProhibited\n Updated Date: 15-may-2014\n Creation Date: 15-jul-1997\n Expiration Date: 14-jul-2015\n\nThe Registry database contains ONLY .COM, .NET, .EDU domains and\nRegistrars.\nWelcome to the Network Solutions(R) Registrar WHOIS Server.\n\nTo see the Network Solutions WHOIS Policy, click on or copy and paste the following\nURL into your browser:\n\nhttp://www.networksolutions.com/whois/index.jhtml\n\nIf you feel that you have received this message in error, please email us using the online\nform at http://www.networksolutions.com/help/email.jsp with the following information:\n\nWhois Query: garyhart.com\nYOUR IP address is 91.121.71.92\nDate and Time of Query: Fri Sep 26 18:26:56 EDT 2014\nReason Code: IE",
 "whois_timestamp": 1.4117709495514E9
 }]},
 "modified": "Oct 10, 2014 12:48:26 AM UTC",
 "publishedDate": "Oct 10, 2014 12:48:26 AM UTC",
 "queryRelevance": 100,
 "score": 100,
 "source": ["Federated Query - Virustotal Domain"],
 "sourceKey": ["www.virustotal.com.vtapi.v2.domain.report"],
 "title": "Virus Total Domain Lookup: garyhart.com: 63.233.155.6, 2013-09-04 00:00:00",
 "url": "inf://federated/www.virustotal.com.vtapi.v2.domain.report/garyhart.com/externaldomain"
}