REST API example

Overview

Shows a very simple REST example where the returned JSON is split into separate documents.

Some additional functionality that is commonly required but is not shown here specifically:

  • In order to iterate through pages, the Follow Web links element can be used (eg in the "links.script" you would have "var json = eval('('+text +')')", observe that "json.meta.offset + json.meta.limit < json.meta.total_count" and then push "{ url: _doc.url + "?limit="+json.limit, /* etc */, spiderOut: true}" onto the return array
    • (You can also just set a large "numPages" and then set "stopPaginatingOnDuplicate" togther with "pageChangeRegex" and "pageChangeReplace") to avoid needing any scripting (in some cases the API reply doesn't have a limit/offset schema but instead each JSON object contains the URL of the next call - in this case scripting is currently needed, eg "retval = []; /* push docs to extract, then spidering is: */ var json = eval('('+text +')'); var next_link  = json.meta.next_link; retval.push({ 'url': next_link, 'spiderOut': true})")
  • Headers and content data (ie POST) can be used. This is discussed in Web extractor.
  • To use secure credentials, upload a JSON share containing them to your personal community (using the 5. File Uploader) and then use the substitution format "#IKANOW{ID.FIELD}" (Where ID is the "_id" (hex string) of the share, and FIELD is the fieldname).
  • Multi-step authentication is harder - again the Follow Web links can be used.

Input data

{meta: {limit: 100,
	offset: 0,
	total_count: 471
},
objects: [
{
	bill_resolution_type: "bill",
	bill_type: "house_bill",
	bill_type_label: "H.R.",
	congress: 111,
	current_status: "referred",
	current_status_date: "2010-01-26",
	current_status_description: "This bill was introduced on January 26, 2010, in a previous session of Congress, but was not enacted.",
	current_status_label: "Referred to Committee",
	display_number: "H.R. 4507",
	docs_house_gov_postdate: null,
	id: 433,
	introduced_date: "2010-01-26",
	is_alive: false,
	is_current: false,
	link: "https://www.govtrack.us/congress/bills/111/hr4507",
//(...)
	sponsor: {bioguideid: "R000568",
		birthday: "1946-12-09",
		cspanid: 48779,
		firstname: "Ciro",
		gender: "male",
		gender_label: "Male",
		id: 400339,
		lastname: "Rodriguez",
		link: "https://www.govtrack.us/congress/members/ciro_rodriguez/400339",
		middlename: "D.",
		name: "Rep. Ciro Rodriguez [D-TX23, 2007-2010]",
		namemod: "",
		nickname: "",
		osid: "N00009828",
		pvsid: "16389",
		sortname: "Rodriguez, Ciro (Rep.) [D-TX23, 2007-2010]",
		twitterid: null,
		youtubeid: null
	},
	title: "H.R. 4507 (111th): Cyber Security Domestic Preparedness Act",
	title_without_number: "Cyber Security Domestic Preparedness Act",
	titles: [["short",
		"introduced",
		"Cyber Security Domestic Preparedness Act"
	],
		["official",
		"introduced",
		"To amend the Homeland Security Act of 2002 to authorize the Secretary of Homeland Security to establish the Cyber Security Domestic Preparedness Consortium, and for other purposes."
	]
	]
},
{
	//other objects in the same format
}
//etc
]}

Source

{
    "description": "JSON API splitter test",
    "isPublic": true,
    "mediaType": "Record",
    "processingPipeline": [
        {
            "display": "Specify one or more JSON (or XML or ...) endpoints from which to extract objects, each endpoint/URL generates multiple documents",
            "feed": {"extraUrls": [{
                "title": "dummy",
                "url": "https://www.govtrack.us/api/v2/bill?q=cyber"
            }]}
        },
        {
            "display": "A global space to group all the complex parsing and processing logic, can be called from anywhere",
            "globals": {
                "scriptlang": "javascript",
                "scripts": ["function create_links( urls, input_array )\n{\n    for (var x in input_array) {\n        var input = input_array[x];\n        urls.push( { url: input.link, title: input.title, description: input.current_status_description, publishedData: input.current_status_date, fullText: input });\n    }\n}"]
            }
        },
        {
            "display": "Only check the API every 10 minutes (can be set to whatever you'd like)",
            "harvest": {
                "duplicateExistingUrls": true,
                "searchCycle_secs": 600
            }
        },
        {
            "contentMetadata": [
                {
                    "fieldName": "json",
                    "index": false,
                    "script": "var json = eval('('+text+')'); json; ",
                    "scriptlang": "javascript",
                    "store": true
                }
            ],

            "display": "Convert the text into a JSON object in the document's metadata field: _doc.metadata.json[0]"
        },
        {
            "display": "Take the original documents, split them using their metadaata into new documents, and then delete the originals",
            "splitter": {
                "deleteExisting": true,
                "script": "var urls = []; create_links( urls, _metadata.json[0].objects ); urls;",
                "scriptflags": "m",
                "scriptlang": "javascript"
            }
        },
        {
            "contentMetadata": [{
                "fieldName": "json",
                "index": false,
                "script": "var json = eval('('+text+')'); json; ",
                "scriptlang": "javascript",
                "store": true
            }],
            "display": "Convert the text into a JSON object in the document's metadata field: _doc.metadata.json[0]"
        },
        {
            "display": "Improve ingest performance by not full-text-indexing the JSON object itself (the full text, entities etc still get indexed)",
            "searchIndex": {
                "indexOnIngest": true,
                "metadataFieldList": "+"
            }
        }
    ],
    "tags": ["test"],
    "title": "API example"
}

Output data

TODO