REST API example
Overview
Shows a very simple REST example where the returned JSON is split into separate documents.
Some additional functionality that is commonly required but is not shown here specifically:
- In order to iterate through pages, the Follow Web links element can be used (eg in the "links.script" you would have "var json = eval('('+text +')')", observe that "json.meta.offset + json.meta.limit < json.meta.total_count" and then push "{ url: _doc.url + "?limit="+json.limit, /* etc */, spiderOut: true}" onto the return array
- (You can also just set a large "numPages" and then set "stopPaginatingOnDuplicate" togther with "pageChangeRegex" and "pageChangeReplace") to avoid needing any scripting (in some cases the API reply doesn't have a limit/offset schema but instead each JSON object contains the URL of the next call - in this case scripting is currently needed, eg "retval = []; /* push docs to extract, then spidering is: */ var json = eval('('+text +')'); var next_link = json.meta.next_link; retval.push({ 'url': next_link, 'spiderOut': true})")
- Headers and content data (ie POST) can be used. This is discussed in Web extractor.
- To use secure credentials, upload a JSON share containing them to your personal community (using the 5. File Uploader) and then use the substitution format "#IKANOW{ID.FIELD}" (Where ID is the "_id" (hex string) of the share, and FIELD is the fieldname).
- Multi-step authentication is harder - again the Follow Web links can be used.
Input data
{meta: {limit: 100, offset: 0, total_count: 471 }, objects: [ { bill_resolution_type: "bill", bill_type: "house_bill", bill_type_label: "H.R.", congress: 111, current_status: "referred", current_status_date: "2010-01-26", current_status_description: "This bill was introduced on January 26, 2010, in a previous session of Congress, but was not enacted.", current_status_label: "Referred to Committee", display_number: "H.R. 4507", docs_house_gov_postdate: null, id: 433, introduced_date: "2010-01-26", is_alive: false, is_current: false, link: "https://www.govtrack.us/congress/bills/111/hr4507", //(...) sponsor: {bioguideid: "R000568", birthday: "1946-12-09", cspanid: 48779, firstname: "Ciro", gender: "male", gender_label: "Male", id: 400339, lastname: "Rodriguez", link: "https://www.govtrack.us/congress/members/ciro_rodriguez/400339", middlename: "D.", name: "Rep. Ciro Rodriguez [D-TX23, 2007-2010]", namemod: "", nickname: "", osid: "N00009828", pvsid: "16389", sortname: "Rodriguez, Ciro (Rep.) [D-TX23, 2007-2010]", twitterid: null, youtubeid: null }, title: "H.R. 4507 (111th): Cyber Security Domestic Preparedness Act", title_without_number: "Cyber Security Domestic Preparedness Act", titles: [["short", "introduced", "Cyber Security Domestic Preparedness Act" ], ["official", "introduced", "To amend the Homeland Security Act of 2002 to authorize the Secretary of Homeland Security to establish the Cyber Security Domestic Preparedness Consortium, and for other purposes." ] ] }, { //other objects in the same format } //etc ]}
Source
{ "description": "JSON API splitter test", "isPublic": true, "mediaType": "Record", "processingPipeline": [ { "display": "Specify one or more JSON (or XML or ...) endpoints from which to extract objects, each endpoint/URL generates multiple documents", "feed": {"extraUrls": [{ "title": "dummy", "url": "https://www.govtrack.us/api/v2/bill?q=cyber" }]} }, { "display": "A global space to group all the complex parsing and processing logic, can be called from anywhere", "globals": { "scriptlang": "javascript", "scripts": ["function create_links( urls, input_array )\n{\n for (var x in input_array) {\n var input = input_array[x];\n urls.push( { url: input.link, title: input.title, description: input.current_status_description, publishedData: input.current_status_date, fullText: input });\n }\n}"] } }, { "display": "Only check the API every 10 minutes (can be set to whatever you'd like)", "harvest": { "duplicateExistingUrls": true, "searchCycle_secs": 600 } }, { "contentMetadata": [ { "fieldName": "json", "index": false, "script": "var json = eval('('+text+')'); json; ", "scriptlang": "javascript", "store": true } ], "display": "Convert the text into a JSON object in the document's metadata field: _doc.metadata.json[0]" }, { "display": "Take the original documents, split them using their metadaata into new documents, and then delete the originals", "splitter": { "deleteExisting": true, "script": "var urls = []; create_links( urls, _metadata.json[0].objects ); urls;", "scriptflags": "m", "scriptlang": "javascript" } }, { "contentMetadata": [{ "fieldName": "json", "index": false, "script": "var json = eval('('+text+')'); json; ", "scriptlang": "javascript", "store": true }], "display": "Convert the text into a JSON object in the document's metadata field: _doc.metadata.json[0]" }, { "display": "Improve ingest performance by not full-text-indexing the JSON object itself (the full text, entities etc still get indexed)", "searchIndex": { "indexOnIngest": true, "metadataFieldList": "+" } } ], "tags": ["test"], "title": "API example" }
Output data
TODO