Infinit.e quick start - example sources
Simple example (News)
Start with the "Basic RSS Source Template" from the Infinit.e.Manager - Sources GUI:
{ "communityIds": ["4c927585d591d31d7b37097a"], "description": "Create a description of your source here.", "extractType": "Feed", "harvestBadSource": false, "isApproved": true, "isPublic": true, "mediaType": "Social", "tags": [ "tag1", "tag2" ], "title": "Basic RSS Source Template", "url": "" }
It is assumed you have obtained an OpenCalais or AlchemyAPI key and configured the Infinit.e properties file. If not do that first.
Then modify the title and description, and select a standard News RSS feed, eg:
Also add the field "useExtractor", and set it to:
- "AlchemyAPI" if you have an AlchemyAPI key
- "OpenCalais" if you have an OpenCalais key
- (A little bit more detail is provided on enrichment engines here)
The default wait time for hitting URLs in the same source is 10 seconds (an industry standard). For testing and/or for large web-sites, this can normally be reduced. The code block below shows how the "waitTimeOverride_ms" of the "rss" block can be used to do this.
This results in a source that looks something like (with ignored fields deleted):
{ "description": "A test RSS news feed, using OpenCalais", "extractType": "Feed", "isPublic": true, "mediaType": "News", "tags": [ "news", "england" ], "rss": {"waitTimeOverride_ms": 1000}, "title": "BBC News", "url": "", "useExtractor": "OpenCalais" }
Complex example (Twitter)
By contrast the next example does not use any of the commercial extraction engines (in practice, "AlchemyAPI-metadata" could be used), instead it uses the custom scripting engine based on regex and Javascript.
{ "description" : "Example template for twitter", "extractType" : "Feed", "isPublic" : true, "mediaType" : "Social", "tags" : [ "topic:politics", "industry:all", "Social" ], "title" : "Twitter Recent #technology Posts", "useTextExtractor": "none", "useExtractor": "none", "unstructuredAnalysis": { "simpleTextCleanser": [ { "field": "fullText", "regEx": "^.*<meta content=\"([^\"]*)\"\\s+name=[\"']description[\"']\\s*/>.*$", "replacement": "$1", "flags": "dH" } ], "meta": [ { "fieldName": "HashTag", "scriptlang": "regex", "script": "(#[a-zA-Z0-9_]+)", "groupNum":1, "context": "All" }, { "fieldName": "Author", "scriptlang": "regex", "script": "<meta content=[\"']([^\"']+)[\"'] name=[\"']page-user-screen_name[\"']\\s*/>", "groupNum": 1, "context": "First" }, { "fieldName": "message__comment", "scriptlang": "regex", "script": "^.*$", "groupNum": 0, "context": "All" }, { "fieldName": "Correspondent", "scriptlang": "regex", "script": "@([a-zA-Z0-9_]+)(?:$|[^:a-zA-Z0-9_])", "groupNum": 1, "context": "All" }, { "fieldName": "Retweeted", "scriptlang": "regex", "script": "@([a-zA-Z0-9_]+):", "groupNum": 1, "context": "All" } ] }, "structuredAnalysis": { "scriptEngine": "JavaScript", "description": "$metadata.message__comment", "entities": [ { "iterateOver": "HashTag", "disambiguated_name": "", "type": "HashTag", "dimension": "What" }, { "iterateOver": "Author", "disambiguated_name": "", "type": "TwitterHandle", "dimension": "Who" }, { "iterateOver": "Correspondent", "disambiguated_name": "", "type": "TwitterHandle", "dimension": "Who" }, { "iterateOver": "Retweeted", "disambiguated_name": "", "type": "TwitterHandle", "dimension": "Who" } ], "associations": [ { "iterateOver": "Correspondent", "entity1":"$metadata.Author", "entity2":"$SCRIPT( return _value; )", "verb":"mentions in a tweet", "verb_category":"tweets_to", "assoc_type":"Event", "time_start": "$SCRIPT( return _doc.publishedDate; )" }, { "iterateOver": "Retweeted", "entity1":"$metadata.Author", "entity2":"$SCRIPT( return _value; )", "verb":"retweeted", "verb_category":"retweeted", "assoc_type":"Event", "time_start": "$SCRIPT( return _doc.publishedDate; )" }, { "iterateOver": "HashTag", "entity1":"$metadata.Author", "entity2":"$SCRIPT( return _value; )", "verb":"tweets about", "verb_category":"tweets_about", "assoc_type":"Event", "time_start": "$SCRIPT( return _doc.publishedDate; )" } ] }, "url":"", "rss": { "waitTimeOverride_ms": "1000" } }
