...
Start with the "Basic RSS Source Template" from the source builder Infinit.e.Manager - Sources GUI:
Code Block | ||
---|---|---|
| ||
{ "communityIds": ["4c927585d591d31d7b37097a"], "description": "Create a description of your source here.", "extractType": "Feed", "harvestBadSource": false, "isApproved": true, "isPublic": true, "mediaType": "Social", "tags": [ "tag1", "tag2" ], "title": "Basic RSS Source Template", "url": "http://blahblahblah.com/blah.rss" } |
It is assumed you have obtained an OpenCalais or AlchemyAPI key and configured the Infinit.e properties file. If not do that first.
Then modify the title and description, and select a standard News RSS feed, eg:
...
- "AlchemyAPI" if you have an AlchemyAPI key
- "OpenCalais" if you have an OpenCalais key
- (A little bit more detail is provided on enrichment engines here)
The default wait time for hitting URLs in the same source is 10 seconds (an industry standard). For testing and/or for large web-sites, this can normally be reduced. The code block below shows how the "waitTimeOverride_ms" of the "rss" block can be used to do this.
This results in a source that looks something like (with ignored fields deleted):
Code Block | ||
---|---|---|
| ||
{
"description": "A test RSS news feed, using OpenCalais",
"extractType": "Feed",
"isPublic": true,
"mediaType": "News",
"tags": [
"news",
"england"
],
"rss": {"waitTimeOverride_ms": 1000},
"title": "BBC News",
"url": "http://feeds.bbci.co.uk/news/rss.xml?edition=uk",
"useExtractor": "OpenCalais"
} |
Complex example (Twitter)
By contrast the next example does not use any of the commercial extraction engines (in practice, "AlchemyAPI-metadata" could be used)
...
TODO twitter, instead it uses the custom scripting engine based on regex and Javascript.
Code Block | ||
---|---|---|
| ||
{
"description" : "Example template for twitter",
"extractType" : "Feed",
"isPublic" : true,
"mediaType" : "Social",
"tags" : [ "topic:politics",
"industry:all",
"Social"
],
"title" : "Twitter Recent #technology Posts",
"useTextExtractor": "none",
"useExtractor": "none",
"unstructuredAnalysis": {
"simpleTextCleanser":
[
{
"field": "fullText",
"regEx": "^.*<meta content=\"([^\"]*)\"\\s+name=[\"']description[\"']\\s*/>.*$",
"replacement": "$1",
"flags": "dH"
}
],
"meta": [
{
"fieldName": "HashTag",
"scriptlang": "regex",
"script": "(#[a-zA-Z0-9_]+)",
"groupNum":1,
"context": "All"
},
{
"fieldName": "Author",
"scriptlang": "regex",
"script": "<meta content=[\"']([^\"']+)[\"'] name=[\"']page-user-screen_name[\"']\\s*/>",
"groupNum": 1,
"context": "First"
},
{
"fieldName": "message__comment",
"scriptlang": "regex",
"script": "^.*$",
"groupNum": 0,
"context": "All"
},
{
"fieldName": "Correspondent",
"scriptlang": "regex",
"script": "@([a-zA-Z0-9_]+)(?:$|[^:a-zA-Z0-9_])",
"groupNum": 1,
"context": "All"
},
{
"fieldName": "Retweeted",
"scriptlang": "regex",
"script": "@([a-zA-Z0-9_]+):",
"groupNum": 1,
"context": "All"
}
]
},
"structuredAnalysis": {
"scriptEngine": "JavaScript",
"description": "$metadata.message__comment",
"entities": [
{
"iterateOver": "HashTag",
"disambiguated_name": "",
"type": "HashTag",
"dimension": "What"
},
{
"iterateOver": "Author",
"disambiguated_name": "",
"type": "TwitterHandle",
"dimension": "Who"
},
{
"iterateOver": "Correspondent",
"disambiguated_name": "",
"type": "TwitterHandle",
"dimension": "Who"
},
{
"iterateOver": "Retweeted",
"disambiguated_name": "",
"type": "TwitterHandle",
"dimension": "Who"
}
],
"associations": [
{
"iterateOver": "Correspondent",
"entity1":"$metadata.Author",
"entity2":"$SCRIPT( return _value; )",
"verb":"mentions in a tweet",
"verb_category":"tweets_to",
"assoc_type":"Event",
"time_start": "$SCRIPT( return _doc.publishedDate; )"
},
{
"iterateOver": "Retweeted",
"entity1":"$metadata.Author",
"entity2":"$SCRIPT( return _value; )",
"verb":"retweeted",
"verb_category":"retweeted",
"assoc_type":"Event",
"time_start": "$SCRIPT( return _doc.publishedDate; )"
},
{
"iterateOver": "HashTag",
"entity1":"$metadata.Author",
"entity2":"$SCRIPT( return _value; )",
"verb":"tweets about",
"verb_category":"tweets_about",
"assoc_type":"Event",
"time_start": "$SCRIPT( return _doc.publishedDate; )"
}
]
},
"url":"http://search.twitter.com/search.rss?q=search%23technology",
"rss": {
"waitTimeOverride_ms": "1000"
}
}
|