Infinit.e quick start - example sources

Simple example (News)

Start with the "Basic RSS Source Template" from the Infinit.e.Manager - Sources GUI:

 {
    "communityIds": ["4c927585d591d31d7b37097a"],
    "description": "Create a description of your source here.",
    "extractType": "Feed",
    "harvestBadSource": false,
    "isApproved": true,
    "isPublic": true,
    "mediaType": "Social",
    "tags": [
        "tag1",
        "tag2"
    ],
    "title": "Basic RSS Source Template",
    "url": "http://blahblahblah.com/blah.rss"
}

It is assumed you have obtained an OpenCalais or AlchemyAPI key and configured the Infinit.e properties file. If not do that first.

Then modify the title and description, and select a standard News RSS feed, eg:

Also add the field "useExtractor", and set it to:

  • "AlchemyAPI" if you have an AlchemyAPI key
  • "OpenCalais" if you have an OpenCalais key
  • (A little bit more detail is provided on enrichment engines here)

The default wait time for hitting URLs in the same source is 10 seconds (an industry standard). For testing and/or for large web-sites, this can normally be reduced. The code block below shows how the "waitTimeOverride_ms" of the "rss" block can be used to do this.

This results in a source that looks something like (with ignored fields deleted):

  {
    "description": "A test RSS news feed, using OpenCalais",
    "extractType": "Feed",
    "isPublic": true,
    "mediaType": "News",
    "tags": [
        "news",
        "england"
    ],
    "rss": {"waitTimeOverride_ms": 1000},
    "title": "BBC News",
    "url": "http://feeds.bbci.co.uk/news/rss.xml?edition=uk",
    "useExtractor": "OpenCalais"
}

Complex example (Twitter)

By contrast the next example does not use any of the commercial extraction engines (in practice, "AlchemyAPI-metadata" could be used), instead it uses the custom scripting engine based on regex and Javascript.

 { 
      "description" : "Example template for twitter",
      "extractType" : "Feed",
      "isPublic" : true,
      "mediaType" : "Social",
      "tags" : [ "topic:politics",
          "industry:all",
          "Social"
        ],
      "title" : "Twitter Recent #technology Posts",
      "useTextExtractor": "none",
      "useExtractor": "none",
      "unstructuredAnalysis": {
      "simpleTextCleanser": 
      [
      {
      "field": "fullText",
      "regEx": "^.*<meta content=\"([^\"]*)\"\\s+name=[\"']description[\"']\\s*/>.*$",
      "replacement": "$1",
      "flags": "dH"
      }
      ],
      "meta": [
      {
      "fieldName": "HashTag",
      "scriptlang": "regex",
      "script": "(#[a-zA-Z0-9_]+)",
      "groupNum":1,
      "context": "All"
      },
     {
      "fieldName": "Author",
      "scriptlang": "regex",
      "script": "<meta content=[\"']([^\"']+)[\"'] name=[\"']page-user-screen_name[\"']\\s*/>",
      "groupNum": 1,
      "context": "First"
      },
     {
      "fieldName": "message__comment",
      "scriptlang": "regex",
      "script": "^.*$",
      "groupNum": 0,
      "context": "All"
      },
      {
      "fieldName": "Correspondent",
      "scriptlang": "regex",
      "script": "@([a-zA-Z0-9_]+)(?:$|[^:a-zA-Z0-9_])",
      "groupNum": 1,
      "context": "All"
      },
      {
      "fieldName": "Retweeted",
      "scriptlang": "regex",
      "script": "@([a-zA-Z0-9_]+):",
      "groupNum": 1,
      "context": "All"
      }
      ]
      },
      "structuredAnalysis": {      
      "scriptEngine": "JavaScript",
      "description": "$metadata.message__comment",
      "entities": [
      {
      "iterateOver": "HashTag",
      "disambiguated_name": "",
      "type": "HashTag",
      "dimension": "What"
      },
      {
      "iterateOver": "Author",
      "disambiguated_name": "",
      "type": "TwitterHandle",
      "dimension": "Who"
      },
       {
      "iterateOver": "Correspondent",
      "disambiguated_name": "",
      "type": "TwitterHandle",
      "dimension": "Who"
      },
       {
      "iterateOver": "Retweeted",
      "disambiguated_name": "",
      "type": "TwitterHandle",
      "dimension": "Who"
      }
     ],
     "associations": [
		{
			"iterateOver": "Correspondent",
			"entity1":"$metadata.Author",
			"entity2":"$SCRIPT( return _value; )",
			"verb":"mentions in a tweet",
			"verb_category":"tweets_to",
			"assoc_type":"Event",
			"time_start": "$SCRIPT( return _doc.publishedDate; )"
		},
		{
			"iterateOver": "Retweeted",
			"entity1":"$metadata.Author",
			"entity2":"$SCRIPT( return _value; )",
			"verb":"retweeted",
			"verb_category":"retweeted",
			"assoc_type":"Event",
			"time_start": "$SCRIPT( return _doc.publishedDate; )"
		},
 		{
			"iterateOver": "HashTag",
			"entity1":"$metadata.Author",
			"entity2":"$SCRIPT( return _value; )",
			"verb":"tweets about",
			"verb_category":"tweets_about",
			"assoc_type":"Event",
			"time_start": "$SCRIPT( return _doc.publishedDate; )"
		}
    ]
      },
      "url":"http://search.twitter.com/search.rss?q=search%23technology",
      "rss": {
     "waitTimeOverride_ms": "1000"
      }
}

Copyright © 2012 IKANOW, All Rights Reserved | Licensed under Creative Commons