AWS Marketplace - Datasift Sandbox Advanced Topics
Suspending and deleting sources
As outlined in the video above, once Datasift sources have been created, as detailed here (video and documentation), then they can be controlled from the source editor.
The following operations can be performed:
- Suspend active sources
- Re-activate suspended sources
- Delete sources
- Modify sources
These are described below.
Sources should be suspended or deleted before terminating or shutting down an instance. It is not catastrophic if you forget, Datasift will time-out the push subscription after a few minutes, but you will incur costs during that period.
As a fall-back scenario, you can always view and delete sources from the Datasift Developer Console.
Suspend active sources
To suspend an active source, simply navigate to the source editor tab of the manager, select the desired source from the list on the left, and select the "Disable Source" button (clicking "OK" to confirm publishing the source when prompted).
This will have 2 effects:
- The Datasift push subscription that was generating documents for this source will be suspended within a minute.
- Infinit.e will immediately stop checking for new documents for this source.
Suspended sources can be re-activated as described below.
Re-activate suspended sources
Re-activating a suspended source is the same operation: the source will have a "Enable Source" button in the same place, and clicking on it and confirming will perform the following:
- The Datasift push subscription will be re-activated within a minute
- (the contents of the source description field will be used to generate the filter, see below under "Modify Sources"; if unchanged this will be the JCSDL generated by the original filter of course)
- Infinit.e will immediately start checking for new documents from this source
Delete sources
To delete a source follow the steps from the source editor documentation. Deleting a source will automatically delete the corresponding Datasift push subscription.
Modify sources
Although this is not strictly recommended, there is a potentially consequence of the Infinit.e source containing the JCSDL in its description field so that it can use it to reconstruct sources when suspended sources are re-activated.
If it is necessary to make minor modifications to the JCSDL then this can be done without deleting and re-creating the source. Instead, suspend the source, wait two minutes (to be sure the existing Datasift push subscription has been terminated), modify the JCSDL in the subscription (this is one of the fields in the source editor), and then activate it (which will automatically publish the change).
If it is requested enough then we will add a graphical editor into the Datasift connector widget so as to be able to modify sources more formally.
Further reading:
Creating aliases and discarding unwanted entities
This video covers 3 topics:
- Using the entity alias builder to "merge" multiple different entities that actually represent the same person/place/etc.
- (ie dragging an entity from the table on the left to the top right table to make it a master alias, then dragging entities to merge from the left table to the bottom right one - those entities will be replaced with the master when the query is refreshed)
- Using the entity alias builder to "discard" unwanted entities.
- (ie dragging entities from the table on the left to the bottom right table, with "DISCARD" selected - those entities will disappear when the query is refreshed)
- Use the type/verb filters in the advanced options to remove entire classes of entities and associations.
These are straightforward and will not be covered again in this section.
It is worth re-iterating one of the key features of Infinit.e aliasing: no data is modified. The aliasing function sits in between the raw data and the API and modifies the objects "in flight". This makes it very flexible: different users on the same platform can have different sets of aliases. In addition it makes it safer to experiment, since none of the raw data purchased from Datasift can be corrupted.
There are a few additional useful topics that are not covered in the video:
- Selecting different alias sets
- Creating new alias masters (that aren't present in the data)
- "Text" aliases
- Manually setting alias configurations
- Positive and negative selection in the entity and association filters
Selecting different alias sets
As was noted in the video, all of the sets of aliases from the different configurations across the different communities that are being searched are combined. If you do not search over a community then any aliases saved in that community are not applied.
As a result, if you place an alias set in a community with no data in it then you can choose whether or not to apply it at query time just by either including the community or not from the source manager.
In fact there is a built in community that lets you accomplish this: the personal community (referred to as the "Personal Sandbox" in the Entity Alias Builder widget), see screenshots below:
The first shows the Entity Alias Builder widget, select the "Personal Sandbox" under the Communty dropdown. Ensure that the "Entities By Community" checkbox is unchecked, otherwise only data from that community is used to populate the table on the left.
The second shows the Source Manager, ensure that the "Personal Community" is checked, ie that its aliases will be applied.
When aliases are created in the Personal Sandbox community they do not affect anyone else's searches (and they can be removed simply by unchecking the "Personal Community" in the Source Manager).
One downside to this is that there is no way within the widget of transferring aliases from the sandbox to a real community (eg once you are happy with them). However the sub-section "Manually setting alias configurations" explains how this can be achieved easily using the File Uploader page instead.
Creating synthetic alias masters
It will be often the case that the desired "master" entity will not actually be present in the data.
For example, if you have a reddit and a twitter author who you believe to be the same person, you will have their handles as TwitterUser and RedditUser respectively (eg "joeOnTwitter" and "blogs_the_blogger"). If you can infer their name from the posts then you might want to make that the master entity, eg "Joe Blogs/Person" with "joeOnTwitter/TwitterUser" and "blogs_the_bloggetr/RedditUser" as aliases, even though "Joe Blogs" never appears in the content.
This is easily accomplished: in the widget, type the entity name, "/" then the desired entity type in the "Filte/Add Master" text box (see screencap below; ignore the fact that it will temporarily filter the other masters out), and then press the "+" button.
Select the created master entity and then drag aliases from the table on the left, and save as normal.
"Text" aliases
Creating master:alias sets actually does two things:
- Aliases are merged into the master entity during the query
- Queries involving the master entity are expanded to include the master's aliases
Sometimes this expansion is not sufficient. For example, some of the data might not have had any entities extracted at all (eg it presents the data in bullets that the NLP cannot parse). To address this sort of issue, the Entity Alias Builder widget also allows you to add arbitrary text to the expansion, which is converted to full text searches, ie will bring back documents regardless of the entity extraction quality/completeness.
There are two easy ways of doing this (see screenshots below):
- (red) type the desired text into the "Filter/Add Entities" text box on the right (ignore that it filters as you type), and press the "+" button next to it.
- (orange) tick the "create exact text terms for aliases" and then drag aliases across as normal, with the checkbox ticked 2 aliases are created: the normal entity and also the text of the entity name.
Manually setting alias configurations
As with all Infinit.e GUI functionality, the Entity Alias Builder widget is just an interface to our open API.
Aliases are stored in Infinit.e as JSON shares of type "infinite-entity-alias". Their format is described here. They can be manually uploaded and shared between communities using the File Uploader manager page.
This can be useful for 2 purposes:
- Where there are large numbers of aliases to be generated, it would not be much fun to use the GUI for each one. Instead you can programmatically generate (eg with a script) a JSON file containing the aliases and then upload it.
- This is a bit beyond the scope of this documentation, but you can also create a plugin (eg using the Javascript scripting engine) and then create a share with type "infinite-entity-alias" that points to the custom plugin results (this is described in the File Uploader documentation).
- So as an example if you have a word document that lists lots of social media handle mappings, then you could upload that as a share, then import that share as a source (this is discussed further below under "Importing other sources"), then write a Javascript plugin (see below under "More complex analytics") that parses the document into the right format, and then finally point a share to that!
- This would have the nice feature that it would automatically update itself whenever the document was re-uploaded.
- So as an example if you have a word document that lists lots of social media handle mappings, then you could upload that as a share, then import that share as a source (this is discussed further below under "Importing other sources"), then write a Javascript plugin (see below under "More complex analytics") that parses the document into the right format, and then finally point a share to that!
One other use of the File Uploader (/API) is to apply a single alias configuration to multiple communities (see above under "Selecting different alias sets"), by CTRL-clicking on the communities in the File Uploader.
Note that these techniques don't play particularly well with the widget interface at the moment (it assumes a single alias share per community), so it is recommended to pick one method or the other (though the widget interface can always be used as a readonly view of the alias configuration).
A few things to note when using aliases in multi-user environments:
- Anyone can create a share that defines an alias, but aliases are only applied when "endorsed". For security purposes, only a user with role "content publisher" or above can endorse shares (see the Community Manager documentation to see how to change users roles, this is also discussed below under "Adding Communities"). You can endorse via the API, but in general the best way of re-endorsing a share for a community is to unshare it for that community, submit it, and then reshare it (as an administrator or moderator) using the File Uploader.
- The widget does not support multi-user environments that well - once one user has created an alias share for a community, then only that user or a moderator/administrator can modify it.
- The idea is that one user per community should be responsible for the aliases for that community to avoid confusion. We anticipate improving the level of support in the future, as we get more feedback from our operational deployments.
Positive and negative selection in the entity and association filters
The purpose of this sub-section is just to note that the entity type filter or association verb category filter can be specified in one of two ways (from the "Advanced Options" view selectable from the "Options" dropdown on the left of the main GUI):
- Negatively, as a comma-separated list starting with "-"
- (see under "Entity Filter" in the screenshot below: no entities with type "Theme" or "Topic" would be included in the query dataset)
- Positively, as a comma-separated list
- (see under "Association Filter" in the screenshot below: in that case only associations with verb category "retweet" or "mentions" would be included in the query dataset)
Also in the example above, only documents containing at least one association of the positively specified type would be included. The converse is not true: negative filtering does not precluded documents from being retrieved (though of course they may have reduced scores and thus not make it into the top 100).
Further reading:
- An IKANOW blog post discussing an operational use of aliasing
- File Uploader documentation
- Community Manager documentation
- The alias API
More complex analytics and visualization
VIDEO COMING SOON!
UNDER CONSTRUCTION (Note the functionality is already present in the AMI, the documentation in this section just needs to be completed)
Overview
In previous sections we have seen how the query function returns a subset of the matching documents, together with some basic averaging statistics, and how this is sufficient for many standard data driven investigations.
In other cases, particularly as your activities move from data investigation to data science, it becomes necessary either to apply either more complex algorithms (for example graph theory or social network analysis), or to calculate standard statistics in domain-specific (a very simple example of this would be aggregating sentiment geographically).
In order to support these sorts of operations, Infinit.e provides the ability to plug-in analytic modules that can run over any subset of the data (including all of it). The general topic of building plug-in modules and scheduling and running them is beyond the scope of this documentation; this section will provide links to the Infinit.e documentation and describe the aspects most relevant to Datasift.
In particular, we have provided 3 sample jobs that illustrate a few different types of analytic and demonstrate how to access the document objects, and interpret the results (in practice the Infinit.e-specific bits like this are very easy, the difficulty is typically in building the algorithms themselves, as it should be).
Note that the examples below assume that you have an instance into which you have loaded some sources and collected some data, using the techniques described in previous sections.
Example 1 - Aggregate sentiment by geo
This sample analytic can be seen in the screenshot above. It simply creates a 10 degree x 10 degree grid and aggregates the sentiment associated via geo-tagged entities with those grid squares.
The sequence of the screenshots below shows how to access the example (starting from the manager webapp, eg press the "MANAGER" link in the top right of the main GUI):
The key screenshot above is the middle one, which shows the "scary-at-first-glance" plugin manager. This manager is documented in more detail here, we'll focus for now on the following components:
- The dropdown menu at the top lists available tasks. Selecting one fills in the rest of the form, as shown.
- The "QuickRun" or "Save and Debug" buttons save the current settings and run the job (on a subset of the records in the latter case). You can see from the status message that (on 8.5K records), the job took ~19s to complete and generated 44 aggregated records.
- (both these options don't reload the page until the job has completed - to run the job asynchronously you can use "Submit" instead, this is dicussed below).
- Once the job has run you can view the results in a separate tab by pressing the "Show results" button.
- Note that this new tab is just directly using the Custom - Get Results RESTful API call, and unless your browser is configured to render JSON will not be nicely formatted - we use the Chrome/Firefox extension JSONView.
Regardless of how nicely formatted the JSON is, in practice it is preferable to have a graphical view of the resulting data. The Infinit.e application comes with two widgets for this purpose:
- Custom Viewer - Map: Finds fields in the record that "look like" lat/long points and plots them on a map (MapQuest) colored according to a score defined by a numeric field in the same record (see below for more details)
- The rules for "looks" like lat/long are as follows:
- Is a top-level object called "geo", "geotag", "latlong", "latlon" AND consists of 2 numeric field (or strings representing numbers) with names "lat" and "lon" or "latitude" and "longitude"
- Has two top level numeric fields (or strings representing numbers) with names names "lat" and "lon" or "latitude" and "longitude"
- The rules for "looks" like lat/long are as follows:
- Custom Viewer - Bar Graph: Uses any field from the record as a key, and plots a bar of height defined by a numeric field in the same record (see example 2 for more details).
In addition we have successfully used the free jsfiddle service to visualization analytics - see this blog post for more details.
For this example, the "Custom Viewer - Map" is the obvious choice. The screenshot below shows the different options from the header.
- The first dropdown menu selects the plug-in from which to take the results.
- (The selection will fail if the widget cannot detect any fields that look like lat/long. This is discussed below, under "Visualizing the output of plug-ins")
- The second menu allows the user to select which field determines the color of the plotted points (from the palette of green/blue/orange/red).
- (This job has generated two numeric fields, the aggregated sentiment, and the number of records containing sentiment)
- The third menu determines how the score field is converted into a color:
- Linear scale: the lowest score is green, the highest score is red, the buckets are distributed evenly from min to max.
- Log scale: the lowest score is green, the highest score is red, the buckets are distributed logarithmicly from min to max.
- Polarity: Red is negative, Green is positive, Blue is neutral (less than 10% of the max in either direction).
Returning back to the plugin manager, there were two larger text fields:
- "Query" field: together with the "Communities" list, this controls what data is processed
- "User arguments" field: in this case this is actually the code that is run over the data. This is because it is a Javascript plugin, see below under "Creating new Javascript plug-ins".
- (Note that for Hadoop JARs this provides generic configuration parameters, see below under "Creating new Hadoop plug-ins")
In this case we can see that the query is:
{"docGeo":{"$exists":true}} //^^ (ie only process geo-tagged tweets, eg from cellphones)
There are a few points to note here:
- The overall syntax of the query is that of MongoDB
- There are some additional extensions starting with "$": these are documented here, and can be inserted either manually of by by pressing the "Add Options" button that is next to the query.
- The document fields that the query is applied against are described here.
- You can view the JSON format of a given document from the "Document Browser" widget, as shown in the screenshot below.
As an example, say you wanted to query on only records that were tagged by datasift with gender "Male". There would be two ways of doing this:
//Option 1, simplest (see datasift documentation for their metadata format): //TODO //Option 2, most generic: //TODO
The advantage of Option 2 would be that if you later imported other sources that had a "Gender" entity but weren't from datasift (eg had a different metadata format), then you would not have to alter your queries.
TODO example - changing the query
TODO map/reduce code
Example 2 - Aggregate sentiment by gender
TODO
Example 3 - Show top co-references
TODO
Creating new Javascript plug-ins
TODO
Creating new Hadoop plug-ins
TODO
Run-time options for plug-ins
TODO
Visualizing the output of plug-ins
TODO more details on how to format fields to be usable
TODO the advanced option
TODO jsfiddle
Further reading:
- Plugin manager documentation
- Information about the built-in Javascript engine
- Developer information about building Java Hadoop plugins
- An IKANOW blog post discussing using jsfiddle to visualize custom analytics
- (contains links to some other relevant blog posts about running analytics on Infinit.e datasets, including this one about doing temporal/sentiment analytics on emails)
Exporting the data (and alerting, and backups)
Using the API
An important feature of the Infinit.e platform is that it wants data to be open: our User Interfaces and applications use our open RESTful API, so any other client can get the same data.
The primary method of getting at the data is via the query API call, and that linked page shows some examples of making the call in javascript and actionscript. In addition we have a beta (ie undocumented!) Java driver here (that we use internally, so is well supported). There are more, general, examples of using the API in different languages here.
In the context of using the query API to support bulk export of the data, this section of our knowledgebase describes how to use the "curl" command line utility (in Linux, or MacOS, or cygwin on Windows platforms) to script getting all the data out.
MongoDB dumps and backups
The underlying data store for Infinit.e is the popular NoSQL database called MongoDB. If you have ssh access to the server then you can use mongodump or mongoexport to get at the data. This image describes the database format.
It is worth noting whiile discussing Mongo that a nightly backup of the data is generated (at 1am) and stored at "/opt/db-home/" as "db_backup_<<hostname>>_most_recent.tgz". Currently nothing is done with this file (ie it is overwritten nightly). It is recommended that you upload this to S3 regularly (it was not possible to pre-configure this because of AWS restrictions). More details on the backup process are provided here.
GUI utilities
The main GUI provides three ways of saving the data or workspace state (see screenshot below):
- "Copy workspace link to clipboard": This copies a (long!) URL to the clipboard that will return you to the current query, community set, and widget set when pasted into a browser.
- Note this URL is too long for some applications to handle (eg gmail unfortunately) - a forthcoming release will use a link shortener.
- "Create PDF for current data view": This will open a new tab containing a PDF that contains screenshots of all the open widgets together with information about the query that was used.
- (Widgets can be programmed to write more detailed information into the PDF, though currently only the Doc Browser widget takes advantage of this.)
- As an alternative the second section of this blog post describes generating per-widget screenshots. This has been very popular for creating "quickview" presentations.
- "Export JSON for current data": This saves a file to local disk containing the JSON returned from the query. The format is described here.
One final export mechanism is another optional part of the widget API, and is currently supported by the following widgets:
- Event graph: exports the graph to GraphML (all edges/nodes except those filtered out)
- Map: export to KML
The screenshot of a widget header below shows the icon that appears when this per-widget export is enabled:
It is also possible to load and save queries (JSON) to disk, from the "Advanced Query Builder".
Alerting using RSS
The final option in the "Options" screenshot above ("Create RSS feed for current query") has been very popular with our users. Selecting that option opens a new tab in the browser containing a (long!) URL that generates an RSS feed for the current query. This feed can be used in RSS readers or alerting tools supporting RSS (an access key is embedded in the URL so no authentication is required on the RSS reader side).
Further reading:
Importing other sources
Although the focus of this Amazon AWS Marketplace product is to allow users to ingest social media easily from datasift, the entire Infinit.e community platform is included.
Infinit.e is a general purpose tool for harvesting, enriching, and analyzing data of many different types from many different sources, including filesystems and enterprise Intranets, databases, and the Web.
This section provides a brief description of the more general harvesting functionality, and mainly a list of resources for users who want to explore these additional capabilities. Note that Infinit.e provides a rich and complex framework (though with simple shortcuts and templates where possible), and it is beyond the scope of this web page to document it fully.
Overview of harvesting in Infinit.e
Harvesting in Infinit.e is controlled by JSON documents called sources. These sources can be tested by POSTing to the Config - Source - Test REST endpoint, and activated/updated ("published") by POSTing to the Config - Source - Save REST endpoint.
In practice the Source Manager GUI can be used to perform these activities in a more visual intuitive way. It still requires building the source JSON with limited development support - as can be seen from the documentation here, this requires some javascript skills and some effort. The source manager provides some templates to get up and running on simpler types of ingest, and there is a source gallery with some real world examples of various complexities.
(In addition, our enterprise offering provides a visual "ETL" tool)
Quickly importing sources using the Chrome extension
For pulling public RSS feeds and HTML pages we provide a Chrome extension that gives a "1-click" import capability. This is described here.
Enrichment and entity extraction
One augmentation feature that is provided by Datasift and is therefore not applied to data imported via sources is the entity extraction provided by Salience. The Infinit.e platform provides the following Enrichment enginesalternatives:
- TextRank: Extracts keywords similarly to Salience/Datasift (though less well)
- (connector) AlchemyAPI: You can register for an API key with AlchemyAPI and use their service, which is integrated into Infinit.e. AlchemyAPI have a free tier allowing 1000 transactions/day. This connector pulls named entities only by default, but does include sentiment.
- (connector) AlchemyAPI-metadata: This is another connector to AlchemyAPI, which provides keywords but no entity extraction - it is best used for short/badly formatted sources like twitter.
- (connector) OpenCalais: OpenCalais is an alternative to AlchemyAPI - it focuses on business and politics, and doesn't have sentiment but does provide "business associations" (takeover rumors, that sort of thing). It has a significant free tier, offering 50,000 transactions per day once you register for an API key.
Note that these entity extractors all have different ontologies, eg their types are slightly different and their "disambiguation formats" also ("State" vs "StateOrProvince"; "Paris, Texas" vs "Paris, Texas. USA"), which is not ideal for combining with the built-in Salience augmentation since the same entity will appear in different forms. The entity aliasing function can be used to clear up some of these issues (eg for important entities; or a custom job could generate aliases automatically based on extracted data using some simple heuristics).
(Note that Salience does have a SaaS version, called Semantria, which offers a one time free 10,000 transaction usage. We have not built a connector to Semantria (or used it in any way), though it would be easy enough for us or another developer to do).
(Note also that our enterprise offering provides the same Salience NLP engine that Datasift uses, which would enable external sources to be integrated seamlessly with Datasift's social media)
Adding users and communities
The documentation so far has assumed that the application has run in its default configuration, with:
- 1 user: Admin, infinite_default@ikanow.com
- 1 data community: the system community
- (plus the Admin user's personal community that is used to store temporary alias settings, saved widget configurations, etc)
In practice Infinit.e is a multi-user application that also allows multiple "communities", allowing for separation of data, aliases, custom jobs, access controls etc.
Creating new users can be performed from the Person Manager GUI.
Creating new communities can be performed from the Community Manager GUI.
A few points to note:
- After a community is created, only the owner is initially added. Other users can be added (or removed) by selecting the "Add New Members" (or "Edit/Remove Members") at the bottom of the right pane for the selected community.
- For users to be able to add new sources, they must either be system administrators, or be added as "Content Publisher" or better in their community role (right hand pane after selecting a user from the "Edit/Remove Members" page).
- (In secure mode, see below, users must be administrators to create new sources)
The secure mode of Infinit.e that is needed to guarantee system security in multi-user environments is not enabled by default. Check here for more details.
Updating the software
There are two separate components installed on the Amazon image:
- The Infinit.e community platform (September 2013 release)
- Additional widgets and web services that provide the connection to Datasift plus other functionality such as entity alias manipulation (this is actually a subset of our enterprise offering).
To update the core platform, SSH into the instance and then follow the instructions provided here. There are monthly releases (though you certainly do not have to upgrade that often), described here.
There is currently not an automated way to upgrade the additional components. Should patches be required, we will update the Amazon image and also provide instructions to existing customers on how to obtain the latest binaries and update their existing images.