Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 10 Next »

UNDER CONSTRUCTION

Suspending and deleting sources

TODO basic outline ("as outlined in the video below, it is simple to suspend and re-activate Infinit.e sources, and this automatically stops/restarts the corresponding Datasift push subscriptions)

TODO mention to delete/suspend sources before shutting system down

TODO something about exporting data?

TODO dev.datasift console...

Creating aliases and discarding unwanted entities

This video covers 3 topics:

  • Using the entity alias builder to "merge" multiple different entities that actually represent the same person/place/etc.
    • (ie dragging an entity from the table on the left to the top right table to make it a master alias, then dragging entities to merge from the left table to the bottom right one - those entities will be replaced with the master when the query is refreshed)
  • Using the entity alias builder to "discard" unwanted entities.
    • (ie dragging entities from the table on the left to the bottom right table, with "DISCARD" selected - those entities will disappear when the query is refreshed) 
  • Use the type/verb filters in the advanced options to remove entire classes of entities and associations.

These are straightforward and will not be covered again in this section.

It is worth re-iterating one of the key features of Infinit.e aliasing: no data is modified. The aliasing function sits in between the raw data and the API and modifies the objects "in flight". This makes it very flexible: different users on the same platform can have different sets of aliases. In addition it makes it safer to experiment, since none of the raw data purchased from Datasift can be corrupted.

There are a few additional useful topics that are not covered in the video:

  • Selecting different alias sets
  • Creating new alias masters (that aren't present in the data)
  • "Text" aliases
  • Manually setting alias configurations
  • Positive and negative selection in the entity and association filters

Selecting different alias sets

As was noted in the video, all of the sets of aliases from the different configurations across the different communities that are being searched are combined. If you do not search over a community then any aliases saved in that community are not applied.

As a result, if you place an alias set in a community with no data in it then you can choose whether or not to apply it at query time just by either including the community or not from the source manager.

In fact there is a built in community that lets you accomplish this: the personal community (referred to as the "Personal Sandbox" in the Entity Alias Builder widget), see screenshots below:

The first shows the Entity Alias Builder widget, select the "Personal Sandbox" under the Communty dropdown. Ensure that the "Entities By Community" checkbox is unchecked, otherwise only data from that community is used to populate the table on the left.

The second shows the Source Manager, ensure that the "Personal Community" is checked, ie that its aliases will be applied.

When aliases are created in the Personal Sandbox community they do not affect anyone else's searches (and they can be removed simply by unchecking the "Personal Community" in the Source Manager).

One downside to this is that there is no way within the widget of transferring aliases from the sandbox to a real community (eg once you are happy with them). However the sub-section "Manually setting alias configurations" explains how this can be achieved easily using the File Uploader page instead.

Creating synthetic alias masters

It will be often the case that the desired "master" entity will not actually be present in the data.

For example, if you have a reddit and a twitter author who you believe to be the same person, you will have their handles as TwitterUser and RedditUser respectively (eg "joeOnTwitter" and "blogs_the_blogger"). If you can infer their name from the posts then you might want to make that the master entity, eg "Joe Blogs/Person" with "joeOnTwitter/TwitterUser" and "blogs_the_bloggetr/RedditUser" as aliases, even though "Joe Blogs" never appears in the content.

This is easily accomplished: in the widget, type the entity name, "/" then the desired entity type in the "Filte/Add Master" text box (see screencap below; ignore the fact that it will temporarily filter the other masters out), and then press the "+" button.

Select the created master entity and then drag aliases from the table on the left, and save as normal.

"Text" aliases

Creating master:alias sets actually does two things:

  • Aliases are merged into the master entity during the query
  • Queries involving the master entity are expanded to include the master's aliases

Sometimes this expansion is not sufficient. For example, some of the data might not have had any entities extracted at all (eg it presents the data in bullets that the NLP cannot parse). To address this sort of issue, the Entity Alias Builder widget also allows you to add arbitrary text to the expansion, which is converted to full text searches, ie will bring back documents regardless of the entity extraction quality/completeness.

There are two easy ways of doing this (see screenshots below):

  • (red) type the desired text into the "Filter/Add Entities" text box on the right (ignore that it filters as you type), and press the "+" button next to it.
  • (orange) tick the "create exact text terms for aliases" and then drag aliases across as normal, with the checkbox ticked 2 aliases are created: the normal entity and also the text of the entity name.

Manually setting alias configurations

As with all Infinit.e GUI functionality, the Entity Alias Builder widget is just an interface to our open API.

Aliases are stored in Infinit.e as JSON shares of type "infinite-entity-alias". Their format is described here. They can be manually uploaded and shared between communities using the File Uploader manager page.

This can be useful for 2 purposes:

  • Where there are large numbers of aliases to be generated, it would not be much fun to use the GUI for each one. Instead you can programmatically generate (eg with a script) a JSON file containing the aliases and then upload it.
  • This is a bit beyond the scope of this documentation, but you can also create a plugin (eg using the Javascript scripting engine) and then create a share with type "infinite-entity-alias" that points to the custom plugin results (this is described in the File Uploader documentation). 
    • So as an example if you have a word document that lists lots of social media handle mappings, then you could upload that as a share, then import that share as a source (this is discussed further below under "Importing other sources"), then write a Javascript plugin (see below under "More complex analytics") that parses the document into the right format, and then finally point a share to that! 
      • This would have the nice feature that it would automatically update itself whenever the document was re-uploaded.

One other use of the File Uploader (/API) is to apply a single alias configuration to multiple communities (see above under "Selecting different alias sets"), by CTRL-clicking on the communities in the File Uploader.

Note that these techniques don't play particularly well with the widget interface at the moment (it assumes a single alias share per community), so it is recommended to pick one method or the other (though the widget interface can always be used as a readonly view of the alias configuration).

A few things to note when using aliases in multi-user environments:

  • Anyone can create a share that defines an alias, but aliases are only applied when "endorsed". For security purposes, only a user with role "content publisher" or above can endorse shares (see the Community Manager documentation to see how to change users roles, this is also discussed below under "Adding Communities"). You can endorse via the API, but in general the best way of re-endorsing a share for a community is to unshare it for that community, submit it, and then reshare it (as an administrator or moderator) using the File Uploader.
  • The widget does not support multi-user environments that well - once one user has created an alias share for a community, then only that user or a moderator/administrator can modify it. 
    • The idea is that one user per community should be responsible for the aliases for that community to avoid confusion. We anticipate improving the level of support in the future, as we get more feedback from our operational deployments.

Positive and negative selection in the entity and association filters

The purpose of this sub-section is just to note that the entity type filter or association verb category filter can be specified in one of two ways (from the "Advanced Options" view selectable from the "Options" dropdown on the left of the main GUI):

  • Negatively, as a comma-separated list starting with "-" 
    • (see under "Entity Filter" in the screenshot below: no entities with type "Theme" or "Topic" would be included in the query dataset)
  • Positively, as a comma-separated list 
    • (see under "Association Filter" in the screenshot below: in that case only associations with verb category "retweet" or "mentions" would be included in the query dataset)

Also in the example above, only documents containing at least one association of the positively specified type would be included. The converse is not true: negative filtering does not precluded documents from being retrieved (though of course they may have reduced scores and thus not make it into the top 100).

Further reading:

More complex analytics and visualization

COMING SOON! Note the functionality is already present in the AMI, we just need to write the video and associated documentation.

Further reading:

Exporting the data (and alerting, and backups)

Using the API

An important feature of the Infinit.e platform is that it wants data to be open: our User Interfaces and applications use our open RESTful API, so any other client can get the same data.

The primary method of getting at the data is via the query API call, and that linked page shows some examples of making the call in javascript and actionscript. In addition we have a beta (ie undocumented!) Java driver here (that we use internally, so is well supported). There are more, general, examples of using the API in different languages here.

In the context of using the query API to support bulk export of the data, this section of our knowledgebase describes how to use the "curl" command line utility (in Linux, or MacOS, or cygwin on Windows platforms) to script getting all the data out.

MongoDB dumps and backups

The underlying data store for Infinit.e is the popular NoSQL database called MongoDB. If you have ssh access to the server then you can use mongodump or mongoexport to get at the data. This image describes the database format.

It is worth noting whiile discussing Mongo that a nightly backup of the data is generated (at 1am) and stored at "/opt/db-home/" as "db_backup_<<hostname>>_most_recent.tgz". Currently nothing is done with this file (ie it is overwritten nightly). It is recommended that you upload this to S3 regularly (it was not possible to pre-configure this because of AWS restrictions). More details on the backup process are provided here.

GUI utilities

The main GUI provides three ways of saving the data or workspace state (see screenshot below):

  • "Copy workspace link to clipboard": This copies a (long!) URL to the clipboard that will return you to the current query, community set, and widget set when pasted into a browser. 
    • Note this URL is too long for some applications to handle (eg gmail unfortunately) - a forthcoming release will use a link shortener.
  • "Create PDF for current data view": This will open a new tab containing a PDF that contains screenshots of all the open widgets together with information about the query that was used.
    • (Widgets can be programmed to write more detailed information into the PDF, though currently only the Doc Browser widget takes advantage of this.)
    • As an alternative the second section of this blog post describes generating per-widget screenshots. This has been very popular for creating "quickview" presentations.
  • "Export JSON for current data": This saves a file to local disk containing the JSON returned from the query. The format is described here.

TODO widget exports

Alerting using RSS

The final option in the "Options" screenshot above ("Create RSS feed for current query") has been very popular with our users

TODO

Further reading:

Importing other sources

TODO complex subject, lots of documentation (gui coming soon), this section just highlights a few of the most relevant possibilities to datasift

TODO something about entity generation (salience not available via public API, though it is via our enterprise edition - so there will be a disambiguation problem between different entity formats and types - can address some of this via alias builder)

Adding communities

TODO

Updating the software

TODO 2 methods, link to OSS also mention share

  • No labels