Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 14 Next »

Overview

The Source Editor GUI is not currently compatible with IE. It is compatible with Chrome, Firefox, and Safari.

Source management is intrinsically a complex process (particularly when taking advantage of Infinit.e's customization engine). 

The Infinit.e.Manager Sources page provides a simple interface for adding and testing new sources, saving templates for future sources, and managing existing ones. Future iterations of the tool will provide actual support for the difficult bits of source writing, such as writing Javascript and regexes.

Note that the grey lines can be dragged to increase or decrease the size of the editor window.

Create New Source

To create a new source click on the New Source button in the upper right hand corner of the page. The Infinit.e.Manager application will forward you to the Create New Source page shown below:

When copying an existing source into the New Source window, that existing source should be "scrubbed" first (middle right, "Scrub" button) - otherwise the presence of the "_id"/"key" fields will mean that the old source is modified instead of a new one being created.

Edit Existing Sources

To edit an existing source click on the source's name in the list of Sources found on the left hand side of the page.

Note: There are three types of documents listed in the Sources list: published sources, shares that are editable copies of published sources, and shares that have not yet been published as sources. Shares are denoted by "(*)".

If copying the logic of an existing source, it is recommended to first "scrub" it to remove any server-added fields (particularly "_id" and "key", which can overwrite the existing source).

Note that "private" sources ("isPublic":"false") do not have all fields displayed unless you are an admin, community moderator, or the source owner. In this case, it is likely that testing them (or using them as the basis for a new source) will fail. Contact the source owner to get a full copy.

There are 3 tabs that can be edited:

  • "JSON" - this is the full source including all fields
  • "JS-U" - the Unstructured Analysis Module allows content to be transformed by "scriptlets" (xpath/regex/javascript) into document metadata. This view shows only the javascript maintained in "unstructuredAnalysis.script" - all of the logic can be written in here as separate functions, and then the scriptlets can be simple calls to these functions, to maxmize the maintainability of the code in the source.
  • "JS-S" - the Structured Analysis Module allows content to be transformed by "scriptlets" (xpath/regex/javascript) into document metadata. This view shows only the javascript maintained in "structuredAnalysis.script" - all of the logic can be written in here as separate functions, and then the scriptlets can be simple calls to these functions, to maxmize the maintainability of the code in the source.

Validating the Source Format

To check the Source JSON format is valid at any time, select thte "Check Format" button (middle right).

If run on the "JS-U" or "JS-S" tabs then the javascript in "structuredAnalysis.script" or "unstructuredAnalysis.script" is checked instead. 

This validation is run automatically before the source is saved, tested, enabled/disabled, or published. (Or when switching between the JSON/JS tabs). Note that the automatic validation does not run on the javascript, only on the JSON.

Testing a Source

Once a first draft of a source is complete it should be tested to see which documents it extracts and how it enriches the documents with additional metadata, entities, and associations, etc.

Two parameters can be set for testing the sources:

  • "Full text": by default, the full text of a document is not returned (it can be quite long). For testing text extractors (eg "boilerpipe" vs "none" vs "AlchemyAPI"), or for testing "unstructured analysis" transformations, the text maybe useful or essential though; in these cases, enable this check box.
  • "Number of documents": the maximum number of documents that will be enriched and returned. The smaller the number of documents, the quick the API calls returns.

Click on the Test Source button to start the testing process. Note that it can take a few minutes for the processed documents to be returned. Temporarily setting the "waitTimeOverride_ms" field of the "rss" object to be 1000 (ie 1s) can be useful during the debug stages.

Note that the first time you test a source, you are likely to get an error accompanied by a request from the browser to allow/deny the window from launching pop ups. Select "Allow always" or the equivalent, refresh the browser if necessary, and press the test button again.

As can be seen from the above screen capture, the pop up contains 2 text elements:

  • A status message including the number of documents returned, any errors or warnings encountered etc.
  • The JSON of the extracted and enriched /wiki/spaces/INF/pages/4358642, if the test was successful.
    • Future versions of the tool will allow the documents to be viewed in widgets in the main GUI, providing a much easier interface to validate the source.

Based off the results from testing, the source can then be refined until the desired functionality is obtained.

Saving sources as templates

The Sources page allows you to save sources as templates to streamline the process creating new sources that share common attributes. To save a source as a template click on the Save Source as Template button. Note: Your new template will be available in the Source Templates drop down on the Create New Source page.

Note that templates are saved into your personal community only, but you can see any templates shared across any of the communities to which you belong. To share a template you have created with one of your communities, use the file uploader.

Before turning a source into a template, that existing source should be "scrubbed" first (middle right, "Scrub" button) - otherwise the presence of the "_id"/"key" fields will mean that the old source is modified instead of a new one being created.

Publishing sources

Sources need to be "published" to the system in order for the Infinit.e Core Server to begin harvesting. Once you have created and tested a source, or edited and tested an existing source, you can publish the source by clicking on the Publish Source button.

If you submit (publish) a new source or to a community you do not own, then it is initially added in a "pending" state. An email is sent to the community owners and moderators, and they are given the option of allowing the source or not.

Editing sources that have previously been approved may not require further moderation, if only display fields have been modified; otherwise it is suspended pending approval as above.

Note that once a source has been published, its status can be monitored from "<ROOT URL>/InfiniteSourceMonitor.html" (eg http://infinite.ikanow.com/InfiniteSourceMonitor.html), provided you are logged into the main GUI or source builder.

After publishing a share, you should get an alert saying that the source has been published and the working copy "share" has been deleted. If you don't get this alert, then it is likely that an internal configuration error has occurred - contact your system administrator to get it fixed.

"Scrubbing" sources

As discussed above in a few places, this removes all fields added by the server after publishing, just retaining the actual ingest logic. It should be used before copying/templating.

If you accidentally scrub the source and then save it then you can get back to the original published source by just deleting the share and then re-selecting the source.

Enabling/disabling sources

Sources can be disabled by setting their "searchCycle_secs" to a negative number. This button just automates that process.

Note that this button only affects the un-published version of the source (ie the corresponding share). The source should be published to apply the change.

Deleting source's documents

This button will leave the source intact but will delete all of the documents harvested so far. It can only be performed on sources you own unless you are a community moderator or an admin.

Obviously, this function should be used with caution. Also for sources with many documents, this operation may take some time (eg 10 minutes for 500,000 documents).

Deleting sources or shares

To delete a source or share click on the "X" button next to the source name in the Sources list:

  • Share: If the item you are deleting is a Share the system will ask you to confirm: "Do you really wish to delete the share: XXXXXXXXX (*)?". What happens next depends on whether or not the share has been published or not:
    • If the share has been published the share is deleted but the published source is left alone and will appear in the Sources list.
    • If the share has not been published the share will simply be deleted and will disappear from the Sources list.
       
  • Source: If the item you are deleting is a Source the system will ask you to confirm: "Do you really wish to delete the source: XXXXXXXXX?". If you confirm the deletion the system will then delete the published source and all harvested documents associated with it.

Note that deleting a published source will also delete all documents associated with that source. In some cases those documents will not be retrievable (eg old URLs from an RSS feed). This should therefore be used with caution. Also for sources with many documents, this operation may take some time (eg 10 minutes for 500,000 documents).

Monitoring sources

There is a graphical utility to monitor sources available from the home page (Source Monitor link). It opens in a new tab and is pictured below. It is not possible to change any source information from this GUI.

A subset of this information can also be accessed from the Source Manager dialog of the main GUI.

The colors have the following meanings:

  • Green: successfully harvested ("success")
  • Blue: in progress ("in_progress")
    • (or has partially harvested, "success_iteration" - means that the most recent harvest cycle completed but not all available documents were harvested because of document/cycle limitations)
  • Red: harvested with errors ("error")
  • Yellow: not yet seen by a harvester, or currently unapproved.

Suspended sources retain their color status but have "[SUSPENDED]" prepended to their title.

  • No labels