Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 25 Next »

Overview

The Source Editor GUI is not currently compatible with IE. It is compatible with Chrome, Firefox, and Safari.

Source management is intrinsically a complex process (particularly when taking advantage of Infinit.e's customization engine). 

The Managing Sources with Infinit.e Source Manager Sources page provides a simple interface for adding and testing new sources, saving templates for future sources, and managing existing ones. Future iterations of the tool will provide actual support for the difficult bits of source writing, such as writing Javascript and regexes.

Note that the grey lines can be dragged to increase or decrease the size of the editor window.

The "Filter" text box will by default search the source titles, but it can also search the following fields:

  • URL: type "url:<url fragment>" 
    • (note that URLs from the processing pipeline or feed configuration objects won't be searched unless you are currently editing them).
  • Community IDs: type "community:<community-id>"
  • ID: type "id:<source _id field>"
  • Tags: type "tags:<tag fragment>"
  • key, title, description, mediaType and extractType: use the same "fieldName:<field value fragment syntax>"
    • (note title is the default if no prefix is specified)
  • Suspended sources:
    • "suspended:true" to see manually suspended tasks
    • "fullQuarantined:true" to see unauthorized sources (this can happen automatically because they error too much, or if they are disabled by an administrator)
    • "tempQuarantined:true" to see sources quarantined for the day (because of a possibly transient source error)

Create New Source

To create a new source click on the New Source button in the upper right hand corner of the page. The Infinit.e.Manager application will forward you to the Create New Source page shown below:

You have 2 options for creating a new source here:

  1. You can use an empty template, fill out the title/description/tags/community fields and click Create Source, you can build a source from scratch or paste one in on the next page.
  2. You can select a template to get you started as shown below

To create a source from a template, choose a dropdown option from the "Create a New Source" box.

Once you have selected a template, fill out the title/description/tags/community fields and click create source.  You will be able too modify the source on the next page before it starts running.

Once you've created your source you can follow these instructions for how to use the source builder.

When copying an existing source into the New Source window, that existing source should be "scrubbed" first (middle right, "Scrub" button) - otherwise the presence of the "_id"/"key" fields will mean that the old source is modified instead of a new one being created.

 

Edit Existing Sources

To edit an existing source click on the source's name in the list of Sources found on the left hand side of the page.

Note: There are three types of documents listed in the Sources list: published sources, shares that are editable copies of published sources, and shares that have not yet been published as sources. Shares are denoted by "(*)".

If copying the logic of an existing source, it is recommended to first "scrub" it to remove any server-added fields (particularly "_id" and "key", which can overwrite the existing source).

Note that "private" sources ("isPublic":"false") do not have all fields displayed unless you are an admin, community moderator, or the source owner. In this case, it is likely that testing them (or using them as the basis for a new source) will fail. Contact the source owner to get a full copy.

There are 3 tabs that can be edited:

  • "JSON" - this is the full source including all fields
  • New source pipeline:
    • "JS" - The global script that all other elements can use - all of the logic can be written in here as separate functions, and then the scriptlets in other pipeline elements can be simple calls to these functions, to maximize the maintainability of the code in the source.
    • "LS" - If generated Logstash sources, you can write the configuration directly into here
    • "UI" (currently only supported in the enterprise build) - brings up the source builder GUI
  • Legacy sources:
    • "JS-U" - the Unstructured Analysis Module allows content to be transformed by "scriptlets" (xpath/regex/javascript) into document metadata. This view shows only the javascript maintained in "unstructuredAnalysis.script" - all of the logic can be written in here as separate functions, and then the scriptlets can be simple calls to these functions, to maximize the maintainability of the code in the source.
    • "JS-S" - the Structured Analysis Module allows content to be transformed by "scriptlets" (xpath/regex/javascript) into document metadata. This view shows only the javascript maintained in "structuredAnalysis.script" - all of the logic can be written in here as separate functions, and then the scriptlets can be simple calls to these functions, to maximize the maintainability of the code in the source.
    • "JS-RSS" - (only visible if the "searchConfig" field of "rss" is specified; use "Save Source" to reset visibility if it changes during editing) the Feed Harvester can use javascript (and xpath) to create multiple documents out of a single received feed. This view shows only the javascript maintained in "rss.searchConfig.globals" - all of the logic can be written in here as separate functions, and then the scriptlets can be simple calls to these functions, to maximize the maintainability of the code in the source.

By default only you can see your temporary copies of sources (so for example you cannot share links to sources being edited). You can use the file uploader to share in either read or read-write:

  • Go to the file uploader , filter on JSON type "source", select your source
  • Share with a community in which your collaborator belongs (and is at least a "content publisher" if you want him to make changes)
  • If you want to provide him with the ability to make changes, set the read access
    • Warning - there is no automatic synchronization, so if you both make changes at the same time work can be lost

Validating the Source Format

To check the Source JSON format is valid at any time, select thte "Check Format" button (middle right).

If run on the "JS-U" or "JS-S" tabs then the javascript in "structuredAnalysis.script" or "unstructuredAnalysis.script" is checked instead. 

This validation is run automatically before the source is saved, tested, enabled/disabled, or published. (Or when switching between the JSON/JS tabs). Note that the automatic validation does not run on the javascript, only on the JSON.

Testing a Source

Once a first draft of a source is complete it should be tested to see which documents it extracts and how it enriches the documents with additional metadata, entities, and associations, etc.

Two parameters can be set for testing the sources:

  • "Full text": by default, the full text of a document is not returned (it can be quite long). For testing text extractors (eg "boilerpipe" vs "none" vs "AlchemyAPI"), or for testing "unstructured analysis" transformations, the text maybe useful or essential though; in these cases, enable this check box.
  • "Number of documents": the maximum number of documents that will be enriched and returned. The smaller the number of documents, the quick the API calls returns.

Click on the Test Source button to start the testing process. Note that it can take a few minutes for the processed documents to be returned. Temporarily setting the "waitTimeOverride_ms" field of the "rss" object to be 1000 (ie 1s) can be useful during the debug stages.

Note that the first time you test a source, you are likely to get an error accompanied by a request from the browser to allow/deny the window from launching pop ups. Select "Allow always" or the equivalent, refresh the browser if necessary, and press the test button again.

As can be seen from the above screen capture, the pop up contains 2 text elements:

  • A status message including the number of documents returned, any errors or warnings encountered etc.
  • The JSON of the extracted and enriched /wiki/spaces/INF/pages/4358642, if the test was successful.
    • Future versions of the tool will allow the documents to be viewed in widgets in the main GUI, providing a much easier interface to validate the source.

Based off the results from testing, the source can then be refined until the desired functionality is obtained.

Saving sources as templates

The Sources page allows you to save sources as templates to streamline the process creating new sources that share common attributes. To save a source as a template click on the Save Source as Template button. Note: Your new template will be available in the Source Templates drop down on the Create New Source page.

The template is shared with the source's community - if you don't want to share with anybody else then set the dropdown to be your personal community before saving it as a template.

Publishing sources

Sources need to be "published" to the system in order for the Infinit.e Core Server to begin harvesting. Once you have created and tested a source, or edited and tested an existing source, you can publish the source by clicking on the Publish Source button.

If you submit (publish) a new source or to a community you do not own, then it is initially added in a "pending" state. An email is sent to the community owners and moderators, and they are given the option of allowing the source or not.

Editing sources that have previously been approved may not require further moderation, if only display fields have been modified; otherwise it is suspended pending approval as above.

Note that once a source has been published, its status can be monitored from "<ROOT URL>/InfiniteSourceMonitor.html" (eg http://infinite.ikanow.com/InfiniteSourceMonitor.html), provided you are logged into the main GUI or source builder.

After publishing a share, you should get an alert saying that the source has been published and the working copy "share" has been deleted. If you don't get this alert, then it is likely that an internal configuration error has occurred - contact your system administrator to get it fixed.

"Reverting" sources

The "revert" button in the top right hand corner of the code editor, for published sources, overwrites the existing temporary share with the current version of the source in the database. This can be useful for 2 reasons:

  • To discard unwanted manual changes 
  • (If there are no changes) to update the "harvest" status block

"Scrubbing" sources

As discussed above in a few places, this removes all fields added by the server after publishing, just retaining the actual ingest logic. It should be used before copying/templating.

If you accidentally scrub the source and then save it then you can get back to the original published source by just deleting the share and then re-selecting the source.

Enabling/disabling sources

Sources can be disabled by setting their "searchCycle_secs" to a negative number. This button just automates that process.

Note that this button only affects the un-published version of the source (ie the corresponding share). The source should be published to apply the change - you are automatically prompted for this.

Deleting source's documents

This button will leave the source intact but will delete all of the documents harvested so far. It can only be performed on sources you own unless you are a community moderator or an admin.

Obviously, this function should be used with caution. Also for sources with many documents, this operation may take some time (eg 10 minutes for 500,000 documents).

Deleting sources or shares

To delete a source or share click on the "X" button next to the source name in the Sources list:

  • Share: If the item you are deleting is a Share the system will ask you to confirm: "Do you really wish to delete the share: XXXXXXXXX (*)?". What happens next depends on whether or not the share has been published or not:
    • If the share has been published the share is deleted but the published source is left alone and will appear in the Sources list.
    • If the share has not been published the share will simply be deleted and will disappear from the Sources list.
       
  • Source: If the item you are deleting is a Source the system will ask you to confirm: "Do you really wish to delete the source: XXXXXXXXX?". If you confirm the deletion the system will then delete the published source and all harvested documents associated with it.

Note that deleting a published source will also delete all documents associated with that source. In some cases those documents will not be retrievable (eg old URLs from an RSS feed). This should therefore be used with caution. Also for sources with many documents, this operation may take some time (eg 10 minutes for 500,000 documents).

Monitoring sources

There is a graphical utility to monitor sources available from the home page (Source Monitor link). It opens in a new tab and is pictured below. It is not possible to change any source information from this GUI.

A subset of this information can also be accessed from the Source Manager dialog of the main GUI.

The colors have the following meanings:

  • Green: successfully harvested ("success")
  • Blue: in progress ("in_progress")
    • (or has partially harvested, "success_iteration" - means that the most recent harvest cycle completed but not all available documents were harvested because of document/cycle limitations)
  • Red: harvested with errors ("error")
  • Yellow: not yet seen by a harvester, or currently unapproved.

If the colored "Status" column contains numbers, eg "0/20" then it is referring to the (beta) distributed source function - the left number is the number of "in progress" threads, and the right number is the total number of threads.

Suspended sources retain their color status but have "[SUSPENDED]" prepended to their title.

  • No labels