Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

PageReviewed by AlexOther's CommentsAndrew CommentsAlex CommentsStatus
GENERAL   

1) double check the JSON formatting, eg automated text extractor has wrong indentation ... my preferred format for elements that are part of a pipeline is shown in the instructions to Harvest control settings

2) Some of the elements are described as single elements when they're in fact arrays - check the source pipeline pojo to see (eg manual text extractor?)

3) All the elements have a common "criteria" element, where would this beset be documented?

 
File extractor
  •   
 

I have made the necessary changes in response to Comments (05/20/2014)

 

-would it be possible to have an SV example in the source gallery for when XMLIgnoreValues are used to auto derive field names. (05/20/2014)

Status
subtletrue
colourRed
titleaction needed

(As per db extractor comment), call out specifically how the url is constructed in the different cases:

  • office type, the path of the file (ie file.url + path-relative-to-url)
  • json/xml/csv: if xmlsourcename and xmlprimarykey are specified:
    • xmlsourcename + object,get(xmlprimarykey)
  • if not:
    • path-of-file (as above) + <hash of object> + ".csv"/.json/.xml

There are some general fields (renameAfterParse, path*, mode) that need to be documented (see the TODO)

Is there a reason you have different sections for CSV and SV? (CSV is just the default case of SV, ie the seperator is the default comma)

The *sv documentation is a bit unclear i think - it's much simpler (at least in 90% of cases) ... in most cases the configuration is just either setting the columns or making it in auto config mode, so i'd focus around that ... then ignoring other header fields, and setting the separator/quote/escape (plus the url setting which is general for xml/json/sv and is described above). I'm sure there are lots of better csv configuration documents out there than my original one (http://logstash.net/docs/1.4.0/filters/csv) so feel free to find one of those to start with!

I have similar comments about the JSON/XML ... the main thing is just selecting the root object so i'd start with an explanation of that (it's very similar between xml/json so probably copy/paste) and then the other fields

Status
colourRed
titleON HOLD
Feed extractor
  •   
Caleb: would probably be useful to have an example of a feed?DONE 
Status
colourYellow
titleReview
Web extractor
  •   
Caleb: would be useful to have an example w/ title/desc in the extraUrls to point out how this is different from feed extractor, or something separating the 2.  I'm not sure if "web extractor" is a very good name either, maybe URL extractor or something?  Web is ambiguousI have addressed your comment by highlighting what distinguishes feed from web extractor, but we still require an example in the source gallery that includes title, description etc. (06/03/2014) 
Status
colourYellow
titleReview
Database extractor
  •   
 

I have made the necessary changes in response to Comments (05/20/2014)

-Requires example urls for connecting to the database when PrimaryKeyValue is specified and when it is not. 

Status
subtletrue
colourRed
titleaction needed

 

There's another missing field that has changed between legacy and pipeline - the database object now has a "url" field (that was previously in the source top level) ... if no value is specified for 'primaryKeyValue' (hmm this also seems to be missing from the documentation it is in the code here: https://bitbucket.org/ikanow/ikanow_infinit.e_community
/src/d4d92a4131ffc9706417b70077aec548178bcf58/core
/infinit.e.data_model/src/com/ikanow/infinit/e/data_model
/store/config/source/SourceDatabaseConfigPojo.java?at=master
) then the document URL is database.url + record.get(primaryKey) (if no 'primaryKey' is specified then a random string is used), otherwise it's primaryKeyValue + record.get(primaryKey). It would be good for the file and db extractors to call out how the URL is constructed actually.

(Re authentication: Made minor update to correct error in legacy documentation, and to reflect v0.3 functionality change)

Status
colourRed
titleon hold
Follow Web links
  •   

Drew: Haven't reviewed the whole doc, but definitely should include an example of a splitter instead of just follow web links.

Alex: Drew ... we should use your page-by-page PDF as an example (post to source gallery in a new page) and then Andrew can reference that

I have added an example using "splitter" from the following: Basic Complex JSON API Source Template #2 (document splitting)

I also cleaned up the language a bit so it is more clear that the method can be called using either "links" or "split."

 
Status
colourYellow
titleReview
Automated text extraction
  •   

Drew: In { config_param_name", string, ... }" should probably be { "config_param_name" : string, ... } to make it valid JSON.

Drew: This list should probably only include the config options for boilerpipe, tika, and AlchemyAPI? The others included are feature extractors. The distinction between the two isn't made clear enough, I think, for a new user.

Drew: Probably need to include some examples of when to use each engine (common question I get) (e.g. tika is used to process word docs, pdf, office; boilerpipe for web data ). Additionally, examples should show some samples of raw text processed by each engine and the output.

 

DONE

 


I have moved the indicated engines to feature extraction, and worked on the text to make the distinction between the two more clear.  I have indicated the common uses for each.


Tika configuration is explained as part of File extractor.

I will look into obtaining some raw text examples.

Status
subtletrue
colourRed
titleaction needed

 
Status
colourYellow
titleReview
Manual text transformation
  •   

Drew: "Log file from File Share" example is missing the global javascript declaration. Makes it impossible to follow the description below. Alternately, rewrite the description to "After "globals" has been used to define a function called decode (see <globals>), decode is used to capture the metadata for the sample input data into an object call info.  Info can then be used in the example that follows:"

Drew: Possibly include a sample of the original XML prior to transform via Xpath - example is a little tough to follow without that.

DONE

 

 


 

DONE

 

 
Status
colourYellow
titleReview
Document metadata
  •   
Caleb: What does "Ibid." mean?
Caleb: appendTagsToDocs should probably be reworded for clarity to something like "defaults to false, when true appends source tags to extracted documents
DONE 
Status
colourYellow
titleReview
Content metadata
  •   

Randy: need to add 'g' flag for grid

Randy: How many times do we need to define what each field does?

Randy: In process comment needs removed?

Status
subtletrue
colourRed
titleaction needed
to add g flag.


The description table may be preferable for users not accustomed to reading code comments.


DONE

 

 
Status
colourYellow
titleReview
Manual entities
  •   

Randy: Need a linked explanation, or an in-context explanation on $FUNC and $SCRIPT maybe?

Randy: Notes needed on what happens when a value (like location) cannot be evaluated.

 

DONE


-need to obtain error behavior

Status
subtletrue
colourRed
titleaction needed

 
Status
colourYellow
titleReview
Manual association of entities
  •   
Randy: Need comment on what occurs when an entity or other information is missing in an association - Does it silently error out, does it partially generate, does it break the doc?-need to obtain error behavior
Status
subtletrue
colourRed
titleaction needed
 
Status
colourYellow
titleReview
Document storage settings
  •   
 

 

 
Status
colourYellow
titlereview
Feature extraction
  •   

Drew: "This toolkit element passes the document text" should probably read "document full text"

Drew: The warning here should point to a section in automated text extraction page that explains which feature engines need a text extractor (and which work best for which problems)

Drew: As with my comment on automated text extraction, I'd move all of the feature engine blocks to here. As it is, this page is very sparse and doesn't really reflect what should be here

 

DONE


DONE


DONE

 

 

 
Status
colourYellow
titlereview
Aliasing
  •   
 Not supported 
Status
colourRed
titleon hold
Harvest control settings
  •   
 

Require more examples for the following:

Status
subtletrue
colourRed
titleaction needed

  • duplicateExistingUrls
  • maxDocs_global
  • throttleDocs_perCycle
  • maxDocs_perCycle
  • distributionFactor
 
I added a bunch of examples and explanation (since it was not obvious how some of the params were being used) ... Andrew I think this covers both my review action and my "give you more info" action
Status
colourRed
titleon hold
Search index settings
  •   
 More examples in the source for searchIndex parameters would be beneficial.
Status
subtletrue
colourRed
titleaction needed
 
Status
colourRed
titleon hold
Lookup tables
  •   
 I tried to edit an existing example from the old source, as I could not find any new examples.  Please verify the changes I made to the example source and scripts.
Status
subtletrue
colourRed
titleaction needed
 
Status
colourRed
titleon hold
Javascript globals
  •   
   
Status
colourYellow
titlereview
Logstash extractor
  •   
 Would it be possible to have some logstash examples?  

...