Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

PageReviewed by AlexOther's CommentsAndrew CommentsAlex CommentsStatus
File extractor
  •   
 

I have made the necessary changes in response to Comments (05/20/2014)

 

-would it be possible to have an SV example in the source gallery for when XMLIgnoreValues are used to auto derive field names. (05/20/2014)

Status
subtletrue
colourRed
titleaction needed

(As per db extractor comment), call out specifically how the url is constructed in the different cases:

  • office type, the path of the file (ie file.url + path-relative-to-url)
  • json/xml/csv: if xmlsourcename and xmlprimarykey are specified:
    • xmlsourcename + object,get(xmlprimarykey)
  • if not:
    • path-of-file (as above) + <hash of object> + ".csv"/.json/.xml

There are some general fields (renameAfterParse, path*, mode) that need to be documented (see the TODO)

Is there a reason you have different sections for CSV and SV? (CSV is just the default case of SV, ie the seperator is the default comma)

The *sv documentation is a bit unclear i think - it's much simpler (at least in 90% of cases) ... in most cases the configuration is just either setting the columns or making it in auto config mode, so i'd focus around that ... then ignoring other header fields, and setting the separator/quote/escape (plus the url setting which is general for xml/json/sv and is described above). I'm sure there are lots of better csv configuration documents out there than my original one (http://logstash.net/docs/1.4.0/filters/csv) so feel free to find one of those to start with!

I have similar comments about the JSON/XML ... the main thing is just selecting the root object so i'd start with an explanation of that (it's very similar between xml/json so probably copy/paste) and then the other fields

Status
colourRed
titleON HOLD
Feed extractor
  •   
Caleb Reviewing: would probably be useful to have an example of a feed?  
Status
colourYellow
titleReview
Web extractor
  •   
   
Status
colourYellow
titleReview
Database extractor
  •   
 

I have made the necessary changes in response to Comments (05/20/2014)

-Requires example urls for connecting to the database when PrimaryKeyValue is specified and when it is not. 

Status
subtletrue
colourRed
titleaction needed

 

There's another missing field that has changed between legacy and pipeline - the database object now has a "url" field (that was previously in the source top level) ... if no value is specified for 'primaryKeyValue' (hmm this also seems to be missing from the documentation it is in the code here: https://bitbucket.org/ikanow/ikanow_infinit.e_community
/src/d4d92a4131ffc9706417b70077aec548178bcf58/core
/infinit.e.data_model/src/com/ikanow/infinit/e/data_model
/store/config/source/SourceDatabaseConfigPojo.java?at=master
) then the document URL is database.url + record.get(primaryKey) (if no 'primaryKey' is specified then a random string is used), otherwise it's primaryKeyValue + record.get(primaryKey). It would be good for the file and db extractors to call out how the URL is constructed actually.

(Re authentication: Made minor update to correct error in legacy documentation, and to reflect v0.3 functionality change)

Status
colourRed
titleon hold
Follow Web links
  •   
   
Status
colourYellow
titleReview
Automated text extraction
  •   
   
Status
colourYellow
titleReview
Manual text transformation
  •   
   
Status
colourYellow
titleReview
Document metadata
  •   
   
Status
colourYellow
titleReview
Content metadata
  •   
 

Added missing examples for xpath and regex. (05/20/2014)

 
Status
colourYellow
titleReview
Manual entities
  •   
   
Status
colourYellow
titleReview
Manual association of entities
  •   
   
Status
colourYellow
titleReview
Document storage settings
  •   
 

 

 
Status
colourYellow
titlereview
Feature extraction
  •   
   
Status
colourYellow
titlereview
Aliasing
  •   
 Not supported 
Status
colourRed
titleon hold
Harvest control settings
  •   
 

Require more examples for the following:

Status
subtletrue
colourRed
titleaction needed

  • duplicateExistingUrls
  • maxDocs_global
  • throttleDocs_perCycle
  • maxDocs_perCycle
  • distributionFactor
 
Status
colourRed
titleon hold
Search index settings
  •   
 More examples in the source for searchIndex parameters would be beneficial.
Status
subtletrue
colourRed
titleaction needed
 
Status
colourRed
titleon hold
Lookup tables
  •   
 I tried to edit an existing example from the old source, as I could not find any new examples.  Please verify the changes I made to the example source and scripts.
Status
subtletrue
colourRed
titleaction needed
 
Status
colourRed
titleon hold
Javascript globals
  •   
   
Status
colourYellow
titlereview
Logstash extractor
  •   
 Would it be possible to have some logstash examples?  

...