Documentation Progress Tracker
Documentation Progress Tracker
Page | Reviewed by Alex | Other's Comments | Andrew Comments | Alex Comments | Status |
---|---|---|---|---|---|
GENERAL | 1) double check the JSON formatting, eg automated text extractor has wrong indentation ... my preferred format for elements that are part of a pipeline is shown in the instructions to Harvest control settings 2) Some of the elements are described as single elements when they're in fact arrays - check the source pipeline pojo to see (eg manual text extractor?)
Andrew: I have cleaned up the JSON formatting as per the preferred format. 3) All the elements have a common "criteria" element, where would this beset be documented? Andrew: I have defined criteria on the new Using Javascript page. | DONE | |||
File extractor | I have made the necessary changes in response to Comments (05/20/2014)
-would it be possible to have an SV example in the source gallery for when XMLIgnoreValues are used to auto derive field names. (05/20/2014) | (As per db extractor comment), call out specifically how the url is constructed in the different cases:
There are some general fields (renameAfterParse, path*, mode) that need to be documented (see the TODO) Is there a reason you have different sections for CSV and SV? (CSV is just the default case of SV, ie the seperator is the default comma) The *sv documentation is a bit unclear i think - it's much simpler (at least in 90% of cases) ... in most cases the configuration is just either setting the columns or making it in auto config mode, so i'd focus around that ... then ignoring other header fields, and setting the separator/quote/escape (plus the url setting which is general for xml/json/sv and is described above). I'm sure there are lots of better csv configuration documents out there than my original one (http://logstash.net/docs/1.4.0/filters/csv) so feel free to find one of those to start with! I have similar comments about the JSON/XML ... the main thing is just selecting the root object so i'd start with an explanation of that (it's very similar between xml/json so probably copy/paste) and then the other fields | DONE | ||
Feed extractor | DONE | DONE | |||
Web extractor | Caleb: would be useful to have an example w/ title/desc in the extraUrls to point out how this is different from feed extractor, or something separating the 2. I'm not sure if "web extractor" is a very good name either, maybe URL extractor or something? Web is ambiguous - VERIFIED | I have addressed your comment by highlighting what distinguishes feed from web extractor, -example includes description, title, url (07/18/14) | DONE | ||
Database extractor | I have made the necessary changes in response to Comments (05/20/2014) -Requires example urls for connecting to the database when PrimaryKeyValue is specified and when it is not. DONE
| There's another missing field that has changed between legacy and pipeline - the database object now has a "url" field (that was previously in the source top level) ... if no value is specified for 'primaryKeyValue' (hmm this also seems to be missing from the documentation it is in the code here: https://bitbucket.org/ikanow/ikanow_infinit.e_community (Re authentication: Made minor update to correct error in legacy documentation, and to reflect v0.3 functionality change) | DONE | ||
Follow Web links |
Alex: Drew ... we should use your page-by-page PDF as an example (post to source gallery in a new page) and then Andrew can reference that Andrew: I have now included several examples of "splitter" on this page. Drew: The API-Style Parsing block is a bit spartan and thus unclear. In particular it makes it seem like the script field is automaticly going to look for URLs and parse out the array. This is not the case. " The example should have { } around it to show that it's a single object Under Split, you need to include the block from the example that has the Global javascript function create_links() or convert_to_docs() in order for the examples to make sense. As is, a user cannot see how the splitter.script formats the input into the appropriate array output. ANDREW: Latest changes have been made. Please verify VERIFIED (DREW) | I have added an example using "splitter" from the following: Basic Complex JSON API Source Template #2 (document splitting) I also cleaned up the language a bit so it is more clear that the method can be called using either "links" or "split." | DONE | ||
Automated text extraction |
IGNORE ANDREW: We have re-worked this page due to changes on feature extraction. biolerpipe has no configuration parameters to pass of note, so we won't do an example. | DONE
I have moved the indicated engines to feature extraction, and worked on the text to make the distinction between the two more clear. I have indicated the common uses for each. Tika configuration is explained as part of File extractor.
Andrew Johnston: I added additional text examples. I updated some of the text with SECRET HIDDEN KNOWLEDGE The examples aren't right though ... basically all these extractors are for grabbing data from PDFs or HTML. What you should do is create a source consisting only of a "web" element pointing to some page, and then the relevant "textEngine", then press "test" with "show full text" checked ... this will give you the text (annoyingly inside a JSON field ie with newlines encoded - I'm sure there's a web service that will convert to text for you .. eg use the browser js console just type console.log("<PASTE>"), you can then paste that into a text box (perhaps show next to a screencap of the web page), eg:
| ACTIONS | ||
Manual text transformation |
Drew: Alex, do we have a snippet of the WITS XML that could be put in as an input to the XML section of this? ANDREW: all code examples now have snippets of example source. Please verify. - VERIFIED (DREW) | DONE
DONE
| DONE | ||
Document metadata | Caleb: What does "Ibid." mean? - VERIFIED Caleb: appendTagsToDocs should probably be reworded for clarity to something like "defaults to false, when true appends source tags to extracted documents - VERIFIED | DONE | DONE | ||
Content metadata | Randy: need to add 'g' flag for grid Randy: How many times do we need to define what each field does?
| IGNORE to add g flag. TODO: locate existing documentation for g flag Randy Jarrett (Unlicensed) The description table may be preferable for users not accustomed to reading code comments. DONE
| DONE | ||
Manual entities | Randy: Need a linked explanation, or an in-context explanation on $FUNC and $SCRIPT maybe? Randy: Notes needed on what happens when a value (like location) cannot be evaluated.
| DONE
| NEW (double check the json version of iterateOver is sufficient, see comments below) NEW: don't understand "need to obtain error behavior", plz clarify This is fine, dont' worry about error behavior | DONE | |
Manual association of entities | Randy: Need comment on what occurs when an entity or other information is missing in an association - Does it silently error out, does it partially generate, does it break the doc? |
Andrew: All New Comments from Alex addressed and page now ready. | NEW: The "iterateOver" explanation is a bit of a disaster ... there are 3 types of iterateOver: multiplicative, associative, and json .. you give a quick explanation of the multiplicate (but less clear than the legacy I think), but the example is a json example; there is no explanation of json example (which is the same between the engines) NEW: don't understand "need to obtain error behavior", plz clarify | DONE | |
Document storage settings | Andrew: added additional clarity around functional description and use cases. | DONE | |||
Feature extraction | Drew: "This toolkit element passes the document text" should probably read "document full text" Drew: The warning here should point to a section in automated text extraction page that explains which feature engines need a text extractor (and which work best for which problems) Drew: As with my comment on automated text extraction, I'd move all of the feature engine blocks to here. As it is, this page is very sparse and doesn't really reflect what should be here Andrew: feature engine blocks are now on this page. Please verify DrewS (Unlicensed) - VERIFIED (DREW) | DONE DONE DONE
Andrew: new functionality has been added to this page, where a regex can be used for feature extraction engine. Documentation has been added. Please verify. AlexI (Unlicensed) | DONE | ||
Aliasing | Not supported | DONE | |||
Harvest control settings | Require more examples for the following: DONE
| I added a bunch of examples and explanation (since it was not obvious how some of the params were being used) ... Andrew I think this covers both my review action and my "give you more info" action Andrew: Looks good. I made a few more minor changes. Provided you are happy with the page now, can we go ahead and set Status to Complete? | DONE | ||
Search index settings | Andrew Johnston: More examples in the source for searchIndex parameters would be beneficial. Andrew Johnston: added additional clarity around functional description and use cases. Lets either dig up an example or get this page set to Done. AlexI (Unlicensed)
| Added a couple of examples, good enough | DONE | ||
Lookup Tables |
Andrew Johnston: added additional clarity around functional description and use cases. Improved integration of examples with functional description. | Good enough, until someone complains | DONE | ||
Javascript globals | There's a few things discussion worthy here: 1) (scripts and imports are arrays - fixed) 2) the scope of the javascript array itself is much smaller than the scope of the page as currently written ... globals just includes functions that the other elements that use javascript can access ... then there's the question of explaining how javascript is used across the various elements where it's an option ... eg contentMetadata, text, criteria everywhere, follow web links, split, docMetadata etc etc (There are similar considerations for regex So would be interested in your thoughts on how to break this up ... should we have one page for each of the script languages and then reference that everywhere, with just any element-specific considerations .. if we did that, then the globals would be a really simple ... write global javascript and imports in here that can be accessed everywhere else as per <link>. 3) has some relevant thoughts relative to this decision Andrew: I have simplified the Javascript globals page, and created a new page called Using Javascript to address the points raised in 3) 3) No fault of yours (except not sure what the picture of the rhino is good for!) but this section (reading it like the "general javascript guide" is pretty bad (because the source material is really unclear and bad) ... one of the problems is just of organization though ... there are 3 different ways basically in which JS can be used: - a) to get metadata out of either the text or the metadata (which has no $SCRIPT(...), you don't return variables, the last evaluated expression is returned, you get _doc, _metadata or text depending on flags (unless chaining in which case you get _iterator) - b) to create associations/entities from metadata using $SCRIPT or $FUNC - c) there's a few other random places, criteria is the most ubiquitous, but also there's some stuff in doc storage settings Andrew: After reading this description everything is much clearer! Thank you I have created a new page Using Javascript which addresses each of these points in turn. It provides links to the relevant pipeline elements as required. 4) in the same way in the logstash extractor I mention that you should always code in the LS window not directly in the JSON editor, here the same thing is true but for the JS editor. OK that's enough rambling! Is that any use? Andrew: very useful. | DONE | |||
Logstash extractor | Would it be possible to have some logstash examples? | The tricky thing about examples is that the "config" is itself a complex object format ... the source editor has a special "LS" editor logstash (analoguous to the "JS" editor for global JS) I'll add an example the logstash configuration, and you can work out some wording to explain its context Andrew: I provided a contextual description of the included example. | DONE |