Code Block | ||||
---|---|---|---|---|
| ||||
{ // Location url: string, sourceKey: string, communityIds: [ ObjectId ], // The gzipped content stored in the Lucene index (eg after the harvest processing pipeline) gzip_content: binary, gzip_len: integer, // Optional content: gzip_raw_content: binary, // The original text, before the processing pipeline (but after Tika for PDFs etc) gzip_raw_len: integer, gzip_md_content: binary, // The compressed document metadata object gzip_md_len: integer } |
Notes
Note that the "binary" type serializes to byte[].
...
Info |
---|
Currently the content records are not generated in the following circumstances:
The original idea behind this was that JSON and XML files, and RDBMS records, would typically not have a large block of data as the full text, instead the full text (if present at all) would be a composite of small fields. There are a few ways this is not optimal for many current use cases:
The medium term plan will be to allow users to specify manually the content saving behavior on a per source basis (with the above as the fallback if no manual override is specified). The source configuration is in the middle of a major overhaul for both functional and readability reasons, so this will be rolled into that forthcoming change. |
Misc
Here is some sample Java code that shows how to access the unzipped content:
...