...
Code Block | ||||
---|---|---|---|---|
| ||||
{
// Location
url: string,
sourceKey: string,
communityIds: ObjectId,
// The gzipped content stored in the Lucene index (eg after the harvest processing pipeline)
gzip_content: binary,
gzip_len: integer,
// Optional content:
gzip_raw_content: binary, // The original text, before the processing pipeline (but after Tika for PDFs etc)
gzip_raw_len: integer,
gzip_md_content: binary, // The compressed document metadata object
gzip_md_len: integer
} |
Notes
Note that the "binary" type serializes to byte[].
The fields used in deployments are controlled by the following parameters from "infinite.service.properties" (in "/opt/infinite-home/config", normally auto-generated from "/opt/infinite-install/config/infinite.configuration.properties" and "/opt/infinite-home/config/infinite.service.properties.TEMPLATE" - currently only "store.maxcontent" is copied across from the infinite-install directory; to override the others they should be added to "/opt/infinite-home/config/infinite.service.properties.TEMPLATE"):
- store.maxcontent: (long) the maximum uncompressed length in bytes - data beyond this is truncated. Note that is not applied to gzip_md_content.
- store.rawcontent: (boolean) whether to store the original text
- store.metadata_as_content: (boolean) whether to store the metadata
Info |
---|
Currently the content records are not generated in the following circumstances:
The original idea behind this was that JSON and XML files, and RDBMS records, would typically not have a large block of data as the full text, instead the full text (if present at all) would be a composite of small fields. There are a few ways this is not optimal for many current use cases:
The medium term plan will be to allow users to specify manually the content saving behavior on a per source basis (with the above as the fallback if no manual override is specified). The source configuration is in the middle of a major overhaul for both functional and readability reasons, so this will be rolled into that forthcoming change. |
Misc
Here is some sample Java code that shows how to access the unzipped content:
Code Block | ||||
---|---|---|---|---|
| ||||
BasicDBObject dboContent = (BasicDBObject) contentDB.findOne(contentQ);
if (null != dboContent) {
byte[] compressedData = ((byte[])dboContent.get(CompressedFullTextPojo.gzip_content_));
ByteArrayInputStream in = new ByteArrayInputStream(compressedData);
GZIPInputStream gzip = new GZIPInputStream(in);
int nRead = 0;
StringBuffer output = new StringBuffer();
while (nRead >= 0) {
nRead = gzip.read(storageArray, 0, 200000);
if (nRead > 0) {
String s = new String(storageArray, 0, nRead, "UTF-8");
output.append(s);
}
}
doc.setFullText(output.toString());
} |