{ // Location url: string, sourceKey: string, communityIds: [ ObjectId ], // The gzipped content stored in the Lucene index (eg after the harvest processing pipeline) gzip_content: binary, gzip_len: integer, // Optional content: gzip_raw_content: binary, // The original text, before the processing pipeline (but after Tika for PDFs etc) gzip_raw_len: integer, gzip_md_content: binary, // The compressed document metadata object gzip_md_len: integer } |
Note that the "binary" type serializes to byte[].
The fields used in deployments are controlled by the following parameters from "infinite.service.properties" (in "/opt/infinite-home/config", normally auto-generated from "/opt/infinite-install/config/infinite.configuration.properties" and "/opt/infinite-home/config/infinite.service.properties.TEMPLATE" - currently only "store.maxcontent" is copied across from the infinite-install directory; to override the others they should be added to "/opt/infinite-home/config/infinite.service.properties.TEMPLATE"):
Currently the content records are not generated in the following circumstances:
The original idea behind this was that JSON and XML files, and RDBMS records, would typically not have a large block of data as the full text, instead the full text (if present at all) would be a composite of small fields. There are a few ways this is not optimal for many current use cases:
The medium term plan will be to allow users to specify manually the content saving behavior on a per source basis (with the above as the fallback if no manual override is specified). The source configuration is in the middle of a major overhaul for both functional and readability reasons, so this will be rolled into that forthcoming change. |
Here is some sample Java code that shows how to access the unzipped content:
BasicDBObject dboContent = (BasicDBObject) contentDB.findOne(contentQ); if (null != dboContent) { byte[] compressedData = ((byte[])dboContent.get(CompressedFullTextPojo.gzip_content_)); ByteArrayInputStream in = new ByteArrayInputStream(compressedData); GZIPInputStream gzip = new GZIPInputStream(in); int nRead = 0; StringBuffer output = new StringBuffer(); while (nRead >= 0) { nRead = gzip.read(storageArray, 0, 200000); if (nRead > 0) { String s = new String(storageArray, 0, nRead, "UTF-8"); output.append(s); } } doc.setFullText(output.toString()); } |