Document Contents JSON Format

doc_content.gzip_content format
{
	// Location
	url: string,
	sourceKey: string,
	communityIds: [ ObjectId ],
 
	// The gzipped content stored in the Lucene index (eg after the harvest processing pipeline)
	gzip_content: binary,
	gzip_len: integer,
 
	// Optional content:
	gzip_raw_content: binary, // The original text, before the processing pipeline (but after Tika for PDFs etc)
	gzip_raw_len: integer,
 
	gzip_md_content: binary, // The compressed document metadata object
	gzip_md_len: integer
}

Notes

Note that the "binary" type serializes to byte[].

The fields used in deployments are controlled by the following parameters from "infinite.service.properties" (in "/opt/infinite-home/config", normally auto-generated from "/opt/infinite-install/config/infinite.configuration.properties" and "/opt/infinite-home/config/infinite.service.properties.TEMPLATE" - currently only "store.maxcontent" is copied across from the infinite-install directory; to override the others they should be added to "/opt/infinite-home/config/infinite.service.properties.TEMPLATE"): 

  • store.maxcontent: (long) the maximum uncompressed length in bytes - data beyond this is truncated. Note that is not applied to gzip_md_content.
  • store.rawcontent: (boolean) whether to store the original text
  • store.metadata_as_content: (boolean) whether to store the metadata

Currently the content records are not generated in the following circumstances:

  • the URL starts "jdbc://"
  • the URL starts "smb://", "s3://", or "file:" and ends in either ".xml" or ".json"
  • The document has a "sourceUrl" field (many XML/JSON/token-separated records in a single file)

The original idea behind this was that JSON and XML files, and RDBMS records, would typically not have a large block of data as the full text, instead the full text (if present at all) would be a composite of small fields.

There are a few ways this is not optimal for many current use cases:

  • These days you can choose to ignore specified JSON/XML fields, so you will often have a large JSON field that generates the fullText via the structured analysis handler and then discarded
  • You can change the URL

The medium term plan will be to allow users to specify manually the content saving behavior on a per source basis (with the above as the fallback if no manual override is specified). The source configuration is in the middle of a major overhaul for both functional and readability reasons, so this will be rolled into that forthcoming change.

Misc

Here is some sample Java code that shows how to access the unzipped content:

Unzipping the content
 				BasicDBObject dboContent = (BasicDBObject) contentDB.findOne(contentQ);
				if (null != dboContent) {
					byte[] compressedData = ((byte[])dboContent.get(CompressedFullTextPojo.gzip_content_));
					ByteArrayInputStream in = new ByteArrayInputStream(compressedData);
					GZIPInputStream gzip = new GZIPInputStream(in);				
					int nRead = 0;
					StringBuffer output = new StringBuffer();
					while (nRead >= 0) {
						nRead = gzip.read(storageArray, 0, 200000);
						if (nRead > 0) {
							String s = new String(storageArray, 0, nRead, "UTF-8");
							output.append(s);
						}
					}
					doc.setFullText(output.toString());
				}