Page Comparison

Code Block

language	javascript
title	doc_content.gzip_content format

{
	// Location
	url: string,
	sourceKey: string,
	communityIds: [ ObjectId ],
 
	// The gzipped content stored in the Lucene index (eg after the harvest processing pipeline)
	gzip_content: binary,
	gzip_len: integer,
 
	// Optional content:
	gzip_raw_content: binary, // The original text, before the processing pipeline (but after Tika for PDFs etc)
	gzip_raw_len: integer,
 
	gzip_md_content: binary, // The compressed document metadata object
	gzip_md_len: integer
}

Notes

Note that the "binary" type serializes to byte[].

...

Info

Currently the content records are not generated in the following circumstances:

the URL starts "jdbc://"
the URL starts "smb://", "s3://", or "file:" and ends in either ".xml" or ".json"
The document has a "sourceUrl" field (many XML/JSON/token-separated records in a single file)

The original idea behind this was that JSON and XML files, and RDBMS records, would typically not have a large block of data as the full text, instead the full text (if present at all) would be a composite of small fields.

There are a few ways this is not optimal for many current use cases:

These days you can choose to ignore specified JSON/XML fields, so you will often have a large JSON field that generates the fullText via the structured analysis handler and then discarded
You can change the URL

The medium term plan will be to allow users to specify manually the content saving behavior on a per source basis (with the above as the fallback if no manual override is specified). The source configuration is in the middle of a major overhaul for both functional and readability reasons, so this will be rolled into that forthcoming change.

Misc

Here is some sample Java code that shows how to access the unzipped content:

...

Versions Compared

Old Version 4

New Version Current

Key

Notes

Misc