Infinit.e maintenance - command line utilities and other scripts and log files

Overview

This page describes command line utilities that system administrators can use to modify the data within Infinit.e.

This is a work-in-progress. Contact me if you have a specific need for some documenation and I'll address that first.

infinit.e.mongo-indexer

Location: "/opt/infinite-home/bin/infinite_indexer.sh" (calls "/opt/infinite-home/lib/infinit.e.mongo-indexer.jar" - note prior to v0.3 was run directly as "java -jar /opt/infinite-home/lib/infinit.e.mongo-indexer.jar <command line args>")

Overview: A very important command line tool that indexes data from MongoDB into elasticsearch, can delete data from both, and can also recreate/fix elasticsearch indexes.

Usage:

The indexer has a number of different functions, described below. In all cases the configuration is taken from "/opt/infinite-home/config/infinite.service.properties" unless overridden with "–config". Note that some familiarity with the data model will be helpful in understanding its usage.

Deleting data (while keeping elasticsearch and MongoDB synchronized)

sh infinite_indexer.sh --delete --doc|--assoc|--entity [ --query '{ ... }' ] [ --skip N1 ] [ --limit N2 ]

(Note: It is generally preferable to use the API call "config/source/delete/docs" to delete sources, and "social/community/remove" automatically deletes all sources in its community).

Arguments:

--doc OR --assoc OR --entity ... which collection to run against (doc_metadata.metadata/doc_content.gzip_content vs feature.association vs feature.entity)
--query followed by a MongoDB query specifying which records to delete, defaults to everything
"--skip <N>" to jump over N records before deleting (natural order)
"--limit <N>" to delete at most N records (natural order) starting from 0 (unless a skip is specified)

(Note that it is necessary to specify either a limit or a query, just to make it harder to delete all the data by accident).

Rebuilding all elasticsearch indexes corresponding to empty database (deleting all data from elasticsearch)

sh infinite_indexer.sh --rebuild --doc|--assoc|--entity --limit 1

Note that this only makes sense if the data in MongoDB is deleted. To rebuild indexes containing data use the "–rebuild" option together with a query, as described below under "Synchronize elasticsearch with MongoDB"

Re-creating missing elasticsearch indexes

sh infinite_indexer.sh --doc --verify

For example if you delete an index or alias by hand (eg from the elasticsearch head GUI found at "localhost:9200/_plugin/head/") then this can be run to recreate the indexes without otherwise affecting the system. Note that if MongoDB still has documents then you will need to separately re-index by calling the indexer again (see below under "Synchronize elasticsearch with MongoDB").

(This function does not exist for entities and associations, since they only contain 1 index each, just use "–rebuild" if for some reason the index is not present).

Synchronize elasticseatch with MongoDB

sh infinite_indexer.sh --doc|--assoc|--entity [ --query '{ ... }' [ --rebuild ] ] [ --skip N1 ] [ --limit N2 ]

Arguments:

--doc OR --assoc OR --entity ... which collection to run against (doc_metadata.metadata/doc_content.gzip_content vs feature.association vs feature.entity)
--query followed by a MongoDB query specifying which records to synchronize, defaults to everything
- If --rebuild is also specified then the index containing each document is deleted. This makes sense, eg if a community has become corrupt - your query is on communityId or a set of sourceKey fields, this blows whatever is there away and re-indexes from MongoDB.
- This option should be used with extreme caution. (It can also be used with --skip/--limit, though this only makes sense if you are transferring all documents across, ie with --query '{}')
"--skip <N>" to jump over N records before starting to index (natural order)
"--limit <N>" to index at most N records (natural order) starting from 0 (unless a skip is specified) (note: limits <= 0 are ignored)

Re-indexing data when it has changed - documents with entities/associations added

sh infinite_indexer.sh --feature --doc [ --query '{ ... }' ] [ --skip N1 ] [ --limit N2 ]

This is another somewhat specialized function. It is intended for cases where documents have been harvested without entity/association enrichment, and then subsequent off-line processes has added entities. This call, in addition to synchronize elasticsearch like the call above, will create objects in the entity and association feature tables based on the entities (or update their counts if they already exist). A couple of points worth noting:

If this is run on documents that already had entities then the entities will be double counted (until the weekly resycn generates by sync_features, see below).
It is somewhat sub-optimal currently because the standard metadata (plus full text) will be indexed once on harvest, and then again when this is called. It should be on the roadmap to have the option not to index the docs first time round (which will also hide them from searches, which may be desirable).

Heavily optimized entity synchronization - used by "sync_features" script, see below

sh infinite_indexer.sh --entity --verify --query doc_feature.tmpCalcFreqCount

Used to copy entity counts to MongoDB and then elasticsearch following the (normally) weekly entity recount initiated by "/opt/infinte-home/bin/sync_features.sh" (see below). Should not be used in other circumstances.

A utility script for calling mongo-indexer to synchronize MongoDB and elasticsearch

sh /opt/infinite-home/bin/reindex_from_db.sh --doc|--assoc|--entity [--rebuild] <stepsize> <saved_state>

This script calls mongo-indexer in steps of <stepsize> (eg 100,000) while using the file path pointed to by <saved_state> (eg /tmp/resync_state).

The idea is that if during a large multi-hour resychronization, something goes wrong, you can just rerun the command and it will pick up where it left off. Obviously each time it is run, a different <saved_state> should be used.

(Note this script and the functions it calls are in need of some rework - internally they rely on skip/limit instead of the _id ranges stored in "config.chunks" so is currently not scalable for very large databases - it should be easy to make it support chunks when needed).

Regularly scheduled jobs

Overview

Location: "/etc/cron.d/infinite*" (infinite-db, infinite-index-engine, infinite-logging, infinite-px-engine)

Overview: Sets of scripts that perform regularly scheduled activities in Infinit.e. Aside from the install scripts, these are really the only OS dependencies, apart from this everything runs from inside the Java containers (plus MongoDB).

infinite-db: (backups) "/opt/db-home/master_backup-script.sh", every day at 1am except Saturday

01 01    * * 0-5   root    /opt/db-home/master_backup-script.sh &> /tmp/backup_script.log

(Note: not run on Saturday so as not to interfere with the batch aggregation, see below under "Batch aggregations").

See below, under "Backups".

infinite-db: (database compaction) "/opt/db-home/master_compact-script.sh", on Saturday at midnight (replicas) or 3am (masters)

05 00   * * 6   root    /opt/db-home/master_compact-script.sh 1 > /tmp/compact_script.log
05 03   * * 6   root    /opt/db-home/master_compact-script.sh 2 >> /tmp/compact_script.log

Compaction of MongoDB databases is an important regular activity to take to prevent performance from degrading noticeably over time.

In order to compact the databases without losing availability, a round-robin scheme is run:

All the nodes running replicas compact themselves. During this time (eg 30 minutes on our 6 million/2 shard system) the replicas are unavailable, so all requests are routed to the master.
3 hours later, the master abdicates (one of the replicas is voted the new master) and then compacts itself.

Note that an implication of the above scheme is that the master changes every week. This should not be an issue because all client access to the database is via a "mongos" process that routes requests dynamically.

Note finally that compaction is not automatically run for databases that don't use replicas (eg single node databases). It must be run manually during planned outages.

More details are provided about the compaction scripts in the "DB backups" section below.

infinite-db: (extra MongoDB file management), every day at 2am

02 02    * * *     root    find /data/*/moveChunk/ -mtime +5 -exec rm {} \;

These are control files that don't get deleted by MongoDB for some reason and can fill up the disk. This task just deletes old ones (newer ones are reportedly useful for error recovery if the DB goes down really badly).

infinite-db: "/opt/db-home/rotate-logs-script.sh", midnight every day

00 01    * * *   root    /opt/db-home/rotate-logs-script.sh

Rotates the MongoDB logs (the current set of logs are maintained in "/var/log/mongo/", once archived they are copied to "/data/log")

tomcat6-interface-engine: (check the interface engine is still running, restart if not), every minute

if ! pgrep  -f org.apache.catalina.startup.Bootstrap > /dev/null && [ -f /var/lock/subsys/tomcat6-interface-engine ]; then service tomcat6-interface-engine start; fi

On platforms with marginal amounts of memory, elasticsearch occasionally stops running. This script just bounces it if that occurs.

infinite-index-engine: (check the index engine is still running, restart if not), every minute

if ! pgrep  -f org.elasticsearch.bootstrap.ElasticSearch > /dev/null && [ -f /var/lock/subsys/infinite-index-engine ]; then service infinite-index-engine start; fi

On platforms with marginal amounts of memory, elasticsearch occasionally stops running. This script just bounces it if that occurs.

infinite-index-engine: "/opt/elasticsearch-infinite/master_backup_index.sh", 2am every night

00 02 * * * root        /opt/elasticsearch-infinite/master_backup_index.sh

See under "Backup scripts"

Log file related

infinite-logging: (various Infinit.e log cleanup), daily

00 00    * * *   root   find /opt/infinite-home/logs -mtime +30 -a -name "*.log.*" | xargs rm -f
01 00    * * *   root   if [ -d /opt/tomcat-infinite/index-engine/logs ]; then find /opt/tomcat-infinite/index-engine/logs -mtime +30 -a -name "*.log" | xargs rm -f; fi
02 00    * * *   root   if [ -d /opt/tomcat-infinite/interface-engine/logs ]; then find /opt/tomcat-infinite/interface-engine/logs-mtime +30 -a -name "*.log" | xargs rm -f; fi

Just clears out Infinit.e scripts that are older than 30 days

infinite-logging: (other log cleanup), daily

03 00    * * *   root   if [ -d /var/log/elasticsearch ]; then find /var/log/elasticsearch -mtime +30 -a -name "*.log.*" | xargs rm -f; fi
 
04 00    * * *   root   if [ -d /var/log/hadoop/ ]; then find /var/log/hadoop/ -mtime +10 -a -name "*log*" | xargs rm -f; fi
04 00    * * *   root   if [ -d /var/log/hadoop/userlogs ]; then find /var/log/hadoop/userlogs -mtime +10 -a -name "job*" | xargs rm -f; fi
04 00    * * *   root   if [ -d /var/log/hadoop/history ]; then find /var/log/hadoop/history -mtime +2 -a -name "*.xml" | xargs rm-f; fi

For elasticsearch and Hadoop logs. The Hadoop logs are cleaned out more frequently (10 days for most logs, 2 days for history logs) because they tend to get large quickly.

Alert related

infinite-logging: (AlchemyAPI check) "/opt/infinite-home/scripts/AlchemyLimitExceededAlert.python", hourly

00 *     * * *   root   if [ -f /opt/splunk/bin/splunk ]; then /opt/splunk/bin/splunk cmd python /opt/infinite-home/scripts/AlchemyLimitExceededAlert.python; fi

This is a somewhat obsolete call that alerts the "mail.username" email address (see configuration properties) if AlchemyAPI is being used for NLP generation, and the daily transaction allowance is exceeded.

Only works if Splunk is installed on the node. Installing Splunk into Infinit.e to improve the internal logging capabilities is discussed here.

infinite-logging: (check for spikes in API time) "/opt/infinite-home/scripts/APITimeAlert.python", daily at 8pm

00 20    * * *   root   if [ -f /opt/splunk/bin/splunk ]; then /opt/splunk/bin/splunk cmd python /opt/infinite-home/scripts/APITimeAlert.python; fi

Sends an email to the "mail.username" email address (see configuration properties) if the API search times over the day is >=2s higher than its average over the preceding 7 days (or is ever higher than 5 seconds). Indicates either abnormal usage. or perhaps that the cluster size should be increased or the data size decreased.

Only works if Splunk is installed on the node. Installing Splunk into Infinit.e to improve the internal logging capabilities is discussed here.

infinite-logging: (report weekly API times) "/opt/infinite-home/scripts/WeeklyAPITimeStatus.python", weekly at Sunday 2pm

00 14    * * 0   root   if [ -f /opt/splunk/bin/splunk ]; then /opt/splunk/bin/splunk cmd python /opt/infinite-home/scripts/WeeklyAPITimeStatus.python; fi

Every week sends an email to the "mail.username" email address (see configuration properties) providing some useful information on the search times for this node (averaged weekly for the last 4 weeks), eg:

Week_Beginning	Weekly_Average(sec)	Weekly_Lucene_Average(sec)	Weekly_Mongo_Average(src)	Weekly_Proc_Average(sec)	Weekly_Setup_Average(sec)
05/12/13	0.717744113	0.313216641	0.000039246	0.390291209	0.007057300
05/19/13	0.662515957	0.284186170	0.000022340	0.367243617	0.003074468
05/26/13	1.637945346	0.346122972	0.000029889	1.282366354	0.002587532
06/02/13	0.914491329	0.373391412	0.000033031	0.502286540	0.030908340

Where:

Weekly_Average: is the total average search time, ie the sum of the other averages
Weekly_Lucene_Average: is the amount of search time spent in elasticsearch/Lucene calls
Weekly_Mongo_Average: is the amount of time spent in the MongoDB query
- Note this is not very useful, all the Mongo time is actually spent in the cursor operations that are maintained under Weekly_Proc_Average below
Weekly_Proc_Average: is the amount of time spent in the internal document retrieval and scoring routines
- Note that in practice this time is heavily dominated by the MongoDB cursor operations
Weekly_Setup_Average: is the amount of time creating connections to elasticsearch and MongoDB.

As a rule of thumb, if the Weekly_Lucene_Average starts to climb then more API nodes would be helpful; if the Weekly_Proc_Average starts to climb then more DB nodes would be helpful. Weekly_Lucene_Average can also be reduced at the expense of some accuracy by setting "api.aggregation.accuracy" to "low" (see configuration, section 2.6).

Only works if Splunk is installed on the node. Installing Splunk into Infinit.e to improve the internal logging capabilities is discussed here.

infinite-logging: (report the harvester message counts) "/opt/infinite-home/scripts/WeeklyExtractorStatus.python", weekly at Sunday 2pm

00 14    * * 0   root   if [ -f /opt/splunk/bin/splunk ]; then /opt/splunk/bin/splunk cmd python /opt/infinite-home/scripts/WeeklyExtractorStatus.python; fi

A report that lists the number of documents harvested by day for the last week, and emailed to the "mail.username" email address (see configuration properties), eg:

Day	max_num_of_sources_harvested	min_num_of_sources_harvested	num_of_docs_extracted	num_of_source_errors	num_of_url_errors
06/02/13	99	0	11188	12	6935

Where:

min/max_num_of_sources_harvested: the min/max number of sources handled across all harvest cycles (typically 5-10 minutes, eg in the case above, 1+ cycles handled 99 sources, 1+ cycles did nothing.
num_of_docs_extracted: the total number of documents extracted across all harvest cycles that day
num_of_sources_errors: the number of serious source-level errors (eg authorization errors for file access, 4xx errors for RSS access) - normally not many as a source error results in a source being suspended for the rest of the day
num_of_url_errors: the number of individual documents that errored (whether it be because the document could not be retrieved, or the enrichment pipeline failed for some reason, eg wrong language, too much/little content etc)

(Note that this report does not currently report "newer" features such as document updates).

Only works if Splunk is installed on the node. Installing Splunk into Infinit.e to improve the internal logging capabilities is discussed here.

infinite-logging: (Check the API health) "/opt/infinite-home/scripts/APINumResultsCheck.sh", daily between 6am and 9pm

00 6-21    * * *   root service tomcat6-interface-engine status | grep -q 'is running' && sh /opt/infinite-home/scripts/APINumResultsCheck.sh localhost:8080/api

If an API server is running on the node, then issues a request (randomly picked from the comma-separated list of query terms in "api.search.test.terms" in the section 2.6 of the configuration) on behalf of "test.user" in all that user's communities (which can be controlled by admins), and compares the returned results against "api.search.expected.results". If too few results are returned (or the API call returns an error) then an email is sent to "mail.username" email address (see configuration properties).

Harvest related

infinite-px-engine: (Restart the processing-engine if it has crashed), every minute

* * * * * root  if [ ! -f /opt/infinite-home/bin/STOPFILE ]; then service infinite-px-engine watchdog; fi

(Note does nothing if the control file STOPFILE exists - see here under "Controlling the Processing Engine" for more details on control files).

infinite-px-engine: (Regulate the index size in the event of node creation/removal), every minute

* * * * * root  curl -s 'http://localhost:8080/api/auth/login/ping/ping' > /dev/null

(fails silently if the API is not up)

This dummy API call (which fails with a 400/500 code if the DB or Index are not working - handy for quick system diagnostics) ensures that each node has a mirror of the entity and association indexes for performance.

infinite-px-engine: (Queue DB/index synchronization), every hour

00 *  * * * root touch /opt/infinite-home/bin/SYNC_FILE

See here under "Controlling the Processing Engine" for more details.

infinite-px-engine: (Reset bad sources), daily at midnight

#00 00 * * * root touch /opt/infinite-home/bin/RESET_FILE
00 00 * * * root /opt/infinite-home/bin/reset_bad_harvest.sh

(The two variants are functionally equivalent)

Sources that appear to be generating mostly transient errors (eg RSS sources returning lots of 5xx HTTP errors) are suspended for a day (setting "harvestBadSource":"true" in the source JSON). This call unsuspends them.

infinite-px-engine: (Temporal aggregation), daily at 4am

00 04    * * *   root   /opt/infinite-home/bin/generate_temporal_aggregations.sh

See below, under "Batch aggregation".

infinite-px-engine: (Weekly batch aggregation recalculation) "/opt/infinite-home/bin/sync_features.sh", Sunday morning from 1.30am

30 01    * * 7   root   if [ ! -f /opt/infinite-home/bin/STOP_BATCH_SYNC_FILE ]; then service infinite-px-engine stop; fi
00 02    * * 7   root   if [ ! -f /opt/infinite-home/bin/STOP_BATCH_SYNC_FILE ]; then /opt/infinite-home/bin/sync_features.sh; fi
30 02    * * 7   root   if [ ! -f /opt/infinite-home/bin/STOP_BATCH_SYNC_FILE ]; then service infinite-px-engine start; fi

(Note the weekly batch aggregation is not run if the control file "STOP_BATCH_SYNC_FILE" exists, see here under "Controlling the Processing Engine" for more details on control files).

At 1.30am all harvest nodes have their harvester turned off. This gives them time to close down.

At 2am the "sync_features.sh" script is started. This is described below, under "Batch aggregation". Note that it writes the date to "feature.sync_lock" in MongoDB - this stops harvesters from starting up until the job is complete.

At 2.30am the harvesters are started up again (but in practice won't start until "feature.sync_lock" is removed).

This batch job can take several hours on larger clusters - on our 4 DB node (2 replicas) with ~6 million documents, it takes 3 hours to run. It uses the Mongo map-reduce construct, which splits across shards but not replicas.

infinite-px-engine: (report source statuses) "/opt/infinite-home/bin/weekly_sources_report.sh", weekly at Sunday 2pm

00 14    * * 0   root    /opt/infinite-home/bin/weekly_sources_report.sh

Generates a weekly report (per cluster not per node) containing the following:

"HARVEST: NEW"
- A list of sources that previously were not errored but as of this time are errored
"HARVEST: FIXED"
- A list of sources that were previously errored but as of this time are working
"HARVEST: OLD, HARVEST: ERROR, APPROVED: TRUE"
- A list of sources that are still in error (and are approved)
"HARVEST: OLD, HARVEST: SUCCESS, APPROVED: FALSE"
- A list of sources that have successfully run but are currently not approved.

Note the approved true/false is not currently very useful - it used to be that commonly erroring sources had their approved flag set to false to prevent them from running - but now this is achieved by setting the "searchCycle_secs" field to be a negative number.

infinite-px-engine: (Checks for Hadoop jobs to run) "/opt/infinite-home/bin/custommr.sh", every minute

*/1 * * * * tomcat      if [ ! -f /opt/infinite-home/bin/STOP_CUSTOM ]; then /opt/infinite-home/bin/custommr.sh; fi

(Note that this can be suspended by the control file "STOP_CUSTOM", see here under "Controlling the Processing Engine" for more details on control files).

This calls the core server JAR, which will check if there are any Infinit,e-scheduled Hadoop jobs (eg from the GUI or from the API) pending or completed.

Backup scripts

Lucene Backups (elasticsearch)

In practice it is not quite clear what issues might arise from restoring the index as described in this section, potential issues include:

The DB backup process also occurs at 1am daily, so the index and DB will be slightly out-of-sync
The harvester is not stopped during the backup, so there may be issues with data moving between shards

In general - this backup/restore process is only recommended in a couple of scenarios:

In an emergency, to get up and running as quickly as possible
When it is known that the cluster has not been harvesting over the duration of the index

Otherwise (and this is what I have always done), just restore the database and then use the "mongo-indexer" command line utility to rebuild the indexes from scratch.

/opt/elasticsearch-infinite/master_backup_index.sh

This performs 1-3 activities (scheduled from cron to run 2am every night, see above under "Regularly scheduled jobs"):

Every day - creates a backup of the local index, in /opt/elasticsearch-infinite/backups/index_backup_<<cluster_name>>_<<hostname>>_latest.tgz
- (note translog flushing is disabled during the backup)
- (note the previous backup will be overwritten by the new one every night)
Only if the parameter "s3.url" exists in "/opt/infinite-home/config/infinite.service.properties" (and "s3cmd" has been configured, run "s3cmd --configure" if not):
- Every day:
  - Uploads the backed-up index to the S3 bucket called "elasticsearch.<<s3.url>>", both as index_backup_<<cluster_name>>_<<hostname>>_latest.tgz (ie overwriting the previous "most recent") and index_backup_<<cluster_name>>_<<hostname>>_<<DAY_OF_WEEK>>.tgz
  - (as a result, a rolling previous ~7 days of indexes are stored for each index)
- Only on Sundays:
  - Uploads the backed-up index to the S3 bucket pointed "backup.elasticsearch.<<s3.url>>" with the name index_backup_<<cluster_name>>_<<hostname>>_<<WEEK_OF_YEAR>.tgz
    - (It is recommended that this bucket be placed in a different region for redundancy)

Restoring the index from a backup

The backup's root is "/". Therefore to restore the index:

Stop elasticsearch
- "service infinite-index-engine stop"
"rm -rf /opt/elasticsearch-infinite/data"
"cd /"
"tar xzvf <<backup file>>"
Restart elasticsearch:
- "service infinite-index-engine start"

DB backups (mongodb)

/opt/db-home/master_backup_script.sh

This performs 1-3 activities (scheduled from cron to run 1am every night except Saturday, see above under "Regularly scheduled jobs"):

Every day - creates a backup of the any shards for which the host is the master, in /opt/db-home/backups/db_backup_<<cluster_name>>_<<hostname>>_latest.tgz (for config DBs: "db_backup_<<hostname>>_latest_27016.tgz")
- (note shard re-balancing is disabled during the backup - note this doesn't work properly because each shard sets/unsets the global balancer)
- (note the previous backup will be overwritten by the new one every night)
Only if the parameter "s3.url" exists in "/opt/infinite-home/config/infinite.service.properties" (and "s3cmd" has been configured, run "s3cmd --configure" if not):
- Every day:
  - Uploads the backed-up database fileto the S3 bucket called "mongo.<<s3.url>>" both as db_backup_<<cluster_name>>_<<hostname>>_latest.tgz (ie overwriting the previous "most recent") and db_backup_<<cluster_name>>_<<hostname>>_<<DayOfMonth>>.tgz
  - (as a result, a rolling previous ~30 days of databases are stored for each index)
- Only on Sundays:
  - Uploads the backed-up database file to the S3 bucket called "backup.mongo.<<s3.url>>" with the name db_backup_<<cluster_name>>_<<hostname>>_<<WeekOfYear>>.tgz
    - (It is recommended that this bucket be placed in a different region for redundancy)

Note that this is not quite the MongoDB approved backup process - which is described here. The most significant difference is that the backup of the config database is not directly synchronized with the data backup, only by the data of the cron job.

In addition, the replica set is not locked (by shutting down a secondary on each one of the replicas). Also "point in time" backups are not supported. The net effect is the risk that the shards will be slightly out of sync with one other - but there is not really anyway consistency requirements across the different tables/shards, so the current strategy should be adequate.

/opt/db-home/backup_script.sh <<port>>

This performs the low level backup operations, called from "master_backup_script.sh" with the port (eg 27106 for the config DB server, (27017 + shard) number for the data replicas). It does nothing on replicas other than the master.

/opt/db-home/restore_script.sh

This is currently not a useful script and should not be used. There is no automated way of restoring backups, administrators should decide what to do and then use "mongorestore" manually. There are a couple of options for the manual restore process:

Smaller clusters, to an empty DB:
- For each shard do a mongorestore on the entire backup (just untar it) directly into a mongos on one of the nodes ... the config DB backup can be ignored, it will be recreated during the restore.
  - (NOTE: if a directory called "config" exists you must delete it)
Larger clusters, to an empty DB:
- Follow the steps described here
Sub-sections of the entire database (eg one of the databases/collections has been corrupted)
- If an unsharded database/collection:
  - mongorestore using "-d" and optionally "-c" directly into mongos on any one of the nodes.
- If a sharded collection:
  - If "small enough": (what this means in practice is unclear, try this way first)
    - For each shard, mongorestore using "-d" and optionally "-c" directly into mongos on any one of the nodes.
  - If "larger":
    - Follow the steps described here but only for the one database/collection.

/opt/db-home/sync_from_master.sh

This is an old script and should not be used. A restore guide is provided above.

/opt/db-home/master_compact_script.sh

/opt/db-home/compact_script.js

This is not strictly speaking a backup process, but it is included here anyway.

These scripts are called from a cron job and are described in outline under the "Database related" section of "Regularly scheduled jobs".

"master_compact_script" is called on Saturday night with one parameter: 1 for replicas (midnight), and 2 for masters (3am). The actual compaction is perfomed by a js file (compant_script.js) invoked using the mongo shell.

"master_company_script.sh 2" will result in the master abdicating and one of the replicas being elected in its place. Because of this, the master role will rotate through the replicas every week.

Batch aggregation (IN PROGRESS)

Batch scripts (temporal, doc count, entity counts)

(also aggregation rebuild)

Others (IN PROGRESS)

Standalone harvester and standalone custom processing (runuser etc)

Config update scripts

Intialization scripts?

infdb, setupAdminShards, start_balancer, sync_from_master

(infdb_aws NOTE the config_ips_override script)

TODO various log files, including security/failure logs generated by apache and tomcat

SYSCONFIG SCRIPTS (%config) - including infinite-index-engine-custom, plus fact that tomcat6-interface-engine one is unusual (puts export in front of every line!)

TODO enterprise httpd stuff

[data-colorid=nxaqs70ulb]{color:#e45f48} html[data-color-mode=dark] [data-colorid=nxaqs70ulb]{color:#b7321b}[data-colorid=r6npn12uwk]{color:#e45f48} html[data-color-mode=dark] [data-colorid=r6npn12uwk]{color:#b7321b}Overview

infinit.e.mongo-indexer

Deleting data (while keeping elasticsearch and MongoDB synchronized)

Rebuilding all elasticsearch indexes corresponding to empty database (deleting all data from elasticsearch)

Re-creating missing elasticsearch indexes

Synchronize elasticseatch with MongoDB

Re-indexing data when it has changed - documents with entities/associations added

Heavily optimized entity synchronization - used by "sync_features" script, see below

A utility script for calling mongo-indexer to synchronize MongoDB and elasticsearch

Regularly scheduled jobs

Overview

Database-related

infinite-db: (backups) "/opt/db-home/master_backup-script.sh", every day at 1am except Saturday

infinite-db: (database compaction) "/opt/db-home/master_compact-script.sh", on Saturday at midnight (replicas) or 3am (masters)

infinite-db: (extra MongoDB file management), every day at 2am

infinite-db: "/opt/db-home/rotate-logs-script.sh", midnight every day

API-related

tomcat6-interface-engine: (check the interface engine is still running, restart if not), every minute

Lucene-related

infinite-index-engine: (check the index engine is still running, restart if not), every minute

infinite-index-engine: "/opt/elasticsearch-infinite/master_backup_index.sh", 2am every night

Log file related

infinite-logging: (various Infinit.e log cleanup), daily

infinite-logging: (other log cleanup), daily

Alert related

infinite-logging: (AlchemyAPI check) "/opt/infinite-home/scripts/AlchemyLimitExceededAlert.python", hourly

infinite-logging: (check for spikes in API time) "/opt/infinite-home/scripts/APITimeAlert.python", daily at 8pm

infinite-logging: (report weekly API times) "/opt/infinite-home/scripts/WeeklyAPITimeStatus.python", weekly at Sunday 2pm

infinite-logging: (report the harvester message counts) "/opt/infinite-home/scripts/WeeklyExtractorStatus.python", weekly at Sunday 2pm

infinite-logging: (Check the API health) "/opt/infinite-home/scripts/APINumResultsCheck.sh", daily between 6am and 9pm

Harvest related

infinite-px-engine: (Restart the processing-engine if it has crashed), every minute

infinite-px-engine: (Regulate the index size in the event of node creation/removal), every minute

infinite-px-engine: (Queue DB/index synchronization), every hour

infinite-px-engine: (Reset bad sources), daily at midnight

infinite-px-engine: (Temporal aggregation), daily at 4am

infinite-px-engine: (Weekly batch aggregation recalculation) "/opt/infinite-home/bin/sync_features.sh", Sunday morning from 1.30am

infinite-px-engine: (report source statuses) "/opt/infinite-home/bin/weekly_sources_report.sh", weekly at Sunday 2pm

infinite-px-engine: (Checks for Hadoop jobs to run) "/opt/infinite-home/bin/custommr.sh", every minute

Backup scripts

Lucene Backups (elasticsearch)

/opt/elasticsearch-infinite/master_backup_index.sh

Restoring the index from a backup

DB backups (mongodb)

/opt/db-home/master_backup_script.sh

/opt/db-home/backup_script.sh <<port>>

/opt/db-home/restore_script.sh

/opt/db-home/sync_from_master.sh

/opt/db-home/master_compact_script.sh

/opt/db-home/compact_script.js

Batch aggregation (IN PROGRESS)

Others (IN PROGRESS)

Overview