Infinit.e EC2 Installation Guide

Overview

The following diagram (click zoom to expand) shows the recommended configuration for running multiple (ie 1+) clusters of multiple nodes (2+ but we recommend 4+: ie 2+ API nodes and 2+ DB nodes).

Note that sharding is not fully supported (or at least not fully tested) as of the March 2012 release. Apart from one weekly maintenance script (that is awaiting a new MongoDB) feature, we believe it should work. As of 3M documents indexed, sharding is not necessary in case. Assuming sharding is enabled, the top-level design page explains how the system scales.

As an alternative to load balancers, DNS round robin load balancing (using Amazon's Route 53) has also been tested and works well.

The remaining sections describe the different steps necessary to get up and running. Note that steps 3-5 can be performed interchangeably, and it is not necessary to finish one step before starting the next. Also API nodes can be added to the load balancer before they are complete (they will appear as out of service until the system is working).

Step 1: Configure AWS settings

There are 3 things that need to be done in the AWS management console to prepare for Infinit.e install:

Set up a (free) CloudFormation account (simply navigate to the CloudFormation tab in the management console and follow the instructions)
Set up security groups and keys
Set up S3 storage for index and DB backups

Security groups and keys

The only port that is needed is port 80, though ssh at least on authorized IP addresses is standard.

There is no functional need to separate out the different clusters into different groups, but there are obviously safety/security reasons, eg to stop someone logged in to cluster "X" to deliberately or inadvertently access the technology stack on cluster "Y".

So having one group per cluster that disallows internal traffic (eg 10.*.*.*) is probably desirable (note that nodes within a group have unrestricted access to each other, which is desirable).

An even stricter configuration would be to have 2 groups per cluster, one for API nodes and one for DB nodes, only allowing port 27017 and 27016 access between them.

Each distinct EC2 keypair allows 1 different set of admins/maintainers/developers to access a cluster, so if you want to partition different machines between different people in your organization, create security keys accordingly.

S3 storage

Given a root S3 path (S3ROOT say), eg we might use "infinit.e-saas.ikanow.com" (which is entered into the properties.configuration file, see below), the following buckets are required:

mongo.<S3ROOT>: daily database backups, put in the same region as the cluster.
elasticsearch.<S3ROOT>: daily index backups, put in the same region as the cluster.
backup.mongo.<S3ROOT>: weekly database backups, put in a different region (and ideally country) to the cluster.
backup.elasticsearch.<S3ROOT>: weekly index backups, put in a different region (and ideally country) to the cluster.

It is also recommended to set up a folder for holding configuration files (eg the "infinit.e.properties.configuration" file described below), eg "config.<S3ROOT>". Both default DB and API node templates (see steps 4, 5) require such an S3 location to be specified.

Step 2: Create a properties.configuration file

A single file is used to populate the configuration files for all the custom and standard technologies used in Infinit.e: "infinit.e.configuration.properties". A template for this file can be obtained here.

A full description of the fields within "infinit.e.properties.configuration" is provided here, but the EC2-specific automated configuration makes populating it considerably easier than in the general case. The remainder of this section describes the EC2-specific configuration.

Generic parameters

################################################################################
# Amazon services properties
# If deployed on an EC2 cluster set this to 1:
use.aws=1
# This is the root s3 bucket name to be used for backups:
# The "s3.url" parameter corresponds to the "S3ROOT" described in "Step 1" above
s3.url=infinite.myorg.com

AWS access information

################################################################################
# Amazon AWS Settings
################################################################################
# AWS keys (only needed if use.aws=1)
aws.access.key=ACCESS_KEY
aws.secret.key=SECRET_KEY
# Used for s3cmd, see their web page for details
s3.gpg.passphrase=none

Obviously these should be set to your Amazon keys.

Cluster information

################################################################################
# Cluster name and URL
# Any unique name within the EC2 cluster/subnet: 
# eg infinite-cluster1
elastic.cluster=CLUSTER_NAME
################################################################################
# Discovery mode = ec2 (if running on AWS) or zen (specify a list of IPs below):
elastic.node.discovery=ec2

# (once "elastic.node.discovery" has been set to "ec2", "elastic.search.nodes" can be ignored - the discovery will happen automatically)
#elastic.search.nodes=
# Also these DB configuration params can be ignored:
################################################################################
# MongoDB Properties
#db.cluster.subnet=
#db.config.servers=
#db.replica.sets=

In EC2 mode, the "elastic.cluster" string must be the same for all nodes (API and DB) in the cluster. It controls three things:

It enables the API nodes to discover each other
It enables the DB nodes to discover each other
It enables the API nodes to discover their DB

Step 3: Start a load balancer

Amazon Elastic Load Balancers have non-configurable timeouts (eg 60 seconds). This can cause problems to some of the Infinit.e operations, such as testing and deleting sources and documents.

You can request Amazon to increase the timeout on their EC2 forums, and they will normally do it within a day or 2. Example forum post I made.

An alternative is to use the load balancer only to provide automated health-checking of the API, eg and to use Amazon's DNS service, Route 53, for round-robin load balancing (delegating the "rr" subdomain of ikanow.com: useful link).

We provide a template for this (here), though actually the AWS management console interface is just as good, the only custom parameter is the health check target, which should be set to "HTTP:80/api/auth/login/ping/ping".

Using the template, the display name cannot be changed, which is irritating but not that important.

To start using the template:

Navigate to the CloudFormation tab in the AWS management console.
Select "Create New Stack"
Either upload the template (if you've modified it) via "Upload a Template file" or specify in "Provide a Template URL".
Select a "Stack Name" and click Next/Finish where prompted.
The Load Balancer URL can be found either from the "Output" tab in CloudFormation or from the EC2 tab, then the navigation bar "NETWORK & SECURITY" > "Load Balancers".

Note that while it would have been nice to have API nodes automatically connect themselves to the Load Balancer on start, this is not currently possible with CloudFormation except via AWS "Auto Scaling", which does not have a manual override (and also does not map well onto resource provision in Infinit.e).

Step 4: Start database nodes

The precise steps vary depending on how the config server node is deployed:

The standard deployment is to run 1 or 3 (or 5) standalone config servers (generally on very cheap micro instances).
For smaller or test deployments, a single config server can be co-located with one of the DB nodes.

As noted above, it is likely that you will be running unsharded deployments, both because even pretty large clusters (with many API nodes) still perform well with only 2 DB nodes in 1 replica set (and also because at release, sharded is largely untested operationally!)

Step 4 - Scenario 1: DB nodes with 1 co-located config server

As for the load balancer, navigate to the "CloudFormation" tab, select "Create New Stack", upload/link to the template (single node or replica pair), select a "Stack Name" (for display only) and then "Next" to the configuration parameters.

The following fields must be populated:

ClusterName: the cluster name, should match the "infinit.e.configuration.properties" file.
IsConfigSvr: should be set to "1" for the first node created, "0" after that (for combined config server/DB scenarios only).
- Note that once one config server has been started like this, adding extra config servers will stop new DB nodes from starting successfully.
ReplicaSetIds: For unsharded deployments (as set in "infinit.e.configuration.properties"; almost certainly what you will be running), just leave as 1 all the time. For sharded deployments, use "1" for the first 2 nodes, "2" for the second 2 nodes, etc.
- It is also possible to make a node join multiple replica sets by setting a comma-separated list, eg "1,2,3" to belong to 3 replica sets (one DB process is created per replica set). This is not recommended for typical usage, but could be useful eg to use a single node for multiple "slaves" (the low performance won't matter because there'll never be queried in practice)
NodeName: The name displayed in the EC2 instances. For the replica pair template, the actual names are "<NodeName>-1" and "<NodeName>-2".
ConfigFileS3Path: the location of the "infinit.e.configuration.properties" file in your S3 storage.
AwsAccessId: The AWS ID/Access Key for your Amazon account.
AwsAccessKey: The AWS Key/Secret Key for your Amazon account.
AvailabilityZone: Must be consistent with the availability zone from which the stack was launched (top left of CloudFormation tab)
SecurityGroups: Set the security group from Step 1.
KeyName: Set the key from Step 1.

The following fields are populated sensibly by default, but can be changed:

InstanceType: Defaults to "m1.xlarge", which is what you want for any decent sized deployment; use "m1.large" for test/demo clusters. Note that if "m1.xlarge" then RAID is automatically installed on startup (which takes about 10 minutes).
IsStorageNode: (leave as 1).
QuickInstall: Defaults to "–fast", which saves 15 minutes on node start-up but will not update the system packages from whatever AMI is in use. Set to "–slow" instead for a more up to date OS.

Note that in practice you will probably want to override the default templates, so that standard fields like ClusterName (unless you have multiple clusters in the same AWS account), ConfigFileS3Path, AwsAccessId, AwsAccessKey, AvailabilityZone, SecurityGroups and KeyName (ie basically everything!) are set to default parameters and can normally be ignored.

Step 4 - Scenario 2: Standalone config servers

First start the 1/3/5 config servers. There are specific templates for a single or three-node configurations (the 5-node case is an easy tweak to the existing template, if needed). The config server parameters are the same as DB but without the unnecessary ReplicaSetIds, IsConfigServer, IsStorageNode.

The config server Cloudformation template also creates a DNS entry in Route53 for a user-specified Hosted Zone. This is necessary because of a bug in MongoDB where changing the hostname of a config server (eg because the EC2 instance becomes unstable so a new node must be created) requires a complete cluster restart (in order: shutdown API nodes, DB nodes, config nodes; startup config nodes, DB nodes, API nodes). The DNS entry is written into the EC2 metadata in the "DnsName" field.

The only other differences is that InstanceType is one of "t1.micro" or "m1.large". The micro instance should be fine in most cases (and is >10x cheaper).

Then start the main DB nodes, again just as Scenario 1, except:

IsConfigSvr should always be "0", otherwise system-wide problems will occur.
DnsName should be present, unique, and point via CNAME to the actual hostname, otherwise system-wide issues may occur

Step 5: Start API nodes

The API nodes can then be started. It is difficult to provision in advance the number of nodes because it heavily depends on usage patterns and sort of documents being indexed. It is therefore recommended to start with 2 and add new ones if response times are too long.

To create a new API node, follow the usual steps: navigate to the "CloudFormation" tab, select "Create New Stack", upload/link to the API template, select a "Stack Name" and then "Next" to the configuration parameters.

The following fields must be populated:

ClusterName: the cluster name, should match the "infinit.e.configuration.properties" file.
NodeName: The name displayed in the EC2 instances
ConfigFileS3Path: the location of the "infinit.e.configuration.properties" file in your S3 storage.
AwsAccessId: The AWS ID/Access Key for your Amazon account.
AwsAccessKey: The AWS Key/Secret Key for your Amazon account.
AvailabilityZone: Must be consistent with the availability zone from which the stack was launched (top left of CloudFormation tab)
SecurityGroups: Set the security group from Step 1.
KeyName: Set the key from Step 1.

The following fields are populated sensibly by default, but can be changed:

InstanceType: Defaults to "m1.xlarge", which is what you want for any decent sized deployment; use "m1.large" for test/demo clusters. Note that if "m1.xlarge" then RAID is automatically installed on startup (which takes about 10 minutes).
QuickInstall: Defaults to "–fast", which saves 15 minutes on node start-up but will not update the system packages from whatever AMI is in use. Set to "–slow" instead for a more up to date OS.

As with the DB nodes, in practice you will probably want to override the default templates, so that standard fields like ClusterName (unless you have multiple clusters in the same AWS account), ConfigFileS3Path, AwsAccessId, AwsAccessKey, AvailabilityZone, SecurityGroups and KeyName (ie basically everything!) are set to default parameters and can normally be ignored.

Step 6: Connect the API nodes to the load balancer

This is performed in standard fashion:

Navigate to the EC2 tab in the AWS management console
From the navigation sidebar, "NETWORK & SECURITY" > "Load Balancers"
Select the desired load-balancer
Press the green "+" in the top right of the "Instances" tab
Select the node based in it's "NodeName" (shown in brackets next to the instance ID).

Miscellaneous notes

You now have a fully operational Infinit.e cluster. Start adding sources and you can begin analysis. This link provides a quick example of getting a source imported in order to test/demonstrate the GUI.

Note that while CloudFormation stacks are primarily intended to start entire clusters, this is not practical for Infinit.e because the only way of adding or subtracting nodes is with Amazon Auto Scaling (ie not manually except by treating each node as a separate stack, as we do), and the available node addition/removal criteria do not map well onto how Infinit.e resource management works. Therefore each CloudFormation stack is normally a single node (apart from the 3-node config server and 2-node replica set "convenience" templates).

For "quick installs", it takes about 15 minutes for "m1.xlarge" API nodes to start-up (10 minutes of this is the RAID setup - so "m1.large" take about 5 minues). DB nodes take 5-10 minutes longer (MongoDB initialization time).

If startup time is important then the base AMI provided can be extended manually (eg installing the RPMs by hand) and then saved as a new AMI. If RAID is required together with quick node startups then EBS nodes will need to be used in place of the ephemeral storage, and similarly for pre-initialization of the DB.