The following diagram (click zoom to expand) shows the recommended configuration for running multiple (ie 1+) clusters of multiple nodes (2+ but we recommend 4+: ie 2+ API nodes and 2+ DB nodes).
Note that sharding is not fully supported (or at least not fully tested) as of the March 2012 release. Apart from one weekly maintenance script (that is awaiting a new MongoDB) feature, we believe it should work. As of 3M documents indexed, sharding is not necessary in case.
The remaining sections describe the different steps necessary to get up and running. Note that steps 3-5 can be performed interchangeably, and it is not necessary to finish one step before starting the next. Also API nodes can be added to the load balancer before they are complete (they will appear as out of service until the system is working).
There are 3 things that need to be done in the AWS management console to prepare for Infinit.e install:
The only port that is needed is port 80, though ssh at least on authorized IP addresses is standard.
There is no functional need to separate out the different clusters into different groups, but there are obviously safety/security reasons, eg to stop someone logged in to cluster "X" to deliberately or inadvertently access the technology stack on cluster "Y".
So having one group per cluster that disallows internal traffic (eg 10.*.*.*) is probably desirable (note that nodes within a group have unrestricted access to each other, which is desirable).
An even stricter configuration would be to have 2 groups per cluster, one for API nodes and one for DB nodes, only allowing port 27017 and 27016 access between them.
Each distinct EC2 keypair allows 1 different set of admins/maintainers/developers to access a cluster, so if you want to partition different machines between different people in your organization, create security keys accordingly.
Given a root S3 path (S3ROOT say), eg we might use "infinit.e-saas.ikanow.com" (which is entered into the properties.configuration file, see below), the following buckets are required:
It is also recommended to set up a folder for holding configuration files (eg the "infinit.e.properties.configuration" file described below), eg "config.<S3ROOT>". Both default DB and API node templates (see steps 4, 5) require such an S3 location to be specified.
A single file is used to populate the configuration files for all the custom and standard technologies used in Infinit.e: "infinit.e.configuration.properties". A template for this file can be obtained here.
A full description of the fields within "infinit.e.properties.configuration" is provided here, but the EC2-specific automated configuration makes populating it considerably easier than in the general case. The remainder of this section describes the EC2-specific configuration.
################################################################################ # Amazon services properties # If deployed on an EC2 cluster set this to 1: use.aws=1 # This is the root s3 bucket name to be used for backups: # The "s3.url" parameter corresponds to the "S3ROOT" described in "Step 1" above s3.url=infinite.myorg.com |
################################################################################ # Amazon AWS Settings ################################################################################ # AWS keys (only needed if use.aws=1) aws.access.key=ACCESS_KEY aws.secret.key=SECRET_KEY # Used for s3cmd, see their web page for details s3.gpg.passphrase=none |
Obviously these should be set to your Amazon keys.
################################################################################ # Cluster name and URL # Any unique name within the EC2 cluster/subnet: # eg infinite-cluster1 elastic.cluster=CLUSTER_NAME ################################################################################ # Discovery mode = ec2 (if running on AWS) or zen (specify a list of IPs below): elastic.node.discovery=ec2 # (once "elastic.node.discovery" has been set to "ec2", "elastic.search.nodes" can be ignored - the discovery will happen automatically) #elastic.search.nodes= # Also these DB configuration params can be ignored: ################################################################################ # MongoDB Properties #db.cluster.subnet= #db.config.servers= #db.replica.sets= |
In EC2 mode, the "elastic.cluster" string must be the same for all nodes (API and DB) in the cluster. It controls three things:
We provide a template for this (here), though actually the AWS management console interface is just as good, the only custom parameter is the health check target, which should be set to "HTTP:80/api/auth/login/ping/ping".
Using the template, the display name cannot be changed, which is irritating but not that important.
To start using the template:
Note that while it would have been nice to have API nodes automatically connect themselves to the Load Balancer on start, this is not currently possible with CloudFormation except via AWS "Auto Scaling", which does not have a manual override (and also does not map well onto resource provision in Infinit.e).
The precise steps vary depending on whether the config servers are standalone (recommended for operational deployments if sharding is enabled) or run on the same node (for unsharded/small/dev/test deployments). As noted above, it is likely that you will be running unsharded deployments, both because even pretty large clusters (with many API nodes) still perform well with only 2 DB nodes in 1 replica set (and also because at release, sharded is largely untested operationally!)
As for the load balancer, navigate to the "CloudFormation" tab, select "Create New Stack", upload/link to the DB template, select a "Stack Name" and then "Next" to the configuration parameters.
The following fields must be populated:
The following fields are populated sensibly by default, but can be changed:
Note that in practice you will probably want to override the default templates, so that standard fields like ClusterName (unless you have multiple clusters in the same AWS account), ConfigFileS3Path, AwsAccessId, AwsAccessKey, AvailabilityZone, SecurityGroups and KeyName (ie basically everything!) are set to default parameters and can normally be ignored.
Note also that while CloudFormation stacks were designed to create entire stacks (eg load balancer, API nodes, replica sets), we only use them for individual elements (eg one for load balancer, one for API nodes, one for DB nodes). This is because the CloudFormation templates do no allow addition (/less importantly removal) of nodes except via the unsuitable AWS auto scaling function.
First start the 1/3/5 config servers. This will require the same steps as above except:
(Alternatively used the "DB Config Server" template provided).
Then start the main DB nodes, again just as Scenario 1, except:
The API nodes can then be started. It is difficult to provision in advance the number of nodes because it heavily depends on usage patterns and sort of documents being indexed. It is therefore recommended to start with 2 and add new ones if response times are too long.
To create a new API node, follow the usual steps: navigate to the "CloudFormation" tab, select "Create New Stack", upload/link to the API template, select a "Stack Name" and then "Next" to the configuration parameters.
The following fields must be populated:
The following fields are populated sensibly by default, but can be changed:
As with the DB nodes, in practice you will probably want to override the default templates, so that standard fields like ClusterName (unless you have multiple clusters in the same AWS account), ConfigFileS3Path, AwsAccessId, AwsAccessKey, AvailabilityZone, SecurityGroups and KeyName (ie basically everything!) are set to default parameters and can normally be ignored.
The same comments as for the DB node about using CloudFormation somewhat sub-optimally also hold. It is particularly noticeable for API nodes because it results in one final step, discussed in the next section.
This is performed in standard fashion:
You now have a fully operational Infinit.e cluster. Start adding sources and you can begin analysis. This link provides a quick example of getting a source imported in order to test/demonstrate the GUI.
It takes about 30 minutes for a node to come online following start-up. Most of this time (20-25 minutes) is spent updating the packages (like Java and JPackage) from the defaults on the CentOS 5.5 AMI. Therefore the time-to-start could be significantly improved by building a new custom AMI, starting from the base AMI, installing infinit.e-prerequisites-online RPM, and then creating the new AMI (we used that link to generate this script file on github).