Introduction
...
Info |
---|
We refer to "document" as a catch all for database record, Web page, PDF/office document, XML document etc. The figures below are for some "average" document across all those types (say 5KB in size) ... if most documents ingested are smaller (eg DB records) then the capacity/performance will be higher and conversely if most documents are larger (eg complex pdf reports) then the capacity/performance will be lower. In practice as you ingest data you should track disk usage against document size to get a more accurate picture of your own data (or just add lots more disk space than could possibly be needed and then monitor performance to decide when to scale). Separately there is a different set of volumetrics associated with log records. The documents and record sizing combine linearly. The "per hardware" scaling factors are described below. |
Demo configuration
For running in a VM on a laptop to demonstrate the tool. May become slow for more than 100-1000 documents, or a few hundred thousand records.
Infinit.e API + DB Node | |
---|---|
Processor | 1x 1.8+ GHz CPU |
Memory | 1 or 2 GB RAM (swap required to get up to ~8GB total) |
Network | WAN connection/none |
Storage | 20GB |
Compact configuration
A small deployment servicing a few thousand documents, or about 10 million records:
The following table lists the minimum recommended hardware configuration for one Infinit.e API and Database node.
Infinit.e API + DB Node | |
---|---|
Processor | 1 X Dual/Quad Core 1.8+ GHz CPUs |
Memory | 4-8 GB RAM (swap required to get up to ~8GB total) |
Network | 1x GigE LAN connection |
Storage | 10 GB Root/OS partition + |
...
The following configuration works quite acceptably on 500K-1M documents, or about 50 million records. The higher the spec, the faster the performance for a given number/size of documents. However this topology does not provide redundancy.
Infinit.e API Node | Infinit.e Database Node | |
---|---|---|
Processor | 1-2 X Dual Core 1.8+ GHz CPUs | 1-2 X Dual Core 1.8+ GHz CPUs |
Memory | 8-16 GB RAM (or more) | 8-16 GB RAM (or more) |
Network | 2x GigE LAN connection | 2x GigE LAN connection |
Storage | 15 GB Root/OS partition + (~10GB per 1 million "average" documents) | 15 GB Root/OS partition + (~60GB per 1 million "average" documents) |
...
A 2x API node and 2x DB node deployment using the following hardware works very quickly on a 2M+ document deployment (eg 2M-5M is a good typical range), or about 100 million records. In general the system capacity scales fairly linearly with nodes (see below).
...
Infinit.e API Node | Infinit.e Database Node | |
---|---|---|
Processor | 2 X Dual Core 1.8+ GHz CPUs | 2 X Dual Core 1.8+ GHz CPUs |
Memory | 16 GB RAM or more (32GB is ideal) | 16 GB RAM or more (32GB is ideal) |
Network | 2x GigE LAN connection | 2x GigE LAN connection |
Storage | 20 GB Root/OS partition + | 20 GB Root/OS partition + 600+ GB data partition, RAID-0 (~60GB per 1 million "average" documents) |
Info |
---|
Note the API and DB scales per 2-node block, since the primary benefit of the second node is redundancy rather than performance - although it balances the reads somewhat (not the writes) so there is some (not 2x) performance gain within a replica set. |
Record server
...
API nodes scale for records without requiring additional DB nodes. So for example, each pair of 16GB API nodes provides capacity for approximately 100M records (~3M records/day with 30 retention). And also, for example, Each 4-node combination of 2x 16GB API and 2x 16GB DB nodes provides for approximately 2M documents. |
Required Open Source Software
The following open source software packages are an integral part of the Infinit.e platform:
- Java JRE/JDK 6u30+ (current version = 6u31)
- Apache Tomcat 6.X (current version = 6.0.35)
- MongoDB 2.1+
- elasticsearch 0.19+
- (Hadoop CDH 5.3+ is not required but provide additional functionality when installed)
- (Logstash 1.4+ is not required but provide additional functionality when installed)
...