Helix is a generic cluster management framework used for the automatic management of partitioned, replicated and distributed resources hosted on a cluster of nodes. Helix provides the following features:
A distributed system comprises of one or more nodes. Depending on the purpose, each node performs a specific task. For example, in a search system it can be a index, in a pub sub system it can be a topic/queue, in storage system it can be a database. Helix refers to such tasks as a resource. In order to scale the system, each node is responsible for a part of task referred to as partition. For scalability and fault tolerance, task associated with each partition can run on multiple nodes. Helix refers to them as replica.
Helix refers to each of the node in the cluster as a PARTICIPANT. As seen in many distributed system, there is a central component called CONTROLLER that co-ordinates the PARTICIPANTs during start up, failures and cluster expansion. In most distributed systems need to provide a service discovery mechanism for external entities like clients, request routers, load balancers to interact with the distributed system. These external entities are referred as SPECTATOR.
Helix is built on top of Zookeeper and uses it store the cluster state and serves as the communication channel between CONTROLLER, PARTICIPANT and spectator. There is no single point of failure in Helix.
Helix managed distributed system architecture.

Even though most distributed systems follow similar mechanism of co-ordinating the nodes through a controller or zookeeper, the implementation is specific to the use case. Helix abstracts out the cluster management of distributed system from its core functionality.
Helix allows one to express the system behavior via
State machine allows one to express the different roles a replica can take up and transition from one role to another.
Helix allows one to specify constraints on states and transitions.
Objectives are used to control the replica placement strategy across the nodes. For example
Consider the simple use cases where all partitions are actively processing search query request. We can express it using a OnlineOffline state model where a task can be either ONLINE (task is active) or OFFLINE (not active).
Similarly take a slightly more complicated system, where we need three states OFFLINE, SLAVE and MASTER.
The following state machine table provides transition from start state to End state. For example, if the current state is Offline and the target state is Master, the table says that the first transition must be Offline-Slave and then Slave-Master.
OFFLINE | SLAVE | MASTER
_____________________________
| | | |
OFFLINE | N/A | SLAVE | SLAVE |
|__________|________|_________|
| | | |
SLAVE | OFFLINE | N/A | MASTER |
|__________|________|_________|
| | | |
MASTER | SLAVE | SLAVE | N/A |
|__________|________|_________|
Another unique feature of Helix is it allows one to add constraints on each state and transitions.
For example In a OnlineOffline state model one can enforce a constraint that there should be 3 replicas in ONLINE state per partition.
ONLINE:3
In a MasterSlave state model with a replication factor of 3 one can enforce a single master by specifying constraints on number of Masters and Slaves.
MASTER:1 SLAVE:2
Given these constraints, Helix will ensure that there is 1 Master and 2 Slaves by initiating appropriate state transitions in the cluster.
Apart from Constraints on STATES, Helix supports constraints on transitions as well. For example, consider a OFFLINE-BOOTSTRAP transition where a service download the index over the network. Without any throttling during start up of a cluster, all nodes might start downloading at once which might impact the system stability. Using Helix with out changing any application code, one can simply place a constraint of max 5 transitions OFFLINE-BOOTSTRAP across the entire cluster.
The constraints can be at any scope node, resource, transition type and
Helix comes with 3 commonly used state models, you can also plugin your custom state model.
Helix framework can be used to build distributed, scalable, elastic and fault tolerant systems by configuring the distributed state machine and its constraints based on application requirements. The application has to provide the implementation for handling state transitions appropriately. Example
MasterSlaveStateModel extends HelixStateModel {
void onOfflineToSlave(Message m, NotificationContext context){
print("Transitioning from Offline to Slave for resource:"+ m.getResourceName() + " and partition:"+ m.getPartitionName());
}
void onSlaveToMaster(Message m, NotificationContext context){
print("Transitioning from Slave to Master for resource:"+ m.getResourceName() + " and partition:"+ m.getPartitionName());
}
void onMasterToSlave(Message m, NotificationContext context){
print("Transitioning from Master to Slave for resource:"+ m.getResourceName() + " and partition:"+ m.getPartitionName());
}
void onSlaveToOffline(Message m, NotificationContext context){
print("Transitioning from Slave to Offline for resource:"+ m.getResourceName() + " and partition:"+ m.getPartitionName());
}
}
Each transition results in a partition moving from its CURRENT state to a NEW state. These transitions are triggered on changes in the cluster state like
Helix uses terms that are commonly used to describe distributed data system concepts.
Helix approach of using a distributed state machine with constraints on state and transitions has the following benefits
Requirements: Jdk 1.6+, Maven 2.0.8+
git clone https://git-wip-us.apache.org/repos/asf/incubator-helix.git
cd incubator-helix
mvn install package -DskipTests
Maven dependency
<dependency>
<groupId>org.apache.helix</groupId>
<artifactId>helix-core</artifactId>
<version>0.6.0-incubating</version>
</dependency>
Download Helix artifacts from here.