git clone https://git-wip-us.apache.org/repos/asf/incubator-helix.git cd recipes/rsync-replicated-file-system/ mvn clean install package -DskipTests cd target/rsync-replicated-file-system-pkg/bin chmod +x * ./quickdemo
There are many applications that require storage for storing large number of relatively small data files. Examples include media stores to store small videos, images, mail attachments etc. Each of these objects is typically kilobytes, often no larger than a few megabytes. An additional distinguishing feature of these usecases is also that files are typically only added or deleted, rarely updated. When there are updates, they are rare and do not have any concurrency requirements.
These are much simpler requirements than what general purpose distributed file system have to satisfy including concurrent access to files, random access for reads and updates, posix compliance etc. To satisfy those requirements, general DFSs are also pretty complex that are expensive to build and maintain.
A different implementation of a distributed file system includes HDFS which is inspired by Googles GFS. This is one of the most widely used distributed file system that forms the main data storage platform for Hadoop. HDFS is primary aimed at processing very large data sets and distributes files across a cluster of commodity servers by splitting up files in fixed size chunks. HDFS is not particularly well suited for storing a very large number of relatively tiny files.
Its possible to build a vastly simpler system for the class of applications that have simpler requirements as we have pointed out.
We call this system a Partitioned File Store (PFS) to distinguish it from other distributed file systems. This system needs to provide the following features:
Apache Helix is a generic cluster management framework that makes it very easy to provide the scalability, fault-tolerance and elasticity features. Rsync can be easily used as a replication channel between servers so that each file gets replicated on multiple servers.
Every write on the master will result in creation/deletion of one or more files. In order to maintain timeline consistency slaves need to apply the changes in the same order. To facilitate this, the master logs each transaction in a file and each transaction is associated with an 64 bit id in which the 32 LSB represents a sequence number and MSB represents the generation number. Sequence gets incremented on every transaction and and generation is increment when a new master is elected.
Replication is required to slave to keep up with the changes on the master. Every time the slave applies a change it checkpoints the last applied transaction id. During restarts, this allows the slave to pull changes from the last checkpointed id. Similar to master, the slave logs each transaction to the transaction logs but instead of generating new transaction id, it uses the same id generated by the master.
When a master fails, a new slave will be promoted to master. If the prev master node is reachable, then the new master will flush all the changes from previous master before taking up mastership. The new master will record the end transaction id of the current generation and then starts new generation with sequence starting from 1. After this the master will begin accepting writes.
This application demonstrate a file store that uses rsync as the replication mechanism. One can envision a similar system where instead of using rsync, can implement a custom solution to notify the slave of the changes and also provide an api to pull the change files.
The coordination between nodes is done by Helix. Helix does the partition management and assigns the partition to multiple nodes based on the replication factor. It elects one the nodes as master and designates others as slaves. It provides notifications to each node in the form of state transitions ( Offline to Slave, Slave to Master). It also provides notification when there is change is cluster state. This allows the slave to stop replicating from current master and start replicating from new master.
In this application, we have only one partition but its very easy to extend it to support multiple partitions. By partitioning the file store, one can add new nodes and Helix will automatically re-distribute partitions among the nodes. To summarize, Helix provides partition management, fault tolerance and facilitates automated cluster expansion.