HDFS fault tolerance

HDFS is fault tolerant. Each file is broken up into blocks, and each block must be written to more than one server. The number of servers is configurable, but three is the common configuration. Just as with RAID, this provides fault tolerance and increase retrieval performance.

When a block is read, its checksum indicates whether the block is valid or corrupted. If corrupted, and depending on the scope of the corruption, the block may be rewriten or the server may be taken out of the cluster and the blocks spread to other existing servers. If the cluster is running within an elastic cloud then either the server is healed or a new server is added.

Unlike high end SAN hardware which is architected to avoid failure, HDFS assumes that its low end equipment will fail so it has self-healing built into its operating model.

Comments are closed.