Apache Spark 2:Data Processing and Real-Time Analytics
上QQ阅读APP看书,第一时间看更新

State versioning guarantees consistent results after reruns

What about the state? Imagine that a machine learning algorithm maintains a count variable on all the workers. If you replay the exact same data twice, you will end up counting the data multiple times. Therefore, the query planner also maintains a versioned key-value map within the workers, which are persisting their state in turn to HDFS--which is by design fault tolerant.

So, in case of a failure, if data has to be replaced, the planner makes sure that the correct version of the key-value map is used by the workers.