data:image/s3,"s3://crabby-images/9488a/9488a8a3634c8256db5fb170c4a76f2bd987cf9a" alt="Apache Spark 2:Data Processing and Real-Time Analytics"
上QQ阅读APP看书,第一时间看更新
State versioning guarantees consistent results after reruns
What about the state? Imagine that a machine learning algorithm maintains a count variable on all the workers. If you replay the exact same data twice, you will end up counting the data multiple times. Therefore, the query planner also maintains a versioned key-value map within the workers, which are persisting their state in turn to HDFS--which is by design fault tolerant.
So, in case of a failure, if data has to be replaced, the planner makes sure that the correct version of the key-value map is used by the workers.