Preprocessing big data through Spark EMR