
Record reader
The input reader pides the input into appropriately sized splits (in practice, typically, 64 MB to 128 MB), and the framework assigns one split to each map function. The input reader reads data from stable storage (typically, a distributed filesystem) and generates key/value pairs.
A common example will read a directory full of text files and return each line as a record.
The record reader translates an input split generated by input format into records. The purpose of the record reader is to parse the data into records, but not to parse the record itself. It passes the data to the mapper in the form of a key/value pair. Usually, the key in this context is positional information, and the value is the chunk of data that composes a record. Customized record readers are outside of the scope of this book. We generally assume you have an appropriate record reader for your data. LineRecordReader is the default RecordReader that TextInputFormat provides and it treats each line of the input file as the new value; the associated key is byte offset. LineRecordReader always skips the first line in the split (or part of it), if it is not the first split. It reads one line after the boundary of the split at the end (if data is available, so it is not the last split).