Practical Real-time Data Processing and Analytics

上QQ阅读APP看书，第一时间看更新

Collection

Now that we have identified the source of data and its characteristics and frequency of arrival, next we need to consider the various collection tools available for tapping the live data into the application:

Apache Flume: Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tenable reliability mechanisms and many fail over and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application. (Source: https://flume.apache.org/). The salient features of Flume are:
- It can easily read streaming data, and has a built in failure recovery. It has memory and disk channels to handle surges or spikes in incoming data without impacting the downstream processing system.
- Guaranteed delivery: It has a built-in channel mechanism that works on acknowledgments, thus ensuring that the messages are delivered.
- Scalability: Like all other Hadoop components, Flume is easily horizontally scalable.

FluentD: FluentD is an open source data collector which lets you unify data collection and consumption for a better use and understanding of data. (Source: http://www.fluentd.org/architecture). The salient features of FluentD are:
- Reliability: This component comes with both memory and file-based channel configurations which can be configured based on reliability needs for the use case in consideration
- Low infrastructure foot print: The component is written in Ruby and C and has a very low memory and CPU foot print
- Pluggable architecture: This component leads to an ever-growing contribution to the community for its growth
- Uses JSON: It unifies the data into JSON as much as possible thus making unification, transformation, and filtering easier
Logstash: Logstash is an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to your favorite stash (ours is Elasticsearch, naturally). (Source: https://www.elastic.co/products/logstash). The salient features of Logstash are:
- Variety: It supports a wide variety of input sources, varying from metrics to application logs, real-time sensor data, social media, and so on, in streams.
- Filtering the incoming data: Logstash provides the ability to parse, filter, and transform data using very low latency operations, on the fly. There could be situations where we want the data arriving from a variety of sources to be filtered and parsed as per a predefined, a common format before landing into the broker or stash. This makes the overall development approach decoupled and easy to work with due to convergence to the common format. It has the ability to format and parse very highly complex data, and the overall processing time is independent of source, format, complexity, or schema.
- It can club the transformed output to a variety of storage, processing, or downstream application systems such as Spark, Storm, HDFS, ES, and so on.
- It's robust, scalable and extensible: where the developers have the choice to use a wide variety of available plugins or write their own custom plugins. The plugins can be developed using the Logstash tool called plugin generator.
- Monitoring API: It enables the developers to tap into the Logstash clusters and monitor the overall health of the data pipeline.
- Security: It provides the ability to encrypt data in motion to ensure that the data is secure.

Cloud API for data collection: This is yet another method of data collection where most cloud platforms offer a variety of data collection API's such as:
- AWS Amazon Kinesis Firehose
- Google Stackdriver Monitoring API
- Data Collector API
- IBM Bluemix Data Connect API