It seems like the world moves at a faster pace every day. People and places become more connected, and people and organizations try to react at an ever-increasing pace. Reaching the limits of a human's ability to respond, tools are built to process the vast amounts of data available to decision makers, analyze it, present it, and, in some cases, respond to events as they happen.
The collection and processing of this data has a number of application areas, some of which are discussed in the next section. These applications, which are discussed later in this chapter, require an infrastructure and method of analysis specific to streaming data. Fortunately, like batch processing before it, the state of the art of streaming infrastructure is focused on using commodity hardware and software to build its systems rather than the specialized systems required for real-time analysis prior to the Internet era. This, combined with flexible cloud-based environment, puts the implementation of a real-time system within the reach of nearly any organization. These commodity systems allow organizations to analyze their data in real time and scale that infrastructure to meet future needs as the organization grows and changes over time.
The goal of this book is to allow a fairly broad range of potential users and implementers in an organization to gain comfort with the complete stack of applications. When real-time projects reach a certain point, they should be agile and adaptable systems that can be easily modified, which requires that the users have a fair understanding of the stack as a whole in addition to their own areas of focus. “Real time” applies as much to the development of new analyses as it does to the data itself. Any number of well-meaning projects have failed because they took so long to implement that the people who requested the project have either moved on to other things or simply forgotten why they wanted the data in the first place. By making the projects agile and incremental, this can be avoided as much as possible.
This chapter is divided into sections that cover three topics. The first section, “Sources of Streaming Data,” is some of the common sources and applications of streaming data. They are arranged more or less chronologically and provide some background on the origin of streaming data infrastructures. Although this is historically interesting, many of the tools and frameworks presented were developed to solve problems in these spaces, and their design reflects some of the challenges unique to the space in which they were born. Kafka, a data motion tool covered in Chapter 4, “Flow Management for Streaming Analysis,” for example, was developed as a web applications tool, whereas Storm, a processing framework covered in Chapter 5, “Processing Streaming Data,” was developed primarily at Twitter for handling social media data.
The second section, “Why Streaming Data is Different,” covers three of the important aspects of streaming data: continuous data delivery, loosely structured data, and high-cardinality datasets. The first, of course, defines a system to be a real-time streaming data environment in the first place. The other two, though not entirely unique, present a unique challenge to the designer of a streaming data application. All three combine to form the essential streaming data environment.
The third section, “Infrastructures and Algorithms,” briefly touches on the significance of how infrastructures and algorithms are used with streaming data.
There are a variety of sources of streaming data. This section introduces some of the major categories of data. Although there are always more and more data sources being made available, as well as many proprietary data sources, the categories discussed in this section are some of the application areas that have made streaming data interesting. The ordering of the application areas is primarily chronological, and much of the software discussed in this book derives from solving problems in each of these specific application areas.
The data motion systems presented in this book got their start handling data for website analytics and online advertising at places like LinkedIn, Yahoo!, and Facebook. The processing systems were designed to meet the challenges of processing social media data from Twitter and social networks like LinkedIn.
Google, whose business is largely related to online advertising, makes heavy use of the advanced algorithmic approaches similar to those presented in Chapter 11. Google seems to be especially interested in a technique called deep learning, which makes use of very large-scale neural networks to learn complicated patterns.
These systems are even enabling entirely new areas of data collection and analysis by making the Internet of Things and other highly distributed data collection efforts economically feasible. It is hoped that outlining some of the previous application areas provides some inspiration for as-of-yet-unforeseen applications of these technologies.
Operational monitoring of physical systems was the original application of streaming data. Originally, this would have been implemented using specialized hardware and software (or even analog and mechanical systems in the pre-computer era). The most common use case today of operational monitoring is tracking the performance of the physical systems that power the Internet.
These datacenters house thousands—possibly even tens of thousands—of discrete computer systems. All of these systems continuously record data about their physical state from the temperature of the processor, to the speed of the fan and the voltage draw of their power supplies. They also record information about the state of their disk drives and fundamental metrics of their operation, such as processor load, network activity, and storage access times.
To make the monitoring of all of these systems possible and to identify problems, this data is collected and aggregated in real time through a variety of mechanisms. The first systems tended to be specialized ad hoc mechanisms, but when these sorts of techniques started applying to other areas, they started using the same collection systems as other data collection mechanisms.
The introduction of the commercial web, through e-commerce and online advertising, led to the need to track activity on a website. Like the circulation numbers of a newspaper, the number of unique visitors who see a website in a day is important information. For e-commerce sites, the data is less about the number of visitors as it is the various products they browse and the correlations between them.
To analyze this data, a number of specialized log-processing tools were introduced and marketed. With the rise of Big Data and tools like Hadoop, much of the web analytics infrastructure shifted to these large batch-based systems. They were used to implement recommendation systems and other analysis. It also became clear that it was possible to conduct experiments on the structure of websites to see how they affected various metrics of interest. This is called A/B testing because—in the same way an optometrist tries to determine the best prescription—two choices are pitted against each other to determine which is best. These tests were mostly conducted sequentially, but this has a number of problems, not the least of which is the amount of time needed to conduct the study.
As more and more organizations mined their website data, the need to reduce the time in the feedback loop and to collect data on a more continual basis became more important. Using the tools of the system-monitoring community, it became possible to also collect this data in real time and perform things like A/B tests in parallel rather than in sequence. As the number of dimensions being measured and the need for appropriate auditing (due to the metrics being used for billing) increased, the analytics community developed much of the streaming infrastructure found in this book to safely move data from their web servers spread around the world to processing and billing systems.
This sort of data still accounts for a vast source of information from a variety of sources, although it is usually contained within an organization rather than made publicly available. Applications range from simple aggregation for billing to the real-time optimization of product recommendations based on current browsing history (or viewing history, in the case of a company like Netflix).
A major user and generator of real-time data is online advertising. The original forms of online advertising were similar to their print counterparts with “buys” set up months in advance. With the rise of the advertising exchange and real-time bidding infrastructure, the advertising market has become much more fluid for a large and growing portion of traffic.
For these applications, the money being spent in different environments and on different sites is being managed on a per-minute basis in much the same way as the stock market. In addition, these buys are often being managed to some sort of metric, such as the number of purchases (called a conversion) or even the simpler metric of the number of clicks on an ad. When a visitor arrives at a website via a modern advertising exchange, a call is made to a number of bidding agencies (perhaps 30 or 40 at a time), who place bids on the page view in real time. An auction is run, and the advertisement from the winning party is displayed. This usually happens while the rest of the page is loading; the elapsed time is less than about 100 milliseconds. If the page contains several advertisements, as many of them do, an auction is often being run for all of them, sometimes on several different exchanges.
All the parties in this process—the exchange, the bidding agent, the advertiser, and the publisher—are collecting this data in real time for various purposes. For the exchange, this data is a key part of the billing process as well as important for real-time feedback mechanisms that are used for a variety of purposes. Examples include monitoring the exchange for fraudulent traffic and other risk management activities, such as throttling the access to impressions to various parties.
Advertisers, publishers, and bidding agents on both sides of the exchange are also collecting the data in real time. Their goal is the management and optimization of the campaigns they are currently running. From selecting the bid (in the case of the advertiser) or the “reserve” price (in the case of the publisher), to deciding which exchange offers the best prices for a particular type of traffic, the data is all being managed on a moment-to-moment basis.
A good-sized advertising campaign or a large website could easily see page views in the tens or hundreds of millions. Including other events such as clicks and conversions could easily double that. A bidding agent is usually acting on the part of many different advertisers or publishers at once and will often be collecting on the order of hundreds of millions to billions of events per day. Even a medium-sized exchange, sitting as a central party, can have billions of events per day. All of this data is being collected, moved, analyzed, and stored as it happens.
Another newer source of massive collections of data are social media sources, especially public ones such as Twitter. As of the middle of 2013, Twitter reported that it collected around 500 million tweets per day with spikes up to around 150,000 tweets per second. That number has surely grown since then.
This data is collected and disseminated in real time, making it an important source of information for news outlets and other consumers around the world. In 2011, Twitter users in New York City received information about an earthquake outside of Washington, D.C. about 30 seconds before the tremors struck New York itself.
Combined with other sources like Facebook, Foursquare, and upcoming communications platforms, this data is extremely large and varied. The data from applications like web analytics and online advertising, although highly dimensional, are usually fairly well structured. The dimensions, such as money spent or physical location, are fairly well understood quantitative values.
In social media, however, the data is usually highly unstructured, at least as data analysts understand the term. It is usually some form of “natural language” data that must be parsed, processed, and somehow understood by automated systems. This makes social media data incredibly rich, but incredibly challenging for the real-time data sources to process.
One of the most exciting new sources of data was introduced to the world in 2007 in the form of Apple's iPhone. Cellular data-enabled computers had been available for at least a decade, and devices like Blackberries had put data in the hands of business users, but these devices were still essentially specialist tools and were managed as such.
The iPhone, Android phones, and other smartphones that followed made cellular data a consumer technology with the accompanying economies of scale that goes hand in hand with a massive increase in user base. It also put a general-purpose computer in the pocket of a large population. Smartphones have the ability not only to report back to an online service, but also to communicate with other nearby objects using technologies like Bluetooth LE.
Technologies like so-called “wearables,” which make it possible to measure the physical world the same way the virtual world has been measured for the last two decades, have taken advantage of this new infrastructure. The applications range from the silly to the useful to the creepy. For example, a wristband that measures sleep activity could trigger an automated coffee maker when the user gets a poor night's sleep and needs to be alert the next day. The smell of coffee brewing could even be used as an alarm. The communication between these systems no longer needs to be direct or specialized, as envisioned in the various “smart home” demonstration projects during the past 50 years. These tools are possible today using tools like If This Then That (IFTTT) and other publicly available systems built on infrastructure similar to those in this book.
On a more serious note, important biometric data can be measured in real time by remote facilities in a way that has previously been available only when using highly specialized and expensive equipment, which has limited its application to high-stress environments like space exploration. Now this data can be collected for an individual over a long period of time (this is known in statistics as longitudinal data) and pooled with other users' data to provide a more complete picture of human biometric norms. Instead of taking a blood pressure test once a year in a cold room wearing a paper dress, a person's blood pressure might be tracked over time with the goal of “heading off problems at the pass.”
Outside of health, there has long been the idea of “smart dust”—large collections of inexpensive sensors that can be distributed into an area of the world and remotely queried to collect interesting data. The limitation of these devices has largely been the expense required to manufacture relatively specialized pieces of equipment. This has been solved by the commodification of data collection hardware and software (such as the smartphone) and is now known as the Internet of Things. Not only will people continually monitor themselves, objects will continually monitor themselves as well. This has a variety of potential applications, such as traffic management within cities to making agriculture more efficient through better monitoring of soil conditions.
The important piece is that this information can be streaming through commodity systems rather than hardware and software specialized for collection. These commodity systems already exist, and the software required to analyze the data is already available. All that remains to be developed are the novel applications for collecting the data.
There are a number of aspects to streaming data that set it apart from other kinds of data. The three most important, covered in this section, are the “always-on” nature of the data, the loose and changing data structure, and the challenges presented by high-cardinality dimensions. All three play a major role in decisions made in the design and implementation of the various streaming frameworks presented in this book. These features of streaming data particularly influence the data processing frameworks presented in Chapter 5. They are also reflected in the design decisions of the data motion tools, which consciously choose not to impose a data format on information passing through their system to allow maximum flexibility. The remainder of this section covers each of these in more depth to provide some context before diving into Chapter 2, which covers the components and requirements of a streaming architecture.
This first is somewhat obvious: streaming data streams. The data is always available and new data is always being generated. This has a few effects on the design of any collection and analysis system. First, the collection itself needs to be very robust. Downtime for the primary collection system means that data is permanently lost. This is an important thing to remember when designing an edge collector, and it is discussed in more detail in Chapter 2.
Second, the fact that the data is always flowing means that the system needs to be able to keep up with the data. If 2 minutes are required to process 1 minute of data, the system will not be real time for very long. Eventually, the problem will be so bad that some data will have to be dropped to allow the system to catch up. In practice it is not enough to have a system that can merely “keep up” with data in real time. It needs to be able to process data far more quickly than real time. For reasons that are either intentional, such as a planned downtime, or due to catastrophic failures, such as network outages, the system either whole or in part will go down.
Failing to plan for this inevitability and having a system that can only process at the same speed as events happen means that the system is now delayed by the amount of data stored at the collectors while the system was down. A system that can process 1 hour of data in 1 minute, on the other hand, can catch up fairly quickly with little need for intervention. A mature environment that has good horizontal scalability—a concept also discussed in Chapter 2—can even implement auto-scaling. In this setting, as the delay increases, more processing power is temporarily added to bring the delay back into acceptable limits.
On the algorithmic side, this always-flowing feature of streaming data is a bit of a double-edged sword. On the positive side, there is rarely a situation where there is not enough data. If more data is required for an analysis, simply wait for enough data to become available. It may require a long wait, but other analyses can be conducted in the meantime that can provide early indicators of how the later analysis might proceed.
On the downside, much of the statistical tooling that has been developed over the last 80 or so years is focused on the discrete experiment. Many of the standard approaches to analysis are not necessarily well suited to the data when it is streaming. For example, the concept of “statistical significance” becomes an odd sort of concept when used in a streaming context. Many see it as some sort of “stopping rule” for collecting data, but it does not actually work like that. The p-value statistic used to make the significance call is itself a random value and may dip below the critical value (usually 0.05) even though, when the next value is observed, it would result in a p-value above 0.05.
This does not mean that statistical techniques cannot and should not be used—quite the opposite. They still represent the best tools available for the analysis of noisy data. It is simply that care should be taken when performing the analysis as the prevailing dogma is mostly focused on discrete experiments.
Streaming data is often loosely structured compared to many other datasets. There are several reasons this happens, and although this loose structure is not unique to streaming data, it seems to be more common in the streaming settings than in other situations.
Part of the reason seems to be the type of data that is interesting in the streaming setting. Streaming data comes from a variety of sources. Although some of these sources are rigidly structured, many of them are carrying an arbitrary data payload. Social media streams, in particular, will be carrying data about everything from world events to the best slice of pizza to be found in Brooklyn on a Sunday night. To make things more difficult, the data is encoded as human language.
Another reason is that there is a “kitchen sink” mentality to streaming data projects. Most of the projects are fairly young and exploring unknown territory, so it makes sense to toss as many different dimensions into the data as possible. This is likely to change over time, so the decision is also made to use a format that can be easily modified, such as JavaScript Object Notation (JSON). The general paradigm is to collect as much data as possible in the event that it is actually interesting.
Finally, the real-time nature of the data collection also means that various dimensions may or may not be available at any given time. For example, a service that converts IP addresses to a geographical location may be temporarily unavailable. For a batch system this does not present a problem; the analysis can always be redone later when the service is once more available. The streaming system, on the other hand, must be able to deal with changes in the available dimensions and do the best it can.
Cardinality refers to the number of unique values a piece of data can take on. Formally, cardinality refers to the size of a set and can be applied to the various dimensions of a dataset as well as the entire dataset itself. This high cardinality often manifests itself in a “long tail” feature of the data. For a given dimension (or combination of dimensions) there is a small set of different states that are quite common, usually accounting for the majority of the observed data, and then a “long tail” of other data states that comprise a fairly small fraction.
This feature is common to both streaming and batch systems, but it is much harder to deal with high cardinality in the streaming setting. In the batch setting it is usually possible to perform multiple passes over the dataset. A first pass over the data is often used to identify dimensions with high cardinality and compute the states that make up most of the data. These common states can be treated individually, and the remaining state is combined into a single “other” state that can usually be ignored.
In the streaming setting, the data can usually be processed a single time. If the common cases are known ahead of time, this can be included in the processing step. The long tail can also be combined into the “other” state, and the analysis can proceed as it does in the batch case. If a batch study has already been performed on an earlier dataset, it can be used to inform the streaming analysis. However, it is often not known if the common states for the current data will be the common states for future data. In fact, changes in the mix of states might actually be the metric of interest. More commonly, there is no previous data to perform an analysis upon. In this case, the streaming system must attempt to deal with the data at its natural cardinality.
This is difficult both in terms of processing and in terms of storage. Doing anything with a large set necessarily takes time to process anything that involves a large number of different states. It also requires a linear amount of space to store information about each different state and, unlike batch processing, storage space is much more restricted than in the batch setting because it must generally use very fast main memory storage instead of the much slower tertiary storage of hard drives. This has been relaxed somewhat with the introduction of high-performance Solid State Drives (SSDs), but they are still orders of magnitude slower than memory access.
As a result, a major topic of research in streaming data is how to deal with high-cardinality data. This book discusses some of the approaches to dealing with the problem. As an active area of research, more solutions are being developed and improved every day.
The intent of this book is to provide the reader with the ability to implement a streaming data project from start to finish. An algorithm without an infrastructure is, perhaps, an interesting research paper, but not a finished system. An infrastructure without an application is mostly just a waste of resources.
The approach of “build it and they will come” really isn't going to work if you focus solely on an algorithm or an infrastructure. Instead, a tangible system must be built implementing both the algorithm and the infrastructure required to support it. With an example in place, other people will be able to see how the pieces fit together and add their own areas of interest to the infrastructure. One important thing to remember when building the infrastructure (and it bears repeating) is that the goal is to make the infrastructure and algorithms accessible to a variety of users in an organization (or the world). A successful project is one that people use, enhance, and extend.
Ultimately, the rise of web-scale data collection has been about connecting “sensor platforms” for a real-time processing framework. Initially, these sensor platforms were entirely virtual things such as system monitoring agents or the connections between websites and a processing framework for the purposes of advertising. With the rise of ubiquitous Internet connectivity, this has transferred to the physical world to allow collection of data across a wide range of industries at massive scale.
Once data becomes available in real time, it is inevitable that processing should be undertaken in real time as well. Otherwise, what would be the point of real-time collection when bulk collection is probably still less expensive on a per-observation basis? This brings with it a number of unique challenges. Using the mix of infrastructural decisions and computational approaches covered in this book, these challenges can be largely overcome to allow for the real-time processing of today's data as well as the ever-expanding data collection of future systems.
In This Part