If Big Data is the Answer, What is the Question?

It seems that if you’re not on the Big Data detection bandwagon these days, you risk not being part of the conversation. It’s a marginalization akin to string theorists of the 70′s, and non-string theorists of the present. Probably there is a healthier skepticism among the core of the security research community, but in part that’s because healthy skepticism about almost everything is simply part of that community’s DNA.

Depending on who you ask, Big Data is either the problem needing to be solved or it is the solution, but both perspectives really get at the same question: How do I centrally collect, store, analyze, and coordinate actions upon all of the information available in order to detect and respond to threats?

The assumptions underlying this question deserve closer scrutiny, particularly as they relate to detection.

Dumb vs. Smart sensors

An important distinction that is not often made in the vague conversations that occur around Big Data is the distinction between smart and dumb sensors. Smart sensors are characterized by alerting based on some analysis. It’s often these alerts that are referred to when the word “correlation” is invoked. If sensor X fires an alert, but all’s well according to sensor Y, then I may want to ignore sensor X’s alert. This makes a great deal of sense, since each technology, be it at the firewall or elsewhere on the wire, at the email server, or on the endpoint, has its own strengths and weaknesses, and comparing information from different sensors can significantly improve the overall signal-to-noise ratio.

However, Big Data is mostly predicated on collating large amounts of information from dumb sensors.  Dumb sensors simply feed raw data upstream, be it logs, DNS queries, URLs, etc. Without invoking dumb sensors, we’re not really talking about Big Data, we’re talking about a more traditional SIEM approach. Centralized analysis of dumb sensor data is a still untested approach that requires analysis and data mining techniques that basically haven’t been invented yet. And frankly, if it hasn’t been envisioned by the academic or research communities, it’s probably not going to happen any time soon, if ever. Usually industry adopts ideas from academia with a lag time measured in years. So the industry has been collectively moving toward a particular destination in hopes that they’ll find the correct answer in place once they arrive, without a whole lot of evidence to support the move. That takes a good deal of faith.

Bottlenecked by Design

Another problem with the Big Data approach is that of information granularity. If you truly wish to send all of the dumb sensor data to a central location for analysis, two significant problems arise. The first is delay. Analyzing and correlating dumb sensor data with sophisticated and yet-to-be invented mining techniques is almost certainly going to imply significant delays in reporting actionable intelligence. Many of the approaches I hear talked about involve analyzing historical data based on new information and patterns that appear. Timeliness is an important consideration in security operations and any delay risks allowing the trail to grow cold long before infiltrating actors are detected.

The other problem is that of volume. Yes we have lots of storage capacity, and relatively recent data storage systems allow for better performance and scalability of analysis. But there will always be a limit on how high the volume knob can go before networks are overwhelmed. Distributing analysis may help – more on that in a moment. In short, centralizing data analysis will always mean that the volume knob has to be turned down, and this risks leaving useful information on the table.

What precisely do you do with the data?

Answers are emerging to this question, but there isn’t a silver bullet among them at present. Often when a SIEM is introduced into the corporate security environment, its primary applied use is to create a single pane of glass. So far, few SOCs are using SIEMs to their full potential and getting ahead of the problem. And those that are ahead of the problem are often doing so without the aggregation of a SIEM.

Distributed Analysis

We mentioned above the possibility of distributing analysis. Probably what this means is partitioning the problem either heterogeneously or homogenously. In a heterogeneous partitioning scheme, we correlate sensors A and B in one location and provide correlation analysis there, and we correlate information from sensors C and D elsewhere, with both analyses feeding a higher level analysis engine based on the processed and filtered data. A homogeneous approach would effectively place lots of Big Data analysis points throughout the infrastructure, each collecting data from the same types of sensors, but only looking at portions of the network.

The point here is not to dwell on the relative merits of any of these architectures, but to point out that there are alternatives to the centralized analysis approach that is generally the hallmark of Big Data architectures. Analysis and intelligence could be pushed down where appropriate to the sensors themselves possibly in addition to a middle tier, mitigating the need for aggregating and analyzing massive amounts of dumb sensor data.

In a knee-jerk reaction, the industry’s data pendulum has swung from disjointed and non-cooperating sensors “at the edge” to an architecture that places huge burdens on infrastructure and great expectations on yet-to-be-invented analysis techniques. Perhaps it’s reasonable to consider an alternative that is somewhere in between these two extremes.

Bit9 Endpoint Protection: Advanced, Enterprise Server Security Solutions … Sign-up for a Free 5-Day Trial

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of the author. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided "as-is". The author shall not be liable for any damages whatsoever arising out of the content or use of this blog.
%d bloggers like this: