I had a difficult decision this month. There are two terms being thrown around conference rooms with the same mind-boggling frequency as “hit the ground running” and “make sure we are on the same page.” The funny thing about these new buzzwords is that they are really technology terms but they have been embraced by business folks who believe they are “thinking outside the box.” So I had a tough decision to make. Should I write about HTML 5—as in “our new website needs to be HTML 5,” or should I discuss “Big Data,” as in, “Big Data is going to fundamentally change the way we do business?” Big data won this month but with HTML 5 displacing Web 2.0 as a “value-added paradigm” it was a tough decision.
What is Big Data in real life? Generally accepted definitions are something like this: data sets that are so large and complex they cannot easily be processed or analyzed by traditional database tools. That doesn’t really tell us a lot. There are new tools that allow us to manage large amounts of data through distributed non-traditional methods. And that is very interesting to a technologist.
But the technology is not really what most people are referring to when they talk about big data. What has caught everyone’s attention is the unprecedented amount of data that is now being collected and stored. One estimate put the monthly Internet data flow at over 20 exabytes (an exabyte is 1,000,000 terabytes). We are collecting data at such a rate that we often have no practical methods in place to analyze and make use of that data.
The Big Data value proposition is just the old “knowledge is power” proposition. In the business world the more you know about your customer the better you can serve them, which can be translated to—the more revenue we can generate from them.
Big Data is not new. There has always been more data available than can be reasonably recorded and processed. Insurance companies were very early adopters of computers because they were able to calculate actuarial tables efficiently. If you are unable to calculate the risk, selling insurance is a pretty dicey proposition.
Computer punch cards were developed to tabulate census data. CRM systems were developed to handle all the existent data we had on our customers. There has always been and will always be Big Data, if we define Big Data to be those new super large, super complex data sets we encounter.
Big Data is not so much the data itself as what we do with it. And what we do with it is largely determined by the data itself. We can lump Big Data into two large and general categories: structured and unstructured. Strictly speaking there is no useful unstructured data. Data must have some structure; we need to be able to differentiate between records and the fields within those records. Totally unstructured data—like that analyzed by the SETI folks—does not allow us to do much more than search for patterns. This probably explains SETI’s phenomenal lack of success. By structured data I mean data where we understand the source and have defined ways of analyzing that data even though it is not organized like traditional sources.
I have spent a lot of time working with web servers. There is a lot of data generated for a given server—things like internal event logs, error logs, web logs, performance logs, load balancer logs, and access logs. We are also able to track a lot of data about the consumers of the services delivered by these servers. Typically this is accomplished by using java script which allows us to collect client side data.
Having access to all this data is not new. What is new is what we are able to do with it now that we have access to Big Data tools. It is not unusual for a single server to generate a gigabyte or more of data every day. Multiply that by the hundreds or thousands of servers in a large enterprise and we have a lot of data to consume. Additionally the real value of most of this data is real time analysis. Using Big Data tools we can import all that data into something like Hadoop. Hadoop is an open-source distributed framework based on work done at Google. Hadoop is commonly used as a platform to house large amounts of data. Once that data is in Hadoop you can use a Big Data reporting tool like Splunk to analyze that data. (I know there are other tools out there; these two just come to front of mind).
Now that I have all that semi-structured data available what do I do with it? The data is from disparate sources and while it has a structure supported by its various sources there is not always a clearly defined relationship between the data sets. That is where the analysis tool comes into play.
We can point the analysis to look at the logs for a particular set of servers and from there we find commonalities between the various data sources. We may connect a get request in a web log to a request from a client log using IP address and we can then link them by log time and relate that to the event log on the server. Now suppose we are monitoring server performance real time and we begin to see an elevation in processor time beyond what is expected. Using Splunk we can quickly drill down into the logs and find a correlation to an increase in requests per second which in turn we correlate to a particular IP or range of IP addresses. Suddenly we realize that we are the target of a DOS attack. We have networking filter that offending IP and all returns to normal.
I understand that was a trivial example and there are probably other tools in place that would have already flagged a DOS attack. The point is that having real time reporting on aggregated logging from a variety of sources provides IT engineers the data they need to analyze and respond to events related to the health of their equipment.
Big Data is only as valuable as the analysis of the data. Kaggle is an organization that specializes in predictive modeling but is best known for its competitions. A particular problem in data analysis is postulated and a cash prize is provided to the winners. This crowdsourcing approach was inspired by Netflix. The Netflix prize was a million dollar contest to improve the efficiency of the movie recommendation algorithm by 10 percent. I think Netflix would have been better off asking for a prediction on the next preferred delivery method for DVD’s. Seems like Redbox figured that one out. Nevertheless crowdsourcing allows you to get wide range of theories tested quickly and inexpensively.
AllState partnered with Kaggle in 2011 for a contest to create actuarial algorithms to predict claims based on actual data from 2005-2007. The winning entry was 340 percent more accurate than AllState’s existing methods. The total prize money was $10,000. While the results aren’t conclusive proof that the winning algorithm should be adopted it led the sponsor to examine the algorithms presented and use them to fine tune their own.
As mentioned earlier there is no true useful unstructured data, but there is a wealth of unclassified data that we may call unstructured. This is the kind of data that could be used to build truly intelligent systems. Artificial intelligence is a term thrown around much too loosely. IBM has gotten a lot of mileage around the Garry Kasparov – Deep Blue chess matches and the Jeopardy winning Watson machine. Both Deep Blue and Watson are specially designed systems with specially configured data to do one thing very well. As such they appear intelligent-like, but they are not examples of artificial intelligence.
Two distinct intelligent activities have not been demonstrated by machines. First is the creative process that only the human mind seems capable of. Whether that creativity be the “aha” phenomenon parodied by Archimedes bathtub story or Newton’s apple, or the result of long analysis of existing phenomena to provide a revolutionary theory (as in Einstein’s theory of relativity), machines are not even close to demonstrating creative thinking. In fact I suspect this is probably a fundamental weakness with digital machines. We don’t live in a digital world.
Big Data and Traffic
The second intelligent activity is more interesting for this discussion. I am speaking of the ability to reason and predict from exposure to and analysis of unrelated data. I am a long time runner. I can look at a vehicle approaching from a quarter mile away and “know” with a very high degree of probability if I better move well out of the way of that vehicle. I can’t even tell you how I know but it is based on vehicle type, age and condition, driver age, gender and appearance, and a thousand unquantifiable signals that vehicle, driver, road, and weather combination is sending. Similarly when commuting on an eight lane interstate I am able to pick out the vehicle among the hundreds in my view which will certainly and suddenly change lanes and which is controlled by a driver who is unable to multi-task – or at least unable to change lanes and accelerate to the lane speed at the same time. Remarkably those same drivers have little problem simultaneously changing lanes and talking on a cell phone.
It is this area where we expect great things from big data. Traffic flow on a beltway or interstate should be predictable. The data is capture-able. Historic information is readily available. Traffic flow should model to fluid dynamics. Yet we are unable to master traffic control. Metered access is sometimes able to control traffic at certain choke points, reversible lanes help some.
The simple fact is humans drive automobiles and trucks and humans act selfishly—and selfish is not in the best interest of traffic flow. Until we can predict how many individuals will run the lights at the metered entrance or how many will purposefully motor along 10 miles an hour below other traffic computer models for traffic will fail.
Big Data and the Consumer
We have yet to touch on the factor that has made Big Data the buzzword du jour. That factor is nothing more that the insight Big Data can give us into our customers.
The information we are able to access about an individual goes far beyond the traditional sources like credit reports, purchase histories, investment details, demographics, education, family history, etc. Some people provide additional detailed information for free because they like to post interesting things about their lives on Facebook, Twitter, Pinterest, Yammer, LinkedIn, and other social media tools.
There are also more stealthy sources of data like cameras, cellphone tracking, and GPS tracking. It is theoretically possible for a consumer’s every movement through a shopping mall to be traced. This is the Big Data that business is after. When a customer walks up to the service counter at Best Buy it would be very useful for the service representative to know that they went to a Radiohead concert last week and that they were planning a trip to Cancun next month.
Wal-Mart has weekly revenue of about $8 billion which it realizes from some 100 million customers. Their total customer base is obviously much larger than that weekly 100 million. The Big Data theory says that if a retailer like Wal-Mart can collect every available bit of information about each customer that data can be analyzed in such a way to provide better service, better marketing and better information to those customers so that the retailer can increase the revenue they receive from those customers. Maybe – maybe not. I don’t know that McDonald’s is going to be able to super-size me because they know that I tweeted about my noisy neighbors last night but who knows, maybe they know more about me than I do.