Big Data is the new buzzword sweeping the worlds of IT and analytics. The number of new data sources and the amount of data generated by existing and new sources is growing at an incredible rate. Some recent statistics illustrate this explosion:
- Facebook generates 130 terabytes in data each day, just in user logs. An additional 200-400 terabytes are generated by users posting pictures.
- Google processes 25 petabytes (a petabyte is about 1,000 terabytes) each day.
- The large Hadron Collider, the world's largest and highest-energy particle accelerator, built by the European Organization for Nuclear Research, generates one petabyte of data each second.
- The total amount of data created worldwide in 2011 was about one zetabyte (1,000,000 petabytes). This is projected to grow by 50%-60% each year into the future.
With all of this data, the demand for skilled statistical problem solvers has never been greater.
What is Big Data?
Despite all of the above, we still need a good definition of Big Data. Two such definitions come to mind.
The Three Vs: Volume, Velocity, and Variety. Here, Big Data come in massive quantities, may come at you fast, and may also come in various forms (e.g. structured and unstructured). To this, we add a fourth V: Variability. Big Data is highly variable, covering the full range of experiences in the human condition and the physical world.
The second definition may have more appeal to some: Big Data are data that are expensive to manage and hard to extract value from. This definition recognizes that the many forms and varieties of Big Data may be difficult to collect, manage, process, aggregate, or summarize.
The Berkeley AMP lab suggests three pillars to deal with Big Data.
New algorithms are needed to deal with Big Data. They recognize that many existing statistical and data mining algorithms will not scale, nor will some existing software handle Big Data (e.g. R is well-known to have issues with memory). Furthermore, as data becomes bigger, using sampling for prediction or projection may miss important facts and phenomena.
Specialized machines are needed for Big Data. On the hardware side, developments in parallel and distributed processing are necessary for working with Big Data in a reasonable amount of time. New software, like Hadoop and MapReduce, have permitted data to be stored across multiple machines, yet permit fast access to the information.
Finally, new skills are required to deal with Big Data. McKinsey projects that by 2018, the U. S. will need 160,000 more people with expertise in statistical methods and data analysis and 1.5 million more data-literate managers.
Rise of the Data Scientist
These three pillars have given rise to a new position in an organization: the Data Scientist.
This view of the Data Scientist is attributed to Drew Conway on his blog in September 2010. The Traditional Researcher may have substantive expertise and learning, as well as statistical skills and the ability to use algorithms, but he is lacking the training and experience to manipulate raw data cleverly and skillfully. The Machine Learning specialist knows hacking and math/stat, but is substance-free. Most of the analytics in machine learning are devoid of theory and are model-free. Finally, the combination of substantive experience with hacking is exemplified by the clever guy in the IT department who has uncovered an interesting fact, yet cannot generalize that fact beyond the one occurrence, since he cannot distinguish between a random event and a systematic pattern by using statistics.
Thus the Data Scientist is a rare bird indeed and the three complementary skills combined in one person are highly valuable.
What is in4mation insights Doing with Big Data?
in4mation insights is meeting the challenge of Big Data by organizing our efforts around algorithms, machines, and people.
Algorithms: We have taken the most state-of-the-art statistical method, namely hierarchical Bayesian statistical models, and we have parallelized their implementation. This has resulted in blazing speed and the ability to do sophisticated analysis on terabytes of data. We continue to innovate by adding new statistical models that can be used in a variety of applications, including advanced econometrics, variable selection, hidden Markov models, Bayesian data fusion, tree-structured models, Bayesian CART, and Random Forests.
Machines: We have purchased our own in-house High Performance Computation Cluster (cloud), which has been tuned for high speed, complex mathematical and matrix calculations. We are able to get over 80% of theoretical performance from this HPCC, compared to the 50% or so obtained when not so tuned. We currently have over 100 computation cores and 1/2 terabyte of RAM. Expansion is expected to add an additional 128 cores and 4 terabytes of RAM.
People: We have invested considerable time in finding and training super smart people. The company's atmosphere and ethos combine to foster collaboration and communication, and innovation. Over 60% of our staff have a MS/MA or a Ph.D. degree.
Benefits to Our Clients
Our experience with Big Data can be summed up in this way:
- We can handle problems of any size. There is no need to create a sample of your data.
- When we do not have an off-the-shelf solution, together with our Board of Science Advisors, we will create a solution.
- The speed of our hardware and software permit us to do extensive testing of various model specifications in a short period of time.
- We innovate, not for innovation's sake, but rather to give our clients the best possible insights into their markets and customers.
We Are the Data Science Company.