Big Data Tools and Technology

In the past decade the amount of data being created has skyrocketed. More than 30000 gigabytes of data are generated every second, and the rate of data creation is only accelerating.

The data we deal with is diverse. Users create content like blog posts, tweets, social network interactions, and photos. Servers continuously log messages about what they're doing. Scientists create detailed measurements of the world around us. The internet, the ultimate source of data, is almost incomprehensibly large.

This astonishing growth in data has profoundly affected businesses. Traditional database systems, such as relational databases, have been pushed to the limit. In an increasing number of cases these systems are breaking under the pressures of "Big Data." Traditional systems, and the data management techniques associated with them, have failed to scale to Big Data.

To tackle the challenges of Big Data, a new breed of technologies has emerged. Many of these new technologies have been grouped under the term "NoSQL." In some ways these new technologies are more complex than traditional databases, and in other ways they are simpler. These systems can scale to vastly larger sets of data, but using these technologies effectively requires a fundamentally new set of techniques. They are not one-size-fits-all solutions.

Many of these Big Data systems were pioneered by Google, including distributed filesystems, the MapReduce computation framework, and distributed locking services. Another notable pioneer in the space was Amazon, who created an innovative distributed key-value store called Dynamo. The open source community responded in the years following with projects like Hadoop, HBase, MongoDB, Cassandra, RabbitMQ, and countless other projects.

We will learn how to use this new breed of technologies to build robust and

scalable Big Data systems. We will be exploring a new set of techniques for handling Big Data. Managing the complexity of these systems is as important as scaling. As our tools become more complex and we must worry about concepts like fault-tolerance, consistency, and availability in our application code, it is imperative that we find ways to eliminate complexity throughout the rest of our systems. Some of the most basic ways people handle data in traditional systems is too complex for building robust Big Data systems. The simpler, alternative approach is the new paradigm for Big Data that we will be exploring.

We will explore the "Big Data problem" and why we need a new paradigm for Big Data. We'll look at the perils of some of the traditional techniques for scaling and disocver some deep flaws in the traditional way of building data systems. By starting from first principles of data systems, we'll formulate a different way to build data systems that avoids the complexity of traditional techniques. We'll take a look at how recent trends in technology encourage the use of new kinds of systems, and finally we'll take a look at an example Big Data system that to illustrate the key concepts.

Big Data Tools and Technology

Wednesday, February 22, 2012

A New Paradigm for Big Data