Every day I see a dozen new buzzwords in magazines, online articles, and social media, but I am always disappointed when none of the terms are ever able to live up to the hype. I really wasn’t that impressed or excited when I first learned about the term “Big Data” either, but that began to change when I saw more and more examples of how it touches every single part of my life. Big Data is not another short-lived buzzword, but is going to be a leading factor in shaping the future.
A case study that first caught my attention was the example of CellTell, the Congo-based phone provider, that was able to predict massacres in Congo based on prepaid phone purchases. When families in Congo sensed chaos coming, they wanted to protect their personal assets, so they would buy phone cards because it was one of the only things that was valued in US currency, which was safe from inflation, unlike the local currency. Or there is the now clique example of how Target was able to predict a teen girl’s pregnancy before her father. What is so interesting about the analytics behind Big Data is that completely unrelated things can be linked together to make predictions. For example, according to the predictive modeling company Kaggle, someone is the most likely to make their flight if he or she has preordered a vegetarian meal. And if you buy a used car, you should buy an orange one because its owners take the best care of their cars. Although these might seem like odd coincidences, there are a ton of things going on in the backend to make these connections, such as making connections between the physchology of vegetarians combined with the implications of preordering a personal meal on a flight, or linking the odd color of orange for car with the smaller production number of cars and realizing that someone who drives an orange car will take better car of it because the odd color implies that the car is likely used as a form of self expression.
Although Big Data is mainly used commercially to help businesses target their marketing initiatives, humanitarian initiatives have sprung up as well. Although it was eventually deemed a failure, I loved the idea behind Google Flu Trends. The idea here was that Google would be able to track the spread of the flu based on search queries. Even though there are over 3.5 billion search queries on Google every day, search habits and queries aren’t enough to predict the flu and don’t even begin to scrape the surface of everything that makes up big data.
So if Big Data is really so amazing, what exactly is it, how does it work, and why should individuals who don’t work at Google or the NSA care?
When I first started researching Big Data, data science, and analytical tools, I learned that knowledge about it was very marginalized. There was a huge gap between average citizens and data scientists and analysts. Every thing I read was hidden beneath industry jargon, overly complex models, and of course, more buzzwords and concepts that I didn’t fully understand. It’s almost as if the world of Big Data and data science was a secret society that was careful not to let any outsiders in. My goal with this post is to push aside the hype and buzz, and explain Big Data as simply and concisely as possible.
What is it?
Essentially, Big Data is a combination of structured and unstructured data. So it’s just a lot of information, so much information that traditional software and hardware can’t process it. It’s doesn’t only contain standard data, like your name and phone number, but it also includes things like how many friends you have on Facebook, transactional data from when you swipe a credit card, being captured by video surveillance at a gas station, how often you use your phone, your eye patterns when looking at a website or billboard, and more. It is dynamic and real-time information that comes from an unlimited variety of sources. So you can imagine that a lot of data is being created constantly. In fact, more data was created in the last two years than in all of history combined. We’re talking 1.8 zettabytes (10 to the 12th gigabytes) every two days. The term Big Data has also began to refer to the process of analyzing this data, as well as the companies that do so. Although it can be tricky to define, the name itself, Big Data, actually does a pretty good job describing it: it’s really just a ton of data in all different forms.
How does it work?
The reason that Big Data is a thing now is because it’s getting cheaper and cheaper to store information. But before companies can store data, they have to mine it. Some ways are pretty straightforward: we know that Google stores our search queries, but some ways are a little bit sketchier. For example, any website that has the option to “Share” or “Like” on Facebook is sharing that information with Facebook, even if you don’t hit either button. Many mobile applications also use services like Tea Leaf, which actually records video of your screen when using an application so that analysts can play back the video to improve the user experience. It is not uncommon for companies to leave Floodlights embedded in the code of their website to capture/report the actions of users after visiting their website. Although many methods for extracting data do raise concerns about cyber security, they are mostly used for harmless attempts to advertise products.
After data is mined, it is stored in a enterprise data warehouse, or some other form of database. Think of it like any other warehouse, only the product that is being stored is data. The big boys, like Google, use ridiculously expensive super computers to store and analyze data, while other big companies run NoSQL and Hadoop. In my opinion, this is by far the most boring part of Big Data, but it’s still extremely important because the whole system wouldn’t be able to work without it. Basically, NoSQL uses an agile approach, which means that everything is always a work in progress, with dynamic schema. So it’s a very fluid and constantly changing process. For Hadoop, instead of storing everything on one super computer, it spreads data across many computers. An example that I really like to describe the dynamic, agile approach to Big Data is the example of trying to find a good free throw shooter. Traditional data analysis would tell tell us that to find a good free throw shooter we should have everyone shoot 1000 free throws and then select everyone who makes at least 900 of them. The problem here is that it is not only impractical, but it takes a lot of time. With the dynamic approach for big data, it assumes that everyone is a great free throw shooter and as more data is collected, it will either prove this assumption true or false. With each individual free throw that is shot, it allows us to make a prediction based on that moment in time. As more data is added, the predictions become more and more accurate.
Why should I care?
If you own a business, hopefully you see the advantages that data can have to increasing your profitability. Most people don’t have access to super computers or other high end data tools, but even free tools, like Google Analytics or Excel, can allow you to target your marketing. You can test digital ads to see which are most effective or store data in Excel and do simple computations to find customers that have not been engaged with your company for a long period of time to send them marketing materials to reengage them. You can use heat maps on your site to see where users are engaged on your site and where they ignore content.
If you don’t own a business, you should still care about Big Data, as it will create 1 million jobs directly every year and 5 million jobs indirectly. If you can understand a little bit about Big Data, you instantly become a more valuable employee.
Great Tools to Use
This is a really cool tool that allows you to see how people are engaging with your website and where they are clicking on the site. The best part is that the normally $100 monthly fee for a premium account is waived and it is free for students, teachers, and charities.
Data Camp allows you to learn every thing about programming, data visualization, and data science. The courses are really interactive as well, and the first can be started for free. After you complete a course, you are given a certificate, which can help sweeten up your resume. After doing the first course for free, it is $25 a month, but you should be able to get at least one certificate in the first month.
If you are a student at Indiana University, you should be using this free service. It is crazy that IU gives us this for free and allows you to receive incredible training for almost anything that has to do with data and IT.
“And if you don’t know, now you know…”