At Tagged, we’re extremely focused on data science, and this requires large volumes of data to be statistically relevant. We’re always looking to improve the speed of our data availability to keep up with the pace of the rest of Tagged, which is fast-moving and agile.
Hadoop is an open source software framework that supports distributed data-intensive storage and computation. It’s particularly strong in dealing with the large amounts of data generated by larger websites (like Tagged.com).
Hadoop consists of two main layers:
1) HDFS (Hadoop Distributed File System) – this is built to handle large quantities of data
2) Map/Reduce – this shines in doing computation and analysis on large data sets
Hadoop is important to Tagged not only because of its ability to store large amounts of data, but also because it works in concert with and is a foundation for real-time data analysis. Real-time data analysis is one of our big goals on the Analytics Team. Hadoop allows us to give decision makers immediate access to metrics so they can react at the speed of our business. The ability to store large amounts of data allows us to store the amount of history required for our data scientists and to draw meaningful insights about customer behavior. Hadoop is flexible; it’s not just about crunching big data slowly. Hadoop can be used for big data storage and retrieval and can also be leveraged for real time analytics.
I’m excited about this new trend in no relational (NoSQL) database technology and how it can be used to complement more traditional databases. In a personal sense, it gives me a chance to work with cutting edge technology that was specifically developed to deal with big data challenges that we encounter at companies like Tagged.
Chris Bunnell has over 10 years experience of working with big data in a variety of industries (banking, telecom and now social media). You can connect with him on LinkedIn.