Brickhouse comes to Tagged

Brickhouse is a library of user-defined functions for Hive and Hadoop that enables the agile development of big data pipelines. It provides tools and techniques that allow those pipelines to process larger datasets more robustly. The Brickhouse project was developed at Klout, whose team will continue to use and build upon Brickhouse and hosts the github repo. Since I’ve recently joined the Tagged team and am the the primary maintainer of Brickhouse, we’re going to start using it at Tagged as well.

Tagged embraces open source as part of its engineering culture, and will continue to support Brickhouse along with its other open-source projects. The Tagged website produces massive amounts of data, and we’ll be using Brickhouse for more advanced analyses of this data.

The premise here is that one uses Hive as the underlying programming language for pipeline development. There are various options for describing dataflows in Hadoop (e.g., Hive, Pig, Cascading and Scalding). I believe Hive is the simplest approach for generating these workflows. Generally, Hive is perceived as something that a business analyst would use to perform ad-hoc queries, rather than as a real programming language. Many stick to the dogma that every detail of a dataflow has to be explicitly handcrafted by a data engineer.

But think about SQL itself, for relational databases. Sure, there are plenty of business analysts who use it for writing custom reports, but SQL is also the lingua franca for the database developer, the software engineer, and the DBA, who has needed to interact with RDBMS for decades now. The theoretical model behind SQL has been dissected in countless papers, and many careers require being fluent in it; SQL has profound importance beyond simple reporting.

What about SQL has made it so pervasive? One reason is that it describes a dataset declaratively rather than procedurally. Users define a SQL query describing the data that they want to retrieve rather than taking the procedural steps required to gather that data. They define the “what,” not the “how” of the dataset. This allows developers to focus on the business requirements of their applications and not on the details of implementation or the optimization of those queries. The on-going enhancements have been developed over the years by the implementors of RDBMS engines, as well as the tireless efforts of dedicated DBAs and data architects.

Hive carries this promise of SQL to the new worlds of Hadoop and MapReduce. While the ad-hoc queries of the business analyst are very important, I’m considering the role it has for the newly-minted data engineer. Today’s data engineer is concerned with creating dataflows of massive size for new industries whose products consist entirely of data. In the turbulent startup world, time to market is of utmost importance; startups need to get new products up and running as quickly as possible. Requirements also change quite quickly, so it doesn’t make sense to waste engineering resources on optimizing and hand-crafting products that may not turn out to be very popular.

Brickhouse (on top of Hive and Hadoop) is intended to be a tool for the agile data developer. It provides the missing pieces of Hive, so the developer doesn’t have to drop down and write his own custom mappers and reducers. With Brickhouse you can express just about any data pipeline as a set of Hive views and queries. Plus, for cases where the traditional Hive approach has shortcomings, Brickhouse attempts to rework the approach to be more scalable and efficient, so that Hive queries can be run safely in production.

Brickhouse provides functionality in multiple areas:

  • Standard map and array transformations, like collect
  • Vector and timeseries calculations
  • Improved JSON UDFs for robustly parsing and producing JSON
  • HBase UDFs for inserting and reading from HBase
  • Probabilistic data structures, like KMV sketch sets, HyperLogLogs, and Bloom filters, for representing large datasets in a small amount of space
  • Many others as well

Let’s look at one specific example where Brickhouse helps out. This particular workflow had to go through a dataset and classify each record according to various sets of criteria. This would be implemented by a Hive developer with a script like this:

This is just an fictional example, but there are arbitrarily many record types that would have needed to be evaluated. In Hive, a query like this would span multiple MapReduce jobs; one for every sub-select that was being unioned. For early development, this was not such a problem, because we could live with multiple steps, especially if we were working with a reduced dataset. We didn’t really want to run this way in production, however, because we were making multiple passes of the data when we didn’t have to. For large datasets, simply scanning the data can take a significant amount of time. Also, for every new record type, we would have significantly added to our pipeline, so we couldn’t have easily added types if necessary. So, while the Hive approach technically would have worked, and was correct, it would have burned unnecessary cluster cycles, and increased runtimes when we were trying to meet an aggressive SLA for completing all of our jobs.

How can we solve this problem? We could write a custom mapper or reducer in raw Hadoop, and place the business logic for all our different record types in Java for the reducer. This would work, but would be pretty ugly, and would force us to do a standard MR job in the middle of our pipeline.

Another option would be to write a python script, and use Hive TRANSFORM to pipe records through it. This also would probably work, but would separate business logic between HQL and Python, so it wouldn’t be very maintainable. If possible, we’d want to keep our business logic in one general area, and Hive would seem like the right place.

This is how we came up with the conditional_emit UDTF. This UDTF takes an array of boolean expressions and emits a record for each one that is true. This way we can emit a record for all the record types that we need, with only one pass of the data. Adding record types would be simple; we’d simply add a new value to the array. The refactored query with the new UDTF would look something like the following:

As you can see, Brickhouse allows us to do the same thing with only one pass of the data. It reduces the number of MapReduce jobs which need to be run to just one. This saves us considerable cluster time, and reduces the use of temporary space for intermediate results. It makes it more likely that the pipeline will make its SLA.

This is only one example of how Brickhouse makes pipelines more efficient and reliable. If you are running a pipeline on Hadoop with Hive (or are planning on writing one), check it out – it will make your life much easier!

Jerome Banks is a Senior Software Engineer II on the analytics team at Tagged.

in Tech

Best Practices for Reliable Releases

One of our values at Tagged is “users are number one,” and in order to live up to that value, our Customer Experience team needs effective and reliable tools to best serve our users. My team, CE tools, is responsible for creating these mechanisms.

Reliability, in particular, is something the CE tools team is extremely focused on. Since Customer Experience agents are administrators of the site, bugs can cause the wrong actions to be taken to dire consequences.

Releasing new features can be risky when dealing with a large legacy codebase, and we want to make sure we don’t push bad code to production. We also want to keep our verification processes quick and simple so we have more time to work on new features. With these priorities in mind, we stick to the following best practices, which enable us to quickly produce quality code:

  • Unit tests – We try to write as many unit tests for our changes as we can. This helps to prevent regressions in the future, in case one of our changes accidentally alters something else on the site. Tagged has a lot of legacy code, which helps us prevent unexpected bugs.
  • Continuous integration server – It’s also important to run unit tests against any new incoming changes and have the results of the test be readily available for developers. It’s particularly important to run unit tests against new branches before they get merged into the master branch.
  • Live verification – Part of Tagged’s infrastructure lets us look at specific branches in our staging environment. This lets our product team verify that changes made by the engineers match the expectations designated by product. Live verification is also a great way to catch bugs that may have been missed during unit testing.
  • Master branch – We only merge new changes into the master branch if all of the above “passes” (e.g., unit testing is successful and our product team gives us the OK). This way, the master is in a state that can be pushed to production at any time.

Reliable release processes are still being figured out and differ from one company to another, but overall it’s always smart to follow the practices listed here. Keeping the main integration branch clean at all times will ensure that when changes are pushed to production everything will function smoothly.

Rob Goetz is a Software Engineer I on the CE tools team at Tagged.

in Tech

Zana Bootcamp Teaches Art of Growth at Tagged

Last week, Zana joined us at Tagged for a three-hour bootcamp with top-notch speakers sharing their insights on growth and distribution strategies.

The workshop included a fireside chat with Jim Schienman and Akash Garg. Schienman founded Maven Ventures, home to Maven Growth Labs, which has helped build some of the largest internet and mobile consumer brands. Garg is the Director of the growth and international teams at Twitter and has much previous experience in social networking companies.

Eli Beit-Zuri, a UI and UX guru, and Elliot Shmukler, vice president of product and growth at Wealthfront, also presented. Shmukler shared his experiences as a leader behind LinkedIn’s growth from 20 million to more than 200 million members.

Thanks to all who participated and shared their knowledge on the art of growth in Silicon Valley.

Upcoming Events @ Tagged 2/21 – 2/27

Tagged is proud to sponsor the LAUNCH Hackathon, taking place from February 21-23, and will be hosting a dinner for the hackathon winners at our HQ on February 25.

Our team will be attending Mobile World Congress, the world’s premier mobile industry event, in Barcelona from February 24-27.

Tagged will be hosting The State of Blacks in Technology Panel for the AT&T Making History Happen: 28 Days Project on February 26 at our HQ.

Join us at Tagged for a charity poker tournament benefitting the Signal Media Project on February 27.

in Tech

Marc LeBrun Celebrates Mac’s 30th Anniversary

This weekend, members of the original Macintosh team gathered at De Anza College to celebrate the 30th anniversary of Apple’s Mac computer.

We’re proud to share that Tagged team member Marc LeBrun was one of four people that built the first Mac. Marc attended the celebration this weekend and spoke on the first panel about “the conception of the Mac.”

Marc (front row, fifth from left) and other attendees at the anniversary celebration.

Congratulations to Marc for being a part of this history and for playing an integral role in the making of such an important technology. We’re honored to have him on our team!

Marc LeBrun is the engineering manager for search and advanced data services at Tagged. He is currently interested in innovative applications of number theory to the challenges of big data.