Brickhouse comes to Tagged

Brickhouse is a library of user-defined functions for Hive and Hadoop that enables the agile development of big data pipelines. It provides tools and techniques that allow those pipelines to process larger datasets more robustly. The Brickhouse project was developed at Klout, whose team will continue to use and build upon Brickhouse and hosts the github repo. Since I’ve recently joined the Tagged team and am the the primary maintainer of Brickhouse, we’re going to start using it at Tagged as well.

Tagged embraces open source as part of its engineering culture, and will continue to support Brickhouse along with its other open-source projects. The Tagged website produces massive amounts of data, and we’ll be using Brickhouse for more advanced analyses of this data.

The premise here is that one uses Hive as the underlying programming language for pipeline development. There are various options for describing dataflows in Hadoop (e.g., Hive, Pig, Cascading and Scalding). I believe Hive is the simplest approach for generating these workflows. Generally, Hive is perceived as something that a business analyst would use to perform ad-hoc queries, rather than as a real programming language. Many stick to the dogma that every detail of a dataflow has to be explicitly handcrafted by a data engineer.

But think about SQL itself, for relational databases. Sure, there are plenty of business analysts who use it for writing custom reports, but SQL is also the lingua franca for the database developer, the software engineer, and the DBA, who has needed to interact with RDBMS for decades now. The theoretical model behind SQL has been dissected in countless papers, and many careers require being fluent in it; SQL has profound importance beyond simple reporting.

What about SQL has made it so pervasive? One reason is that it describes a dataset declaratively rather than procedurally. Users define a SQL query describing the data that they want to retrieve rather than taking the procedural steps required to gather that data. They define the “what,” not the “how” of the dataset. This allows developers to focus on the business requirements of their applications and not on the details of implementation or the optimization of those queries. The on-going enhancements have been developed over the years by the implementors of RDBMS engines, as well as the tireless efforts of dedicated DBAs and data architects.

Hive carries this promise of SQL to the new worlds of Hadoop and MapReduce. While the ad-hoc queries of the business analyst are very important, I’m considering the role it has for the newly-minted data engineer. Today’s data engineer is concerned with creating dataflows of massive size for new industries whose products consist entirely of data. In the turbulent startup world, time to market is of utmost importance; startups need to get new products up and running as quickly as possible. Requirements also change quite quickly, so it doesn’t make sense to waste engineering resources on optimizing and hand-crafting products that may not turn out to be very popular.

Brickhouse (on top of Hive and Hadoop) is intended to be a tool for the agile data developer. It provides the missing pieces of Hive, so the developer doesn’t have to drop down and write his own custom mappers and reducers. With Brickhouse you can express just about any data pipeline as a set of Hive views and queries. Plus, for cases where the traditional Hive approach has shortcomings, Brickhouse attempts to rework the approach to be more scalable and efficient, so that Hive queries can be run safely in production.

Brickhouse provides functionality in multiple areas:

  • Standard map and array transformations, like collect
  • Vector and timeseries calculations
  • Improved JSON UDFs for robustly parsing and producing JSON
  • HBase UDFs for inserting and reading from HBase
  • Probabilistic data structures, like KMV sketch sets, HyperLogLogs, and Bloom filters, for representing large datasets in a small amount of space
  • Many others as well

Let’s look at one specific example where Brickhouse helps out. This particular workflow had to go through a dataset and classify each record according to various sets of criteria. This would be implemented by a Hive developer with a script like this:

This is just an fictional example, but there are arbitrarily many record types that would have needed to be evaluated. In Hive, a query like this would span multiple MapReduce jobs; one for every sub-select that was being unioned. For early development, this was not such a problem, because we could live with multiple steps, especially if we were working with a reduced dataset. We didn’t really want to run this way in production, however, because we were making multiple passes of the data when we didn’t have to. For large datasets, simply scanning the data can take a significant amount of time. Also, for every new record type, we would have significantly added to our pipeline, so we couldn’t have easily added types if necessary. So, while the Hive approach technically would have worked, and was correct, it would have burned unnecessary cluster cycles, and increased runtimes when we were trying to meet an aggressive SLA for completing all of our jobs.

How can we solve this problem? We could write a custom mapper or reducer in raw Hadoop, and place the business logic for all our different record types in Java for the reducer. This would work, but would be pretty ugly, and would force us to do a standard MR job in the middle of our pipeline.

Another option would be to write a python script, and use Hive TRANSFORM to pipe records through it. This also would probably work, but would separate business logic between HQL and Python, so it wouldn’t be very maintainable. If possible, we’d want to keep our business logic in one general area, and Hive would seem like the right place.

This is how we came up with the conditional_emit UDTF. This UDTF takes an array of boolean expressions and emits a record for each one that is true. This way we can emit a record for all the record types that we need, with only one pass of the data. Adding record types would be simple; we’d simply add a new value to the array. The refactored query with the new UDTF would look something like the following:

As you can see, Brickhouse allows us to do the same thing with only one pass of the data. It reduces the number of MapReduce jobs which need to be run to just one. This saves us considerable cluster time, and reduces the use of temporary space for intermediate results. It makes it more likely that the pipeline will make its SLA.

This is only one example of how Brickhouse makes pipelines more efficient and reliable. If you are running a pipeline on Hadoop with Hive (or are planning on writing one), check it out – it will make your life much easier!

Jerome Banks is a Senior Software Engineer II on the analytics team at Tagged.