Tagged : Tagged

Brickhouse comes to Tagged

Brickhouse is a library of user-defined functions for Hive and Hadoop that enables the agile development of big data pipelines. It provides tools and techniques that allow those pipelines to process larger datasets more robustly. The Brickhouse project was developed at Klout, whose team will continue to use and build upon Brickhouse and hosts the github repo. Since I’ve recently joined the Tagged team and am the the primary maintainer of Brickhouse, we’re going to start using it at Tagged as well.

Tagged embraces open source as part of its engineering culture, and will continue to support Brickhouse along with its other open-source projects. The Tagged website produces massive amounts of data, and we’ll be using Brickhouse for more advanced analyses of this data.

The premise here is that one uses Hive as the underlying programming language for pipeline development. There are various options for describing dataflows in Hadoop (e.g., Hive, Pig, Cascading and Scalding). I believe Hive is the simplest approach for generating these workflows. Generally, Hive is perceived as something that a business analyst would use to perform ad-hoc queries, rather than as a real programming language. Many stick to the dogma that every detail of a dataflow has to be explicitly handcrafted by a data engineer.

But think about SQL itself, for relational databases. Sure, there are plenty of business analysts who use it for writing custom reports, but SQL is also the lingua franca for the database developer, the software engineer, and the DBA, who has needed to interact with RDBMS for decades now. The theoretical model behind SQL has been dissected in countless papers, and many careers require being fluent in it; SQL has profound importance beyond simple reporting.

What about SQL has made it so pervasive? One reason is that it describes a dataset declaratively rather than procedurally. Users define a SQL query describing the data that they want to retrieve rather than taking the procedural steps required to gather that data. They define the “what,” not the “how” of the dataset. This allows developers to focus on the business requirements of their applications and not on the details of implementation or the optimization of those queries. The on-going enhancements have been developed over the years by the implementors of RDBMS engines, as well as the tireless efforts of dedicated DBAs and data architects.

Hive carries this promise of SQL to the new worlds of Hadoop and MapReduce. While the ad-hoc queries of the business analyst are very important, I’m considering the role it has for the newly-minted data engineer. Today’s data engineer is concerned with creating dataflows of massive size for new industries whose products consist entirely of data. In the turbulent startup world, time to market is of utmost importance; startups need to get new products up and running as quickly as possible. Requirements also change quite quickly, so it doesn’t make sense to waste engineering resources on optimizing and hand-crafting products that may not turn out to be very popular.

Brickhouse (on top of Hive and Hadoop) is intended to be a tool for the agile data developer. It provides the missing pieces of Hive, so the developer doesn’t have to drop down and write his own custom mappers and reducers. With Brickhouse you can express just about any data pipeline as a set of Hive views and queries. Plus, for cases where the traditional Hive approach has shortcomings, Brickhouse attempts to rework the approach to be more scalable and efficient, so that Hive queries can be run safely in production.

Brickhouse provides functionality in multiple areas:

  • Standard map and array transformations, like collect
  • Vector and timeseries calculations
  • Improved JSON UDFs for robustly parsing and producing JSON
  • HBase UDFs for inserting and reading from HBase
  • Probabilistic data structures, like KMV sketch sets, HyperLogLogs, and Bloom filters, for representing large datasets in a small amount of space
  • Many others as well

Let’s look at one specific example where Brickhouse helps out. This particular workflow had to go through a dataset and classify each record according to various sets of criteria. This would be implemented by a Hive developer with a script like this:

This is just an fictional example, but there are arbitrarily many record types that would have needed to be evaluated. In Hive, a query like this would span multiple MapReduce jobs; one for every sub-select that was being unioned. For early development, this was not such a problem, because we could live with multiple steps, especially if we were working with a reduced dataset. We didn’t really want to run this way in production, however, because we were making multiple passes of the data when we didn’t have to. For large datasets, simply scanning the data can take a significant amount of time. Also, for every new record type, we would have significantly added to our pipeline, so we couldn’t have easily added types if necessary. So, while the Hive approach technically would have worked, and was correct, it would have burned unnecessary cluster cycles, and increased runtimes when we were trying to meet an aggressive SLA for completing all of our jobs.

How can we solve this problem? We could write a custom mapper or reducer in raw Hadoop, and place the business logic for all our different record types in Java for the reducer. This would work, but would be pretty ugly, and would force us to do a standard MR job in the middle of our pipeline.

Another option would be to write a python script, and use Hive TRANSFORM to pipe records through it. This also would probably work, but would separate business logic between HQL and Python, so it wouldn’t be very maintainable. If possible, we’d want to keep our business logic in one general area, and Hive would seem like the right place.

This is how we came up with the conditional_emit UDTF. This UDTF takes an array of boolean expressions and emits a record for each one that is true. This way we can emit a record for all the record types that we need, with only one pass of the data. Adding record types would be simple; we’d simply add a new value to the array. The refactored query with the new UDTF would look something like the following:

As you can see, Brickhouse allows us to do the same thing with only one pass of the data. It reduces the number of MapReduce jobs which need to be run to just one. This saves us considerable cluster time, and reduces the use of temporary space for intermediate results. It makes it more likely that the pipeline will make its SLA.

This is only one example of how Brickhouse makes pipelines more efficient and reliable. If you are running a pipeline on Hadoop with Hive (or are planning on writing one), check it out – it will make your life much easier!

Jerome Banks is a Senior Software Engineer II on the analytics team at Tagged.

Treasure Quest

Fun, Flash and Festivity: Tagged Winter Game Jam

Most games take months or years to develop. During a recent Game Jam, our Games Team made four fully functioning games in 24 hours.

At Tagged, Game Jams feel reminiscent of music jam sessions, where friends gather for hours on end to rock out and tap into their creative side. Sub out the guitars for computers and rockers for coders and you’ve got a feel for our vibe.

Over the course of 24 hours, our game designers, artists, programmers and producers develop at an extremely fast pace in order to create several playable versions of concepts generated by the team. Game Jams provide a great opportunity to step out of our usual roles during production and skip right to the heart of creating something for other people to enjoy. It’s also a great chance for people to create with co-workers outside of their regular teams.

This Game Jam, specifically, helped us focus on how we can best prototype for Flash. Our team recently decided to use Flash for our next few games as we found it to perform better when stressing animation, audio and customizable art assets. Due to the nature of how technologies like HTML5 load bitmaps and animation-data, we’re better off using vectors with flash instead of developing a new loading pipeline for bitmap. This is important as Tagged serves a number of users across many browsers and connection speeds, and a poor loading pipeline can severely impact both the first-time experience and the game as a whole.

We established a few goals at the beginning of the Jam in order to make it both extremely fun and productive. Our goals for this Game Jam were:

  1. What techniques and tools can we use to be most efficient in Flash development?
  2. What game concept should be developed for our next Tagged game?
  3. Team building!

Laying out the games

At 5 p.m. we gathered in our main meeting room where our assignments were revealed. The next 24 hours were intense. Early success came when one of our groups had a playable game in only a couple of hours. Two of the teams were working with the Flixel library in order to skip most of the basic game loop coding. Most of the groups pushed on through the night with small celebrations by the teams as each part of their games started to function.

Working hard

By 10 p.m. everyone had hit their stride and were coding their respective games.

Our productivity held strong through the early morning hours, but we grew more tired and silly as the morning went on. We recharged with pizza at 4 a.m., and pushed through the morning. Final tweaks and polish were added during the last few hours, and teams took their hands off their keyboards when the Jam ended at 5 p.m. the next day.

We still had to celebrate though! So in classic Tagged fashion, we brought in some beers, played some games, and reflected on the past 24 hours of insane productivity in a debrief.

The results were incredible! By the time we ended, we had four fully functional games: a paper prototype, a two-player game, a networked multiplayer game and a game that was running in flash and on the iPhone! We learned most of us made design decisions based on what could be coded quickly, not necessarily what would be the most impressive feature. Having this clear vision of ‘do it quick or not at all’ really allowed the teams to make playable games with many features in a short amount of time. All of these were solid prototypes that will be played and studied over the next few weeks to find the next best game to release to our users on Tagged.

Game discussion

At the end, not only do we have some awesome new games to play, but we also learned more about how each member of our team works and how insanely fast we can make games. We also all grew closer through sharing this intense, exhausting, and fun experience together. We’ll definitely be doing another Game Jam next quarter and will be sure to report back!

Above screen capture is of “Treasure Quest”.

Auston Montville is a Junior Game Designer at Tagged and loves chatting about games.

Tagged on Github!

We’re excited to announce that Tagged is now on Github, the web-based distributed version control system.

You can follow some of the projects our engineers are contributing to such as a Node-Kafka, a node client for Kafka, LinkedIn’s disk based message queue, JHM, an Intelligent build system we’re developing here at Tagged and many more projects to come in the future. We’ll be writing about how we’re using open-source projects here on our blog.

We’re excited to work with and give back to the open-source community. Follow all the projects we’re contributing to here – https://github.com/tagged

Allen Intern

A Day in the Life at Tagged (Intern-style!)

My Tagged Story

As an intern on the Mobile Web team, I helped develop and launch Tagged’s first BlackBerry app. My typical work day involves creating features and products for Tagged’s mobile division – and I’m also sure to set aside some time during and after work to battle my co-workers in  Starcraft, chess or whatever new board game someone brought in.

At Tagged, every intern is assigned a mentor who provides projects and ongoing career guidance. The large pool of talented mentors is a critical part of the internship program’s success. The quality of my internship has been substantially increased by my mentor, Mark Kater, who gives me the right balance of direction and freedom to help me excel at my work. Being new to the company and new to mobile development – and being shy by nature – having a mentor by my side has been immensely helpful in my growth and development. Beyond my immediate mentor, everyone at Tagged is hugely helpful and friendly – it’s as if those qualities are required to work at the company!

Working on Mobile

My work on the Tagged BlackBerry app was definitely one of the most rewarding projects I’ve worked on at Tagged. From research to development to deployment, my favorite part of this process researching and experimenting with different app solutions. One big surprise for me was how many steps it took to actually set up and deploy the app on BlackBerry. Deploying the app required an array of resources, including legal documentation and icons.

As part of the Mobile team, many of my tasks have been about porting features from Desktop Web to Mobile Web (e.g., user registration, status updates, etc.) or fixing bugs that involve varying amounts of PHP, HTML, JavaScript and CSS. Tagged’s Mobile Web is being developed with a progressive enhancement strategy in order to ensure support for lower-end devices so we haven’t gotten around to introducing JavaScript yet. This means that porting over a feature from Desktop Web usually involves thinking about how it can be adapted to function in an environment without AJAX, lightbox overlays or any other similar luxuries that we may be accustomed to. Despite the fact that our Mobile site has a limited feature set compared to our Desktop Web version, mobile daily page views have soared to nearly 15 million.

Interns are also tasked with giving an end-of-term presentation to the entire company about a topic of their choosing. Presentation topics cover a wide spectrum, including math, technology and culture. The presentations provide a great opportunity for interns to showcase their interests and talents to the broader team.

Every day I’m excited to come into Tagged and work with my teammates to launch new features on Mobile Web. You know you’re at a great place when it doesn’t even feel like work!

Allen Dam is an intern on the Mobile Team at Tagged and you can follow him on Twitter.

Dave Mangot Graphite Talk copy3

Site Monitoring At Tagged With Graphite

Last Thursday I had the opportunity to give a talk on one of my favorite visualization tools, Graphite, at the Bay Area Large Scale Production Engineering Meetup. Recently, we’ve been trying out the Graphite Realtime Graphing system at Tagged. It started as an experiment during our latest Hackathon, and the more we’ve tried it, the more things there are to like.

For those interested, I’ve attached the presentation below and video of my talk is available  here:

Dave Mangot is a Senior Systems Administrator at Tagged and you can follow him on his blog.