This year, my colleague Shobeir and I attended the GraphLab Conference to learn more about GraphLab and data science in general. The event began with a really great introduction by Carlos Guestrin, which illustrated GraphLab’s strategy, vision and practices. After this we attended a few more high-level talks about GraphLab before delving into sessions specifically around data science, data engineering and deployment.
The GraphLab team introduced many exciting new features for GraphLab Create since its last release, including gradient-boosted trees for regression and classifications, deployment capabilities, hyper-parameter tuning, higher capacity SFrames, access to latent factors of matrix factorization methods and much more.
In the second session we heard from Alice Zheng, the director of data science at GraphLab, who spoke about machine learning toolkits in GraphLab Create. It includes tools for classification and regression, clustering, recommender systems, topic modeling and more. The session continued with additional talks, including one on scaling distributed machine learning with the parameter server by Alex Smola from Carnegie Mellon. We also enjoyed hearing from Tao Ye of Pandora on large scale music recommendation.
The third session focused on data engineering with GraphLab. First, we heard from Yucheng Low of GraphLab, who was a TA when I took an artificial intelligence course at Carnegie Mellon! His talk focused on two core data structures: SFrame and SGraph. It was great to hear some background on these data structures before getting our hands dirty with them the next day.
Another presentation that really stood out was Joe Hellerstein’s talk on DSLs. Domain specific languages can make tasks much easier than they would be in a typical programming language. Hellerstein has worked on creating an environment that allows for a lot of very natural interaction with data and could definitely make data munging much less painful.
At Tagged, we’ve noticed how useful DSLs can be, and have worked on a system called FeatureBuilder that allows data scientists to define features in a DSL. Our aim is to allow data scientists to define a feature exactly once, without having to learn Scala, and using this definition as the “ground truth” through all phases of model building from exploration through production.
The last session covered deployment, or how to use all of this in the real world. Rajat Arya’s talk took away any anxiety there might’ve been about the deployment process; the intention is that GraphLab Create is capable of being used from exploration straight up through production. There was another interesting talk by Milind Bhandarkar, chief scientist at Pivotal, about integrating GraphLab analysis on top of Hadoop data with Hamster. This seems like something we could really take advantage of at Tagged, and something we’d be very excited about trying. Vahab Mirrokni, head of the algorithms research group at Google Research NY, also told us about a fault-tolerant asynchronous message-passing algorithm that is being studied at Google.
In addition to speakers, there were also many interesting exhibitors showing the latest and greatest trends in graph databases, machine learning and big data analytics. Shobeir presented a poster on using GraphLab to detect social spam at Tagged, which was well received by the attendees. While this took place, I went on a random walk of the other booths to hear what everyone had to say.
On the second day of the conference we did lots of exercises with support from the GraphLab team. I loaded up GraphLab Create on my MacBook and got cracking. It was great to be working on exercises while surrounded by other people doing the same thing. Plus, there were great resources readily available to help — it felt like being back in school!
Shobeir Fakhreai contributed to this post.