This is part two in our three part series on our statistics pipeline. The first part described our pipeline. In this part we’ll go into detail concerning particular design decisions. The third part will cover the history of analytics at Twitch. Head over to our team page to get to know more about what we do.
Also, join the discussion on Hacker News!
Make our co-workers want to give us their data
- Ease of integration
As detailed in the first post, our clients send base64 encoded json data using HTTP GETs. At the core of this decision was ease of integration; by choosing to stay Mixpanel compatible we were able to use our existing Mixpanel client libraries. This permitted our engineering team to continue to use the clients they trust and removed the need for us to build and maintain a new set of clients in various languages. The use of GETs is driven by the same-origin policy that browsers enforce. We’ve a far wider range of clients than browsers and as such it is on our short term roadmap to support POSTs. As CORS becomes more widely supported we’ll embrace POSTs there also.
Storage & Querying
Simplicity and scale form the basis of our choice for RedShift. We needed a system that could store many terabytes of data while permitting fast and simple fetching of that data. Any time you talk about huge volumes of data Hadoop and Cassandra are natural contenders. RedShift was the standout option for us on account of it being PostgreSQL based and, as such, having great client support along with the ability to support arbitrary joins. When coupled with the COPY command for bulk inserts (we get 100k inserts a second, and haven’t begun optimizing yet) RedShift is a tectonic level play by Amazon; our buddies over at Pinterest and Airbnb agree and both have great write-ups on how they use it.
Core Pipeline Implementation
The decision for how to transform data from our clients into rows in RedShift was influenced by all three elements of our rubric. We knew we would need something fast with good concurrency primitives, and support for the libraries that we’d want to use. We ended up with Clojure and Golang as our main candidates. Java and C ruled themselves out for not having simple concurrency primitives. Clojure, on the other hand, has great concurrency primitives and can interop with Java very elegantly.
In the end Golang won out over Clojure due to institutional knowledge; we’ve a number of engineers at Twitch that are deeply into Golang. It is not unusual to stumble into a lunchtime conversation covering the pros and cons of the new precise garbage collector and the various patches to Golang we need due to page size allocation changes introduced as part of Go 1.2. We have reached this point through being an early adopter of Golang and running it under some incredibly high loads, oftentimes discovering aspects of the runtime that were interesting to the Golang core development team. Golang’s approach to concurrency is beautiful and the decision to build statically linked binaries makes deployment very easy. As a result almost all green field projects at Twitch are in Golang. I can honestly say, if you want to work with Go there is nowhere better than Twitch.Given we’re hosted on Amazon we could also have opted to use Data Pipeline or Kinesis. Kinesis didn’t exist when we first started work on this project, though with it being so highly Java orientated we may have still opted not to use it. One of the things we’re super excited about is open sourcing all of our work, as we firmly believe that making good product decisions shouldn’t be something that is bound by available technology. We’re currently sweeping through code replacing AWS credentials with IAM roles from the UserData API (we recently patched goamz to fix role timeout issues); when we’re done we’ll release everything! Expect more on this in a couple of weeks.
Personally one of the most significant technology decisions I think we’ve made is to host our pipeline on AWS. We have historically bought our own hardware and co-located it at various datacenters. This works wonders for managing the cost of a world wide video CDN; our ops and video engineers are constantly finding ways to squeeze more out of our current set of machines while ensuring our next generation of machines can achieve the best QoS for our users. When it came to our data pipeline we really had no idea what we would be bound by; we initially opted for two beefy boxes in our SFO datacenter. The first weekend taking significant traffic we exhausted our available RAM and our pipeline ground to a halt. Faced with the option of finding significant memory improvements or provision more capacity we chose to move to the cloud, as it would remove the burden we’d put on our ops team by iterating on our stack. When it came to cloud partners we had a couple of options: Digital Ocean, Rackspace or AWS. Since we were already using S3 and RedShift, AWS made the most sense.
Our decisions to make things simple, scalable and easy to integrate has permitted us to build a very efficient pipeline. If you share our passion for making smart decisions and being a member of a high leverage team then let us know as we’re currently looking for our 5th team member.