Chat Scalability Improvements

My post yesterday was purely about the chat product as a concept. In it I made a loose comment around "resisting improving chat", Izlsnizzt completely hit the nail on the head when he asked "are you ready for that to be taken out of context" :)

While I was busy thinking about our stats system and musing about how great chat is (as a product), two of our engineers were continuing their work on scaling our chat system. For some context: this is a constant ongoing process. That is to say, it is one of our core arcs of work, it is omnipresent and we invest a good deal in monitoring, performance profiling, and development to make it a better system overall. When a phenomenon like TPP comes along which increases load on the system many fold it gives us a great opportunity to discover and fix new issues and issues that only raise their head under super high load.

Yesterday we found our redis servers were pegged at 100% CPU - we've ramped up more of them and we're ensuring our monitoring picks up this class of issue in the future. Additionally we found some configurations that were not optimal in our chat stack itself. These two things have had a massive impact:

<3 those lovely smooth lines :) ... Also you guys seem to think it is helping too: