Good morning everyone,
My name is Mark and I’m the co-founder of KaizenOps - a startup from the CA Accelerator program focused on helping the site reliability engineers keep production systems running well. I’ve spent the past 15 years of my professional career in the reliability business; most recently as the head of product management for CA APM. In my experience, the problems surrounding production incidents haven’t gotten any easier to solve. In my new role, I want to tackle this problem head-on. Cabot wants to tie site issues directly to the source of production problems; keeping customers engaged, engineers out of crisis mode, and business running smooth.
We're in the process of digging into the features for our first release and we could use your help to ensure that we provide a good experience and truly helps in the middle of production incidents.
To give you an example of one of the features that we’re considering, we believe that alert messages should contain all of the necessary information to quickly focus an engineer’s attention on the problem, the business impact, and provide guidance toward mitigating the problem. Here is an example the kind of message that Cabot will produce:
"Over the past 10 minutes, login txn latency has trended upwards to 7 seconds. It is normally 2 seconds. In addition, app volume has dropped 40% to 300 users from a normal level of 500. The problem appears to be related to high IO rates from 2 machines in the AWS region tagged US-East-1A. The 'scanner' process on EC2 instances named AS54 and AS39 appears to be generating most of the IO requests."
We would love to hear your input on more features that will provide you exactly the right information at the right time. Please schedule a time to talk to us.
If you have questions, please feel free to reach out via DM or comment below.