First Thank you Mr Sydor on your insights into the fail-over strategies and purposes.
What changes, improvements or possible new release features have been added to this discussion over the last year?
This topic has risen again at my company and the argument of value has some traction but there are still the undertones. From my understanding there are two points of failure that the load-balancing of agent-collector does not provide for.
1. MOM Failure
This is the more critical since it is the nerve center of our solution and without out, we can not access the metrics and the solution can not send out alerts. The primary concern is the not sending out alerts. While the band-aid file share and lock file possible directions are present, they are both filled with very undesirable side-effects that is worse than the problem they are trying to solve.
It seems the main issue is the MOM is both the communication hub and the primary consumer of system requests. Decoupling the communication bus and move to a distributed request publish/subscribe model might be a solution to not only help solve this issue but also remove the artificial limits on having only one MOM within a cluster. But that is only based on my guess at how the black-box communication between the MOM and collectors function.
2. APM DB Failure
This is very secondary since the storage elements of the app map and CEM transactions become secondary when the metrics and abilities of APM provide worth during a critical failure event. Would like to know more about the behaviors of the APM collectors when the APM DB has failed, but I think that is a different discussion.
For CEM, with 9.6 requirement to have a RedHat OS and my failure of helping people understand the worth and use of the CEM metrics, we have removed our TIMs.
Again, Thank you for your insight,
Billy