We had the same observation early on in our enviroment as we deployed several production bots. We discovered several situations where things get stuck without warning. As a result we build 2 bots one downloads using api calls the audit and history logs details every 30 mins, its called the collector. A second bot runs independently called the Monitor, and it checks for situations within the audit data that might be possible problems. We are also looking at putting the collected data in a tool like splunk to handle the alerting
For example:
check for bots with long runtimes (45+ mins).
Check for large # of bots in the run/waiting queue - we have some that run every 15 mins, so if we get more than 4 deep its a potential problem.
One challenge with this operation? its also dependent on bot runners so it executes from dev or qa envirnoments and collects stats for all enviroments. its also dependent on the API key which expires every 45 days.