[Watching Logs] How-To Avoid Drowning in Log Avalanche #13374
davift
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I guess most of us are familiar with running:
tail -f /var/log/cloudstack/management/management-server.logand being immediately blasted with an unbearable amount of log messages:
Enabling debug logging is often essential for troubleshooting and identifying clues that lead to a solution. However, it also increases the volume of logs significantly, making it even harder to spot the information that actually matters.
Even worse, you may discover that a particular issue has been occurring for days, weeks, or even months, continuously flooding the logs without anyone noticing.
Wouldn't it be useful to visualize the occurrence of known (classified) events over time, correlate them with infrastructure events, and receive alerts when unknown patterns or abnormal spikes appear?
To help with this, I built a tool that uses AI to classify log entries of any kind. I called it LogWatcher.
What does it have to do with CloudStack?
I trained LogWatcher with millions of CloudStack log lines and spent time reviewing and correcting the classifications to improve accuracy, because AI is just a statistical guessing machine. The resulting knowledge bases for ACS Management and ACS KVM Agent are available here.
What does this mean?
Anyone can load the pre-trained knowledge bases and immediately start classifying CloudStack logs. The tool can run in offline mode using the existing knowledge base, or continue learning as it encounters new patterns.
The generated metrics can be scraped by Prometheus and visualized in Grafana, making it easy to create dashboards and alerts. This provides visibility into trends, helps correlate issues with infrastructure events, and can reveal silent problems long before users report them.
Request for Help
I would love to collaborate with CloudStack operators to expand the knowledge base and cover a wider range of issues that I haven't been able to reproduce and train LogWatcher on.
For those curious about performance, LogWatcher can process 10 million log lines in roughly 10 minutes and typically evaluates between 10,000 and 20,000 log lines per second, with a pre-trained knowledge base (no AI invoked for classification), while running as a single-threaded application.
I also run it in a centralized setup, where logs from multiple hosts are collected and analyzed through a single pane of glass.
If you are interested in contributing log samples, testing the knowledge base, or sharing feedback, I would be happy to collaborate.
Beta Was this translation helpful? Give feedback.
All reactions