Logster and our error logging strategy at Discourse

over 6 years ago

I have always been somewhat fascinated with logs. I tend to see the warning and error logs in production as a valuable heartbeat of an application. Proper handling of error logs is a very strong complement to a robust test suite. It shows us what really happens when real world data meets our application.

9 years ago, at Stack Overflow we had a daily ritual where we would open up our fork of ELMAH every morning and fish through our logs for problems. This had a dramatic positive effect on Stack Overflow.

Almost 7 years into our journey building Discourse, every single week we find and fix issues in our application thanks to our error logs and Logster. Error logs are the pulse of our application, they let us know immediately if there are any urgent issues and where. Since we host more than 1500 sites running many different code branches, we needed to evolve a sane and robust set of practices and tools.

Top level structure of logging and monitoring at Discourse

We have lots of logs at Discourse and many systems for dealing with them.

We keep raw Docker, Postgres, Redis, NGINX, Rails and HAProxy and so on in Elastic Search and use Kibana for business intelligence.
We have a monitoring system built on alertmanager and Prometheus, with business intelligence in Grafana and alert escalation in our internal Discourse instance and opsgenie.
We have logster which we use for web application aka. “Rails / Sidekiq” warnings and errors.

I would like to focus on logster and our Rails / Sidekiq portion for this blog post, but think it is worth mentioning other mechanisms cause I don’t want people to think we are not good data hoarders and only have very limited visibility into our systems.

About Logster

At Discourse we developed a log viewer called logster.

Logster is a free and open source tool you can embed into any Ruby on Rails or Rack application in production and development. It runs as Rack middleware and uses Redis as its backend for log storage and analysis.

It operates in two different modes:

In production mode it aggregates similar errors by fingerprinting backtraces listening for warnings/errors and fatal messages. The intention is to display a list of open application problems that can somehow be resolved.
In development mode it provides a full fire-hose of all logs produced by Rails. (debug and up). This has significant advantages over console as you have proper access to backtraces for every log line.

Here are a few screenshots from logs on this very blog (accessible to admins at https://discuss.samsaffron.com/logs):

Each error log has a full backtrace

Web requests have extensive environment info, including path, ip address and user agent.

Logster has accumulated a large amount of very useful features over the years, including:

The ability to suppress errors from the logs until the application is upgraded. (The solve button)
The ability to protect certain log messages so they are not purged when clear all is clicked.
Advanced filtering, including regex and reverse regex search
Custom environment (ability to tag current thread with arbitrary metadata)
JavaScript error and backtrace support
Rich API allowing you to suppress patterns, ship errors from other instances, integrate automatically into Rails and so on.

The Logter project is still very much alive, recently our part time developer Osama added a mobile view and upgraded the Ember frontend to latest Ember. We have many exciting new features planned for 2019!

Giving up on tail -f logs/development.log

I do not remember the last time I tailed logs in development. There are a few reasons this does not happen anymore.

Most of the time when building stuff I use TDD, using our rake autospec tool. I will focus on one broken test. Every time I save a file it automatically triggers the test to re-run, if I need extra diagnostics I sprinkle puts statements.
If I am dealing with a specific error on a page I often find working with better_errors far more effective than reading logs.
If I need access to logs I will always prefer using logster in development. It allows me to filter using a text pattern or log level which is a huge time saver. It also provides information that is completely absent from the Rails logs on a per-line basis (environment and backtrace).

I sprinkled Rails.logger.warn("someone called featured users, I wonder who?") and filtered on “featured”

Death by 10,000 log messages in production

Logster attempts to provide some shielding against log floods by grouping based off stack traces. That said, we must be very diligent to keep our logs “under control”.

For the purpose of our Logster application logs usage we like to keep the screens focused on “actionable” errors and warnings. Many errors and warnings that get logged by default have no action we can take to resolve. We can deal with these elsewhere (offending IPs can be blocked after N requests and so on).

Here are a non exhaustive example of some “errors” that we really have no way of dealing with so they do not belong in Logster.

A rogue IP making a web request with corrupt parameter encoding
A 404 to index.php which we really do not care about
Rate limiting … for example a user posting too fast or liking too fast
Rogue users making a requests with an unknown HTTP verbs

Another interesting point about our use of Logster is that not all errors that float into our logs mean that we have a broken line of code in our application that needs fixing. In some cases a backup redis or db server can be broken so we will log that fact. In some cases there is data corruption that the application can pick up and log. Sometimes transactions can deadlock.

Keeping our Logster logs useful is extremely important. If we ignore in-actionable errors for long enough we can end up with a useless error log where all we have is noise.

Proactively logging issues

Given we have a high visibility place to look at errors. We will sometimes use our error logs to proactively report problems before a disaster hits.

github.com

discourse/discourse/blob/a19170a4c2c37bb6f6ae9531fe4f925777f3e8d5/lib/scheduler/defer.rb#L91-L93


      
          warning_job = @reactor.queue(@timeout) do
            Rails.logger.error "'#{desc}' is still running after #{@timeout} seconds on db #{db}, this process may need to be restarted!"
          end if !non_block

In this case we are watching our “defer” queue, which is a special thread we have for light-weight jobs that run between requests on our web workers in a background thread. We need this queue to be serviced quickly if it is taking longer than 30 seconds per job we have a problem… but not necessarily a disaster. By reporting about this early we can correct issues in the job queue early, rather than dealing with the much more complex task of debugging “queue starvation” way down the line. (which we also monitor for)

The logs hot potato game

Half a year ago or so we introduced a fantastic game within our development team. The idea is very simple. Every developer attempts to correct an issue raised in our error logs and then assigns to the next person on the list.

We attempted many other patterns in the past, including:

Having our internal Discourse instance raise a big warning when too many errors are in the logs (which we still use)
Having “log parties” where a single team member triages the logs and assigns issues from the logs to other team members.
Having arbitrary triage and assign.

The “logs game” has proven the most effective at resolving a significant number of issues while keeping the entire team engaged.

We structure the game by having a dedicated Discourse topic in our internal instance with a list of names.

When we resolve issues based on log messages we share the resolution with the team. That way as the game progresses more people learn how to play it and more people learn about our application.

Once resolved, the team member hands the torch to the next person on the list. And so it goes.

This helps all of us get a holistic picture of our system, if logs are complaining that our backup redis instance can not be contacted, this may be a provisioning bug that needed fixing. For the purpose of the “logs game” fixing system issues is also completely legitimate, even though no line of code was committed to Discourse to fix it.

Should my Ruby web app be using Logster?

There are many other products for dealing with errors in production. When we started at Discourse we used errbit these days you have many other options such as sentry, airbrake or raygun.

One big advantage Logster has is that it can be embedded so you get to use the same tool in development and production with a very simple setup. Once you add it to your Gemfile you are seconds away from accessing logs at /logs.

On the other hand the for-pay dedicated tools out there have full time development teams building them with 100s of amazing features.

Logster is designed so it can work side-by-side with other tools, if you find you need other features you could always add an additional error reporter (or submit a PR to Logster).

Regardless of what you end up choosing, I recommend you choose something, there is enormous value in regular audits of errors and better visibility of real world problems your customers are facing.

Posted by: Sam Permalink | Comments

Sam Saffron