With microservices architecture now often being the standard for web applications, it’s no longer possible to rely on legacy methods to monitor these applications. In simple terms, to ensure the stability of a system, it must be observable, or, in other words, you should be able to monitor its internal state by observing its output.
Now, you’ll be mistaken to think that observability is all that’s needed. Both observability and monitoring play a vital role in achieving system dependability. In this post, we’ll briefly look at these two terms and the key application metrics that are important in supporting and maintaining your applications.
Before looking at these key metrics, it’s important to understand the key terms we mentioned above and the interplay between them.
Monitoring primarily relates to observing an application’s performance over time. It, therefore, provides key insights and information about the application’s performance and usage patterns. This could, for example, include detailed information about memory usage and issues, availability, request rates, bottlenecks, and much more.
In contrast to monitoring, observability refers to the ability to infer an application’s internal state by observing its external outputs. Observability it’s based on three pillars: metrics, traces, and logs which create an environment where it’s possible to make hypotheses why something is not working, understand the internals of an application, detect issues faster, and identify anomalies in the application proactively.
And that’s the difference between observability and monitoring. Simply put, while monitoring is something you do, observability is something you have. Yet, they complement each other. So, you need monitoring to achieve observability, and, in turn, observability gives you the insight you need to know what to monitor. They both have one problem – no matter what they are claiming, monitoring and observability show you only symptoms of the problems. It forces you to interpret data, which means you need to guess what is the root cause of the problems. Depending on the quality of collected data you may be more or less successful in solving issues.
Logs, Metrics, and Traces
Logs are timestamped, immutable records of events generated mostly from the app code that help identify behavior in an application and provide valuable information about what happened in the application or when an error occurred.
Metrics are counts or measurements that are aggregated data over a period of time. Metrics will tell you what is the overall behavior over time and if there are any deviations during the selected period e.g how much of the total amount of CPU is used, or how many requests a service handles per minute, or what is the success or error rate.
Traces, in turn, display the operation as an individual transaction or request moves from one node to the next in a distributed system. They allow you to delve into requests in more detail to find out which components of the application cause errors and find performance bottlenecks. At the same time, they allow you to monitor the flow through the modules of an application.
With that in mind, let’s look at the important metrics you should look at to ensure proper monitoring and, by implication, observability.
As the name implies, response time is the time it takes for your application to respond to a user request. Understandably, it’s a valuable metric to understand how responsive your application is too specific requests inputs and how it’s performing. So, ultimately, by knowing your response time, you’ll be able to make improvements where necessary to enhance the user experience.
But what numbers should you be looking at when it comes to response time? Although you might be inclined to look at average response time, you must keep in mind that outliers can easily skew the results. This, in turn, can fail to give you a realistic picture of how your application performs.
So, for example, significantly longer requests may push the average higher while very short requests can lower the average. This can make your response time appear higher or lower than it is. It’s therefore far better to rather look at percentile-based statistics.
Here, the 50th percentile or median is extremely valuable as it will show you that half of your requests are completed in that time or less. Likewise, the 95th and 99th percentiles are also valuable as they give you an idea of how you’re your userbase experience your application.
Error rate is simply an indication of the number of errors your application generates over a specific period. It’s important to keep in mind here that, although you may be used to finding and fixing bugs in an application, in production, errors will often happen because of network and infrastructure variables.
Errors happen due to things that exist inside and outside of your application code. This not only makes them more challenging to identify but also harder to fix. Unfortunately, there is no standard way to deal with these errors and the process to do this will vary based on the type of error, the number of affected users, and the scale of your application.
As a result, when looking at error rate, it’s always better to not only look at the details of the individual error for more information about it but also the overall frequency or errors types as a measure of application stability. Ultimately, this information will give you great insights into the user experience and how the stability of your application changes over time.
Slow transactions happen, as the name suggests, any time when a transaction or request takes longer to complete. Remember, you’ll sometimes experience requests that take longer, but not long enough to influence the response time or throw an error. In addition, transactions that take too long can take up excessive computing resources, which then trickles through to other requests.
This, in turn, affects the performance of your application and also could result in additional cloud spendings – e.g. when using serverless functions you pay also for computing time. Here, the slow transactions metric is extremely valuable to show if there are any bottlenecks or issues in your application’s responsiveness.
The problem is that monitoring slow transactions isn’t as easy as many other metrics. There are two main ways for you to do this. The easiest is to compile a list of the transactions and sort it by duration. You’ll then be able to see the length and frequency of the longest transaction times. Here, you’ll typically want to focus on transaction times of longer than 500ms.
The other way is to look at the transactions on your list with transaction times above the 95th percentile. No matter what method you use, you’ll end up with a list of long-running transactions which you can sort even further by calculating the time taken by any given transaction.
In this way, you’ll be able to see what types of transactions take the longest. This, in turn, allows you to get an idea of which transaction types to prioritize. And when you do this, you’ll be able to improve the user experience.
The Bottom Line
To ensure the dependability and optimal performance of an application, you’ll need to look at the right application metrics. By doing so, you’ll be able to act proactively and ensure that your application runs as it should.
Hopefully, this post helped illustrate the key metrics in more detail and will allow you to start using these metrics to improve your application’s performance. To find out more about application monitoring, visit our website for more details.