Monitoring and Observability Resources for Engineers

DZone's Featured Monitoring and Observability Resources

Integrate Istio and Apache Skywalking for Kubernetes Observability

By Pulak Das

In enterprises, SREs, DevOps, and cloud architects often discuss which platform to choose for observability for faster troubleshooting of issues and understanding about performance of their production systems. There are certain questions they need to answer to get maximum value for their team, such as: Will an observability tool support all kinds of workloads and heterogeneous systems? Will the tool support all kinds of data aggregation, such as logs, metrics, traces, topology, etc..? Will the investment in the (ongoing or new) observability tool be justified? In this article, we will provide the best way to get started with unified observability of your entire infrastructure using open-source Skywalking and Istio service mesh. Istio Service Mesh of Multi-Cloud Application Let us take an example of a multi-cloud example where there are multiple services hosted on on-prem or managed Kubernetes clusters. The first step for unified observability will be to form a service mesh using Istio service mesh. The idea is that all the services or workloads in Kubernetes clusters (or VMs) should be accompanied by an Envoy proxy to abstract the security and networking out of business logic. As you can see in the below image, a service mesh is formed, and the network communication between edge to workloads, among workloads, and between clusters is controlled by the Istio control plane. In this case, the Istio service mesh emits a logs, metrics, and traces for each Envoy proxies, which will help to get unified observability. We need a visualization tool like Skywalking to collect the data and populate it for granular observability. Why Skywalking for Observability SREs from large companies such as Alibaba, Lenovo, ABInBev, and Baidu use Apache Skywalking, and the common reasons are: Skywalking aggregates logs, metrics, traces, and topology. It natively supports popular service mesh software like Istio. While other tools may not support getting data from Envoy sidecars, Skywalking supports sidecar integration. It supports OpenTelemetry (OTel) standards for observability. These days, OTel standards and instrumentation are popular for MTL (metrics, logs, traces). Skywalking supports observability-data collection from almost all the elements of the full stack- database, OS, network, storage, and other infrastructure. It is open-source and free (with an affordable enterprise version). Now, let us see how to integrate Istio and Apache skywalking into your enterprise. Steps To Integrate Istio and Apache Skywalking We have created a demo to establish the connection between the Istio data plane and Skywalking, where it will collect data from Envoy sidecars and populate them in the observability dashboards. Note: By default, Skywalking comes with predefined dashboards for Apache APISIX and AWS Gateways. Since we are using Istio Gateway, it will not get a dedicated dashboard out-of-the-box, but we’ll get metrics for it in other locations. If you want to watch the video, check out my latest Istio-Skywalking configuration video. You can refer to the GitHub link here. Step 1: Add Kube-State-Metrics to Collect Metrics From the Kubernetes API Server We have installed kube-state-metrics service to listen to the Kubernetes API server and send those metrics to Apache skywalking. First, add the Prometheus community repo: Shell helm repo add prometheus-community https://prometheus-community.github.io/helm-charts (After every helm repo add, add a line about running helm repo update to fetch the latest charts.) And now you can install kube-state-metrics. Shell helm install kube-state-metrics prometheus-community/kube-state-metrics Step 2: Install Skywalking Using HELM Charts We will install Skywalking version 9.2.0 for this observability demo. You can run the following command to install Skywalking into a namespace (my namespace is skywalking). You can refer to the values.yaml. Shell helm install skywalking oci://registry-1.docker.io.apache/skywalking-helm -f -n skywalking (Optional reading) In helm chart values.yaml, you will notice that: We have made the flag oap (observability analysis platform, i.e., the back-end) and ui configuration as true. Similarly, for databases, we have enabled postgresql as true. For tracking metrics from Envoy access logs, we have configured the following environmental variables: SW_ENVOY_METRIC: default SW_ENVOY_METRIC_SERVICE: true SW_ENVOY_METRIC_ALS_HTTP_ANALYSIS: k8s-mesh,mx-mesh,persistence SW_ENVOY_METRIC_ALS_TCP_ANALYSIS: k8s-mesh,mx-mesh,persistence This is to select the logs and metrics from the Envoy from the Istio configuration (‘c’ and ‘d’ are the rules for analyzing Envoy access logs). We will enable the OpenTelemetry receiver and configure it to receive data in otlp format. We will also enable OTel rules according to the data we will send to Skywalking. In a few moments (in Step 3), we will configure the OTel collector to scrape istiod, k8s, kube-state-metrics, and the Skywalking OAP itself. So, we have enabled the appropriate rules: SW_OTEL_RECEIVER: default SW_OTEL_RECEIVER_ENABLED_HANDLERS: “otlp” SW_OTEL_RECEIVER_ENABLED_OTEL_RULES: “istio-controlplane,k8s-cluster,k8s-node,k8s-service,oap” SW_TELEMETRY: prometheus SW_TELEMETRY_PROMETHEUS_HOST: 0.0.0.0 SW_TELEMETRY_PROMETHEUS_PORT: 1234 SW_TELEMETRY_PROMETHEUS_SSL_ENABLED: false SW_TELEMETRY_PROMETHEUS_SSL_KEY_PATH: “” SW_TELEMETRY_PROMETHEUS_SSL_CERT_CHAIN_PATH: “” We have instructed Skywalking to collect data from the Istio control plance, Kubernetes cluster, node, services, and also oap (Observability Analytics Platform by Skywalking).(The configurations from ‘d’ to ‘i’ enable Skywalking OAP’s self-observability, meaning it will expose Prometheus-compatible metrics at port 1234 with SSL disabled. Again, in Step 3, we will configure the OTel collector to scrape this endpoint.) In the helm chart, we have also enabled the creation of a service account for Skywalking OAP. Step 3: Setting Up Istio + Skywalking Configuration After that, we can install Istio using this IstioOperator configuration. In the IstioOperator configuration, we have set up the meshConfig so that every Sidecar will have enabled the envoy access logs service and set the address for access logs service and metrics service to skywalking. Additionally, with the proxyStatsMatcher, we are configuring all metrics to be sent via the metrics service. YAML meshConfig: defaultConfig: envoyAccessLogService: address: "skywalking-skywalking-helm-oap.skywalking.svc:11800" envoyMetricsService: address: "skywalking-skywalking-helm-oap.skywalking.svc:11800" proxyStatsMatcher: inclusionRegexps: - .* enableEnvoyAccessLogService: true Step 4: OpenTelemetry Collector Once the Istio and Skywalking configuration is done, we need to feed metrics from applications, gateways, nodes, etc, to Skywalking. We have used the opentelemetry-collector.yaml to scrape the Prometheus compatible endpoints. In the collector, we have mentioned that OpenTelemetry will scrape metrics from istiod, Kubernetes-cluster, kube-state-metrics, and skywalking. We have created a service account for OpenTelemetry. Using opentelemetry-serviceaccount.yaml, we have set up a service account, declared ClusterRole and ClusterRoleBinding to define what all actions the opentelemetry service account will be able to take on various resources in our Kubernetes cluster. Once you deploy the opentelemetry-collector.yaml and opentelemetry-serviceaccount.yaml, there will be data flowing into Skywalking from- Envoy, Kubernetes cluster, kube-state-metrics, and Skywalking (oap). Step 5: Observability of Kubernetes Resources and Istio Resource in Skywalking To check the UI of Skywalking, port-forward the Skywalking UI service to port (say 8080). Run the following command: Shell kubectl port-forward svc/skywalking-skywalking-helm-ui -n skywalking 8080:80 You can open the Skywalking UI service at localhost:8080. (Note: For setting up load to services and see the behavior and performance of apps, cluster, and Envoy proxy, check out the full video. ) Once you are on the Skywalking UI (refer to the image below), you can select service mesh in the left-side menu and select control plane or data plane. Skywalking would provide all the resource consumption and observability data of Istio control and data plane, respectively. Skywalking Istio-dataplane provides info about all the Envoy proxies attached to services. Skywalking provides metrics, logs, and traces of all the Envoy proxies. Refer to the below image, where all the observability details are displayed for just one service-proxy. Skywalking provides the resource consumption of Envoy proxies in various namespaces. Similarly, Skywalking also provides all the observable data of the Istio control plane. Note, in case you have multiple control planes in different namespaces (in multiple clusters), you just provide the access Skywalking oap service. Skywalking provides Istio control planes like metrics, number of pilot pushes, ADS monitoring, etc. Apart from the Istio service mesh, we also configured Skywalking to fetch information about the Kubernetes cluster. You can see in the below image Skywalking provides all the info about the Kubernetes dashboard, such as the number of nodes, pods, K8s deployments, services, pods, containers, etc. You also get the respective resource utilization metrics of each K8 resource in the same dashboard. Skywalking provides holistic information about a Kubernetes cluster. Similarly, you can drill further down into a service in the Kubernetes cluster and get granular information about their behavior and performance. (refer to the below images.) For setting up load to services and seeing the behavior and performance of apps, cluster, and Envoy proxy, check out the full video. Benefits of Istio Skywalking Integrations There are several benefits of integrating Istio and Apache Skywalking for Unified observability. Ensure 100% visibility of the technology stack, including apps, sidecars, network, database, OS, etc. Reduce 90% of the time to find the root cause (MTTR) of issues or anomalies in production with faster troubleshooting. Save approximately ~$2M of lifetime spend on closed-source solutions, complex pricing, and custom integrations. More

LLMs Demand Observability-Driven Development

By Charity Majors

Our industry is in the early days of an explosion in software using LLMs, as well as (separately, but relatedly) a revolution in how engineers write and run code, thanks to generative AI. Many software engineers are encountering LLMs for the very first time, while many ML engineers are being exposed directly to production systems for the very first time. Both types of engineers are finding themselves plunged into a disorienting new world — one where a particular flavor of production problem they may have encountered occasionally in their careers is now front and center. Namely, that LLMs are black boxes that produce nondeterministic outputs and cannot be debugged or tested using traditional software engineering techniques. Hooking these black boxes up to production introduces reliability and predictability problems that can be terrifying. It’s important to understand this, and why. 100% Debuggable? Maybe Not Software is traditionally assumed to be testable, debuggable, and reproducible, depending on the flexibility and maturity of your tooling and the complexity of your code. The original genius of computing was one of constraint; that by radically constraining language and mathematics to a defined set, we could create algorithms that would run over and over and always return the same result. In theory, all software is debuggable. However, there are lots of things that can chip away at that beauteous goal and make your software mathematically less than 100% debuggable, like: Adding concurrency and parallelism. Certain types of bugs. Stacking multiple layers of abstractions (e.g., containers). Randomness. Using JavaScript (HA HA). There is a much longer list of things that make software less than 100% debuggable in practice. Some of these things are related to cost/benefit tradeoffs, but most are about weak telemetry, instrumentation, and tooling. If you have only instrumented your software with metrics, for example, you have no way of verifying that a spike in api_requests and an identical spike in 503 errors are for the same events (i.e., you are getting a lot of api_requests returning 503) or for a disjoint set of events (the spike in api_requests is causing general congestion causing a spike in 503s across ALL events). It is mathematically impossible; all you can do is guess. But if you have a log line that emits both the request_path and the error_code, and a tool that lets you break down and group by arbitrary dimensions, this would be extremely easy to answer. Or if you emit a lot of events or wide log lines but cannot trace them, or determine what order things executed in, there will be lots of other questions you won’t be able to answer. There is another category of software errors that are logically possible to debug, but prohibitively expensive in practice. Any time you see a report from a big company that tracked down some obscure error in a kernel or an ethernet device, you’re looking at one of the rare entities with 1) enough traffic for these one in a billion errors to be meaningful, and 2) enough raw engineering power to dedicate to something most of us just have to live with. But software is typically understandable because we have given it structure and constraints. IF (); THEN (); ELSE () is testable and reproducible. Natural languages, on the other hand, are infinitely more expressive than programming languages, query languages, or even a UI that users interact with. The most common and repeated patterns may be fairly predictable, but the long tail your users will create is very long, and they expect meaningful results there, as well. For complex reasons that we won’t get into here, LLMs tend to have a lot of randomness in the long tail of possible results. So with software, if you ask the exact same question, you will always get the exact same answer. With LLMs, you might not. LLMs Are Their Own Beast Unit testing involves asserting predictable outputs for defined inputs, but this obviously cannot be done with LLMs. Instead, ML teams typically build evaluation systems to evaluate the effectiveness of the model or prompt. However, to get an effective evaluation system bootstrapped in the first place, you need quality data based on real use of an ML model. With software, you typically start with tests and graduate to production. With ML, you have to start with production to generate your tests. Even bootstrapping with early access programs or limited user testing can be problematic. It might be ok for launching a brand new feature, but it’s not good enough for a real production use case. Early access programs and user testing often fail to capture the full range of user behavior and potential edge cases that may arise in real-world usage when there are a wide range of users. All these programs do is delay the inevitable failures you’ll encounter when an uncontrolled and unprompted group of end users does things you never expected them to do. Instead of relying on an elaborate test harness to give you confidence in your software a priori, it’s a better idea to embrace a “ship to learn” mentality and release features earlier, then systematically learn from what is shipped and wrap that back into your evaluation system. And once you have a working evaluation set, you also need to figure out how quickly the result set is changing. Phillip gives this list of things to be aware of when building with LLMs: Failure will happen — it’s a question of when, not if. Users will do things you can’t possibly predict. You will ship a “bug fix” that breaks something else. You can’t really write unit tests for this (nor practice TDD). Latency is often unpredictable. Early access programs won’t help you. Sound at all familiar? Observability-Driven Development Is Necessary With LLMs Over the past decade or so, teams have increasingly come to grips with the reality that the only way to write good software at scale is by looping in production via observability — not by test-driven development, but observability-driven development. This means shipping sooner, observing the results, and wrapping your observations back into the development process. Modern applications are dramatically more complex than they were a decade ago. As systems get increasingly more complex, and nondeterministic outputs and emergent properties become the norm, the only way to understand them is by instrumenting the code and observing it in production. LLMs are simply on the far end of a spectrum that has become ever more unpredictable and unknowable. Observability — both as a practice and a set of tools — tames that complexity and allows you to understand and improve your applications. We have written a lot about what differentiates observability from monitoring and logging, but the most important bits are 1) the ability to gather and store telemetry as very wide events, ordered in time as traces, and 2) the ability to break down and group by any arbitrary, high-cardinality dimension. This allows you to explore your data and group by frequency, input, or result. In the past, we used to warn developers that their software usage patterns were likely to be unpredictable and change over time; now we inform you that if you use LLMs, your data set is going to be unpredictable, and it will absolutely change over time, and you must have a way of gathering, aggregating, and exploring that data without locking it into predefined data structures. With good observability data, you can use that same data to feed back into your evaluation system and iterate on it in production. The first step is to use this data to evaluate the representativity of your production data set, which you can derive from the quantity and diversity of use cases. You can make a surprising amount of improvements to an LLM based product without even touching any prompt engineering, simply by examining user interactions, scoring the quality of the response, and acting on the correctable errors (mainly data model mismatches and parsing/validation checks). You can fix or handle for these manually in the code, which will also give you a bunch of test cases that your corrections actually work! These tests will not verify that a particular input always yields a correct final output, but they will verify that a correctable LLM output can indeed be corrected. You can go a long way in the realm of pure software, without reaching for prompt engineering. But ultimately, the only way to improve LLM-based software is by adjusting the prompt, scoring the quality of the responses (or relying on scores provided by end users), and readjusting accordingly. In other words, improving software that uses LLMs can only be done by observability and experimentation. Tweak the inputs, evaluate the outputs, and every now and again, consider your dataset for representivity drift. Software engineers who are used to boolean/discrete math and TDD now need to concern themselves with data quality, representivity, and probabilistic systems. ML engineers need to spend more time learning how to develop products and concern themselves with user interactions and business use cases. Everyone needs to think more holistically about business goals and product use cases. There’s no such thing as a LLM that gives good answers that don’t serve the business reason it exists, after all. So, What Do You Need to Get Started With LLMs? Do you need to hire a bunch of ML experts in order to start shipping LLM software? Not necessarily. You cannot (there aren’t enough of them), you should not (this is something everyone needs to learn), and you don’t want to (these are changes that will make software engineers categorically more effective at their jobs). Obviously, you will need ML expertise if your goal is to build something complex or ambitious, but entry-level LLM usage is well within the purview of most software engineers. It is definitely easier for software engineers to dabble in using LLMs than it is for ML engineers to dabble in writing production applications. But learning to write and maintain software in the manner of LLMs is going to transform your engineers and your engineering organizations. And not a minute too soon. The hardest part of software has always been running it, maintaining it, and understanding it — in other words, operating it. But this reality has been obscured for many years by the difficulty and complexity of writing software. We can’t help but notice the upfront cost of writing software, while the cost of operating it gets amortized over many years, people, and teams, which is why we have historically paid and valued software engineers who write code more than those who own and operate it. When people talk about the 10x engineer, everyone automatically assumes it means someone who churns out 10x as many lines of code, not someone who can operate 10x as much software. But generative AI is about to turn all of these assumptions upside down. All of a sudden, writing software is as easy as sneezing. Anyone can use ChatGPT or other tools to generate reams of code in seconds. But understanding it, owning it, operating it, extending and maintaining it... all of these are more challenging than ever, because in the past, the way most of us learned to understand software was by writing it. What can we possibly do to make sure our code makes sense and works, and is extendable and maintainable (and our code base is consistent and comprehensible) when we didn’t go through the process of writing it? Well, we are in the early days of figuring that out, too. If you’re an engineer who cares about your craft: Do code reviews. Follow coding standards and conventions. Write (or generate) tests for it. But ultimately, the only way you can know for sure whether or not it works is to ship it to production and watch what happens. This has always been true, by the way. It’s just more true now. If you’re an engineer adjusting to the brave new era: Take some of that time you used to spend writing lines of code and reinvest it back into understanding, shipping under controlled circumstances, and observing. This means instrumenting your code with intention, and inspecting its output. This means shipping as soon as possible into the production environment. This means using feature flags to decouple deploys from releases and gradually roll new functionality out in a controlled fashion. Invest in these — and other — guardrails to make the process of shipping software more safe, fine-grained, and controlled. Most of all, it means developing the habit of looking at your code in production, through the lens of your telemetry, and asking yourself: Does this do what I expected it to do? Does anything else look weird? Or maybe I should say “looking at your systems” instead of “looking at your code,” since people might confuse the latter with an admonition to “read the code.” The days when you could predict how your system would behave simply by reading lines of code are long, long gone. Software behaves in unpredictable, emergent ways, and the important part is observing your code as it’s running in production, while users are using it. Code in a buffer can tell you very little. This Future Is a Breath of Fresh Air This, for once, is not a future I am afraid of. It’s a future I cannot wait to see manifest. For years now, I’ve been giving talks on modern best practices for software engineering — developers owning their code in production, testing in production, observability-driven development, continuous delivery in a tight feedback loop, separating deploys from releases using feature flags. No one really disputes that life is better, code is better, and customers are happier when teams adopt these practices. Yet, only 11% of teams can deploy their code in less than a day, according to the DORA report. Only a tiny fraction of teams are operating in the way everybody agrees we all should! Why? The answers often boil down to organizational roadblocks, absurd security/compliance policies, or lack of buy-in/prioritizing. Saddest of all are the ones who say something like, “our team just isn’t that good” or “our people just aren’t that smart” or “that only works for world-class teams like the Googles of the world.” Completely false. Do you know what’s hard? Trying to build, run, and maintain software on a two month delivery cycle. Running with a tight feedback loop is so much easier. Just Do the Thing So how do teams get over this hump and prove to themselves that they can have nice things? In my experience, only one thing works: When someone joins the team who has seen it work before, has confidence in the team’s abilities, and is empowered to start making progress against those metrics (which they tend to try to do, because people who have tried writing code the modern way become extremely unwilling to go back to the bad old ways). And why is this relevant? I hypothesize that over the course of the next decade, developing with LLMs will stop being anything special, and will simply be one skill set of many, alongside mobile development, web development, etc. I bet most engineers will be writing code that interacts with an LLM. I bet it will become not quite as common as databases, but up there. And while they’re doing that, they will have to learn how to develop using short feedback loops, testing in production, observability-driven development, etc. And once they’ve tried it, they too may become extremely unwilling to go back. In other words, LLMs might ultimately be the Trojan Horse that drags software engineering teams into the modern era of development best practices. (We can hope.) In short, LLMs demand we modify our behavior and tooling in ways that will benefit even ordinary, deterministic software development. Ultimately, these changes are a gift to us all, and the sooner we embrace them, the better off we will be. More

Observability Pillars: Exploring Logs, Metrics and Traces

By Chitra Bisht

Database Monitoring: Key Metrics and Considerations

By Akim Zubarchuk

Data Observability: Better Insights Through Reliable Data Practices

By Joana Carvalho CORE

PromCon EU 2023: Observability Recap in Berlin

As previously mentioned, last week I was on-site at the PromCon EU 2023 event for two days in Berlin, Germany. This is a community-organized event focused on the technology and implementations around the open-source Prometheus project, including, for example, PromQL and PromLens. Below you'll find an overview covering insights into the talks given, often with a short recap if you don't want to browse the details. Along with the talks, it was invaluable to have the common discussions and chats that happen between talks in the breaks where you can connect with core maintainers of various aspects of the Prometheus project. Be sure to keep an eye on the event video playlist, as all sessions were recorded and will appear there. Let's dive right in and see what the event had to offer this year in Berlin. This overview will be my impressions of each day of the event, but not all the sessions will be covered. Let's start with a short overview of the insights taken after sessions, chats, and the social event: OpenTelemetry interoperability (in all flavors) is the hot topic of the year. Native Histograms were a big topic the last two years; this year, showing up as having a lot of promise here and there, but not a big topic in this year's talks. Perses dashboard and visualization project presented their Alpha release as a truly open-source project based on the Apache 2.0 license. By my count, there were ~150 attendees, and they also live-streamed all talks/lightning talks, which will also be made available on their YouTube channel post-event. Day 1 The day started with a lovely walk through the center of Berlin and to the venue located on the Spree River. The event opened and jumped right into the following series of talks (insights provided inline): What's New in Prometheus and Its Ecosystem Native Histograms - Efficiency and more details Documentation note on prometheus.io: "...Native histograms (added as an experimental feature in Prometheus v2.40). Once native histograms are closer to becoming a stable feature, this document will be thoroughly updated." stringlabels - Storing labels differently for significant memory reduction keep_firing_for field faded to alerting rules - How long an alert will continue firing after the condition has occurred scrape_config_files - Split prom scrape configs into multiple files, avoiding having to have big config files OTLP receiver (v2.47) - Experimental support for receiving OTLP metrics SNMP Exporter (v0.24) - Breaking changes: new configuration format; splits connection settings from metrics details, simpler to change. Also added the ability to query multiple modules in one scrape using just one scrape. MySQLd Exporter (v0.15) - Multi-target support, use a single exporter to monitor multiple MySQL-alike servers Java client (v1.0.0) - client_java with OpenTelemetry metrics and tracing support, Native Histograms Alertmanager - New receivers. MS Teams, Discord, Webex Windows Exporter - Now an official exporter; was delayed due to licensing but is in the final stages now Every Tuesday Prometheus meets for Bug Scrub at 11:00 UTC. Calendar https://promtheus.io/community. What’s Coming New AlertManager UI Metadata Improvements Exemplary Improvements Remote Write v2 Perses: The CNCF Candidate for Observability Visualization Summary An announcement was given of the Alpha launch of the Perses dashboard and visualization project with GitOps compatibility - purpose-built for observability data; a truly open-source alternative with the Apache 2.0 license. Perses was born from the CNCF landscape missing visualization tooling projects: Perses - An exploration of a standard dashboard format Chronosphere, Red Hat, and Amadeus are displayed as founding members GitOps friendly, static validation, Kubernetes support; you can use the Perses binary in your development environment Chronosphere supported its development and Red Hat is integrating the Perses package into the OpenShift Console. There is an exploration of its usage with Prometheus/PromLens. Currently only metrics display, but ongoing by Red Hat integrating tracing with OpenTelemetry Logs are on the future wishlist. Feature details presented for the development of dashboards Includes Grafana migration tooling I was chatting with core maintainer Augustin Husson after the talk, and they are interested in submitting Perses as an applicant for the CNCF Sandbox status. Towards Making Prometheus OpenTelemetry Native Summary OpenTelemetry protocol (OTLP) support in Prometheus for metrics ingestion is experimental. Details on the Effort OTLP ingestion is there experimentally. The experience with target_info is a big pain point at the moment. Takes about half the bandwidth of remote write, 30-40% more CPU due to gzip New Arrow-based OTLP protocol promises half the bandwidth again at half the CPU cost; may inspire Prometheus remote write 2.0 GitHub milestone to track Thinking about using collector remote config to solve "split configuration" between Prometheus server and OpenTelemetry clients Planet Scale Monitoring: Handling Billions of Active Series With Prometheus and Thanos Summary Shopify states they are running “highly scalable globally distributed and highly dynamic” cloud infrastructure, so they are on “Planet Scale” with Prometheus. Details on the Effort Huge Ruby shop, latency-sensitive, large scaling events around the retail cycle and flash sales HPA struggles with scaling up quickly enough Using StatsD to get around Ruby/Python/PHP-specific limitations on shared counters Backend is Thanos-based, but have added a lot on top of it (custom work) Have a custom operator to scale Prometheus agents by scraping the targets and seeing how many time series they have (including redistribution) Have a router layer on top of Thanos to decouple ingestion and storage; sounds like they're evolving into a a Mimir-like setup Split the query layer into two deployments: one for short-term queries and one for longer-term queries Team and service-centric UI for alerting, integrated with SLO tracking Native histograms solved cardinality challenges and combined with Thanos' distributed querier to make very high cardinality queries work; as they stated, "This changed the game for us." When migrating from the previous observability vendor, they decided not to convert dashboards; instead, worked with developers to build new cleaner ones. Developers are not scoping queries well, so most fan out to all regional stores, but performance on empty responses is satisfactory, so it's not a big issue. Lightning Talks Summary It's always fun to end the day with a quick series of talks that are ad-hoc collected from the attendees. Below is a list of ones I thought were interesting as well as a short summary, should you want to find them in the recordings: AlertManager UI: Alertmanager will get a new UI in React. ELM didn't get traction as a common language; considering alternatives to Bootstrap Implementing integrals with Prometheus and Grafana: Integrals in PromQL- inverse of rates, Pure-PromQL version of the delta counter we do; using sum_over_time and Grafana variables to simplify getting all the right factors. Metrics have a DX Problem: Looking at how to do developer-focused metrics from the IDE using autometrics-dev project on Git Hub; framework for instrumenting by function, with IDE integration to explore prod metrics; interesting idea to integrate this deeply Day 2 After the morning walk through the center of Berlin, day two provided us with some interesting material (insights provided inline): Taming the Tsunami: Low Latency Ingestion of Push-Based Metrics in Prometheus Summary Overview of the metrics story at Shopify, with over 1k teams running it: Originally forwarding metrics "from observability vendor agent" Issues because that was multiplying the cardinality across exporter instances; same with sidecar model Built a StatsD protocol-aware load balancer Running as a sidecar also had ownership issues, stating, "We would be on call for every application" DaemonSet deployment meant resource usage and hot-spotting concerns; also cardinality, but at a lower level Didn't want per-instance metrics because of cardinality and metrics are more domain-level Roughly one exporter per 50-100 nodes Load balancer sanitizes label values and drops labels Pre-aggregation on short time scales to deal with "hot loop instrumentation;" resulted in roughly 20x reduction in bandwidth use Compensating for lack of per-instance metrics by looking at infrastructure metrics (KSM, cAdvisor) "We have close to a thousand teams right now" Prometheus Java Client 1.0.0 Summary V1.0.0 was released last week. This talk was an overview of some of their updates featuring native histograms and OpenTelemetry support. Rewrote the underlying model, so breaking changes with the migration module for Prom simpleclient metrics JavaDoc can be found here. Almost as simple as importing changes in your Java app to use; going to update my workshop Java example for instrumentation to the new API Includes good examples in the project Exposes native + classic histograms by default, scraper's choice A lot more configurations available as Java properties Callback metrics (this is great for writing exporters) OTel push support (on a configurable interval) Allows standard OTel names (with dots), automatically replaces dots with underscores for Prometheus format Integrates with OTel tracing client to make exemplars work - picks exemplars from tracing context, extends tracing context to mark that trace to not get sampled away Despite supporting OTel, this is still a performance-minded client library All metric types support concurrent updates Dropped Pushgateway support for now, but will port it forward Once JMX exporter is updated, as a side effect, you can update Not aiming to become a full OTel library, only future-proofing your instrumentation; more lightweight and performance-focused Lightning Talks Summary Again, here is a list of lightning talks I thought were interesting from the final day and a short summary, should you want to find them in the recordings: Tracking object storage costs Trying to measure object storage costs, as they are the number 2 cost in their cloud bills; built a Prometheus Price Exporter Object storage cost is ~half of Grafana's cloud bill; varies by customer (can be as low as 2%) Trick for extending sparse metrics with zeroes: or on() vector(0) They have a prices exporter in the works; promised to open source it Prom operator - what's next? Tour of some more features coming in the Prometheus operator; shards autoscaling, scrape classes, support Kubernetes events, and Prometheus-agent deployment as DaemonSet Prometheus adoption stats 868k users in 2023 (up from 774k last year), based on Grafana instances which have at least one Prometheus data source enabled Final Impressions Final impressions of this event left me for the second straight year with the feeling that the attendees were both passionate and knowledgeable about the metrics monitoring tooling around the Prometheus ecosystem. This event did not really have "getting started" sessions. Most of this assumes you are coming for in-depth dives into the various elements of the Prometheus project, almost giving you glimpses into the research progress behind features being improved in the coming versions of Prometheus. It remains well worth your time if you are active in the monitoring world, even if you are not using open source or Prometheus: you will gain insights into the status of features in the monitoring world.

By Eric D. Schabell CORE

Log Analysis: How to Digest 15 Billion Logs Per Day and Keep Big Queries Within 1 Second

This data warehousing use case is about scale. The user is China Unicom, one of the world's biggest telecommunication service providers. Using Apache Doris, they deploy multiple petabyte-scale clusters on dozens of machines to support their 15 billion daily log additions from their over 30 business lines. Such a gigantic log analysis system is part of their cybersecurity management. For the need of real-time monitoring, threat tracing, and alerting, they require a log analytic system that can automatically collect, store, analyze, and visualize logs and event records. From an architectural perspective, the system should be able to undertake real-time analysis of various formats of logs, and of course, be scalable to support the huge and ever-enlarging data size. The rest of this article is about what their log processing architecture looks like and how they realize stable data ingestion, low-cost storage, and quick queries with it. System Architecture This is an overview of their data pipeline. The logs are collected into the data warehouse, and go through several layers of processing. ODS: Original logs and alerts from all sources are gathered into Apache Kafka. Meanwhile, a copy of them will be stored in HDFS for data verification or replay. DWD: This is where the fact tables are. Apache Flink cleans, standardizes, backfills, and de-identifies the data, and write it back to Kafka. These fact tables will also be put into Apache Doris, so that Doris can trace a certain item or use them for dashboarding and reporting. As logs are not averse to duplication, the fact tables will be arranged in the Duplicate Key model of Apache Doris. DWS: This layer aggregates data from DWD and lays the foundation for queries and analysis. ADS: In this layer, Apache Doris auto-aggregates data with its Aggregate Key model, and auto-updates data with its Unique Key model. Architecture 2.0 evolves from Architecture 1.0, which is supported by ClickHouse and Apache Hive. The transition arised from the user's needs for real-time data processing and multi-table join queries. In their experience with ClickHouse, they found inadequate support for concurrency and multi-table joins, manifested by frequent timeouts in dashboarding and OOM errors in distributed joins. Now let's take a look at their practice in data ingestion, storage, and queries with Architecture 2.0. Real-Case Practice Stable Ingestion of 15 Billion Logs Per Day In the user's case, their business churns out 15 billion logs every day. Ingesting such data volume quickly and stably is a real problem. With Apache Doris, the recommended way is to use the Flink-Doris-Connector. It is developed by the Apache Doris community for large-scale data writing. The component requires simple configuration. It implements Stream Load and can reach a writing speed of 200,000~300,000 logs per second, without interrupting the data analytic workloads. A lesson learned is that when using Flink for high-frequency writing, you need to find the right parameter configuration for your case to avoid data version accumulation. In this case, the user made the following optimizations: Flink Checkpoint: They increase the checkpoint interval from 15s to 60s to reduce writing frequency and the number of transactions processed by Doris per unit of time. This can relieve data writing pressure and avoid generating too many data versions. Data Pre-Aggregation: For data of the same ID but comes from various tables, Flink will pre-aggregate it based on the primary key ID and create a flat table, in order to avoid excessive resource consumption caused by multi-source data writing. Doris Compaction: The trick here includes finding the right Doris backend (BE) parameters to allocate the right amount of CPU resources for data compaction, setting the appropriate number of data partitions, buckets, and replicas (too much data tablets will bring huge overheads), and dialing up max_tablet_version_num to avoid version accumulation. These measures together ensure daily ingestion stability. The user has witnessed stable performance and low compaction score in Doris backend. In addition, the combination of data pre-processing in Flink and the Unique Key model in Doris can ensure quicker data updates. Storage Strategies to Reduce Costs by 50% The size and generation rate of logs also impose pressure on storage. Among the immense log data, only a part of it is of high informational value, so storage should be differentiated. The user has three storage strategies to reduce costs. ZSTD (ZStandard) compression algorithm: For tables larger than 1TB, specify the compression method as "ZSTD" upon table creation, it will realize a compression ratio of 10:1. Tiered storage of hot and cold data: This is supported by a new feature of Doris. The user sets a data "cooldown" period of 7 days. That means data from the past 7 days (namely, hot data) will be stored in SSD. As time goes by, hot data "cools down" (getting older than 7 days), it will be automatically moved to HDD, which is less expensive. As data gets even "colder," it will be moved to object storage for much lower storage costs. Plus, in object storage, data will be stored with only one copy instead of three. This further cuts down costs and the overheads brought by redundant storage. Differentiated replica numbers for different data partitions: The user has partitioned their data by time range. The principle is to have more replicas for newer data partitions and less for the older ones. In their case, data from the past 3 months is frequently accessed, so they have 2 replicas for this partition. Data that is 3~6 months old has two replicas, and data from 6 months ago has one single copy. With these three strategies, the user has reduced their storage costs by 50%. Differentiated Query Strategies Based on Data Size Some logs must be immediately traced and located, such as those of abnormal events or failures. To ensure real-time response to these queries, the user has different query strategies for different data sizes: Less than 100G: The user utilizes the dynamic partitioning feature of Doris. Small tables will be partitioned by date and large tables will be partitioned by hour. This can avoid data skew. To further ensure balance of data within a partition, they use the snowflake ID as the bucketing field. They also set a starting offset of 20 days, which means data of the recent 20 days will be kept. In this way, they find the balance point between data backlog and analytic needs. 100G~1T: These tables have their materialized views, which are the pre-computed result sets stored in Doris. Thus, queries on these tables will be much faster and less resource-consuming. The DDL syntax of materialized views in Doris is the same as those in PostgreSQL and Oracle. More than 100T: These tables are put into the Aggregate Key model of Apache Doris and pre-aggregate them. In this way, we enable queries of 2 billion log records to be done in 1~2s. These strategies have shortened the response time of queries. For example, a query of a specific data item used to take minutes, but now it can be finished in milliseconds. In addition, for big tables that contain 10 billion data records, queries on different dimensions can all be done in a few seconds. Ongoing Plans The user is now testing with the newly added inverted index in Apache Doris. It is designed to speed up full-text search of strings as well as equivalence and range queries of numerics and datetime. They have also provided their valuable feedback about the auto-bucketing logic in Doris: Currently, Doris decides the number of buckets for a partition based on the data size of the previous partition. The problem for the user is, most of their new data comes in during daytime, but little at nights. So in their case, Doris creates too many buckets for night data but too few in daylight, which is the opposite of what they need. They hope to add a new auto-bucketing logic, where the reference for Doris to decide the number of buckets is the data size and distribution of the previous day. They've come to the Apache Doris community and we are now working on this optimization.

By Zaki Lu

Not a Single Trace

Your team celebrates a success story where a trace identified a pesky latency issue in your application's authentication service. A fix was swiftly implemented, and we all celebrated a quick win in the next team meeting. But the celebrations are short-lived. Just days later, user complaints surged about a related payment gateway timeout. It turns out that the fix we made did improve performance at one point but created a situation in which key information was never cached. Other parts of the software react badly to the fix, and we need to revert the whole thing. While the initial trace provided valuable insights into the authentication service, it didn’t explain why the system was built in this way. Relying solely on a single trace has given us a partial view of a broader problem. This scenario underscores a crucial point: while individual traces are invaluable, their true potential is unlocked only when they are viewed collectively and in context. Let's delve deeper into why a single trace might not be the silver bullet we often hope for and how a more holistic approach to trace analysis can paint a clearer picture of our system's health and the way to combat problems. The Limiting Factor The first problem is the narrow perspective. Imagine debugging a multi-threaded Java application. If you were to focus only on the behavior of one thread, you might miss how it interacts with others, potentially overlooking deadlocks or race conditions. Let's say a trace reveals that a particular method, fetchUserData(), is taking longer than expected. By optimizing only this method, you might miss that the real issue is with the synchronized block in another related method, causing thread contention and slowing down the entire system. Temporal blindness is the second problem. Think of a Java Garbage Collection (GC) log. A single GC event might show a minor pause, but without observing it over time, you won't notice if there's a pattern of increasing pause times indicating a potential memory leak. A trace might show that a Java application's response time spiked at 2 PM. However, without looking at traces over a longer period, you might miss that this spike happens daily, possibly due to a scheduled task or a cron job that's putting undue stress on the system. The last problem is related to that and is the context. Imagine analyzing the performance of a Java method without knowing the volume of data it's processing. A method might seem inefficient, but perhaps it's processing a significantly larger dataset than usual. A single trace might show that a Java method, processOrders(), took 5 seconds to execute. However, without context, you wouldn't know if it was processing 50 orders or 5,000 orders in that time frame. Another trace might reveal that a related method, fetchOrdersFromDatabase(), is retrieving an unusually large batch of orders due to a backlog, thus providing context to the initial trace. Strength in Numbers Think of traces as chapters in a book and metrics as the book's summary. While each chapter (trace) provides detailed insights, the summary (metrics) gives an overarching view. Reading chapters in isolation might lead to missing the plot, but when read in sequence and in tandem with the summary, the story becomes clear. We need this holistic view. If individual traces show that certain Java methods like processTransaction() are occasionally slow, grouped traces might reveal that these slowdowns happen concurrently, pointing to a systemic issue. Metrics, on the other hand, might show a spike in CPU usage during these times, indicating that the system might be CPU-bound during high transaction loads. This helps us distinguish between correlation and causation. Grouped traces might show that every time the fetchFromDatabase() method is slow, the updateCache() method also lags. While this indicates a correlation, metrics might reveal that cache misses (a specific metric) increase during these times, suggesting that database slowdowns might be causing cache update delays, establishing causation. This is especially important in performance tuning. Grouped traces might show that the handleRequest() method's performance has been improving over several releases. Metrics can complement this by showing a decreasing trend in response times and error rates, confirming that recent code optimizations are having a positive impact. I wrote about this extensively in a previous post about the Tong motion needed to isolate an issue. This motion can be accomplished purely through the use of observability tools such as traces, metrics, and logs. Example Observability is somewhat resistant to examples. Everything I try to come up with feels a bit synthetic and unrealistic when I examine it after the fact. Having said that, I looked at my modified version of the venerable Spring Pet Clinic demo using digma.ai. Running it showed several interesting concepts taken by Digma. Probably the most interesting feature is the ability to look at what’s going on in the server at this moment. This is an amazing exploratory tool that provides a holistic view of a moment in time. But the thing I want to focus on is the “Insights” column on the right. Digma tries to combine the separate traces into a coherent narrative. It’s not bad at it, but it’s still a machine. Some of that value should probably still be done manually since it can’t understand the why, only the what. It seems it can detect the venerable Spring N+1 problem seamlessly. But this is only the start. One of my favorite things is the ability to look at tracing data next to a histogram and list of errors in a single view. Is performance impacted because there are errors? How impactful is the performance on the rest of the application? These become questions with easy answers at this point. When we see all the different aspects laid together. Magical APIs The N+1 problem I mentioned before is a common bug in Java Persistence API (JPA). The great Vlad Mihalcea has an excellent explanation. The TL;DR is rather simple. We write a simple database query using ORM. But we accidentally split the transaction, causing the data to be fetched N+1 times, where N is the number of records we fetch. This is painfully easy to do since transactions are so seamless in JPA. This is the biggest problem in “magical” APIs like JPA. These are APIs that do so much that they feel like magic, but under the hood, they still run regular old code. When that code fails, it’s very hard to see what goes on. Observability is one of the best ways to understand why these things fail. In the past, I used to reach out to the profiler for such things, which would often entail a lot of work. Getting the right synthetic environment for running a profiling session is often very challenging. Observability lets us do that without the hassle. Final Word Relying on a single individual trace is akin to navigating a vast terrain with just a flashlight. While these traces offer valuable insights, their true potential is only realized when viewed collectively. The limitations of a single trace, such as a narrow perspective, temporal blindness, and lack of context, can often lead developers astray, causing them to miss broader systemic issues. On the other hand, the combined power of grouped traces and metrics offers a panoramic view of system health. Together, they allow for a holistic understanding, precise correlation of issues, performance benchmarking, and enhanced troubleshooting. For Java developers, this tandem approach ensures a comprehensive and nuanced understanding of applications, optimizing both performance and user experience. In essence, while individual traces are the chapters of our software story, it's only when they're read in sequence and in tandem with metrics that the full narrative comes to life.

By Shai Almog CORE

Logging to Infinity and Beyond: How To Find the Hidden Value of Your Logs

If your environment is like many others, it can often seem like your systems produce logs filled with a bunch of excess data. Since you need to access multiple components (servers, databases, network infrastructure, applications, etc.) to read your logs — and they don’t typically have any specific purpose or focus holding them together — you may dread sifting through them. If you don’t have the right tools, it can feel like you’re stuck with a bunch of disparate, hard-to-parse data. In these situations, I picture myself as a cosmic collector, gathering space debris as it floats by my ship and sorting the occasional good material from the heaps of galactic material. Though it can feel like more trouble than it’s worth, sorting through logs is crucial. Logs hold many valuable insights into what’s happening in your applications and can indicate performance problems, security issues, and user behavior. In this article, we’re going to take a look at how logging can help you make sense of your log data without much effort. We'll talk about best practices and habits and use some of the Log Analytics tools from Sumo Logic as examples. Let’s blast off and turn that cosmic trash into treasure! The Truth Is Out There: Getting Value Just From the Things You’re Already Logging One massive benefit offered by a log analytics platform to any system engineer is the ability to utilize a single log interface. Rather than needing to SSH into countless machines or download logs and parse through them manually, viewing all your logs in a centralized aggregator can make it much easier to see simultaneous events across your infrastructure. You’ll also be able to clearly follow the flow of data and requests through your stack. Once you see all your logs in one place, you can tap into the latent value of all that data. Of course, you could make your own aggregation interface from scratch, but often, log aggregation tools provide a number of extra features that are worth the additional investment. Those extra features include capabilities such as powerful search and fast analytics. Searching Through the Void: Using Search Query Language To Find Things You’ve probably used grep or similar tools for searching through your logs, but for real power, you need the ability to search across all of your logs in one interface. You may have even investigated using the ELK stack on your own infrastructure to get going with log aggregation. If you have, you know how valuable putting logs all in the same place can be. Some tools provide even more functionality on top of this interface. For example, with Log Analytics, you can use a Search Query Language that allows for more complex searches. Because these searches are being executed across a vast amount of log data, you can use special operations to harness the power of your log aggregation service. Some of these operations can be achieved with grep, so long as you have all of the logs at your disposal. But others, such as aggregate operators, field expressions, or transaction analytics tools, can produce extremely powerful reports and monitoring triggers across your infrastructure. To choose just one tool as an example, let’s take a closer look at field expressions. Essentially, field expressions allow you to create variables in your queries based on what you find through your log data. For example, if you wanted to search across your logs, and you know your log lines follow the format “From Jane To: John,” you can parse out the “from” and “to” with the following query: * | parse "From: * To: *" as (from, to) This would store “Jane” in the “from” field and “John” in the “to” field. Another valuable language feature you could tap into would be keyword expressions. You could use this query to search across your logs for any instances where a command with root privileges failed: (``su`` OR ``sudo`` ) AND (fail* OR error) Here is a listing of General Search Examples that are drawn from parsing a single Apache log message. Light-Speed Analytics: Making Use of Real-Time Reports and Advanced Analytics One other aspect of searching is that it’s typically looking into the past. Sometimes, you need to be looking at things as they happen. Let’s take a look at Live Tail and LogReduce — two tools to improve simple searches. Versions of these features exist on many platforms, but I like the way they work on Sumo Logic’s offering, so we’ll dive into them. Live Tail At its simplest, Live Tail lets you see a live feed of your log messages. It’s like running tail-f on any one of your servers to see the logs as they come in, but instead of being on a single machine, you’re looking across all logs associated with a Sumo Logic Source or Collector. Your Live Tail can be modified to automatically filter for only specific things. Live Tails also supports highlighting keywords (up to eight of them) as the logs roll in. LogReduce LogReduce gives you more insight into–or a better understanding of–your search query’s aggregate log results. When you run LogReduce on a query, it performs fuzzy logic analysis on messages meeting the search criteria you defined and then provides you with a set of “Signatures” that meet your criteria. It also gives you a count of the logs with that pattern and a rating of the relevance of the pattern when compared to your search. You then have tools at your disposal to rank the generated signatures and even perform further analysis on the log data. This is all pretty advanced and can be hard to understand without a demo, so you can dive deeper by watching this video. Integrated Log Aggregation Often, you’ll need information from systems you aren’t running directly mixed in with your other logs. That’s why it’s important to make sure you can integrate your log aggregator with other systems. Many log aggregators provide this functionality. Elastic, which underlies the ELK stack, provides a bunch of integrations that you can hook into your self-hosted or cloud-hosted stack. Of course, integrations aren’t only available on the ELK stack. Sumo Logic also provides a whole list of integrations as well. Regardless, the power of connecting your logs with the many systems you use outside of your monitoring and operational stack is phenomenal. Want to get logs sent from your company’s 1Password account into the rest of your logs? Need more information from AWS than you are getting on your individual instances or services? ELK and Sumo Logic provide great options. The key to understanding this concept is that you don’t need to be the one controlling the logs to make it valuable to aggregate them. Think through the full picture of what systems keep your business running, and consider putting all of the logs in your aggregator together. Conclusion This has been a brief tour through some of the features available with log aggregation. There’s a lot more to it, which shouldn’t be surprising given the vast amount of data generated every second by our infrastructure. The really amazing part of these tools is that these insights are available to you without installing anything on your servers. You just need to have a way to export your log data to the aggregation service. Whether you need to track compliance or monitor the reliability of your services, log aggregation is an incredibly powerful tool that can let you unlock infinite value from your already existing log data. That way, you can become a better cosmic junk collector!

By Michael Bogan CORE

A New Era Has Come, and So Must Your Database Observability

The World Has Changed, and We Need To Adapt The world has gone through a tremendous transformation in the last fifteen years. Cloud and microservices changed the world. Previously, our application was using one database; developers knew how it worked, and the deployment rarely happened. A single database administrator was capable of maintaining the database, optimizing the queries, and making sure things worked as expected. The database administrator could just step in and fix the performance issues we observed. Software engineers didn’t need to understand the database, and even if they owned it, it was just a single component of the system. Guaranteeing software quality was much easier because the deployment happened rarely, and things could be captured on time via automated tests. Fifteen years later, everything is different. Companies have hundreds of applications, each one with a dedicated database. Deployments happen every other hour, deployment pipelines work continuously, and keeping track of flowing changes is beyond one’s capabilities. The complexity of the software increased significantly. Applications don’t talk to databases directly but use complex libraries that generate and translate queries on the fly. Application monitoring is much harder because applications do not work in isolation, and each change may cause multiple other applications to fail. Reasoning about applications is now much harder. It’s not enough to just grab the logs to understand what happened. Things are scattered across various components, applications, queues, service buses, and databases. Databases changed as well. We have various SQL distributions, often incompatible despite having standards in place. We have NoSQL databases that provide different consistency guarantees and optimize their performance for various use cases. We developed multiple new techniques and patterns for structuring our data, processing it, and optimizing schemas and indexes. It’s not enough now to just learn one database; developers need to understand various systems and be proficient with their implementation details. We can’t rely on ACID anymore as it often harms the performance. However, other consistency levels require a deep understanding of the business. This increases the conceptual load significantly. Database administrators have a much harder time keeping up with the changes, and they don’t have enough time to improve every database. Developers are unable to analyze and get the full picture of all the moving parts, but they need to deploy changes faster than ever. And the monitoring tools still swamp us with metrics instead of answers. Given all the complexity, we need developers to own their databases and be responsible for their data storage. This “shift left” in responsibility is a must in today’s world for both small startups and big Fortune 500 enterprises. However, it’s not trivial. How do we prevent the bad code from reaching production? How to troubleshoot issues automatically? How do we move from monitoring to observability? Finally, how do we give developers the proper tools and processes so they will be able to own the databases? Read on to find answers. Measuring Application Performance Is Complex It’s crucial to measure to improve the performance. Performance indicators (PIs) help us evaluate the performance of the system on various dimensions. They can focus on infrastructure aspects such as the reliability of the hardware or networking. They can use application metrics to assess the performance and stability of the system. They can also include business metrics to measure the success from the company and user perspective, including user retention or revenue. Performance indicators are important tracking mechanisms to understand the state of the system and the business as a whole. However, in our day-to-day job, we need to track many more metrics. We need to understand contributors to the performance indicators to troubleshoot the issues earlier and understand whether the system is healthy or not. Let’s see how to build these elements in the modern world. We typically need to start with telemetry — the ability to collect the signals. There are multiple types of signals that we need to track: logs (especially application logs), metrics, and traces. Capturing these signals can be a matter of proper configuration (like enabling them in the hosting provider panel), or they need to be implemented by the developers. Recently, OpenTelemetry gained significant popularity. It’s a set of SDKs for popular programming languages that can be used to instrument applications to generate signals. This way, we have a standardized way of building telemetry within our applications. Odds are that most of the frameworks and libraries we use are already integrated with OpenTelemetry and can generate signals properly. Next, we need to build a solution for capturing the telemetry signals in one centralized place. This way, we can see “what happens” inside the system. We can browse the signals from the infrastructure (like hosts, CPUs, GPUs, and network), applications (number of requests, errors, exceptions, data distribution), databases (data cardinality, number of transactions, data distribution), and many other parts of the application (queues, notification services, service buses, etc.). This lets us troubleshoot more easily as we can see what happens in various parts of the ecosystem. Finally, we can build the Application Performance Management (APM). It’s the way of tracking metric indicators with telemetry and dashboards. APM focuses on providing end-to-end monitoring that goes across all the components of the system, including the web layer, mobile and desktop applications, databases, and the infrastructure connecting all the elements. It can be used to automate alarms and alerts to constantly assess whether the system is healthy. APM may seem like a silver bullet. It aggregates metrics, shows the performance, and can quickly alert when something goes wrong, and the fire begins. However, it’s not that simple. Let’s see why. Why Application Performance Monitoring Is Not Enough APM captures signals and presents them in a centralized application. While this may seem enough, it lacks multiple features that we would expect from a modern maintenance system. First, APM typically presents raw signals. While it has access to various metrics, it doesn’t connect the dots easily. Imagine that the CPU spikes. Should you migrate to a bigger machine? Should you optimize the operating system? Should you change the driver? Or maybe the CPU spike is caused by different traffic coming to the application? You can’t tell that easily just by looking at metrics. Second, APM doesn’t easily show where the problem is. We may observe metrics spiking in one part of the system, but it doesn’t necessarily mean that the part is broken. There may be other reasons and issues. Maybe it’s wrong input coming to the system, maybe some external dependency is slow, and maybe some scheduled task runs too often. APM doesn’t show that, as it cannot connect the dots and show the flow of changes throughout the system. You just see the state then, but you don’t see how you got to that point easily. Third, the resolution is unknown. Let’s say that the CPU spiked during the scheduled maintenance task. Should we upscale the machine? Should we disable the task? Should we run it some other time? Is there a bug in the task? Many things are not clear. We can easily imagine a situation when the scheduled task runs in the middle of the day just because it is more convenient for the system administrators; however, the task is now slow and competes with regular transactions for the resources. In that case, we probably should move the task to some time outside of peak hours. Another scenario is that the task was using an index that doesn’t work anymore. Therefore, it’s not about the task per se, but it’s about the configuration that has been changed with the last deployment. Therefore, we should fix the index. APM won’t show us all those details. Fourth, APM is not very readable. Dashboards with metrics look great, but they are too often just checked whether they’re green. It’s not enough to see that alarms are not ringing. We need to manually review the metrics, look for anomalies, understand how they change, and if we have all the alarms in place. This is tedious and time-consuming, and many developers don’t like doing that. Metrics, charts, graphs, and other visualizations swamp us with raw data that doesn’t show the big picture. Finally, one person can’t reason about the system. Even if we have a dedicated team for maintenance, the team won’t have an understanding of all the changes going through the system. In the fast-paced world with tens of deployments every day, we can’t look for issues manually. Every deployment may result in an outage due to invalid schema migration, bad code change, cache purge, lack of hardware, bad configuration, or many more issues. Even when we know something is wrong and we can even point to the place, the team may lack the understanding or knowledge needed to identify the root cause. Involving more teams is time-consuming and doesn’t scale. While APM looks great, it’s not the ultimate solution. We need something better. We need something that connects the dots and provides answers instead of data. We need true observability. What Makes the Observability Shine Observability turns alerts into root causes and raw data into understanding. Instead of charts, diagrams, and graphs, we want to have a full story of the changes going through pipelines and how they affect the system. This should understand the characteristics of the application, including the deployment scheme, data patterns, partitioning, sharding, regionalization, and other things specific to the application. Observability lets us reason about the internals of the system from the outside. For instance, we can reason that we deployed the wrong changes to the production environment because there is a metric spike in the database. We don’t focus on the database per se, but we analyze the difference between the current and the previous code. However, if there was no deployment recently, but we observe much higher traffic on the load balancer, then we can reason that it’s probably due to different traffic coming to the application. Observability makes the interconnections clear and visible. To build observability, we need to capture static signals and dynamic history. We need to include our deployments, configuration, extensions, connectivity, and characteristics of our application code. It’s not enough just to see that “something is red now.” We need to understand how we got there and what could be the possible reason. To achieve that, a good observability solution needs to go through multiple steps. First, we need to be able to pinpoint the problem. In the modern world of microservices and bounded contexts, it’s not trivial. If the CPU spikes, we need to be able to answer which service or application caused that, which tenant is responsible, or whether this is for all the traffic or some specific requests in the case of a web application. We can do that by carefully observing metrics with multiple dimensions, possibly with dashboards and alarms. Second, we need to include multiple signals. CPU spikes can be caused by a lack of hardware, wrong configuration, broken code, unexpected traffic, or simply things that shouldn’t be running at that time. What’s more, maybe something unexpected happened around the time of the issue. This could be related to a deployment, an ongoing sports game, a specific time of week or time of year, some promotional campaign we just started, or some outage in the cloud infrastructure. All these inputs must be provided to the observability system to understand the bigger picture. Third, we need to look for anomalies. It may seem counterintuitive, but digital applications rot over time. Things change, traffic changes, updates are installed, security fixes are deployed, and every single change can break our application. However, the outage may not be quick and easy. The application may get slower and slower over time, and we won’t notice that easily because alarms do not go off or they become red only for a short period. Therefore, we need to have anomaly detection built-in. We need to be able to look for traffic patterns, weekly trends, and known peaks during the year. A proper observability solution needs to be aware of these and automatically find the situations in which the metrics don’t align. Fourth, we need to be able to automatically root cause the issue and suggest a solution. We can’t push the developers to own the databases and the systems without proper tooling. The observability systems need to be able to automatically suggest improvements. We need to unblock the developers so they can finally be responsible for the performance and own the systems end to end. Databases and Observability We Need Today Let’s now see what we need in the domain of databases. Many things can break, and it’s worth exploring the challenges we may face when working with SQL or NoSQL databases. We are going to see the three big areas where things may go wrong. These are code changes, schema changes, and execution changes. Code Changes Many database issues come from the code changes. Developers modify the application code, and that results in different SQL statements being sent to the database. These queries may be inherently slow, but these won’t be captured by the testing processes we have in place now. Imagine that we have the following application code that extracts the user aggregate root. The user may have multiple additional pieces of information associated with them, like details, pages, or texts: JavaScript const user = repository.get("user") .where("user.id = 123") .leftJoin("user.details", "user_details_table") .leftJoin("user.pages", "pages_table") .leftJoin("user.texts", "texts_table") .leftJoin("user.questions", "questions_table") .leftJoin("user.reports", "reports_table") .leftJoin("user.location", "location_table") .leftJoin("user.peers", "peers_table") .getOne() return user; The code generates the following SQL statement: SQL SELECT * FROM users AS user LEFT JOIN user_details_table AS detail ON detail.user_id = user.id LEFT JOIN pages_table AS page ON page.user_id = user.id LEFT JOIN texts_table AS text ON text.user_id = user.id LEFT JOIN questions_table AS question ON question.user_id = user.id LEFT JOIN reports_table AS report ON report.user_id = user.id LEFT JOIN locations_table AS location ON location.user_id = user.id LEFT JOIN peers_table AS peer ON Peer.user_id = user.id WHERE user.id = '123' Because of multiple joins, the query returns nearly 300 thousand rows to the application that are later processed by the mapper library. This takes 25 seconds in total. Just to get one user entity. The problem with such a query is that we don’t see the performance implications when we write the code. If we have a small developer database with only a hundred rows, then we won’t get any performance issues when running the code above locally. Unit tests won’t catch that either because the code is “correct” — it returns the expected result. We won’t see the issue until we deploy to production and see that the query is just too slow. Another problem is a well-known N+1 query problem with Object Relational Mapper (ORM) libraries. Imagine that we have table flights that are in 1-to-many relation with table tickets. If we write a code to get all the flights and count all the tickets, we may end up with the following: C# var totalTickets = 0; var flights = dao.getFlights(); foreach(var flight in flights){ totalTickets + flight.getTickets().count; } This may result in N+1 queries being sent in total. One query to get all the flights, and then n queries to get tickets for every flight: SQL SELECT * FROM flights; SELECT * FROM tickets WHERE ticket.flight_id = 1; SELECT * FROM tickets WHERE ticket.flight_id = 2; SELECT * FROM tickets WHERE ticket.flight_id = 3; ... SELECT * FROM tickets WHERE ticket.flight_id = n; Just as before, we don’t see the problem when running things locally, and our tests won’t catch that. We’ll find the problem only when we deploy to an environment with a sufficiently big data set. Yet another thing is about rewriting queries to make them more readable. Let’s say that we have a table boarding_passes. We want to write the following query (just for exemplary purposes): SQL SELECT COUNT(*) FROM boarding_passes AS C1 JOIN boarding_passes AS C2 ON C2.ticket_no = C1.ticket_no AND C2.flight_id = C1.flight_id AND C2.boarding_no = C1.boarding_no JOIN boarding_passes AS C3 ON C3.ticket_no = C1.ticket_no AND C3.flight_id = C1.flight_id AND C3.boarding_no = C1.boarding_no WHERE MD5(MD5(C1.ticket_no)) = '525ac610982920ef37b34aa56a45cd06' AND MD5(MD5(C2.ticket_no)) = '525ac610982920ef37b34aa56a45cd06' AND MD5(MD5(C3.ticket_no)) = '525ac610982920ef37b34aa56a45cd06' This query joins the table with itself three times, calculates the MD5 hash of the ticket number twice, and then filters rows based on the condition. This code runs for 8 seconds on my machine with the demo database. A programmer may now want to avoid this repetition and rewrite the query to the following: SQL WITH cte AS ( SELECT *, MD5(MD5(ticket_no)) AS double_hash FROM boarding_passes ) SELECT COUNT(*) FROM cte AS C1 JOIN cte AS C2 ON C2.ticket_no = C1.ticket_no AND C2.flight_id = C1.flight_id AND C2.boarding_no = C1.boarding_no JOIN cte AS C3 ON C3.ticket_no = C1.ticket_no AND C3.flight_id = C1.flight_id AND C3.boarding_no = C1.boarding_no WHERE C1.double_hash = '525ac610982920ef37b34aa56a45cd06' AND C2.double_hash = '525ac610982920ef37b34aa56a45cd06' AND C3.double_has = '525ac610982920ef37b34aa56a45cd06' The query is now more readable as it avoids repetition. However, the performance dropped, and the query now executes in 13 seconds. Now, when we deploy changes like these to production, we may reason that we need to upscale the database. Seemingly, nothing has changed, but the database is now much slower. With good observability tools, we would see that the query executed behind the scenes is now different, which leads to a performance drop. Schema Changes Another problem around databases is when it comes to schema management. There are generally three different ways of modifying the schema: we can add something (table, column index, etc.), remove something, or modify something. Each schema modification is dangerous because the database engine may need to rewrite the table — copy the data on the side, modify the table schema, and then copy the data back. This may lead to a very long deployment (minutes, hours, even months) that we can’t optimize or stop in the middle. Additionally, we typically won’t see the problems when running things locally because we typically run our tests against the latest schema. A good observability solution needs to capture these changes before running in production. Indexes pose another interesting challenge. Adding an index seems to be safe. However, as is the case with every index, it needs to be maintained over time. Indexes generally improve the read performance because they help us find rows much faster. At the same time, they decrease the modification performance as every data modification must be performed in the table and in all the indexes. However, indexes may not be useful after some time. It’s often the case that we configure an index; a couple of months later, we change the application code, and the index isn’t used anymore. Without good observability systems, we won’t be able to notice that the index isn’t useful anymore and decreases the performance. Execution Changes Yet another area of issues is related to the way we execute queries. Databases prepare a so-called execution plan of the query. Whenever a statement is sent to the database, the engine analyzes indexes, data distribution, and statistics of the tables’ content to figure out the fastest way of running the query. Such an execution plan heavily depends on the content of our database and running configuration. The execution plan dictates what join strategy to use when joining tables (nested loop join, merge join, hash join, or maybe something else), which indexes to scan (or tables instead), and when to sort and materialize the results. We can affect the execution plan by providing query hints. Inside the SQL statements, we can specify what join strategy to use or what locks to acquire. The database may use these hints to improve the performance but may also disregard them and execute things differently. However, we don’t know whether the database used them or not. Things get worse over time. Indexes may change after the deployment, data distribution may depend on the day of the week, and the database load may be much different between countries when we regionalize our application. Query hints that we provided half a year ago may not be relevant anymore, but our tests won’t catch that. Unit tests are used to verify the correctness of our queries, and the queries will still return the same results. We have simply no way of identifying these changes automatically. Database Guardrails Is the New Standard Based on what we said above, we need a new approach. No matter if we run a small product or a big Fortune 500 company, we need a novel way of dealing with databases. Developers need to own their databases and have all the means to do it well. We need good observability and database guardrails — a novel approach that: Prevents the bad code from reaching production, Monitors all moving pieces to build a meaningful context for the developer, It significantly reduces the time to identify the root cause and troubleshoot the issues, so the developer gets direct and actionable insights We can’t let ourselves go blind anymore. We need to have tools and systems that will help us change the way we interact with databases, avoid performance issues, and troubleshoot problems as soon as they appear in production. Let’s see how we can build such a system. There are four things that we need to capture to build successful database guardrails. Let’s walk through them. Database Internals Each database provides enough details about the way it executes the query. These details are typically captured in the execution plan that explains what join strategies were used, which tables and indexes were scanned, or what data was sorted. To get the execution plan, we can typically use the EXPLAIN keyword. For instance, if we take the following PostgreSQL query: SQL SELECT TB.* FROM name_basics AS NB JOIN title_principals AS TP ON TP.nconst = NB.nconst JOIN title_basics AS TB ON TB.tconst = TP.tconst WHERE NB.nconst = 'nm00001' We can add EXPLAIN to get the following query: SQL EXPLAIN SELECT TB.* FROM name_basics AS NB JOIN title_principals AS TP ON TP.nconst = NB.nconst JOIN title_basics AS TB ON TB.tconst = TP.tconst WHERE NB.nconst = 'nm00001' The query returns the following output: SQL Nested Loop (cost=1.44..4075.42 rows=480 width=89) -> Nested Loop (cost=1.00..30.22 rows=480 width=10) -> Index Only Scan using name_basics_pkey on name_basics nb (cost=0.43..4.45 rows=1 width=10) Index Cond: (nconst = 'nm00001'::text) -> Index Only Scan using title_principals_nconst_idx on title_principals tp (cost=0.56..20.96 rows=480 width=20) Index Cond: (nconst = 'nm00001'::text) -> Index Scan using title_basics_pkey on title_basics tb (cost=0.43..8.43 rows=1 width=89) Index Cond: (tconst = tp.tconst) This gives a textual representation of the query and how it will be executed. We can see important information about the join strategy (Nested Loop in this case), tables and indexes used (Index Only Scan for name_basics_pkey, or Index Scan for title_basics_pkey), and the cost of each operation. Cost is an arbitrary number indicating how hard it is to execute the operation. We shouldn’t draw any conclusions from the numbers per se, but we can compare various plans based on the cost and choose the cheapest one. Having plans at hand, we can easily tell what’s going on. We can see if we have an N+1 query issue if we use indexes efficiently and if the operation runs fast. We can get some insights into how to improve the queries. We can immediately tell if a query is going to scale well in production just by looking at how it reads the data. Once we have these plans, we can move on to another part of successful database guardrails. Integration With Applications We need to extract plans somehow and correlate them with what our application does. To do that, we can use OpenTelemetry (OTel). OpenTelemetry is an open standard for instrumenting applications. It provides multiple SDKs for various programming languages and is now commonly used in frameworks and libraries for HTTP, SQL, ORM, and other application layers. OpenTelemetry captures signals: logs, traces, and metrics. They are later captured into spans and traces that represent the communication between services and timings of operations. Each span represents one operation performed by some server. This could be file access, database query, or request handling. We can now extend OpenTelemetry signals with details from databases. We can extract execution plans, correlate them with signals from other layers, and build a full understanding of what happened behind the scenes. For instance, we would clearly see the N+1 problem just by looking at the number of spans. We could immediately identify schema migrations that are too slow or operations that will take the database down. Now, we need the last piece to capture the full picture. Semantic Monitoring of All Databases Observing just the local database may not be enough. The same query may execute differently depending on the configuration or the freshness of statistics. Therefore, we need to integrate monitoring with all the databases we have, especially with the production ones. By extracting statistics, number of rows, running configuration, or installed extensions, we can get an understanding of how the database performs. Next, we can integrate that with the queries we run locally. We take the query that we captured in the local environment and then reason about how it would execute in production. We can compare the execution plan and see which tables are accessed or how many rows are being read. This way, we can immediately tell the developer that the query is not going to scale well in production. Even if the developer has a different database locally or has a low number of rows, we can still take the query or the execution plan, enrich it with the production statistics, and reason about the performance after the deployment. We don’t need to wait for the deployment of the load tests, but we can provide feedback nearly immediately. The most important part is that we move from raw signals to reasoning. We don’t swamp the user with plots or metrics that are hard to understand or that the user can’t use easily without setting the right thresholds. Instead, we can provide meaningful suggestions. Instead of saying, “CPU spiked to 80%,” we can say, “The query scanned the whole table, and you should add an index on this and that column.” We can give developers answers, not only the data points to reason about. Automated Troubleshooting That’s just the beginning. Once we understand what is actually happening in the database, the sky's the limit. We can run anomaly detection on the queries to see how they change over time, if they use the same indexes as before, or if they changed the join strategy. We can catch ORM configuration changes that lead to multiple SQL queries being sent for a particular REST API. We can submit automated pull requests to tune the configuration. We can correlate the application code with the SQL query so we can rewrite the code on the fly with machine-learning solutions. Summary In recent years, we observed a big evolution in the software industry. We run many applications, deploy many times a day, scale out to hundreds of servers, and use more and more components. Application Performance Monitoring is not enough to keep track of all the moving parts in our applications. Here at Metis, we believe that we need something better. We need a true observability that can finally show us the full story. And we can use observability to build database guardrails that provide the actual answers and actionable insights. Not a set of metrics that the developer needs to track and understand, but automated reasoning connecting all the dots. That’s the new approach we need and the new age we deserve as developers owning our databases.

By Adam Furmanek

The Convergence of Testing and Observability

This is an article from DZone's 2023 Automated Testing Trend Report.For more: Read the Report One of the core capabilities that has seen increased interest in the DevOps community is observability. Observability improves monitoring in several vital ways, making it easier and faster to understand business flows and allowing for enhanced issue resolution. Furthermore, observability goes beyond an operations capability and can be used for testing and quality assurance. Testing has traditionally faced the challenge of identifying the appropriate testing scope. "How much testing is enough?" and "What should we test?" are questions each testing executive asks, and the answers have been elusive. There are fewer arguments about testing new functionality; while not trivial, you know the functionality you built in new features and hence can derive the proper testing scope from your understanding of the functional scope. But what else should you test? What is a comprehensive general regression testing suite, and what previous functionality will be impacted by the new functionality you have developed and will release? Observability can help us with this as well as the unavoidable defect investigation. But before we get to this, let's take a closer look at observability. What Is Observability? Observability is not monitoring with a different name. Monitoring is usually limited to observing a specific aspect of a resource, like disk space or memory of a compute instance. Monitoring one specific characteristic can be helpful in an operations context, but it usually only detects a subset of what is concerning. All monitoring can show is that the system looks okay, but users can still be experiencing significant outages. Observability aims to make us see the state of the system by making data flows "observable." This means that we can identify when something starts to behave out of order and requires our attention. Observability combines logs, metrics, and traces from infrastructure and applications to gain insights. Ideally, it organizes these around workflows instead of system resources and, as such, creates a functional view of the system in use. Done correctly, it lets you see what functionality is being executed and how frequently, and it enables you to identify performance characteristics of the system and workflow. Figure 1: Observability combines metrics, logs, and traces for insights One benefit of observability is that it shows you the actual system. It is not biased by what the designers, architects, and engineers think should happen in production. It shows the unbiased flow of data. The users, over time (and sometimes from the very first day), find ways to use the system quite differently from what was designed. Observability makes such changes in behavior visible. Observability is incredibly powerful in debugging system issues as it allows us to navigate the system to see where problems occur. Observability requires a dedicated setup and some contextual knowledge similar to traceability. Traceability is the ability to follow a system transaction over time through all the different components of our application and infrastructure architecture, which means you have to have common information like an ID that enables this. OpenTelemetry is an open standard that can be used and provides useful guidance on how to set this up. Observability makes identifying production issues a lot easier. And we can use observability for our benefit in testing, too. Observability of Testing: How to Look Left Two aspects of observability make it useful in the testing context: Its ability to make the actual system usage observable and its usefulness in finding problem areas during debugging. Understanding the actual system behavior is most directly useful during performance testing. Performance testing is the pinnacle of testing since it tries to achieve as close to the realistic peak behavior of a system as possible. Unfortunately, performance testing scenarios are often based on human knowledge of the system instead of objective information. For example, performance testing might be based on the prediction of 10,000 customer interactions per hour during a sales campaign based on the information of the sales manager. Observability information can help define the testing scenarios by using the information to look for the times the system was under the most stress in production and then simulate similar situations in the performance test environment. We can use a system signature to compare behaviors. A system signature in the context of observability is the set of values for logs, metrics, and traces during a specific period. Take, for example, a marketing promotion for new customers. The signature of the system should change during that period to show more new account creations with its associated functionality and the related infrastructure showing up as being more "busy." If the signature does not change during the promotion, we would predict that we also don't see the business metrics move (e.g., user sign-ups). In this example, the business metrics and the signature can be easily matched. Figure 2: A system behaving differently in test, which shows up in the system signature In many other cases, this is not true. Imagine an example where we change the recommendation engine to use our warehouse data going forward. We expect the system signature to show increased data flows between the recommendation engine and our warehouse system. You can see how system signatures and the changes of the system signature can be useful for testing; any differences in signature between production and the testing systems should be explainable by the intended changes of the upcoming release. Otherwise, investigation is required. In the same way, information from the production observability system can be used to define a regression suite that reflects the functionality most frequently used in production. Observability can give you information about the workflows still actively in use and which workflows have stopped being relevant. This information can optimize your regression suite both from a maintenance perspective and, more importantly, from a risk perspective, making sure that core functionality, as experienced by the user, remains in a working state. Implementing observability in your test environments means you can use the power of observability for both production issues and your testing defects. It removes the need for debugging modes to some degree and relies upon the same system capability as production. This way, observability becomes how you work across both dev and ops, which helps break down silos. Observability for Test Insights: Looking Right In the previous section, we looked at using observability by looking left or backward, ensuring we have kept everything intact. Similarly, we can use observability to help us predict the success of the features we deliver. Think about a new feature you are developing. During the test cycles, we see how this new feature changes the workflows, which shows up in our observability solution. We can see the new features being used and other features changing in usage as a result. The signature of our application has changed when we consider the logs, traces, and metrics of our system in test. Once we go live, we predict that the signature of the production system will change in a very similar way. If that happens, we will be happy. But what if the signature of the production system does not change as predicted? Let's take an example: We created a new feature that leverages information from previous bookings to better serve our customers by allocating similar seats and menu options. During testing, we tested the new feature with our test data set, and we see an increase in accessing the bookings database while the customer booking is being collated. Once we go live, we realize that the workflows are not utilizing the customer booking database, and we leverage the information from our observability tooling to investigate. We have found a case where the users are not using our new features or are not using the features in the expected way. In either case, this information allows us to investigate further to see whether more change management is required for the users or whether our feature is just not solving the problem in the way we wanted it to. Another way to use observability is to evaluate the performance of your changes in test and the impact on the system signature — comparing this afterwards with the production system signature can give valuable insights and prevent overall performance degradation. Our testing efforts (and the associated predictions) have now become a valuable tool for the business to evaluate the success of a feature, which elevates testing to become a business tool and a real value investment. Figure 3: Using observability in test by looking left and looking right Conclusion While the popularity of observability is a somewhat recent development, it is exciting to see what benefits it can bring to testing. It will create objectiveness for defining testing efforts and results by evaluating them against the actual system behavior in production. It also provides value to developer, tester, and business communities, which makes it a valuable tool for breaking down barriers. Using the same practices and tools across communities drives a common culture — after all, culture is nothing but repeated behaviors. This is an article from DZone's 2023 Automated Testing Trend Report.For more: Read the Report

By Mirco Hering

Log Analysis Using grep

I recently began a new role as a software engineer, and in my current position, I spend a lot of time in the terminal. Even though I have been a long-time Linux user, I embarked on my Linux journey after becoming frustrated with setting up a Node.js environment on Windows during my college days. It was during that time that I discovered Ubuntu, and it was then that I fell in love with the simplicity and power of the Linux terminal. Despite starting my Linux journey with Ubuntu, my curiosity led me to try other distributions, such as Manjaro Linux, and ultimately Arch Linux. Without a doubt, I have a deep affection for Arch Linux. However, at my day job, I used macOS, and gradually, I also developed a love for macOS. Now, I have transitioned to macOS as my daily driver. Nevertheless, my love for Linux, especially Arch Linux and the extensive customization it offers, remains unchanged. Anyway, in this post, I will be discussing grep and how I utilize it to analyze logs and uncover insights. Without a doubt, grep has proven to be an exceptionally powerful tool. However, before we delve into grep, let’s first grasp what grep is and how it works. What Is grep and How Does It Work? grep is a powerful command-line utility in Unix-like operating systems used for searching text or regular expressions (patterns) within files. The name “grep” stands for “Global Regular Expression Print.” It’s an essential tool for system administrators, programmers, and anyone working with text files and logs. How It Works When you use grep, you provide it with a search pattern and a list of files to search through. The basic syntax is: grep [options] pattern [file...] Here’s a simple understanding of how it works: Search pattern: You provide a search pattern, which can be a simple string or a complex regular expression. This pattern defines what you’re searching for within the files. Files to search: You can specify one or more files (or even directories) in which grep should search for the pattern. If you don’t specify any files, grep reads from the standard input (which allows you to pipe in data from other commands). Matching lines:grep scans through each line of the specified files (or standard input) and checks if the search pattern matches the content of the line. Output: When a line containing a match is found, grep prints that line to the standard output. If you’re searching within multiple files, grep also prefixes the matching lines with the file name. Options:grep offers various options that allow you to control its behavior. For example, you can make the search case-insensitive, display line numbers alongside matches, invert the match to show lines that don’t match and more. Backstory of Development grep was created by Ken Thompson, one of the early developers of Unix, and its development dates back to the late 1960s. The context of its creation lies in the evolution of the Unix operating system at Bell Labs. Ken Thompson, along with Dennis Ritchie and others, was involved in developing Unix in the late 1960s. As part of this effort, they were building tools and utilities to make the system more practical and user-friendly. One of the tasks was to develop a way to search for patterns within text files efficiently. The concept of regular expressions was already established in the field of formal language theory, and Thompson drew inspiration from this. He created a program that utilized a simple form of regular expressions for searching and printing lines that matched the provided pattern. This program eventually became grep. The initial version of grep used a simple and efficient algorithm to perform the search, which is based on the use of finite automata. This approach allowed for fast pattern matching, making grep a highly useful tool, especially in the early days of Unix when computing resources were limited. Over the years, grep has become an integral part of Unix-like systems, and its functionality and capabilities have been extended. The basic concept of searching for patterns in text using regular expressions, however, remains at the core of grep’s functionality. grep and Log Analysis So you might be wondering how grep can be used for log analysis. Well, grep is a powerful tool that can be used to analyze logs and uncover insights. In this section, I will be discussing how I use grep to analyze logs and find insights. Isolating Errors Debugging often starts with identifying errors in logs. To isolate errors using grep, I use the following techniques: Search for error keywords: Start by searching for common error keywords such as "error", "exception", "fail" or "invalid" . Use case-insensitive searches with the -i flag to ensure you capture variations in case. Multiple pattern search: Use the -e flag to search for multiple patterns simultaneously. For instance, you could search for both "error" and "warning" messages to cover a wider range of potential issues. Contextual search: Use the -C flag to display a certain number of lines of context around each match. This helps you understand the context in which an error occurred. Tracking Down Issues Once you’ve isolated errors, it’s time to dig deeper and trace the source of the issue: Timestamp-based search: If your logs include timestamps, use them to track down the sequence of events leading to an issue. You can use grep along with regular expressions to match specific time ranges. Unique identifiers: If your application generates unique identifiers for events, use these to track the flow of events across log entries. Search for these identifiers using grep. Combining with other tools: Combine grep with other command-line tools like sort, uniq, and awk to aggregate and analyze log entries based on various criteria. Identifying Patterns Log analysis is not just about finding errors; it’s also about identifying patterns that might provide insights into performance or user behavior: Frequency analysis: Use grep to count the occurrence of specific patterns. This can help you identify frequently occurring events or errors. Custom pattern matching: Leverage regular expressions to define custom patterns based on your application’s unique log formats. Anomaly detection: Regular expressions can also help you detect anomalies by defining what “normal” log entries look like and searching for deviations from that pattern. Conclusion In the world of debugging and log analysis, grep is a tool that can make a significant difference. Its powerful pattern-matching capabilities, combined with its versatility in handling regular expressions, allow you to efficiently isolate errors, track down issues, and identify meaningful patterns in your log files. With these techniques in your toolkit, you’ll be better equipped to unravel the mysteries hidden within your logs and ensure the smooth operation of your systems and applications. Happy log hunting! Remember, practice is key. The more you experiment with grep and apply these techniques to your real-world scenarios, the more proficient you’ll become at navigating through log files and gaining insights from them. Examples Isolating Errors Search for lines containing the word “error” in a log file: grep -i "error" application.log Search for lines containing either “error” or “warning” in a log file: grep -i -e "error" -e "warning" application.log Display lines containing the word “error” along with 2 lines of context before and after: grep -C 2 "error" application.log Tracking Down Issues Search for log entries within a specific time range (using regular expressions for timestamp matching): grep "^\[2023-08-31 10:..:..]" application.log Search for entries associated with a specific transaction ID: grep "TransactionID: 12345" application.log Count the occurrences of a specific error: grep -c "Connection refused" application.log Identifying Patterns Count the occurrences of each type of error in a log file: grep -i -o "error" application.log | sort | uniq -c Search for log entries containing IP addresses: grep -E "[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+" application.log Detect unusual patterns using negative lookaheads in regular expressions: grep -E "^(?!.*normal).*error" application.log Lastly, I hope you enjoyed reading this and got a chance to learn something new from this post. If you have any grep tips or how you started your Linux journey feel free to comment below as I would love to hear them.

By Muhammad Raza

Send Your Logs to Loki

One of my current talks focuses on Observability in general and Distributed Tracing in particular, with an OpenTelemetry implementation. In the demo, I show how you can see the traces of a simple distributed system consisting of the Apache APISIX API Gateway, a Kotlin app with Spring Boot, a Python app with Flask, and a Rust app with Axum. Earlier this year, I spoke and attended the Observability room at FOSDEM. One of the talks demoed the Grafana stack: Mimir for metrics, Tempo for traces, and Loki for logs. I was pleasantly surprised how one could move from one to the other. Thus, I wanted to achieve the same in my demo but via OpenTelemetry to avoid coupling to the Grafana stack. In this blog post, I want to focus on logs and Loki. Loki Basics and Our First Program At its core, Loki is a log storage engine: Loki is a horizontally scalable, highly available, multi-tenant log aggregation system inspired by Prometheus. It is designed to be very cost effective and easy to operate. It does not index the contents of the logs, but rather a set of labels for each log stream. Loki Loki provides a RESTful API to store and read logs. Let's push a log from a Java app. Loki expects the following payload structure: I'll use Java, but you can achieve the same result with a different stack. The most straightforward code is the following: Java public static void main(String[] args) throws URISyntaxException, IOException, InterruptedException { var template = "'{' \"streams\": ['{' \"stream\": '{' \"app\": \"{0}\" '}', \"values\": [[ \"{1}\", \"{2}\" ]]'}']'}'"; //1 var now = LocalDateTime.now().atZone(ZoneId.systemDefault()).toInstant(); var nowInEpochNanos = NANOSECONDS.convert(now.getEpochSecond(), SECONDS) + now.getNano(); var payload = MessageFormat.format(template, "demo", String.valueOf(nowInEpochNanos), "Hello from Java App"); //1 var request = HttpRequest.newBuilder() //2 .uri(new URI("http://localhost:3100/loki/api/v1/push")) .header("Content-Type", "application/json") .POST(HttpRequest.BodyPublishers.ofString(payload)) .build(); HttpClient.newHttpClient().send(request, HttpResponse.BodyHandlers.ofString()); //3 } This is how we did String interpolation in the old days Create the request Send it The prototype works, as seen in Grafana: However, the code has many limitations: The label is hard-coded. You can and must send a single-label Everything is hard-coded; nothing is configurable, e.g., the URL The code sends one request for every log; it's hugely inefficient as there's no buffering HTTP client is synchronous, thus blocking the thread while waiting for Loki No error handling whatsoever Loki offers both gzip compression and Protobuf; none are supported with my code Finally, it's completely unrelated to how we use logs, e.g.: Java var logger = // Obtain logger logger.info("My message with parameters {}, {}", foo, bar); Regular Logging on Steroids To use the above statement, we need to choose a logging implementation. Because I'm more familiar with it, I'll use SLF4J and Logback. Don't worry; the same approach works for Log4J2. We need to add relevant dependencies: XML <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-api</artifactId>  <version>2.0.7</version> </dependency> <dependency> <groupId>ch.qos.logback</groupId> <artifactId>logback-classic</artifactId>  <version>1.4.8</version> <scope>runtime</scope> </dependency> <dependency> <groupId>com.github.loki4j</groupId> <artifactId>loki-logback-appender</artifactId>  <version>1.4.0</version> <scope>runtime</scope> </dependency> SLF4J is the interface Logback is the implementation Logback appender dedicated to SLF4J Now, we add a specific Loki appender: XML <appender name="LOKI" class="com.github.loki4j.logback.Loki4jAppender">  <http> <url>http://localhost:3100/loki/api/v1/push</url>  </http> <format> <label> <pattern>app=demo,host=${HOSTNAME},level=%level</pattern>  </label> <message> <pattern>l=%level h=${HOSTNAME} c=%logger{20} t=%thread | %msg %ex</pattern>  </message> <sortByTime>true</sortByTime> </format> </appender> <root level="DEBUG"> <appender-ref ref="STDOUT" /> </root> The loki appender Loki URL As many labels as wanted Regular Logback pattern Our program has become much more straightforward: Java var who = //... var logger = LoggerFactory.getLogger(Main.class.toString()); logger.info("Hello from {}!", who); Grafana displays the following: Docker Logging I'm running most of my demos on Docker Compose, so I'll mention the Docker logging trick. When a container writes on the standard out, Docker saves it to a local file. The docker logs command can access the file content. However, other options than saving to a local file are available, e.g., syslog, Google Cloud, Splunk, etc. To choose a different option, one sets a logging driver. One can configure the driver at the overall Docker level or per container. Loki offers its own plugin. To install it: Shell docker plugin install grafana/loki-docker-driver:latest --alias loki --grant-all-permissions At this point, we can use it on our container app: YAML services: app: build: . logging: driver: loki #1 options: loki-url: http://localhost:3100/loki/api/v1/push #2 loki-external-labels: container_name={{.Name},app=demo #3 Loki logging driver URL to push to Additional labels The result is the following. Note the default labels. Conclusion From a bird's eye view, Loki is nothing extraordinary: it's a plain storage engine with a RESTful API on top. Several approaches are available to use the API. Beyond the naive one, we have seen a Java logging framework appender and Docker. Other approaches include scraping the log files, e.g., Promtail, via a Kubernetes sidecar. You could also add an OpenTelemetry Collector between your app and Loki to perform transformations. Options are virtually unlimited. Be careful to choose the one that fits your context the best. To go further: Push log entries to Loki via API Loki Clients

By Nicolas Fränkel CORE

Getting Started With Prometheus Workshop: Instrumenting Applications

Are you looking to get away from proprietary instrumentation? Are you interested in open-source observability but lack the knowledge to just dive right in? This workshop is for you, designed to expand your knowledge and understanding of open-source observability tooling that is available to you today. Dive right into a free, online, self-paced, hands-on workshop introducing you to Prometheus. Prometheus is an open-source systems monitoring and alerting tool kit that enables you to hit the ground running with discovering, collecting, and querying your observability today. Over the course of this workshop, you will learn what Prometheus is, what it is not, install it, start collecting metrics, and learn all the things you need to know to become effective at running Prometheus in your observability stack. Previously, I shared an introduction to Prometheus, installing Prometheus, an introduction to the query language, exploring basic queries, using advanced queries, relabeling metrics in Prometheus, and discovering service targets as free online labs. In this article, you'll learn all about instrumenting your applications using Prometheus client libraries. Your learning path takes you into the wonderful world of instrumenting applications in Prometheus, where you learn all about client libraries for the languages you code in. Note this article is only a short summary, so please see the complete lab found online to work through it in its entirety yourself. The following is a short overview of what is in this specific lab of the workshop. Each lab starts with a goal. In this case, it is as follows: This lab introduces client libraries and shows you how to use them to add Prometheus metrics to applications and services. You'll get hands-on and instrument a sample application to start collecting metrics. You start in this lab reviewing how Prometheus metrics collection works, exploring client library architectures, and reviewing the four basic metrics types (counters, gauges, histograms, and summaries). If you've never collected any type of metrics data before, you're given two systems to help you get started. One is known as the USE method and is known for systems or infrastructure metrics. The other is the RED method, which targets more applications and services. The introduction finishes with a few best practices around naming your metrics and warnings on how to avoid cardinality bombs. Instrumentation in Java For the rest of this lab, you'll be working on exercises that walk you through instrumenting a simple Java application using the Prometheus Java client library. No previous Java experience is required, but there are assumptions made that you have minimum versions of Java and Maven installed. You are provided with a Java project that you can easily download and work from using your favorite IDE. If you don't work in an IDE, use any editor you like as the coding you'll be doing is possible with just cutting and pasting from the lab slides. To install the project locally: Download and unzip the Prometheus Java Metrics Demo from GitLab. Unzip the prometheus-java-metrics-demo-main.zip file in your workshop directory. Open the project in your favorite IDE (examples shown in the lab use VSCode). You'll be building and running the Java application, which is a basic empty service where comments are used to show where your application code would go. Before that block, you see that the instrumentation has been provided for all four of the basic metric types. Once you have built and started the Java JAR file, the output will show you that the setup has been successful: $ cd prometheus-java-metrics-demo-main/ $ mvn clean install (watch for BUILD SUCCESS) $ java -jar target/java_metrics-1.0-SNAPSHOT-jar-with-dependencies.jar Java example metrics setup successful... Java example service started... Now it's just waiting for you to validate the endpoint at localhost:7777/metrics, which displays the metrics: # HELP java_app_s is a summary metric (request size in bytes) # TYPE java_app_s summary java_app_s{quantile="0.5",} 2.679717814859738 java_app_s{quantile="0.9",} 4.566657867333372 java_app_s{quantile="0.99",} 4.927313848318692 java_app_s_count 512.0 java_app_s_sum 1343.9017287309503 # HELP java_app_h is a histogram metric # TYPE java_app_h histogram java_app_h_bucket{le="0.005",} 1.0 java_app_h_bucket{le="0.01",} 1.0 ... java_app_h_bucket{le="10.0",} 512.0 java_app_h_bucket{le="+Inf",} 512.0 java_app_h_count 512.0 java_app_h_sum 1291.5300871683055 # HELP java_app_c is a counter metric # TYPE java_app_c counter java_app_c 512.0 # HELP java_app_g is a gauge metric # TYPE java_app_g gauge java_app_g 5.5811320747117765 While the metrics are exposed in this example on localhost:7777, they will not be scraped by Prometheus until you have updated its configuration to add this new endpoint. Let's update our workshop-prometheus.yml file to add the Java application job as shown along with comments for clarity (this is the minimum needed, with a few custom labels for fun): # workshop config global: scrape_interval: 5s scrape_configs: # Scraping Prometheus. - job_name: "prometheus" static_configs: - targets: ["localhost:9090"] # Scraping java metrics. - job_name: "java_app" static_configs: - targets: ["localhost:7777"] labels: job: "java_app" env: "workshop-lab8" Start the Prometheus instance (for container Prometheus, see the workshop for details) and then watch the console where you started the Java application as it will report each time a new scrape is done by logging Handled :/metrics: $./prometheus --config.file=support/workshop-prometheus.yml ===========Java application log=============== Java example metrics setup successful... Java example service started... Handled :/metrics Handled :/metrics Handled :/metrics Handled :/metrics ... You can validate that the Java metrics you just instrumented in your application are available in the Prometheus console localhost:9090 as shown. Feel free to query and explore: Next up, you'll be creating your own Java metrics application starting with the minimal setup needed to get your Java application running and exposing the path /metrics. Instead of coding it all by hand, you're given a starting point class file found in the project. Instrumenting Basic Metrics Java was chosen as the language due to many developers using this in enterprises, and exposing you to the Prometheus client library usage for a common developer language is a good baseline. The rest of the lab walks through multiple exercises where you start from a blank application template that's provided and code step-by-step the four basic metrics types. You're also walked through a custom build and run of the application each step of the way, with the following process used for each metric type as you work from implementation, to build, to validating that it works: Add the necessary Java client library import statements for the metric type you are adding. Add the code to construct the metric type you are defining. Initialize the new metric in a thread with basic numerical values (often random numbers). Rebuilding the basic Java application to create an updated JAR file you can run. Starting the application and validating the new metric is available on localhost:9999/metrics. Once all four of the basic metric types have been implemented and tested, you learn to update your Prometheus configuration to pick up your application: # workshop config global: scrape_interval: 5s scrape_configs: # Scraping Prometheus. - job_name: "prometheus" static_configs: - targets: ["localhost:9090"] # Scraping java metrics. - job_name: "java_app" static_configs: - targets: ["localhost:9999"] labels: job: "java_app" env: "workshop-lab8" Finally, you verify that you are collecting your Java-instrumented application data by checking through the Prometheus query console: Miss Previous Labs? This is one lab in the more extensive free online workshop. Feel free to start from the very beginning of this workshop here if you missed anything previously. You can always proceed at your own pace and return any time you like as you work your way through this workshop. Just stop and later restart Perses to pick up where you left off. Coming up Next I'll be taking you through the final lab in this workshop where you'll learn all about metrics monitoring at scale and understanding some of the pain points with Prometheus that arise as you start to scale out your observability architecture and start caring more about reliability.. Stay tuned for more hands-on material to help you with your cloud-native observability journey.

By Eric D. Schabell CORE

How to Configure Istio, Prometheus and Grafana for Monitoring

Intro to Istio Observability Using Prometheus Istio service mesh abstracts the network from the application layers using sidecar proxies. You can implement security and advance networking policies to all the communication across your infrastructure using Istio. But another important feature of Istio is observability. You can use Istio to observe the performance and behavior of all your microservices in your infrastructure (see the image below). One of the primary responsibilities of Site reliability engineers (SREs) in large organizations is to monitor the golden metrics of their applications, such as CPU utilization, memory utilization, latency, and throughput. In this article, we will discuss how SREs can benefit from integrating three open-source software- Istio, Prometheus, and Grafana. While Istio is the most famous service software, Prometheus is the most widely used monitoring software, and Grafana is the most famous visualization tool. Note: The steps are tested for Istio 1.17.X Watch the Video of Istio, Prometheus, and Grafana Configuration Watch the video if you want to follow the steps from the video: Step 1: Go to Istio Add-Ons and Apply Prometheus and Grafana YAML File First, go to the add-on folder in the Istio directory using the command. Since I am using 1.17.1, the path for me is istio-1.17.1/samples/addons You will notice that Istio already provides a few YAML files to configure Grafana, Prometheus, Jaeger, Kiali, etc. You can configure Prometheus by using the following command: Shell kubectl apply -f prometheus.yaml Shell kubectl apply -f grafana.yaml Note these add-on YAMLs are applied to istio-system namespace by default. Step 2: Deploy New Service and Port-Forward Istio Ingress Gateway To experiment with the working model, we will deploy the httpbin service to an Istio-enabled namespace. We will create an object of the Istio ingress gateway to receive the traffic to the service from the public. We will also port-forward the Istio ingress gateway to a particular port-7777. You should see the below screen at localhost:7777 Step 3: Open Prometheus and Grafana Dashboard You can open the Prometheus dashboard by using the following command. Shell istioctl dashboard prometheus Shell istioctl dashboard grafana Both the Grafana and Prometheus will open in the localhost. Step 4: Make HTTP Requests From Postman We will see how the httpbin service is consuming CPU or memory when there is a traffic load. We will create a few GET and POST requests to the localhost:7777 from the Postman app. Once you GET or POST requests to httpbin service multiple times, there will be utilization of resources, and we can see them in Grafana. But at first, we need to configure the metrics for httpbin service in Prometheus and Grafana. Step 5: Configuring Metrics in Prometheus One can select a range of metrics related to any Kubernetes resources such as API server, applications, workloads, envoy, etc. We will select container_memory_working_set_bytes metrics for our configuration. In the Prometheus application, we will select the namespace to scrape the metrics using the following search term: container_memory_working_set_bytes { namespace= “istio-telemetry”} (istio-telemetry is the name of our Istio-enabled namespace, where httpbin service is deployed) Note that, simply running this, we get the memory for our namespace. Since we want to analyze the memory usage of our pods, we can calculate the total memory consumed by summing the memory usage of each pod grouped by pod. The following query will help us in getting the desired result : sum(container_memory_working_set_bytes{namespace=”istio-telemetry”}) by (pod) Note: Prometheus provides a lot of flexibility to filter, slice, and dice the metric data. The central idea of this article was to showcase the ability of Istio to emit and send metrics to Prometheus for collection Step 6: Configuring Istio Metrics Graphs in Grafana Now, you can simply take the query sum(container_memory_working_set_bytes{namespace=”istio-telemetry”}) by (pod) in Prometheus and plot a graph with time. All you need to do is create a new dashboard in Grafana and paste the query into the metrics browser. Grafana will plot a time-series graph. You can edit the graph with proper names, legends, and titles for sharing with other stakeholders in the Ops team. There are several ways to tweak and customize the data and depict the Prometheus metrics in Grafana. You can choose to make all the customization based on your enterprise needs. I have done a few experiments in the video; feel free to check it out. Conclusion Istio service mesh is extremely powerful in providing overall observability across the infrastructure. In this article, we have just offered a small use case of metrics scrapping and visualization using Istio, Prometheus, and Grafana. You can perform logging and tracing of logs and real-time traffic using Istio; we will cover those topics in our subsequent blogs.

By Md Azmal

Monitoring and Observability

DZone's Featured Monitoring and Observability Resources

Top Monitoring and Observability Experts

The Latest Monitoring and Observability Topics