Modern systems span numerous architectures and technologies and are becoming exponentially more modular, dynamic, and distributed in nature. These complexities also pose new challenges for developers and SRE teams that are charged with ensuring the availability, reliability, and successful performance of their systems and infrastructure. Here, you will find resources about the tools, skills, and practices to implement for a strategic, holistic approach to system-wide observability and application monitoring.
Integrate Istio and Apache Skywalking for Kubernetes Observability
LLMs Demand Observability-Driven Development
As previously mentioned, last week I was on-site at the PromCon EU 2023 event for two days in Berlin, Germany. This is a community-organized event focused on the technology and implementations around the open-source Prometheus project, including, for example, PromQL and PromLens. Below you'll find an overview covering insights into the talks given, often with a short recap if you don't want to browse the details. Along with the talks, it was invaluable to have the common discussions and chats that happen between talks in the breaks where you can connect with core maintainers of various aspects of the Prometheus project. Be sure to keep an eye on the event video playlist, as all sessions were recorded and will appear there. Let's dive right in and see what the event had to offer this year in Berlin. This overview will be my impressions of each day of the event, but not all the sessions will be covered. Let's start with a short overview of the insights taken after sessions, chats, and the social event: OpenTelemetry interoperability (in all flavors) is the hot topic of the year. Native Histograms were a big topic the last two years; this year, showing up as having a lot of promise here and there, but not a big topic in this year's talks. Perses dashboard and visualization project presented their Alpha release as a truly open-source project based on the Apache 2.0 license. By my count, there were ~150 attendees, and they also live-streamed all talks/lightning talks, which will also be made available on their YouTube channel post-event. Day 1 The day started with a lovely walk through the center of Berlin and to the venue located on the Spree River. The event opened and jumped right into the following series of talks (insights provided inline): What's New in Prometheus and Its Ecosystem Native Histograms - Efficiency and more details Documentation note on prometheus.io: "...Native histograms (added as an experimental feature in Prometheus v2.40). Once native histograms are closer to becoming a stable feature, this document will be thoroughly updated." stringlabels - Storing labels differently for significant memory reduction keep_firing_for field faded to alerting rules - How long an alert will continue firing after the condition has occurred scrape_config_files - Split prom scrape configs into multiple files, avoiding having to have big config files OTLP receiver (v2.47) - Experimental support for receiving OTLP metrics SNMP Exporter (v0.24) - Breaking changes: new configuration format; splits connection settings from metrics details, simpler to change. Also added the ability to query multiple modules in one scrape using just one scrape. MySQLd Exporter (v0.15) - Multi-target support, use a single exporter to monitor multiple MySQL-alike servers Java client (v1.0.0) - client_java with OpenTelemetry metrics and tracing support, Native Histograms Alertmanager - New receivers. MS Teams, Discord, Webex Windows Exporter - Now an official exporter; was delayed due to licensing but is in the final stages now Every Tuesday Prometheus meets for Bug Scrub at 11:00 UTC. Calendar https://promtheus.io/community. What’s Coming New AlertManager UI Metadata Improvements Exemplary Improvements Remote Write v2 Perses: The CNCF Candidate for Observability Visualization Summary An announcement was given of the Alpha launch of the Perses dashboard and visualization project with GitOps compatibility - purpose-built for observability data; a truly open-source alternative with the Apache 2.0 license. Perses was born from the CNCF landscape missing visualization tooling projects: Perses - An exploration of a standard dashboard format Chronosphere, Red Hat, and Amadeus are displayed as founding members GitOps friendly, static validation, Kubernetes support; you can use the Perses binary in your development environment Chronosphere supported its development and Red Hat is integrating the Perses package into the OpenShift Console. There is an exploration of its usage with Prometheus/PromLens. Currently only metrics display, but ongoing by Red Hat integrating tracing with OpenTelemetry Logs are on the future wishlist. Feature details presented for the development of dashboards Includes Grafana migration tooling I was chatting with core maintainer Augustin Husson after the talk, and they are interested in submitting Perses as an applicant for the CNCF Sandbox status. Towards Making Prometheus OpenTelemetry Native Summary OpenTelemetry protocol (OTLP) support in Prometheus for metrics ingestion is experimental. Details on the Effort OTLP ingestion is there experimentally. The experience with target_info is a big pain point at the moment. Takes about half the bandwidth of remote write, 30-40% more CPU due to gzip New Arrow-based OTLP protocol promises half the bandwidth again at half the CPU cost; may inspire Prometheus remote write 2.0 GitHub milestone to track Thinking about using collector remote config to solve "split configuration" between Prometheus server and OpenTelemetry clients Planet Scale Monitoring: Handling Billions of Active Series With Prometheus and Thanos Summary Shopify states they are running “highly scalable globally distributed and highly dynamic” cloud infrastructure, so they are on “Planet Scale” with Prometheus. Details on the Effort Huge Ruby shop, latency-sensitive, large scaling events around the retail cycle and flash sales HPA struggles with scaling up quickly enough Using StatsD to get around Ruby/Python/PHP-specific limitations on shared counters Backend is Thanos-based, but have added a lot on top of it (custom work) Have a custom operator to scale Prometheus agents by scraping the targets and seeing how many time series they have (including redistribution) Have a router layer on top of Thanos to decouple ingestion and storage; sounds like they're evolving into a a Mimir-like setup Split the query layer into two deployments: one for short-term queries and one for longer-term queries Team and service-centric UI for alerting, integrated with SLO tracking Native histograms solved cardinality challenges and combined with Thanos' distributed querier to make very high cardinality queries work; as they stated, "This changed the game for us." When migrating from the previous observability vendor, they decided not to convert dashboards; instead, worked with developers to build new cleaner ones. Developers are not scoping queries well, so most fan out to all regional stores, but performance on empty responses is satisfactory, so it's not a big issue. Lightning Talks Summary It's always fun to end the day with a quick series of talks that are ad-hoc collected from the attendees. Below is a list of ones I thought were interesting as well as a short summary, should you want to find them in the recordings: AlertManager UI: Alertmanager will get a new UI in React. ELM didn't get traction as a common language; considering alternatives to Bootstrap Implementing integrals with Prometheus and Grafana: Integrals in PromQL- inverse of rates, Pure-PromQL version of the delta counter we do; using sum_over_time and Grafana variables to simplify getting all the right factors. Metrics have a DX Problem: Looking at how to do developer-focused metrics from the IDE using autometrics-dev project on Git Hub; framework for instrumenting by function, with IDE integration to explore prod metrics; interesting idea to integrate this deeply Day 2 After the morning walk through the center of Berlin, day two provided us with some interesting material (insights provided inline): Taming the Tsunami: Low Latency Ingestion of Push-Based Metrics in Prometheus Summary Overview of the metrics story at Shopify, with over 1k teams running it: Originally forwarding metrics "from observability vendor agent" Issues because that was multiplying the cardinality across exporter instances; same with sidecar model Built a StatsD protocol-aware load balancer Running as a sidecar also had ownership issues, stating, "We would be on call for every application" DaemonSet deployment meant resource usage and hot-spotting concerns; also cardinality, but at a lower level Didn't want per-instance metrics because of cardinality and metrics are more domain-level Roughly one exporter per 50-100 nodes Load balancer sanitizes label values and drops labels Pre-aggregation on short time scales to deal with "hot loop instrumentation;" resulted in roughly 20x reduction in bandwidth use Compensating for lack of per-instance metrics by looking at infrastructure metrics (KSM, cAdvisor) "We have close to a thousand teams right now" Prometheus Java Client 1.0.0 Summary V1.0.0 was released last week. This talk was an overview of some of their updates featuring native histograms and OpenTelemetry support. Rewrote the underlying model, so breaking changes with the migration module for Prom simpleclient metrics JavaDoc can be found here. Almost as simple as importing changes in your Java app to use; going to update my workshop Java example for instrumentation to the new API Includes good examples in the project Exposes native + classic histograms by default, scraper's choice A lot more configurations available as Java properties Callback metrics (this is great for writing exporters) OTel push support (on a configurable interval) Allows standard OTel names (with dots), automatically replaces dots with underscores for Prometheus format Integrates with OTel tracing client to make exemplars work - picks exemplars from tracing context, extends tracing context to mark that trace to not get sampled away Despite supporting OTel, this is still a performance-minded client library All metric types support concurrent updates Dropped Pushgateway support for now, but will port it forward Once JMX exporter is updated, as a side effect, you can update Not aiming to become a full OTel library, only future-proofing your instrumentation; more lightweight and performance-focused Lightning Talks Summary Again, here is a list of lightning talks I thought were interesting from the final day and a short summary, should you want to find them in the recordings: Tracking object storage costs Trying to measure object storage costs, as they are the number 2 cost in their cloud bills; built a Prometheus Price Exporter Object storage cost is ~half of Grafana's cloud bill; varies by customer (can be as low as 2%) Trick for extending sparse metrics with zeroes: or on() vector(0) They have a prices exporter in the works; promised to open source it Prom operator - what's next? Tour of some more features coming in the Prometheus operator; shards autoscaling, scrape classes, support Kubernetes events, and Prometheus-agent deployment as DaemonSet Prometheus adoption stats 868k users in 2023 (up from 774k last year), based on Grafana instances which have at least one Prometheus data source enabled Final Impressions Final impressions of this event left me for the second straight year with the feeling that the attendees were both passionate and knowledgeable about the metrics monitoring tooling around the Prometheus ecosystem. This event did not really have "getting started" sessions. Most of this assumes you are coming for in-depth dives into the various elements of the Prometheus project, almost giving you glimpses into the research progress behind features being improved in the coming versions of Prometheus. It remains well worth your time if you are active in the monitoring world, even if you are not using open source or Prometheus: you will gain insights into the status of features in the monitoring world.
This data warehousing use case is about scale. The user is China Unicom, one of the world's biggest telecommunication service providers. Using Apache Doris, they deploy multiple petabyte-scale clusters on dozens of machines to support their 15 billion daily log additions from their over 30 business lines. Such a gigantic log analysis system is part of their cybersecurity management. For the need of real-time monitoring, threat tracing, and alerting, they require a log analytic system that can automatically collect, store, analyze, and visualize logs and event records. From an architectural perspective, the system should be able to undertake real-time analysis of various formats of logs, and of course, be scalable to support the huge and ever-enlarging data size. The rest of this article is about what their log processing architecture looks like and how they realize stable data ingestion, low-cost storage, and quick queries with it. System Architecture This is an overview of their data pipeline. The logs are collected into the data warehouse, and go through several layers of processing. ODS: Original logs and alerts from all sources are gathered into Apache Kafka. Meanwhile, a copy of them will be stored in HDFS for data verification or replay. DWD: This is where the fact tables are. Apache Flink cleans, standardizes, backfills, and de-identifies the data, and write it back to Kafka. These fact tables will also be put into Apache Doris, so that Doris can trace a certain item or use them for dashboarding and reporting. As logs are not averse to duplication, the fact tables will be arranged in the Duplicate Key model of Apache Doris. DWS: This layer aggregates data from DWD and lays the foundation for queries and analysis. ADS: In this layer, Apache Doris auto-aggregates data with its Aggregate Key model, and auto-updates data with its Unique Key model. Architecture 2.0 evolves from Architecture 1.0, which is supported by ClickHouse and Apache Hive. The transition arised from the user's needs for real-time data processing and multi-table join queries. In their experience with ClickHouse, they found inadequate support for concurrency and multi-table joins, manifested by frequent timeouts in dashboarding and OOM errors in distributed joins. Now let's take a look at their practice in data ingestion, storage, and queries with Architecture 2.0. Real-Case Practice Stable Ingestion of 15 Billion Logs Per Day In the user's case, their business churns out 15 billion logs every day. Ingesting such data volume quickly and stably is a real problem. With Apache Doris, the recommended way is to use the Flink-Doris-Connector. It is developed by the Apache Doris community for large-scale data writing. The component requires simple configuration. It implements Stream Load and can reach a writing speed of 200,000~300,000 logs per second, without interrupting the data analytic workloads. A lesson learned is that when using Flink for high-frequency writing, you need to find the right parameter configuration for your case to avoid data version accumulation. In this case, the user made the following optimizations: Flink Checkpoint: They increase the checkpoint interval from 15s to 60s to reduce writing frequency and the number of transactions processed by Doris per unit of time. This can relieve data writing pressure and avoid generating too many data versions. Data Pre-Aggregation: For data of the same ID but comes from various tables, Flink will pre-aggregate it based on the primary key ID and create a flat table, in order to avoid excessive resource consumption caused by multi-source data writing. Doris Compaction: The trick here includes finding the right Doris backend (BE) parameters to allocate the right amount of CPU resources for data compaction, setting the appropriate number of data partitions, buckets, and replicas (too much data tablets will bring huge overheads), and dialing up max_tablet_version_num to avoid version accumulation. These measures together ensure daily ingestion stability. The user has witnessed stable performance and low compaction score in Doris backend. In addition, the combination of data pre-processing in Flink and the Unique Key model in Doris can ensure quicker data updates. Storage Strategies to Reduce Costs by 50% The size and generation rate of logs also impose pressure on storage. Among the immense log data, only a part of it is of high informational value, so storage should be differentiated. The user has three storage strategies to reduce costs. ZSTD (ZStandard) compression algorithm: For tables larger than 1TB, specify the compression method as "ZSTD" upon table creation, it will realize a compression ratio of 10:1. Tiered storage of hot and cold data: This is supported by a new feature of Doris. The user sets a data "cooldown" period of 7 days. That means data from the past 7 days (namely, hot data) will be stored in SSD. As time goes by, hot data "cools down" (getting older than 7 days), it will be automatically moved to HDD, which is less expensive. As data gets even "colder," it will be moved to object storage for much lower storage costs. Plus, in object storage, data will be stored with only one copy instead of three. This further cuts down costs and the overheads brought by redundant storage. Differentiated replica numbers for different data partitions: The user has partitioned their data by time range. The principle is to have more replicas for newer data partitions and less for the older ones. In their case, data from the past 3 months is frequently accessed, so they have 2 replicas for this partition. Data that is 3~6 months old has two replicas, and data from 6 months ago has one single copy. With these three strategies, the user has reduced their storage costs by 50%. Differentiated Query Strategies Based on Data Size Some logs must be immediately traced and located, such as those of abnormal events or failures. To ensure real-time response to these queries, the user has different query strategies for different data sizes: Less than 100G: The user utilizes the dynamic partitioning feature of Doris. Small tables will be partitioned by date and large tables will be partitioned by hour. This can avoid data skew. To further ensure balance of data within a partition, they use the snowflake ID as the bucketing field. They also set a starting offset of 20 days, which means data of the recent 20 days will be kept. In this way, they find the balance point between data backlog and analytic needs. 100G~1T: These tables have their materialized views, which are the pre-computed result sets stored in Doris. Thus, queries on these tables will be much faster and less resource-consuming. The DDL syntax of materialized views in Doris is the same as those in PostgreSQL and Oracle. More than 100T: These tables are put into the Aggregate Key model of Apache Doris and pre-aggregate them. In this way, we enable queries of 2 billion log records to be done in 1~2s. These strategies have shortened the response time of queries. For example, a query of a specific data item used to take minutes, but now it can be finished in milliseconds. In addition, for big tables that contain 10 billion data records, queries on different dimensions can all be done in a few seconds. Ongoing Plans The user is now testing with the newly added inverted index in Apache Doris. It is designed to speed up full-text search of strings as well as equivalence and range queries of numerics and datetime. They have also provided their valuable feedback about the auto-bucketing logic in Doris: Currently, Doris decides the number of buckets for a partition based on the data size of the previous partition. The problem for the user is, most of their new data comes in during daytime, but little at nights. So in their case, Doris creates too many buckets for night data but too few in daylight, which is the opposite of what they need. They hope to add a new auto-bucketing logic, where the reference for Doris to decide the number of buckets is the data size and distribution of the previous day. They've come to the Apache Doris community and we are now working on this optimization.
Your team celebrates a success story where a trace identified a pesky latency issue in your application's authentication service. A fix was swiftly implemented, and we all celebrated a quick win in the next team meeting. But the celebrations are short-lived. Just days later, user complaints surged about a related payment gateway timeout. It turns out that the fix we made did improve performance at one point but created a situation in which key information was never cached. Other parts of the software react badly to the fix, and we need to revert the whole thing. While the initial trace provided valuable insights into the authentication service, it didn’t explain why the system was built in this way. Relying solely on a single trace has given us a partial view of a broader problem. This scenario underscores a crucial point: while individual traces are invaluable, their true potential is unlocked only when they are viewed collectively and in context. Let's delve deeper into why a single trace might not be the silver bullet we often hope for and how a more holistic approach to trace analysis can paint a clearer picture of our system's health and the way to combat problems. The Limiting Factor The first problem is the narrow perspective. Imagine debugging a multi-threaded Java application. If you were to focus only on the behavior of one thread, you might miss how it interacts with others, potentially overlooking deadlocks or race conditions. Let's say a trace reveals that a particular method, fetchUserData(), is taking longer than expected. By optimizing only this method, you might miss that the real issue is with the synchronized block in another related method, causing thread contention and slowing down the entire system. Temporal blindness is the second problem. Think of a Java Garbage Collection (GC) log. A single GC event might show a minor pause, but without observing it over time, you won't notice if there's a pattern of increasing pause times indicating a potential memory leak. A trace might show that a Java application's response time spiked at 2 PM. However, without looking at traces over a longer period, you might miss that this spike happens daily, possibly due to a scheduled task or a cron job that's putting undue stress on the system. The last problem is related to that and is the context. Imagine analyzing the performance of a Java method without knowing the volume of data it's processing. A method might seem inefficient, but perhaps it's processing a significantly larger dataset than usual. A single trace might show that a Java method, processOrders(), took 5 seconds to execute. However, without context, you wouldn't know if it was processing 50 orders or 5,000 orders in that time frame. Another trace might reveal that a related method, fetchOrdersFromDatabase(), is retrieving an unusually large batch of orders due to a backlog, thus providing context to the initial trace. Strength in Numbers Think of traces as chapters in a book and metrics as the book's summary. While each chapter (trace) provides detailed insights, the summary (metrics) gives an overarching view. Reading chapters in isolation might lead to missing the plot, but when read in sequence and in tandem with the summary, the story becomes clear. We need this holistic view. If individual traces show that certain Java methods like processTransaction() are occasionally slow, grouped traces might reveal that these slowdowns happen concurrently, pointing to a systemic issue. Metrics, on the other hand, might show a spike in CPU usage during these times, indicating that the system might be CPU-bound during high transaction loads. This helps us distinguish between correlation and causation. Grouped traces might show that every time the fetchFromDatabase() method is slow, the updateCache() method also lags. While this indicates a correlation, metrics might reveal that cache misses (a specific metric) increase during these times, suggesting that database slowdowns might be causing cache update delays, establishing causation. This is especially important in performance tuning. Grouped traces might show that the handleRequest() method's performance has been improving over several releases. Metrics can complement this by showing a decreasing trend in response times and error rates, confirming that recent code optimizations are having a positive impact. I wrote about this extensively in a previous post about the Tong motion needed to isolate an issue. This motion can be accomplished purely through the use of observability tools such as traces, metrics, and logs. Example Observability is somewhat resistant to examples. Everything I try to come up with feels a bit synthetic and unrealistic when I examine it after the fact. Having said that, I looked at my modified version of the venerable Spring Pet Clinic demo using digma.ai. Running it showed several interesting concepts taken by Digma. Probably the most interesting feature is the ability to look at what’s going on in the server at this moment. This is an amazing exploratory tool that provides a holistic view of a moment in time. But the thing I want to focus on is the “Insights” column on the right. Digma tries to combine the separate traces into a coherent narrative. It’s not bad at it, but it’s still a machine. Some of that value should probably still be done manually since it can’t understand the why, only the what. It seems it can detect the venerable Spring N+1 problem seamlessly. But this is only the start. One of my favorite things is the ability to look at tracing data next to a histogram and list of errors in a single view. Is performance impacted because there are errors? How impactful is the performance on the rest of the application? These become questions with easy answers at this point. When we see all the different aspects laid together. Magical APIs The N+1 problem I mentioned before is a common bug in Java Persistence API (JPA). The great Vlad Mihalcea has an excellent explanation. The TL;DR is rather simple. We write a simple database query using ORM. But we accidentally split the transaction, causing the data to be fetched N+1 times, where N is the number of records we fetch. This is painfully easy to do since transactions are so seamless in JPA. This is the biggest problem in “magical” APIs like JPA. These are APIs that do so much that they feel like magic, but under the hood, they still run regular old code. When that code fails, it’s very hard to see what goes on. Observability is one of the best ways to understand why these things fail. In the past, I used to reach out to the profiler for such things, which would often entail a lot of work. Getting the right synthetic environment for running a profiling session is often very challenging. Observability lets us do that without the hassle. Final Word Relying on a single individual trace is akin to navigating a vast terrain with just a flashlight. While these traces offer valuable insights, their true potential is only realized when viewed collectively. The limitations of a single trace, such as a narrow perspective, temporal blindness, and lack of context, can often lead developers astray, causing them to miss broader systemic issues. On the other hand, the combined power of grouped traces and metrics offers a panoramic view of system health. Together, they allow for a holistic understanding, precise correlation of issues, performance benchmarking, and enhanced troubleshooting. For Java developers, this tandem approach ensures a comprehensive and nuanced understanding of applications, optimizing both performance and user experience. In essence, while individual traces are the chapters of our software story, it's only when they're read in sequence and in tandem with metrics that the full narrative comes to life.
If your environment is like many others, it can often seem like your systems produce logs filled with a bunch of excess data. Since you need to access multiple components (servers, databases, network infrastructure, applications, etc.) to read your logs — and they don’t typically have any specific purpose or focus holding them together — you may dread sifting through them. If you don’t have the right tools, it can feel like you’re stuck with a bunch of disparate, hard-to-parse data. In these situations, I picture myself as a cosmic collector, gathering space debris as it floats by my ship and sorting the occasional good material from the heaps of galactic material. Though it can feel like more trouble than it’s worth, sorting through logs is crucial. Logs hold many valuable insights into what’s happening in your applications and can indicate performance problems, security issues, and user behavior. In this article, we’re going to take a look at how logging can help you make sense of your log data without much effort. We'll talk about best practices and habits and use some of the Log Analytics tools from Sumo Logic as examples. Let’s blast off and turn that cosmic trash into treasure! The Truth Is Out There: Getting Value Just From the Things You’re Already Logging One massive benefit offered by a log analytics platform to any system engineer is the ability to utilize a single log interface. Rather than needing to SSH into countless machines or download logs and parse through them manually, viewing all your logs in a centralized aggregator can make it much easier to see simultaneous events across your infrastructure. You’ll also be able to clearly follow the flow of data and requests through your stack. Once you see all your logs in one place, you can tap into the latent value of all that data. Of course, you could make your own aggregation interface from scratch, but often, log aggregation tools provide a number of extra features that are worth the additional investment. Those extra features include capabilities such as powerful search and fast analytics. Searching Through the Void: Using Search Query Language To Find Things You’ve probably used grep or similar tools for searching through your logs, but for real power, you need the ability to search across all of your logs in one interface. You may have even investigated using the ELK stack on your own infrastructure to get going with log aggregation. If you have, you know how valuable putting logs all in the same place can be. Some tools provide even more functionality on top of this interface. For example, with Log Analytics, you can use a Search Query Language that allows for more complex searches. Because these searches are being executed across a vast amount of log data, you can use special operations to harness the power of your log aggregation service. Some of these operations can be achieved with grep, so long as you have all of the logs at your disposal. But others, such as aggregate operators, field expressions, or transaction analytics tools, can produce extremely powerful reports and monitoring triggers across your infrastructure. To choose just one tool as an example, let’s take a closer look at field expressions. Essentially, field expressions allow you to create variables in your queries based on what you find through your log data. For example, if you wanted to search across your logs, and you know your log lines follow the format “From Jane To: John,” you can parse out the “from” and “to” with the following query: * | parse "From: * To: *" as (from, to) This would store “Jane” in the “from” field and “John” in the “to” field. Another valuable language feature you could tap into would be keyword expressions. You could use this query to search across your logs for any instances where a command with root privileges failed: (``su`` OR ``sudo`` ) AND (fail* OR error) Here is a listing of General Search Examples that are drawn from parsing a single Apache log message. Light-Speed Analytics: Making Use of Real-Time Reports and Advanced Analytics One other aspect of searching is that it’s typically looking into the past. Sometimes, you need to be looking at things as they happen. Let’s take a look at Live Tail and LogReduce — two tools to improve simple searches. Versions of these features exist on many platforms, but I like the way they work on Sumo Logic’s offering, so we’ll dive into them. Live Tail At its simplest, Live Tail lets you see a live feed of your log messages. It’s like running tail-f on any one of your servers to see the logs as they come in, but instead of being on a single machine, you’re looking across all logs associated with a Sumo Logic Source or Collector. Your Live Tail can be modified to automatically filter for only specific things. Live Tails also supports highlighting keywords (up to eight of them) as the logs roll in. LogReduce LogReduce gives you more insight into–or a better understanding of–your search query’s aggregate log results. When you run LogReduce on a query, it performs fuzzy logic analysis on messages meeting the search criteria you defined and then provides you with a set of “Signatures” that meet your criteria. It also gives you a count of the logs with that pattern and a rating of the relevance of the pattern when compared to your search. You then have tools at your disposal to rank the generated signatures and even perform further analysis on the log data. This is all pretty advanced and can be hard to understand without a demo, so you can dive deeper by watching this video. Integrated Log Aggregation Often, you’ll need information from systems you aren’t running directly mixed in with your other logs. That’s why it’s important to make sure you can integrate your log aggregator with other systems. Many log aggregators provide this functionality. Elastic, which underlies the ELK stack, provides a bunch of integrations that you can hook into your self-hosted or cloud-hosted stack. Of course, integrations aren’t only available on the ELK stack. Sumo Logic also provides a whole list of integrations as well. Regardless, the power of connecting your logs with the many systems you use outside of your monitoring and operational stack is phenomenal. Want to get logs sent from your company’s 1Password account into the rest of your logs? Need more information from AWS than you are getting on your individual instances or services? ELK and Sumo Logic provide great options. The key to understanding this concept is that you don’t need to be the one controlling the logs to make it valuable to aggregate them. Think through the full picture of what systems keep your business running, and consider putting all of the logs in your aggregator together. Conclusion This has been a brief tour through some of the features available with log aggregation. There’s a lot more to it, which shouldn’t be surprising given the vast amount of data generated every second by our infrastructure. The really amazing part of these tools is that these insights are available to you without installing anything on your servers. You just need to have a way to export your log data to the aggregation service. Whether you need to track compliance or monitor the reliability of your services, log aggregation is an incredibly powerful tool that can let you unlock infinite value from your already existing log data. That way, you can become a better cosmic junk collector!
The World Has Changed, and We Need To Adapt The world has gone through a tremendous transformation in the last fifteen years. Cloud and microservices changed the world. Previously, our application was using one database; developers knew how it worked, and the deployment rarely happened. A single database administrator was capable of maintaining the database, optimizing the queries, and making sure things worked as expected. The database administrator could just step in and fix the performance issues we observed. Software engineers didn’t need to understand the database, and even if they owned it, it was just a single component of the system. Guaranteeing software quality was much easier because the deployment happened rarely, and things could be captured on time via automated tests. Fifteen years later, everything is different. Companies have hundreds of applications, each one with a dedicated database. Deployments happen every other hour, deployment pipelines work continuously, and keeping track of flowing changes is beyond one’s capabilities. The complexity of the software increased significantly. Applications don’t talk to databases directly but use complex libraries that generate and translate queries on the fly. Application monitoring is much harder because applications do not work in isolation, and each change may cause multiple other applications to fail. Reasoning about applications is now much harder. It’s not enough to just grab the logs to understand what happened. Things are scattered across various components, applications, queues, service buses, and databases. Databases changed as well. We have various SQL distributions, often incompatible despite having standards in place. We have NoSQL databases that provide different consistency guarantees and optimize their performance for various use cases. We developed multiple new techniques and patterns for structuring our data, processing it, and optimizing schemas and indexes. It’s not enough now to just learn one database; developers need to understand various systems and be proficient with their implementation details. We can’t rely on ACID anymore as it often harms the performance. However, other consistency levels require a deep understanding of the business. This increases the conceptual load significantly. Database administrators have a much harder time keeping up with the changes, and they don’t have enough time to improve every database. Developers are unable to analyze and get the full picture of all the moving parts, but they need to deploy changes faster than ever. And the monitoring tools still swamp us with metrics instead of answers. Given all the complexity, we need developers to own their databases and be responsible for their data storage. This “shift left” in responsibility is a must in today’s world for both small startups and big Fortune 500 enterprises. However, it’s not trivial. How do we prevent the bad code from reaching production? How to troubleshoot issues automatically? How do we move from monitoring to observability? Finally, how do we give developers the proper tools and processes so they will be able to own the databases? Read on to find answers. Measuring Application Performance Is Complex It’s crucial to measure to improve the performance. Performance indicators (PIs) help us evaluate the performance of the system on various dimensions. They can focus on infrastructure aspects such as the reliability of the hardware or networking. They can use application metrics to assess the performance and stability of the system. They can also include business metrics to measure the success from the company and user perspective, including user retention or revenue. Performance indicators are important tracking mechanisms to understand the state of the system and the business as a whole. However, in our day-to-day job, we need to track many more metrics. We need to understand contributors to the performance indicators to troubleshoot the issues earlier and understand whether the system is healthy or not. Let’s see how to build these elements in the modern world. We typically need to start with telemetry — the ability to collect the signals. There are multiple types of signals that we need to track: logs (especially application logs), metrics, and traces. Capturing these signals can be a matter of proper configuration (like enabling them in the hosting provider panel), or they need to be implemented by the developers. Recently, OpenTelemetry gained significant popularity. It’s a set of SDKs for popular programming languages that can be used to instrument applications to generate signals. This way, we have a standardized way of building telemetry within our applications. Odds are that most of the frameworks and libraries we use are already integrated with OpenTelemetry and can generate signals properly. Next, we need to build a solution for capturing the telemetry signals in one centralized place. This way, we can see “what happens” inside the system. We can browse the signals from the infrastructure (like hosts, CPUs, GPUs, and network), applications (number of requests, errors, exceptions, data distribution), databases (data cardinality, number of transactions, data distribution), and many other parts of the application (queues, notification services, service buses, etc.). This lets us troubleshoot more easily as we can see what happens in various parts of the ecosystem. Finally, we can build the Application Performance Management (APM). It’s the way of tracking metric indicators with telemetry and dashboards. APM focuses on providing end-to-end monitoring that goes across all the components of the system, including the web layer, mobile and desktop applications, databases, and the infrastructure connecting all the elements. It can be used to automate alarms and alerts to constantly assess whether the system is healthy. APM may seem like a silver bullet. It aggregates metrics, shows the performance, and can quickly alert when something goes wrong, and the fire begins. However, it’s not that simple. Let’s see why. Why Application Performance Monitoring Is Not Enough APM captures signals and presents them in a centralized application. While this may seem enough, it lacks multiple features that we would expect from a modern maintenance system. First, APM typically presents raw signals. While it has access to various metrics, it doesn’t connect the dots easily. Imagine that the CPU spikes. Should you migrate to a bigger machine? Should you optimize the operating system? Should you change the driver? Or maybe the CPU spike is caused by different traffic coming to the application? You can’t tell that easily just by looking at metrics. Second, APM doesn’t easily show where the problem is. We may observe metrics spiking in one part of the system, but it doesn’t necessarily mean that the part is broken. There may be other reasons and issues. Maybe it’s wrong input coming to the system, maybe some external dependency is slow, and maybe some scheduled task runs too often. APM doesn’t show that, as it cannot connect the dots and show the flow of changes throughout the system. You just see the state then, but you don’t see how you got to that point easily. Third, the resolution is unknown. Let’s say that the CPU spiked during the scheduled maintenance task. Should we upscale the machine? Should we disable the task? Should we run it some other time? Is there a bug in the task? Many things are not clear. We can easily imagine a situation when the scheduled task runs in the middle of the day just because it is more convenient for the system administrators; however, the task is now slow and competes with regular transactions for the resources. In that case, we probably should move the task to some time outside of peak hours. Another scenario is that the task was using an index that doesn’t work anymore. Therefore, it’s not about the task per se, but it’s about the configuration that has been changed with the last deployment. Therefore, we should fix the index. APM won’t show us all those details. Fourth, APM is not very readable. Dashboards with metrics look great, but they are too often just checked whether they’re green. It’s not enough to see that alarms are not ringing. We need to manually review the metrics, look for anomalies, understand how they change, and if we have all the alarms in place. This is tedious and time-consuming, and many developers don’t like doing that. Metrics, charts, graphs, and other visualizations swamp us with raw data that doesn’t show the big picture. Finally, one person can’t reason about the system. Even if we have a dedicated team for maintenance, the team won’t have an understanding of all the changes going through the system. In the fast-paced world with tens of deployments every day, we can’t look for issues manually. Every deployment may result in an outage due to invalid schema migration, bad code change, cache purge, lack of hardware, bad configuration, or many more issues. Even when we know something is wrong and we can even point to the place, the team may lack the understanding or knowledge needed to identify the root cause. Involving more teams is time-consuming and doesn’t scale. While APM looks great, it’s not the ultimate solution. We need something better. We need something that connects the dots and provides answers instead of data. We need true observability. What Makes the Observability Shine Observability turns alerts into root causes and raw data into understanding. Instead of charts, diagrams, and graphs, we want to have a full story of the changes going through pipelines and how they affect the system. This should understand the characteristics of the application, including the deployment scheme, data patterns, partitioning, sharding, regionalization, and other things specific to the application. Observability lets us reason about the internals of the system from the outside. For instance, we can reason that we deployed the wrong changes to the production environment because there is a metric spike in the database. We don’t focus on the database per se, but we analyze the difference between the current and the previous code. However, if there was no deployment recently, but we observe much higher traffic on the load balancer, then we can reason that it’s probably due to different traffic coming to the application. Observability makes the interconnections clear and visible. To build observability, we need to capture static signals and dynamic history. We need to include our deployments, configuration, extensions, connectivity, and characteristics of our application code. It’s not enough just to see that “something is red now.” We need to understand how we got there and what could be the possible reason. To achieve that, a good observability solution needs to go through multiple steps. First, we need to be able to pinpoint the problem. In the modern world of microservices and bounded contexts, it’s not trivial. If the CPU spikes, we need to be able to answer which service or application caused that, which tenant is responsible, or whether this is for all the traffic or some specific requests in the case of a web application. We can do that by carefully observing metrics with multiple dimensions, possibly with dashboards and alarms. Second, we need to include multiple signals. CPU spikes can be caused by a lack of hardware, wrong configuration, broken code, unexpected traffic, or simply things that shouldn’t be running at that time. What’s more, maybe something unexpected happened around the time of the issue. This could be related to a deployment, an ongoing sports game, a specific time of week or time of year, some promotional campaign we just started, or some outage in the cloud infrastructure. All these inputs must be provided to the observability system to understand the bigger picture. Third, we need to look for anomalies. It may seem counterintuitive, but digital applications rot over time. Things change, traffic changes, updates are installed, security fixes are deployed, and every single change can break our application. However, the outage may not be quick and easy. The application may get slower and slower over time, and we won’t notice that easily because alarms do not go off or they become red only for a short period. Therefore, we need to have anomaly detection built-in. We need to be able to look for traffic patterns, weekly trends, and known peaks during the year. A proper observability solution needs to be aware of these and automatically find the situations in which the metrics don’t align. Fourth, we need to be able to automatically root cause the issue and suggest a solution. We can’t push the developers to own the databases and the systems without proper tooling. The observability systems need to be able to automatically suggest improvements. We need to unblock the developers so they can finally be responsible for the performance and own the systems end to end. Databases and Observability We Need Today Let’s now see what we need in the domain of databases. Many things can break, and it’s worth exploring the challenges we may face when working with SQL or NoSQL databases. We are going to see the three big areas where things may go wrong. These are code changes, schema changes, and execution changes. Code Changes Many database issues come from the code changes. Developers modify the application code, and that results in different SQL statements being sent to the database. These queries may be inherently slow, but these won’t be captured by the testing processes we have in place now. Imagine that we have the following application code that extracts the user aggregate root. The user may have multiple additional pieces of information associated with them, like details, pages, or texts: JavaScript const user = repository.get("user") .where("user.id = 123") .leftJoin("user.details", "user_details_table") .leftJoin("user.pages", "pages_table") .leftJoin("user.texts", "texts_table") .leftJoin("user.questions", "questions_table") .leftJoin("user.reports", "reports_table") .leftJoin("user.location", "location_table") .leftJoin("user.peers", "peers_table") .getOne() return user; The code generates the following SQL statement: SQL SELECT * FROM users AS user LEFT JOIN user_details_table AS detail ON detail.user_id = user.id LEFT JOIN pages_table AS page ON page.user_id = user.id LEFT JOIN texts_table AS text ON text.user_id = user.id LEFT JOIN questions_table AS question ON question.user_id = user.id LEFT JOIN reports_table AS report ON report.user_id = user.id LEFT JOIN locations_table AS location ON location.user_id = user.id LEFT JOIN peers_table AS peer ON Peer.user_id = user.id WHERE user.id = '123' Because of multiple joins, the query returns nearly 300 thousand rows to the application that are later processed by the mapper library. This takes 25 seconds in total. Just to get one user entity. The problem with such a query is that we don’t see the performance implications when we write the code. If we have a small developer database with only a hundred rows, then we won’t get any performance issues when running the code above locally. Unit tests won’t catch that either because the code is “correct” — it returns the expected result. We won’t see the issue until we deploy to production and see that the query is just too slow. Another problem is a well-known N+1 query problem with Object Relational Mapper (ORM) libraries. Imagine that we have table flights that are in 1-to-many relation with table tickets. If we write a code to get all the flights and count all the tickets, we may end up with the following: C# var totalTickets = 0; var flights = dao.getFlights(); foreach(var flight in flights){ totalTickets + flight.getTickets().count; } This may result in N+1 queries being sent in total. One query to get all the flights, and then n queries to get tickets for every flight: SQL SELECT * FROM flights; SELECT * FROM tickets WHERE ticket.flight_id = 1; SELECT * FROM tickets WHERE ticket.flight_id = 2; SELECT * FROM tickets WHERE ticket.flight_id = 3; ... SELECT * FROM tickets WHERE ticket.flight_id = n; Just as before, we don’t see the problem when running things locally, and our tests won’t catch that. We’ll find the problem only when we deploy to an environment with a sufficiently big data set. Yet another thing is about rewriting queries to make them more readable. Let’s say that we have a table boarding_passes. We want to write the following query (just for exemplary purposes): SQL SELECT COUNT(*) FROM boarding_passes AS C1 JOIN boarding_passes AS C2 ON C2.ticket_no = C1.ticket_no AND C2.flight_id = C1.flight_id AND C2.boarding_no = C1.boarding_no JOIN boarding_passes AS C3 ON C3.ticket_no = C1.ticket_no AND C3.flight_id = C1.flight_id AND C3.boarding_no = C1.boarding_no WHERE MD5(MD5(C1.ticket_no)) = '525ac610982920ef37b34aa56a45cd06' AND MD5(MD5(C2.ticket_no)) = '525ac610982920ef37b34aa56a45cd06' AND MD5(MD5(C3.ticket_no)) = '525ac610982920ef37b34aa56a45cd06' This query joins the table with itself three times, calculates the MD5 hash of the ticket number twice, and then filters rows based on the condition. This code runs for 8 seconds on my machine with the demo database. A programmer may now want to avoid this repetition and rewrite the query to the following: SQL WITH cte AS ( SELECT *, MD5(MD5(ticket_no)) AS double_hash FROM boarding_passes ) SELECT COUNT(*) FROM cte AS C1 JOIN cte AS C2 ON C2.ticket_no = C1.ticket_no AND C2.flight_id = C1.flight_id AND C2.boarding_no = C1.boarding_no JOIN cte AS C3 ON C3.ticket_no = C1.ticket_no AND C3.flight_id = C1.flight_id AND C3.boarding_no = C1.boarding_no WHERE C1.double_hash = '525ac610982920ef37b34aa56a45cd06' AND C2.double_hash = '525ac610982920ef37b34aa56a45cd06' AND C3.double_has = '525ac610982920ef37b34aa56a45cd06' The query is now more readable as it avoids repetition. However, the performance dropped, and the query now executes in 13 seconds. Now, when we deploy changes like these to production, we may reason that we need to upscale the database. Seemingly, nothing has changed, but the database is now much slower. With good observability tools, we would see that the query executed behind the scenes is now different, which leads to a performance drop. Schema Changes Another problem around databases is when it comes to schema management. There are generally three different ways of modifying the schema: we can add something (table, column index, etc.), remove something, or modify something. Each schema modification is dangerous because the database engine may need to rewrite the table — copy the data on the side, modify the table schema, and then copy the data back. This may lead to a very long deployment (minutes, hours, even months) that we can’t optimize or stop in the middle. Additionally, we typically won’t see the problems when running things locally because we typically run our tests against the latest schema. A good observability solution needs to capture these changes before running in production. Indexes pose another interesting challenge. Adding an index seems to be safe. However, as is the case with every index, it needs to be maintained over time. Indexes generally improve the read performance because they help us find rows much faster. At the same time, they decrease the modification performance as every data modification must be performed in the table and in all the indexes. However, indexes may not be useful after some time. It’s often the case that we configure an index; a couple of months later, we change the application code, and the index isn’t used anymore. Without good observability systems, we won’t be able to notice that the index isn’t useful anymore and decreases the performance. Execution Changes Yet another area of issues is related to the way we execute queries. Databases prepare a so-called execution plan of the query. Whenever a statement is sent to the database, the engine analyzes indexes, data distribution, and statistics of the tables’ content to figure out the fastest way of running the query. Such an execution plan heavily depends on the content of our database and running configuration. The execution plan dictates what join strategy to use when joining tables (nested loop join, merge join, hash join, or maybe something else), which indexes to scan (or tables instead), and when to sort and materialize the results. We can affect the execution plan by providing query hints. Inside the SQL statements, we can specify what join strategy to use or what locks to acquire. The database may use these hints to improve the performance but may also disregard them and execute things differently. However, we don’t know whether the database used them or not. Things get worse over time. Indexes may change after the deployment, data distribution may depend on the day of the week, and the database load may be much different between countries when we regionalize our application. Query hints that we provided half a year ago may not be relevant anymore, but our tests won’t catch that. Unit tests are used to verify the correctness of our queries, and the queries will still return the same results. We have simply no way of identifying these changes automatically. Database Guardrails Is the New Standard Based on what we said above, we need a new approach. No matter if we run a small product or a big Fortune 500 company, we need a novel way of dealing with databases. Developers need to own their databases and have all the means to do it well. We need good observability and database guardrails — a novel approach that: Prevents the bad code from reaching production, Monitors all moving pieces to build a meaningful context for the developer, It significantly reduces the time to identify the root cause and troubleshoot the issues, so the developer gets direct and actionable insights We can’t let ourselves go blind anymore. We need to have tools and systems that will help us change the way we interact with databases, avoid performance issues, and troubleshoot problems as soon as they appear in production. Let’s see how we can build such a system. There are four things that we need to capture to build successful database guardrails. Let’s walk through them. Database Internals Each database provides enough details about the way it executes the query. These details are typically captured in the execution plan that explains what join strategies were used, which tables and indexes were scanned, or what data was sorted. To get the execution plan, we can typically use the EXPLAIN keyword. For instance, if we take the following PostgreSQL query: SQL SELECT TB.* FROM name_basics AS NB JOIN title_principals AS TP ON TP.nconst = NB.nconst JOIN title_basics AS TB ON TB.tconst = TP.tconst WHERE NB.nconst = 'nm00001' We can add EXPLAIN to get the following query: SQL EXPLAIN SELECT TB.* FROM name_basics AS NB JOIN title_principals AS TP ON TP.nconst = NB.nconst JOIN title_basics AS TB ON TB.tconst = TP.tconst WHERE NB.nconst = 'nm00001' The query returns the following output: SQL Nested Loop (cost=1.44..4075.42 rows=480 width=89) -> Nested Loop (cost=1.00..30.22 rows=480 width=10) -> Index Only Scan using name_basics_pkey on name_basics nb (cost=0.43..4.45 rows=1 width=10) Index Cond: (nconst = 'nm00001'::text) -> Index Only Scan using title_principals_nconst_idx on title_principals tp (cost=0.56..20.96 rows=480 width=20) Index Cond: (nconst = 'nm00001'::text) -> Index Scan using title_basics_pkey on title_basics tb (cost=0.43..8.43 rows=1 width=89) Index Cond: (tconst = tp.tconst) This gives a textual representation of the query and how it will be executed. We can see important information about the join strategy (Nested Loop in this case), tables and indexes used (Index Only Scan for name_basics_pkey, or Index Scan for title_basics_pkey), and the cost of each operation. Cost is an arbitrary number indicating how hard it is to execute the operation. We shouldn’t draw any conclusions from the numbers per se, but we can compare various plans based on the cost and choose the cheapest one. Having plans at hand, we can easily tell what’s going on. We can see if we have an N+1 query issue if we use indexes efficiently and if the operation runs fast. We can get some insights into how to improve the queries. We can immediately tell if a query is going to scale well in production just by looking at how it reads the data. Once we have these plans, we can move on to another part of successful database guardrails. Integration With Applications We need to extract plans somehow and correlate them with what our application does. To do that, we can use OpenTelemetry (OTel). OpenTelemetry is an open standard for instrumenting applications. It provides multiple SDKs for various programming languages and is now commonly used in frameworks and libraries for HTTP, SQL, ORM, and other application layers. OpenTelemetry captures signals: logs, traces, and metrics. They are later captured into spans and traces that represent the communication between services and timings of operations. Each span represents one operation performed by some server. This could be file access, database query, or request handling. We can now extend OpenTelemetry signals with details from databases. We can extract execution plans, correlate them with signals from other layers, and build a full understanding of what happened behind the scenes. For instance, we would clearly see the N+1 problem just by looking at the number of spans. We could immediately identify schema migrations that are too slow or operations that will take the database down. Now, we need the last piece to capture the full picture. Semantic Monitoring of All Databases Observing just the local database may not be enough. The same query may execute differently depending on the configuration or the freshness of statistics. Therefore, we need to integrate monitoring with all the databases we have, especially with the production ones. By extracting statistics, number of rows, running configuration, or installed extensions, we can get an understanding of how the database performs. Next, we can integrate that with the queries we run locally. We take the query that we captured in the local environment and then reason about how it would execute in production. We can compare the execution plan and see which tables are accessed or how many rows are being read. This way, we can immediately tell the developer that the query is not going to scale well in production. Even if the developer has a different database locally or has a low number of rows, we can still take the query or the execution plan, enrich it with the production statistics, and reason about the performance after the deployment. We don’t need to wait for the deployment of the load tests, but we can provide feedback nearly immediately. The most important part is that we move from raw signals to reasoning. We don’t swamp the user with plots or metrics that are hard to understand or that the user can’t use easily without setting the right thresholds. Instead, we can provide meaningful suggestions. Instead of saying, “CPU spiked to 80%,” we can say, “The query scanned the whole table, and you should add an index on this and that column.” We can give developers answers, not only the data points to reason about. Automated Troubleshooting That’s just the beginning. Once we understand what is actually happening in the database, the sky's the limit. We can run anomaly detection on the queries to see how they change over time, if they use the same indexes as before, or if they changed the join strategy. We can catch ORM configuration changes that lead to multiple SQL queries being sent for a particular REST API. We can submit automated pull requests to tune the configuration. We can correlate the application code with the SQL query so we can rewrite the code on the fly with machine-learning solutions. Summary In recent years, we observed a big evolution in the software industry. We run many applications, deploy many times a day, scale out to hundreds of servers, and use more and more components. Application Performance Monitoring is not enough to keep track of all the moving parts in our applications. Here at Metis, we believe that we need something better. We need a true observability that can finally show us the full story. And we can use observability to build database guardrails that provide the actual answers and actionable insights. Not a set of metrics that the developer needs to track and understand, but automated reasoning connecting all the dots. That’s the new approach we need and the new age we deserve as developers owning our databases.
This is an article from DZone's 2023 Automated Testing Trend Report.For more: Read the Report One of the core capabilities that has seen increased interest in the DevOps community is observability. Observability improves monitoring in several vital ways, making it easier and faster to understand business flows and allowing for enhanced issue resolution. Furthermore, observability goes beyond an operations capability and can be used for testing and quality assurance. Testing has traditionally faced the challenge of identifying the appropriate testing scope. "How much testing is enough?" and "What should we test?" are questions each testing executive asks, and the answers have been elusive. There are fewer arguments about testing new functionality; while not trivial, you know the functionality you built in new features and hence can derive the proper testing scope from your understanding of the functional scope. But what else should you test? What is a comprehensive general regression testing suite, and what previous functionality will be impacted by the new functionality you have developed and will release? Observability can help us with this as well as the unavoidable defect investigation. But before we get to this, let's take a closer look at observability. What Is Observability? Observability is not monitoring with a different name. Monitoring is usually limited to observing a specific aspect of a resource, like disk space or memory of a compute instance. Monitoring one specific characteristic can be helpful in an operations context, but it usually only detects a subset of what is concerning. All monitoring can show is that the system looks okay, but users can still be experiencing significant outages. Observability aims to make us see the state of the system by making data flows "observable." This means that we can identify when something starts to behave out of order and requires our attention. Observability combines logs, metrics, and traces from infrastructure and applications to gain insights. Ideally, it organizes these around workflows instead of system resources and, as such, creates a functional view of the system in use. Done correctly, it lets you see what functionality is being executed and how frequently, and it enables you to identify performance characteristics of the system and workflow. Figure 1: Observability combines metrics, logs, and traces for insights One benefit of observability is that it shows you the actual system. It is not biased by what the designers, architects, and engineers think should happen in production. It shows the unbiased flow of data. The users, over time (and sometimes from the very first day), find ways to use the system quite differently from what was designed. Observability makes such changes in behavior visible. Observability is incredibly powerful in debugging system issues as it allows us to navigate the system to see where problems occur. Observability requires a dedicated setup and some contextual knowledge similar to traceability. Traceability is the ability to follow a system transaction over time through all the different components of our application and infrastructure architecture, which means you have to have common information like an ID that enables this. OpenTelemetry is an open standard that can be used and provides useful guidance on how to set this up. Observability makes identifying production issues a lot easier. And we can use observability for our benefit in testing, too. Observability of Testing: How to Look Left Two aspects of observability make it useful in the testing context: Its ability to make the actual system usage observable and its usefulness in finding problem areas during debugging. Understanding the actual system behavior is most directly useful during performance testing. Performance testing is the pinnacle of testing since it tries to achieve as close to the realistic peak behavior of a system as possible. Unfortunately, performance testing scenarios are often based on human knowledge of the system instead of objective information. For example, performance testing might be based on the prediction of 10,000 customer interactions per hour during a sales campaign based on the information of the sales manager. Observability information can help define the testing scenarios by using the information to look for the times the system was under the most stress in production and then simulate similar situations in the performance test environment. We can use a system signature to compare behaviors. A system signature in the context of observability is the set of values for logs, metrics, and traces during a specific period. Take, for example, a marketing promotion for new customers. The signature of the system should change during that period to show more new account creations with its associated functionality and the related infrastructure showing up as being more "busy." If the signature does not change during the promotion, we would predict that we also don't see the business metrics move (e.g., user sign-ups). In this example, the business metrics and the signature can be easily matched. Figure 2: A system behaving differently in test, which shows up in the system signature In many other cases, this is not true. Imagine an example where we change the recommendation engine to use our warehouse data going forward. We expect the system signature to show increased data flows between the recommendation engine and our warehouse system. You can see how system signatures and the changes of the system signature can be useful for testing; any differences in signature between production and the testing systems should be explainable by the intended changes of the upcoming release. Otherwise, investigation is required. In the same way, information from the production observability system can be used to define a regression suite that reflects the functionality most frequently used in production. Observability can give you information about the workflows still actively in use and which workflows have stopped being relevant. This information can optimize your regression suite both from a maintenance perspective and, more importantly, from a risk perspective, making sure that core functionality, as experienced by the user, remains in a working state. Implementing observability in your test environments means you can use the power of observability for both production issues and your testing defects. It removes the need for debugging modes to some degree and relies upon the same system capability as production. This way, observability becomes how you work across both dev and ops, which helps break down silos. Observability for Test Insights: Looking Right In the previous section, we looked at using observability by looking left or backward, ensuring we have kept everything intact. Similarly, we can use observability to help us predict the success of the features we deliver. Think about a new feature you are developing. During the test cycles, we see how this new feature changes the workflows, which shows up in our observability solution. We can see the new features being used and other features changing in usage as a result. The signature of our application has changed when we consider the logs, traces, and metrics of our system in test. Once we go live, we predict that the signature of the production system will change in a very similar way. If that happens, we will be happy. But what if the signature of the production system does not change as predicted? Let's take an example: We created a new feature that leverages information from previous bookings to better serve our customers by allocating similar seats and menu options. During testing, we tested the new feature with our test data set, and we see an increase in accessing the bookings database while the customer booking is being collated. Once we go live, we realize that the workflows are not utilizing the customer booking database, and we leverage the information from our observability tooling to investigate. We have found a case where the users are not using our new features or are not using the features in the expected way. In either case, this information allows us to investigate further to see whether more change management is required for the users or whether our feature is just not solving the problem in the way we wanted it to. Another way to use observability is to evaluate the performance of your changes in test and the impact on the system signature — comparing this afterwards with the production system signature can give valuable insights and prevent overall performance degradation. Our testing efforts (and the associated predictions) have now become a valuable tool for the business to evaluate the success of a feature, which elevates testing to become a business tool and a real value investment. Figure 3: Using observability in test by looking left and looking right Conclusion While the popularity of observability is a somewhat recent development, it is exciting to see what benefits it can bring to testing. It will create objectiveness for defining testing efforts and results by evaluating them against the actual system behavior in production. It also provides value to developer, tester, and business communities, which makes it a valuable tool for breaking down barriers. Using the same practices and tools across communities drives a common culture — after all, culture is nothing but repeated behaviors. This is an article from DZone's 2023 Automated Testing Trend Report.For more: Read the Report
I recently began a new role as a software engineer, and in my current position, I spend a lot of time in the terminal. Even though I have been a long-time Linux user, I embarked on my Linux journey after becoming frustrated with setting up a Node.js environment on Windows during my college days. It was during that time that I discovered Ubuntu, and it was then that I fell in love with the simplicity and power of the Linux terminal. Despite starting my Linux journey with Ubuntu, my curiosity led me to try other distributions, such as Manjaro Linux, and ultimately Arch Linux. Without a doubt, I have a deep affection for Arch Linux. However, at my day job, I used macOS, and gradually, I also developed a love for macOS. Now, I have transitioned to macOS as my daily driver. Nevertheless, my love for Linux, especially Arch Linux and the extensive customization it offers, remains unchanged. Anyway, in this post, I will be discussing grep and how I utilize it to analyze logs and uncover insights. Without a doubt, grep has proven to be an exceptionally powerful tool. However, before we delve into grep, let’s first grasp what grep is and how it works. What Is grep and How Does It Work? grep is a powerful command-line utility in Unix-like operating systems used for searching text or regular expressions (patterns) within files. The name “grep” stands for “Global Regular Expression Print.” It’s an essential tool for system administrators, programmers, and anyone working with text files and logs. How It Works When you use grep, you provide it with a search pattern and a list of files to search through. The basic syntax is: grep [options] pattern [file...] Here’s a simple understanding of how it works: Search pattern: You provide a search pattern, which can be a simple string or a complex regular expression. This pattern defines what you’re searching for within the files. Files to search: You can specify one or more files (or even directories) in which grep should search for the pattern. If you don’t specify any files, grep reads from the standard input (which allows you to pipe in data from other commands). Matching lines:grep scans through each line of the specified files (or standard input) and checks if the search pattern matches the content of the line. Output: When a line containing a match is found, grep prints that line to the standard output. If you’re searching within multiple files, grep also prefixes the matching lines with the file name. Options:grep offers various options that allow you to control its behavior. For example, you can make the search case-insensitive, display line numbers alongside matches, invert the match to show lines that don’t match and more. Backstory of Development grep was created by Ken Thompson, one of the early developers of Unix, and its development dates back to the late 1960s. The context of its creation lies in the evolution of the Unix operating system at Bell Labs. Ken Thompson, along with Dennis Ritchie and others, was involved in developing Unix in the late 1960s. As part of this effort, they were building tools and utilities to make the system more practical and user-friendly. One of the tasks was to develop a way to search for patterns within text files efficiently. The concept of regular expressions was already established in the field of formal language theory, and Thompson drew inspiration from this. He created a program that utilized a simple form of regular expressions for searching and printing lines that matched the provided pattern. This program eventually became grep. The initial version of grep used a simple and efficient algorithm to perform the search, which is based on the use of finite automata. This approach allowed for fast pattern matching, making grep a highly useful tool, especially in the early days of Unix when computing resources were limited. Over the years, grep has become an integral part of Unix-like systems, and its functionality and capabilities have been extended. The basic concept of searching for patterns in text using regular expressions, however, remains at the core of grep’s functionality. grep and Log Analysis So you might be wondering how grep can be used for log analysis. Well, grep is a powerful tool that can be used to analyze logs and uncover insights. In this section, I will be discussing how I use grep to analyze logs and find insights. Isolating Errors Debugging often starts with identifying errors in logs. To isolate errors using grep, I use the following techniques: Search for error keywords: Start by searching for common error keywords such as "error", "exception", "fail" or "invalid" . Use case-insensitive searches with the -i flag to ensure you capture variations in case. Multiple pattern search: Use the -e flag to search for multiple patterns simultaneously. For instance, you could search for both "error" and "warning" messages to cover a wider range of potential issues. Contextual search: Use the -C flag to display a certain number of lines of context around each match. This helps you understand the context in which an error occurred. Tracking Down Issues Once you’ve isolated errors, it’s time to dig deeper and trace the source of the issue: Timestamp-based search: If your logs include timestamps, use them to track down the sequence of events leading to an issue. You can use grep along with regular expressions to match specific time ranges. Unique identifiers: If your application generates unique identifiers for events, use these to track the flow of events across log entries. Search for these identifiers using grep. Combining with other tools: Combine grep with other command-line tools like sort, uniq, and awk to aggregate and analyze log entries based on various criteria. Identifying Patterns Log analysis is not just about finding errors; it’s also about identifying patterns that might provide insights into performance or user behavior: Frequency analysis: Use grep to count the occurrence of specific patterns. This can help you identify frequently occurring events or errors. Custom pattern matching: Leverage regular expressions to define custom patterns based on your application’s unique log formats. Anomaly detection: Regular expressions can also help you detect anomalies by defining what “normal” log entries look like and searching for deviations from that pattern. Conclusion In the world of debugging and log analysis, grep is a tool that can make a significant difference. Its powerful pattern-matching capabilities, combined with its versatility in handling regular expressions, allow you to efficiently isolate errors, track down issues, and identify meaningful patterns in your log files. With these techniques in your toolkit, you’ll be better equipped to unravel the mysteries hidden within your logs and ensure the smooth operation of your systems and applications. Happy log hunting! Remember, practice is key. The more you experiment with grep and apply these techniques to your real-world scenarios, the more proficient you’ll become at navigating through log files and gaining insights from them. Examples Isolating Errors Search for lines containing the word “error” in a log file: grep -i "error" application.log Search for lines containing either “error” or “warning” in a log file: grep -i -e "error" -e "warning" application.log Display lines containing the word “error” along with 2 lines of context before and after: grep -C 2 "error" application.log Tracking Down Issues Search for log entries within a specific time range (using regular expressions for timestamp matching): grep "^\[2023-08-31 10:..:..]" application.log Search for entries associated with a specific transaction ID: grep "TransactionID: 12345" application.log Count the occurrences of a specific error: grep -c "Connection refused" application.log Identifying Patterns Count the occurrences of each type of error in a log file: grep -i -o "error" application.log | sort | uniq -c Search for log entries containing IP addresses: grep -E "[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+" application.log Detect unusual patterns using negative lookaheads in regular expressions: grep -E "^(?!.*normal).*error" application.log Lastly, I hope you enjoyed reading this and got a chance to learn something new from this post. If you have any grep tips or how you started your Linux journey feel free to comment below as I would love to hear them.
One of my current talks focuses on Observability in general and Distributed Tracing in particular, with an OpenTelemetry implementation. In the demo, I show how you can see the traces of a simple distributed system consisting of the Apache APISIX API Gateway, a Kotlin app with Spring Boot, a Python app with Flask, and a Rust app with Axum. Earlier this year, I spoke and attended the Observability room at FOSDEM. One of the talks demoed the Grafana stack: Mimir for metrics, Tempo for traces, and Loki for logs. I was pleasantly surprised how one could move from one to the other. Thus, I wanted to achieve the same in my demo but via OpenTelemetry to avoid coupling to the Grafana stack. In this blog post, I want to focus on logs and Loki. Loki Basics and Our First Program At its core, Loki is a log storage engine: Loki is a horizontally scalable, highly available, multi-tenant log aggregation system inspired by Prometheus. It is designed to be very cost effective and easy to operate. It does not index the contents of the logs, but rather a set of labels for each log stream. Loki Loki provides a RESTful API to store and read logs. Let's push a log from a Java app. Loki expects the following payload structure: I'll use Java, but you can achieve the same result with a different stack. The most straightforward code is the following: Java public static void main(String[] args) throws URISyntaxException, IOException, InterruptedException { var template = "'{' \"streams\": ['{' \"stream\": '{' \"app\": \"{0}\" '}', \"values\": [[ \"{1}\", \"{2}\" ]]'}']'}'"; //1 var now = LocalDateTime.now().atZone(ZoneId.systemDefault()).toInstant(); var nowInEpochNanos = NANOSECONDS.convert(now.getEpochSecond(), SECONDS) + now.getNano(); var payload = MessageFormat.format(template, "demo", String.valueOf(nowInEpochNanos), "Hello from Java App"); //1 var request = HttpRequest.newBuilder() //2 .uri(new URI("http://localhost:3100/loki/api/v1/push")) .header("Content-Type", "application/json") .POST(HttpRequest.BodyPublishers.ofString(payload)) .build(); HttpClient.newHttpClient().send(request, HttpResponse.BodyHandlers.ofString()); //3 } This is how we did String interpolation in the old days Create the request Send it The prototype works, as seen in Grafana: However, the code has many limitations: The label is hard-coded. You can and must send a single-label Everything is hard-coded; nothing is configurable, e.g., the URL The code sends one request for every log; it's hugely inefficient as there's no buffering HTTP client is synchronous, thus blocking the thread while waiting for Loki No error handling whatsoever Loki offers both gzip compression and Protobuf; none are supported with my code Finally, it's completely unrelated to how we use logs, e.g.: Java var logger = // Obtain logger logger.info("My message with parameters {}, {}", foo, bar); Regular Logging on Steroids To use the above statement, we need to choose a logging implementation. Because I'm more familiar with it, I'll use SLF4J and Logback. Don't worry; the same approach works for Log4J2. We need to add relevant dependencies: XML <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-api</artifactId> <!--1--> <version>2.0.7</version> </dependency> <dependency> <groupId>ch.qos.logback</groupId> <artifactId>logback-classic</artifactId> <!--2--> <version>1.4.8</version> <scope>runtime</scope> </dependency> <dependency> <groupId>com.github.loki4j</groupId> <artifactId>loki-logback-appender</artifactId> <!--3--> <version>1.4.0</version> <scope>runtime</scope> </dependency> SLF4J is the interface Logback is the implementation Logback appender dedicated to SLF4J Now, we add a specific Loki appender: XML <appender name="LOKI" class="com.github.loki4j.logback.Loki4jAppender"> <!--1--> <http> <url>http://localhost:3100/loki/api/v1/push</url> <!--2--> </http> <format> <label> <pattern>app=demo,host=${HOSTNAME},level=%level</pattern> <!--3--> </label> <message> <pattern>l=%level h=${HOSTNAME} c=%logger{20} t=%thread | %msg %ex</pattern> <!--4--> </message> <sortByTime>true</sortByTime> </format> </appender> <root level="DEBUG"> <appender-ref ref="STDOUT" /> </root> The loki appender Loki URL As many labels as wanted Regular Logback pattern Our program has become much more straightforward: Java var who = //... var logger = LoggerFactory.getLogger(Main.class.toString()); logger.info("Hello from {}!", who); Grafana displays the following: Docker Logging I'm running most of my demos on Docker Compose, so I'll mention the Docker logging trick. When a container writes on the standard out, Docker saves it to a local file. The docker logs command can access the file content. However, other options than saving to a local file are available, e.g., syslog, Google Cloud, Splunk, etc. To choose a different option, one sets a logging driver. One can configure the driver at the overall Docker level or per container. Loki offers its own plugin. To install it: Shell docker plugin install grafana/loki-docker-driver:latest --alias loki --grant-all-permissions At this point, we can use it on our container app: YAML services: app: build: . logging: driver: loki #1 options: loki-url: http://localhost:3100/loki/api/v1/push #2 loki-external-labels: container_name={{.Name},app=demo #3 Loki logging driver URL to push to Additional labels The result is the following. Note the default labels. Conclusion From a bird's eye view, Loki is nothing extraordinary: it's a plain storage engine with a RESTful API on top. Several approaches are available to use the API. Beyond the naive one, we have seen a Java logging framework appender and Docker. Other approaches include scraping the log files, e.g., Promtail, via a Kubernetes sidecar. You could also add an OpenTelemetry Collector between your app and Loki to perform transformations. Options are virtually unlimited. Be careful to choose the one that fits your context the best. To go further: Push log entries to Loki via API Loki Clients
Are you looking to get away from proprietary instrumentation? Are you interested in open-source observability but lack the knowledge to just dive right in? This workshop is for you, designed to expand your knowledge and understanding of open-source observability tooling that is available to you today. Dive right into a free, online, self-paced, hands-on workshop introducing you to Prometheus. Prometheus is an open-source systems monitoring and alerting tool kit that enables you to hit the ground running with discovering, collecting, and querying your observability today. Over the course of this workshop, you will learn what Prometheus is, what it is not, install it, start collecting metrics, and learn all the things you need to know to become effective at running Prometheus in your observability stack. Previously, I shared an introduction to Prometheus, installing Prometheus, an introduction to the query language, exploring basic queries, using advanced queries, relabeling metrics in Prometheus, and discovering service targets as free online labs. In this article, you'll learn all about instrumenting your applications using Prometheus client libraries. Your learning path takes you into the wonderful world of instrumenting applications in Prometheus, where you learn all about client libraries for the languages you code in. Note this article is only a short summary, so please see the complete lab found online to work through it in its entirety yourself. The following is a short overview of what is in this specific lab of the workshop. Each lab starts with a goal. In this case, it is as follows: This lab introduces client libraries and shows you how to use them to add Prometheus metrics to applications and services. You'll get hands-on and instrument a sample application to start collecting metrics. You start in this lab reviewing how Prometheus metrics collection works, exploring client library architectures, and reviewing the four basic metrics types (counters, gauges, histograms, and summaries). If you've never collected any type of metrics data before, you're given two systems to help you get started. One is known as the USE method and is known for systems or infrastructure metrics. The other is the RED method, which targets more applications and services. The introduction finishes with a few best practices around naming your metrics and warnings on how to avoid cardinality bombs. Instrumentation in Java For the rest of this lab, you'll be working on exercises that walk you through instrumenting a simple Java application using the Prometheus Java client library. No previous Java experience is required, but there are assumptions made that you have minimum versions of Java and Maven installed. You are provided with a Java project that you can easily download and work from using your favorite IDE. If you don't work in an IDE, use any editor you like as the coding you'll be doing is possible with just cutting and pasting from the lab slides. To install the project locally: Download and unzip the Prometheus Java Metrics Demo from GitLab. Unzip the prometheus-java-metrics-demo-main.zip file in your workshop directory. Open the project in your favorite IDE (examples shown in the lab use VSCode). You'll be building and running the Java application, which is a basic empty service where comments are used to show where your application code would go. Before that block, you see that the instrumentation has been provided for all four of the basic metric types. Once you have built and started the Java JAR file, the output will show you that the setup has been successful: $ cd prometheus-java-metrics-demo-main/ $ mvn clean install (watch for BUILD SUCCESS) $ java -jar target/java_metrics-1.0-SNAPSHOT-jar-with-dependencies.jar Java example metrics setup successful... Java example service started... Now it's just waiting for you to validate the endpoint at localhost:7777/metrics, which displays the metrics: # HELP java_app_s is a summary metric (request size in bytes) # TYPE java_app_s summary java_app_s{quantile="0.5",} 2.679717814859738 java_app_s{quantile="0.9",} 4.566657867333372 java_app_s{quantile="0.99",} 4.927313848318692 java_app_s_count 512.0 java_app_s_sum 1343.9017287309503 # HELP java_app_h is a histogram metric # TYPE java_app_h histogram java_app_h_bucket{le="0.005",} 1.0 java_app_h_bucket{le="0.01",} 1.0 ... java_app_h_bucket{le="10.0",} 512.0 java_app_h_bucket{le="+Inf",} 512.0 java_app_h_count 512.0 java_app_h_sum 1291.5300871683055 # HELP java_app_c is a counter metric # TYPE java_app_c counter java_app_c 512.0 # HELP java_app_g is a gauge metric # TYPE java_app_g gauge java_app_g 5.5811320747117765 While the metrics are exposed in this example on localhost:7777, they will not be scraped by Prometheus until you have updated its configuration to add this new endpoint. Let's update our workshop-prometheus.yml file to add the Java application job as shown along with comments for clarity (this is the minimum needed, with a few custom labels for fun): # workshop config global: scrape_interval: 5s scrape_configs: # Scraping Prometheus. - job_name: "prometheus" static_configs: - targets: ["localhost:9090"] # Scraping java metrics. - job_name: "java_app" static_configs: - targets: ["localhost:7777"] labels: job: "java_app" env: "workshop-lab8" Start the Prometheus instance (for container Prometheus, see the workshop for details) and then watch the console where you started the Java application as it will report each time a new scrape is done by logging Handled :/metrics: $./prometheus --config.file=support/workshop-prometheus.yml ===========Java application log=============== Java example metrics setup successful... Java example service started... Handled :/metrics Handled :/metrics Handled :/metrics Handled :/metrics ... You can validate that the Java metrics you just instrumented in your application are available in the Prometheus console localhost:9090 as shown. Feel free to query and explore: Next up, you'll be creating your own Java metrics application starting with the minimal setup needed to get your Java application running and exposing the path /metrics. Instead of coding it all by hand, you're given a starting point class file found in the project. Instrumenting Basic Metrics Java was chosen as the language due to many developers using this in enterprises, and exposing you to the Prometheus client library usage for a common developer language is a good baseline. The rest of the lab walks through multiple exercises where you start from a blank application template that's provided and code step-by-step the four basic metrics types. You're also walked through a custom build and run of the application each step of the way, with the following process used for each metric type as you work from implementation, to build, to validating that it works: Add the necessary Java client library import statements for the metric type you are adding. Add the code to construct the metric type you are defining. Initialize the new metric in a thread with basic numerical values (often random numbers). Rebuilding the basic Java application to create an updated JAR file you can run. Starting the application and validating the new metric is available on localhost:9999/metrics. Once all four of the basic metric types have been implemented and tested, you learn to update your Prometheus configuration to pick up your application: # workshop config global: scrape_interval: 5s scrape_configs: # Scraping Prometheus. - job_name: "prometheus" static_configs: - targets: ["localhost:9090"] # Scraping java metrics. - job_name: "java_app" static_configs: - targets: ["localhost:9999"] labels: job: "java_app" env: "workshop-lab8" Finally, you verify that you are collecting your Java-instrumented application data by checking through the Prometheus query console: Miss Previous Labs? This is one lab in the more extensive free online workshop. Feel free to start from the very beginning of this workshop here if you missed anything previously. You can always proceed at your own pace and return any time you like as you work your way through this workshop. Just stop and later restart Perses to pick up where you left off. Coming up Next I'll be taking you through the final lab in this workshop where you'll learn all about metrics monitoring at scale and understanding some of the pain points with Prometheus that arise as you start to scale out your observability architecture and start caring more about reliability.. Stay tuned for more hands-on material to help you with your cloud-native observability journey.
Intro to Istio Observability Using Prometheus Istio service mesh abstracts the network from the application layers using sidecar proxies. You can implement security and advance networking policies to all the communication across your infrastructure using Istio. But another important feature of Istio is observability. You can use Istio to observe the performance and behavior of all your microservices in your infrastructure (see the image below). One of the primary responsibilities of Site reliability engineers (SREs) in large organizations is to monitor the golden metrics of their applications, such as CPU utilization, memory utilization, latency, and throughput. In this article, we will discuss how SREs can benefit from integrating three open-source software- Istio, Prometheus, and Grafana. While Istio is the most famous service software, Prometheus is the most widely used monitoring software, and Grafana is the most famous visualization tool. Note: The steps are tested for Istio 1.17.X Watch the Video of Istio, Prometheus, and Grafana Configuration Watch the video if you want to follow the steps from the video: Step 1: Go to Istio Add-Ons and Apply Prometheus and Grafana YAML File First, go to the add-on folder in the Istio directory using the command. Since I am using 1.17.1, the path for me is istio-1.17.1/samples/addons You will notice that Istio already provides a few YAML files to configure Grafana, Prometheus, Jaeger, Kiali, etc. You can configure Prometheus by using the following command: Shell kubectl apply -f prometheus.yaml Shell kubectl apply -f grafana.yaml Note these add-on YAMLs are applied to istio-system namespace by default. Step 2: Deploy New Service and Port-Forward Istio Ingress Gateway To experiment with the working model, we will deploy the httpbin service to an Istio-enabled namespace. We will create an object of the Istio ingress gateway to receive the traffic to the service from the public. We will also port-forward the Istio ingress gateway to a particular port-7777. You should see the below screen at localhost:7777 Step 3: Open Prometheus and Grafana Dashboard You can open the Prometheus dashboard by using the following command. Shell istioctl dashboard prometheus Shell istioctl dashboard grafana Both the Grafana and Prometheus will open in the localhost. Step 4: Make HTTP Requests From Postman We will see how the httpbin service is consuming CPU or memory when there is a traffic load. We will create a few GET and POST requests to the localhost:7777 from the Postman app. Once you GET or POST requests to httpbin service multiple times, there will be utilization of resources, and we can see them in Grafana. But at first, we need to configure the metrics for httpbin service in Prometheus and Grafana. Step 5: Configuring Metrics in Prometheus One can select a range of metrics related to any Kubernetes resources such as API server, applications, workloads, envoy, etc. We will select container_memory_working_set_bytes metrics for our configuration. In the Prometheus application, we will select the namespace to scrape the metrics using the following search term: container_memory_working_set_bytes { namespace= “istio-telemetry”} (istio-telemetry is the name of our Istio-enabled namespace, where httpbin service is deployed) Note that, simply running this, we get the memory for our namespace. Since we want to analyze the memory usage of our pods, we can calculate the total memory consumed by summing the memory usage of each pod grouped by pod. The following query will help us in getting the desired result : sum(container_memory_working_set_bytes{namespace=”istio-telemetry”}) by (pod) Note: Prometheus provides a lot of flexibility to filter, slice, and dice the metric data. The central idea of this article was to showcase the ability of Istio to emit and send metrics to Prometheus for collection Step 6: Configuring Istio Metrics Graphs in Grafana Now, you can simply take the query sum(container_memory_working_set_bytes{namespace=”istio-telemetry”}) by (pod) in Prometheus and plot a graph with time. All you need to do is create a new dashboard in Grafana and paste the query into the metrics browser. Grafana will plot a time-series graph. You can edit the graph with proper names, legends, and titles for sharing with other stakeholders in the Ops team. There are several ways to tweak and customize the data and depict the Prometheus metrics in Grafana. You can choose to make all the customization based on your enterprise needs. I have done a few experiments in the video; feel free to check it out. Conclusion Istio service mesh is extremely powerful in providing overall observability across the infrastructure. In this article, we have just offered a small use case of metrics scrapping and visualization using Istio, Prometheus, and Grafana. You can perform logging and tracing of logs and real-time traffic using Istio; we will cover those topics in our subsequent blogs.
Joana Carvalho
Site Reliability Engineering,
Virtuoso
Eric D. Schabell
Director Technical Marketing & Evangelism,
Chronosphere
Chris Ward
Zone Leader,
DZone
Ted Young
Director of Open Source Development,
LightStep