Performance refers to how well an application conducts itself compared to an expected level of service. Today's environments are increasingly complex and typically involve loosely coupled architectures, making it difficult to pinpoint bottlenecks in your system. Whatever your performance troubles, this Zone has you covered with everything from root cause analysis, application monitoring, and log management to anomaly detection, observability, and performance testing.
Unlocking Performance: Exploring Java 21 Virtual Threads [Video]
Performance Optimization Strategies in Highly Scalable Systems
Apache JMeter is an open-source, Java-based tool used for load and performance testing of various services, particularly web applications. It supports multiple protocols like HTTP, HTTPS, FTP, and more. JMeter can simulate heavy loads on a server to analyze performance under different conditions. It offers both GUI and non-GUI modes for test configuration and can display test results in various formats. JMeter also supports distributed testing, enabling it to handle multiple test threads simultaneously. Its functionality can be extended through plugins, making it a versatile and widely used tool in performance testing. JMeter stands out from other testing tools due to its exceptional concurrency model, which governs how it executes requests in parallel. The concurrency model of JMeter relies on Thread Pools, widely recognized as the standard method for parallel processing in Java and several other programming languages. However, as with any advantage, there comes a significant trade-off: the resource-intensive nature of JMeter’s concurrency model. In JMeter, each thread corresponds to a Java thread, further utilizing an Operating System (OS) thread for its execution. OS threads, although effective in accomplishing concurrent tasks, carry a certain level of weightiness, manifested in terms of memory consumption and CPU usage during context switching. This attribute poses a noteworthy challenge to JMeter’s performance. Moreover, certain operating systems enforce strict limitations on the total number of threads that can be generated, imposing implicit restrictions on JMeter’s capabilities. Unleashing the True Power, Java 21 to the Rescue Project Loom, which has gained significant attention within the Java community over the past several years, has finally been incorporated into Java 21 after several early preview releases with JEP-444. Java’s virtual threads, also known as lightweight or user-mode threads, are introduced as an experimental feature under Project Loom, which is now officially included in Java 21. While the details of this feature are interesting, they’re not the main focus of our discussion today, so that we won’t delve deeper into them at this moment. JMeter The code review reveals a straightforward process for creating a new thread group by coying ThreadGroup class. In this instance, we have simply duplicated the logic from the ThreadGroup JMeter class that we wish to modify. A key method to note is startNewThread, which is responsible for creating the threads. We have altered one line in this method: The original line of code: Thread newThread = new Thread(jmThread, jmThread.getThreadName()); Has been replaced with the following: Thread newThread = Thread.ofVirtual() .name(jmThread.getThreadName()) .unstarted(jmThread); In this modification, instead of creating a traditional thread, we’re creating a virtual thread, as introduced in Java’s Project Loom. This change allows for more lightweight, efficient thread handling. Also, other modifications, such as removing the synchronized block from addNewThreadmethod and updating similar thread creation logic at a few other places. Setup I have quickly set up nginx, which always returns a 200 ok response: # nginx.conf location /test { return 200 'OK'; } 2. Add Virtual Thread Group element: 3. Configure Threads: you will see the title Virtual Thread Properties header for the right thread group. 4. Get set, go....!! and final result: My primary focus was not on the server’s responsiveness or its ability to scale up to 50k users (which, with some tuning, could be easily achieved). Instead, I was more interested in observing how JMeter generates and handles the load, irrespective of whether the server responses were successful or failed. Summary JMeter has traditionally been resource-intensive, primarily due to its I/O-bound nature involving network requests. However, with the introduction of virtual threads, it has significantly improved in performance. The utilization of virtual threads has enabled JMeter to operate smoothly and efficiently without any glitches, even when handling heavy loads. Source Code Anyone interested in trying on their own, see the following GitHub Project for more details.
Your team celebrates a success story where a trace identified a pesky latency issue in your application's authentication service. A fix was swiftly implemented, and we all celebrated a quick win in the next team meeting. But the celebrations are short-lived. Just days later, user complaints surged about a related payment gateway timeout. It turns out that the fix we made did improve performance at one point but created a situation in which key information was never cached. Other parts of the software react badly to the fix, and we need to revert the whole thing. While the initial trace provided valuable insights into the authentication service, it didn’t explain why the system was built in this way. Relying solely on a single trace has given us a partial view of a broader problem. This scenario underscores a crucial point: while individual traces are invaluable, their true potential is unlocked only when they are viewed collectively and in context. Let's delve deeper into why a single trace might not be the silver bullet we often hope for and how a more holistic approach to trace analysis can paint a clearer picture of our system's health and the way to combat problems. The Limiting Factor The first problem is the narrow perspective. Imagine debugging a multi-threaded Java application. If you were to focus only on the behavior of one thread, you might miss how it interacts with others, potentially overlooking deadlocks or race conditions. Let's say a trace reveals that a particular method, fetchUserData(), is taking longer than expected. By optimizing only this method, you might miss that the real issue is with the synchronized block in another related method, causing thread contention and slowing down the entire system. Temporal blindness is the second problem. Think of a Java Garbage Collection (GC) log. A single GC event might show a minor pause, but without observing it over time, you won't notice if there's a pattern of increasing pause times indicating a potential memory leak. A trace might show that a Java application's response time spiked at 2 PM. However, without looking at traces over a longer period, you might miss that this spike happens daily, possibly due to a scheduled task or a cron job that's putting undue stress on the system. The last problem is related to that and is the context. Imagine analyzing the performance of a Java method without knowing the volume of data it's processing. A method might seem inefficient, but perhaps it's processing a significantly larger dataset than usual. A single trace might show that a Java method, processOrders(), took 5 seconds to execute. However, without context, you wouldn't know if it was processing 50 orders or 5,000 orders in that time frame. Another trace might reveal that a related method, fetchOrdersFromDatabase(), is retrieving an unusually large batch of orders due to a backlog, thus providing context to the initial trace. Strength in Numbers Think of traces as chapters in a book and metrics as the book's summary. While each chapter (trace) provides detailed insights, the summary (metrics) gives an overarching view. Reading chapters in isolation might lead to missing the plot, but when read in sequence and in tandem with the summary, the story becomes clear. We need this holistic view. If individual traces show that certain Java methods like processTransaction() are occasionally slow, grouped traces might reveal that these slowdowns happen concurrently, pointing to a systemic issue. Metrics, on the other hand, might show a spike in CPU usage during these times, indicating that the system might be CPU-bound during high transaction loads. This helps us distinguish between correlation and causation. Grouped traces might show that every time the fetchFromDatabase() method is slow, the updateCache() method also lags. While this indicates a correlation, metrics might reveal that cache misses (a specific metric) increase during these times, suggesting that database slowdowns might be causing cache update delays, establishing causation. This is especially important in performance tuning. Grouped traces might show that the handleRequest() method's performance has been improving over several releases. Metrics can complement this by showing a decreasing trend in response times and error rates, confirming that recent code optimizations are having a positive impact. I wrote about this extensively in a previous post about the Tong motion needed to isolate an issue. This motion can be accomplished purely through the use of observability tools such as traces, metrics, and logs. Example Observability is somewhat resistant to examples. Everything I try to come up with feels a bit synthetic and unrealistic when I examine it after the fact. Having said that, I looked at my modified version of the venerable Spring Pet Clinic demo using digma.ai. Running it showed several interesting concepts taken by Digma. Probably the most interesting feature is the ability to look at what’s going on in the server at this moment. This is an amazing exploratory tool that provides a holistic view of a moment in time. But the thing I want to focus on is the “Insights” column on the right. Digma tries to combine the separate traces into a coherent narrative. It’s not bad at it, but it’s still a machine. Some of that value should probably still be done manually since it can’t understand the why, only the what. It seems it can detect the venerable Spring N+1 problem seamlessly. But this is only the start. One of my favorite things is the ability to look at tracing data next to a histogram and list of errors in a single view. Is performance impacted because there are errors? How impactful is the performance on the rest of the application? These become questions with easy answers at this point. When we see all the different aspects laid together. Magical APIs The N+1 problem I mentioned before is a common bug in Java Persistence API (JPA). The great Vlad Mihalcea has an excellent explanation. The TL;DR is rather simple. We write a simple database query using ORM. But we accidentally split the transaction, causing the data to be fetched N+1 times, where N is the number of records we fetch. This is painfully easy to do since transactions are so seamless in JPA. This is the biggest problem in “magical” APIs like JPA. These are APIs that do so much that they feel like magic, but under the hood, they still run regular old code. When that code fails, it’s very hard to see what goes on. Observability is one of the best ways to understand why these things fail. In the past, I used to reach out to the profiler for such things, which would often entail a lot of work. Getting the right synthetic environment for running a profiling session is often very challenging. Observability lets us do that without the hassle. Final Word Relying on a single individual trace is akin to navigating a vast terrain with just a flashlight. While these traces offer valuable insights, their true potential is only realized when viewed collectively. The limitations of a single trace, such as a narrow perspective, temporal blindness, and lack of context, can often lead developers astray, causing them to miss broader systemic issues. On the other hand, the combined power of grouped traces and metrics offers a panoramic view of system health. Together, they allow for a holistic understanding, precise correlation of issues, performance benchmarking, and enhanced troubleshooting. For Java developers, this tandem approach ensures a comprehensive and nuanced understanding of applications, optimizing both performance and user experience. In essence, while individual traces are the chapters of our software story, it's only when they're read in sequence and in tandem with metrics that the full narrative comes to life.
Garbage Collection is a facet often disregarded and underestimated, yet beneath its surface lies the potential for profound impacts on your organization that reach far beyond the realm of application performance. In this post, we embark on a journey to unravel the pivotal role of Garbage Collection analysis and explore seven critical points that underscore its significance. Improve Application Response Time Without Code Changes Automatic garbage collection (GC) is a critical memory management process, but it introduces pauses in applications. These pauses occur when GC scans and reclaims memory occupied by objects that are no longer in use. Depending on various factors, these pauses can range from milliseconds to several seconds or even minutes. During these pauses, no application transactions are processed, causing customer requests to be stranded. However, there’s a solution. By fine-tuning the GC behavior, you can significantly reduce GC pause times. This reduction ultimately leads to a decrease in the overall application’s response time, delivering a smoother user experience. A real-world case study from one of the world’s largest automobile manufacturers demonstrates the impact of GC tuning without making a single line of code change. Read the full case study here. They were able to reduce their response time by 50% just by tuning their GC settings without a single line of code change. Efficient Cloud Cost Reduction In the world of cloud computing, enterprises often unknowingly spend millions of dollars on inefficient garbage collection practices. A high GC Throughput percentage, such as 98%, may initially seem impressive, like achieving an ‘A grade’ score. However, this seemingly minor difference carries substantial financial consequences. Imagine a mid-sized company operating 1,000 AWS t2.2x.large 32G RHEL on-demand EC2 instances in the US West (North California) region. The cost of each EC2 instance is $0.5716 per hour. Let’s assume that their application’s GC throughput is 98%. Now, let’s break down the financial impact of this assumption: With a 98% GC Throughput, each instance loses approximately 28.8 minutes daily due to garbage collection. In a day, there are 1,440 minutes (equivalent to 24 hours x 60 minutes). Thus, 2% of 1,440 minutes equals 28.8 minutes. Over the course of a year, this adds up to 175.2 hours per instance. (i.e. 28.8 minutes x 365 days) For a fleet of 1,000 AWS EC2 instances, this translates to approximately $100.14K in wasted resources annually (calculated as 1,000 EC2 instances x 175.2 hours x $0.5716 per hour) due to garbage collection delays. This calculation vividly illustrates how seemingly insignificant pauses in GC activity can amass substantial costs for enterprises. It emphasizes the critical importance of optimizing garbage collection processes to achieve significant cost savings. Trimming Software Licensing Cost In today’s landscape, many of our applications run on commercial vendor software solutions like Dell Boomi, ServiceNow, Workday, and others. While these vendor software solutions are indispensable, their licensing costs can be exorbitant. What’s often overlooked is that the efficiency of our code and configurations within these vendor software platforms directly impacts software licensing costs. This is where proper Garbage Collection (GC) analysis comes into play. It provides insights into whether there is an overallocation or underutilization of resources within these vendor software environments. Surprisingly, overallocation often remains hidden until we scrutinize GC behavior. By leveraging GC analysis, enterprises gain the visibility needed to identify overallocation and reconfigure resources accordingly. This optimization not only enhances application performance but also results in significant cost savings by reducing the licensing footprint of these vendor software solutions. The impact on the bottom line can be substantial. Forecast Memory Problems in Production Garbage collection logs hold the key to vital predictive micrometrics that can transform how you manage your application’s availability and performance. Among these micrometrics, one stands out: ‘GC Throughput.’ But what is GC Throughput? Imagine your application’s GC throughput is at 98% — it means that your application spends 98% of its time efficiently processing customer activity, with the remaining 2% allocated to GC activity. The significance becomes apparent when your application faces a memory problem. Several minutes before a memory issue becomes noticeable, the GC throughput will begin to degrade. This degradation serves as an early warning, enabling you to take preventive action before memory problems impact your production environment. Troubleshooting tools like yCrash closely monitor ‘GC throughput’ to predict and forecast memory problems, ensuring your application remains robust and reliable. Unearthing Memory Issues One of the primary reasons for production outages is encountering an OutOfMemoryError. In fact, there are nine different types of OutOfMemoryErrors: java.lang.OutOfMemoryError: Java heap space java.lang.OutOfMemoryError: PermGen space java.lang.OutOfMemoryError: GC overhead limit exceeded java.lang.OutOfMemoryError: Requested array size exceeds VM limit java.lang.OutOfMemoryError: Unable to create new native thread java.lang.OutOfMemoryError: Metaspace java.lang.OutOfMemoryError: unable to create new native thread java.lang.OutOfMemoryError: Direct buffer memory java.lang.OutOfMemoryError: Compressed class space GC analysis provides valuable insights into the root cause of these errors and helps in effectively triaging the problem. By understanding the specific OutOfMemoryError type and its associated details, developers can take targeted actions to debug and resolve memory-related issues, minimizing the risk of production outages. Spotting Performance Bottlenecks During Development In the modern software development landscape, the “Shift Left” approach has become a key initiative for many organizations. Its goal is to identify and address production-related issues during the development phase itself. Garbage Collection (GC) analysis enables this proactive approach by helping to isolate performance bottlenecks early in the development cycle. One of the vital metrics obtained through GC analysis is the ‘Object Creation Rate.’ This metric signifies the average rate at which objects are created by your application. Here’s why it matters: If your application, which previously generated data at a rate of 100MB/sec, suddenly starts creating 150MB/sec without a corresponding increase in traffic volume, it’s a red flag indicating potential problems within the application. This increased object creation rate can lead to heightened GC activity, higher CPU consumption, and degraded response times. Moreover, this metric can be integrated into your Continuous Integration/Continuous Deployment (CI/CD) pipeline to gauge the quality of code commits. For instance, if your previous code commit resulted in an object creation rate of 50MB/sec and a subsequent commit increases it to 75MB/sec for the same traffic volume, it signifies an inefficient code change. To streamline this process, you can leverage the GCeasy REST API. This integration allows you to capture critical data and insights directly within your CI/CD pipeline, ensuring that performance issues are identified and addressed early in the development lifecycle. Efficient Capacity Planning Effective capacity planning is vital for ensuring that your application can meet its performance and resource requirements. It involves understanding your application’s demands for memory, CPU, network resources, and storage. In this context, analyzing garbage collection behavior emerges as a powerful tool for capacity planning, particularly when it comes to assessing memory requirements. When you delve into garbage collection behavior analysis, you gain insights into crucial micro-metrics such as the average object creation rate and average object reclamation rate. These micro-metrics provide a detailed view of how your application utilizes memory resources. By leveraging this data, you can perform precise and effective capacity planning for your application. This approach allows you to allocate resources optimally, prevent resource shortages or overprovisioning, and ensure that your application runs smoothly and efficiently. Garbage Collection analysis, with its focus on memory usage patterns, becomes an integral part of the capacity planning process, enabling you to align your infrastructure resources with your application’s actual needs. How To Do Garbage Collection Analysis While there are monitoring tools and JMX MBeans that offer real-time Garbage Collection metrics, they often lack the depth needed for thorough analysis. To gain a complete understanding of Garbage Collection behavior, turn to GC logs. Once you have GC logs, select a free GC log analysis tool that suits your needs. With your chosen GC log analysis tool, examine Garbage Collection behavior in the logs, looking for patterns and performance issues. Pay attention to key metrics, and based on your analysis, optimize your application to reduce GC pauses and enhance performance. Adjust GC settings, allocate memory efficiently, and monitor the impact of your changes over time. Conclusion In the fast-paced world of software development and application performance optimization, Garbage Collection (GC) analysis is often the unsung hero. While it may be considered an underdog, it’s high time for this perception to change. GC analysis wields the power to enhance performance, reduce costs, and empower proactive decision-making. From improving application response times to early issue detection and precise capacity planning, GC analysis stands as a pivotal ally in optimizing applications and resources.
If your environment is like many others, it can often seem like your systems produce logs filled with a bunch of excess data. Since you need to access multiple components (servers, databases, network infrastructure, applications, etc.) to read your logs — and they don’t typically have any specific purpose or focus holding them together — you may dread sifting through them. If you don’t have the right tools, it can feel like you’re stuck with a bunch of disparate, hard-to-parse data. In these situations, I picture myself as a cosmic collector, gathering space debris as it floats by my ship and sorting the occasional good material from the heaps of galactic material. Though it can feel like more trouble than it’s worth, sorting through logs is crucial. Logs hold many valuable insights into what’s happening in your applications and can indicate performance problems, security issues, and user behavior. In this article, we’re going to take a look at how logging can help you make sense of your log data without much effort. We'll talk about best practices and habits and use some of the Log Analytics tools from Sumo Logic as examples. Let’s blast off and turn that cosmic trash into treasure! The Truth Is Out There: Getting Value Just From the Things You’re Already Logging One massive benefit offered by a log analytics platform to any system engineer is the ability to utilize a single log interface. Rather than needing to SSH into countless machines or download logs and parse through them manually, viewing all your logs in a centralized aggregator can make it much easier to see simultaneous events across your infrastructure. You’ll also be able to clearly follow the flow of data and requests through your stack. Once you see all your logs in one place, you can tap into the latent value of all that data. Of course, you could make your own aggregation interface from scratch, but often, log aggregation tools provide a number of extra features that are worth the additional investment. Those extra features include capabilities such as powerful search and fast analytics. Searching Through the Void: Using Search Query Language To Find Things You’ve probably used grep or similar tools for searching through your logs, but for real power, you need the ability to search across all of your logs in one interface. You may have even investigated using the ELK stack on your own infrastructure to get going with log aggregation. If you have, you know how valuable putting logs all in the same place can be. Some tools provide even more functionality on top of this interface. For example, with Log Analytics, you can use a Search Query Language that allows for more complex searches. Because these searches are being executed across a vast amount of log data, you can use special operations to harness the power of your log aggregation service. Some of these operations can be achieved with grep, so long as you have all of the logs at your disposal. But others, such as aggregate operators, field expressions, or transaction analytics tools, can produce extremely powerful reports and monitoring triggers across your infrastructure. To choose just one tool as an example, let’s take a closer look at field expressions. Essentially, field expressions allow you to create variables in your queries based on what you find through your log data. For example, if you wanted to search across your logs, and you know your log lines follow the format “From Jane To: John,” you can parse out the “from” and “to” with the following query: * | parse "From: * To: *" as (from, to) This would store “Jane” in the “from” field and “John” in the “to” field. Another valuable language feature you could tap into would be keyword expressions. You could use this query to search across your logs for any instances where a command with root privileges failed: (``su`` OR ``sudo`` ) AND (fail* OR error) Here is a listing of General Search Examples that are drawn from parsing a single Apache log message. Light-Speed Analytics: Making Use of Real-Time Reports and Advanced Analytics One other aspect of searching is that it’s typically looking into the past. Sometimes, you need to be looking at things as they happen. Let’s take a look at Live Tail and LogReduce — two tools to improve simple searches. Versions of these features exist on many platforms, but I like the way they work on Sumo Logic’s offering, so we’ll dive into them. Live Tail At its simplest, Live Tail lets you see a live feed of your log messages. It’s like running tail-f on any one of your servers to see the logs as they come in, but instead of being on a single machine, you’re looking across all logs associated with a Sumo Logic Source or Collector. Your Live Tail can be modified to automatically filter for only specific things. Live Tails also supports highlighting keywords (up to eight of them) as the logs roll in. LogReduce LogReduce gives you more insight into–or a better understanding of–your search query’s aggregate log results. When you run LogReduce on a query, it performs fuzzy logic analysis on messages meeting the search criteria you defined and then provides you with a set of “Signatures” that meet your criteria. It also gives you a count of the logs with that pattern and a rating of the relevance of the pattern when compared to your search. You then have tools at your disposal to rank the generated signatures and even perform further analysis on the log data. This is all pretty advanced and can be hard to understand without a demo, so you can dive deeper by watching this video. Integrated Log Aggregation Often, you’ll need information from systems you aren’t running directly mixed in with your other logs. That’s why it’s important to make sure you can integrate your log aggregator with other systems. Many log aggregators provide this functionality. Elastic, which underlies the ELK stack, provides a bunch of integrations that you can hook into your self-hosted or cloud-hosted stack. Of course, integrations aren’t only available on the ELK stack. Sumo Logic also provides a whole list of integrations as well. Regardless, the power of connecting your logs with the many systems you use outside of your monitoring and operational stack is phenomenal. Want to get logs sent from your company’s 1Password account into the rest of your logs? Need more information from AWS than you are getting on your individual instances or services? ELK and Sumo Logic provide great options. The key to understanding this concept is that you don’t need to be the one controlling the logs to make it valuable to aggregate them. Think through the full picture of what systems keep your business running, and consider putting all of the logs in your aggregator together. Conclusion This has been a brief tour through some of the features available with log aggregation. There’s a lot more to it, which shouldn’t be surprising given the vast amount of data generated every second by our infrastructure. The really amazing part of these tools is that these insights are available to you without installing anything on your servers. You just need to have a way to export your log data to the aggregation service. Whether you need to track compliance or monitor the reliability of your services, log aggregation is an incredibly powerful tool that can let you unlock infinite value from your already existing log data. That way, you can become a better cosmic junk collector!
VAR-As-A-Service is an MLOps approach for the unification and reuse of statistical models and machine learning models deployment pipelines. It is the second of a series of articles that is built on top of that project, representing experiments with various statistical and machine learning models, data pipelines implemented using existing DAG tools, and storage services, both cloud-based and alternative on-premises solutions. This article focuses on the model file storage using an approach also applicable and used for machine learning models. The implemented storage is based on MinIO as an AWS S3-compatible object storage service. Furthermore, the article gives an overview of alternative storage solutions and outlines the benefits of object-based storage. The first article of the series (Time Series Analysis: VARMAX-As-A-Service) compares statistical and machine learning models as being both mathematical models and provides an end-to-end implementation of a VARMAX-based statistical model for macroeconomic forecast using a Python library called statsmodels. The model is deployed as a REST service using Python Flask and Apache web server, packaged in a docker container. The high-level architecture of the application is depicted in the following picture: The model is serialized as a pickle file and deployed on the web server as part of the REST service package. However, in real projects, models are versioned, accompanied by metadata information, and secured, and the training experiments need to be logged and kept reproducible. Furthermore. from an architectural perspective, storing the model in the file system next to the application contradicts the single responsibility principle. A good example is a microservice-based architecture. Scaling the model service horizontally means that each and every microservice instance will have its own version of the physical pickle file replicated over all the service instances. That also means that the support of multiple versions of the models will require a new release and redeployment of the REST service and its infrastructure. The goal of this article is to decouple models from the web service infrastructure and enable the reuse of the web service logic with different versions of models. Before diving into the implementation, let's say a few words about statistical models and the VAR model used in that project. Statistical models are mathematical models, and so are machine learning models. More about the difference between the two can be found in the first article of the series. A statistical model is usually specified as a mathematical relationship between one or more random variables and other non-random variables. Vector autoregression (VAR) is a statistical model used to capture the relationship between multiple quantities as they change over time. VAR models generalize the single-variable autoregressive model (AR) by allowing for multivariate time series. In the presented project, the model is trained to do forecasting for two variables. VAR models are often used in economics and the natural sciences. In general, the model is represented by a system of equations, which in the project are hidden behind the Python library statsmodels. The architecture of the VAR model service application is depicted in the following picture: The VAR runtime component represents the actual model execution based on parameters sent by the user. It connects to a MinIO service via a REST interface, loads the model, and runs the prediction. Compared to the solution in the first article, where the VARMAX model is loaded and deserialized at application startup, the VAR model is read from the MinIO server each time a prediction is triggered. This comes at the cost of additional loading and deserialization time but also with the benefit of having the latest version of the deployed model at every single run. Furthermore, it enables dynamic versioning of models, making them automatically accessible to external systems and end-users, as will be shown later in the article. Note that due to that loading overhead, the performance of the selected storage service is of great importance. But why MinIO and object-based storage in general? MinIO is a high-performance object storage solution with native support for Kubernetes deployments that provides an Amazon Web Services S3-compatible API and supports all core S3 features. In the presented project, MinIO is in Standalone Mode, consisting of a single MinIO server and a single drive or storage volume on Linux using Docker Compose. For extended development or production environments, there is the option for a distributed mode described in the article Deploy MinIO in Distributed Mode. Let's have a quick look at some storage alternatives while a comprehensive description can be found here and here: Local/Distributed file storage: Local file storage is the solution implemented in the first article, as it is the simplest option. Computation and storage are on the same system. It is acceptable during the PoC phase or for very simple models supporting a single version of the model. Local file systems have limited storage capacity and are unsuitable for larger datasets in case we want to store additional metadata like the training data set used. Since there is no replication or autoscaling, a local file system can not operate in an available, reliable, and scalable fashion. Each service deployed for horizontal scaling is deployed with its own copy of the model. Furthermore, the local storage is as secure as the host system is. Alternatives to the local file storage are NAS (Network-attached storage), SAN (Storage-area network), distributed file systems (Hadoop Distributed File System (HDFS), Google File System (GFS), Amazon Elastic File System (EFS) and Azure Files). Compared to the local file system, those solutions are characterized by availability, scalability, and resilience but come with the cost of increased complexity. Relational databases: Due to the binary serialization of models, relational databases provide the option for a blob or binary storage of models in table columns. Software developers and many data scientists are familiar with relational databases, which makes that solution straightforward. Model versions can be stored as separate table rows with additional metadata, which is easy to read from the database, too. A disadvantage is that the database will require more storage space, and this will impact backups. Having large amounts of binary data in a database can also have an impact on performance. In addition, relational databases impose some constraints on the data structures, which might complicate the storing of heterogeneous data like CSV files, images, and JSON files as model metadata. Object storage: Object storage has been around for quite some time but was revolutionized when Amazon made it the first AWS service in 2006 with Simple Storage Service (S3). Modern object storage is native to the cloud, and other clouds soon brought their offerings to market, too. Microsoft offers Azure Blob Storage, and Google has its Google Cloud Storage service. The S3 API is the de-facto standard for developers to interact with storage in the cloud, and there are multiple companies that offer S3-compatible storage for the public cloud, private cloud, and private on-premises solutions. Regardless of where an object store is located, it is accessed via a RESTful interface. While object storage eliminates the need for directories, folders, and other complex hierarchical organization, it’s not a good solution for dynamic data that is constantly changing as you’ll need to rewrite the entire object to modify it, but it is a good choice for storing serialized models and the mode's metadata. A summary of the main benefits of object storage are: Massive scalability: Object storage size is essentially limitless, so data can scale to exabytes by simply adding new devices. Object storage solutions also perform best when running as a distributed cluster. Reduced complexity: Data is stored in a flat structure. The lack of complex trees or partitions (no folders or directories) reduces the complexity of retrieving files easier as one doesn't need to know the exact location. Searchability: Metadata is part of objects, making it easy to search through and navigate without the need for a separate application. One can tag objects with attributes and information, such as consumption, cost, and policies for automated deletion, retention, and tiering. Due to the flat address space of the underlying storage (every object in only one bucket and no buckets within buckets), object stores can find an object among potentially billions of objects quickly. Resiliency: Object storage can automatically replicate data and store it across multiple devices and geographical locations. This can help protect against outages, safeguard against data loss, and help support disaster recovery strategies. Simplicity: Using a REST API to store and retrieve models implies almost no learning curve and makes the integrations into microservice-based architectures a natural choice. It is time to look at the implementation of the VAR model as a service and the integration with MinIO. The deployment of the presented solution is simplified by using Docker and Docker Compose. The organization of the whole project looks as follows: As in the first article, the preparation of the model is comprised of a few steps that are written in a Python script called var_model.py located in a dedicated GitHub repository : Load data Divide data into train and test data set Prepare endogenous variables Find optimal model parameter p ( first p lags of each variable used as regression predictors) Instantiate the model with the optimal parameters identified Serialize the instantiated model to a pickle file Store the pickle file as a versioned object in a MinIO bucket Those steps can also be implemented as tasks in a workflow engine (e.g., Apache Airflow) triggered by the need to train a new model version with more recent data. DAGs and their applications in MLOps will be the focus of another article. The last step implemented in var_model.py is storing the serialized as a pickle file model in a bucket in S3. Due to the flat structure of the object storage, the format selected is: <bucket name>/<file_name> However, for file names, it is allowed to use a forward slash to mimic a hierarchical structure, keeping the advantage of a fast linear search. The convention for storing VAR models is as follows: models/var/0_0_1/model.pkl Where the bucket name is models, and the file name is var/0_0_1/model.pkl and in MinIO UI, it looks as follows: This is a very convenient way of structuring various types of models and model versions while still having the performance and simplicity of flat file storage. Note that the model versioning is implemented as part of the model name. MinIO provides versioning of files, too, but the approach selected here has some benefits: Support of snapshot versions and overriding Usage of semantic versioning (dots replaced by '_' due to restrictions) Greater control of the versioning strategy Decoupling of the underlying storage mechanism in terms of specific versioning features Once the model is deployed, it is time to expose it as a REST service using Flask and deploy it using docker-compose running MinIO and an Apache Web server. The Docker image, as well as the model code, can be found on a dedicated GitHub repository. And finally, the steps needed to run the application are: Deploy application: docker-compose up -d Execute model preparation algorithm: python var_model.py (requires a running MinIO service) Check if the model has been deployed: http://127.0.0.1:9101/browser Test model: http://127.0.0.1:80/apidocs After deploying the project, the Swagger API is accessible via <host>:<port>/apidocs (e.g., 127.0.0.1:80/apidocs). There is one endpoint for the VAR model depicted next to the other two exposing a VARMAX model: Internally, the service uses the deserialized model pickle file loaded from a MinIO service: Requests are sent to the initialized model as follows: The presented project is a simplified VAR model workflow that can be extended step by step with additional functionalities like: Explore standard serialization formats and replace the pickle with an alternative solution Integrate time series data visualization tools like Kibana or Apache Superset Store time series data in a time series database like Prometheus, TimescaleDB, InfluxDB, or an Object Storage such as S3 Extend the pipeline with data loading and data preprocessing steps Incorporate metric reports as part of the pipelines Implement pipelines using specific tools like Apache Airflow or AWS Step Functions or more standard tools like Gitlab or GitHub Compare statistical models' performance and accuracy with machine learning models Implement end-to-end cloud-integrated solutions, including Infrastructure-As-Code Expose other statistical and ML models as services Implement a Model Storage API that abstracts away the actual storage mechanism and model's versioning, stores model metadata, and training data These future improvements will be the focus of upcoming articles and projects. The goal of this article is to integrate an S3-compatible storage API and enable the storage of versioned models. That functionality will be extracted in a separate library soon. The presented end-to-end infrastructural solution can be deployed on production and improved as part of a CI/CD process over time, also using the distributed deployment options of MinIO or replacing it with AWS S3.
Embedded systems have existed long before Linux, but the marriage of these two technologies has led to an unprecedented era of device innovation. Today, it is not uncommon to find Linux at the heart of televisions, cars, routers, smart devices, and countless other electronics. But why is Linux, a free and open-source software, becoming so pervasive in the embedded world? Let's delve into Embedded Linux, its advantages, key features, and its significance in the modern tech landscape. What Is Embedded Linux? Embedded Linux refers to the use of the Linux kernel, usually tailored for specific applications, in embedded devices. Unlike desktop or server distributions, which might come with a comprehensive suite of software and a graphical user interface, Embedded Linux systems are stripped down, containing only the necessary components to run a particular device. This results in a leaner, faster, and more efficient operating system. Why Linux for Embedded Systems? There are several compelling reasons for using Linux in embedded systems: Open-source nature: As a free and open-source platform, Linux provides developers the flexibility to customize and adapt the software to meet their specific requirements. This stands in stark contrast to proprietary systems, where such freedom is often restricted. Cost-effective: There are no licensing fees associated with Linux. This makes it a cost-effective choice for manufacturers and developers. Vibrant community: Linux benefits from a massive global community of developers and enthusiasts who continually contribute to its development, offering patches, updates, and new features. Scalability: Whether you're working on a tiny IoT sensor or a powerful industrial machine, Linux can be scaled to fit the need, making it incredibly versatile for a wide range of applications. Rich features and protocol support: Embedded Linux systems can leverage a vast array of existing protocols and drivers, easing the process of integrating with other systems and networks. Challenges of Using Embedded Linux While Linux provides numerous advantages, it also comes with its set of challenges: Footprint: Even though Linux can be stripped down, it may still be larger than some real-time operating systems (RTOS) designed explicitly for embedded use. This can be a concern for devices with limited storage. Real-time constraints: Standard Linux is not a real-time OS, meaning it cannot guarantee task execution within strict timing constraints. However, there are real-time patches and distributions like RTLinux or PREEMPT_RT that address this limitation. Learning curve: For developers new to Linux or those transitioning from traditional embedded environments, there can be a steep learning curve. Key Features of Embedded Linux Modularity: Embedded Linux can be tailored to include only the modules and drivers that a specific device needs, ensuring optimized performance and reduced resource consumption. Multitasking: Linux is inherently a multitasking OS, enabling embedded devices to run multiple applications or processes simultaneously. Security: Linux, with its user permission system and community-driven security patches, offers a robust security framework. This is crucial, especially for connected devices that are exposed to network threats. Cross-platform: Linux can run on a myriad of architectures, including ARM, MIPS, PowerPC, and x86. This makes it a preferred choice for various hardware platforms. Embedded Linux in the Modern World The Internet of Things (IoT) has provided a massive boost to the adoption of Embedded Linux. With billions of connected devices coming online, there's a pressing need for a stable, secure, and flexible OS, and Linux fits the bill perfectly. Automobiles, once mechanical masterpieces, are now evolving into "computers on wheels." Linux is playing an essential role in this transition, with the Automotive Grade Linux (AGL) platform being a testament to its growing influence. In the realm of smart TVs and entertainment systems, platforms like WebOS and Tizen, both based on Linux, are becoming household names. Conclusion Embedded Linux, with its adaptability, rich feature set, and vibrant community, has firmly established itself as a cornerstone of modern electronic devices. As the world becomes increasingly connected and our reliance on smart devices grows, the significance of Embedded Linux is poised to grow exponentially. It's not just an OS; it's the backbone of the modern embedded era.
In enterprises, SREs, DevOps, and cloud architects often discuss which platform to choose for observability for faster troubleshooting of issues and understanding about performance of their production systems. There are certain questions they need to answer to get maximum value for their team, such as: Will an observability tool support all kinds of workloads and heterogeneous systems? Will the tool support all kinds of data aggregation, such as logs, metrics, traces, topology, etc..? Will the investment in the (ongoing or new) observability tool be justified? In this article, we will provide the best way to get started with unified observability of your entire infrastructure using open-source Skywalking and Istio service mesh. Istio Service Mesh of Multi-Cloud Application Let us take an example of a multi-cloud example where there are multiple services hosted on on-prem or managed Kubernetes clusters. The first step for unified observability will be to form a service mesh using Istio service mesh. The idea is that all the services or workloads in Kubernetes clusters (or VMs) should be accompanied by an Envoy proxy to abstract the security and networking out of business logic. As you can see in the below image, a service mesh is formed, and the network communication between edge to workloads, among workloads, and between clusters is controlled by the Istio control plane. In this case, the Istio service mesh emits a logs, metrics, and traces for each Envoy proxies, which will help to get unified observability. We need a visualization tool like Skywalking to collect the data and populate it for granular observability. Why Skywalking for Observability SREs from large companies such as Alibaba, Lenovo, ABInBev, and Baidu use Apache Skywalking, and the common reasons are: Skywalking aggregates logs, metrics, traces, and topology. It natively supports popular service mesh software like Istio. While other tools may not support getting data from Envoy sidecars, Skywalking supports sidecar integration. It supports OpenTelemetry (OTel) standards for observability. These days, OTel standards and instrumentation are popular for MTL (metrics, logs, traces). Skywalking supports observability-data collection from almost all the elements of the full stack- database, OS, network, storage, and other infrastructure. It is open-source and free (with an affordable enterprise version). Now, let us see how to integrate Istio and Apache skywalking into your enterprise. Steps To Integrate Istio and Apache Skywalking We have created a demo to establish the connection between the Istio data plane and Skywalking, where it will collect data from Envoy sidecars and populate them in the observability dashboards. Note: By default, Skywalking comes with predefined dashboards for Apache APISIX and AWS Gateways. Since we are using Istio Gateway, it will not get a dedicated dashboard out-of-the-box, but we’ll get metrics for it in other locations. If you want to watch the video, check out my latest Istio-Skywalking configuration video. You can refer to the GitHub link here. Step 1: Add Kube-State-Metrics to Collect Metrics From the Kubernetes API Server We have installed kube-state-metrics service to listen to the Kubernetes API server and send those metrics to Apache skywalking. First, add the Prometheus community repo: Shell helm repo add prometheus-community https://prometheus-community.github.io/helm-charts (After every helm repo add, add a line about running helm repo update to fetch the latest charts.) And now you can install kube-state-metrics. Shell helm install kube-state-metrics prometheus-community/kube-state-metrics Step 2: Install Skywalking Using HELM Charts We will install Skywalking version 9.2.0 for this observability demo. You can run the following command to install Skywalking into a namespace (my namespace is skywalking). You can refer to the values.yaml. Shell helm install skywalking oci://registry-1.docker.io.apache/skywalking-helm -f -n skywalking (Optional reading) In helm chart values.yaml, you will notice that: We have made the flag oap (observability analysis platform, i.e., the back-end) and ui configuration as true. Similarly, for databases, we have enabled postgresql as true. For tracking metrics from Envoy access logs, we have configured the following environmental variables: SW_ENVOY_METRIC: default SW_ENVOY_METRIC_SERVICE: true SW_ENVOY_METRIC_ALS_HTTP_ANALYSIS: k8s-mesh,mx-mesh,persistence SW_ENVOY_METRIC_ALS_TCP_ANALYSIS: k8s-mesh,mx-mesh,persistence This is to select the logs and metrics from the Envoy from the Istio configuration (‘c’ and ‘d’ are the rules for analyzing Envoy access logs). We will enable the OpenTelemetry receiver and configure it to receive data in otlp format. We will also enable OTel rules according to the data we will send to Skywalking. In a few moments (in Step 3), we will configure the OTel collector to scrape istiod, k8s, kube-state-metrics, and the Skywalking OAP itself. So, we have enabled the appropriate rules: SW_OTEL_RECEIVER: default SW_OTEL_RECEIVER_ENABLED_HANDLERS: “otlp” SW_OTEL_RECEIVER_ENABLED_OTEL_RULES: “istio-controlplane,k8s-cluster,k8s-node,k8s-service,oap” SW_TELEMETRY: prometheus SW_TELEMETRY_PROMETHEUS_HOST: 0.0.0.0 SW_TELEMETRY_PROMETHEUS_PORT: 1234 SW_TELEMETRY_PROMETHEUS_SSL_ENABLED: false SW_TELEMETRY_PROMETHEUS_SSL_KEY_PATH: “” SW_TELEMETRY_PROMETHEUS_SSL_CERT_CHAIN_PATH: “” We have instructed Skywalking to collect data from the Istio control plance, Kubernetes cluster, node, services, and also oap (Observability Analytics Platform by Skywalking).(The configurations from ‘d’ to ‘i’ enable Skywalking OAP’s self-observability, meaning it will expose Prometheus-compatible metrics at port 1234 with SSL disabled. Again, in Step 3, we will configure the OTel collector to scrape this endpoint.) In the helm chart, we have also enabled the creation of a service account for Skywalking OAP. Step 3: Setting Up Istio + Skywalking Configuration After that, we can install Istio using this IstioOperator configuration. In the IstioOperator configuration, we have set up the meshConfig so that every Sidecar will have enabled the envoy access logs service and set the address for access logs service and metrics service to skywalking. Additionally, with the proxyStatsMatcher, we are configuring all metrics to be sent via the metrics service. YAML meshConfig: defaultConfig: envoyAccessLogService: address: "skywalking-skywalking-helm-oap.skywalking.svc:11800" envoyMetricsService: address: "skywalking-skywalking-helm-oap.skywalking.svc:11800" proxyStatsMatcher: inclusionRegexps: - .* enableEnvoyAccessLogService: true Step 4: OpenTelemetry Collector Once the Istio and Skywalking configuration is done, we need to feed metrics from applications, gateways, nodes, etc, to Skywalking. We have used the opentelemetry-collector.yaml to scrape the Prometheus compatible endpoints. In the collector, we have mentioned that OpenTelemetry will scrape metrics from istiod, Kubernetes-cluster, kube-state-metrics, and skywalking. We have created a service account for OpenTelemetry. Using opentelemetry-serviceaccount.yaml, we have set up a service account, declared ClusterRole and ClusterRoleBinding to define what all actions the opentelemetry service account will be able to take on various resources in our Kubernetes cluster. Once you deploy the opentelemetry-collector.yaml and opentelemetry-serviceaccount.yaml, there will be data flowing into Skywalking from- Envoy, Kubernetes cluster, kube-state-metrics, and Skywalking (oap). Step 5: Observability of Kubernetes Resources and Istio Resource in Skywalking To check the UI of Skywalking, port-forward the Skywalking UI service to port (say 8080). Run the following command: Shell kubectl port-forward svc/skywalking-skywalking-helm-ui -n skywalking 8080:80 You can open the Skywalking UI service at localhost:8080. (Note: For setting up load to services and see the behavior and performance of apps, cluster, and Envoy proxy, check out the full video. ) Once you are on the Skywalking UI (refer to the image below), you can select service mesh in the left-side menu and select control plane or data plane. Skywalking would provide all the resource consumption and observability data of Istio control and data plane, respectively. Skywalking Istio-dataplane provides info about all the Envoy proxies attached to services. Skywalking provides metrics, logs, and traces of all the Envoy proxies. Refer to the below image, where all the observability details are displayed for just one service-proxy. Skywalking provides the resource consumption of Envoy proxies in various namespaces. Similarly, Skywalking also provides all the observable data of the Istio control plane. Note, in case you have multiple control planes in different namespaces (in multiple clusters), you just provide the access Skywalking oap service. Skywalking provides Istio control planes like metrics, number of pilot pushes, ADS monitoring, etc. Apart from the Istio service mesh, we also configured Skywalking to fetch information about the Kubernetes cluster. You can see in the below image Skywalking provides all the info about the Kubernetes dashboard, such as the number of nodes, pods, K8s deployments, services, pods, containers, etc. You also get the respective resource utilization metrics of each K8 resource in the same dashboard. Skywalking provides holistic information about a Kubernetes cluster. Similarly, you can drill further down into a service in the Kubernetes cluster and get granular information about their behavior and performance. (refer to the below images.) For setting up load to services and seeing the behavior and performance of apps, cluster, and Envoy proxy, check out the full video. Benefits of Istio Skywalking Integrations There are several benefits of integrating Istio and Apache Skywalking for Unified observability. Ensure 100% visibility of the technology stack, including apps, sidecars, network, database, OS, etc. Reduce 90% of the time to find the root cause (MTTR) of issues or anomalies in production with faster troubleshooting. Save approximately ~$2M of lifetime spend on closed-source solutions, complex pricing, and custom integrations.
Elasticsearch is an open-source search engine and analytics store used by a variety of applications from search in e-commerce stores, to internal log management tools using the ELK stack (short for “Elasticsearch, Logstash, Kibana”). As a distributed database, your data is partitioned into “shards” which are then allocated to one or more servers. Because of this sharding, a read or write request to an Elasticsearch cluster requires coordinating between multiple nodes as there is no “global view” of your data on a single server. While this makes Elasticsearch highly scalable, it also makes it much more complex to setup and tune than other popular databases like MongoDB or PostgresSQL, which can run on a single server. When reliability issues come up, firefighting can be stressful if your Elasticsearch setup is buggy or unstable. Your incident could be impacting customers which could negatively impact revenue and your business reputation. Fast remediation steps are important, yet spending a large amount of time researching solutions online during an incident or outage is not a luxury most engineers have. This guide is intended to be a cheat sheet for common issues that engineers running that can cause issues with Elasticsearch and what to look for. As a general purpose tool, Elasticsearch has thousands of different configurations which enables it to fit a variety of different workloads. Even if published online, a data model or configuration that worked for one company may not be appropriate for yours. There is no magic bullet getting Elasticsearch to scale, and requires diligent performance testing and trial/error. Unresponsive Elasticsearch Cluster Issues Cluster stability issues are some of the hardest to debug, especially if nothing changes with your data volume or code base. Check Size of Cluster State What Does It Do? Elasticsearch cluster state tracks the global state of our cluster, and is the heart of controlling traffic and the cluster. Cluster state includes metadata on nodes in your cluster, status of shards and how they are mapped to nodes, index mappings (i.e. the schema), and more. Cluster state usually doesn’t change often. However, certain operations such as adding a new field to an index mapping can trigger an update. Because cluster updates broadcast to all nodes in the cluster, it should be small (<100MB). A large cluster state can quickly make the cluster unstable. A common way this happens is through a mapping explosion (too many keys in an index) or too many indices. What to Look For Download the cluster state using the below command and look at the size of the JSON returned.curl -XGET 'http://localhost:9200/_cluster/state' In particular, look at which indices have the most fields in the cluster state which could be the offending index causing stability issues. If the cluster state is large and increasing. You can also get an idea looking at individual index or match against an index pattern like so:curl -XGET 'http://localhost:9200/_cluster/state/_all/my_index-*' You can also see the offending index’s mapping using the following command:curl -XGET 'http://localhost:9200/my_index/_mapping' How to Fix Look at how data is being indexed. A common way mapping explosion occurs is when high-cardinality identifiers are being used as a JSON key. Each time a new key is seen like “4” and”5”, the cluster state is updated. For example, the below JSON will quickly cause stability issues with Elasticsearch as each key is being added to the global state. { "1": { "status": "ACTIVE" }, "2": { "status": "ACTIVE" }, "3": { "status": "DISABLED" } } To fix, flatten your data into something that is Elasticsearch-friendly:{ [ { "id": "1", "status": "ACTIVE" }, { "id": "2", "status": "ACTIVE" }, { "id": "3", "status": "DISABLED" } ] } Check Elasticsearch Tasks Queue What Does It Do? When a request is made against elasticsearch (index operation, query operation, etc), it’s first inserted into the task queue, until a worker thread can pick it up. Once a worker pool has a thread free, it will pick up a task from the task queue and process it. These operations are usually made by you via HTTP requests on the :9200 and :9300 ports, but they can also be internal to handle maintenance tasks on an index At a given time there may be hundreds or thousands of in-flight operations, but should complete very quickly (like microseconds or milliseconds). What to Look For Run the below command and look for tasks that are stuck running for a long time like minutes or hours. This means something is starving the cluster and preventing it from making forward progress. It’s ok for certain long running tasks like moving an index to take a long time. However, normal query and index operations should be quick.curl -XGET 'http://localhost:9200/_cat/tasks?detailed' With the ?detailed param, you can get more info on the target index and query. Look for patterns in which tasks are consistently at the top of the list. Is it the same index? Is it the same node? If so, maybe something is wrong with that index’s data or the node is overloaded. How to Fix If the volume of requests is higher than normal, then look at ways to optimize the requests (such as using bulk APIs or more efficient queries/writes) If not change in volume and looks random, this implies something else is slowing down the cluster. The backup of tasks is just a symptom of a larger issue. If you don’t know where the requests come from, add the X-Opaque-Id header to your Elasticsearch clients to identify which clients are triggering the queries. Checks Elasticsearch Pending Tasks What Does It Do? Pending tasks are pending updates to the cluster state such as creating a new index or updating its mapping. Unlike the previous tasks queue, pending updates require a multi step handshake to broadcast the update to all nodes in the cluster, which can take some time. There should be almost zero in-flight tasks in a given time. Keep in mind, expensive operations like a snapshot restore can cause this to spike temporarily. What to Look For Run the command and ensure none or few tasks in-flight.curl curl curl -XGET 'http://localhost:9200/_cat/pending_tasks' If it looks to be a constant stream of cluster updates that finish quickly, look at what might be triggering them. Is it a mapping explosion or creating too many indices? If it’s just a few, but they seem stuck, look at the logs and metrics of the master node to see if there are any issues. For example, is the master node running into memory or network issues such that it can’t process cluster updates? Hot Threads What Does It Do? The hot threads API is a valuable built-in profiler to tell you where Elasticsearch is spending the most time. This can provide insights such as whether Elasticsearch is spending too much time on index refresh or performing expensive queries. What to Look For Make a call to the hot threads API. To improve accuracy, it’s recommended to capture many snapshots using the ?snapshotsparamcurl -XGET 'http://localhost:9200/_nodes/hot_threads?snapshots=1000' This will return stack traces seen when the snapshot was taken. Look for the same stack in many different snapshots. For example, you might see the text 5/10 snapshots sharing following 20 elements. This means a thread spends time in that area of the code during 5 snapshots. You should also look at the CPU %. If an area of code has both high snapshot sharing and high CPU %, this is a hot code path. By looking at the code module, disassemble what Elasticsearch is doing. If you see wait or park state, this is usually okay. How to Fix If a large amount of CPU time is spent on index refresh, then try increasing the refresh interval beyond the default 1 second. If you see a large amount in cache, maybe your default caching settings are suboptimal and causing a heavy miss. Memory Issues Check Elasticsearch Heap/Garbage Collection What Does It Do? As a JVM process, the heap is the area of memory where a lot of Elasticsearch data structures are stored and requires garbage collection cycles to prune old objects. For typical production setups, Elasticsearch locks all memory using mlockall on-boot and disables swapping. If you’re not doing this, do it now. If Heap is consistently above 85% or 90% for a node, this means we are coming close to out of memory. What to Look For Search for collecting in the last in Elasticsearch logs. If these are present, this means Elasticsearch is spending higher overhead on garbage collection (which takes time away from other productive tasks). A few of these every now and then ok as long as Elasticsearch is not spending the majority of its CPU cycles on garbage collection (calculate the percentage of time spent on collecting relative to the overall time provided). A node that is spending 100% time on garbage collection is stalled and cannot make forward progress. Nodes that appear to have network issues like timeouts may actually be due to memory issues. This is because a node can’t respond to incoming requests during a garbage collection cycle. How to Fix The easiest is to add more nodes to increase the heap available for the cluster. However, it takes time for Elasticsearch to rebalance shards to the empty nodes. If only a small set of nodes have high heap usage, you may need to better balance your customer. For example, if your shards vary in size drastically or have different query/index bandwidths, you may have allocated too many hot shards to the same set of nodes. To move a shard, use the reroute API. Just adjust the shard awareness sensitivity to ensure it doesn’t get moved back.curl -XPOST -H "Content-Type: application/json" localhost:9200/_cluster/reroute -d ' { "commands": [ { "move": { "index": "test", "shard": 0, "from_node": "node1", "to_node": "node2" } } ] }' If you are sending large bulk requests to Elasticsearch, try reducing the batch size so that each batch is under 100MB. While larger batches help reduce network overhead, they require allocating more memory to buffer the request which cannot be freed until after both the request is complete and the next GC cycle. Check Elasticsearch Old Memory Pressure What Does It Do? The old memory pool contains objects that have survived multiple garbage collection cycles and are long-living objects. If the old memory is over 75%, you might want to pay attention to it. As this fills up beyond 85%, more GC cycles will happen but the objects can’t be cleaned up. What to Look For Look at the old pool used/old pool max. If this is over 85%, that is concerning. How to Fix Are you eagerly loading a lot of field data? These reside in memory for a long time. Are you performing many long-running analytics tasks? Certain tasks should be offloaded to a distributed computing framework designed for map/reduce operations like Apache Spark. Check Elasticsearch FieldData Size What Does It Do? FieldData is used for computing aggregations on a field such as terms aggregation Usually, field data for a field is not loaded in memory until the first time an aggregation is performed on it. However, this can also be precomputed on index refresh if eager_load_ordinals is set. What to Look For Look at an index or all indices field data size, like so:curl -XGET 'http://localhost:9200/index_1/_stats/fielddata' An index could have very large field data structures if we are using it on the wrong type of data. Are you performing aggregations on very high-cardinality fields like a UUID or trace id? Field data is not suited for very high-cardinality fields as they will create massive field data structures. Do you have a lot of fields with eager_load_ordinals set or allocate a large amount to the field data cache. This causes the field data to be generated at refresh time vs query time. While it can speed up aggregations, it’s not optimal if you’re computing the field data for many fields at index refresh and never consume it in your queries. How to Fix Make adjustments to your queries or mapping to not aggregate on very high cardinality keys. Audit your mapping to reduce the number that have eager_load_ordinals set to true. Elasticsearch Networking Issues Node Left or Node Disconnected What Does It Do? A node will eventually be removed from the cluster if it does not respond to requests. This allows shards to be replicated to other nodes to meet the replication factor and ensure high availability even if a node was removed. What to Look For Look at the master node logs. Even though there are multiple masters, you should look at the master node that is currently elected. You can use the nodes API or a tool like Cerebro to do this. Look if there is a consistent node that times out or has issues. For example, you can see which nodes are still pending for a cluster update by looking for the phrase pending nodes in the master node’s logs. If you see the same node keep getting added but then removed, it may imply the node is overloaded or unresponsive. If you can’t reach the node from your master node, it could imply a networking issue. You could also be running into the NIC or CPU bandwidth limitations How to Fix Test with the setting transport.compression set to true This will compress traffic between nodes (such as from ingestion nodes to data nodes) reducing network bandwidth at the expense of CPU bandwidth. Note: Earlier versions called this setting transport.tcp.compression If you also have memory issues, try increasing memory. A node may become unresponsive due to a large time spent on garbage collection. Not Enough Master Node Issues What Does It Do? The master and other nodes need to discover each other to formulate a cluster. On the first boot, you must provide a static set of master nodes so you don’t have a split brain problem. Other nodes will then discover the cluster automatically as long as the master nodes are present. What to Look For Enable Trace logging to review discovery-related activities.curl -XPUT -H "Content-Type: application/json" localhost:9200/_cluster/_settings -d ' { "transient": {"logger.discovery.zen":"TRACE"} }' Review configurations such as minimum_master_nodes (if older than 6.x). Look at whether all master nodes in your initial master nodes list can ping each other. Review whether you have a quorum, which should be number of master nodes / 2 +1. If you have less than a quorum, no updates to the cluster state will occur to protect data integrity. How to Fix Sometimes network or DNS issues can cause the original master nodes to not be reachable. Review that you have at least number of master nodes / 2 +1 master nodes currently running. Shard Allocation Errors Elasticsearch in Yellow or Red State (Unassigned Shards) What Does It Do? When a node reboots or a cluster restore is started, the shards are not immediately available. Recovery is throttled to ensure the cluster does not get overwhelmed. Yellow state means primary indices are allocated, but secondary (replica) shards have not been allocated yet. While yellow indices are both readable and writable, availability is decreased. The yellow state is usually self-healable as the cluster replicates shards. Red indices mean primary shards are not allocated. This could be transient such as during a snapshot restore operation, but can also imply major problems such as missing data. What to Look For See the reason behind why allocation has stopped.curl -XGET 'http://localhost:9200/_cluster/allocation/explain' curl -XGET 'http://localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason' Get a list of red indices, to understand which indices are contributing to the red state. The cluster state will be in the red state as long as at least one index is red.curl -XGET 'http:localhost:9200/_cat/indices' | grep red For more detail on a single index, you can see the recovery status for the offending index.curl -XGET 'http:localhost:9200/index_1/_recovery' How to Fix If you see a timeout from max_retries (maybe the cluster was busy during allocation), you can temporarily increase the circuit breaker threshold (Default is 5). Once the number is above the circuit breaker, Elasticsearch will start to initialize the unassigned shards.curl -XPUT -H "Content-Type: application/json" localhost:9200/index1,index2/_settings -d ' { "index.allocation.max_retries": 7 }' Elasticsearch Disk Issues Index Is Read-Only What Does It Do? Elasticsearch has three disk-based watermarks that influence shard allocation. The cluster.routing.allocation.disk.watermark.low watermark prevents new shards from being allocated to a node with disk filling up. By default, this is 85% of the disk used. The cluster.routing.allocation.disk.watermark.high watermark will force the cluster to start moving shards off of the node to other nodes. By default, this is 90%. This will start to move data around until below the high watermark. If Elasticsearch disk exceeds the flood stage watermark cluster.routing.allocation.disk.watermark.flood_stage, is when the disk is getting so full that moving might not be fast enough before the disk runs out of space. When reached, indices are placed in a read-only state to avoid data corruption. What to Look For Look at your disk space for each node. Review logs for nodes for a message like below:high disk watermark [90%] exceeded on XXXXXXXX free: 5.9gb[9.5%], shards will be relocated away from this node Once the flood stage is reached, you’ll see logs like so:flood stage disk watermark [95%] exceeded on XXXXXXXX free: 1.6gb[2.6%], all indices on this node will be marked read-only Once this happens, the indices on that node are read-only. To confirm, see which indices have read_only_allow_delete set to true.curl -XGET 'http://localhost:9200/_all/_settings?pretty' | grep read_only How to Fix First, clean up disk space such as by deleting local logs or tmp files. To remove this block of read-only, make the command:curl -XPUT -H "Content-Type: application/json" localhost:9200/_all/_settings -d ' { "index.blocks.read_only_allow_delete": null }' Conclusion Troubleshooting stability and performance issues can be challenging. The best way to find the root cause is by using the scientific method of hypothesis and proving it correct or incorrect. Using these tools and the Elasticsearch management API, you can gain a lot of insights into how Elasticsearch is performing and where issues may be.
The World Has Changed, and We Need To Adapt The world has gone through a tremendous transformation in the last fifteen years. Cloud and microservices changed the world. Previously, our application was using one database; developers knew how it worked, and the deployment rarely happened. A single database administrator was capable of maintaining the database, optimizing the queries, and making sure things worked as expected. The database administrator could just step in and fix the performance issues we observed. Software engineers didn’t need to understand the database, and even if they owned it, it was just a single component of the system. Guaranteeing software quality was much easier because the deployment happened rarely, and things could be captured on time via automated tests. Fifteen years later, everything is different. Companies have hundreds of applications, each one with a dedicated database. Deployments happen every other hour, deployment pipelines work continuously, and keeping track of flowing changes is beyond one’s capabilities. The complexity of the software increased significantly. Applications don’t talk to databases directly but use complex libraries that generate and translate queries on the fly. Application monitoring is much harder because applications do not work in isolation, and each change may cause multiple other applications to fail. Reasoning about applications is now much harder. It’s not enough to just grab the logs to understand what happened. Things are scattered across various components, applications, queues, service buses, and databases. Databases changed as well. We have various SQL distributions, often incompatible despite having standards in place. We have NoSQL databases that provide different consistency guarantees and optimize their performance for various use cases. We developed multiple new techniques and patterns for structuring our data, processing it, and optimizing schemas and indexes. It’s not enough now to just learn one database; developers need to understand various systems and be proficient with their implementation details. We can’t rely on ACID anymore as it often harms the performance. However, other consistency levels require a deep understanding of the business. This increases the conceptual load significantly. Database administrators have a much harder time keeping up with the changes, and they don’t have enough time to improve every database. Developers are unable to analyze and get the full picture of all the moving parts, but they need to deploy changes faster than ever. And the monitoring tools still swamp us with metrics instead of answers. Given all the complexity, we need developers to own their databases and be responsible for their data storage. This “shift left” in responsibility is a must in today’s world for both small startups and big Fortune 500 enterprises. However, it’s not trivial. How do we prevent the bad code from reaching production? How to troubleshoot issues automatically? How do we move from monitoring to observability? Finally, how do we give developers the proper tools and processes so they will be able to own the databases? Read on to find answers. Measuring Application Performance Is Complex It’s crucial to measure to improve the performance. Performance indicators (PIs) help us evaluate the performance of the system on various dimensions. They can focus on infrastructure aspects such as the reliability of the hardware or networking. They can use application metrics to assess the performance and stability of the system. They can also include business metrics to measure the success from the company and user perspective, including user retention or revenue. Performance indicators are important tracking mechanisms to understand the state of the system and the business as a whole. However, in our day-to-day job, we need to track many more metrics. We need to understand contributors to the performance indicators to troubleshoot the issues earlier and understand whether the system is healthy or not. Let’s see how to build these elements in the modern world. We typically need to start with telemetry — the ability to collect the signals. There are multiple types of signals that we need to track: logs (especially application logs), metrics, and traces. Capturing these signals can be a matter of proper configuration (like enabling them in the hosting provider panel), or they need to be implemented by the developers. Recently, OpenTelemetry gained significant popularity. It’s a set of SDKs for popular programming languages that can be used to instrument applications to generate signals. This way, we have a standardized way of building telemetry within our applications. Odds are that most of the frameworks and libraries we use are already integrated with OpenTelemetry and can generate signals properly. Next, we need to build a solution for capturing the telemetry signals in one centralized place. This way, we can see “what happens” inside the system. We can browse the signals from the infrastructure (like hosts, CPUs, GPUs, and network), applications (number of requests, errors, exceptions, data distribution), databases (data cardinality, number of transactions, data distribution), and many other parts of the application (queues, notification services, service buses, etc.). This lets us troubleshoot more easily as we can see what happens in various parts of the ecosystem. Finally, we can build the Application Performance Management (APM). It’s the way of tracking metric indicators with telemetry and dashboards. APM focuses on providing end-to-end monitoring that goes across all the components of the system, including the web layer, mobile and desktop applications, databases, and the infrastructure connecting all the elements. It can be used to automate alarms and alerts to constantly assess whether the system is healthy. APM may seem like a silver bullet. It aggregates metrics, shows the performance, and can quickly alert when something goes wrong, and the fire begins. However, it’s not that simple. Let’s see why. Why Application Performance Monitoring Is Not Enough APM captures signals and presents them in a centralized application. While this may seem enough, it lacks multiple features that we would expect from a modern maintenance system. First, APM typically presents raw signals. While it has access to various metrics, it doesn’t connect the dots easily. Imagine that the CPU spikes. Should you migrate to a bigger machine? Should you optimize the operating system? Should you change the driver? Or maybe the CPU spike is caused by different traffic coming to the application? You can’t tell that easily just by looking at metrics. Second, APM doesn’t easily show where the problem is. We may observe metrics spiking in one part of the system, but it doesn’t necessarily mean that the part is broken. There may be other reasons and issues. Maybe it’s wrong input coming to the system, maybe some external dependency is slow, and maybe some scheduled task runs too often. APM doesn’t show that, as it cannot connect the dots and show the flow of changes throughout the system. You just see the state then, but you don’t see how you got to that point easily. Third, the resolution is unknown. Let’s say that the CPU spiked during the scheduled maintenance task. Should we upscale the machine? Should we disable the task? Should we run it some other time? Is there a bug in the task? Many things are not clear. We can easily imagine a situation when the scheduled task runs in the middle of the day just because it is more convenient for the system administrators; however, the task is now slow and competes with regular transactions for the resources. In that case, we probably should move the task to some time outside of peak hours. Another scenario is that the task was using an index that doesn’t work anymore. Therefore, it’s not about the task per se, but it’s about the configuration that has been changed with the last deployment. Therefore, we should fix the index. APM won’t show us all those details. Fourth, APM is not very readable. Dashboards with metrics look great, but they are too often just checked whether they’re green. It’s not enough to see that alarms are not ringing. We need to manually review the metrics, look for anomalies, understand how they change, and if we have all the alarms in place. This is tedious and time-consuming, and many developers don’t like doing that. Metrics, charts, graphs, and other visualizations swamp us with raw data that doesn’t show the big picture. Finally, one person can’t reason about the system. Even if we have a dedicated team for maintenance, the team won’t have an understanding of all the changes going through the system. In the fast-paced world with tens of deployments every day, we can’t look for issues manually. Every deployment may result in an outage due to invalid schema migration, bad code change, cache purge, lack of hardware, bad configuration, or many more issues. Even when we know something is wrong and we can even point to the place, the team may lack the understanding or knowledge needed to identify the root cause. Involving more teams is time-consuming and doesn’t scale. While APM looks great, it’s not the ultimate solution. We need something better. We need something that connects the dots and provides answers instead of data. We need true observability. What Makes the Observability Shine Observability turns alerts into root causes and raw data into understanding. Instead of charts, diagrams, and graphs, we want to have a full story of the changes going through pipelines and how they affect the system. This should understand the characteristics of the application, including the deployment scheme, data patterns, partitioning, sharding, regionalization, and other things specific to the application. Observability lets us reason about the internals of the system from the outside. For instance, we can reason that we deployed the wrong changes to the production environment because there is a metric spike in the database. We don’t focus on the database per se, but we analyze the difference between the current and the previous code. However, if there was no deployment recently, but we observe much higher traffic on the load balancer, then we can reason that it’s probably due to different traffic coming to the application. Observability makes the interconnections clear and visible. To build observability, we need to capture static signals and dynamic history. We need to include our deployments, configuration, extensions, connectivity, and characteristics of our application code. It’s not enough just to see that “something is red now.” We need to understand how we got there and what could be the possible reason. To achieve that, a good observability solution needs to go through multiple steps. First, we need to be able to pinpoint the problem. In the modern world of microservices and bounded contexts, it’s not trivial. If the CPU spikes, we need to be able to answer which service or application caused that, which tenant is responsible, or whether this is for all the traffic or some specific requests in the case of a web application. We can do that by carefully observing metrics with multiple dimensions, possibly with dashboards and alarms. Second, we need to include multiple signals. CPU spikes can be caused by a lack of hardware, wrong configuration, broken code, unexpected traffic, or simply things that shouldn’t be running at that time. What’s more, maybe something unexpected happened around the time of the issue. This could be related to a deployment, an ongoing sports game, a specific time of week or time of year, some promotional campaign we just started, or some outage in the cloud infrastructure. All these inputs must be provided to the observability system to understand the bigger picture. Third, we need to look for anomalies. It may seem counterintuitive, but digital applications rot over time. Things change, traffic changes, updates are installed, security fixes are deployed, and every single change can break our application. However, the outage may not be quick and easy. The application may get slower and slower over time, and we won’t notice that easily because alarms do not go off or they become red only for a short period. Therefore, we need to have anomaly detection built-in. We need to be able to look for traffic patterns, weekly trends, and known peaks during the year. A proper observability solution needs to be aware of these and automatically find the situations in which the metrics don’t align. Fourth, we need to be able to automatically root cause the issue and suggest a solution. We can’t push the developers to own the databases and the systems without proper tooling. The observability systems need to be able to automatically suggest improvements. We need to unblock the developers so they can finally be responsible for the performance and own the systems end to end. Databases and Observability We Need Today Let’s now see what we need in the domain of databases. Many things can break, and it’s worth exploring the challenges we may face when working with SQL or NoSQL databases. We are going to see the three big areas where things may go wrong. These are code changes, schema changes, and execution changes. Code Changes Many database issues come from the code changes. Developers modify the application code, and that results in different SQL statements being sent to the database. These queries may be inherently slow, but these won’t be captured by the testing processes we have in place now. Imagine that we have the following application code that extracts the user aggregate root. The user may have multiple additional pieces of information associated with them, like details, pages, or texts: JavaScript const user = repository.get("user") .where("user.id = 123") .leftJoin("user.details", "user_details_table") .leftJoin("user.pages", "pages_table") .leftJoin("user.texts", "texts_table") .leftJoin("user.questions", "questions_table") .leftJoin("user.reports", "reports_table") .leftJoin("user.location", "location_table") .leftJoin("user.peers", "peers_table") .getOne() return user; The code generates the following SQL statement: SQL SELECT * FROM users AS user LEFT JOIN user_details_table AS detail ON detail.user_id = user.id LEFT JOIN pages_table AS page ON page.user_id = user.id LEFT JOIN texts_table AS text ON text.user_id = user.id LEFT JOIN questions_table AS question ON question.user_id = user.id LEFT JOIN reports_table AS report ON report.user_id = user.id LEFT JOIN locations_table AS location ON location.user_id = user.id LEFT JOIN peers_table AS peer ON Peer.user_id = user.id WHERE user.id = '123' Because of multiple joins, the query returns nearly 300 thousand rows to the application that are later processed by the mapper library. This takes 25 seconds in total. Just to get one user entity. The problem with such a query is that we don’t see the performance implications when we write the code. If we have a small developer database with only a hundred rows, then we won’t get any performance issues when running the code above locally. Unit tests won’t catch that either because the code is “correct” — it returns the expected result. We won’t see the issue until we deploy to production and see that the query is just too slow. Another problem is a well-known N+1 query problem with Object Relational Mapper (ORM) libraries. Imagine that we have table flights that are in 1-to-many relation with table tickets. If we write a code to get all the flights and count all the tickets, we may end up with the following: C# var totalTickets = 0; var flights = dao.getFlights(); foreach(var flight in flights){ totalTickets + flight.getTickets().count; } This may result in N+1 queries being sent in total. One query to get all the flights, and then n queries to get tickets for every flight: SQL SELECT * FROM flights; SELECT * FROM tickets WHERE ticket.flight_id = 1; SELECT * FROM tickets WHERE ticket.flight_id = 2; SELECT * FROM tickets WHERE ticket.flight_id = 3; ... SELECT * FROM tickets WHERE ticket.flight_id = n; Just as before, we don’t see the problem when running things locally, and our tests won’t catch that. We’ll find the problem only when we deploy to an environment with a sufficiently big data set. Yet another thing is about rewriting queries to make them more readable. Let’s say that we have a table boarding_passes. We want to write the following query (just for exemplary purposes): SQL SELECT COUNT(*) FROM boarding_passes AS C1 JOIN boarding_passes AS C2 ON C2.ticket_no = C1.ticket_no AND C2.flight_id = C1.flight_id AND C2.boarding_no = C1.boarding_no JOIN boarding_passes AS C3 ON C3.ticket_no = C1.ticket_no AND C3.flight_id = C1.flight_id AND C3.boarding_no = C1.boarding_no WHERE MD5(MD5(C1.ticket_no)) = '525ac610982920ef37b34aa56a45cd06' AND MD5(MD5(C2.ticket_no)) = '525ac610982920ef37b34aa56a45cd06' AND MD5(MD5(C3.ticket_no)) = '525ac610982920ef37b34aa56a45cd06' This query joins the table with itself three times, calculates the MD5 hash of the ticket number twice, and then filters rows based on the condition. This code runs for 8 seconds on my machine with the demo database. A programmer may now want to avoid this repetition and rewrite the query to the following: SQL WITH cte AS ( SELECT *, MD5(MD5(ticket_no)) AS double_hash FROM boarding_passes ) SELECT COUNT(*) FROM cte AS C1 JOIN cte AS C2 ON C2.ticket_no = C1.ticket_no AND C2.flight_id = C1.flight_id AND C2.boarding_no = C1.boarding_no JOIN cte AS C3 ON C3.ticket_no = C1.ticket_no AND C3.flight_id = C1.flight_id AND C3.boarding_no = C1.boarding_no WHERE C1.double_hash = '525ac610982920ef37b34aa56a45cd06' AND C2.double_hash = '525ac610982920ef37b34aa56a45cd06' AND C3.double_has = '525ac610982920ef37b34aa56a45cd06' The query is now more readable as it avoids repetition. However, the performance dropped, and the query now executes in 13 seconds. Now, when we deploy changes like these to production, we may reason that we need to upscale the database. Seemingly, nothing has changed, but the database is now much slower. With good observability tools, we would see that the query executed behind the scenes is now different, which leads to a performance drop. Schema Changes Another problem around databases is when it comes to schema management. There are generally three different ways of modifying the schema: we can add something (table, column index, etc.), remove something, or modify something. Each schema modification is dangerous because the database engine may need to rewrite the table — copy the data on the side, modify the table schema, and then copy the data back. This may lead to a very long deployment (minutes, hours, even months) that we can’t optimize or stop in the middle. Additionally, we typically won’t see the problems when running things locally because we typically run our tests against the latest schema. A good observability solution needs to capture these changes before running in production. Indexes pose another interesting challenge. Adding an index seems to be safe. However, as is the case with every index, it needs to be maintained over time. Indexes generally improve the read performance because they help us find rows much faster. At the same time, they decrease the modification performance as every data modification must be performed in the table and in all the indexes. However, indexes may not be useful after some time. It’s often the case that we configure an index; a couple of months later, we change the application code, and the index isn’t used anymore. Without good observability systems, we won’t be able to notice that the index isn’t useful anymore and decreases the performance. Execution Changes Yet another area of issues is related to the way we execute queries. Databases prepare a so-called execution plan of the query. Whenever a statement is sent to the database, the engine analyzes indexes, data distribution, and statistics of the tables’ content to figure out the fastest way of running the query. Such an execution plan heavily depends on the content of our database and running configuration. The execution plan dictates what join strategy to use when joining tables (nested loop join, merge join, hash join, or maybe something else), which indexes to scan (or tables instead), and when to sort and materialize the results. We can affect the execution plan by providing query hints. Inside the SQL statements, we can specify what join strategy to use or what locks to acquire. The database may use these hints to improve the performance but may also disregard them and execute things differently. However, we don’t know whether the database used them or not. Things get worse over time. Indexes may change after the deployment, data distribution may depend on the day of the week, and the database load may be much different between countries when we regionalize our application. Query hints that we provided half a year ago may not be relevant anymore, but our tests won’t catch that. Unit tests are used to verify the correctness of our queries, and the queries will still return the same results. We have simply no way of identifying these changes automatically. Database Guardrails Is the New Standard Based on what we said above, we need a new approach. No matter if we run a small product or a big Fortune 500 company, we need a novel way of dealing with databases. Developers need to own their databases and have all the means to do it well. We need good observability and database guardrails — a novel approach that: Prevents the bad code from reaching production, Monitors all moving pieces to build a meaningful context for the developer, It significantly reduces the time to identify the root cause and troubleshoot the issues, so the developer gets direct and actionable insights We can’t let ourselves go blind anymore. We need to have tools and systems that will help us change the way we interact with databases, avoid performance issues, and troubleshoot problems as soon as they appear in production. Let’s see how we can build such a system. There are four things that we need to capture to build successful database guardrails. Let’s walk through them. Database Internals Each database provides enough details about the way it executes the query. These details are typically captured in the execution plan that explains what join strategies were used, which tables and indexes were scanned, or what data was sorted. To get the execution plan, we can typically use the EXPLAIN keyword. For instance, if we take the following PostgreSQL query: SQL SELECT TB.* FROM name_basics AS NB JOIN title_principals AS TP ON TP.nconst = NB.nconst JOIN title_basics AS TB ON TB.tconst = TP.tconst WHERE NB.nconst = 'nm00001' We can add EXPLAIN to get the following query: SQL EXPLAIN SELECT TB.* FROM name_basics AS NB JOIN title_principals AS TP ON TP.nconst = NB.nconst JOIN title_basics AS TB ON TB.tconst = TP.tconst WHERE NB.nconst = 'nm00001' The query returns the following output: SQL Nested Loop (cost=1.44..4075.42 rows=480 width=89) -> Nested Loop (cost=1.00..30.22 rows=480 width=10) -> Index Only Scan using name_basics_pkey on name_basics nb (cost=0.43..4.45 rows=1 width=10) Index Cond: (nconst = 'nm00001'::text) -> Index Only Scan using title_principals_nconst_idx on title_principals tp (cost=0.56..20.96 rows=480 width=20) Index Cond: (nconst = 'nm00001'::text) -> Index Scan using title_basics_pkey on title_basics tb (cost=0.43..8.43 rows=1 width=89) Index Cond: (tconst = tp.tconst) This gives a textual representation of the query and how it will be executed. We can see important information about the join strategy (Nested Loop in this case), tables and indexes used (Index Only Scan for name_basics_pkey, or Index Scan for title_basics_pkey), and the cost of each operation. Cost is an arbitrary number indicating how hard it is to execute the operation. We shouldn’t draw any conclusions from the numbers per se, but we can compare various plans based on the cost and choose the cheapest one. Having plans at hand, we can easily tell what’s going on. We can see if we have an N+1 query issue if we use indexes efficiently and if the operation runs fast. We can get some insights into how to improve the queries. We can immediately tell if a query is going to scale well in production just by looking at how it reads the data. Once we have these plans, we can move on to another part of successful database guardrails. Integration With Applications We need to extract plans somehow and correlate them with what our application does. To do that, we can use OpenTelemetry (OTel). OpenTelemetry is an open standard for instrumenting applications. It provides multiple SDKs for various programming languages and is now commonly used in frameworks and libraries for HTTP, SQL, ORM, and other application layers. OpenTelemetry captures signals: logs, traces, and metrics. They are later captured into spans and traces that represent the communication between services and timings of operations. Each span represents one operation performed by some server. This could be file access, database query, or request handling. We can now extend OpenTelemetry signals with details from databases. We can extract execution plans, correlate them with signals from other layers, and build a full understanding of what happened behind the scenes. For instance, we would clearly see the N+1 problem just by looking at the number of spans. We could immediately identify schema migrations that are too slow or operations that will take the database down. Now, we need the last piece to capture the full picture. Semantic Monitoring of All Databases Observing just the local database may not be enough. The same query may execute differently depending on the configuration or the freshness of statistics. Therefore, we need to integrate monitoring with all the databases we have, especially with the production ones. By extracting statistics, number of rows, running configuration, or installed extensions, we can get an understanding of how the database performs. Next, we can integrate that with the queries we run locally. We take the query that we captured in the local environment and then reason about how it would execute in production. We can compare the execution plan and see which tables are accessed or how many rows are being read. This way, we can immediately tell the developer that the query is not going to scale well in production. Even if the developer has a different database locally or has a low number of rows, we can still take the query or the execution plan, enrich it with the production statistics, and reason about the performance after the deployment. We don’t need to wait for the deployment of the load tests, but we can provide feedback nearly immediately. The most important part is that we move from raw signals to reasoning. We don’t swamp the user with plots or metrics that are hard to understand or that the user can’t use easily without setting the right thresholds. Instead, we can provide meaningful suggestions. Instead of saying, “CPU spiked to 80%,” we can say, “The query scanned the whole table, and you should add an index on this and that column.” We can give developers answers, not only the data points to reason about. Automated Troubleshooting That’s just the beginning. Once we understand what is actually happening in the database, the sky's the limit. We can run anomaly detection on the queries to see how they change over time, if they use the same indexes as before, or if they changed the join strategy. We can catch ORM configuration changes that lead to multiple SQL queries being sent for a particular REST API. We can submit automated pull requests to tune the configuration. We can correlate the application code with the SQL query so we can rewrite the code on the fly with machine-learning solutions. Summary In recent years, we observed a big evolution in the software industry. We run many applications, deploy many times a day, scale out to hundreds of servers, and use more and more components. Application Performance Monitoring is not enough to keep track of all the moving parts in our applications. Here at Metis, we believe that we need something better. We need a true observability that can finally show us the full story. And we can use observability to build database guardrails that provide the actual answers and actionable insights. Not a set of metrics that the developer needs to track and understand, but automated reasoning connecting all the dots. That’s the new approach we need and the new age we deserve as developers owning our databases.
This error usually happens when you are doing a backup of the transaction log. The error is like this one: Msg 26019, Level 16, State 1, Line 1 BACKUP detected corruption in the database log. Check the errorlog for more information. BACKUP LOG is terminating abnormally. In this article, we will explain why this error happens and how we can solve this problem. What Does the Database Log Error Corruption Mean? A level 16 error is not so critical. It is in the category of miscellaneous user error. The database will work. If you do a full backup, it will work. If you run the DBCC CHECKDB, it will not detect an error. However, the transaction log file is damaged. Line 1 is the line of code that is failing. Why Does This Error Happen? To find out the reason for this error, check your SQL Error Log. You can find your Error Log in the SQL Server Management Studio (SSMS). In the Object Explorer, go to Magagement>SQL Server Logs. You have the current log and older logs. Double-click the logs, and you can see the events and errors. You can also check the Event Viewer and go to Windows Logs>Application and look for MSSQLServer errors. The most common problems that can generate log corruption are problems in hardware. Also, the software can damage the database. For example, a power failure can shut down the server while doing a transaction, and then the log can be corrupted. Another common problem is that the disk fails. It happens if the disk is old, or there is a power outage, or there is an electricity problem. If the server temperature is high, a hardware problem can occur. If we talk about software, the software can corrupt the log. For example, viruses and malware can damage the log files. How To Resolve Log Corruption Detected During Database Backup in SQL Server If we do a full backup of a corrupted database, the backup will run, but we will back up the database with a corrupt log file. If we try to back up the log file only, we will get the error mentioned before. A solution for that problem is to back up with the Continue on error option. To do that, open the SSMS. In the Object Explorer, right-click the Database and select Tasks>Back Up Select the Transaction Log option. In the Media Option, select the Continue on error option. This option will continue doing the backup even when the transaction log is corrupt. Another way to solve this problem is to set the database to simple recovery mode. In SSMS, go to the Object Explorer. Click the Database, and right-click the database, and select Properties. Go to the Options page and select the Simple Recovery model. Run a checkpoint using T-SQL. CHECKPOINT Do a full backup of your database. Now, you will be able to do a backup of the log file without errors. How To Resolve Log Corruption Detected During Database Backup in SQL Server Using Stellar Repair for MS SQL Another way to solve this problem is to use Stellar Repair for MS SQL. This software can repair the database using the SQL Server Data file, or it can use a damaged SQL Server backup to recover all the information. Once you have the database back, you can back up your log file without errors. To do that, you need to download your software from this link. We will need to take the database online first. To do that, run the following command: ALTER DATABASE stellardb SET OFFLINE; You will need to find the data file. The data file is a file with .mdf extension. This file contains the database information. Optionally, you can Browse and select the mdf file if you know where it is and press the Repair button. Once repaired, you can Save your data in a New Database. The Live database is to replace the current one. When you select other formats, you can export your table and view data in Excel, CSV, or HTML files. If you select the New Database or Live Database, you will be able to back up the log file without errors because the repaired database will not be corrupted. Conclusion In this article, we learned what error occurs when the log is corrupted. Also, we learned why this error occurs. In addition, we learned to back up using the Continue on Error option. Finally, we learned how to repair the database using Stellar Repair for MS SQL.
Joana Carvalho
Site Reliability Engineering,
Virtuoso
Greg Leffler
Observability Practitioner, Director,
Splunk
Ted Young
Director of Open Source Development,
LightStep
Eric D. Schabell
Director Technical Marketing & Evangelism,
Chronosphere