Tutorials archives - Lightrun

Troubleshooting Cloud Native Applications at Runtime

Moshe Sambol — Wed, 18 Oct 2023 18:07:57 +0000

Co-Authored with Gilles Ramone (Chronosphere)

Chronosphere and Lightrun demonstrate how their combined solutions empower developers with optimized end-to-end observability

===========================================================================================================

Introduction

Organizations are moving to micro-services and container-based architectures because these modern environments enable speed, efficiency, availability, and the power to innovate and scale more quickly. However, when it comes to troubleshooting distributed cloud native applications, teams face a unique set of challenges due to the dynamic and decentralized nature of these systems. To name a few:

Lack of visibility: With components spread across various cloud services and environments, gaining comprehensive visibility into the entire system can be difficult. Access to production environments is generally strictly limited to ensure the safety of customer-facing systems. This makes it challenging to understand run-time anomalies and identify the root cause of issues.
Complexity: Distributed systems are inherently complex, with numerous microservices, APIs, and dependencies. Understanding how these components interact and affect one another can be daunting when troubleshooting.
Challenges with container orchestration: When using serverless systems and container orchestration platforms like Kubernetes, processes can be ephemeral, making it very challenging to identify the resources related to specific users or user segments, and to capture and analyze the state of the system relevant to specific traffic.
Cost of monitoring and logging: Setting up effective monitoring and logging across all components is crucial, but it is costly to aggregate and complex to correlate logs and metrics from various sources.

Addressing these challenges requires a combination of a robust observability platform and tooling that simplifies complexity and helps developers understand the behavior of their deployed applications.

These tools must address organizational concerns for security and data privacy. The best observability strategy will enable the ongoing “Shift left” – giving developers access to and responsibility for the quality, durability, and resilience of their code, in every environment in which it runs. Doing so will enable a more proactive approach to software maintenance and excellence throughout the Software Development Life Cycle.

Efficient troubleshooting requires not just gathering data, but making sense of that data: identifying the highest priority signals from the vast quantity and variety produced by large deployments. Chronosphere turbo-charges issue triage by collecting and then prioritizing observability data, providing a centralized observability analysis and optimization solution.

Rather than aggregating and storing all data for months, at ever increasing cost to store and access, Chronosphere pre-processes the data and optimizes it, substantially reducing cost and improving performance.

When leveraging Chronosphere together with Lightrun, engineers are rapidly guided to the most incident-relevant observability data that helps them identify the impacted service. From there, they can connect directly from their local IDE via the Lightrun plugin to debug the live application deployment. With Chronosphere’s focus and Lightrun’s live view of the running application, developers can quickly understand system behavior, complete their investigation at minimal cost, and close the cycle of the troubleshooting process.

Chronosphere + Lightrun: A technical walk through

Ready to see Chronosphere’s ability to separate signal from noise in a metrics-heavy application, and Lightrun’s on-demand, developer-initiated observability capabilities in action?

To demonstrate, we’re going to use a small web application that’s deployed to the cloud and under load. Lightrun’s application provides simple functionality – users are presented with a list of pictures and they can bookmark those they like most.

In this example, we’ve been alerted by Chronosphere about something amiss in our application’s behavior: it seems that some users are experiencing particularly high latency on some operations. Chronosphere pinpoints this to the “un-like” operation.

But why only some users?

The app designers are doing some A/B testing to see how users react to various configurations that may improve the site’s usability and performance. They use feature flags to randomly select subsets of users to get slightly different experiences. The percent of the audience exposed to each feature flag is determined by a config file.

Unfortunately, in our rush to roll out the feature flag controlled experiments, we neglected to include logging, so we have no information about which users are included in each of the experiment groups.

The feature flags – possibly individually, possibly in combination – may be causing the latency that Chronosphere has identified. In order to know for sure, we’ll need to add some logging, which means creating a new build and rolling out an update. Kubernetes would let us do this without bringing down the entire application, but that might just further confuse which users are getting which feature flags, so it seems that some down time may be our best option.

Well, that would be the situation without Lightrun.

Since we’ve deployed Lightrun’s agent, we can introduce observability on-demand, with no change to our code, no new build and no restarts required. That means that we can add new logging to the running system without access to the containers, without changing any code.

We can safely gather the application state, just as we’d see it if we had connected a debugger, without opening any ports and without pausing the running application!

Lightrun provides remote, distributed observability directly in the interface where developers feel most at home: their existing IDE (integrated development environment). With Lightrun’s IDE plugins, adding observability on the fly is simply a matter of right-clicking in your code, choosing the pods of interest, and hitting submit.

Back to the issue at hand, we’ll use a dynamic log to get a quick feel for who is using the system. Lightrun quickly shows us that we’ve got a bunch of users actively bookmarking pictures. By using Lightrun tags, we’re able to gather information from across a distributed deployment without needing details of the running instances.

That’s nice, but it’s still hard to tell what’s going on with a specific user who’s now complaining about the latency. We use conditional logging to reduce the noise and zoom in on that specific user’s activity. From there we can see that their requests are being received, but we still need to answer the question: what’s going on?

What we really want is a full picture, including:

This user’s feature flags
The list of items they’ve bookmarked
And anything else from the environment that could be relevant.

Enter Lightrun snapshots – virtual breakpoints that show us the state without causing any interruption in service.

Creating a snapshot is just as easy as adding a log – we choose the tags that represent our deployment, add any conditions so that we’ll just get the user we’re interested in – regardless of which pod is serving that user at the moment. And there we have it, all of the session state affecting that user’s interaction with the application.

With this information we can see that one of our feature flags is to blame – it looks like it’s only partially implemented. It’s a good thing that only a small percentage of our audience is getting this one! Oops.

Before we roll out a fix, let’s get an idea of how many users are being affected by each of our feature flags. We can use Lightrun’s on-demand metrics to add counters to measure how often each block within our code is being reached. And we can add tic-tocs to measure the latency impact of this code, just in case our experimentation is also slowing down the site’s responsiveness.

Watch below the full troubleshooting workflow done both through Chropnosphere and Lightrun observability platforms.

The Chronosphere and Lightrun Combined Solution

It’s imperative to have all observability data going into a cloud native observability platform like Chronosphere, which helps alert us to the needle in the haystack of all the telemetry our distributed applications are producing. And with Lightrun developers are able to query the state of the live system right in their IDE, where they can dynamically generate additional telemetry to send to Chronosphere, for end to end analysis.

By using these solutions together, we leverage the unique capabilities provided by each. The result is full cloud native observability: understanding what’s going on in our code, right now, wherever it is deployed, at cloud native scale. Zooming in on the details that matter despite the complexity of the code and the deployment. Combining new, on-demand logs and metrics with those which are always produced by our code – for control, cost management, and automatic outlier alerting.

With developer-native, full-cycle observability, these powerful tools are supporting rapid issue triage, analysis, and resolution. This is essential to organizations realizing maximum observability benefits while maintaining control over their cloud native costs.

Feel free to contact us with any inquiries or to arrange an assessment.

The post Troubleshooting Cloud Native Applications at Runtime appeared first on Lightrun.

Debugging Race Conditions in Production

Lightrun Marketing — Mon, 14 Mar 2022 13:36:06 +0000

Race conditions can occur when a multithreaded application accesses a shared resource using over one thread. Unless we have guards in place, the result might depend on which thread “got there first”. This is especially problematic when the state is changed externally.

A race can cause more than just incorrect behavior. It can enable a security vulnerability when the resource in question can be corrupted in the right way. A good example of race condition vulnerabilities is mangling memory. Let’s say we have an admin user name which is restricted and privileged. You can’t change your user name to admin because of validation. But you can change it to anything else…

Malicious code can repeatedly set the value to “admix”, “xdmin” etc. If the system writes the characters in sequence, you might end up with “admin”. We can just run this code repeatedly until we get the right result. This is a privilege escalation attack that lets us access information and capabilities we shouldn’t have access to, such as file systems, etc. The security implications are severe.

For me personally, the biggest problem is the undefined behavior which can trigger elusive bugs.

Note that I used Java as the language of this tutorial but it should work similarly for other programming languages.

Serialize Everything?

Unfortunately, race conditions are remarkably hard to solve. Exclusive access or critical sections slow down application performance considerably. They block usage of computer resources and demolish our CPU cache utilization.

We want to minimize synchronization operations as much as possible, but we don’t want undefined behavior.

There are many optimizations and strategies for building performant multi-threaded applications. A great example of that is maximizing read-only state, immutability, etc.

But the real problem is knowing that you have a race condition. How do you detect it?

Detecting a Race Condition

There are some great race detection tools in the market. Some work with static analysis where they review the application code and show the risky areas. Others work during runtime and inspect the activity of the threads. They aren’t as common since the debug environment isn’t as representative of the “real world”.

Our production environments are remarkably complex and getting more complex every day. Detecting a race in that environment and verifying it is challenging…

Solving the Gap With Static Analysers

The problem with static analyzers is often they often issue false-positive results. They point at a risk, not something that might occur in reality. As a result, developers often discount their output. What if you could verify a potential race?

This is pretty easy to do with Lightrun. We can log the thread that is in the suspect block. If multiple threads enter the area at the same time, then there’s a race condition.

The easiest thing we can do is add a log entry such as this:

The log prints “Thread {Thread.currentThread().getName()} entered”, we can add the corresponding “exited” version at the end of the piece of code. But this brings us to a fresh problem.

For simple cases, this will work well, but if we have many requests going through that code, we might run into throttling. Lightrun throttles actions when they take up too much CPU and, as a result, we might have uneven enter/exit printouts.

The solution is a multi-part solution. First, we can use logs as we did above to get the names of the threads that access this block.

Next, we need to verify that there’s a large volume of requests. For that, we can add a counter:

We can also narrow this further by limiting the counting to a specific thread e.g. for “Thread 1” we can set the condition to:

Thread.currentThread().getName().equals("Thread 1")

Finally, we can use a snapshot with multiple captures:

Notice the “Max Hit Count” below. It will trigger 20 separate hits. We can then review them and see the corresponding stack traces. If the path doesn’t include synchronization or includes a bad monitor, there could be a problem here.

The Race we don’t Know…

Concurrency is hard enough when we know what we’re looking for. Tracking a race when we don’t already have a clue about the direction is challenging.

Thankfully, we have encapsulation to the rescue. The worst races are those related to mutations of a memory location. All we need to do is place a log on the code that performs object mutation so it will print out the name of the current thread. With that, we can track the vast majority of potential races.

But we can do even better than that!

Here we add a multi-hit snapshot when the setter method is invoked with a different thread.

The condition is:

!Thread.currentThread().getName().equals("Thread 1")

Notice the “NOT” operator in the beginning. The assumption is that only “Thread 1” makes mutations, so you will literally see the stack trace to the code that accesses this operation from a different thread. You can leave an action like that in a long running system for days (don’t forget to update expiry in the advanced mode) and the process will trigger this if there’s access to that block.

TL;DR

We deploy modern programs on a scale like never before. Even a simple function on Lambda can change the dynamics of a complex deployment and trigger a synchronization problem that we can’t see when debugging locally because of differences in connection times, environment, etc.

Race conditions can also be an attack vector which can lead to security issues. E.g. a memory location can be corrupted, privileges can be elevated, etc.

Debugging these problems in a local process is hard enough tracking this one production is a Herculean task. Simultaneous requests in the environment can interfere with our tracking, worse… We can trigger production problems if we aren’t careful. Thankfully, Lightrun eliminates these problems!

Since Lightrun is asynchronous and long running by default, it’s ideal for monitoring programs in production and reviewing access to various resources such as memory. You can start using Lightrun today, or request a demo to learn more.

The post Debugging Race Conditions in Production appeared first on Lightrun.

Understand Source Code – Deep into the Codebase, Locally and in Production

Lightrun Marketing — Wed, 15 Jun 2022 18:29:23 +0000

Say you have a new code base to study or picked up an open source project. You might be a seasoned developer for whom this is another project in a packed resume. Alternatively, you might be a junior engineer for whom this is the first “real” project.

It doesn’t matter!

With completely new source code repositories, we still know nothing…

The seasoned senior might have a leg up in finding some things and recognizing patterns. But none of us can read and truly follow a project with 1M+ lines of code. We look over the docs, which in my experience usually have only passing resemblance to the actual codebase. We segregate and assume about the various modules, but picking them up is hard.

We use IDE tools to search for connections, but this is really hard. It’s like following a yarn thread after an army of cats had its way with it…

As a consultant for over a decade, I picked up new client projects on a weekly basis. In this post, I’ll describe the approach I used to do that and how I adapted this approach further at Lightrun.

Finding Usage

The programming language can help a lot. As Java developers, we’re lucky, codebase exploration tools are remarkably reliable. We can dig into the code and find usage. IDEs highlight unused code and they’re pretty great for this. But this has several problems:

We need to know where to look and how deeply
Code might be used by tests or by APIs that aren’t actually used by users
Flow is hard to understand via usage. Especially asynchronous flow
There’s no context such as data to help explain the flow of the code

There has to be a better way than randomly combing through source files.

UML Generation

Another option is UML chart generation from source files. My personal experience with these tools hasn’t been great. They’re supposed to help with the “big picture” but I often felt even more confused by these tools.

They elevate minor implementation details and unused code into equal footing in a mind-boggling confusing chart. With a typical codebase, we need more than a high level view. The devil is in the details and our perception should be of the actual codebase in version control. Not some theoretical model.

Debugging as a Learning Tool

Debuggers instantly solve all these problems. We can instantly verify assumptions, see “real world” usage, and step over a code block to understand the flow. We can place a breakpoint to see if we reached a piece of code. If it’s reached too frequently and we can’t figure out what’s going on, we can make this breakpoint conditional.

I make it a habit to read the values of variables in the watch when using a debugger to study the codebase.

Does this value make sense at this point in time?

If it doesn’t then I have something to look at and figure out. With this tool I can quickly understand the semantics of the codebase. In the following sections I’ll cover techniques for code learning both in the debugger and in production.

Notice I use Java, but this should work for any other programming language as the concepts are (mostly) universal.

Field Watchpoint

I think most developers know about field watchpoints and just forget about them!

“Who changed this value and why”, is probably the most common question asked by developers. When we look through the code, there might be dozens of code flows that trigger a change. But placing a watchpoint on a field will tell you everything in seconds.

Understanding state mutation and propagation is probably the most important thing you can do when studying a codebase.

The Return Value

One of the most important things to understand when stepping through methods is the return value. Unfortunately, with debuggers, this information is often “lost” when returning from a method and might miss a critical part of the flow.

Luckily, most IDEs let us inspect the return value dynamically and see what the method returned from its execution. In JetBrains IDEs such as IntelliJ/IDEA, we can enable “Show Method Return Value” as I discuss here.

Flow Control as a Learning Tool

Why is this line needed?

What would happen if it wasn’t there?

That’s a pretty common set of question. With a debugger, we can change control flow to jump to a specific line of code or force an early return from a method with a specific value. This can help us check situations with a specific line, e.g.: what if this method was invoked with value X instead of Y?

Simple. Just drag the execution back a bit and invoke the method again with a different value. This is much easier than reading into a deep hierarchy.

Keeping Track With Object Marking

Object Marking is one of those unknown debugger capabilities that’s invaluable and remarkably powerful. It has a big role in understanding “what the hell is going on”.

You know how when you debug a value you write down the pointer to the object so you can keep track of “what’s going on in this code block?”.

This becomes very hard to keep track of. So we limit the interaction to very few pointers. Object marking lets us skip this by keeping a reference to a pointer under a fixed name. Even when the object is out of scope, the reference marker would still be valid. We can start tracking objects to understand the flow and see how things work. E.g. if we’re looking at a User object in the debugger and want to keep track of it, we can just keep a reference to it. Then use conditional breakpoints with the user object in place to detect the area of the system that access the user.

This is also remarkably useful in keeping track of threads, which can help in understanding code where the threading logic is complex.

Check The Objects In Memory

A common situation I run into is a case where I see a component in the debugger. But I’ve been looking for a different instance of this object. E.g. if you have an object called UserMetaData. Does every user object have a corresponding UserMetaData object?

As a solution, we can use the memory inspection tool and see which objects of the given type are held in the memory!

Seeing actual object instance values and reviewing them helps put numbers/facts behind the objects. It’s a powerful tool that helps us visualize the data.

Use Tracepoint Statements to Follow Complex Logic

During development, we often just add logs to see “was this line reached”. Obviously a breakpoint has advantages, but we don’t always want to stop. Stopping might change threading behavior, and it can also be pretty tedious.

But adding a log can be worse. Re-compile, re-run, and accidentally commit it to the repository. It’s a problematic tool for debugging and for studying code.

Thankfully, we have tracepoints which are effectively logs that let us print out expressions, etc.

What’s Going on in Production – AKA “Reality Coverage”

This works great for “simple” systems. But there are platforms and settings in our industry that are remarkably hard to reproduce in a debugger. Knowledge about the way our code works locally is one thing. The way it works in production is something completely different.

Production is the one thing that really matters and in multi-developer projects, it’s really hard to evaluate the gap between production and assumption.

We call this reality coverage. E.g. you can get 80% coverage in your tests. But if your coverage is low on classes that are heavily accessed in your source code repository… Then the QA might be less effective. We can study the repo over and over. We can use every code analysis tool and setting. But they won’t show us the two things that really matter:

Is this actually used in production?

How is this used in production?

Without this information, we might waste our time. E.g. when dealing with millions of lines in the repo. You don’t want to waste your study time reading a method that isn’t heavily used.

In order to get an insight into production and debug that environment, we’ll need a developer observability tool, like Lightrun. You can sign up for a Lightrun account here.

Measuring with Counters

A counter lets us see how frequently a line of code was reached. This is one of the most valuable tools at our disposal.

Is this method even reached?

Is this block in the code reached? How often?

If you want to understand where to focus your energies first, the counter is probably the most handy tool at your disposal. You can read about counters here.

Assumption Verification with Conditions

We often look at a statement and make various assumptions. When the code is new to us, these assumptions might be pivotal to our understanding of the code. A good example is something like most of the users who use this feature have been with the system for a while and should be familiar with it.

You can test that with conditional statements that you can attach to any action (logs, counters, snapshots, etc.). As a result, we can use a condition like `user.signupDate.getMillis() < …`.

You can add this to a counter and literally count the users that don’t match your expectations.

Logs and Piping to Learn without the Noise

I think it’s obvious how injecting a log in runtime can make a vast difference to understanding our system. But in production, this comes at a price. I’m studying the system while looking at the logs and all my “methodX reached with value Y” logs add noise to our poor DevOps/SRE teams.

We can’t have that. Studying something should be solitary, but by definition, production is the exact opposite of that…

With piping we can log everything locally to the IDE and spare everyone else the noise. Since the logic is sandboxed, there will be no overhead if you log too much. So go wild!

Snapshots for Easier Learning

One of the enormous challenges in learning is understanding the things we don’t know “yet”. Snapshots help us get a bigger picture of the code. Snapshots are like placing any breakpoint and reviewing the values of the variables in the stack to see if we understand the concepts. They just “don’t break” so you get all the information you can use to study, but the system keeps performing as usual, including threading behavior.

Again, using conditional snapshots is very helpful in pinpointing a specific question. E.g. what’s going on when a user with permission X uses this method?

Simple, place a conditional snapshot for the permission. Then inspect the resulting snapshot to see how variable values are affected along the way.

Final Word

Developers often have a strained relationship with debugging tools.

On the one hand, they’re often the tool that saves our bacon and helps us find the bug. On the other hand, they’re the tool we look at when we realize we were complete morons for the past few hours.

I hope this guide will encourage you to pick up debuggers when you don’t have a bug. The insights provided by running a debugger, even when you aren’t actively debugging, can be “game changing”.

The post Understand Source Code – Deep into the Codebase, Locally and in Production appeared first on Lightrun.

Top 8 VS Code Python Extensions

Lightrun Team — Thu, 23 Jun 2022 04:21:55 +0000

Visual Studio Code (a.k.a. VS Code or VScode) is an open-source and cross-platform source code editor. It was ranked the most popular development tool in the Stack Overflow 2021 Developer Survey, with 70% of the respondents using it as their primary editor. VS Code allows you to use a few programming languages like JavaScript and TypeScript. Still, you need VS Code extensions if you want to use any other programming language or take advantage of extra tools to improve your code.

Python is one of the top computer languages used by developers worldwide for creating a variety of programs, from simple to scientific applications. But VS Code does not directly support Python. Therefore, if you want to use Python in VS Code, it is important to add good Python extensions. Luckily, there are many options available. However, the biggest challenge is to find the most complete and suitable extensions for your requirements.

Top 8 VS Code Python Extensions

To simplify your search for the most suitable Python extensions for your needs, we put together a list of the top 8 VS Code Python extensions available on the market:

1. Python (Microsoft)

Python VS Code extension developed by Microsoft is a feature-rich Python extension that is completely free. VS Code will automatically suggest this extension when you start to create a .py file. Its IntelliSense feature enables useful functionality like code auto-completion, navigation, and syntax checking.

When you install it, the Python VS Code extension will automatically add the Pylance extension, which gives you rich language support, and the Jupyter extension for using Jupyter notebooks. To run tests, you can also use unittest or pytest through its Test Explorer feature. Other valuable capabilities include code debugging, formatting, refactoring, and automatic switching between various Python environments.

2. Lightrun

Lightrun is a real-time debugging platform that supports applications written in several languages, including Python, and is available as a VS Code extension. It consists of an intuitive interface for you to add logs, traces, and metrics in real-time for debugging code in production. You can add Lightrun snapshots to explore the stack trace and variables without stopping your live application.

Also, you can add real-time performance metrics to measure your code’s performance and synchronization, which will allow you to find performance bottlenecks in your applications. Lightrun supports multi-instance applications such as microservices and big data workers with a tagging mechanism. Lightrun it a commercial product, but it comes with a 14-day free trial.

3. Python Preview

This VS Code extension helps understand and debug Python code faster by visualizing code execution with animations and graphics. It previews object allocation and stack frames side-by-side with your code editor, even before you start debugging. This Python extension is also 100% free to use.

4. Better Comments

Comments are critical for any code as they help the developers understand the code better. The Better Comments Python extension is slightly different than the others. It focuses solely on making more human-friendly and readable comments for your Python code. With this extension, you can organize your annotations and improve code clarity. You can use several categories and colors to categorize your annotations—for example, Alerts, Queries, TODOs, and Highlights.

You can also mark commented-out code in a different style and set various comment settings as you see fit. This free VS Code extension also supports many other programming languages.

5. Python Test Explorer

When developing an application, testing is a must to maintain code quality, and you will have to use different types of test frameworks. The Python Test Explorer extension for VS Code lets you run Unittest, Pytest, or Testplan tests.

The Test Explorer extension will show you a complete view of tests and test suites along with their state in VS Code’s sidebar. You can easily see which tests are failing and focus on fixing them.

In addition, this VS Code extension supports convenient error reporting. It will indicate tests having errors, and you can see the complete error message by clicking on them. If you are working with multiple project folders in VS Code, it enables you to run tests on such multi-root workspaces.

6. Python Indent

Having the correct indentation is vital when developing in Python, and adding closing brackets can sometimes get cumbersome. The Python Indent extension helps you maintain proper Python indentation in VS Code. This extension adds closing brackets automatically when you press the Tab key, which speeds up coding and enables you to save a lot of your valuable time.

It can also indent keywords, extend comments and trim whitespace lines. This free VS Code Python extension works by registering the Enter key as a keyboard shortcut, though sometimes it can unexpectedly override the Enter behavior.

7. Python Snippets 3

Python Snippets 3 is a helpful VS Code extension that makes Python code snippets available while you are typing. It provides snippets like built-in strings, lists, sets, tuples, and dictionaries. Other code snippets include if/else, for, while, while/else, try/catch, etc.

There are also Python snippets for Object-Oriented Programming concepts such as inheritance, encapsulation, polymorphism, etc. Since this VS Code extension provides many Python code examples, it is helpful for beginners. However, note that this extension can sometimes add incorrect tab spaces.

8. Bracket Pair Colorizer 2 (CoenraadS)

Bracket Pair Colorizer 2 is another VS Code Python extension that lets developers quickly identify which brackets pair with each other and makes it easier to read code. Matching brackets are highlighted with colors, and you can set tokens and colors that you want to use. This free VS Code extension can be even more helpful if your Python code contains nested conditions and loops.

Although marked as deprecated, this extension is still popular, and many users prefer it to the native Python bracket matching functionality that has since been added to VS Code.

Write Better Python in VS Code

The VS Code Python extensions we discussed here provide helpful features like automatic code completion, test running, indentation, useful snippets to learn Python, and adding different kinds of comments. These extensions help make code more accurate, improve readability, and detect bugs in the system.

One of these Python extensions, Lightrun, enables a robust Python debugging experience in production right from VS Code, and if this is what you need, get started with Lightrun today.

The post Top 8 VS Code Python Extensions appeared first on Lightrun.

Top 8 IntelliJ Debug Shortcuts

Lightrun Team — Mon, 06 Jun 2022 16:52:53 +0000

Let’s get real – as developers, we spend a significant amount of time staring at a screen and trying to figure out why our code isn’t working. According to Coralogix, there are an average of 70 bugs per 1000 lines of code. That’s a solid 7% worth of blimps, bumps, and bugs. In addition to this, fixing a bug can take 30 times longer than writing an actual line of code. But it doesn’t have to be this way. If you’re using IntelliJ (or are thinking about making the switch to it), the in-built debugger and its shortcuts can help speed up the process. But first, what is IntelliJ?

What is IntelliJ?

If you’re looking for a great Java IDE, you should check out IntelliJ IDEA. It’s a robust, feature-rich IDE perfect for developing Java applications. While VSCode is excellent in many situations, IntelliJ is designed for Java applications. Here’s a quick overview of IntelliJ IDEA and why it’s so great.

IntelliJ IDEA is a Java IDE developed by JetBrains. It’s a commercial product, but a free community edition is available. Some of the features include:

Intelligent code completion
Refactoring
Code analysis
Support for various frameworks and libraries
Great debugger

Debugging code in IntelliJ

If you’re a developer, you will have to debug code sooner or later. But what exactly is debugging? And why do we do it?

Debugging is the process of identifying and removing errors from a computer program. Errors can be caused by incorrect code, hardware faults, or software bugs. When you find a bug, the first thing you need to do is to try and reproduce the bug so you can narrow down the problem and identify the root cause. Once you’ve reproduced the bug, you can then start to debug the code.

Debugging is typically done by running a program in a debugger, which is a tool that allows the programmer to step through the code, line by line. The debugger will show the values of variables and allow programmers to change them so they can find errors and fix them.

The general process of debugging follows this flow:

identify the bug
reproduce the bug
narrow down where in the code the bug is occurring
understand why the bug exists
fix the bug

Most often than not, we spend our time on the second and third steps. Statistically, we spend approximately 75% of our time just debugging code. In the US, $113B is spent on developers trying to figure out the what, where, why, and how of existing bugs. Leveraging the IDE’s built-in features will allow you to condense the debugging process.

Sure, using a debugger will slow down the execution of the code, but most of the time, you don’t need it to run at the same snail’s pace speed through the entire process. The shortcut controls allow you to observe the meta inner workings of your code at the rate you need them to be.

Without further ado – here are the top 8 IntelliJ debug shortcuts, what they do and how they can help speed up the debugging process.

Top 8 IntelliJ Debug Shortcuts

1. Step Over (F8)

Stepping is the process of executing a program one line at a time. Stepping helps with the debugging process by allowing the programmer to see the effects of each line of code as it is executed. Stepping can be done manually by setting breakpoints and running the program one line at a time or automatically by using a debugger tool that will execute the program one line at a time.

Step over (F8) takes you to the following line without going into the method if one exists. This step can be helpful if you need to quickly pass through the code, hunt down specific variables, and figure out at what point it exhibits undesired behavior.

2. Step into (F7)

Step into (F7) will take the debugger inside the method to demonstrate what gets executed and how variables change throughout the process.

This functionality is helpful if you want to narrow down your code during the transformation process.

3. Smart step into (Shift + F7)

Sometimes multiple methods are called on the line. Smart step into (Shift + F7) lets you decide which one to invoke, which is helpful as it enables you to target potential problematic methods or go through a clear process of elimination.

4. Step out (Shift + F8)

At some point, you will want to exit the method. The step out (Shift + F8) functionality will take you to the call method and back up the hierarchy branch of your code.

5. Run to cursor (Alt + F9)

Alternative to setting manual breakpoints, you can also use your cursor as the marker for your debugger.

Run to cursor (Alt + F9) will let the debugger run until it reaches where your cursor is pointing. This step can be helpful when you are scrolling through code and want to quickly pinpoint issues without the need to set a manual breakpoint.

6. Evaluate expression (Alt + F8)

It’s one thing to run your code at the speed you need; it’s another to see what’s happening at each step. Under normal circumstances, hovering your cursor over the expression will give you a tooltip.

But sometimes, you just need more details. Using the evaluate expression shortcut (Alt + F8) will reveal the child elements of the object, which can help obtain state transparency.

7. Resume program (F9)

Debugging is a constant stop and start process. The ability to toggle this process is achievable through F9. This shortcut will kickstart the debugger back into gear and get it moving to the next breakpoint.

For Mac, a keycord (Cmd + Alt + R) is required to resume the program.

8. Toggle (Ctrl + F8) & view breakpoints (Ctrl + Shift + F8)

Breakpoints can get nested inside methods – which can be a hassle to look at if you want to step out and see the bigger picture. This is where the ability to toggle breakpoints comes in.

You can toggle line breakpoints with Ctrl+F8. Alternatively, if you want to view and set exception breakpoints, you can use Ctrl+Shift+F8.

For Mac OS, the keycords are:

Toggle – Cmd + F8
View breakpoints – Cmd + Shift + F8

Improving the debugging process

If you’re a software engineer, you know that debugging is essential for the development process. It can be time-consuming and frustrating, but it’s necessary to ensure that your code is working correctly.

Fortunately, there are ways to improve the debugging process, and one of them is by using Lightrun. Lightrun is a cloud-based debugging platform you can use to debug code in real-time. It is designed to make the debugging process easier and more efficient, and it can be used with any programming language.

One of the great things about Lightrun is that you can use it to debug code in production, which means that you can find and fix bugs in your code before your users do. Lightrun can also provide a visual representation of the code being debugged. This can help understand what is going on and identify the root cause of the problem. Start using Lightrun today!

The post Top 8 IntelliJ Debug Shortcuts appeared first on Lightrun.

Spring Transaction Debugging in Production with Lightrun

Lightrun Marketing — Mon, 18 Apr 2022 18:23:26 +0000

Spring makes building a reliable application much easier thanks to its declarative transaction management. It also supports programmatic transaction management, but that’s not as common. In this article, I want to focus on the declarative transaction management angle, since it seems much harder to debug compared to the programmatic approach.

This is partially true. We can’t put a breakpoint on a transactional annotation. But I’m getting ahead of myself.

What is Spring’s Method Declarative Transaction Management?

When writing a spring method or class, we can use annotations to declare that a method or a bean (class) is transactional. This annotation lets us tune transactional semantics using attributes. This lets us define behavior such as:

Transaction isolation levels – lets us address issues such as dirty reads, non-repeatable reads, phantom reads, etc.
Transaction Manager
Propagation behavior – we can define whether the transaction is mandatory, required, etc. This shows whether the method expects to receive a transaction and how it behaves
readOnly attribute – the DB does not always support a read-only transaction. But when it is supported, it’s an excellent performance/reliability tuning feature

And much more.

Isn’t the Transaction Related to the Database Driver?

The concept of transactional methods is very confusing to new spring developers. Transactions are a feature of the database driver/JDBC Connection, not of a method. Why declare it in the method?

There’s more to it. Other features, such as message queues, are also transactional. We might work with multiple databases. In those cases, if one transaction is rolled back, we need to rollback all the underlying transactions. As a result, we do the transaction management in user code and spring seamlessly propagates it into the various underlying transactional resource.

How can we Write Programmatic Transaction Management if we don’t use the Database API?

Spring includes a transaction manager that exposes the API’s we typically expect to see: begin, commit and rollback. This manager includes all the logic to orchestrate the various resources.

You can inject that manager to a typical spring class, but it’s much easier to just write declarative transaction management like this Java code:

@Transactional
public void myMethod() {
    // ...
}

I used the annotation on the method level, but I could have placed it on the class level. The class defines the default and the method can override it.

This allows for extreme flexibility and is great for separating business code from low level JDBC transaction details.

Dynamic Proxy, Aspect Oriented Programming and Annotations

The key to debugging transactions is the way spring implements this logic. Spring uses a proxy mechanism to implement the aspect oriented programming declarative capabilities. Effectively, this means that when you invoke myMethod on MyObject or MyClass spring creates a proxy class and a proxy object instance between them.

Spring routes your invocation through the proxy types which implement all the declarative annotations. As such, a transactional proxy takes care of validating the transaction status and enforcing it.

Debugging a Spring Transaction Management using Lightrun

IMPORTANT: I assume you’re familiar with Lightrun basics. If not, please read this.

Programmatic transaction management is trivial. We can just place a snapshot where it begins or is rolled back to get the status.

But if an annotation fails, the method won’t be invoked and we won’t get a callback.

Annotations aren’t magic, though. Spring uses a proxy object, as we discussed above. That proxy mechanism invokes generic code, which we can use to bind a snapshot. Once we bind a snapshot there, we can detect the proxy types in the stack. Unfortunately, debugging proxying mechanisms is problematic since there’s no physical code to debug. Everything in proxying mechanisms is generated dynamically at runtime. Fortunately, this isn’t a big deal. We have enough hooks for debugging without this.

Finding the Actual Transaction Class

The first thing we need to do is look for the class that implements transaction functionality. Opening the IntelliJ/IDEA class view (Command-O or CTRL-O) lets us locate a class by name. Typing in “Transaction” resulted in the following view:

This might seem like a lot, but we need a concrete public class. So annotations and interfaces can be ignored. Since we only care about Spring classes, we can ignore other packages. Still, the class we are looking for was relatively low in the list, so it took me some time to find it.

In this case, the interesting class is TransactionAspectSupport. Once we open the class, we need to select the option to download the class source code.

Once this is done, we can look for an applicable public method. getTransactionManager seemed perfect, but it’s a bit too bare. Placing a snapshot there provided me a hint:

I don’t have much information here but the invokeWithinTransaction method up the stack is perfect!

Moving on to that method, I would like to track information specific to a transaction on the findById method:

To limit the scope only to findById we add the condition:

method.getName().equals("findById")

Once the method is hit, we can see the details of the transaction in the stack.

If you scroll further in the method, you can see ideal locations to set snapshots in case of an exception in thread, etc. This is a great central point to debug transaction failures.

One of the nice things with snapshots is that they can easily debug concurrent transactions. Their non-blocking nature makes them the ideal tool for that.

Summary

Declarative configuration in Spring makes transactional operations much easier. This significantly simplifies the development of applications and separates the object logic from low level transactional behavior details.

Spring uses class-based proxies to implement annotations. Because they are generated, we can’t really debug them directly, but we can debug the classes, they use internally. Specifically: TransactionAspectSupport is a great example.

An immense advantage of Lightrun is that it doesn’t suspend the current thread. This means issues related to concurrency can be reproduced in Lightrun.

You can start using Lightrun today, or request a demo to learn more.

The post Spring Transaction Debugging in Production with Lightrun appeared first on Lightrun.

Spring Boot Performance Workshop with Vlad Mihalcea

Lightrun Marketing — Wed, 08 Jun 2022 15:07:49 +0000

A couple of weeks ago, we had a great time hosting the workshop you can see below with Vlad Mihalcea. It was loads of fun and I hope to do this again soon!

In this workshop we focused on Spring Boot performance but most importantly on Hibernate performance, which is a common issue in production environments. It’s especially hard to track since issues related to data are often hard to perceive when debugging locally. When we have “real world” data at scale, they suddenly balloon and become major issues.

I’ll start this post by recapping many of the highlights in the talk and conclude by answering some questions we missed. We plan to do a second part of this talk because there were so many things we never got around to covering!

The Problem with show-sql

After the brief introduction, we dove right into the problem with show-sql. It’s pretty common for developers to enable thespring.jpa.show-sqlsetting in the configuration file. By setting this to true, we will see all SQL statements performed by Hibernate printed on the console. This is very helpful for debugging performance issues, as we can see exactly what’s going on in the database.

But it doesn’t log the SQL query. It prints it on the console!

Why do we Use Loggers?

This triggered the question to the audience: why does it matter if we use a logger and not System.out?

Common answers in the chat included:

System.out is slow – it has a performance overhead. But so does logging
System.out is blocking – so are most logging implementations but yes you could use an asynchronous logger
No persistence – you can redirect the output of a process to a file

The reason is the fine grained control and metadata that loggers provide. Loggers let us filter logs based on log level, packages, etc.

They let us attach metadata to a request using tools like MDC, which are absolutely amazing. You can also pipe logs to multiple destinations, output them in ingestible formats such as JSON so they can include proper meta-data when you view all the logs from all the servers (e.g. on Elastic).

Show-sql is Just System Output

It includes no context. It’s possible it won’t get into your Elastic output and even if it does. You will have no context. It will be impossible to tell if a query was triggered because of request X or Y.

Another problem here is the question marks in the SQL. There’s a very limited context to work with. We want to see the variable values, not questions.

Adding a Log with Lightrun

Lightrun lets you add a new log to a production application without changing the source code. We can just open the Hibernate file “Loader.java” and add a new log toexecuteQueryStatement.

We can fill out the log statements in the dialog that prompts us. Notice we can use curly braces to write Java expressions, e.g. variable names, method calls, etc.

These expressions execute in a sandbox which guarantees that they will not affect the application state. The sandbox guarantees read only state!

Once we click OK, we can see the log appear in the IDE. Notice that no code changed, but this will act as if you wrote a logger statement in that line. So logs will be integrated with other logs.

Notice that we print both the statement and the arguments so the log output will include everything we need. You might be concerned that this weighs too heavily on the CPU and you would be right. Lightrun detects overuse of the CPU and suspends expensive operations temporarily to keep execution time in check. This prevents you from accidentally performing an overly expensive operation.

You can see the log was printed with the full content on top but then suspended to prevent CPU overhead. This means you won’t have a performance problem when investigating performance issues…

You still get to see the query, and values sent to the database server.

Log Piping

One of the biggest benefits of Lightrun’s logging capability is its ability to integrate with other log statements written in the code. When you look at the log file, the Lightrun added statements will appear “in-order” with the log statements written in code.

As if you wrote the statement, recompiled and uploaded a new version. But this isn’t what you want in all cases.

If there are many people working on the source code and you want to investigate an issue, logging might be an issue. You might not want to pollute the main log file with your “debug prints”. This is the case for which we have Log Piping.

Log piping lets us determine where we want the log to go. We can choose to pipe logs to the plugin and in such a case, the log won’t appear with the other application logs. This way, a developer can track an issue without polluting the sanctity of the log.

Spring Boot Connection Acquisition

Ideally, we should establish the relational database connection at the very last moment. You should release it as soon as possible to increase database throughput. In JDBC, the transaction is on auto-commit by default and this doesn’t work well with the JPA transactions in Spring Boot.

Unfortunately, we’re at a Chicken and Egg problem. Spring Boot needs to disable auto-commit. In order to do that, it needs a database connection. So it needs to connect to the database just to turn off this flag that should have been off to begin with.

This can seriously affect performance and throughput, as some requests might be blocked waiting for a database connection from the pool.

If this log is printed, we have a problem in our auto-commit configuration. Once we know that the rest is pretty easy. We need to add these two fields that both disable auto-commit and tell Hibernate that we disabled it. Once those are set, performance should be improved.

Query Plan Cache

Compiling JPQL to native SQL code takes time. Hibernate caches the results to save CPU time.

A cache miss in this case has an enormous impact on performance, as evidenced by the chart below:

This can seriously affect the query execution time and the response time of the whole service.

Hibernate has a statistics class which collects all of this information. We can use it to detect problematic areas and, in this case, add a snapshot into the class.

Snapshots

A Snapshot (AKA Non-breaking breakpoint or Capture) is a breakpoint that doesn’t stop the program execution. It includes the stack trace, variable values in every stack frame, etc. It then presents these details to us in a UI very similar to the IDE breakpoint UI.

We can traverse the source code by clicking the stack frames and see the variable values. We can add watch entries and most importantly: we can create conditional snapshots (this also applies to logs and metrics).

Conditional snapshots let us trigger the snapshot only if a particular condition is met. A common problem is when a bug in a system is experienced by a specific user only. We can use a conditional snapshot to get stack information only for that specific user.

Eager Fetch

When we look at logs for SQL queries, we can often see that the database fetches a lot more than what we initially asked for. That’s because of the default setting of JPA relations which is EAGER. This is a problem in the specification itself. We can achieve significant performance improvement by explicitly defining the fetch type to LAZY.

We can detect these problems by placing a snapshot in theloadFromDatasource()method ofDefaultLoadEventListener.

In this case, we use a conditional snapshot with the condition:event.isAssociationFetch().

As a result, the snapshot will only trigger when we have an eager association, which is usually a bug. It means we forgot to include the LAZY argument to the annotation.

As you can see, this got triggered with a full stack trace and the information about the entity that has such a relation.

You can use this approach to detect incorrect lazy fetches as well. Multiple lazy fetches can be worse than a single eager fetch, so we need to be vigilant.

Open Session in View Anti-Pattern

On the surface, it doesn’t seem like we’re doing anything wrong. We’re just fetching data from the database and returning it to the client. But the transaction context finished when the post controller returned and as a result we’re fetching from the database all over again. We need to do an additional query as data might be stale. Isolation level might be broken and many bugs other than performance might arise.

This creates an N+1 problem of unnecessary queries!

We can detect this problem by placing a snapshot on theonInitializeCollectioncall and seeing the open session:

Now that we see the problem is happening we can solve the problem by definingspring.jpa.open-in-view=false

It will block you from using this approach.

Q&A

There were many brilliant questions as part of the session. Here are the answers.

Could you please describe a little bit about Lightrun?

Lightrun is a developer observability platform. As such, it lets you debug production safely and securely while keeping a tight lid on CPU usage. It includes the following pieces:

Client – IDE Plugin/Command Line
Management Server
Agent – running on your server to enable the capabilities

I wrote about it in depth here.

Could Lightrun Work Offline?

Since you’re debugging production, we assume your server isn’t offline.

However, Lightrun can be deployed on-premise, which removes the need for an open to the Internet environment.

Wondering about this sample, will this be available for our reference?

The code is all here.

As the Instrumentation/manipulation happens via a Server, given that I do not host the instrumentation server myself, what kind and what amount of data is being transmitted? Is the data secured or encrypted in any way?

The instrumentation happens on your server using the agent.

The Lightrun server has no access to your source code or bytecode!

Source code or bytecode never goes on the wire at any stage and Lightrun is never exposed to it.

All transmissions are secured and encrypted. Certificates are pinned to avoid a man in the middle attack. The Lightrun architecture received multiple rounds of deep security reviews and is running in multiple Fortune 500 companies.

Finally, all operations in Lightrun are logged in an administrator log, which means you can track every operation that was performed and have a full post mortem trail.

You can read more about Lightrun security here.

As mentioned, these logs are aged out in 1 hr. Is it possible to save those and re-use them for later use rather than creating log entries manually every time?

Lightrun actions default to expire after 1 hour to remove any potential unintentional overhead. You can set this number much higher, which is useful for hard to reproduce bugs.

Notice that when an action is expired, you can just click it and re-create it. It will appear in red within the IDE and can still be used for reference.

Is IntelliJ IDEA the only way to add breakpoints/logging? Or how is debugging with Lightrun done in production?

You can use IntelliJ (also PyCharm and WebStorm) as well as VSCode, VSCode.dev and the command line.

These connect to production through the Lightrun server. The goal is to make you feel as if you’re debugging a local app while extracting production data. Without the implied risks.

Is there any case where eager loading should be configured always for One-to-Many or Many-to-Many or Many-to-One relations? I always configure lazy loading for the above relations. Is it okay?

Yes. If you see that you keep fetching the other entity, then eager loading for this case makes sense. Having eagerness as the default makes little sense for most cases.

Do we need to restart an application with the javaagent?

The agent would run in the background constantly. It’s secure and doesn’t have overhead when it isn’t used.

If we are using other instrumentation tools like say AppDynamics or dynatrace …… does this work alongside?

This varies based on the tool. Most APMs work fine besides Lightrun because they hook up to different capabilities of the JVM.

Does this work with GraalVM?

Not at this time since GraalVM doesn’t support the javaagent argument. We’re looking for alternative approaches, but hopefully the GraalVM team will have some solutions.

Is it free to use?

Using Lightrun comes at a cost, but a free trial is available to everyone.

Does it impact app performance?

Yes, but it’s minimal. Under 0.5% when no actions are used, and under 8% with multiple actions. Notice you can tune the amount of overhead in the agent configuration.

Does it work for Scala and Kotlin?

Yes.

How to use it in production without IDE?

The IDE will work even for production, since you don’t connect directly to the production servers and don’t have access to them. The IDE connects to the Lightrun management server only. This lets your production servers remain segregated.

Having said that, you can still use the command-line interface to get all the features discussed here and much more.

Apart from injecting loggers, what other stuff can we do?

The snapshot lets you get full stack traces with the values of all the variables in the stack and object instance state. You can also include custom watch expressions as part of the snapshot.

Metrics let you add counters (how many times did we reach this line), tictocs (how much time did it take to perform this block), method duration (similar to tictocs but for the whole method) and custom metrics.

You can also add conditions to each one of those to narrowly segment the data.

How do we hide sensitive properties from beans? Say Credit card number of user?

Lightrun supports PII Reduction, which lets you define a mask (e.g. credit card) that would be removed before going into the logs. This lets you block an inadvertent injection into the logs.

It also supports blocklists, which let you block a file/class/group from actions. This means a developer won’t be able to place a log or snapshot there.

How can we use it for performance testing?

I made a tutorial on this here.

When working air gapped on prem is required, how do you provide the Server, as a jar or docker…?

This is something our team helps you set up.

Will it consume much more memory if we run with the Lightrun agent?

This is minimal. Running the petclinic demo on my Mac with no agent produces this in the system monitor:

With the agent, we have this:

At these scales, a difference of 17mb is practically within the margin of error. It’s unclear what overhead the agent has, if at all.

Finally

This has been so much fun and we can’t wait to do it again. Please follow Vlad, Tom, and myself for updates on all of this.

There are so many things we didn’t have time to cover that go well beyond slow queries and spring data nuances. We had a really cool demo of piping metrics to Grafana that we’d love to show you next time around.

The post Spring Boot Performance Workshop with Vlad Mihalcea appeared first on Lightrun.

OpenTracing vs. OpenTelemetry

Lightrun Team — Sat, 18 Jun 2022 12:39:21 +0000

Monitoring and observability have increased with software applications moving from monolithic to distributed microservice architectures. While observability and application monitoring share a similar definition, they also have some differences.

The purpose of both monitoring and observability is to find issues in an application. However, monitoring aims to capture already known issues and display them on a dashboard to understand their root cause and the time they occurred.

On the other hand, observability takes a much low-level approach where developers debug the code to understand the internal state of an application. Thus, observability is the latest evolution of application monitoring that helps detect unknown issues.

Three pillars facilitate observability. They are logs, metrics, and traces.

Metrics indicate that there is an issue.
Traces tell you where the issue is.
Logs help you to find the root cause.

Observability offers several benefits, such as the following:

Find issues that would otherwise be difficult to detect. Usually, troubleshooting production issues is a disaster.
Reduce alert fatigue.
Faster release of products.
Increased automation.
Increased productivity of developers.

According to Gartner, by 2024, 30% of enterprises will use observability to improve the performance of their digital businesses. It’s a rise from what was less than 10% in 2020.

What is OpenTracing?

Logs help understand what is happening in an application. Most applications create logs on the server on which they’re running. However, logs won’t be sufficient for distributed systems as it is challenging to find the location of an issue with logs. Distributed tracing comes in handy here, as it tracks a request from its inception to the end.

Although tracing provides visibility into distributed applications, instrumenting traces is a very tedious task. Each tracing tool available works in its way, and they are constantly evolving. Besides, different tools may be required for different situations, so developers shouldn’t have to be stuck with one tool throughout the whole software development process. This is where OpenTracing comes into play.

OpenTracing is an open-source vendor-agnostic API that allows developers to add tracing into their code base. It’s a standard framework for instrumentation and not a specific, installable program. By providing standard specifications to all tracing tools available, developers can choose the tools that suit their needs at different stages of development. The API works in nine languages, including Java, JavaScript, and Python.

OpenTracing Features

OpenTracing consists of four main components that are easy to understand. These are:

Tracer

A Tracer is the entry point of the tracing API. Tracers are used to create spans. They also let us extract and inject trace information from and to external sources.

Span

Spans are the primary building block or a unit of work in a trace. Once you make a web request that creates a new trace, it’s called a “root span.” If that request initiates another request in its workflow, the second request will be a child span. Span can support more complex workflows, even involving asynchronous messaging.

SpanContext

SpanContext is a serializable form of a Span that transfers Span information across process boundaries. It contains trace id, span id, and baggage items.

References

References build connections between spans. There are two types of references called ChildOf and FollowsFrom.

What is OpenTelemetry?

Telemetry data is a common term across different scientific fields. It is a collection of datasets gathered from a remote location to measure a system’s health. In DevOps, the system is the software application, while the data we collect are logs, traces, and metrics.

OpenTelemetry is an open-source framework with tools, APIs, and SDKs for collecting telemetry data. This data is then sent to the backend platform for analysis to understand the status of an application. OpenTelemetry is a Cloud Native Computing Foundation (CNCF) incubating project created by merging OpenTracing and OpenCensus in May 2019.

OpenTelemetry aims to create a standard format for collecting observability data. Before the invention of solutions like OpenTelemetry, collecting telemetry data across different applications was inconsistent. It was a considerable burden for developers. OpenTelemetry provides a standard for observable instrumentation with its vendor-agnostic APIs and libraries. It saves companies a lot of valuable time spent on creating mechanisms to collect telemetry data.

You can install and use OpenTelemetry for free. This guide will tell you more about this framework.

OpenTelemetry features

You have to know the critical components of OpenTelemetry to understand how it works. They are as follows:

API

APIs help to instrument your application to generate traces, metrics, and logs. These APIs are language-specific and written in various languages such as Java, .Net, and Python.

SDK

SDK is another language-specific component that works as a mediator between the API and the Exporter. It defines concepts like configuration, data processing, and exporting. The SDK also handles transaction sampling and request filtering well.

Collector

The collector gathers, processes, and exports telemetry data. It acts as a vendor-agnostic proxy. Though it isn’t an essential component, it is helpful because it can receive and send application telemetry data to the backend with great flexibility. For example, if necessary, you can handle multiple data formats from OTLP, Jaeger, and Prometheus and send that data to various backends.

In-process exporter

You can use the Exporter to configure the backend to which you want to send telemetry data. The Exporter separates the backend configuration from the instrumentation. Therefore, you can easily switch the backend without changing the instrumentation.

Differences between OpenTracing and OpenTelemetry

OpenTracing and OpenTelemetry are both open-source projects aimed at providing vendor-agnostic solutions. However, OpenTelemetry is the latest solution created by merging OpenTracing and OpenCensus. Thus, it is more robust than OpenTracing.

While OpenTracing collects only traces in distributed applications, OpenTelemetry gathers all types of telemetry data such as logs, metrics, and traces. Moreover, OpenTelemetry is a collection of APIs, SDK, and libraries that you can directly use. One of the critical advantages of OpenTelemetry is its ability to quickly change the backend used to process telemetry data.

Overall, there are many benefits of using OpenTelemetry over OpenTracing, so developers are migrating from one to the other.

Summary

Logs, traces, and metrics are essential to detect anomalies in your application. They help to avoid any adverse effects on the user experience. While logs can be less effective in distributed systems, traces can indicate the location of an issue. Solutions like OpenTracing and OpenTelemetry provide standards for collecting this telemetry data.

You can simplify the observability by using Lightrun. This tool allows you to insert logs and metrics in real-time even while the server is running. You can debug all types of applications, including monolithic applications, microservices, Kubernetes clusters, and Docker Swarm. Amongst many other benefits, Lightrun enables you to quickly resolve bugs, increase productivity, and enhance site reliability. Get started with Lightrun today!

The post OpenTracing vs. OpenTelemetry appeared first on Lightrun.

Maximizing CI/CD Pipeline Efficiency: How to Optimize your Production Pipeline Debugging?

Eran Kinsbruner — Mon, 15 May 2023 11:43:07 +0000

Introduction

At one particular time, a developer would spend a few months building a new feature. Then they’d go through the tedious soul-crushing effort of “integration.” That is, merging their changes into an upstream code repository, which had inevitably changed since they started their work. This task of Integration would often introduce bugs and, in some cases, might even be impossible or irrelevant, leading to months of lost work.

Hence Continuous Integration and continuous deployment (CI/CD) was born, enabling teams to build and deploy software at a much faster pace since the inception of the paradigm shift allowing for extremely rapid release cycles.

Continuous Integration aims to keep team members in sync through automated testing, validation, and immediate feedback. Done correctly, it will instill confidence in knowing that the code shipped adheres to the standards required to be production ready.

However, although many positive factors are derived from CI/CD pipeline, this has since evolved into a complex puzzle of moving parts and steps for organizations where problems occur frequently.

Usually, errors that occur in the pipeline happen after the fact. You have N number of pieces to the puzzle that could fail even if you can resolve some of these issues by piping your logs to a centralized logging service and tracing from there. You are not able to replay the issue.

You may argue for the case of static debugging instances. In this process, one usually traces an error via a stack trace exception or error that occurs. Then you make calculated guesses about where the issue may have happened.

This is then usually followed by some code changes and local testing to simulate the issue and then followed by Deploying the code and going through a vicious cycle of cat and mouse to identify issues.

Issues with CI/CD Pipelines and Debugging

Let’s break down some fundamental issues plaguing most CI/CD pipelines. CI/CD builds, and production deployments rely on testing and performance criteria. Functional testing and validation testing could be automated but challenging due to the scope of different scenarios in place.

Identifying the root cause of the issue

It can be challenging to determine the exact cause of a failure within a CI/CD pipeline while debugging complex pipelines consisting of many stages and interdependent processes can be difficult to understand and comprehend what went wrong and how to fix it.

At its core, a lack of observability and limited access to logs or lack of relevant information can make it challenging to diagnose issues, and at times, the inverse excessive logging and saturation cause tunnel vision.

Another contributing factor If code coverage is low as well edge case scenarios that could potentially be breaking your pipeline will be hard to discover for those that work in a Monorepo environment, issues are exacerbated where shared dependencies and configurations originate from multiple teams or developers that push code without verification cause a dependence elsewhere to break the build deploy pipeline.

How to Optimize your CI/CD Pipeline?

There will be times when you believe you’ve done everything correctly, but something still needs to be fixed.

Your pipeline should have a structured review process.
You need to ensure the pipeline supports automated tests.
Parallelization should be part of your design, with caching of artifacts where applicable.
The pipeline should be built, so it Fails fast with — a feedback loop.
Monitoring should be by design.
Keep builds and tests lean.

All these tips won’t help much if you don’t have a way to observe your pipeline.

Why Should Your CI/CD Pipeline be Observable?

A consistent, reliable pipeline helps teams to integrate and deploy changes more frequently. If any part of this pipeline fails, the ability to release new features and bug fixes grinds to a halt.

An observed pipeline helps you stay on top of any problems or risks to your CI/CD pipeline.

An observed pipeline provides developers with the visibility of their tests, and will they will finally know whether the build process they triggered was successful or not. If it fails, the “Where failed” question is answered immediately.

Not knowing what’s going on in the overall CI/CD process and need to know overall visibility to see how it’s going and overall performance is no longer a topic of discussion.

Tracing issues via different interconnected services and understanding the processing they undergo end to end can be difficult, especially when reproducing the same problem in the same environment is complex or restrictive.

Usually, DevOps and Developers generally try to reproduce issues via their Local environment to understand the root cause, which brings additional complexities of local replication.

Architecture CI/CD pipeline

To put things into context — let’s work through an example of a typical production CI/CD pipeline.

CI/CD pipeline with GitActions and AWS CDK AWS Beanstalk

The CODE

The CI/CD pipeline starts with the source code via Github with git actions to trigger the pipeline. GitHub provides ways to version code but does not track the impact of changes of commits developers make into the repository. For example,

What if a certain change introduces a bug?
What if a branch merge resulted in a successful build but failed deployment?
What if the deployment was successful, then a user received an exception, and it’s already live in production?

BUILD Process

The build process with test cases for code coverage is a critical point of failure for most deployments. If a build fails, the team needs to be notified immediately in order to identify and resolve the problem quickly. You may say they are options like setting alerts to your slack channel or email notifications that can be configured.

Those additional triggers can alert you though they need to provide the ability to trace and debug the issues in a timely manner as one still needs to dig into the code. Failure may be due to some more elusive problems such as missing dependencies.

Unit & Integration TESTS

It’s not enough to know that your build was successful. It also has to pass tests to ensure changes don’t introduce bugs or regressions. Frameworks such as JUnit, NUnit, and pytest generate test result reports though these reports output failed cases but not the how part.

Deploy Application

Most pipelines have pivoted to infrastructure as code where code dictates how Infrastructure provisioning is done. In our example AWS CDK lets you manage the infrastructure as code using Python. While empowering Developers we have the additional complexity of code added which becomes hard to debug.

Post-Deploy Health Checks

Most deployments have an extra step to verify the health such as illustrated in our pipeline. Such checks may include Redis health, and Database health. Since these checks are driven by code we yet have another opportunity for failure that may hinder our success metric.

Visually Illustrating Points of Failure in the CI/CD Pipeline

Below illustrates places that can potentially go wrong i.e Points of failure. This exponentially gets more complex depending on how your CI/CD pipeline has been developed.

Points of failure in our example CI/CD pipeline

Dynamic Debugging and Logging to Remedy your Pipeline Ailments

Let’s take a look at how we can quickly figure out what is going on in our complex pipeline. A new approach of shifting left observability which is the practice of incorporating observability into the early stages of the software development lifecycle via applying a Lightrun CI/CD pipeline Observable Pattern.

Lightrun takes a developer-native observability first approach with the platform; we can begin the process of strategically adding in agent libraries in each component in our CI/CD pipeline, as illustrated below.

Lightrun CI/CD pipeline pattern

Each agent will be able to observe and introspect your code as part of the runtime, allowing you to directly hook into your pipeline line directly from your IDE via a Lightrun plugin or CLI.

This will allow you then to add virtual breakpoints with logging expressions of your choosing to your code in real-time directly from your IDE, i.e., in essence, remote debugging and remote logging such as you would do on your local environment by directly linking into production.

Since virtual breakpoints are non-intrusive and capture the application’s context, such as variables, stack trace, etc., when they’re hit, This means no interruptions to execute code in the pipeline and no further redeployments would be required to optimize your pipeline.

Lightrun Agents can be baked into Docker images as part of the build cycles. This pattern can be further extended by making a base Docker image that has your Lightrun unified configurations inherited by all microservices as part of the build, forming a chain of agents for tracking.

Log placement in parts of the test and deploy build pipeline paired with real-time alerting when log points are reached can minimize challenges in troubleshooting without redeployments.

For parts of code that do not have enough code coverage — all we will need to do is add Lightrun counter metric to bridge the gap to form a coverage tree of dependencies to assist in tracing and scoping what’s been executed and its frequency.

Additional Metrics via the Tic & Toc metric that measures the elapsed time between two selected lines of code within a function or method for measuring performance.

Customized metrics can further be added using custom parameters with simple or complex expressions that return a long int results

Log output will immediately be available for analysis via either your IDE or Lightrun Management Portal. By eliminating arduous and time-consuming CI/CD cycles, developers can quickly drill down into their application’s state anywhere in the code to determine the root cause of errors.

How to Inject Agents into your CI/CD pipeline?

Below we will illustrate using python. You’re free to replicate the same with other supported languages.

Install the Lightrun plugin.
Authenticated IDE pycharm with your Lightrun account
Install the python agent by running python -m pip install lightrun.

pip install lightrun

Add the following code to the beginning of your entrypoint function

import os

LIGHTRUN_KEY = os.environ.get('YOUR_LIGHTRUN_KEY')

LIGHTRUN_SERVER = os.environ.get('YOUR_LIGHTRUN_SERVER_URL')

def import_lightrun():

   try:

       import lightrun

       lightrun.enable(com_lightrun_server=LIGHTRUN_SERVER, company_key=LIGHTRUN_KEY, lightrun_wait_for_init=True, lightrun_init_wait_time_ms=10000,  metadata_registration_tags='[{"name": ""}]')

   except ImportError as e:

       print("Error importing Lightrun: ", e)

as part of the enable function call, you can specify lightrun_wait_for_init=True and lightrun_init_wait_time_ms=10000 as part of the Python agent configuration.

These two configuration parameters will ensure that the Lightrun agent starts up fast enough to work within short-running service functions and apply a wait time of about 10000 milliseconds before fetching Lightrun actions from the management portal. take note these are optional parameters that can be ignored if it doesn’t make sense to apply them for long-lived code execution cycles e.g running a Django project or fast API microservice applications if your using another language like java the same principles apply.

Once your agent is configured, you can call import_lightrun() function in __init__.py part of your pipeline code can be made to ensure agents are invoked when the pipeline starts.

Deploy your code, and open your IDE with access to all your code, including your deployment code.

Select the lines of code you wish to trace and open up the Lightrun terminal and console output window shipped with the agent plugin.

Adding logging to live executing code with Lightrun directly from your IDE

Achieving Unified output to your favorite centralized logging service

If we wish to pipe out logs instead of using the IDE, you can tap into third-party integrations to consolidate the CI/CD pipeline, as illustrated below.

If you notice an unusual event, you can drill down to the relevant log messages to determine the root cause of the problem and begin planning for a permanent fix in the next triggered deploy cycle.

Validation of CI/CD Pipeline Code State

One of the benefits of an observed pipeline is that we can fix the pipeline versioning issues. Without correct tagging, how do you know your builds have the expected commits it gets hard to tell the difference without QA effort.

By adding dynamic log entries at strategic points in the code, we can validate new features and committed code in the pipeline that was introduced into the platform by examining dynamic log output before it reaches production.

This becomes very practical if you work in an environment with a lot of guard rails and security lockdowns on production servers. You don’t worry about contending with incomplete local replications.

Final thoughts

A shift left approach observability in CI/CD pipeline optimization approach can increase your MTTR average time it takes to recover from production pipeline failures which can have a high impact on deploying critical bugs to production.

You can start using Lightrun today, or request a demo to learn more.

The post Maximizing CI/CD Pipeline Efficiency: How to Optimize your Production Pipeline Debugging? appeared first on Lightrun.

Debugging the Java Message Service (JMS) API using Lightrun

Lightrun Marketing — Mon, 25 Apr 2022 10:24:14 +0000

The Java Message Service API (JMS) was developed by Sun Microsystems in the days of Java EE. The JMS API provides us with simple messaging abstractions including Message Producer, Message Consumer, etc. Messaging APIs let us place a message on a “queue” and consume messages placed into said queue. This is immensely useful for high throughput systems – instead of wasting user time by performing a slow operation in real-time, an enterprise application can send a message. This non-blocking approach enables extremely high throughput, while maintaining reliability at scale.

The message carries a transactional context which provides some guarantees on deliverability and reliability. As a result, we can post a message in a method and then just return, which provides similar guarantees to the ones we have when writing to an ACID database.

We can think of messaging somewhat like a community mailing list. You send a message to an email address which represents a specific list. Everyone who subscribes to that list receives that message. In this case, the message topic represents the community mailing list address. You can post a message to it, and the Java Message Service handler can use a message listener to receive said event.

It’s important to note that there are two messaging models in JMS: the publish-and-subscribe model (which we discussed here) and also point-to-point messaging, which lets you send a message to a specific destination.

Let’s go over a quick demo.

A Simple Demo

In order to debug the Java Message Service calls, I’ve created a simple demo application, whose source code can be found here.

This JMS demo is a simple database log API – it’s a microservice which you can use to post a log entry, which is then written to the database asynchronously. RESTful applications can then use this database log API to add a database log entry and without the overhead of database access.

This code implements the main web service:

@RestController
@RequiredArgsConstructor
public class EventRequest {
   private final JmsTemplate jmsTemplate;
   private final EventService eventService;
   private final Moshi moshi = new Moshi.Builder().build();

   @PostMapping("/add")
   public void event(@RequestBody EventDTO event) {
       String json = moshi.adapter(EventDTO.class).toJson(event);
       jmsTemplate.send("event", session ->
               session.createTextMessage(json));
   }

   @GetMapping("/list")
   public List listEvents() {
       return eventService.listEvents();
   }
}

Notice the event() method that posts a message to the event topic. I didn’t discuss message bodies before to keep things simple, but note that in this case I just pass a JSON string as the body. While JMS supports object serialization, using that capability has its own complexities and I want to keep the code simple.

To complement the main web service, we’d need to build a listener that handles the incoming message:

@Component
@RequiredArgsConstructor
public class EventListener {
   private final EventService eventService;

   private final Moshi moshi = new Moshi.Builder().build();

   @JmsListener(destination = "event")
   public void handleMessage(String eventDTOJSON) throws IOException {
       eventService.storeEvent(moshi.adapter(EventDTO.class).fromJson(eventDTOJSON));
   }
}

The listener is invoked with the JSON string that is sent to the listener, which we parse and send on to the service.

Debugging the Hidden Code

The great thing about abstractions like Spring and JMS is that you don’t need to write a lot of boilerplate code. Unfortunately, message-oriented middleware of this type hides a lot of fragile implementation details that can fail along the way.

This is especially painful in a production scenario where it’s hard to know whether the problem occurred because a message wasn’t sent properly. This is where Lightrun comes in.

You can place Lightrun actions (snapshots, logs etc.) directly into the platform APIs and implementations of messaging services. This lets us determine if message selectors are working as expected and whether the message listener is indeed triggered.

With Spring with JMS support as shown above, we can open the JmsTemplate and add a snapshot to the execute method:

As you can see, the action is invoked when sending to a topic. We can review the stack frame to see the topic that receives the message and use conditions to narrow down the right handler for messages.

We can place a matching snapshot in the source of message so we can track the flow. E.g. a snapshot in EventRequest can provide us with some insight. We can dig in the other direction too.

In the stack above, you can see that the execute method is invoked by the method send at line 584. The execute method wraps the caller so the operation will be asynchronous. We can go further down the stack by going to the closure and placing a snapshot there:

Notice that here we can place a condition on the specific topic and narrow things down.

Summary

We pick messaging systems to make our application reliable. However, enterprise messaging systems are very hard to debug in production, which works against that reliability. We can see logs in the target of messages, but what happens if we did not reach it?

With Lightrun, we can place actions in all the different layers of messaging-based applications. This helps us narrow down the problem regardless of the messaging standard or platform.

The post Debugging the Java Message Service (JMS) API using Lightrun appeared first on Lightrun.