Lightrun

Lightrun LogOptimizer Gets A Developer Productivity and Logging Cost Reduction Boost

Eran Kinsbruner — Wed, 03 Jan 2024 13:31:00 +0000

What is Lightrun’s LogOptimizer

Lightrun’s LogOptimizer stands as a groundbreaking automated solution for log optimization and cost reduction in logging. An integral part of the Lightrun IDE plugins, this tool empowers developers to swiftly scan their source code—be it a single file or entire projects—to identify and replace log lines with Lightrun’s dynamic logs, all within seconds. Compatible with runtimes such as Java, JavaScript, and Python, the LogOptimizer is a game-changer in enhancing logging efficiency

With the LogOptimizer, developers can benefit from the following:

Get in few minutes visibility into low quality logs that aren’t required as part of your source code
Reduce the noise in your logs and the overall cost of logging and shift left FinOps practices toward engineers
Move away from using static logs toward dynamic logging
Establish a continuous optimized process of log optimization as part of your DevOps practice
Reduce the noise generated with over-logging (e.g. Static logs, marker logs, duplicate logs, etc.), and find the most valuable information in a cleaner code

What’s New

With the release of Lightrun 1.23.1, significant improvements have been made to enhance user experience and streamline the process of optimizing logs and reducing costs. The product has been thoughtfully restructured into user-friendly sections, making it easier for developers to navigate and seamlessly manage log optimization workflows. In the updated version, developers now have a clear and accessible list of findings from the LogOptimizer scan on the left side of the product. They can effortlessly explore each finding or choose a specific source code file for a more focused analysis. On the right side of the LogOptimizer window, integrated within the IDE plugin, developers receive a convenient summary of the scan, providing a valuable high-level overview with results categorized by file numbers and the total number of findings

To get started with Lightrun LogOptimizer you should make sure that the Docker desktop client is installed and running on your machine. Once it does, from the Lightrun IDE plugin you can run a single source file scan or an entire project scan. See the below quick demo to learn more.

Bottom Line

As organizations try to shift left FinOps towards developers and reduce the overall spend, such tools can empower developers to be more knowledgeable as well as accountable for their software development ongoing costs. As a recurring insight coming from the annual FinOps foundation report, the top challenge (over 32% of responses) with cloud computing cost reduction is the empowerment of developers to take action based on data for their spend.

To get started with Lightrun and the LogOptimizer, please refer to our documentation as well as visit https://lightrun.com and/or book a demo!

The post Lightrun LogOptimizer Gets A Developer Productivity and Logging Cost Reduction Boost appeared first on Lightrun.

Live Debugging for Critical Systems

Lightrun Team — Mon, 23 Oct 2023 08:38:01 +0000

Live debugging refers to debugging software while running in production without causing any downtime. It has gained popularity in modern software development practices, which drives many critical systems across businesses and industries. In the context of always-on, cloud-native applications, unearthing severe bugs and fixing them in real time is only possible through live debugging. Therefore, live debugging becomes an integral part of any developer’s skill set.

This post will explore the various types of critical software systems where live debugging becomes imperative. It will also emphasize the broader strategies for live debugging of such applications.

Type of Critical Systems Where Live Debugging is Important

The definition of a critical system is that it must be highly reliable and retain its reliability as it evolves without incurring performance degradation or prohibitive costs.

Broadly, critical systems can be classified as follows.

Safety Critical Systems

Safety critical systems are systems where failure or malfunction can lead to loss of lives or serious physical injury. In many cases, the malfunctioning also has a second order impact in the form of environmental damage or ecological imbalance.

Software that manages such systems must be designed to control the operational aspects of the systems such that any malfunction has a limited impact on human life, as well as the local flora and fauna of the impacted region. The most obvious example of such a system is the avionics software installed on an aircraft that controls flight surfaces, engine systems, landing gear, and other auxiliary subsystems.

Mission Critical Systems

Mission critical systems are designed around a set of important goals. Therefore, they are intended to facilitate the completion of the goals with clearly stated trade-offs, no matter what hurdles are encountered in the course.

A commonly used mission critical system is map-based navigation software. Most users of Google Maps and other app-based navigation systems know how this software works. It guides drivers to drive to their destination along the road in minimum time. In this case, the mission is to reach the destination, and the trade-off is the time. Therefore, these systems are designed to recommend the best route to the destination in the minimum possible time.

Similar systems are also installed aboard aircraft, ships, and spacecraft with more complex trade-offs around fuel consumption and arrival times.

Business Critical Systems

Business critical systems are systems where failure can prevent an organization from completing important business functions or meeting key objectives. The higher order impact of such failures can result in revenue and reputation loss, eventually leading to degraded performance in the stock market or during subsequent fiscal quarters.

Common examples of software driven business critical systems are payment processing systems or customer support systems. Failure in such a system often disrupts the process workflow. If not addressed in time, such situations can grow out of control, resulting in revenue loss or a decline in the net promoter score for the organization.

Parameters Governing the Health of Critical Systems

The rules for live debugging of critical systems take a radically different approach. Firstly, these systems are designed in a fail-operational or fail-safe design methodology. In this way, these systems can continue functioning or safely shut down a subsystem in case of failure.

Live debugging of such systems in a production setup does not need the developer’s intervention to get into the innards of the source code and figure out the root cause. However, it is important to keep a tab on some key metrics that indicate the systems’ overall health. Let’s take a look at how these metrics can be calculated at a high level.

Mean Time Between Failures (MTBF)

MTBF is a reliability metric. It is a measure of the average time between failures of a critical system or its subsystem components. A higher value for MTBF corresponds to less frequent failures and is, therefore, considered desirable.

MTBF helps in further statistical analysis across all components of a critical system. Comparing MTBF across components can contribute to system design. For example, a subsystem with high MTBF requires less redundancy for fail-operational working. Similarly, a subsystem with lower MTBF must be improved via redesign or rigorous testing.

Mean Time to Resolve (MTTR)

MTTR stands for Mean Time To Resolve (the R sometimes also stands for Recovery or Repair). It is a maintainability metric that measures the average time required to resolve a show-stopper bug in a failed system or component.

MTTR is important to assess a system’s availability and serviceability from the end user’s perspective. A lower value of MTTR is always desirable. A higher MTTR most likely corresponds to inefficient diagnosis procedures or lack of skilled resources.

Mean Time to Acknowledge (MTTA)

MTTA stands for Mean Time To Acknowledge. It is the average time from when a failure is triggered to when work begins on the issue. It indicates how soon the RCA (Root Cause Analysis) is conducted to arrive at the source of failure. A higher MTTA is undesirable and can be indicative of overly complex system design.

The MTTA metric is always lower than MTTR since it takes less time to acknowledge a failure than to resolve it completely. If this is not the case, the critical system is most likely in an unstable state and requires further analysis in a staged environment.

Lightrun: A Reliable Observability Platform for Live Debugging of Critical Systems

Lightrun is a developer-centric observability platform. It empowers developers to ask intricate questions on production deployment and get answers in the form of logs, snapshots, and metrics. This approach enables live debugging of critical systems without causing downtime or performance degradation.

Lightrun is well suited for tracking MTBF in critical systems by injecting timestamped log messages within the running software. This feature creates a stream of dynamic logs that can capture the health-related metrics of the system for proactive remediation. It is also designed for dynamic instrumentation, allowing developers to investigate the software runtime in real time, resulting in reduced MTTA and MTTR.

Lightrun has been proven to reduce the MTTR by up to 60%, resulting in faster bug resolution. All these achievements have a direct impact on improving customer experience and increasing developer productivity.

To experience what it is like to perform live debugging on running production software, sign up for a free Lightrun trial and get started within minutes with your Java, Python, Node.js, or .NET applications. If you’d rather know more before you start, feel free to request a Lightrun demo.

The post Live Debugging for Critical Systems appeared first on Lightrun.

Why Real-Time Debugging Becomes Essential in Platform Engineering

Amir Ish Shalom — Thu, 19 Oct 2023 11:03:55 +0000

Introduction

Platform engineering has been one of the hottest keywords in the software community in recent years. As a natural extension of DevOps and the shift-left mentality it fosters, platform engineering is a subfield within software engineering that focuses on building and maintaining tools, workflows, and frameworks that allow developers to build and test their applications efficiently. While platform engineering can take many forms, most commonly, the byproduct of platform engineering is an Internal Developer Platform (IDP) that enables self-services capabilities for developers.

One of the notable challenges with building a successful platform engineering organization is that there still exists a big gap between dev and ops teams in terms of the tools and the domains they operate in. While the promise of DevOps is to bridge that gap, oftentimes traditional tools designed by and for operations teams are blindly applied to internal developer platforms, drastically reducing their effectiveness. In order for IDPs to be truly self-service and beneficial for all parties involved, observability must play a key role. Without observability, developers will not be able to gather insights into their applications and debug as true owners of their code.

Important to note that platform engineering comes to serve the wider organization at maximum scale across multi-cloud providers (AWS, GCP, Azure), multi-environments (QA, CI, Pre-Production, Production), and multi-runtime languages (Java, C#, .Net, Python, etc.). Being able to debug and troubleshoot all of the above mentioned configurations and code bases in a standardized way is a huge challenge as well as a critical pillar for success.

As a matter of fact, in a recent article that covers the core skills that are required from a platform engineer, 2 out of the top 8 skills were around developer observability and debugging.

Core Skills Required from a Platform Engineer (Source: SpiceWorks)

In this article, we will explore some of the key components of platform engineering and how they manifest in internal developer platforms. We will then shift our focus to the growing importance and adoption of developer focused real-time observability in IDPs and how traditional observability tooling often falls short. Finally, we’ll look at how Lightrun’s dynamic observability tooling can unlock the true value of IDPs.

Key Components of Platform Engineering

Platform engineering came largely as a response to the difference in the idealistic promises and the stark realities of DevOps in practice. While the “you write it, then you run it” ethos of DevOps sounds good, the reality is not so simple. With the rise of cloud native architectures and microservices, we now have more complex moving components to run an application. It is unrealistic to ask developers to not only write their code but also be well-versed in what traditionally falls under the Ops bucket (e.g., IaC, CI/CD, etc).

So platform engineering is a more practical response to carry on the spirit of DevOps while acknowledging the real-world constraints. Some of the key components of platform engineering includes:

Promoting DevOps Practices: This includes IaC, CI/CD, fast iterations, modular deployments, etc.
Enabling Self-Service: Platform engineering teams should enable developers to build and test their applications easily. This touches not only on the build pipeline, but also the infrastructure and other related third-party APIs and services that developers can spin up and connect to on demand.
Providing Tools and Automation: As a follow up to the first two points, platform engineering teams should provide a collection of tools, scripts, and frameworks to automate various tasks to speed up developer lifecycles and reduce human error.
Balancing Abstraction and Flexibility: There should be a good balance between abstracting away the underlying infrastructure to support a scalable and performant platform with exposing important metrics, logs, and other observability data points for engineers to troubleshoot issues. In addition, this allows ownership of services by developers (DevOps practice) without the overhead of understanding all infrastructure parts. Basically shifting left to the developers without the cost of infrastructure complexity.

In short, the platform engineering team acts as a liaison between developers and other infrastructure-related teams to provide tools and platforms for developers to write, build, and deploy code without diving too deep into the complexities of modern infrastructure stacks.

Internal Developer Platforms

These principles are best seen in internal developer platforms. IDPs cover the entire application lifecycle beyond the traditional CI/CD pipeline responsibilities. IDPs provide developers with a flexible platform in which they can quickly iterate on testing their applications as if it is done locally. More specifically, this includes:

Provisioning a new and isolated environment to deploy and test their applications.
Ability to add, modify, and remove configuration, secrets, services, and dependencies on demand.
Fast iteration between building and deploying new versions as well as the ability to rollback.
Scaling up or down based on load.
Production-like environment with guardrails built in to not accidentally cause outages or degradation in service for other teams.
Enablement for developers to understand at all times their application costs and allow them to participate and own the overall cost optimization efforts.

In other words, IDPs provide developers a self-service platform that glues together all the tools behind the scenes in a cohesive manner.

Importance of Real-Time Debugging within an IDP

One of the critical components of a self-service platform is observability through real-time debugging. Without exposing adequate levels of observability to the developers, IDPs will remain a black box that will trigger more support tasks once things go wrong, which defeats the purpose of setting up a self-service platform in the first place. Ideally, developers have access to logs, metrics, traces, and other important pieces of information to troubleshoot the issue and iterate based on the feedback.

As such, real-time observability plays a critical role in creating a successful platform engineering organization and a robust IDP. Platform engineers and VP’s of platform engineering that are building IDPs today are investing and prioritizing the need to efficiently collect logs, metrics, and traces and surface the most relevant signals for developers to detect, troubleshoot, and respond to those issues.

Real-Time Debugging within IDP using Lightrun

Lightrun offers a unique solution that aligns with the principles of platform engineering and adds observability in a way that fits with existing developer workflows. Lightrun provides a standard developer observability platform for real-time debugging that allows developers across multiple clouds, environments, runtime languages and IDEs the ability to debug complex issues fast without a need for iterative SDLC cycle and redeployments.

Specifically, provide developers in real time with:

Dynamic logging: developers can add new logs without stopping or restarting their applications to simply add a new log. This can be added conditionally to only show up in certain scenarios to reduce the noise.
Snapshots: snapshots emulate what breakpoints would give in a local context. It takes a snapshot of the current execution including environment variables, configuration, and other stack traces at run time.
Metrics: developers often don’t think about preemptively adding metrics. Now with Lightrun, they can be collected on demand.

These dynamic observability tools are as mentioned integrated into IDEs that developers already use to write their code. Compared to traditional observability tools like APMs or logging aggregators, Lightrun allows developers to add or remove various logs, snapshots, or metrics on demand without having to go through the expensive iteration cycle or adding logs, raising a PR for review, and waiting for changes to take effect. Especially in the context of IDPs, this dynamic approach enables developers a truly self-service method to troubleshoot and debug their applications.

Summary

The rise of platform engineering in recent years has significantly improved developer productivity and experience. Internal developer platforms address a growing problem of increased complexities in developing and deploying modern applications. As more organizations embrace platform engineering and build out internal developer platforms, observability is becoming an imperative tool in standardizing real-time debugging within the IDP tool stack for a truly self-service platform. With Lightrun’s suite of dynamic observability tooling, platform engineering teams can unlock the true potential of IDPs for increased developer productivity.

The post Why Real-Time Debugging Becomes Essential in Platform Engineering appeared first on Lightrun.

Troubleshooting Cloud Native Applications at Runtime

Moshe Sambol — Wed, 18 Oct 2023 18:07:57 +0000

Co-Authored with Gilles Ramone (Chronosphere)

Chronosphere and Lightrun demonstrate how their combined solutions empower developers with optimized end-to-end observability

===========================================================================================================

Introduction

Organizations are moving to micro-services and container-based architectures because these modern environments enable speed, efficiency, availability, and the power to innovate and scale more quickly. However, when it comes to troubleshooting distributed cloud native applications, teams face a unique set of challenges due to the dynamic and decentralized nature of these systems. To name a few:

Lack of visibility: With components spread across various cloud services and environments, gaining comprehensive visibility into the entire system can be difficult. Access to production environments is generally strictly limited to ensure the safety of customer-facing systems. This makes it challenging to understand run-time anomalies and identify the root cause of issues.
Complexity: Distributed systems are inherently complex, with numerous microservices, APIs, and dependencies. Understanding how these components interact and affect one another can be daunting when troubleshooting.
Challenges with container orchestration: When using serverless systems and container orchestration platforms like Kubernetes, processes can be ephemeral, making it very challenging to identify the resources related to specific users or user segments, and to capture and analyze the state of the system relevant to specific traffic.
Cost of monitoring and logging: Setting up effective monitoring and logging across all components is crucial, but it is costly to aggregate and complex to correlate logs and metrics from various sources.

Addressing these challenges requires a combination of a robust observability platform and tooling that simplifies complexity and helps developers understand the behavior of their deployed applications.

These tools must address organizational concerns for security and data privacy. The best observability strategy will enable the ongoing “Shift left” – giving developers access to and responsibility for the quality, durability, and resilience of their code, in every environment in which it runs. Doing so will enable a more proactive approach to software maintenance and excellence throughout the Software Development Life Cycle.

Efficient troubleshooting requires not just gathering data, but making sense of that data: identifying the highest priority signals from the vast quantity and variety produced by large deployments. Chronosphere turbo-charges issue triage by collecting and then prioritizing observability data, providing a centralized observability analysis and optimization solution.

Rather than aggregating and storing all data for months, at ever increasing cost to store and access, Chronosphere pre-processes the data and optimizes it, substantially reducing cost and improving performance.

When leveraging Chronosphere together with Lightrun, engineers are rapidly guided to the most incident-relevant observability data that helps them identify the impacted service. From there, they can connect directly from their local IDE via the Lightrun plugin to debug the live application deployment. With Chronosphere’s focus and Lightrun’s live view of the running application, developers can quickly understand system behavior, complete their investigation at minimal cost, and close the cycle of the troubleshooting process.

Chronosphere + Lightrun: A technical walk through

Ready to see Chronosphere’s ability to separate signal from noise in a metrics-heavy application, and Lightrun’s on-demand, developer-initiated observability capabilities in action?

To demonstrate, we’re going to use a small web application that’s deployed to the cloud and under load. Lightrun’s application provides simple functionality – users are presented with a list of pictures and they can bookmark those they like most.

In this example, we’ve been alerted by Chronosphere about something amiss in our application’s behavior: it seems that some users are experiencing particularly high latency on some operations. Chronosphere pinpoints this to the “un-like” operation.

But why only some users?

The app designers are doing some A/B testing to see how users react to various configurations that may improve the site’s usability and performance. They use feature flags to randomly select subsets of users to get slightly different experiences. The percent of the audience exposed to each feature flag is determined by a config file.

Unfortunately, in our rush to roll out the feature flag controlled experiments, we neglected to include logging, so we have no information about which users are included in each of the experiment groups.

The feature flags – possibly individually, possibly in combination – may be causing the latency that Chronosphere has identified. In order to know for sure, we’ll need to add some logging, which means creating a new build and rolling out an update. Kubernetes would let us do this without bringing down the entire application, but that might just further confuse which users are getting which feature flags, so it seems that some down time may be our best option.

Well, that would be the situation without Lightrun.

Since we’ve deployed Lightrun’s agent, we can introduce observability on-demand, with no change to our code, no new build and no restarts required. That means that we can add new logging to the running system without access to the containers, without changing any code.

We can safely gather the application state, just as we’d see it if we had connected a debugger, without opening any ports and without pausing the running application!

Lightrun provides remote, distributed observability directly in the interface where developers feel most at home: their existing IDE (integrated development environment). With Lightrun’s IDE plugins, adding observability on the fly is simply a matter of right-clicking in your code, choosing the pods of interest, and hitting submit.

Back to the issue at hand, we’ll use a dynamic log to get a quick feel for who is using the system. Lightrun quickly shows us that we’ve got a bunch of users actively bookmarking pictures. By using Lightrun tags, we’re able to gather information from across a distributed deployment without needing details of the running instances.

That’s nice, but it’s still hard to tell what’s going on with a specific user who’s now complaining about the latency. We use conditional logging to reduce the noise and zoom in on that specific user’s activity. From there we can see that their requests are being received, but we still need to answer the question: what’s going on?

What we really want is a full picture, including:

This user’s feature flags
The list of items they’ve bookmarked
And anything else from the environment that could be relevant.

Enter Lightrun snapshots – virtual breakpoints that show us the state without causing any interruption in service.

Creating a snapshot is just as easy as adding a log – we choose the tags that represent our deployment, add any conditions so that we’ll just get the user we’re interested in – regardless of which pod is serving that user at the moment. And there we have it, all of the session state affecting that user’s interaction with the application.

With this information we can see that one of our feature flags is to blame – it looks like it’s only partially implemented. It’s a good thing that only a small percentage of our audience is getting this one! Oops.

Before we roll out a fix, let’s get an idea of how many users are being affected by each of our feature flags. We can use Lightrun’s on-demand metrics to add counters to measure how often each block within our code is being reached. And we can add tic-tocs to measure the latency impact of this code, just in case our experimentation is also slowing down the site’s responsiveness.

Watch below the full troubleshooting workflow done both through Chropnosphere and Lightrun observability platforms.

The Chronosphere and Lightrun Combined Solution

It’s imperative to have all observability data going into a cloud native observability platform like Chronosphere, which helps alert us to the needle in the haystack of all the telemetry our distributed applications are producing. And with Lightrun developers are able to query the state of the live system right in their IDE, where they can dynamically generate additional telemetry to send to Chronosphere, for end to end analysis.

By using these solutions together, we leverage the unique capabilities provided by each. The result is full cloud native observability: understanding what’s going on in our code, right now, wherever it is deployed, at cloud native scale. Zooming in on the details that matter despite the complexity of the code and the deployment. Combining new, on-demand logs and metrics with those which are always produced by our code – for control, cost management, and automatic outlier alerting.

With developer-native, full-cycle observability, these powerful tools are supporting rapid issue triage, analysis, and resolution. This is essential to organizations realizing maximum observability benefits while maintaining control over their cloud native costs.

Feel free to contact us with any inquiries or to arrange an assessment.

The post Troubleshooting Cloud Native Applications at Runtime appeared first on Lightrun.

Debugging Modern Applications: Advanced Techniques

Lightrun Team — Tue, 10 Oct 2023 15:25:22 +0000

Today’s applications are designed to be always available and serve users 24/7. Performing live debugging on such applications is akin to doctors operating on a patient.

Since the advent of the “as a service” model, software is like a living, breathing entity, akin to an anatomical system. Operating on such entities requires more dexterity on the developer’s part, to ensure that the software application lives on while being debugged and improved continuously.

Let’s look at time travel debugging, continuous observability, and more advanced debugging and live debugging techniques that are available to developers working on modern applications.

1: Time Travel Debugging for Live Issue Analysis

Time travel debugging allows developers to reconstruct and replay the historical runtime state of a running application. The runtime state consists of logs, snapshots, and other metrics, and data is captured with timestamps. Therefore, it can be time-traversed by going back and forth in time to understand the series of events that led to a bug.

Replaying the runtime execution sequence makes it possible to understand the system behavior better. Visualization also plays an important role in this process. There are several ways of visualizing the runtime state for assisting in time travel debugging, such as:

Timeline view: a chart of logged events plotted along a timeline.
Object graphs: a graph that depicts objects, their properties, and references between objects as nodes.
Memory heat maps to illustrate memory allocation and access patterns.

Apart from these visualization approaches, it is also possible to reconstruct a visual illustration of the runtime behavior based on standard UML (Unified Modelling Language) diagrams, such as state diagrams and sequence charts.

2. Chaos Testing for Live Simulation of Disasters

Chaos testing is a technique that intentionally introduces various failures into a software system. The main goal of this test is to measure the resiliency of the software and its ability to recover from unpredictable conditions.

This is not a debugging technique to fix a specific problem. Instead, it is a strategic debugging approach for assessing software reliability in the face of extreme disasters.

Some of the primary approaches to performing chaos testing include:

Injecting failures. Failures like network delays, server crashes, expired certificates, etc., randomly simulated to trigger anomalous behaviors.
Exceeding thresholds. Deliberately increasing the load on the system to force a breach on certain technical thresholds, such as network bandwidth, data storage, computing power, etc., that cause resource exhaustion.
Global disruption. Disrupting essential services the system depends on, like databases, message queues, caches, APIs, etc., by stopping/killing processes or shutting down critical infrastructure like servers, availability zones/regions.
Forced security intrusion. Forced security breaches in the way of simulated attacks, access loopholes, and failed authentication procedures to validate system sanity and understand attack vectors.

3. Shift Right Testing for Live Performance Predictions

Shift right testing is a DevOps culture. It mandates testing the software in a real world scenario earlier in the development phase. This approach is the opposite of the shift left methodology, which requires the developers to perform quality and security checks in development before getting the code into production.

Both approaches complement each other. However, achieving shift right testing is operationally intensive. That is because it involves reproducing the production environment and simulating heavy user traffic, which should be of the same order of magnitude as production traffic.

Like chaos testing, shift right testing is a broader debugging strategy. This approach de-risks the production deployment from unforeseen issues that may cause disruptions later due to undiscovered severe bugs.

4. Continuous Observability for Live Debugging

Continuous observability allows developers to observe and record the internal state of software during the entire DevOps cycle. More importantly, this is performed without any alteration at the source code level. This approach is best suited for live debugging of specific issues without halting runtime execution or forcing changes to the source code to capture telemetry data.

Continuous observability is best achieved by injecting an agent within the running software. The agent occupies a minimum footprint and captures logs, snapshots, and other metrics required for analysis during live debugging. This technique also complements time travel debugging since the data captured during live debugging can be sorted in time order to analyze the bug.

Supercharge Live Debugging with Lightrun

At Lightrun, we are passionate about helping developers improve their debugging productivity. Lightrun is designed to integrate with IDEs just like their native debuggers but with advanced live debugging support.

Unlike traditional debuggers, which halt the runtime execution of software during debugging, Lightrun allows developers to perform these steps dynamically while the runtime execution carries on. Behind the scenes, this capability is backed by dynamic logs, dynamic telemetry, and dynamic instrumentation.

Dynamic logs from Lightrun can be exported to a visualization platform for time travel debugging. Dynamic telemetry allows chaos and shift right tests to capture valuable data about system performance under various simulated load conditions. Above all, dynamic instrumentation allows developers to set virtual breakpoints anywhere in the source code for continuous observability of the software under production.

If you want to experience what it is like to perform live debugging on running production software, sign up for a free Lightrun trial and get started within minutes with your Java, Python, Node.js, or .NET applications. If you’d rather know more before you start, feel free to request a Lightrun demo.

The post Debugging Modern Applications: Advanced Techniques appeared first on Lightrun.

Effective Remote Debugging in PyCharm

Lightrun Team — Tue, 03 Oct 2023 08:25:41 +0000

In a previous post, we looked at the remote debugging features of Visual Studio Code and how Lightrun takes the remote debugging experience to the next level. This post will examine how Lightrun enables Python remote debugging in PyCharm, the Python IDE from JetBrains.

Remote Debugging in PyCharm

PyCharm has many developer-friendly features, including an integrated debugger. It also boasts several advanced debugging features not found in other IDEs.

Some of PyCharm’s key debugging features are:

Better support for Python. Being a Python IDE, PyCharm is well-suited for Python applications, including multithreaded processes. Other IDEs require additional code or configuration to enable smooth debugging in advanced scenarios.
Built-in profiler. PyCharm boasts a built-in profiler to help remove performance bottlenecks from your Python code.
Remote debugging: PyCharm also supports remote debugging where you can attach to one process or several processes running in parallel.

Drawbacks of Remote Debugging in PyCharm

Although PyCharm provides rich features for Python development, its native support for remote debugging is fairly limited.

Manual configuration. PyCharm’s remote debugging workflow requires setting up an SSH connection to a remote host and an additional run configuration to deploy the Python interpreter remotely.
Source code pollution. If you want to use PyCharm for remote debugging, you need to install and import a separate Python package, pydevd_pycharm. This means you introduce untracked temporary changes to your code, which isn’t a good practice.
Not suitable for production debugging. PyCharm’s debugger, whether local or remote, is suited for debugging applications in development and pre-production environments. Debugging a production application is not possible without making code and configuration alterations.

The native support for remote debugging in PyCharm is comparable to other debuggers such as that of VS Code. All these debuggers still rely on the traditional “Halt, Inspect, and Resume” approach.

There are two primary shortcomings of these debuggers:

Debugging by controlling the runtime execution. These debuggers are designed to halt and resume the runtime execution. This technique cannot be utilized for production applications, which must always be running to serve the users.
Attachment to long-running processes. Traditional debuggers were designed for monolithic applications, which execute as a long-running process. It does not work in a cloud-native environment with hundreds of ephemeral processes.

Therefore, the traditional debugging approaches supported by PyCharm and other IDEs are only suitable for long-running processes in a non-production environment. With the advent of cloud-native applications designed for the modern deployment model, these shortcomings become a clear bottleneck for developers to debug effectively.

Leveling up PyCharm Remote Debugging with Lightrun

Lightrun is a developer-centric continuous observability platform that integrates with most popular IDEs for Java, Node.js, and Python. In the case of PyCharm, it is available as a plugin.

Lightrun extends PyCharm’s remote debugging capabilities in a few ways:

Native IDE support for remote debugging. The Lightrun plugin integrates with PyCharm to offer all the visual controls for remote debugging. Developers have the option to connect to a remotely executing Python application on-the-fly and perform debugging right in the IDE. Lightrun handles the connections and setup for accessing the remote Python application in real time.
Ready for cloud-native debugging. By embedding the Lightrun agent as a Python module within the Python applications, developers can control their production applications remotely and seamlessly perform debugging actions such as setting virtual breakpoints, extracting snapshots of the stack, and capturing metrics on multiple instances of a cloud-native Python application.
Highly secure remote debugging. Lightrun’s security architecture ensures that remote debugging is performed in a sandboxed environment. It is a robust, patented mechanism that ensures that every debugging action performed on a production application is secured to ensure the privacy of the source code and no ill effects on the application’s performance.

PyCharm + Lightrun = Production Grade Remote Debugging

With Lightrun’s integration into PyCharm, developers get higher debugging productivity. Instead of spending hours logging the debugging data across the entire Python source code and analyzing it later, they can capture all the data in PyCharm.

For companies, Lightrun’s dynamic observability capabilities help efficiently detect and address security vulnerabilities in corporate software development. Overall, it leads to a faster time to market.

If you are keen to know more, you can try Lightrun yourself using the playground, or book a demo for a guided introduction.

The post Effective Remote Debugging in PyCharm appeared first on Lightrun.

Lightrun’s Product Updates – Q3 2023

Eran Kinsbruner — Fri, 29 Sep 2023 13:11:23 +0000

Throughout the third quarter of this year, Lightrun continued its efforts to develop a multitude of solutions and improvements focused on enhancing developer productivity. Their primary objectives were to improve troubleshooting for distributed workload applications, reduce mean time to resolution (MTTR) for complex issues, and optimize costs in the realm of cloud computing.

Read more below the main new features as well as the key product enhancements that were released in Q3 of 2023!

NEW! Lightrun Support for Action Creation Across Multiple Sources !

Lightrun is excited to announce that developers can now select multiple agents and tags as a single source when creating an action directly from their IDEs. This option lets them simultaneously apply an action to a custom group of agents and tags, which improves their plugin experience and makes it easier to debug with multiple agents and tags. To learn more, see selecting multiple sources in VSCode and selecting multiple sources in JetBrains.

New! Enhanced Capability for Capturing High-Value Snapshot Actions

We’ve taken snapshot capturing to the next level by enabling you to now capture large values for Python and Node.js agents. As part of this enhancement, we’ve raised the default settings to accommodate larger string values. You can also define maximum limits in the agent.config file through the introduction of the max_snapshot_buffer_size, max_variable_size, and max_watchlist_variable_size fields. For more information, refer to the relevant Agent documentation: Python Agent Configuration and Node.js Agent Configuration.

NEW! Duplication of Actions from Within the IDE Plugins

Lightrun now offers an easy and more efficient way to insert Lightrun actions using ‘Copy and Paste’ within your JetBrains IDE, which allows developers to easily reuse existing actions in multiple locations within your code. This new functionality applies to all Lightrun action types, including Lightrun snapshots, metrics, and logs. It simplifies the task of reviving expired actions or duplicating actions which have non-trivial conditions and/or watch expressions.

Similarly, we’ve added a new Duplicate action within the VSCode IDE, which allows developers to easily reuse existing actions in multiple locations within their code. This new functionality applies to all Lightrun action types including Lightrun snapshots, metrics, and logs, simplyfying the task of creating non-trivial conditions and/or watch expressions.

https://lightrun.com/wp-content/uploads/2023/09/CopyPaste_IntelliJ.mp4

NEW! PII Redaction per Agent Pool

With the introduction of PII Redaction Templates, Lightrun now supports additional granularity for utilizing PII Redaction effectively. You can either establish a single default PII Redaction template to be applied to all your agents or alternatively create and assign distinct PII Redaction templates for different agent pools. For example, if you would like to apply PII Redaction only on a Production environment and not on Development or Staging.

To help you get started with configuring your PII redaction on Agent Pools, we provide a single Default template on the PII Redaction page which serves as a starting point for creating your templates. Note that it does not contain any predefined patterns and is not assigned to any agent pools. For more information, see Assigning PII Redaction templates to Agent Pools.

Feel free to visit Lightrun’s website to learn more or if you’re a newcomer, try it for free!

The post Lightrun’s Product Updates – Q3 2023 appeared first on Lightrun.

Putting Developers First: The Core Pillars of Dynamic Observability

Eran Kinsbruner — Sun, 24 Sep 2023 13:55:04 +0000

Introduction

Organizations today must embrace a modern observability approach to develop user-centric and reliable software. This isn’t just about tools; it’s about processes, mentality, and having developers actively involved throughout the software development lifecycle up to production release.

In recent years, the concept of observability has gained prominence in the world of software development and operations. Rooted in three foundational pillars—logging, metrics, and tracing—observability provides a comprehensive understanding of application behavior. These pillars allow teams to diagnose and address issues with greater precision and efficiency.

However, a notable challenge in observability is that many tools available today are designed by and for operations teams. Their primary focus often lies in monitoring, alerting, and system health from an infrastructural standpoint. This design bias can leave developers, who require a different granularity and data context, somewhat in the lurch. Instead of offering insights into code behavior, performance bottlenecks, or specific code-level issues, traditional observability tools may present data in a way that’s more aligned with operational needs. This mismatch underscores the importance of creating or adopting observability tools that cater explicitly to developers, ensuring that they can gain actionable insights from the system and application data in a manner that resonates with their specific workflow and challenges.

With the surge in adopting a platform engineering approach, there’s a profound shift in how organizations perceive and manage the Software Development Life Cycle. At the heart of this approach is providing developers with a robust platform that abstracts away infrastructural complexities and offers tools and services that accelerate development. As platform engineering becomes a catalyst for advanced SDLC management, there is a pressing need to elevate observability proficiency across organizations. Platform engineering, by design, involves a profound intersection of development and operations, which necessitates that the engineers possess a unique blend of skills. Among the emerging skill sets, debugging and observability stand out as paramount.

Why Developer Ownership is Non-negotiable

Over recent years, the software engineering industry has recognized the importance of granting developers ownership of their products to ensure software reliability, agility, and ease of maintenance. Developers should have control over their code, from creation to deployment. They must be able to deploy, rollback, observe, and debug code in production in order to speed up the feedback loop at the core, enabling faster improvements.

The software and overall user experience could improve with the right tools and responsibilities. Real-time debugging in a production environment is invaluable as developers have more context and knowledge to quickly fix the issue as they understand the recent changes best.

The Lightrun Three Pillars of Dynamic Observability

Lightrun offers a suite of features designed to enhance developers’ capabilities. One standout aspect is Lightrun’s ability to debug applications right in the live environment, providing real-time, on-demand insights irrespective of where the application is running.

Pillar 1. Dynamic Logging

Text logging remains a fundamental debugging tool. However, using it in remote environments presents challenges. Centralized logging platforms have grown, offering centralized log ingestion with efficient search capabilities. Yet, they often fall short for real-time remote debugging, mainly because of inherent delays, focus on post-event analysis, and disconnection from the local development environment.

In debugging remote environments, traditional logging can slow down the feedback loop for the developer, as adding a log line usually requires at least an entire CI/CD pipeline run, and most often, deploying a new version to production is impossible or hard to do frequently.`

Many developers opt for overlogging to compensate, leading to increased storage, computation, and possible licensing costs, not counting the difficulty of navigating a massive amount of logs to find the required piece of information.

Finally, log tools are often poorly integrated into developers’ IDEs, resulting in an unnecessary learning curve and shifting developers’ attention away from their primary environment. In some extreme cases, developers lack direct access to production logs because the organization cannot offer a method for secure access.

On the other hand, Lightrun Dynamic Logging enables developers to add new logs without halting the application. This ensures uninterrupted access to crucial data directly from the developer IDE. There’s also the possibility to log only when a specific code-level condition is true, significantly reducing the amount of information that needs to be evaluated to pinpoint an issue.

Pillar 2. Snapshots

Traditional debugging methods often involve a fragmented approach: logs for raw data, metrics for system health overviews, traces for request flows across services, and the occasional breakpoint to dive deep into a specific problem. While each tool offers its distinct advantage, developers often find themselves bouncing between them, trying to piece together a comprehensive understanding of what’s happening within their code. This approach can slow debugging and leave significant gaps in understanding, especially when attempting to correlate high-level data and specific code behaviors. Also, the powerful debugging model where the developer can put breakpoints in the applications can not be directly translated into running live applications, as you can not block them easily.

On the other hand, Lightrun Snapshots introduce a paradigm shift in the debugging process by acting as virtual breakpoints that don’t disrupt the flow of application execution. Unlike traditional breakpoints, which halt execution for inspection, Lightrun Snapshots seamlessly blend into the running application, allowing developers to add conditions, evaluate expressions, and delve deep into any code-level object without ever having to stop, restart, or redeploy the application. Integrated completely within the developer’s IDE, these snapshots not only offer a debugger-like experience but also enable a deeper connection to live applications by alerting developers when specific code segments are executed. This dynamic and continuous approach to debugging, compatible with a range of platforms like AWS, Azure, and Kubernetes, ensures that developers can gain deep insights into their applications right beside the source code, making debugging more intuitive and efficient.

Pillar 3. Metrics

Traditionally, just like with logs, developers have often felt the need to preemptively add many metrics, trying to cover all bases. This scattershot approach not only clutters the telemetry data but also risks overlooking that one critical metric needed during a production issue. Lightrun, however, challenges this paradigm by offering dynamic, code-level metrics. Instead or in addition to instrumenting the application with metrics upfront, Lightrun allows for the real-time insertion of precise metrics directly into live applications, ensuring relevance and accuracy without compromising the execution or state of the application.

With its comprehensive suite of tools, developers can gain insights ranging from the frequency of a specific line being executed with the Counter, to the time efficiency of methods with Method Duration and even block-wise timing with TicToc. Custom Metrics further broaden the scope, granting the freedom to export any numeric expression into a trackable metric.

In Summary

With its suite of features, including dynamic logging, snapshots, and real-time metrics, Lightrun integrates seamlessly with developers’ existing IDEs, positioning itself as an essential ally in the modern development toolkit. If you’re looking to stay ahead in the competitive development space, Lightrun might just be your answer. Dive into its functionalities on the playground, or schedule a demo to experience its capabilities firsthand!

The post Putting Developers First: The Core Pillars of Dynamic Observability appeared first on Lightrun.

Effective Remote Debugging with VS Code

Lightrun Team — Mon, 14 Aug 2023 18:19:40 +0000

This post will discuss remote debugging in VS Code and how to improve the remote debugging experience to maximize debugging productivity for developers.

Visual Studio Code, or VS Code, is one of the most popular IDEs. Within ten years of its initial release, VS Code has garnered the top spot among popularity indices, and its community is growing steadily. Developers love VS Code not only for its simplicity but also due to its rich ecosystem of extensions, including the support for debugging.

VS Code Remote Debugging Features

Being an integrated environment, VS Code has built-in support for debugging in many languages. Support for Node.js applications is available by default. This includes JavaScript, Typescript, and any other language that gets transpiled to JavaScript. Language extensions are also available for Python, C/C++, and most popular programming languages.

VS Code’s remote debugging features allow developers to debug a process running on a remote machine or device. This scenario is the opposite of local debugging, where the debugging is performed on a process spawned within VS Code’s integrated environment.

VS Code’s mechanism for debugging relies on attaching the debugger to a process, which is the executable program to be debugged. VS Code offers a custom launch configuration that allows many ways of attaching the debugger to a process. When debugging locally, the process executes inside VS Code’s environment, and the debugger is attached automatically. When you use VS Code for remote debugging, the launch configuration is updated with parameters for the debugger to point to a process running on a remote host via the IP address.

Some of the features of VS Code remote debugging are:

Consistent debugging UI. In VS Code, the user interface for debugging remains unchanged irrespective of local or remote debugging.
Custom launch configuration. VS Code launch configurations offer many options to set parameters for remote debugging. This mainly includes:
1. Port forwarding to set up communication between the VS Code debugger and the process running on the remote computer.
2. Source paths to point to the correct source code version associated with the running process.
3. Environment variables to set additional variables to control the debugging session.
Multi-target debugging. VS Code supports multi-target debugging, wherein developers can launch more than one debugging session pointing to different processes.
Debugging controls. Remote debugging in VS Code provides the same debugging controls developers use in a local debugging environment. These include setting breakpoints, log points and controls for stepping through the code manually.

Additionally, it supports multiple debug protocol adapters for different languages like C++, Python, Go, etc., with the extensibility to build custom debugging adapters for other platforms.

The Paradigm Shift for Debugging

Despite all the rich debugging capabilities, the VS Code debugging interface has shortcomings. To understand these shortcomings, it is vital to know how classical debugging methodology evolved in software engineering.

The classical debugging workflow relies on three approaches:

Halting the process execution. This is done using breakpoints to halt the runtime execution of the process at a certain point where the bug is most likely to reproduce.
Examining of the stack trace. This is done while the process execution is halted to examine the variable values.
Manual control of business logic. This is done to step through the execution of the process, one source code line at a time, and optionally substituting variable values to understand the system behavior precisely.

Given the advancements in software design and deployment models, this traditional approach to debugging, supported by integrated environments like VS Code, needs to catch up in many ways. This deficiency is due to a combination of paradigm shifts across multiple facets of software development: from desktop to cloud-hosted applications, from monoliths to microservices, and from legacy VMs to cloud-native deployments.

Disadvantages of Remote Debugging in VS Code

Given these sweeping paradigm shifts the industry has witnessed in the last few decades, VS Code’s local and remote debugging experience has the following disadvantages:

Traditional debugging isn’t helpful in production environments. The classical debugging approach relies on halting and manual control of process execution, which is not an option for the production environment. With the advent of agile methodologies, developers spend more time fixing bugs in the project’s staging and production phases than in the development phase. Therefore, runtime observability and monitoring are gaining precedence over debugging.
Debuggers were never designed for cloud-native applications. Cloud-native applications are distributed across multiple containers. While VS Code remote debugging supports containerized applications, they can only be used for long-running processes. In contrast, cloud-native deployment uses multiple ephemeral containers, which cannot be managed through the VS Code debugger interface. Also, the traditional debugging approach does not help unearth hard-to-find bugs that occur due to data races or deadlocks common in complex cloud-native applications running across hundreds of containers.
Manipulating the control flow is less relevant in this age of AI. Artificial Intelligence based applications rely on complex data models to make decisions instead of hand-coded control flow logic. VS Code debugger interface cannot debug such processes since it requires a different level of observability and analysis beyond just manipulating the business logic.
Security issues in remote debugging. Facilitating remote debugging also exposes specific ports on the remote computer where the process runs. Even though VS Code supports SSH based connections for secured access, there are no additional measures to impose IAM (Identity & Access Management) like permissions. This situation can result in debug enabled applications running in production where credentials are shared between development teams, leading to a potential security breach in the future.

Enhanced Remote Debugging in VS Code with Lightrun

Lightrun breaks the stereotype of classical debugging and enables debugging any application on any deployment.

The core approach for Lightrun revolves around developer observability, which allows developers to observe the internal behavior of an application at runtime. It surpasses the drawbacks of traditional debugging in the following ways:

Designed for remote debugging in the cloud. All modern, cloud-hosted applications are offered through the “as a service” model, which requires them to be constantly running and available to serve the end user. Lightrun facilitates remote debugging of production applications running on cloud environments without custom configurations or manipulation in process runtime. This includes the popular deployment orchestration platforms such as Kubernetes.
Designed for instant observability. Lightrun can capture live logs and instant snapshots of the running application, offering instant observability. The snapshots act like virtual breakpoints, which provide information about stack traces and variables without pausing the program execution.
Designed for debugging entire applications instead of individual processes. Rather than attaching to every process instance, the Lightrun agent gets embedded within all the runtime workloads of the application. All the logs and snapshots collected from the multiple runtime process instances can be collated in one place for easier investigation of bugs.

Transcend from Remote Debugging to Live Debugging in VS Code

The best part about Lightrun is that it is available as a VS Code extension:

Developers take advantage of a familiar interface to perform live debugging actions, right in VS Code:

While debugging, Lightrun panel views inside VS Code display logs and detailed snapshot information related to running application:

Behind the scenes, the Lightrun VS Code plugin connects to Lightrun agents embedded within the application to make all the live debugging magic happen.

If you are keen to explore Lightrun integration with VS Code further, check out the Lightrun documentation.

You can also sign up for a Lightrun account and get started with live debugging of Node.js, Java, Python, or .NET applications.

The post Effective Remote Debugging with VS Code appeared first on Lightrun.

Three Code Instrumentation Patterns To Improve Your Node.js Debugging Productivity

Lightrun Team — Thu, 10 Aug 2023 08:41:59 +0000

In this age of complex software systems, code instrumentation patterns define specific approaches to debugging various anomalies in business logic. These approaches offer more options beyond the built-in debuggers to improve developer productivity, ultimately creating a positive impact on the software’s commercial performance.

In this post, let’s examine the various code instrumentation patterns for Node.js. We will briefly touch upon the multiple practices for code instrumentation available within the Node.js ecosystem and cover three code instrumentation patterns for distinct debugging scenarios.

Essential Code Instrumentation Patterns for Node.js

Node.js has a rich ecosystem of tools, including debugging and libraries. As a Node.js developer, you have the following basic options to understand what your code is doing:

Built-in inspector tools and clients to debug the Node.js source code.
Telemetry SDK, such as the OpenTelemetry SDK, for embedding traces and logs within the Node.js source code.

The first option is an inherent code instrumentation technique of manually controlling the execution of the Node.js program based on certain pre-conditions. It resembles the old-fashioned way of debugging code using legacy programming languages like C.

The second option offers more flexibility, and that relies on three commonly used code instrumentation patterns:

Pattern #1 – Tracing: used for capturing the internal call stack of the program at a given point of execution to get a deeper understanding.
Pattern #2 – Debug Logs: used for capturing the program’s internal state during execution.
Pattern #3 – Information Logs: used for capturing information messages to mark the normal progression of the program execution.

Let’s take a quick example to demonstrate the tracing code instrumentation pattern in a Node.js program. Tracing helps capture the deeper context of the program’s flow and data progression. Take a look at this sample Node.js script:

const os = require('os-utils');
const opentelemetry = require('@opentelemetry/api');
const {SemanticResourceAttributes} = require('@opentelemetry/semantic-conventions');
const {NodeTracerProvider} = require('@opentelemetry/sdk-trace-node');
const {ConsoleSpanExporter, BatchSpanProcessor} = require('@opentelemetry/sdk-trace-base');

const CALCULATION_DURATION = 300000; // Calculation duration in milliseconds (300 seconds)
const CPU_IDLE_THRESHOLD = 20; // CPU idle threshold (20%)
const CPU_OVERLOAD_THRESHOLD = 80; // CPU overload threshold (80%)
const MEMORY_IDLE_THRESHOLD = 20; // Memory idle threshold (20%)
const MEMORY_OVERLOAD_THRESHOLD = 80; // Memory overload threshold (80%)

const provider = new NodeTracerProvider({});
const exporter = new ConsoleSpanExporter();
const processor = new BatchSpanProcessor(exporter);
provider.addSpanProcessor(processor);

provider.register();
const tracer = opentelemetry.trace.getTracer();

function calculateAverageUsage(samples) {
    const sum = samples.reduce((total, usage) => total + usage, 0);
    return sum / samples.length;
}

async function getStats() {
    const stats = {
        memoryUsage: Math.floor(100 - (100 * os.freememPercentage())),
        cpuUsage: Math.floor(100 * (await new Promise(resolve => os.cpuUsage(resolve)))),
    };
    return stats;
}

console.log('Starting System Stats Sampling...');

const cpuSamples = [];
const memorySamples = [];
const startTime = Date.now();

const intervalId = setInterval(async () => {

    tracer.startActiveSpan('main', async (span) => {

        let sysStats = await getStats();

        cpuSamples.push(sysStats.cpuUsage);
        memorySamples.push(sysStats.memoryUsage);

        span.addEvent('STAT_CAPTURE', {
            'cpuUsage': sysStats.cpuUsage,
            'memoryAvailable': sysStats.memoryUsage
        });

        span.end();

    });

    const elapsedTime = Date.now() - startTime;

    if (elapsedTime >= CALCULATION_DURATION) {
        const averageCPU = calculateAverageUsage(cpuSamples);
        const averageMemory = calculateAverageUsage(memorySamples);

        console.log('Calculation duration reached!');
        console.log('Average CPU usage:', averageCPU);
        console.log('Average memory usage:', averageMemory);

        if (averageCPU < CPU_IDLE_THRESHOLD) {
            console.log('Warning: Average CPU usage fell below the idle threshold!');
        } else if (averageCPU > CPU_OVERLOAD_THRESHOLD) {
            console.log('Warning: Average CPU usage exceeded the overload threshold!');
        }

        if (averageMemory < MEMORY_IDLE_THRESHOLD) {
            console.log('Warning: Average memory usage fell below the idle threshold!');
        } else if (averageMemory > MEMORY_OVERLOAD_THRESHOLD) {
            console.log('Warning: Average memory usage exceeded the overload threshold!');
        }

        clearInterval(intervalId);
    }
}, 1000);

This Node.js script runs for a predefined time and captures the system statistics of a computer in the form of CPU utilization and available memory percentages. These statistics are captured over a period of time and averaged to detect possible anomalies in the form of low or high values for the percentages.

The above code also includes a snippet of OpenTelemetry SDK and creates a tracer to add custom events for capturing the system statistics. This tracer is initialized through a NodeTracerProvider, and the event is captured inside a span.

…
…

const provider = new NodeTracerProvider({});
const exporter = new ConsoleSpanExporter();
const processor = new BatchSpanProcessor(exporter);

provider.addSpanProcessor(processor);
provider.register();

const tracer = opentelemetry.trace.getTracer();

…
…

tracer.startActiveSpan('main', async (span) => {
    let sysStats = await getStats();
    cpuSamples.push(sysStats.cpuUsage);
    memorySamples.push(sysStats.memoryUsage);
    span.addEvent('STAT_CAPTURE', {
        'cpuUsage': sysStats.cpuUsage,
        'memoryAvailable': sysStats.memoryUsage
    });

    span.end();

});

To run this Node.js program and generate traces, you can perform the following steps (tested on Node version 16 and above):

Create a new Node.js project and install the following OpenTelemetry packages:

npm init -y

npm install @opentelemetry/api @opentelemetry/semantic-conventions @opentelemetry/sdk-trace-node

Save the above code as app.js.
Run app.js:
```
node app.js
```

Upon running, you should see the trace snapshots with the event that captures the CPU and memory percentages every few seconds:

So far, so good. But now, if you want to capture more such trace snapshots in some other part of the code, you must create additional spans and define the events within. This involves code changes, rebuilds, and restarts of the program. This method of achieving code instrumentation is known as static instrumentation.

While static instrumentation is suitable for certain occasions when the software is under development, there are better ways for instrumenting code running in production. Production applications require live debugging without interruptions or downtimes, which can only be achieved through dynamic code instrumentation.

Enabling Dynamic Code Instrumentation in Node.js

Dynamic code instrumentation enables developers to observe live running software without worrying about source code modification or runtime interruptions. In the case of Node.js, there is no way to provision dynamic code instrumentation natively. However, a platform such as Lightrun can achieve it with minimal overheads.

Lightrun is ideally suited for dynamic code instrumentation because it allows developers to provision instrumentation code from the IDE. Thus, developers can work within the familiar environment of their chosen IDE for developing, debugging, and continuously observing the code.

Set Up Dynamic Instrumentation in the IDE

Let’s take a deep dive into the world of dynamic instrumentation. By combining Lightrun with the popular Visual Studio Code IDE, you can leverage all three patterns on a Node.js program. To achieve that, first, you have to sign up for a Lightrun account and follow the steps provided post signup to choose your IDE to install the Lightrun plugin. In the case of Visual Studio Code, the Lightrun plugin appears on the sidebar:

Copy and paste the following Node.js program into the Visual Studio Code editor:

const os = require('os-utils');
require('lightrun').start({
    lightrunSecret: '',
});

const CALCULATION_DURATION = 300000; // Calculation duration in milliseconds (30 seconds)
const CPU_IDLE_THRESHOLD = 20; // CPU idle threshold (20%)
const CPU_OVERLOAD_THRESHOLD = 80; // CPU overload threshold (80%)
const MEMORY_IDLE_THRESHOLD = 20; // Memory idle threshold (20%)
const MEMORY_OVERLOAD_THRESHOLD = 80; // Memory overload threshold (80%)

function calculateAverageUsage(samples) {
    const sum = samples.reduce((total, usage) => total + usage, 0);
    return sum / samples.length;
}

async function getStats() {
    const stats = {
        memoryUsage: Math.floor(100 - (100 * os.freememPercentage())),
        cpuUsage: Math.floor(100 * (await new Promise(resolve => os.cpuUsage(resolve)))),
    };
    return stats;
}

console.log('Starting System Stats Sampling...');

const cpuSamples = [];
const memorySamples = [];
const startTime = Date.now();

const intervalId = setInterval(async () => {

    let sysStats = await getStats();

    console.log("Current System Stats - ", sysStats);

    cpuSamples.push(sysStats.cpuUsage);
    memorySamples.push(sysStats.memoryUsage);

    const elapsedTime = Date.now() - startTime;

    if (elapsedTime >= CALCULATION_DURATION) {
        const averageCPU = calculateAverageUsage(cpuSamples);
        const averageMemory = calculateAverageUsage(memorySamples);

        console.log('Calculation duration reached!');
        console.log('Average CPU usage:', averageCPU);
        console.log('Average memory usage:', averageMemory);

        if (averageCPU < CPU_IDLE_THRESHOLD) {
            console.log('Warning: Average CPU usage fell below the idle threshold!');
        } else if (averageCPU > CPU_OVERLOAD_THRESHOLD) {
            console.log('Warning: Average CPU usage exceeded the overload threshold!');
        }

        if (averageMemory < MEMORY_IDLE_THRESHOLD) {
            console.log('Warning: Average memory usage fell below the idle threshold!');
        } else if (averageMemory > MEMORY_OVERLOAD_THRESHOLD) {
            console.log('Warning: Average memory usage exceeded the overload threshold!');
        }

        clearInterval(intervalId);
    }
}, 1000);

This is the same script used for capturing system stats earlier. However, now the OpenTelemetry SDK related code is removed. The only notable addition is the import of the Lightrun agent with the secret, which is specific to your account. So you must replace the placeholder with your Lightrun account secret. You can find your account-specific Lightrun secret within the management console under the Set Up An Agent section:

Open a terminal and create a new project directory. Perform the following steps to create a new Node.js project and install the Lightrun agent via npm:

npm init -y

npm install lightrun

Also, ensure the above Node.js code is saved within the project directory.

Now, run the program:

node app.js

This should start printing out the captured system stats on the console.

However, the real fun awaits us when we actually start to perform dynamic instrumentation in Visual Studio Code.

Harnessing the Power of Dynamic Instrumentation for Node.js Debugging

Let’s switch to Visual Studio Code to instrument the code using the three patterns.

Once the program is running, you should see an updated Lightrun panel in Visual Studio Code with a Lightrun agent that is now running within the Node.js program:

Adding Snapshots to Trace the Program Context

With Lightrun, you can create dynamic snapshots at different places to trace the internal state of the running program. Think of snapshots as breakpoints that don’t stop your program’s execution. Right-click on the code line where you want to generate the snapshot, select Lightrun from the context menu, and choose to add a snapshot. Subsequently, when that line of code is executed, you can see the snapshot in the Lightrun panel to examine the current call stack.

Adding Debug Logs to Print Variables

Through the Lightrun context menu, you can add a debug log. For example, let’s add a log statement to capture the historical CPU samples stored in the array:

Adding Information Logs to Track the Execution Progress

Finally, you can follow the same steps to add an information log. Here is how you can add a simple information log to indicate the capture of individual system stats, along with a timestamp:

With Lightrun, there are infinite possibilities to instrument any Node.js program. Snapshots and logs are the main actions you can dynamically add anywhere in the code and remove at will. It is also possible to have multiple logs in the same code line.

If you want to enable dynamic code instrumentation in your Node.js application, go ahead and create your Lightrun account to get a free 14-day trial.

Also, do check out the docs section to get a head start on all the Lightrun features across different programming languages and IDEs.

The post Three Code Instrumentation Patterns To Improve Your Node.js Debugging Productivity appeared first on Lightrun.