continuous debugging Archives - Lightrun https://lightrun.com/tag/continuous-debugging/ Developer Observability Platform Sun, 25 Jun 2023 09:28:25 +0000 en-US hourly 1 https://wordpress.org/?v=6.4.2 https://lightrun.com/wp-content/uploads/2022/11/cropped-fav-1-32x32.png continuous debugging Archives - Lightrun https://lightrun.com/tag/continuous-debugging/ 32 32 Continuous Debugging and Observability: the Next Agility Best Practices https://lightrun.com/continuous-debugging-and-observability-the-next-agility-best-practices/ Mon, 15 Jun 2020 09:53:23 +0000 https://lightrun.com/?p=2121 A Message from Lightrun’s Founders Ever since the publication of the “Manifesto for Agile Software Development” approximately two decades ago, software development has become more and more agile. Companies all over the globe have adapted new work processes, built new R&D team structures and incorporated new developer tools and methods. Agility has become the “right” […]

The post Continuous Debugging and Observability: the Next Agility Best Practices appeared first on Lightrun.

]]>
A Message from Lightrun’s Founders

Ever since the publication of the “Manifesto for Agile Software Development” approximately two decades ago, software development has become more and more agile. Companies all over the globe have adapted new work processes, built new R&D team structures and incorporated new developer tools and methods. Agility has become the “right” way to do R&D, and organizations are willing to spend a lot of resources to become agile and improve their product delivery and customer experience.

Software development agility is the flexible management and execution of the development process. However, being the vague term that it is, usually companies relate “agility” to CI/CD. While the two are connected – CI/CD is part of agility – CI/CD does not encompass the full breadth of agility and what it offers R&D departments.

CI/CD means rapid code change, fast delivery and quality. It is a set of principles that enables pushing smaller code changes more frequently, to enable quick advancement of the product and fast error resolution. CI/CD takes place by automating building, testing and deployment, with dedicated tools like Jenkins, CircleCI, and Atlassian Bamboo. Teams also change their structure, since they break monolith development to microservices. So from larger teams built according to development specialty (e.g – backend) they turn into smaller, mission-focused and feature-focused teams.

But there’s more to agility than CI/CD. What happens after deployment, when the service is running?

Agility for Live Running Services 

In recent years, the gap between development and production environments has widened. Microservices, Kubernetes, Docker Swarm, ECS, Big Data workers and serverless require complex maintenance in production. These architectures have made production issues hard to anticipate in development and even harder to reproduce there.

Simulating production data in a dev environment, as well as the scale and the user data, is on the verge of impossible. As if that’s not enough, bugs are difficult to reproduce in staging and development. To top it all off – modern architectures (microservices / serverless) have a tendency to fail more often – as is common for distributed applications.

Yet, in today’s world – the only way to understand your application’s behaviour is based on log lines and metrics that were defined during the development stage, way before production. We know how frustrating and disheartening it is to discover you’re missing a log line, just where you need it. Developers find themselves going through numerous iterative processes just for adding another log line or metric.

Engineering teams have advanced tools to analyze production, like logging tools and APMs. But each time they need new visibility into their code, they have to go through the development and full release cycle process again (and again) and recreate the preconditions for the issue. This has created the observability gap.

The observability gap limits R&D’s ability to react fast and troubleshoot production issues. Which, in turn, obstructs agility in live applications and when delivering software. In other words, the observability gap restricts developer and DevOps capability to deliver. MTTR rises, site reliability declines, and so much time is wasted.

We needed a solution for bridging this observability gap. That’s why we came up with Lightrun.

Bridging the Observability Gap with Continuous Debugging and Continuous Observability

We started Lightrun to enable development agility for solving problems in live environments. Developers have a lot of power and agility in development environments. They have debuggers, profilers, and entire pipelines. But in the most vital environment, the meat-and-potatoes of R&D departments, the environment that serves the customers, the process is still stuck and iterative, and requires a lot of time, effort and resources.

Lightrun is the first complete Continuous Debugging and Continuous Observability platform. This means the ability to add and instrument the three pillars of observability in staging or production environments, i.e while the service is live. Lightrun enables developers to securely add log lines, performance metrics and traces to production, on-demand and while the app is running. True agility and 100% code-level observability are now achieved!

Based on the same principles as CI/CD, Continuous Debugging and Observability (CDB/CO) powers development teams to shorten iterations and boost product quality and reliability. Just like CI/CD shortens the release cycle, CDB/CO shortens the RCA (Root Cause Analysis) cycle.

We also wanted to make sure that agility was at the heart of our processes, so it was important to us to incorporate ourselves into existing developer workflows, end-to-end. We started in the development environment, by integrating into developers’ existing IDEs, and ended in production, with their APM and log aggregation tools.

Announcing Lightrun

Lightrun was established a year ago with the vision of leading the CDB/CO revolution. Lightrun was founded by developers for developers. We both have a heavy technical background from the elite 8200 IDF unit, public enterprise companies and successful startups that were acquired. Our goal is to deeply transform developers’ lives all over the world, by disrupting the way they collect data from live apps and the way they debug and troubleshoot live applications. In the past year we were able to fundraise from some of the most prominent investors who are reshaping the development ecosystem (including senior development executives in Fortune 500 companies). We built an A-player team of accomplished members in the enterprise-level developer tool sphere, and – we built a product that already serves satisfied customers who are expanding their use on a daily basis. It is now time to launch our product and company to the public. Lightrun is emerging from stealth to become the market leader in the Continuous Debugging and Observability revolution. Contact us to learn more.

The post Continuous Debugging and Observability: the Next Agility Best Practices appeared first on Lightrun.

]]>
Debugging Microservices: The Ultimate Guide https://lightrun.com/debugging-microservices-the-ultimate-guide/ Mon, 20 Jul 2020 10:16:34 +0000 https://lightrun.com/?p=2838 Microservices have come a long way from being a shiny, new cool toy for hypesters to a legitimate architecture that transforms the way modern applications are built. Microservices are loosely coupled, independently deployable and scalable, allow a highly diverse technology stack – and these are just some of their biggest advantages. Also, these are some […]

The post Debugging Microservices: The Ultimate Guide appeared first on Lightrun.

]]>
Microservices have come a long way from being a shiny, new cool toy for hypesters to a legitimate architecture that transforms the way modern applications are built. Microservices are loosely coupled, independently deployable and scalable, allow a highly diverse technology stack – and these are just some of their biggest advantages. Also, these are some of their biggest disadvantages, especially when it comes to debugging microservices.

That’s because all the world’s a trade-off. And all those great advantages come with a price tag attached. For a long time the tag has been too high for many teams. In this blog post I am going to discuss the issue that was (and still is, in some cases) a very significant part of the aforementioned price – difficulties in debugging microservices. Then, I will recommend tools (one of which is our own production debugger Lightrun) and platforms that can help overcome these problems, because microservices aren’t going anywhere.

What is Microservices Architecture

Before we start, though, let’s clearly define a few things. First of all, “microservices” and “serverless” are two different things. Well, right, pretty often microservices are built using serverless architecture, and pretty often the serverless architecture is used bearing microservices in mind. And yet, the main goal of serverless is to reduce the total cost of ownership of an application – i.e. reduce the cost of managing servers and usage bill – and it has nothing to do with microservices. It is still possible to build a monolithic web application running entirely on AWS ElasticBeanstalk or Azure AppService or deploy a microservice on top of a nginx server running on EC2.

After this subtle but legally important distinction, note that I will still address serverless debugging issues alongside microservices debugging issues since they interpolate very often.

Another important thing to mention is that the microservices architecture is just a subclass of a more broad and comprehensive cloud-native paradigm, which introduces even more challenges for development teams (out of the scope of this post). But whether you deploy microservices in the public cloud or totally on-prem, you will face the same difficulties. (I don’t assume your application’s gender, age, religion or cloud. And, of course, the language it is written in.)

Microservices Architecture Creates Microservices Debugging Challenges

Imagine that your huge, cumbersome monolith full of shitty legacy code and written in some old-fashioned, boring dinosaur language starts falling apart into beautiful, tiny microservices. What can go wrong?

A lot of things. And when they do go wrong you will find out that, all of a sudden, you can’t put a breakpoint in that new tiny beautiful microservice! You can’t make your favorite IDE debugger just stop there, you can’t see the stack, the values of variables, the process memory, you can’t pause threads and step through the code line by line (well, I do assume language(s) here, sorry for that).

You can’t do all this because the suspicious code is now not just some class instance running at worst as another thread in the process your IDE is attached to. It is now a dedicated Docker container/Kubernetes pod where it runs written in another language: stateless, asynchronous, lonely.

Or even worse, it is now a Lambda function, which is born and dies hundreds of times in a second somewhere in a distant cloud, throwing NPEs every time it starts. How in the world am I supposed to debug a microservice like that? What have I done?

debugging microservices is scary, learn how from this guide

This post comes to the rescue. There are a lot of techniques, tools, and even startup companies that have emerged to address this problem. It is a vibrant and constantly evolving (which is another way to say poor and incomplete) ecosystem that I will review in two steps: debugging microservices locally (this blog post) and debugging microservices in production.

How to Debug Microservices Locally

Let’s see how it is possible to debug microservices when you either develop them locally or try to reproduce and fix a bug. Before I get into solutions, let’s outline the challenges you will face doing that.

Debugging Microservices Locally: The Challenges

Fragmentation

In a good old-fashioned monolith, the functionality (i.e. adding an item to a shopping cart) you were trying to debug was implemented by a couple of classes. These classes made it easy to gain a holistic view.

Now, the same functionality is implemented by a couple of separate microservices, and each can be either a Docker container, a Kubernetes pod, or a serverless function. You are supposed to run all of them simultaneously in order to reproduce a bug and then, after a fix, perform an impact analysis.

To top it all off, to recreate the exact picture, each one of these services must be of the same version they are in production – either all together with the same version, or, even worse, the version each service was running in the production environment where the bug was reported. Creating this environment is a huge challenge, and if you don’t do it right you won’t be able to properly debug your microservices.

Asynchronous Calls

Direct synchronous method invocations (or, at worst, message queues between threads) are replaced in microservices with either synchronous REST or gRPC API calls. Even worse, sometimes they are replaced with an asynchronous event-driven architecture based on plenty of available message queues (async gRPC is also an option).

Too bad that issues occurring with in-process message queues are nothing like what you face with distributed message queues: the configuration is complicated and has a lot of nuances, latency and performance are not always predictable, operational costs are very high (yes, Kafka, I am looking at you) and you may run out of a budget very quickly if you are using managed solutions.

Distributed State

Forget about stack trace, forget about logs. Actually no, don’t forget about logs, forget about understanding anything by digging into those of a single microservice. Those magic ERROR lines you are looking for may be printed into logs of some other microservice at an undefined time offset, messed up with totally unrelated ERROR lines which were printed while handling a different HTTP request. In other words, recreating the application state which led to a bug is often mission impossible.

Different Languages

Back in the day it was one language to write them all, now it is a Noah’s ark of languages and you might have no idea WTF is going on with this “undefined has no properties” error that some weakly dynamically typed language loves to throw (who let this become a backend language, for crying out loud?).

Technical Difficulties in Running Microservices Locally

Well, that’s what Docker was invented for in the first place, right? Docker-compose up and we are done. OK, but what about a Kubernetes cluster? A Kafka cluster? A bunch of Lambda functions? And then your laptop ~melted~ needs more RAM and CPU.

Now it is easy to see why until recently many teams just gave up. For some it cost days, for some it was weeks of frustration, anger and suppressed aggression – and I didn’t even get to production debugging of microservices. The industry reacted quickly to this mess and came up with plenty of solutions addressing these issues. Granted, these are still not even close to providing the speed and convenience of debugging a monolith with an IDE debugger, but the gap is slowly closing. Let’s take a close look at what you can do.

How You Can Debug Microservices

So what is in our microservices debugging kit as of July 2020? Let’s look at the main tools and platforms out there, and how they can help you.

Cloud Infrastructure-as-a-Code Tools

There are plenty of configuration orchestration tools, which include, among some others, Terraform and AWS CloudFormation, as well as configuration management tools like Ansible or Puppet, which automate deployment and configuration of complex applications. Debugging microservices with these tools allows creating a quick and seamless debugging environment – subject to your budget constraints, of course. To optimize costs, you can offload only some of the services to a remote cloud and run the rest locally on your machine.

Centralized Logging

All microservices should send logs to a centralized, preferably external, service. This way you can investigate, trace and find a root case for a bug much easier than switching between multiple log files in your local text editor. You can choose from plenty of managed services like Logz.io and Datadog, deploy your own ELK stack, or just send the logs to ~/dev/null~ cold S3 storage. In case you do not know when you will need the logs, this is a much cheaper option and you can always fetch them later. The most important thing is to implement a Correlation Identifier, and then there are more best practices you should definitely read about.

Serverless Frameworks IaC

Some of your microservices might be implemented using serverless solutions like FaaS and/or other managed services like API Gateway. There are two main players that provide Infrastructure-as-Code frameworks for serverless: the cloud agnostic Serverless and AWS SAM, which is just an abstraction layer over CloudFormation. Back in the day, it was a real mess to develop and debug FaaS, but these days both allow local debugging, while SAM even allows using a local debugger in popular IDEs (Visual Studio, IntelliJ IDEA) with its handy AWS Toolkit. A real time saver!

Local Containers

Running Docker Compose locally is trivial unless you’re using a sophisticated architecture, such as a Kafka cluster alongside your Docker containers. Then things start getting complicated while still feasible – take a look. 

When it comes to Kubernetes though, it is much more difficult. There are some tools that try to simplify local Kubernetes deployment, such as Microk8s and Minikube, but both require a lot of effort to be invested – well, you should not expect your life to be easy when dealing with Kubernetes anyway.

Dedicated “Debuggers for Microservices”

Not very convincing until now, right? I mean, after a lot of effort you can (barely) create the microservices debugging environment and see logs in a manner which makes sense – things you hardly bother about when debugging a monolith. But what about the debugging capabilities that really matter – setting breakpoints throughout the application, following variable values on the fly, stepping through the code, and changing values during run time? 

If your microservices leverage the Kubernetes platform, you can get all of these, at least to an extent. There are two powerful open source tools, Squash and Telepresence, which allow you to use your local IDE debugger features when debugging the Kubernetes environment, preventing your laptop from melting down when running Minicube.

Squash builds a bridge between some of the popular IDEs and debuggers (here’s the full list) and uses a sidecar approach to deploy its client on every Kubernetes node (the authors claim very low performance and resource consumption overhead). This allows you to use all the powerful features of the local debugger such as live debugging, setting breakpoints, stepping through code, viewing the values of variables, modifying them for troubleshooting, and more. You can find a thorough guide here.

Telepresence operates quite differently: it runs a service you want to debug locally, while connecting it to a remote Kubernetes cluster, so you can develop/test it locally and use any of your favorite local IDE debuggers seamlessly. A bunch of tutorials, FAQs and docs can be found here.

Unless I missed something (let me know in the comments), that’s what you have in your hands in the mid 2020 when it comes to debugging microservices locally. Far from ideal, it is much better than just a couple years ago, and it is constantly getting better.

In the next blog post I will discuss the tools and best practices for debugging microservices in production!

Spoiler: a great tool to debug microservices in production is Lightrun. You can add on-demand logs, performance metrics and snapshots (breakpoints that don’t stop your application) in real time without having to issue hotfixes or reproduce the bug locally – all of which makes life much easier when debugging microservices. You can start using Lightrun today, or request a demo to learn more.

The post Debugging Microservices: The Ultimate Guide appeared first on Lightrun.

]]>
Extending CI/CD with Continuous Observability & Debugging https://lightrun.com/complete-agility-extend-your-ci-cd-pipelines-with-continuous-debugging-and-continuous-observability/ Tue, 17 Nov 2020 16:43:50 +0000 https://lightrun.com/?p=3939 This post is a recap of a joint webinar we held with JFrog – you can watch the on-demand recording here. At The Phoenix Project – a whimsical journey through the eyes of an imaginary corporation’s IT leader’s day-to-day life – a large production issue is encountered in the deployment of the company’s flagship Phoenix […]

The post Extending CI/CD with Continuous Observability & Debugging appeared first on Lightrun.

]]>
This post is a recap of a joint webinar we held with JFrog – you can watch the on-demand recording here.

At The Phoenix Project – a whimsical journey through the eyes of an imaginary corporation’s IT leader’s day-to-day life – a large production issue is encountered in the deployment of the company’s flagship Phoenix project. This project, already behind schedule and over budget, is one of the focal points for the company’s CEO – who is very adamant that the catastrophe can be mitigated and the deployment finished promptly.

The narrator, Bill, is obviously under a tremendous amount of pressure to get the show on the road. But – being a seasoned veteran with a penchant for proper processes – he spends the majority of the rest of the book attempting to implement a saner operational schedule for the department.

This book is a great story-like representation of what it’s like to build software nowadays. The tension at multiple levels of the company’s hierarchy, an assortment of project leaders, methodologies and deployment environments, external customer pressure – all of these and more are part of a day’s work for any company with a large-enough software development organization (the number of which is increasing rapidly as more and more processes start revolving around software).

Software engineering organization

It’s interesting to explore what we – as engineers and engineering leaders – have been doing in order to ensure that catastrophes like the one depicted earlier do not repeat themselves with every new version launch. As a discipline, software engineering has made leaps and bounds – infrastructure-wise – since the days of extremely costly dedicated servers in on-prem data centers. It’s cheaper and faster than ever to get your application into production by signing up to a cloud provider, defining your required resources and topology, and clicking a button or running a command to deploy.

This increased agility is not limited to the hardware that we run our software on and to the basic software components we use to scaffold our applications – the process side of things has drastically improved as well. We now have full-fledged, mostly automated processes, around getting software from development to production – “The CI/CD Pipeline” – that makes sure that every piece of software is cared for on every front before reaching the customer’s eyes. In addition, these processes ensure that the entire workflow is faster, more reliable and less fragmented.

But, the value our software brings is only measured when it reaches the hands of our users – i.e. when it hits production. Software can only truly be evaluated by the effect it leaves on the daily lives of our customers, and as such – we need to make sure that the same agility is extended to the “right”-hand side of the SDLC – into the frontier of production environments.

The last stage of CI/CD

Before we dive into how to make sure the processes in our production environments are as streamlined as the ones earlier in the process, let’s first take a closer look at the earlier steps our software takes before it’s released into the wild.

Early Stage Agility

Getting a piece of software from development to production entails, roughly, the following stages:

 

Dev, build, test

 

    1. Development – Source code is written into a source code repository – usually using a VCS (Version Control System) like Git. The main codebase is often hosted on a centralised repository on either a managed service or an on-prem solution, which acts as a single source of truth (SSOT) for the application versions and as a remote backup. This allows for multiple participants to simultaneously develop, with the final codebase always synchronized from one place.
    2. Build – In order to make sure the software functions properly, a build process that sets up the environment, fetches all the necessary dependencies and builds the final application is performed.
    3. Testing – Before code can actually be added into the centralised location (known as “integration” of the new code), a test suite is executed against it to ensure compatibility with the existing codebase.
      This continuous addition of new code to the codebase following the build and test phases is often referred to as CI – Continuous Integration – and the machines that host the endeavour are often colloquially referred to as “CI Servers” (or, in some places, simply as “Build Servers”).
      Continuous integration (CI)
      Once the code is “confirmed working” following the test suites, it needs to be prepared for deployment and then deployed to production:

Code working

  1. Creating Artifacts – In a world with hundreds of different deployment targets including VMs, containers, Kubernetes pods, serverless functions, bare metal servers and others, creating the artifacts and the related configuration and metadata can be quite an ordeal. This endeavor gets even more tiresome if you are not working with a monolith application but in a distributed microservices architecture, where every release is composed of dozens (and sometimes hundreds) of different types of artifacts – one for each service.
  2. Deploying Artifacts – Once the artifacts are created, we also need to get them to the production machines. This usually entails communicating with the target platforms to announce the arrival of the new version, and waiting for confirmation of a successful deployment.

This latter part of the process, when conducted automatically, is often referred to as Continuous Deployment – or CD.

CI/CD

There are, however, quite a lot of other concerns that we’ve glided over so far. These issues are integral to the process and the software shouldn’t be deployed without them in place, including (but not limited to):

  1. Dependencies – relevant dependencies are not hosted on the source code repository next to our application, and as such need to be fetched during the build process from external sources (MavenCentral is an example for JVM development). These external sources must host the correct version and deliver it reliably – both of which are notorious pitfalls that often stop a build from finishing successfully.
  2. Security – the vulnerabilities of the underlying dependencies – and of the application itself – must be verified before deploying to production. This is usually done by scanning the dependencies of the application against an external vulnerability database (both on the binary and metadata levels) and the various components of the application itself.

Continuous Deployment (CD)

These steps used to be carried out by a member of the engineering team, or – in companies with more resources – by a dedicated systems or production engineer. These processes are repetitive, time-intensive, and prone to error when done manually, making them all prime candidates for automation.

And, indeed, the automation of these processes constitutes a large portion of the tooling of most modern software organizations. The sequential execution of them on the dedicated infrastructure is often referred to as the “CI/CD Pipeline” mentioned earlier, due to the incremental nature of the “flow” of the application from source code to a deployed artifact.

Verification

JFrog Artifactory is one example of a solution that can handle the process of artifact management for you -. Instead of relying on external artifact repositories and manual transfer of the software from one host to the other, JFrog Artifactory allows for automated promotion of artifacts between so-called “Artifactory Repositories”.

As your artifacts “mature” throughout the pipeline – passing more and more of the processes mentioned above, and moving between various environments – Artifactory can automatically promote them to the next level. This allows for the pipeline to really “flow” based on triggers from previous stages, instead of relying on human input for things that can be checked by a machine.

If you’re looking for an overarching solution, something that ties all the pieces together under a single roof you can use JFrog Pipelines which allows you to orchestrate and automate every single step your application needs to take before being deployed into production.

JFrog Pipelines coordinates all the existing tools that automate the processes mentioned above – including security scanning using JFrog XRay and artifact caching using JFrog Artifactory – to provide a more wholesome experience for software organizations that need to keep on shipping.

The diagram below gives a great overview of how the different JFrog services fit together.

JFrog Pipelines

The Need For Production Agility

By now, we’ve established that there’s a need for agility on the way from development to production. We’ve also looked at a couple of solutions that offer automation and orchestration for significant pieces of the puzzle.

These advancements, along with a growing appreciation for the complexity of the process of developing software and the care it takes for doing it right – has created a world in which getting our software to production is a fast and streamlined process, enabling quicker updates and a better user experience.

And, when something inevitably breaks, it’s usually very easy to track the specific point of failure, make a change immediately, and re-trigger the exact step that failed without going through the full pipeline because of that single problem. Every step along the way is triggered automatically based on the previous step, is extremely visible and thus easy to audit from all angles and can be modified using the command line, user interfaces and APIs.

But what goes on after the pipeline ends? What happens when our software is released into production?

CI/CD Pipeline

Going back to The Phoenix Project for a second, the narrator says to the other stakeholders at his department the following line during a specifically difficult portion of the project deployment:

The Phoenix Project

The issue is, unfortunately, that it is almost always impossible to match the environments exactly. There are just too many factors to take into consideration – for example:

Live application

  • Users with different characteristics – The sheer amount of technology out there today causes an insurmountable amount of different user configurations that your application might face. In fact, an entire industry – online advertising – is predicated on this very fact for its existence (by profiling users based on the information their setup provides about them).
  • External Failures – Most, if not all, modern applications rely heavily on external vendors for various utilities. The core one is of course cloud vendors – if a GCP or Azure service you rely on goes down, so will your application.
  • Unexpected usage – Logical and functional testing can only get you so far – there will always be users who misuse (or outright abuse) your software, and you must be able to deal with them appropriately.
  • Infrastructure bottlenecks – Your own infrastructure might fail you as well. If a resource inside your topology is pertinent to more than one entity – a database is a good example – latencies caused by actual communication overhead as well as unexpectedly long-running queries might cause timeouts down the road.

These, and more, can all be categorized as things you can either test for in a very limited fashion, or can’t test for at all. Recall, as I mentioned in the beginning of the article, that an obvious (yet easy to forget) fact is that your live application is the most important part of your software development process. These are just too many unknowns to consider when trying to better understand issues with your app.

Live application 2

When a production issue stems from one of these concerns, it’s often hard to understand its origin and to identify which thing exactly broke along the way.

It might not seem apparent at first glance, but the problem is exacerbated by the current set of processes we use to understand, debug and resolve production incidents today.

The Current Production Observability Toolbox

Generally speaking, when a production issue occurs we have a few tools we can use to get better visibility into the issue:

Production observability toolbox

  • Passive Observation – A better understanding of the issue is extracted from the existing infrastructure, such as application management (APM) or other tools that either aggregate or enhance your existing application logging and metrics collection.
  • Replication & Reproduction – The service is replicated locally or on a similar piece of infrastructure; the bug is reproduced on the replicated service.
    Hotfixing for more information – A “patch” with additional logging is created and deployed to the running service, which now emits more granular information.
  • Remote Debugging – A special type of agent is attached to the running service ad-hoc, imitating a local debugger, and allows for breakpoint-by-breakpoint analysis of the service (including stopping it at each breakpoint).
  • Alerting – A set of pre-configured cases – usually based on a specific event occurring or a certain metric reaching a certain threshold – triggers alerts to ensure all stakeholders are aware that something is wrong.
  • Self-Healing – When a piece of software fails, a certain actor is in charge of triggering a process (usually a reboot or an instance swap) to account for the failure automatically.

While each of these approaches has its own advantages and disadvantages, they all have one thing in common – none is an extension of the existing pipeline.

Looking at the production environment with the same lens that you used for the stages leading up to it, and especially considering its importance in comparison to those other stages, it’s clear that we should have tools that allow us to streamline as many of the processes related to incident resolution as possible. Self-healing can steer us a bit in that direction, but it’s incredibly difficult to account for all failure scenarios when creating our software, and thus it is as difficult to create self-healing mechanisms that deal with all of those scenarios.

We’ve been extremely good at cleaning up and automating processes before the release, but, until recent years, there hasn’t been as much attention focused on the process of troubleshooting the application once it’s live in production. Debugging and incident resolution have always been defined by a diverse set of processes and tooling (a large portion of which is mentioned above), but there has never been a concrete umbrella term for the people in charge of facilitating and automating these flows, nor the discipline that they follow.

CI/CD Pipeline

When Google’s Site Reliability Engineering book hit the web, and the SRE profession was introduced, a better definition for both became widespread. There is now a profession inside software organisations whose sole goal is to ensure that engineering practices are applied to the right-hand, production side of the cycle as well.

SRE

But we would be remiss to so easily define it as an entirely new endeavor. Instead, it’s better to look at it as a continuation of the hard work that came before it – and refer to it as an extension of the existing pipeline:

Site reliability engineering

And with this fresh perspective on the process, we can now talk about how to infuse the same agility we experienced in the earlier stages of the pipeline into the art of production troubleshooting as well.

Continuous Debugging & Continuous Observability

The concept of applying the same principles as the pipeline into the production world can be referred to as Continuous Observability.

If observability can be defined as the ability to understand how your systems are working on the inside just by asking questions from the outside, then Continuous Observability can be defined as a streamlined process for asking new questions and getting immediate responses.

But being conscious of our production systems shouldn’t stop there – we also must be able to answer these new questions without causing any damage to the business. That means that outages must be minimized, no customer data should be corrupted and any disruption to the user experience of our products must be mitigated.

To complement the practice of continuous observability, agile teams can also implement Continuous Debugging processes – ways to actively break down tough bugs by getting more and more visibility into the running service, without stopping it or degrading the customer’s experience.

Lightrun was built from the ground up to empower these exact processes.

Continuous debugging

Lightrun works inside your IDE, allowing you to add logs, metrics, traces and more to your running application without ever breaking the process. Instead of having to edit the source code to add more visibility, compile, test, create artifacts, deploy and then inspect the information on the other end, Lightrun skips the process and allows you to add more visibility to production services in real-time, and get the answers you need immediately.

To contrast the Lightrun approach with the current production observability toolbox, let’s look at a couple of examples:

With hotfixing, you have to go through the entire pipeline just to get an additional log line into production. This is a long process that can take many precious minutes for something that should be as simple for production services as it is locally.

Hotfixing

With remote debugging, in order to ask any new questions you have to stop the process causing an outage. This is an expensive price to pay for getting a peek at what’s going on inside your service. Since this addition of information happens repeatedly during debugging, this could mean a hefty dent in the overall uptime of your service.

Remote debugger

With Lightrun, you can add as much information as you want ad-hoc, without stopping the process, and get all the information immediately in your IDE.

Lightrun

By enabling a real-time, on-demand debugging process and enriching the information the application reveals about itself without stopping the process, Lightrun offers a streamlined experience for what is currently an objectively difficult and manual process. By doing so, it facilitates a speedy incident resolution process, resulting in lower mean-time-to-resolution (MTTR) and a better overall developer experience when handling incidents.

The post Extending CI/CD with Continuous Observability & Debugging appeared first on Lightrun.

]]>