Resources  Archive - Lightrun

How Taboola slashed MTTR & saved 260+ debugging hours a month with Lightrun on AWS

gbenhorin — Tue, 27 Sep 2022 14:56:30 +0000

The Challenge

Understanding the real state of production services

Taboola’s production environment is particularly dynamic. With many different features in development simultaneously to accommodate the needs of the business, their developers push a large number of changes on a daily basis. To keep up with demand, Taboola developers push changes directly into production and use advanced feature-flagging techniques to verify their code works as expected.

Each of Taboola’s servers performs hundreds of thousands of queries per second (QPS) to support the roughly 0.5 million requests that hit Taboola’s servers every second. To support that enormous job, Taboola built a data center that hosts heavyset servers with up to 2 TB of RAM each (in addition to utilizing AWS services). Since most developer laptops simply can’t handle the loads that these machines deal with daily, developers often have trouble reproducing issues from a production environment in their local one.

In addition, fast, repetitive code changes imply fast, repetitive deployments. With a deployment cycle that takes anywhere from 30 minutes to an hour, a high pace of changes, and the aforementioned bug reproduction challenges, re-building and re-deploying while tackling complicated issues can sometimes take literally hours. Combined with the fact that – due to their vast user base – each minute of downtime equals thousands of dollars in lost revenue, Taboola needed fine-grained controls to guarantee quality, reliability, and security continuously.

Taboola was looking for a solution that will enable them to make sure each released version works without a hitch, including on their additional services hosted on AWS. Ultimately, they looked for a tool that would allow them to troubleshoot issues and validate feature behavior in production services in a quick, developer-friendly way. They needed the said tool to allow them to figure out if a new feature holds up in production, and also aid them in performing fast root cause analysis with as little unnecessary context switches as possible.

“Lightrun has been a game-changer for us. With Lightrun we shortened our development process significantly by skipping iterative deployment cycles when adding logs and metrics. A day’s work turned into just one hour. Lightrun provided us with new observability into our production environment that was not accessible to us beforehand. Lightrun is a key component in our developer toolset here at Taboola and one of our development best practices.” Rami Stern, R&D Infrastructure Team Leader

The Solution

Real-time production debugging with Lightrun

The difference between lost revenue and happy customers for Taboola is a speedy incident resolution process. However, production issues come in many different shapes and forms, and not all of them are easy to resolve quickly.

For example, when working on logically complicated flows, it’s often difficult to understand the code paths that various requests take at every run. Using Lightrun, Taboola developers can now issue conditional snapshots that allow them to identify the state of a particular request; in order to isolate the specific request, a developer can insert any valid Java expression (as complex as it may be), and a snapshot will only be taken when that expression evaluates to true. This allows developers to get information relevant to the request – and only that information – without sifting through endless screens in their logging systems.

Performance bottlenecks are another popular form of production issues. In order to understand which part of the system is causing the latency, collecting and visualizing metrics is a rather common practice among troubleshooting developers.

Using Lightrun, developers can insert TicTocs – real-time, on-demand metrics – to measure the amount of time a certain piece of code took to execute – even if they run inside managed Kubernetes like EKS or in serverless environments like AWS Lambda. Lightrun offers a few of these types of code-level metrics (method durations, counters, etc..) that are extremely valuable in identifying bottlenecks during real-time sessions.

Previously, developers often found themselves adding metrics into the actual code. These metrics, when left unattended, can make the codebase bloated and add additional overhead to each transaction due to the cost of instrumenting them. This means that the developers had to remove them in the next deployed version. Lightrun allows real-time, on-demand addition and removal of metrics, where each metric only has a negligible performance footprint.

In addition and as mentioned above, Taboola uses feature-flagging and progressive delivery to safely test the behavior of new features in production, without affecting the entire customer base. When Taboola pushes a new version to production, they first push it to a smaller customer subset. They then use Lightrun logs and snapshots to verify the behavior of the new feature, the state of the application, and the path the code takes are as expected. In addition, they verify that the performance of the newly introduced code meets the required benchmarks using Lightrun metrics. If all the expectations are met, they then gradually roll it out to the rest of the customer base.

With Lightrun, teams were able to get to the root cause of issues in record time. Hidden, implicit backend issues that took up to two weeks to mitigate in the past were now resolved in under an hour using Lightrun logs, snapshots, and metrics – without ever interrupting a running production service.

The results

Improving developer experience by saving over 260 debugging hours every month

With instant, real-time production logs, snapshots, and metrics, Taboola developers now save precious incident resolution time previously spent waiting for their hotfixes to deploy to their datacenters and AWS-hosted services. When there’s a need to better understand the current state of a service in order to investigate an issue, developers now have a tool to ask new questions and get immediate, real-time answers.

Using Lightrun on a constant basis decreases MTTR, increases the rate at which Taboola deploys new features into production, and improves each individual developer’s productivity. By streamlining a once cumbersome, lengthy, and rather manual debugging process, Lightrun removes unnecessary frustration from the developer’s day and allows Taboola’s team to focus on what they do best – ship new amazing features to Taboola customers.

The post How Taboola slashed MTTR & saved 260+ debugging hours a month with Lightrun on AWS appeared first on Lightrun.

How Gong Enabled Secure Production Debugging Across their Entire Engineering Organization with Lightrun

Eran Kinsbruner — Tue, 05 Dec 2023 16:02:49 +0000

Gong Engineering slashes Issue resolution time significantly and Boosts its Dev Productivity

The Challenge

Gong’s mission is to transform revenue organizations by driving business efficiency, revenue growth, and improved decision-making. The Gong Revenue Intelligence platform uses the largest customer interaction dataset, coupled with patented, in-house artificial intelligence (AI) models to power revenue teams’ most critical workflows, including deal execution, sales engagement, forecasting, coaching, and strategic initiatives.

Gong’s engineering team, which consists of hundreds of developers, grappled with debugging tasks that spanned across AWS EC2 and EKS distributed workload environments. The application is Java-based on Spring framework and also utilizes Apache Tomcat web server. The Gong product team deploys their application to production hundreds of times each day serving hundreds of thousands of users and executing millions of online and batch transactions per minute. The hurdles were primarily centered around peripheral microservices and data-storage systems tied to systems being debugged. This includes databases, messaging systems, and other services, the behavior of which could not be faithfully replicated in non-production environments. Moreover, Gong engineers strive to minimize logs due to the prohibitive costs of logging at the scale that the system operates.

Before Lightrun’s adoption, the troubleshooting procedures at Gong involved incorporating static logs and attempting to emulate production issues locally – an inefficient and resource-draining process. Each added log line required a redeployment cycle, resulting in significant time waste and financial costs while the bug still impacted customers. The Gong team was thus on a quest to streamline their troubleshooting procedures, accelerate the process, and diminish the overall expenses associated with remote debugging.

Proposed Solution

Lightrun’s Production Debugging across the SDLC as Part of Gong’s Platform Engineering Strategy

Lightrun was able to help the entire Gong engineering organization achieve the above objectives through its dynamic observability platform. Lightrun platform enabled developers access to their production environments in a secure manner and allowed efficient debugging of complex issues directly from the developers’ workstations. Specifically, Gong developers started using the Lightrun IntelliJ IDE plugin with Java runtime support (agent) and were able to quickly ramp up and analyze production issues by placing regular as well as conditional snapshots and logs. The team was super excited by the ease of use of the Lightrun platform, which felt like a native IDE debugger but much more powerful. This resulted in employing a modified and cost-effective debugging workflow that includes dynamic logs and breakpoints in runtime within live applications.

The team at Gong embraced Lightrun with enterprise-grade security across multiple use cases.

Solving production issues and reducing MTTR
Validate complex production behaviors and pinpoint issues
Asses and analyze code execution performance using Lightrun metrics
Enhance the usage of logs while reducing static log expenses

By using Lightrun, Gong engineers were able to save daily hours to troubleshoot customer issues and connect their developers to their production environments from their IDEs without hot-fixing, redeploying, or changing the state of the app in runtime.

“I have found Lightrun to be both user-friendly and remarkably efficient for debugging complex production issues in remote environments. Utilizing Lightrun’s snapshots directly from my IDE has proven to be a straightforward yet powerful method as I worked towards identifying the issues.” – Gil Sagi, Staff Engineer, Gong R&D.

“Leveraging Lightrun, our teams unraveled complex incidents that were challenging to solve and replicate locally using conventional debugging solutions. Lightrun enables us to quickly introduce dynamic logs and snapshots surrounding the incident area, then facilitates reproduction and resolution of issues, much to the delight of our customers.” – Jacob Eckel, VP, Gong R&D

The Results

Before adopting Lightrun, Gong developers were compelled to add numerous static and expensive log lines each time a production issue arose. This necessitated debugging sessions that were long, costly, and inefficient, frequently requiring hotfixes and redeployments that could last for hours.

With the integration of Lightrun, the entire engineering team at Gong, consisting of a few hundred, transformed their troubleshooting procedures to incorporate dynamic logs and snapshots on-demand within their debugging workflows, reducing their MTTR significantly.
This adjustment frequently reduces the overall debugging time from several hours to mere minutes. The adoption of Lightrun paved the way for a more mature platform engineering infrastructure throughout the organization.

The post How Gong Enabled Secure Production Debugging Across their Entire Engineering Organization with Lightrun appeared first on Lightrun.

How Drata Improves MTTR by 30% with Lightrun

Or Maimon — Tue, 07 Mar 2023 19:36:13 +0000

Lightrun reduces costs for logging and observability while further enhancing customer experience

About Drata

Drata is a continuous security and compliance automation platform that streamlines customers’ risk and compliance journeys across 14+ frameworks, regulations, and standards such as SOC2, ISO 27001, PCI DSS, GDPR, HIPAA, and more. Through an automation-led approach, Drata enables continuous control monitoring so companies of all sizes can achieve and maintain compliance over time and stay audit ready.

Thousands of leading companies use Drata to automate their risk and compliance programs, continuously monitor their controls, and scale securely.

The Challenge

Enable developers to troubleshoot quickly, from their own dev environments

As a business-critical platform for their customers, Drata developers are constantly focused on providing a functional and dynamic platform that helps its customers stay audit ready.

When it came to production improvements, the team realized that having to edit code and produce a new build each time they needed to add new logging to the system created repetitive processes backlogged by support tickets. Drata was looking for a better way to streamline the overall process while reducing the overall cost of logging.

With Lightrun, Drata is able to add logging dynamically, on the fly, without a complete CI/CD build cycle and redeployments. Observability is put fully in the hands of developers.

Where appropriate, the new observability is piped to their APM/Observability system; but for short-lived investigations where that’s not necessary, Lightrun lets developers add logging, get the info they need, then remove the logging – all without ever changing the code or producing a new build.

The Solution

Adding on-demand, real-time observability with Lightrun

Understanding the state of the running processes on a production machine traditionally relies on looking at existing application logs or attaching a remote debugger.

Prior to using Lightrun, Drata engineers used to have to follow these steps:

Determine which piece of the code needed extra visibility.
Add the required logs and measurements at the relevant places.
Deploy a new release to the production server.
Inspect the given information, and repeat the process all over again with new analysis.

With Lightrun’s platform In-IDE experience, Drata was able to significantly optimize the above process, directly from the developer environment, while saving on time and costs.

Drata’s new troubleshooting process now consists of a simple 2-step process:

Add snapshots and logs to the relevant code.
Observe all the information immediately inside developer’s IDE or APM.

“We’ve experienced immediate value from the moment our developer team began using Lightrun; we were able to efficiently fix dozens of tickets on a monthly basis and our observability consumption moved more into the engineers IDE.” Alec Barba, Full Stack Engineer at Drata

The Results

Using Lightrun platform, Drata developers were able to improve MTTR by 30%

Using Lightrun, Drata developers eliminate bottlenecks by reducing the overall MTTR. Drata was also able to reduce the volume and costs of logs and storage that the engineering team was previously consuming. Overall, Lightrun empowered Drata to further enhance the production environment for a more seamless and reliable customer experience.

“Drata’s main focus is to ensure we’re delivering value for our customers and addressing their most urgent needs; we embrace that approach across the entire organization. Leveraging Lightrun to significantly streamline logging has allowed us to remediate faster without shifting our focus away from critical business initiatives.” Dave Knell, VP Software Engineering at Drata

The post How Drata Improves MTTR by 30% with Lightrun appeared first on Lightrun.

How InsideTracker Improved MTTR by 50% and Saved Dozens of Developer Hours a Month Using Lightrun Logs and Snapshots

Eran Kinsbruner — Wed, 29 Mar 2023 12:30:01 +0000

The Challenge

InsideTracker is a personalized health analysis and data-driven wellness guide, designed to help you live healthier longer. The development team is using Java runtime technology in a cloud based environment. They have 2 kubernetes clusters, one for production and one for development and testing purposes.

The app consists of dozens of services across multiple pods that developers do not have access to. When an incident is being reported, the developers have to go through an iterative and time consuming process to troubleshoot the problem and add more static logs and telemetry, which slows down the resolution of customer facing issues.

Specifically, there are cases where InsideTracker developers receive a partner incident that involves API calls that returns responses that could not be parsed and therefore troubleshooted and understood by the engineers. In addition, analyzing issues that are attributed to 3rd party data providers like smart watch devices and other health tracking solutions is an additional challenge to troubleshoot and parse within the development environment.

As a personalized health and wellness technology company, it is critical for InsiderTracker to troubleshoot and resolve production issues quickly. With that in mind, the developers needed an efficient and highly secure troubleshooting solution for their services.

Proposed Solution & Architecture

Lightrun helped the InsideTracker developers by allowing them to troubleshoot the apps whether they were running in the pre-production environment or in production during runtime, without stopping the app, directly from the development IDEs. This saved the engineering team the long and iterative cycles of debugging that used to take hours to complete.

By using Lightrun, over a dozen InsideTracker developers were able to add logs and snapshots without hotfixing, redeploying or changing the state of the app in runtime, all through a top-notch customer privacy and security platform.

“By using Lightrun, our development team was able to figure out an extremely complex incident that was hard to parse and reproduce locally and with the standard debugging solutions. The team was able to use Lightrun to quickly add logs and snapshots around the area of the incident, reproduce and fix it” Yan Dyshkalps, Director of Technology Research, Architecture and Infrastructure, InsideTracker

The Results

Using Lightrun platform, InsideTracker developers improved MTTR by 50% and saved dozens of hours a month

Prior to using Lightrun, InsideTracker developers were unable to troubleshoot or access the remote kubernetes environments from their local machine, causing debugging sessions to be long and inefficient. Such sessions used to involve hotfixes and redeployments that lasted hours.

Upon the adoption of Lightrun, the engineering group reduced troubleshooting time from hours to minutes by adding logs and snapshots dynamically and on-demand directly from the IntelliJ IDE.

The post How InsideTracker Improved MTTR by 50% and Saved Dozens of Developer Hours a Month Using Lightrun Logs and Snapshots appeared first on Lightrun.

How Start.io slashed MTTR by 50%-60% with Lightrun on AWS

gbenhorin — Thu, 07 Jan 2021 09:39:57 +0000

The Challenge

Juggling loads of traffic and its unknowns

Start.io handles more than 30 billion requests every day. Handling this enormous amount of traffic is an uphill battle – one that is fraught with complicated, nuanced production issues. Two key types of production issues that come with this level of scale are concurrency and parallelism problems. These types of issues appear only under a specific set of circumstances and are often very hard to reliably replicate locally. When these issues happen and are left unmitigated, they tend to lead to severe service disruptions – often causing data corruption and non-standard program behavior. Software-defined caches are another pain point for Start.io’s developers. Some types of caches must remain immutable – that is, once a value is inserted into the cache it must not change. The developers discovered, however, that there are specific configurations under which the values in the cache can indeed change. This results in an unfortunate situation – when the cache is being “hit”, the information returned from it is not the same information the developers expect to see. Investigating these types of issues is also very difficult since the cache gets “dirty” non-deterministically. In other words – while it’s relatively easy to identify that the information returned from the cache is incorrect, it’s hard to know which activities make the cache “dirty”.

“Using Lightrun, we were able to dive deep into production issues instantly, with a single action. By reducing the number of steps in the debugging process, Lightrun helped us reduce our MTTR by 50%-60%. Lightrun is definitely an ideal addition to every company’s toolbox, and is especially helpful in investigating hard-to-replicate production issues” Boris Shmerlin, Director of Advertising R&D at Start.io

The Solution

Leveraging Lightrun for real-time debugging, monitoring, and alerting without ever leaving the IDEn

When Start.io learned of Lightrun’s approach to production debugging, they were immediately intrigued. They have, after all, been hard at work trying to break apart the exact issues Lightrun claims to solve. After a short exploration period, Start.io deployed Lightrun to a significant portion of their production services. Previously, when developers wanted to add more visibility when a specific event occurs in a running application, they had to:

Add a new piece of code that exposes some piece of information
Pour the information produced into an external system, for example, Kibana
Review the information in the external system

Lightrun eliminates this entire process, by opting instead to take a more proactive approach. Using Lightrun, Start.io’s developers now define conditions that determine when the event at hand should occur. Then, when it does, the developers get proactive alerts inside their IDE with all the required information. When the specific case is being “caught” (i.e. when a specific condition is being met), Lightrun automatically pipes the information right to them. This new process is especially handy when debugging the previously mentioned cache issues. By placing Lightrun Snapshots on the relevant parts of the cache and inspecting the stack trace, the developers can now identify the exact flow that caused the cache to misbehave. Without Lightrun, capturing the same stack trace would take significantly longer – resulting in slower MTTR and a decreased quality of service for their customers. Getting a better grip on issues that appear only under specific circumstances is also a breeze using Lightrun. By placing a Lightrun agent in each of their data centers, Start.io’s developers were able to identify – in real-time – issues that are isolated to a single data center and resolve them significantly faster.

The results

50%-60% faster incident resolution using Lightrun

Start.io saves a lot of time by debugging with Lightrun, relieving their teams of unnecessary repetitive processes and freeing them up to focus on writing new features. Real-time debugging without needing to add new code (and without having to remove that code later on), proactive alerting, and visibility into the code path that led up to the issue at hand all result in a significant increase in productivity. Lightrun also supports Start.io in reducing much of the friction associated with incident resolution. Because it is completely integrated into the IDE, Lightrun enables developers to keep their fingers on the pulse of production systems without constant context switching. This streamlined approach has Start.io’s developers reporting less stress and a significantly improved developer experience during incident resolution.

The post How Start.io slashed MTTR by 50%-60% with Lightrun on AWS appeared first on Lightrun.

How Lightrun saved WhiteSource cycles of redeployments on AWS

gbenhorin — Mon, 07 Sep 2020 09:43:46 +0000

The Challenge

Quickly Identifying an Error in Production

The error occurred in a method that collected results from several threads, but WhiteSource could not immediately identify which of the tasks had the problem. The stack trace showed that the last line of code running was for collecting the thread results.

The last WS line:

com.wss.util.ThreadUtils.collectResults

The last line in the stack trace:

java.util.concurrent.FutureTask.report

WhiteSource was able to identify that the overall exception was an SQL syntax error. However,
the log made it very difficult to identify which query was throwing the exception and making it fail.

You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘) group by projectinv1_.projectId’ at line 1

The logs were not informative and did not enable WhiteSource to identify the solution in the current version running in production. They had to find a way to identify the root cause.

In the past, WhiteSource had resorted to adding new logs to suspected lines of codes in the next deployment. Occasionally, they had had to go through several iterations of adding logs to new versions, until the issue was detected. They would remove the logs in the following deployment, to decrease overhead and logging costs.

This process would sometimes take weeks of iterations, and many hours of developer
time -for waiting for the changes to be deployed to production, for inspecting code behavior and re-exploring the issue. The developer would also have to deal with a lot of context switches – every iteration would require the developer to re-read the relevant code, recall the assumptions and continue from there. In the meantime, the version would be running with an error.

“Using Lightrun to debug an actual issue in production enabled us to react instantly. We were able to add the right logs and identify the root-cause in a real-time session, instead of waiting for redeployments” Tom Shapira, Director of Software Engineering at WhiteSource

The Solution

Adding Logs with Lightrun On-demand

WhiteSource used Lightrun to dynamically add logs to each thread in production. They needed to identify the problematic query, among all the MySQL queries in their system.

With Lightrun, integrated into their IDE, WhiteSource was able to add these logs in real-time. The process was simple and only took them a few moments. They were then able to quickly identify where the problematic flow occurred and which lines were executed.

The Results

Exception Identification in Minutes

In just a few minutes, WhiteSource was able to identify the line of code they weren’t able to reach with the logs they had originally added, and that was generating the exception. As a result, WhiteSource was able to quickly catch and handle the bug. They discovered they had sent an empty collection of IDs to the query. In the future, they will add a check to see if the collection is empty before sending it to the query.

By using Lightrun, WhiteSource was able to quickly identify an error that would have taken them cycles of redeployments to resolve. This process would have included waiting for future deployments, which occur every two weeks, for adding logs and recreating the issue. They might have had to go through multiple iterations, before identifying the exception root cause. Then, one last iteration for removing the logs. This is a very time consuming and resource heavy process. In some cases, it prevents full deployment of code versions.

In addition, the quick identification enables WhiteSource to quickly identify a new best practice to add to their tool set. Thus, they will ensure this error will not occur again.

The post How Lightrun saved WhiteSource cycles of redeployments on AWS appeared first on Lightrun.

How OurCrowd Improved its MTTR of Business Critical Issues by 70% with Lightrun Dynamic Observability Platform

Eran Kinsbruner — Wed, 12 Jul 2023 15:02:38 +0000

The Challenge

OurCrowd is a global venture investing platform that empowers institutions and individuals to invest and engage in emerging companies.

The OurCrowd development team was taking days to drill down into the root cause of production incidents by only using their traditional debugging tools. The OurCrowd engineering team was aiming to reduce the time it takes from getting alerts from their APM tools until the issues are resolved, and also reduce the overall cost associated with developer observability.

Before incorporating Lightrun, the developers received alerts from tools such as RollBar and similar ones, indicating problems in their production environments. In their attempt to resolve these issues, the developers attempted to address the problem by adding additional logs for debugging purposes. Unfortunately, this process took them a minimum of two days, and sometimes even longer, to overcome these obstacles.

Solving the issues was too slow, hence, costly from a developer productivity standpoint.

Proposed Solution

Lightrun helped the OurCrowd developers by providing them with the ability to add dynamic logs into their production environments and hone in the area without the need to redeploy the apps to obtain more visibility or stop and change the state of the app.

“By using Lightrun, our development team was able to figure out the solution to an extremely complex incident that was hard to resolve locally, saving both developers time and business implications. The team was able to use Lightrun to quickly add logs and snapshots around the area of the incident, reproduce and fix it.” , Or Angrest, R&D Manager, OurCrowd

The development team at OurCrowd started using the Lightrun IDE plugin within VSCode to troubleshoot their Node.JS code and add logs without hot-fixing, redeploying or even restarting multiple components in their system.

“Lightrun has transformed my software development process! With its live debugging capabilities, seamless IDE integration, and real-time insights, it’s a game-changing companion that significantly enhances my debugging and development. It’s like having a superpower that amplifies productivity and ensures top-notch applications.”, Tamir Dagan, Infrastructure Code Team Lead OurCrowd

The Results

Using Lightrun platform, OurCrowd developers improved MTTR by 70% and saved dozens of hours a month

By utilizing the Lightrun IDE plugin to add Logs and Snapshots to running applications, the development team were able to quickly investigate and resolve bugs in a few hours rather than days. As a best practice, OurCrowd developers also started to use Lightrun as part of their platform engineering tool stack throughout the development process as well and not only during production troubleshooting. This allows developers to pinpoint issues earlier in the SDLC and save costs on logging.

“When it comes to debugging, there’s no place like the server. And there really is no place like production. Lightrun’s ability to peer into the server environment (especially production!) gives a glimpse into the jaws of the problem. Once we understand the problem, most of the battle has been fought.” , Larry Reisler, Core Group Lead, OurCrowd

The post How OurCrowd Improved its MTTR of Business Critical Issues by 70% with Lightrun Dynamic Observability Platform appeared first on Lightrun.

How Easyway improved MTTR by 60% and Saved Dozens of Developer Hours a Month using Lightrun Logs and Snapshots

Eran Kinsbruner — Tue, 28 Mar 2023 13:37:28 +0000

About Easyway

Easyway is a cloud-based platform that provides a suite of artificial intelligence-powered tools to automate customer service and support tasks. The platform uses natural language processing (NLP) and machine learning (ML) to interpret customer inquiries and generate accurate and timely responses. Easyway’s tools include a chatbot builder, live chat support, and an AI-powered knowledge base that can be integrated with popular customer support channels like Facebook Messenger, WhatsApp, and more. With Easyway, businesses can save time and resources by automating their customer service workflows while still providing high-quality support to their customers.

The Challenge

Easyway’s developers build their guest relationship management platform on top of multiple Node.js microservices running on Amazon’s EKS Kubernetes clusters. These services were previously single-instance pods but the company recently moved to using replicas as part of a major scaling process happening in the organization. Easyway engineering team was aiming to properly support the increasing traffic and address issues when they arise.

Specifically, the developers were dealing with problematic request-response debugging cycles in the service that manages their integrations and in the MKS Kafka cluster that the integrations used as a data streaming layer. The developers added logs to debug the requests going out, the responses coming in and to the Kafka consumers to gain further insight into what was happening.

However, debugging was still difficult as they would have to release multiple deployments with additional logging, which was too time-consuming and resource-intensive. The developers needed a more efficient solution to debug the services

The Solution

Lightrun helped the Easyway developers by providing a comprehensive view of their micro-services running on Amazon EKS and their data streaming layer on top of MKS, enabling them to quickly identify and address increased traffic issues in case they occur.

By using Lightrun, developers were able to add logs without hotfixing, redeploying or even restarting to multiple components in the system, including:

Their own code processing the incoming API responses
The Kafka consumers in charge of handling the data receive from the API requests

By utilizing the Lightrun Visual Studio Code plugin to add Logs and Snapshots to running applications, the Easyway development team were able to quickly investigate and troubleshoot their pods during a massive scale-up, without relying on iterative and manual deployment steps.

The Results

Using Lightrun platform, Easyway developers were able to improve MTTR by 60%

Easyway developers were used to troubleshooting sessions lasting many hours due to their existing tool stack prior to bringing on Lightrun. After adding Lightrun to the development ecosystem, the engineering team was able to better troubleshoot Amazon EKS deployments from their native IDEs, shorten their MTTRs, and optimize their workflows.

Today, using Lightrun, developers reduced troubleshooting time from hours to mere minutes using Lightrun’s read-only, real-time Logs and Snapshots, and have accredited Lightrun as being a major success driver in their current scale-up process.

The post How Easyway improved MTTR by 60% and Saved Dozens of Developer Hours a Month using Lightrun Logs and Snapshots appeared first on Lightrun.

Debugging Serverless Functions with Lightrun

Roni Kriger — Sun, 02 Apr 2023 06:35:15 +0000

Major cloud providers like AWS, Azure, and GCP offer Functions-as-a-Service (FaaS) which are popular among developers. Debugging serverless functions can be challenging, but new solutions like Lightrun support multiple languages and IDEs, allowing developers to troubleshoot serverless functions via dynamic logs and snapshots.

The post Debugging Serverless Functions with Lightrun appeared first on Lightrun.

The Hidden Costs of Production Downtime in the Financial Industry (and How Developer Observability can help)

Reut Bashan — Sun, 30 Apr 2023 13:51:46 +0000

Date & Time

May 17th, 11 am (EST) / 8 am (PST) / 4 pm (GMT) / 6 pm (GMT+2)

About the webinar

As financial apps continuously evolve towards more distributed architectures, highlight competitive landscape, and more digital users across so many different platforms, the cost of failure as well as the ability to quickly and efficiently troubleshoot end-user issues is becoming key for these organizations success. In addition, many of these financial organizations are still required to support a mix of legacy and cloud-native applications.

Join leaders including Thomas Haver from M&T Bank and Joe Larizza from Bank of Montreal (BMO) in this panel webinar where we will unfold topics around ensuring high quality financial applications as well as high developers productivity in this highly demanding market segment.

In this session, you will learn hands-on tips & tricks around:

✓ How financial enterprises deal with production downtime and end-user issues across both cloud-native and legacy architectures?

✓ What is the impact on the bottom line and developer productivity?

✓ Tips and tricks on how to meet SLAs while reducing MTTR of production issues

✓ And much more.

The post The Hidden Costs of Production Downtime in the Financial Industry (and How Developer Observability can help) appeared first on Lightrun.

Resources Archive - Lightrun

How Taboola slashed MTTR & saved 260+ debugging hours a month with Lightrun on AWS

The Challenge

Understanding the real state of production services

The Solution

Real-time production debugging with Lightrun

The results

Improving developer experience by saving over 260 debugging hours every month

How Gong Enabled Secure Production Debugging Across their Entire Engineering Organization with Lightrun

Gong Engineering slashes Issue resolution time significantly and Boosts its Dev Productivity

The Challenge

Proposed Solution

Lightrun’s Production Debugging across the SDLC as Part of Gong’s Platform Engineering Strategy

By using Lightrun, Gong engineers were able to save daily hours to troubleshoot customer issues and connect their developers to their production environments from their IDEs without hot-fixing, redeploying, or changing the state of the app in runtime.

The Results

How Drata Improves MTTR by 30% with Lightrun

Lightrun reduces costs for logging and observability while further enhancing customer experience

About Drata

The Challenge

The Solution

“We’ve experienced immediate value from the moment our developer team began using Lightrun; we were able to efficiently fix dozens of tickets on a monthly basis and our observability consumption moved more into the engineers IDE.” Alec Barba, Full Stack Engineer at Drata

The Results

How InsideTracker Improved MTTR by 50% and Saved Dozens of Developer Hours a Month Using Lightrun Logs and Snapshots

The Challenge

Proposed Solution & Architecture

The Results

How Start.io slashed MTTR by 50%-60% with Lightrun on AWS

The Challenge

Juggling loads of traffic and its unknowns

The Solution

Leveraging Lightrun for real-time debugging, monitoring, and alerting without ever leaving the IDEn

The results

50%-60% faster incident resolution using Lightrun

How Lightrun saved WhiteSource cycles of redeployments on AWS

The Challenge

Quickly Identifying an Error in Production

The Solution

Adding Logs with Lightrun On-demand

The Results

Exception Identification in Minutes

How OurCrowd Improved its MTTR of Business Critical Issues by 70% with Lightrun Dynamic Observability Platform

The Challenge

Proposed Solution

The Results

How Easyway improved MTTR by 60% and Saved Dozens of Developer Hours a Month using Lightrun Logs and Snapshots

About Easyway

The Challenge

The Solution

The Results

Debugging Serverless Functions with Lightrun

The Hidden Costs of Production Downtime in the Financial Industry (and How Developer Observability can help)

Date & Time

About the webinar

Resources  Archive - Lightrun