testing in production Archives - Lightrun

Testing in Production: Recommended Tools

Lightrun Marketing — Thu, 11 Jun 2020 07:08:29 +0000

Testing in production has a bad reputation. The same kind “git push – – force origin master” has. Burning houses and Chuck Norris represent testing in production in memes and that says it all. When done poorly, testing in production very much deserves the sarcasm and negativity. But that’s true for any methodology or technique.

This blog post aims to shed some light on the testing in production paradigm. I will explain why giants like Google, Facebook and Netflix see it as a legitimate and very beneficial instrument in their CI/CD pipelines. So much, in fact, that you could consider starting using it as well. I will also provide recommendations for testing in production tools, based on my team’s experience.

Testing In Production – Why?

Before we proceed, let’s make it clear: testing in production is not applicable for every software. Embedded software, on-prem high-touch installation solutions or any type of critical systems should not be tested this way. The risks (and as we’ll see further, it’s all about risk management) are too high. But do you have a SaaS solution with a backend that leverages microservices architecture or even just a monolith that can be easily scaled out? Or any other solution that the company engineers have full control over its deployment and configuration? Ding ding ding – those are the ideal candidates.

So let’s say you are building your SaaS product and have already invested a lot of time and resources to implement both unit and integration tests. You have also built your staging environment and run a bunch of pre-release tests on it. Why on earth would you bother your R&D team with tests in production? There are multiple reasons: let’s take a deep dive into each of them.

Staging environments are bad copies of production environments

Yes, they are. Your staging environment is never as big as your production environment – in terms of server instances, load balancers, DB shards, message queues and so on. It never handles the load and the network traffic production does. So, it will never have the number of open TCP/IP connections, HTTP sessions, open file descriptors and parallel writes DB queries perform. There are stress testing tools that can emulate that load. But when you scale, this stops being sufficient very quickly.

Besides the size, the staging environment is never the production one in terms of configuration and state. It is often configured to start a fresh copy of the app upon every release, security configurations are eased up, ACL and services discovery will never handle real-life production scenarios and the databases are emulated by recreating them from scratch with automation scripts (copying production data is often impossible even legally due to privacy regulations such as GDPR). Well, after all, we all try our best.

At best we can create a bad copy of our production environment. This means our testing will be unreliable and our service susceptible to errors in the real life production environment.

Chasing after maximum reliability before the release costs. A lot.

Let’s just cite Google engineers:

“It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a cost: maximizing stability limits how fast new features can be developed and how quickly products can be delivered to users, and dramatically increases their cost, which in turn reduces the number of features a team can afford to offer.

Our goal is to explicitly align the risk taken by a given service with the risk the business is willing to bear. We strive to make a service reliable enough, but no more reliable than it needs to be.”

Let’s emphasize the point: “Our goal is to explicitly align the risk taken by a given service with the risk the business is willing to bear”. No unit/integration/stating env tests will ever make your release 100% error-free. In fact they shouldn’t (well, unless you are a Boeing engineer). After a certain point, investing more and more in tests and attempting to build a better staging environment will just cost you more compute/storage/traffic resources and will significantly slow you down.

Doing more of the same is not the solution. You shouldn’t spend your engineers’ valuable work hours chasing the dragon trying to diminish the risks. So what should you be doing instead?

Embracing the Risk

Again, citing the great Google SRE Book:

“…we manage service reliability largely by managing risk. We conceptualize risk as a continuum. We give equal importance to figuring out how to engineer greater reliability into Google systems and identifying the appropriate level of tolerance for the services we run. Doing so allows us to perform a cost/benefit analysis to determine, for example, where on the (nonlinear) risk continuum we should place Search, Ads, Gmail, or Photos…. That is, when we set an availability target of 99.99%,we want to exceed it, but not by much: that would waste opportunities to add features to the system, clean up technical debt, or reduce its operational costs.”

So it is not just about when and how you run your tests. It’s about how you manage risks and costs of your application failures. No company can afford its product downtime because of some failed test (which is totally OK in staging). Therefore, it is crucial to ensure that your application handles failures right. “Right”, quoting the great post by Cindy Sridharan, means:

“Opting in to the model of embracing failure entails designing our services to behave gracefully in the face of failure.”

The design of fault tolerant and resilient apps is out of the scope of this post (Netflix Hystrix is still worth a look though). So let’s assume that’s how your architecture is built. In such a case, you can fearlessly roll-out a new version that has been tested just enough internally.

And then, the way to bridge the gap so as to get as close as possible to 100% error-free, is by testing in production. This means testing how our product really behaves and fixing the problems that arise. To do that, you can use a long list of dedicated tools and also expose it to real-life production use cases.

So the next question is – how to do it right?

Testing In Production – How?

Cindy Sridharan wrote a great series of blog posts that discusses the subject in a great depth. Her recent Testing in Production, the safe way blog post depicts a table of test types you can take in pre-production and in production.

Refined by testing spectrum a bit more – this againnisnt comprehensive, but makes it increasingly clear that testing in production (especially testing post release) is just as important as pre-production testing.

Thoughts? pic.twitter.com/YS0VjACvqD

— Cindy Sridharan (@copyconstruct) February 9, 2018

One should definitely read carefully through this post. We’ll just take a brief look and review some of the techniques she offers. We will also recommend various tools from each category. I hope you find our recommendations useful.

Load Testing in Production

As simple as it sounds. Depending on the application, it makes sense to stress its ability to handle a huge amount of network traffic, I/O operations (often distributed), database queries, various forms of message queues storming and so on. Some severe bugs appear clearly only upon load testing (hi, memory overwrite). Even if not – your system is always capable of handling a limited amount of a load. So here the failure tolerance and graceful handling of connections dropping become really crucial.

Obviously, performing a load test in the production environment will stress your app configured for the real life use cases, thus it will provide way more useful insights than loading testing in staging.

There are a bunch of software tools for load testing that we recommend, many of them are open sourced. To name a few:

mzbench

mzbench supports MySQL, PostgreSQL, MongoDB, Cassandra out of the box. More protocols can be easily added. It was a very popular tool in the past, but had been abandoned by a developer 2 years ago.

HammerDB

HammerDB supports Oracle Database, SQL Server, IBM Db2, MySQL, MariaDB, PostgreSQL and Redis. Unlike mzbench, it is under active development as for May 2020.

Apache JMeter

Apache JMeter focuses more on Web Services (DB protocols supported via JDBC). This the old-fashioned (though somewhat cumbersome) Java tool I was using ten years ago for fun and profit.

BlazeMeter

BlazeMeter is a proprietary tool. It runs JMeter, Gatling, Locust, Selenium (and more) open source scripts in the cloud to enable simulation of more users from more locations.

Spirent Avalanche Hardware

If you are into heavy guns, meaning you are developing solutions like WAFs, SDNs, routers, and so on, then this testing tool is for you. Spirinet Avalanche is capable of generating up to 100 Gbps, performing vulnerability assessments, QoS and QoE tests and much more. I have to admit – it was my first load testing tool as a fresh graduate working in Checkpoint and I still remember how amazed I was to see its power.

Shadowing/Mirroring in Production

Send a portion of your production traffic to your newly deployed service and see how it’s handled in terms of performance and possible regressions. Did something go wrong? Just stop the shadowing and put your new service down – with zero impact on production. This technique is also known as “Dark Launch” and described in detail by CRE life lessons: What is a dark launch, and what does it do for me? blog post by Google.

A proper configuration of load balancers/proxies/message queues will do the trick. If you are developing a cloud native application (Kubernetes / Microservices) you can use solutions like:

HAProxy

HAProxy is an open source easy to configure proxy server.

Envoy proxy

Envoy proxy is open source and a bit more advanced than HAProxy. Wired to suit the microservice world, this proxy was built into the microservices world and offers functionalities of service discovery, shadowing, circuit breaking and dynamic configuration via API.

Istio

Istio is a full open-source service mesh solution. Under the hood it uses the Envoy proxy as a sidecar container in every pod. This sidecar is responsible for the incoming and outgoing communication. Istio control service access, security, routing and more.

Canarying in Production

Google SRE Book defines “canarying” as the following:

To conduct a canary test, a subset of servers is upgraded to a new version or configuration and then left in an incubation period. Should no unexpected variances occur, the release continues and the rest of the servers are upgraded in a progressive fashion. Should anything go awry, the modified servers can be quickly reverted to a known good state.

This technique, as well as similar (but not the same!) Blue-Green deployment and A/B testing techniques are discussed in this Cristian Posta blog post while the caveats and cons of canarying are reviewed here. As for recommended tools,

Spinnaker

Netflix open-sourced the Spinnaker CD platform leverages the aforementioned and many other deployment best practices (as in everything Netflix, built bearing microservices in mind).

ElasticBeanstalk

AWS supports Blue/Green deployment with its PaaS ElasticBeanstalk solution

Azure App Services

Azure App Services has its own staging slots capability that allows you to apply the prior techniques with a zero downtime.

LaunchDarkly

LaunchDarkly is a feature flagging solution for canary releases – enabling to perform a gradual capacity testing on new features and safe rollback if issues are found.

Chaos Engineering in Production

Firstly introduced by Netflix’s ChaosMonkey, Chaos Engineering has emerged to be a separate and very popular discipline. It is not about a “simple” load testing, it is about bringing down services nodes, reducing DB shards, misconfiguring load balancers, causing timeouts – in other words messing up your production environment as badly as possible.

Winning tools in that area are tools I like to call “Chaos as a service”:

ChaosMonkey

ChaosMonkey is an open source tool by Netflix . It randomly terminates services in your production system, making sure your application is resilient to these kinds of failures.

Gremlin

Gremlin is another great tool for chaos engineering. It allows DevOps (or a chaos engineer) to define simulations and see how the application will react in different scenarios: unavailable resources (CPU / Mem), state changes (change systime / kill some of the processes), and network failures (packet drops / DNS failures).

Here are some others

Debugging and Monitoring in Production

The last but not least toolset to be briefly reviewed is monitoring and debugging tools. Debugging and monitoring are the natural next steps after testing. Testing in production provides us with real product data, that we can then use for debugging. Therefore, we need to find the right tools that will enable us to monitor and debug the test results in production.

There are some acknowledged leaders, each one of them addressing the need for three pillars of observability, aka logs, metrics, and traces, in its own way:

DataDog

DataDog is a comprehensive monitoring tool with amazing tracing capabilities. This helps a lot in debugging with a very low overhead.

Logz.io

Logz.io is all about centralized logs management – its combination with DataDog can create a powerful toolset.

New Relic

A very strong APM tool, which offers log management, AI ops, monitoring and more.

Prometheus

Prometheus is open source monitoring solution that includes metrics scraping, querying, visualization and alerting.

Lightrun

Lightrun is a powerful production debugger. It enables adding logs, performance metrics and traces to production and staging in real-time, on demand. Lightrun enables developers to securely adding instrumentation without having to redeploy or restart. Request a demo to see how it works.

To sum up, testing in production is a technique you should pursue and experiment with if you are ready for a paradigm shift from diminishing risks in pre-production to managing risks in production.

Testing in production complements the testing you are used to doing, and adds important benefits such as speeding up the release cycles and saving resources. I covered some different types of production testing techniques and recommended some tools to use. If you want to read more, check out the resources I cited throughout the blog post. Let us know how it goes!

Learn more about Lightrun and let’s chat.

The post Testing in Production: Recommended Tools appeared first on Lightrun.

The Cost of Production Blindness

Lightrun Marketing — Mon, 04 Jul 2022 13:09:21 +0000

When I speak at conferences, I often fall back to the fact that just a couple of decades ago we’d observe production by kicking the server. This is obviously no longer practical. We can’t see our production. It’s an amorphous cloud that we can’t touch or feel. A power that we read about but don’t fully grasp.

In this case, we have physical evidence that the cloud is there.

A part of this major shift in our industry is a change to our fundamental roles as engineers. DevOps and SRE are roles that didn’t exist back then. Yet today, they’re often essential for major businesses. They brought with them tremendous advancements to the reliability of production, but they also brought with them a cost: distance.

Production is in the amorphous cloud, which is accessible everywhere. Yet it’s never been further away from the people who wrote the software powering it. We no longer have the fundamental insight we took for granted a bit over a decade ago.

Is That So Bad?

Yes, and no. We gave up some insight and control and got a lot in return:

Stability
Simplicity
Security

These are pretty incredible benefits. We don’t want to give these benefits up. But we also lost some insight, debugging became harder and complexity rose. We discussed these problems before but today I want to talk about one impact only…

Cost

This is a form of blindness.

I wrote a lot about the impact of this situation on the reliability of our cloud deployments. But today I want to talk about the financial and environmental costs. Initially, the cloud was billed as a cost saving measure and there was some truth to that. The agility of deployment let us cut down on hardware costs, consolidate and simplify.

But as we got used to the cloud, our appetite for scale/reliability grew. We ended up simplifying deployment to such an extent that launching a container can be accomplished seamlessly, with no interaction on our part. This is enormous progress but also troubling. We slowly lose grip on our costs and end up paying more for less.

So what’s the solution?

APMs are a category that rose to prominence specifically around this problem. Today, they are more important than ever. They help us get a sense of the Pareto principle (80/20 rule) so we can focus optimizations on the specific areas that cost the most.

This is a powerful and important tool that DevOps use every day, but it’s also a very limited one.

Before we proceed, I’d like to take a moment to discuss the concept of cost. The most obvious impact is on our monthly cloud provider bill. This is work that might fund a department. But there’s a more important cost in my humble opinion: the environmental cost. We tend to ignore the electricity spend because it’s a very amorphous spend. But this cost is severe, e.g. the cost of a single cloud instance over one year can be the equivalent of a transatlantic flight.

We don’t see the underlying hardware, but it’s there, and it carries a carbon footprint. By optimizing, we can affect both costs significantly.

Observing Production Effectively

APMs are great for measuring performance at a high level. But they provide very little detail about the dynamic inner workings of the application and the cost-cutting measures we can take inside. I often liken them to the bat signal or check engine light. They notify us of a problem but leave us without the tool to inspect the details.

That’s where developer observability tools can fill in that gap. These tools can provide low level applicable insights into the application. Verify assumptions and provide developers with the means to understand production substantially.

Instead of discussing the theory, let’s give some examples of actions you can take today with developer observability tools to reduce the costs of your production.

Reduce Logs

Log ingestion is probably the most expensive feature in your application. Removing a single line of log code can end up saving thousands of dollars in ingestion and storage costs. We tend to overlog since the alternative is production issues that we can’t trace to their root cause.

We need a middle ground. We want the ability to follow an issue through without overlogging. Developer observability lets you add logs dynamically as needed into production. This frees you from the need to overlog and lets you focus on logging a reasonable amount. You can also raise the log level to keep the logs down. I wrote about this in depth here.

Caching

My top three tips for performance have always been:

Caching
Caching
Caching

There’s really nothing else. It all boils down to that. Unfortunately, cache misses are notoriously hard to tune and detect. This is an even bigger issue in production where we need to account for the changing landscape. E.g. we cache up to 10 friends of a user on a social network but in production the growth team encourages friendships and users have more friends…

You’d have cache misses more often and you wouldn’t even know.

Placing conditional breakpoints or temporary conditional logs on cache misses and inspecting them can go a long way to detect subtle issues like that. This can make an order-of-magnitude difference to performance when done right.

However, there’s a bigger payout here. Many developers just ignore L2 caches entirely. This is understandable. They are hard to maintain and debug. Especially in production. A single cache corruption or a value that’s out of sync and you end up with a major bug. The problem is that debugging these things in production environments is essential. Cache behaves radically differently in production because of its distributed nature.

We built developer observability solutions to debug these exact types of problems. By placing snapshots and logs over cache population/invalidation, we can narrow down the point of corruption and fix cache relation issues. By deploying these solutions to your production server, overhead can be reduced significantly!

Micro Benchmarks

APMs provide us with high-level numbers on performance and a general direction. They don’t provide the lines of code we need to address. That’s left up to our guesswork. If the system behaves identically when it’s running locally, this should be fine. Unfortunately, this is rarely the case. E.g. a database query can have a significantly different impact when running in production. Based on local profiling results, you might waste your energy on the wrong optimization.

Developer observability tools provide the ability to narrow down the performance overhead of a code snippet. This lets us follow through the web service stack and narrow down the actual lines that are taking the most CPU time. We can accomplish this by adding a tictoc metric that measures the time between the tic line and the toc line.

We can mark a block of code and get statistics about its execution time. As in the common case of a specific query taking longer in production, we can quickly prove that this is the cause of the performance problem using this tool. The impact of many “small” issues like this can be significant in a large system and can easily mean the difference between scaling and a bottleneck.

Verification and Dead Code

A common problem is under utilized resources. APMs expose some of those problems but don’t expose them all. When we have dead-code, its impact on our bottom line can be significant.

How many times did you refactor code or stop yourself from refactoring because of a legacy mess you didn’t want to touch?

Yes, that legacy mess is used in your code so you don’t want to “risk it”. If you end up changing the code, you need to walk on eggshells and the entire operation can take an order-of-magnitude longer. This maps to cost since our time is valuable and you can spend it optimizing. It also blocks some major optimizations most times.

But what if that block of code isn’t used by anyone in production?

What if it’s used by very few people?

That’s exactly what the counter metric does. It counts the number of times a line was reached. It can tell us which methods are important to us and how frequently they’re reached. You wouldn’t be as concerned about a refactor if only three people reach that line of code…

Finally

I could carry on with the discussion of these techniques, but the gist is simple: we need to “see” what’s going on. As developers, we’re given a task to build a product. But the tools that let us peer into production aren’t as capable as our local tools. The results we get from production can be very misleading.

As we scale production deployments, we need to use a new class of tools that exposes our code in this way. I can classify modern production with one word:

DREAD.

This deep binding fear that we all feel when we push a major change into production. People lose their jobs by pushing bad stuff to production. That’s scary!

What do we do when facing such dread?

We keep going, but carefully. We step lightly and don’t take big risks. Is our code wasteful?

Maybe, but the risk of bringing down production is far scarier than the benefit of shaving some expenses to the company.

Developer observability is the light within this darkness. When you shine a light in the dark, you take away some of the fear and make production more approachable. We can measure, test, and move fast. We also have a better sense of the risks we’ll be facing with the upcoming changes. The tooling also gives us a sense of the upside. How much can we save? Imagine saving the cost of your entire department in cloud expenses. That’s job security right there… The best to fight that fear of risky changes.

The post The Cost of Production Blindness appeared first on Lightrun.