Skip links

Limiting the Impact: Exploring Blast Radius Management in Software Systems

Jump To Section

Blast Radius Management

What constitutes a blast radius?

Most, if not all, of us are familiar with the Chernobyl nuclear plant disaster, deemed the worst in human history, which occurred on April 26, 1986. The aftermath, both in terms of cost and casualties, is still evident and palpable today. This tragic event resulted from a simulated exercise aimed at testing the resilience of the reactor’s coolant systems. With proper planning and analysis, the impact could have been mitigated. The term blast radius in software delivery, which measures the impact area of changes, has become associated with this disaster.

While we are not dealing with a scenario that could lead to such extensive casualties, exercising due diligence is imperative to minimize adverse impacts on any production system whenever changes are implemented. 

In the tech world, metaphorically speaking, the severity of impact is gauged as the blast radius. Reliability stands out as a crucial key performance indicator (KPI) for both systems and tech teams. Therefore, it is essential to be well-acquainted with the tools and processes to limit the blast radius in software delivery.

Salesforce, like numerous other platforms and microservices, offers features and patterns designed to fulfill multiple purposes concurrently if well understood. Simple features can serve as an excellent strategy for disaster control or mitigating the blast radius in adverse conditions.

It’s important to note that the term “blast radius” is utilized to measure the magnitude of disasters in various aspects of the software lifecycle, including microservices, security, reliability, cloud infrastructure, deployment, and access management. However, this write-up primarily focuses on the security and reliability of applications.

Effects are determined by layers of software. In the context of the blast radius perspective in Salesforce, the impact layers are typically conceptualized as follows:

Blast Radius Management
Blast Radius Management

In this blog, I will elaborate on the patterns that have proven to be both widespread and effective in constraining the blast radius within the realm of software. These patterns include:

  1. Bulkhead Pattern
  2. Circuit Breaker Pattern
  3. Service Registry Pattern

Strategies for Damage Mitigation

Bulkhead Pattern

When discussing security concerns, our immediate focus often shifts to potential hacks and unauthorized access. Indeed, implementing access control measures and robust security practices forms an effective strategy for disaster management. This damage-limiting approach is known as the bulkhead pattern, borrowing its name from naval terminology. 

Similar to a compartment in a ship designed to contain damage and enable other sections to function accurately (albeit at reduced capacity), the bulkhead pattern in software swiftly isolates unforeseen issues. This ensures minimal impact on customers while streamlining the process of identifying and resolving problems. Ultimately, this approach creates a win-win situation for all stakeholders involved.

Some considerations for scenarios on the Salesforce platform:

a) Ensure that any newly developed features are encapsulated within permission sets or custom permissions, akin to creating bulkheads. Distribute features and accesses across these permission sets or custom permissions to maximize the effectiveness of this pattern.

b) Implement a strategy to compartmentalize features into distinct individual components, acting as bulkheads. This could involve employing an apex helper pattern, creating reusable Lightning Web Components (LWC) dedicated to specific features, or establishing an integration framework comprised of multiple components serving different purposes sequentially. Alternatively, consider a service-based integration framework to further enhance compartmentalization.

In the event of a change deployed in the production environment posing a potential adverse impact on system performance or security, promptly implementing a bulkhead extraction can mitigate the extensive consequences. This bulkhead cut-off can be automated using try-catch programs to facilitate real-time responsiveness. By doing so, the resulting damage can be confined to the logic layer or, at most, the process encapsulated within the bulkhead.

Circuit Breaker Pattern

Consider it a safeguard or an automated ally to ensure uninterrupted application functionality—the governor limits inherent in all cloud-based or multi-tenant systems. These limits regulate resource usage, ensuring each process receives its allocated resources for seamless operation. However, this control can lead to errors if a process exceeds its limits, halting the operation. In scenarios where the halted process is critical and another system is dependent on its completion, the consequences could be financially severe.

In a real-world analogy, imagine needing to catch a flight, and your taxi breaks down on the way to the airport. The solution? Have a backup plan—get another cab and reach your destination. While not a perfect analogy, it emphasizes the importance of having backup processes or bypass mechanisms for critical operations.

This safety net is embodied in the Circuit Breaker Pattern and Retry Pattern. When employing the Circuit Breaker pattern, development teams can focus on handling dependencies’ unavailability rather than merely detecting and managing failures. For instance, if a team is developing a website page reliant on ContentMicroservice for a widget’s content, they can make the page available without the widget’s content when ContentMicroservice is unavailable.

For the Salesforce platform, specific considerations include:

a) Implement checks for resource limits on these features, incorporating clauses to restrict resource usage. Salesforce’s built-in timeout limits in many libraries help control resources, but additional areas require attention.

b) Have a catch bypass ready for these features, flagging the issue and routing the process to an alternative channel or logic (Plan B). For instance, when attempting to retrieve data from an integrated system, if a network issue or system downtime occurs, an alternate path should dictate the next steps. While it may not yield exact results, it keeps the system running by minimizing the impact. Simultaneously, implement a robust flagging mechanism for impacted records, providing a clear handle for corrective measures.

c) Consider retries in running processes with a threshold limit, ensuring a cap on the number of retry attempts.

Service Registry Pattern 

While this pattern may not directly reduce the blast radius, it significantly enhances operational reliability. The service registry acts as a repository storing information about services, including details about their instances and locations. In a microservices application, this pattern allows the application to dynamically search the repository for an available service instance, avoiding reliance on static connections. Before providing the service’s location, the registry may perform a Health Check API invocation to ensure the service’s availability.

Conclusion

In conclusion, system reliability is paramount for business success. Whether intentional or inadvertent, any changes to the business system should not result in a system failure or disrupt business operations. The patterns discussed here represent a subset of innovative ideas aimed at maintaining system resilience. While additional ideas are encouraged, these guidelines provide a solid foundation for establishing a robust application architecture.

Picture of Mohammad Parwez Akhtar

Mohammad Parwez Akhtar

Subscribe

Suggested Reading

Ready to Unlock Your Enterprise's Full Potential?