Skip links

Limiting the Impact: Exploring Blast Radius Management in Software Systems

Jump To Section

Blast Radius Management

What constitutes a blast radius?

Most, if not all of us, are familiar with the Chernobyl nuclear plant disaster, deemed the worst in human history, occurring on April 26, 1986. The aftermath, both in terms of cost and casualties, is still evident and palpable today. This tragic event resulted from a simulated exercise aimed at testing the resilience of coolants. With proper planning and analysis, the impact could have been mitigated. The blast radius, measuring the impact area, became a term associated with this disaster.

While we are not dealing with a scenario that could lead to such extensive casualties, exercising due diligence is imperative to minimize adverse impacts on any production system whenever changes are implemented. 

In the tech world, metaphorically speaking, the severity of impact is gauged as the blast radius. Reliability stands out as a crucial key performance indicator (KPI) for both systems and tech teams. Therefore, it is essential to be well-acquainted with the tools and processes to limit the blast radius.

Salesforce, like numerous other platforms and microservices, offers features and patterns designed to fulfill multiple purposes concurrently if well understood. Simple features can serve as an excellent strategy for disaster control or mitigating the blast radius in adverse conditions.

It’s important to note that the term “blast radius” is utilized to measure the magnitude of disasters in various aspects of the software lifecycle, including microservices, security, reliability, cloud infrastructure, deployment, and access management. However, this write-up primarily focuses on the security and reliability of applications.

Effects are determined by layers of software. In the context of the blast radius perspective in Salesforce, the impact layers are typically conceptualized as follows:

Blast Radius Management
Blast Radius Management

In this blog, I will elaborate on the patterns that have proven to be both widespread and effective in constraining the blast radius within the realm of software. These patterns include:

  1. Bulkhead Pattern
  2. Circuit Breaker Pattern
  3. Service Registry Pattern

Strategies for Damage Mitigation

Bulkhead Pattern

When discussing security concerns, our immediate focus often shifts to potential hacks and unauthorized access. Indeed, implementing access control measures and robust security practices forms an effective strategy for disaster management. This damage-limiting approach is known as the bulkhead pattern, borrowing its name from naval terminology. 

Similar to a compartment in a ship designed to contain damage and enable other sections to function accurately (albeit at reduced capacity), the bulkhead pattern in software swiftly isolates unforeseen issues. This ensures minimal impact on customers while streamlining the process of identifying and resolving problems. Ultimately, this approach creates a win-win situation for all stakeholders involved.

Some considerations for scenarios on the Salesforce platform:

a) Ensure that any newly developed features are encapsulated within permission sets or custom permissions, akin to creating bulkheads. Distribute features and accesses across these permission sets or custom permissions to maximize the effectiveness of this pattern.

b) Implement a strategy to compartmentalize features into distinct individual components, acting as bulkheads. This could involve employing an apex helper pattern, creating reusable Lightning Web Components (LWC) dedicated to specific features, or establishing an integration framework comprised of multiple components serving different purposes sequentially. Alternatively, consider a service-based integration framework to further enhance compartmentalization.

In the event of a change deployed in the production environment posing a potential adverse impact on system performance or security, promptly implementing a bulkhead extraction can mitigate the extensive consequences. This bulkhead cut-off can be automated using try-catch programs to facilitate real-time responsiveness. By doing so, the resulting damage can be confined to the logic layer or, at most, the process encapsulated within the bulkhead.

Circuit Breaker Pattern

Consider it a safeguard or an automated ally to ensure uninterrupted application functionality—the governor limits inherent in all cloud-based or multi-tenant systems. These limits regulate resource usage, ensuring each process receives its allocated resources for seamless operation. However, this control can lead to errors if a process exceeds its limits, halting the operation. In scenarios where the halted process is critical and another system is dependent on its completion, the consequences could be financially severe.

In a real-world analogy, imagine needing to catch a flight, and your taxi breaks down on the way to the airport. The solution? Have a backup plan—get another cab and reach your destination. While not a perfect analogy, it emphasizes the importance of having backup processes or bypass mechanisms for critical operations.

This safety net is embodied in the Circuit Breaker Pattern and Retry Pattern. When employing the Circuit Breaker pattern, development teams can focus on handling dependencies’ unavailability rather than merely detecting and managing failures. For instance, if a team is developing a website page reliant on ContentMicroservice for a widget’s content, they can make the page available without the widget’s content when ContentMicroservice is unavailable.

For the Salesforce platform, specific considerations include:

a) Implement checks for resource limits on these features, incorporating clauses to restrict resource usage. Salesforce’s built-in timeout limits in many libraries help control resources, but additional areas require attention.

b) Have a catch bypass ready for these features, flagging the issue and routing the process to an alternative channel or logic (Plan B). For instance, when attempting to retrieve data from an integrated system, if a network issue or system downtime occurs, an alternate path should dictate the next steps. While it may not yield exact results, it keeps the system running by minimizing the impact. Simultaneously, implement a robust flagging mechanism for impacted records, providing a clear handle for corrective measures.

c) Consider retries in running processes with a threshold limit, ensuring a cap on the number of retry attempts.

Service Registry Pattern 

While this pattern may not directly reduce the blast radius, it significantly enhances operational reliability. The service registry acts as a repository storing information about services, including details about their instances and locations. In a microservices application, this pattern allows the application to dynamically search the repository for an available service instance, avoiding reliance on static connections. Before providing the service’s location, the registry may perform a Health Check API invocation to ensure the service’s availability.

Conclusion

In conclusion, system reliability is paramount for business success. Whether intentional or inadvertent, any changes to the business system should not result in a system failure or disrupt business operations. The patterns discussed here represent a subset of innovative ideas aimed at maintaining system resilience. While additional ideas are encouraged, these guidelines provide a solid foundation for establishing a robust application architecture.

Picture of Mohammad Parwez Akhtar

Mohammad Parwez Akhtar

Latest Reads

Subscribe

Suggested Reading

Ready to Unlock Your Enterprise's Full Potential?

Adaptive Clinical Trial Designs: Modify trials based on interim results for faster identification of effective drugs.Identify effective drugs faster with data analytics and machine learning algorithms to analyze interim trial results and modify.
Real-World Evidence (RWE) Integration: Supplement trial data with real-world insights for drug effectiveness and safety.Supplement trial data with real-world insights for drug effectiveness and safety.
Biomarker Identification and Validation: Validate biomarkers predicting treatment response for targeted therapies.Utilize bioinformatics and computational biology to validate biomarkers predicting treatment response for targeted therapies.
Collaborative Clinical Research Networks: Establish networks for better patient recruitment and data sharing.Leverage cloud-based platforms and collaborative software to establish networks for better patient recruitment and data sharing.
Master Protocols and Basket Trials: Evaluate multiple drugs in one trial for efficient drug development.Implement electronic data capture systems and digital platforms to efficiently manage and evaluate multiple drugs or drug combinations within a single trial, enabling more streamlined drug development
Remote and Decentralized Trials: Embrace virtual trials for broader patient participation.Embrace telemedicine, virtual monitoring, and digital health tools to conduct remote and decentralized trials, allowing patients to participate from home and reducing the need for frequent in-person visits
Patient-Centric Trials: Design trials with patient needs in mind for better recruitment and retention.Develop patient-centric mobile apps and web portals that provide trial information, virtual support groups, and patient-reported outcome tracking to enhance patient engagement, recruitment, and retention
Regulatory Engagement and Expedited Review Pathways: Engage regulators early for faster approvals.Utilize digital communication tools to engage regulatory agencies early in the drug development process, enabling faster feedback and exploration of expedited review pathways for accelerated approvals
Companion Diagnostics Development: Develop diagnostics for targeted recruitment and personalized treatment.Implement bioinformatics and genomics technologies to develop companion diagnostics that can identify patient subpopulations likely to benefit from the drug, aiding in targeted recruitment and personalized treatment
Data Standardization and Interoperability: Ensure seamless data exchange among research sites.Utilize interoperable electronic health record systems and health data standards to ensure seamless data exchange among different research sites, promoting efficient data aggregation and analysis
Use of AI and Predictive Analytics: Apply AI for drug candidate identification and data analysis.Leverage AI algorithms and predictive analytics to analyze large datasets, identify potential drug candidates, optimize trial designs, and predict treatment outcomes, accelerating the drug development process
R&D Investments: Improve the drug or expand indicationsUtilize computational modelling and simulation techniques to accelerate drug discovery and optimize drug development processes