Anyone who has spent time in a large enterprise knows that there are lots of intractable problems. Academics (and consultants and business schools) often tout they can get to the “root” of a problem and give you a “surgical” solution that will solve the problem and keep everything else the same. Ironically, such magical thinking that a root cause can be found and expertly solved might be the root cause of management and other fads.
There are two types of magical thinking going on here. The first is that a complex systems problem has a "root" cause. The second (and more subtle one) is that there exists a “superhero” who can find and fix it and make everything perfect again. At this point, you might be asking, so if there’s no “root cause” what exactly are we supposed to “solve” or fix?
The insight I want to share is that in any complex system that has evolved/changed over time, there are always several layers of complexity. Each layer of complexity is an attempt to cover or hedge against some risks in another layer, introduces its own uncertainties and risks. We see this layering pretty much everywhere. Individual cells band together to form larger organisms, and the individual cells shed some capabilities and take on specialized responsibilities. Groups of people forms a tribe or village. Each individual is no longer independent and often takes on specialized professions. It is the “fractal” view of the domain.
In all these (and countless other) examples, there are layers of complexity where each layer does something specialized for the system as a whole while sacrificing some other trait. The trick, then, is to see how the “problem” not just the symptom, manifests in multiple layers. And the solution is to add additional layers that transform a problem you find difficult to solve into something you can potentially reason about and deal with.
Let us see this in action in the case of slow performance in your database system. The “root” cause thinking will tell you to find the slowest query and tune it by adding an index or rewriting it so that it uses the write index or using whatever other magical incantations your DBAs will give you. It might even solve the most recent “problem”, but it has only solved the symptom.
Let us look at the same problem fractally. This “problem” is always there, and tuning exercise has only temporarily solved it. We need to look at all the layers and identify the additional layers we need to add. It might reveal that your application has poorly written querying patterns. You can’t rewrite so much code, but you could wrap the worst parts in a caching layer to reduce the frequency of bad queries hitting the database. You might also add some profiling in various layers of your system to collect response times, throughput, and other such things to help you identify and proactively deal with those in the future. You might do some cross-training between your dev and ops teams to identify performance issues and how to make each other's lives easier the next time it strikes. You might also want to talk to your users to understand how often this will happen and potentially think of alternative solutions in such scenarios. This solution doesn’t seem very clean or elegant because it probably mirrors how real-life systems evolve.
The key takeaway I would like you to have is that any given problem you hear about is most proabably a symptom. The root cause will likely give you a way to alleviate the symptom. Solving the root cause alone is not sufficient. See the layers around the symptoms and how they need to be layered further to transform the problem into something more manageable. Also,Have the humility to recognize that you have only found a temporary solution to the problem.