Troubleshoot#1: Correlation vs Causation
Network problems can be as simple as follows:
- Ping neighbor.
- 100% packet loss.
- Check network interface, it's down.
- Change SFP and you are done, problem solved :)
But they can also be complex; really complex .. In the past few years, I have had the unplanned privilege to see some of the shi**iest networking problems on earth. Not because I am an unlucky bastard but simply because I am in the position that if it reaches me, then it's fairly complex.
Dealing with such scenarios have taught me some good lessons that are not necessarily limited to networking or IT. I just thought that it might be a good idea to share some lessons with you guys.
Illusion of causality:
Recently, I have been inspecting a problem "I will try to stay very vague here" in which an undesired event is happening unintentionally after performing some normal actions on some routers ... sometimes. The network is fairly big and the problem is happening sporadically from time to time after performing normal day to day operations on some of the most loaded routers.
Dealing with such issues, you usually get no definite answers from anyone about possible triggers, and you are only left with tons of logs and your lab trying to find out what could be triggering such unwanted behavior.
The impacted routers are highly loaded resulting in tons of events and errors on each of them. After few hours of inspecting logs, I found some logs that are happening very close to the problem and are very related in nature to the problem we are talking about.
The logs seemed like a fair pointer to what could be underlying such problem, specially that it appears almost every time before problem is seen.
However, When I started digging deeper into the problem, I found that these events are actually not the cause of the problem as I first thought and is just a highly correlated event that is being generated due to some other faulty conditions. They still could be a contributing factor to the problem, but they are not the main cause for it.
By that vague story, I wanted to bring this point home; mixing correlation with causation is a result of one of the human cognitive biases that we all tend to do in different areas of life. Although it might not be always obvious at the beginning, but taking no precaution measures against this cognitive bias leads to serious mistakes or errors in judgement.
As humans we tend to think when two events are happening sequentially that one of them is probably causing the other, while this might not be always the case and it could be just a correlation.
In troubleshooting complex problems in networks or in life, it's important to be aware of this cognitive bias to protect yourself against making big errors in judgement.
After all, we will not eliminate such errors completely but probably we can put some protective measures that would lead to better decisions whenever possible. Here is what I think could help when dealing with complex or persisting/chronic problems in your network.
- Adopt a scientific mentality and don't jump to conclusions. Having two events following one another doesn't not necessarily mean they have a cause and effect relationship, it might just be a correlation. Keep this in mind.
- Whenever possible, perform a thorough analysis for the logs, actions and the events encountered. Look at different sets of data, preferably using specialized tools and scripts to avoid taking forever in front of a black screen.
- Inspect wider time frames of data, as humans we also tend to look for events that are close to the problem and see them as causes, which usually works but trust me it's not always the case, some problems triggers start way back in time than the actual problem itself.
- Finally, validate your assumptions. Try to verify and test your assumptions in the lab as much as possible. If you think that X is causing Y, try to test or verify that assumption either by removing X and seeing if Y would still occur or if you can see X without seeing Y in return.
This was about network troubleshooting, but not limited to it. If you got a minute reflect on that.
Fore More information, visit Wikipedia page: Correlation does not imply causation.