Real life stories
These stories are real life stories. They are samples of the unique power Spider brings in troubleshooting distributed systems!
Many other cases happened over the years that I forgot about!
Seeing the unseen to understand the unexpected.
That's a huge strength!!
Parallel calls
Once in 2018 in Flowbird production, we had an unexpected behavior of one microservices. We could not explain nor reproduce the answers it was giving.
With Spider sequence diagram, we were able to see what other calls were made in parallel to the same service replica, and we found out that for all bad answers, another call to another API of the same service was made.
We found a few minutes later that a parameter of the second API was declared as a global and impacting the first API!
It took us less than one hour to troubleshoot a case that would have taken weeks otherwise!
Time synchronisation delay
In 2021, in Flowbird production, we noticed that a random number of calls were answered by 403 Not authorized responses when the call before or later with same credentials were perfectly right.
The analysis was in progress for weeks without success, generating much annoyance in customers. Spider was off at this time.
Out of solution, we reinstalled Spider and found the culprit in less than 1 hour after installation!
- The IAM solution was generating a JWT with the
nbffield: not before. - This field contains a date (resolution in seconds) before which the token in invalid.
- The issue was the clock of the IAM server was shifting faster than other servers.
- Even with the
ntptime synchronization updates, the IAM server ended up, every hour, with 1 or 2 ms delay from the application server. - And tokens that were generated close to the start of a second could be received by the application server while this one was still in the previous second! Thus making the token invalid!
Spider helped us see quickly that those rejected calls had a token not valid at the time of the capture!
We would never have found it without it!
NGINX forward auth cache issue
In 2022, in Flowbird test platform, we found that some requests were made by a process using a token this process could not have known about!
Using Spider we found out that:
- The token had been served by NGINX cache based on the result of a previous request
- We found the previous request and noticed that it was calling NGINX with two different authentications: a certificate and a token, both with different users!
- So there was a bug in the code (to fix)
- But also, NGINX was associating from the cache the result of a request with 2 auth to a new request that was made only with 1 auth
- So there was a bug in NGINX configuration (or NGINX itself)
Production load duplication for a successful infrastructure migration.
In 2023, we migrated our old Swarm cluster to Kubernetes.
To ensure a smooth migration, we decided to test the new cluster before switching the traffic to it.
Instead of running the regression test suite, we developed a script to replay the requests of the current production environment on the new cluster.
And this, with only a few seconds delay! Spider did the job extremely well!
More than a duplication, the script was changing the data on the fly to adjust the URLs used in the content (JSONLD) to match the new cluster.
The cherry on the cake?
We used Spider on the new cluster to check that the request were successfully replayed on the new cluster!
Later on, when balancing progressively the load on the new cluster, we used Spider to monitor and ensure that everything was working as expected, and react at the first sign of a problem to switch back to the old cluster.
Reverse engineering of a deprecated 3rd party system
In 2024, one of our providers decided to deprecate their SaaS IoT communication system and to close the operations.
However, we had thousands of devices connected to their system.
Changing these devices to use our new communication system would be a huge effort and would require a lot of time and resources.
We decided to use Spider to reverse-engineer the communication protocol of the deprecated system and to create a new communication system that would mimic the APIs and behavior
with the devices.
Using the Plugin system of Spider, we developed a plugin that would decode the specific MQTT variant used by the agents on the devices, adding Man in the Middle to capture the communications between the devices and the server.
Spider helped in capturing the communications, understanding the protocol, and building test cases to validate the clone.
When we deployed the clone, we used Spider to monitor the communications and to track any change in the behavior of the devices.
This was a huge success!
Understanding the unique behavior of a customer's system
On one of our systems, a customer was complaining of slow responses, when none of the others were.
We factually proved with Spider that the system was answering all their requests in less than 500ms, when they were measuring 15s!
We even traced the exact same requests that were slow on their side!
And then we discovered that for all their slow requests, they were not calling us only once, but tens or hundreds of times in a row! They had a loop in their code, and some infrastructure misconfiguration on their side was causing many retries, and their monitoring was measuring the time between the first call and the final answer!
We were able to demonstrate that our system was answering all their requests in due time, and that the issue was not in our code.
They fixed the issue and everything was back to normal.
Managing security attacks on the fly
In 2025, several times, a customer noticed unexpected behaviors in the usage patterns of production.
Using Spider, we were able to trace the root calls of the issue, and we were able to isolate the attackers!
We blocked them, and managed to determine exactly what they were doing and to confirm that they did not access critical data.