Skip to main content

Enhanced monitoring for parsing status

· 2 min read

When playing with chaos testing, I noticed that I had no metric telling me if parsing speed was right, ok close to the limit. I knew when parsing was failing, but not if it was about to fail.

I then designed and added new metrics for parsing speed:

  • Delay before parsing
  • Duration of parsing
  • Speed of parsing

The first KPI indicates if parsing 'power' is enough, as it must stay between 10s (delay before parsing in the configuration) and 45s (TTL of packets in Redis).

The other KPIs indicates speed of parsers with current load and will allow to compare performance improvements.

In the main dashboard

As a new parsing page

I regrouped the previous parsing KPI together:

  • Tcp to parse in queue - to check it is not increasing
  • Tcp parsing status - to check quality of parsing
  • Maximum parsing delay - to check it stays way below 45s
  • Parsing duration of a polled page of Tcp sessions (max 20) - to check speed
  • Amount of communications created from the parsing - to check we indeed created something :)

All in all...  1 day of work :)

Avoiding duplicates

· One min read

When capturing both sides of the same communication - for instance, when capturing from both the gateway and the service itself - Spider captures twice the same communication, with sightly different dates.

It is now possible to ask Spider to avoid duplicates.

Avoiding duplicated communications

With this option, Spider will generate the same id for the object on both side of the communication, and only one will be then saved (and parsed).

For this, select 'Avoid duplicated communications' on Capture Config tab.

Then, only one Tcp session will be created, and thus, only one example of the Http Communications.

Avoiding duplicated packets

You may also chose to avoid duplicated packets, on the advanced options of the Packets saving part of Parsing Config tab. The options is visible only when saving packets.

Note that this is asking more resources in of the system, and should be only considered when doing statistics at the packet level (not often).

Changelog since may 2021

· One min read

It's been a while I did not write here.

Spider is progressing, but I spent much of my time Spider doing administrative and legal stuff. It's official public release is approaching :)

I nevertheless did some stuff:

  • Upgraded all services and UIs to Node 16 in august and september, with an upgrade of all libraries
  • Improved the UI so that it checks for a new version every time it receives the focus. With an integrated changelog of UI versions displayed in the details panel by rendering the service CHANGELOG.md file. You might have seen it already.
  • Improved teams configuration to allow copying teams settings to the user's in order to troubleshoot and improve them (the opposite was already existing)
  • Import/export of Whisperer configuration (decoding and parsing) from a file. This would have proven useful before, so it will again !

And I've spent some time solving my 'last' parsing issues to support long communications and optimise again the parsing.

That's for next post ! :)

My 1st customer satisfaction survey

· One min read

In october last year, I performed my first customer satisfaction survey... And the results are great!

I read some website advise and took some templates as examples. The first version was too long with too many choices and questions. I reduced it to get more feedback :)

Thanks to all of you that participated!

The summary in picture:

Contact me if you'd like to know more!

Parsing engine rework !

· 4 min read

Existing issue

The existing parsing engine of Spider had two major issues:

  • Tcp sessions resources were including the list of packets they were built from. This was a limitation in the count of packets a tcp session could hold because the resource was ever increasing. And long persistent Tcp sessions were causing issues and thus were limited in terms of packets.
  • Http parsing logs were including as well list of packets and of HTTP communications found.

I studied how to remove these limitations, and how to improve at the same time the parsing speed and its footprint. While keeping the same quality of course!

And I managed :) !!

I had to change part of architecture level 1 decisions I took at the beginning, and it had impacts on Whisperers code on 7 other micro services, but it seemed sound and the right decision!

Work

4 weeks later, it is all done, full regression tested and deployed on Streetsmart! And the result is AWESOME :-)

Spider now parses Tcp sessions in streaming, with a minimal footprint, and a reduced CPU usage of the servers for the same load ! :)

I also took the time to improve the 'understandability' of the process and the code quality. I will document the former soon.

Results

Users... did not see any improvements (nor any issue), except that 'it seems faster', but figures are here to tell us!

The first day, I achieved 65 parsing errors out of 43 million communications! Those 2 missed bugs were solved straight away thanks to good observability! :)

Effects on back end

3 /v2 APIs have been added to the API. Corresponding /v1 APIs will be deprecated soon.

But Spider is compatible with both, which allowed me an easy non regression by comparing parsing results of the same network communications... by both engines ;-) !

Effects on UI

  1. Grids and details view of packets and Tcp sessions have been updated.
  2. Pcap upload feature has been updated to match new APIs.
  3. Downloading pcap packets has been fixed to match the new APIs.
  4. All this also implied changes in Tcp sessions display:
  • Details and content details pages are now with infinite scroll as there may be tens or hundreds of thousands of packets.
    • It deservers another improvement for later: to be able to select the time in the timeline!
  • Getting all packets of a Tcp session is only a single filter away.

Performance

Last but not least... performance!! Give me the figures :)

Statistics over

  • 9h of run
  • 123 MB /min of parsed data
    • 318 000 packets /min
    • 31 000 tcp sessions /min
  • For a total of
    • 171 Million packets
    • 66,4 GB
    • 16 Million Tcp sessions
    • 0 error :-)

CPU usage dropped!

Before:After:

Redis footprint was divided by more than 2!

  • From 80 000 items in working memory to 30 000
  • From 500 MB to 200 MB memory footprint :)

Before:After:

Resource usage

CPU usage for parsing service dropped of 6%. But the most impressive is the CPU drop of 31% and 43% of inbound services: pack-write and tcp-write.

Whisperers --> Spider

Confirming the above figures, response time of pack-write and tcp-write as improved of 40% and 10%!

API stats

APIs statistics confirm the trend with service side improvements of up to 50%! Geez !!

Circuit breakers stats

When seen from the circuit breakers perspective, the difference is smaller, due to delay in client service internal processing.

Conclusion

That was big work! Many changes in many places. By Spider is now faster and better than ever :)

Excel driven refactoring ! My first ever ;)

· 2 min read

One of the oldest saga of Network UI was needing a huge refactor. Well... not a refactor, the goal was to remove it completely.

I made it before I built my best practices with Saga. And this method was one that help me understand... not to do like this ;) This method was called on various users and automatic actions to update various elements of the UIs:

  • Timeline
  • Map
  • Grid
  • Stats
  • Nodes names
  • ...

While it was updating everything, it was quite simple. But for performance improvements, and to limit the queries on the servers, many parameters were added to limit some refresh to some situations.

However, this is not the right pattern. It is better to have each component have its own saga watching the actions they would need to refresh on. This is the pattern I implemented almost everywhere else, and it scales good, while limiting the responsibility in one place.

To perform this refactor was risky. As the function was called from many places with various arguments.

So I used ... Excel ! ;)

1. List the calls

2. List the behaviors from the params

3. List the needs for each call

4. Find the actions behind each need, and subscribe the update sagas to those actions

5. Tests!!

All in all... 5h of preparation, 5h of refactor + fix and... it rocks :) !

So much more understandable and maintenance easy. What a relief to remove this old code.

Code are aging badly. Really ;)

Monitoring - New Performance view

· One min read

I added a new view to monitoring. And thanks to the big refactors of last year... this was bloody easy :)

This view adds several grids to get performance statistics over the period:

  • Services performance
    • Replicas, CPU, RAM, Errors
  • Whisperers -> Services communications
  • Services API
  • Services -> Services communications
  • Services -> Elasticsearch
  • Services -> Redis
    • For all: Load, Latency, Errors

Setup is managing Docker config upgrades

· One min read

I wanted to remove coupling between Spider setup and infrastructure configuration.

There was one sticky bit still: the configuration service was using a volume with all configuration files of the applications mounted from it.

I moved it all in Docker configs, so that you may have many replicas of configuration service, and also so that High Availability is managed by Docker. To be able to go towards this, I upgraded Spider setup script to:

  • Create Docker configs for each application configuration file
  • Inject them in Configuration service Docker stack definition
  • And also... manage updates of those configuration to transparently change the Docker configs on next deploy.

Now, more than ever, setup and upgrades of Spider are simple:

  • Setup your ES cluster
  • Setup your Docker Swarm cluster
  • Pull the Setup repo and configure setup.yml
  • To install:
    • Run make new-install config db keypair admin crons cluster
  • To update:
    • Run make update config db cluster

I could also manage Docker secret upgrades... but since only the signing key is in secret, there is not much value in it :)

Technical upgrades

· One min read

I did some technical upgrades of Spider:

  • Traefik -> 2.4.8
  • Redis -> Back in Docker Swarm cluster for easier upgrade, and High Availability
  • Metricbeat & Filebeat -> 7.11

I also tested Redis threaded IO... but there was no gain, so I reverted back.

Upgrade to Redis 6.2

· One min read

Just wanted to test :) I just upgraded Redis from 5.0 to 6.2...

  • Nothing to change except systemd loader
  • Performance is as fast as before (with no change of settings): 10 500 op/s for 7% CPU
  • Works like a charm

CPU

Load

Processing time

I'll let it run for some time, then I'll activate the new IO threads to check any improvement.

Later, I'll see about using the new ACL and TLS features ;)