Profilers!

I like learning new stuff - anything, including technology. I love tinkering with new tools, systems and services, especially open source projects
So, today, we had an issue in one of our internal systems called API Tester. It was very slow. Only today it was slow, and the CPU usage was very high according to our monitoring systems, especially since today morning. Before noticing the CPU usage, we thought it was some DB issue - increased the DB size (CPU and RAM), but that was not the problem and our SQL queries were also running fast. But the API calls to the system were slow, though SQL queries used by those API calls were fast. Finally we realized something else is causing the slowness and that it’s also causing the CPU spike
This API Tester, it’s an important internal system, that’s a single entry point to a lot of things in our internal developer platform. For example, it takes cares of / helps with running automation tests and showing the results / logs for it - through another system called Validator, and it also has an important feature - to issue access tokens that can be used to access all the internal systems that are protected by an auth wall. Apart from these, there are so many other features it has that I’m yet to learn about
This internal system is written in Python as a web application
I was blindly trying to debug this system to understand why CPU usage was high - using Google, Google’s AI answers and public forums (StackOverflow etc)
Based on my noob Googling, I went ahead and ran gdb blindly by installing gdb and many other companion stuff like stuff specific to python to debug python programs with gdb. This caused the Python program to halt (!!!!) :’) I thought I can do live debugging, but no, it halted the process (still gotta read about this) - as if there was a breakpoint and this caused issues for all the users of the system. Later, I stopped gdb after multiple separate runs and didn’t run it again when I realized it’s causing problems for the users who are already complaining about slowness
Finally, I found py-spy which seemed like a pretty interesting and fancy tool to debug Python programs
You can find the source code of it here - https://github.com/benfred/py-spy
It helped with understanding what functions are taking up lot of CPU. There’s more to the tool than what basic stuff I used. I need to learn more - about gdb and py-spy. I have just tried to understand the ABCDs of profiling python programs
The idea from my Googling was - Linux system had high CPU usage - checked that using top and then found that a specific python process is using too much CPU - find the process ID and then check if there are many threads running under it and check which thread is using too much CPU and then check which python code invoked the thread and what it’s doing and why it’s slow
gdb helped with finding threads inside the process and some more stuff which I didn’t understand. And python3.9-dbg (Debug Build of the Python Interpreter (version 3.9)) and libpython3.9-dbg (Debug Build of the Python Interpreter (version 3.9)) Ubuntu packages also helped, with py-bt, py-bt-full commands in gdb
But we are still nowhere close to being able to debug this again if it happens again. But now we have some data and some guesses on what would have caused this problem. But yeah, next time, we’ll be at a better position to debug with py-spy. This time when the issue occurred, we had run py-spy very close to the end of the problem so didn’t get much data except a few things. Later the issue was also resolved by doing some restarts
I’ll write more about this when the problem happens again. In the meanwhile, I’ll probably create some sample slow running programs and debug them using gdb, and using py-spy too, to understand differences, pros and cons etc of using different tools for profiling, to understand CPU usage, RAM usage etc
Till then, see ya! :)
Apparently there are many such interesting profilers. Like, a profiler for Ruby, for PHP etc
References:



