[I knew that our SIP team was working for a while on multi-core optimization. So when there were signs of oil there, with some very promising results, I asked Ofer Goren, a software engineer in the SIP Stack team, and one of those given the optimization task, to write a few words on the process. You might even appreciate the work we do here for our customers, toiling around with analysis tools.]
“Hardware is cheap, programmers are expensive” says Jeff Atwood. So it’s better to spend your money on hardware, and not on optimizing your code. I wish we could say that to our customers and my boss, but he has this nagging habit of asking me to do things. Like, well, optimizing our code. More specifically optimizing it to this new multi-core technology. Hmpfff.
Therefore, over the past several months, multi-core optimization was my main focus. After identifying areas that can be optimized for scaling our SIP Stack performance to the level of CPU cores in use, we decided to put some effort into it. Hence the coming SIP Developer Suite version (5.5 GA2) will include code changes aimed to optimize multi-core performance in such environments.
I will describe in short what tools we used, why, and how they helped us with the task at hand.
What and Why
So the “what and why” questions were quite easy to answer. It was an obvious choice to look for some tool running on MS-Windows. After all, most of our development is done in a Windows environment before being ported to other operating systems, our IDEs and debugger run on Windows, and for me, at least, working with GUI is easier (sorry, but emacs is better than vi). We already had an installation of Intel Thread Profiler and Intel Thread Checker. In an incredible coincidence Intel approached us, suggesting we become part of the beta program for their next generation threading tools. Need I say more? I guess the choice was made.
So we installed the new tools, and started playing with them a little. After some fooling around, an eye-opening window appeared:
Wow. Out of 204 seconds of CPU run time, almost 60 seconds are spent on doing… nothing??? It just waits there? On a freaking Mutex?! Heads will roll, I thought to myself. Let’s see who’s to blame:
Hmmm… we had several suspects, but it seemed that this “HandleIncomingRequestMsg” was just lying there, enjoying life, smoking (which is bad for you), while blocking everybody from doing their job…
I was interested to know if the legacy Thread Profiler tool agrees with those observations. The Profiler tool makes you work harder to get the same information, but I realized they share the same understanding: some cores are getting paid for no work at all! I knew we had to do something with this code.
To get a feeling of the extent of the issues at hand, Intel’s thread profiler can give an easy to read picture of CPU utilization by giving the concurrency level of the program running. The higher the right bars the better.
In the image above, you can see that most of the time, the test executed uses a single thread only, while the rest of the threads are idle. You can drill down from this view to your heart’s content and spend hours looking at your code.
And that’s how we identified potential areas that can be optimized, one at a time, focusing time on the largest bottleneck.
What I Learned
I was kinda clueless at the beginning of the process. I was told to optimize the code for multi-core CPUs. Well, great, how the heck should I do that? I just work here, remember?! So the Intel Tools came to the rescue.. It told us everything we needed to know, in my all-time favorite GUI way, identified potential areas to work on, and gave us a methodical way to scale up and improve our code (Yeah, right. As if I wrote THAT code…). The work done improved performance on all architectures, not just Wintel. What can I say? They’ve got my vote.
My takeaways here are quite simple: when locking objects for a multi-core application that needs to scale:
- Make the locks you use as granular as possible (multi-level hashes, locks on objects and not on arrays, etc.). Bear in mind that while you do that, you increase memory usage and resources, so there’s a tradeoff here.
- Locks should be as short as possible. If you lock large pieces of code – you’re going to pay for it. with your time and with the cost of Intel’s tools
The Bottom Line
We started with 65% core utilization. After several months of work, we increased calls per second by 50% (!), to 94% core utilization. So if someone asks me, the Intel threading tools are the right thing for you. Smoking, in case you missed it, is not.