One frequent operation that is commonly overlooked is subroutine or class method calling. Calls to subroutines or class methods may make up 50 percent or even as much as 80 percent of the source code. There is not much interesting in a subroutine call itself. But there are other operations that happen almost every time a subroutine is called. These are operations related to parameter passing.
In the good old days, when procedural languages reigned supreme, parameter passing was simple. All it took was to create a stack frame and to push values to the stack:
push A
push B
call proc
...
; proc
push ebp ; stack frame
mov ebp,esp ; created here
; do something
mov esp,ebp ; stack frame
pop ebp ; destruction
ret
This is still a very typical scenario for a C/C++ program because there are many routines that operate on "simple" (i.e. non-object) parameters. Even if the routine is defined with an __inline modifier and in-lined in the program code, parameters are pushed into the stack and stack frame is created. The only thing that changes is that the call instruction is replaced by the subroutine code and the ret instruction is eliminated. Looking at the design of modern CPUs it is easy to see that inlining does not provide any performance improvement. In most cases, call and ret instructions are processed in zero cycles due to successful static branch prediction and instruction prefetching. However, excessive inlining may blow up the size of the code and ultimately reduce performance due to the increased likelihood of instruction cache misses. Perhaps the only reason to use inlining is when the routines are extremely compact (just a few operations) and called very frequently. Good examples are CString methods in C++ and COMPLEX arithmetics.
Making the Most of Your Registers
There is another modifier that can be helpful: __fastcall forces subroutine parameters to be passed in registers. This eliminates memory operations such as pushing parameters into stack and stack frame access. Also, instructions that operate solely on registers execute faster in the internal CPU pipeline. However in the x86 architecture, sometimes there are just not enough registers to accommodate all the values.
Also there is an /Oy- compiler option in Visual C++. It turns off stack frame initialization, which saves a few instructions and frees the EBP register for general use. Though the advantage is small, it's still an advantage. Needless to say in a scarce pool of x86 registers, an extra register may be a big asset.
Simple parameters are only a part of the problem. Most programs use objects heavily and pass them as parameters frequently. Where there are objects, one finds constructors, destructors, and quite often memory allocation. And did I mention local variables? Consider what happens in the following code sample:
void foo(CString S)
{
CString S2;
...
}
...
CString S1;
foo(S1);
First the constructor for S1 is called. Then the copy constructor for S (which also allocates memory using the new operator). Then the constructor for S2 is called; then the subroutine does something. Then the destructor for S2 is called (which releases allocated memory using the delete operator). Then the destructor for S is called (which again releases allocated memory using the delete operator). What if there are more parameters? And what if they are complex objects with complex constructors, or destructors that, among other things, allocate and/or free memory? And what about all those local variables? It is clear that the overhead can be quite substantial. Is there a work around? Of course: Pass objects by reference and avoid, minimize, or consolidate local variables or make them static. Given these guidelines, the foo() routine can be rewritten as:
void foo(const CString& S)
{
static CString S2;
...
}
While there is nothing wrong with using static local objects (though you must remember to initialize or clear the static objects forcefully every time the routine is called) local variable consolidation is now considered a bad practice because it violates code separability. For instance, if you have two routines foo() and faa() that both rely on a local CString variable it is possible to consolidate both local variables into one by defining a global CString.
Also keep in mind that static or global variables are not thread safe. If several threads or processes call the same function that uses a static variable, the value of the static variable will be undetermined unless explicit synchronization (e.g., using incremental locks and mutexes) is employed. Though global variables are out of favor and there are some risks, there is no reason why we should not consider using them when performance really matters (or rather, there is no reason why compilers should not attempt to consolidate local object-type variables automatically).
A Winning Strategy
To improve the performance of subroutine/method calls, pass parameters in registers (__fastcall modifier in C++); pass objects by reference; reduce usage and/or consolidate expensive local variables by making them global, or make them static to prevent violation of code separability.