GCC May “Save” You Some Recursive Functions Calls: an Analysis of a Function Call Stack Length Example

We know compilers like gcc can do lots smart optimization to make the program run faster. Regarding functions call optimization, gcc can do tail-call elimination to save the cost of allocating a new stack frame, and tail recursion elimination to turn a recursive function to non-recursive iterative one. gcc can even transform some recursive functions that are not tail-recursive into a tail-recursive one so that the compiler can then do tail recursion elimination. But what will happen if a function can not be optimized using tail recursion elimination because of some non-safe operations inside of the function body? In this post, we analyze one example.

The C program we analyze

The prog.c program we analyze is as follows.

#include <stdio.h>

void RecursiveFunction( int recursion_times )
{
  printf("stack: %p\n", (void*)&recursion_times);

  if (recursion_times <= 1) return;

  return RecursiveFunction( --recursion_times );
}

int main(int argc, char* args[])
{
  RecursiveFunction(4);
  return 0;
}

We pass the &recursion_times into another function which may change its value. C/C++ require each variable, including multiple instances of the same variable in recursive calls, to have distinct locations. The number of recursion_times variables are only known during run time. So tail recursion elimination should not be done simply here by the compiler although RecursiveFunction is tail recursive. So will the compiler just stop here and do nothing to optimize it? Let’s see by running it with different optimization levels.

The execution results

We use a script run.sh to try different cases and disassemble the executable binary files generated by gcc.

#!/bin/bash

set -o errexit

for opt in 0 1 2 s 3 ; do
  echo "Begin -O$opt"
  gcc -fno-stack-protector -O$opt prog.c -o prog.$opt
  ./prog.$opt
  objdump -d ./prog.$opt > prog.$opt.as
  # gcc -fno-stack-protector -O$opt -Wa,-adhln -g prog.c > prog.$opt.list
  echo "End -O$opt"
done

The -fno-stack-protector option tells gcc not to generate stack protection code so that assembly code is cleaner to read.

I tried it on Ubuntu 18.04 with gcc 7.5.0 (gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0) and the results are as follows.

$ ./run.sh
Begin -O0
stack: 0x7fff3690668c
stack: 0x7fff3690666c
stack: 0x7fff3690664c
stack: 0x7fff3690662c
End -O0
Begin -O1
stack: 0x7ffde5213d4c
stack: 0x7ffde5213d2c
stack: 0x7ffde5213d0c
stack: 0x7ffde5213cec
End -O1
Begin -O2
stack: 0x7ffd6a09587c
stack: 0x7ffd6a09585c
stack: 0x7ffd6a09583c
stack: 0x7ffd6a09581c
End -O2
Begin -Os
stack: 0x7ffef5b6ce9c
stack: 0x7ffef5b6ce7c
stack: 0x7ffef5b6ce5c
stack: 0x7ffef5b6ce3c
End -Os
Begin -O3
stack: 0x7ffd414fb53c
stack: 0x7ffd414fb54c
stack: 0x7ffd414fb50c
stack: 0x7ffd414fb51c
End -O3

For optimization levels 0,1,2 and s, the results are consistent. For each function calls, the stack frame increases (address decreases) by 0x20 bytes. The interesting one is the result when gcc optimization level is 3. The stack frame decreases by 0x10 bytes and then increases by 0x40 bytes. Why does the stack frame wigwag? Let’s take a look at the assembly code for the RecursiveFunction.

The generated code at optimization level 0

Let’s first take a look at the generated code in prog.0.as for optimization level 0. It is quite straightforward and the flow follows the C code’s order. I added annotations.

000000000000064a <RecursiveFunction>:
                                # save old stack frame base in stack, use 0x8 bytes on stack
 64a:   55                      push   %rbp
                                # new stack frame base at old stack pointer 
 64b:   48 89 e5                mov    %rsp,%rbp
                                # allocate 0x10 for the new stack
 64e:   48 83 ec 10             sub    $0x10,%rsp
                                # store recursion_times into stack
 652:   89 7d fc                mov    %edi,-0x4(%rbp)
                                # get &recursion_times
 655:   48 8d 45 fc             lea    -0x4(%rbp),%rax
                                # call printf()
 659:   48 89 c6                mov    %rax,%rsi
 65c:   48 8d 3d d1 00 00 00    lea    0xd1(%rip),%rdi        # 734 <_IO_stdin_used+0x4>
 663:   b8 00 00 00 00          mov    $0x0,%eax
 668:   e8 b3 fe ff ff          callq  520 <printf@plt>
                                # get recursion_times to %eas
 66d:   8b 45 fc                mov    -0x4(%rbp),%eax
                                # if (resursion_times <= 1) return
 670:   83 f8 01                cmp    $0x1,%eax
 673:   7e 15                   jle    68a <RecursiveFunction+0x40>
                                # do --resursion_times
 675:   8b 45 fc                mov    -0x4(%rbp),%eax
 678:   83 e8 01                sub    $0x1,%eax
 67b:   89 45 fc                mov    %eax,-0x4(%rbp)
                                # Prepare arguments and call RecursiveFunction
 67e:   8b 45 fc                mov    -0x4(%rbp),%eax
 681:   89 c7                   mov    %eax,%edi
                                # callq will store the returning address on stack, using 0x8 bytes
 683:   e8 c2 ff ff ff          callq  64a <RecursiveFunction>
 688:   eb 01                   jmp    68b <RecursiveFunction+0x41>
 68a:   90                      nop
 68b:   c9                      leaveq 
 68c:   c3                      retq   

The 0x20-byte stack frame consists of 0x8 bytes for storing old %rbp (at 64a), 0x10 bytes for this function’s usage (at 64e), and the 0x8 bytes used by callq at 683. There is no surprise.

The generated code at optimization level 3

Now let’s look at the generated code by gcc under optimization level 3. Annotations are added too.

0000000000000690 <RecursiveFunction>:
                                # allocate a stack frame of 0x28 bytes
 690:   48 83 ec 28             sub    $0x28,%rsp
 694:   48 8d 35 e9 00 00 00    lea    0xe9(%rip),%rsi        # 784 <_IO_stdin_used+0x4>
 69b:   31 c0                   xor    %eax,%eax
 69d:   48 8d 54 24 0c          lea    0xc(%rsp),%rdx
                                # store recursion_times into stack at 0xc(%rsp)
 6a2:   89 7c 24 0c             mov    %edi,0xc(%rsp)
 6a6:   bf 01 00 00 00          mov    $0x1,%edi
                                # call __printf_chk()
 6ab:   e8 90 fe ff ff          callq  540 <__printf_chk@plt>
                                # if recursion_times > 1, goto 6c0
                                # so, if recursion_times <= 1, return
 6b0:   8b 44 24 0c             mov    0xc(%rsp),%eax
 6b4:   83 f8 01                cmp    $0x1,%eax
 6b7:   7f 07                   jg     6c0 <RecursiveFunction+0x30>
 6b9:   48 83 c4 28             add    $0x28,%rsp
 6bd:   c3                      retq   
 6be:   66 90                   xchg   %ax,%ax
                                # --recursion_times
 6c0:   83 e8 01                sub    $0x1,%eax
                                # call __printf_chk()
 6c3:   48 8d 54 24 1c          lea    0x1c(%rsp),%rdx
 6c8:   48 8d 35 b5 00 00 00    lea    0xb5(%rip),%rsi        # 784 <_IO_stdin_used+0x4>
                                # store recursion_times to stack at 0xc(%rsp)
 6cf:   89 44 24 0c             mov    %eax,0xc(%rsp)
                                # store recursion_times to stack at 0x1c(%rsp)
 6d3:   89 44 24 1c             mov    %eax,0x1c(%rsp)
 6d7:   bf 01 00 00 00          mov    $0x1,%edi
 6dc:   31 c0                   xor    %eax,%eax
 6de:   e8 5d fe ff ff          callq  540 <__printf_chk@plt>
                                # if recursion_times <= 1, return
 6e3:   8b 7c 24 1c             mov    0x1c(%rsp),%edi
 6e7:   83 ff 01                cmp    $0x1,%edi
 6ea:   7e cd                   jle    6b9 <RecursiveFunction+0x29>
                                # -- recursion_times
 6ec:   83 ef 01                sub    $0x1,%edi
                                # store recursion_times into stack at 0x1c(%rsp)
 6ef:   89 7c 24 1c             mov    %edi,0x1c(%rsp)
                                # call RecursiveFunction
 6f3:   e8 98 ff ff ff          callq  690 <RecursiveFunction>
 6f8:   eb bf                   jmp    6b9 <RecursiveFunction+0x29>
 6fa:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

Before we get into the RecursiveFunction logic, let’s first check the optimizations applied here.

  • %rbp is not used and only %rsp is maintained. So the 2 instructions pushq %rbp and movq %rsp,%rbp in the code generated at level 0 are not needed.
  • __printf_chk() is directly called instead of printf() because the printf() body only contains return __printf_chk(... and the tail-call elimination takes effect here. You may use the gcc -fno-stack-protector -O3 -Wa,-adhln -g prog.c (as a technique from Generating Mixed Source and Assembly List using GCC) to verify and see the mixed source code and assembly code.

Now, let’s take a look at why the &recursion_times wigwags.

In the body of the generated code of RecursiveFunction, the __printf_chk() are called twice and recursion_times is deducted by 1 for two times. So, 2 RecursiveFunction functions’ logic from the C code may be executed in one RecursiveFunction procedure call in the generated optimized code. So the RecursiveFunction is not the same any more! That is, one RecursiveFunction C function is inlined into itself. For the 2 recursive_times variables in one RecursiveFunction procedure call, they are stored in 0xc(%rsp) and 0x1c(%rsp) so their address differs by 0x10 bytes. The stack frame size for a RecursiveFunction procedure (the generated one) call is actually 0x28+0x8 = 0x30 (by sub at 690 and callq at 6f3).

The reason is recursive inlining optimization

Now it is clear the cause is the recursive inlining optimization by gcc. gcc also has options to control the optimization behavior. From gcc manual:

max-inline-insns-recursive
max-inline-insns-recursive-auto

    Specifies the maximum number of instructions an out-of-line copy of a
    self-recursive inline function can grow into by performing recursive
    inlining.

    --param max-inline-insns-recursive applies to functions declared inline.
    For functions not declared inline, recursive inlining happens only when
    -finline-functions (included in -O3) is enabled; --param
    max-inline-insns-recursive-auto applies instead.  The default value is 450.

max-inline-recursive-depth
max-inline-recursive-depth-auto

    Specifies the maximum recursion depth used for recursive inlining.

    --param max-inline-recursive-depth applies to functions declared inline.
    For functions not declared inline, recursive inlining happens only when
    -finline-functions (included in -O3) is enabled; --param
    max-inline-recursive-depth-auto applies instead.  The default value is 8.

min-inline-recursive-probability

    Recursive inlining is profitable only for function having deep recursion in
    average and can hurt for function having little recursion depth by
    increasing the prologue size or complexity of function body to other
    optimizers.

    When profile feedback is available (see -fprofile-generate) the actual
    recursion depth can be guessed from the probability that function recurses
    via a given call expression.  This parameter limits inlining only to call
    expressions whose probability exceeds the given threshold (in percents).
    The default value is 10.

Further tries

Now, let’s use the parameters to tune gcc’s optimization parameters and see what’s the result. Let’s set min-inline-recursive-probability to 5. To make the results clearer, I changed the initial recursion_times to 10.

$ gcc -fno-stack-protector -O3 --param min-inline-recursive-probability=5 prog.c -o prog.3
$ ./prog.3 
stack: 0x7ffdfa315a7c
stack: 0x7ffdfa315a88
stack: 0x7ffdfa315a8c
stack: 0x7ffdfa315a4c
stack: 0x7ffdfa315a58
stack: 0x7ffdfa315a5c
stack: 0x7ffdfa315a1c
stack: 0x7ffdfa315a28
stack: 0x7ffdfa315a2c
stack: 0x7ffdfa3159ec

Now 3 RecursiveFunctions are merged together. We can verify this by checking assembly code which calls __printf_chk() 3 times.

0000000000000690 <RecursiveFunction>:
 690:   48 83 ec 28             sub    $0x28,%rsp
 694:   48 8d 35 19 01 00 00    lea    0x119(%rip),%rsi        # 7b4 <_IO_stdin_used+0x4>
 69b:   31 c0                   xor    %eax,%eax
 69d:   48 8d 54 24 0c          lea    0xc(%rsp),%rdx
 6a2:   89 7c 24 0c             mov    %edi,0xc(%rsp)
 6a6:   bf 01 00 00 00          mov    $0x1,%edi
 6ab:   e8 90 fe ff ff          callq  540 <__printf_chk@plt>
 6b0:   8b 44 24 0c             mov    0xc(%rsp),%eax
 6b4:   83 f8 01                cmp    $0x1,%eax
 6b7:   7f 07                   jg     6c0 <RecursiveFunction+0x30>
 6b9:   48 83 c4 28             add    $0x28,%rsp
 6bd:   c3                      retq   
 6be:   66 90                   xchg   %ax,%ax
 6c0:   83 e8 01                sub    $0x1,%eax
 6c3:   48 8d 54 24 18          lea    0x18(%rsp),%rdx
 6c8:   48 8d 35 e5 00 00 00    lea    0xe5(%rip),%rsi        # 7b4 <_IO_stdin_used+0x4>
 6cf:   89 44 24 0c             mov    %eax,0xc(%rsp)
 6d3:   89 44 24 18             mov    %eax,0x18(%rsp)
 6d7:   bf 01 00 00 00          mov    $0x1,%edi
 6dc:   31 c0                   xor    %eax,%eax
 6de:   e8 5d fe ff ff          callq  540 <__printf_chk@plt>
 6e3:   8b 44 24 18             mov    0x18(%rsp),%eax
 6e7:   83 f8 01                cmp    $0x1,%eax
 6ea:   7e cd                   jle    6b9 <RecursiveFunction+0x29>
 6ec:   83 e8 01                sub    $0x1,%eax
 6ef:   48 8d 54 24 1c          lea    0x1c(%rsp),%rdx
 6f4:   48 8d 35 b9 00 00 00    lea    0xb9(%rip),%rsi        # 7b4 <_IO_stdin_used+0x4>
 6fb:   89 44 24 18             mov    %eax,0x18(%rsp)
 6ff:   89 44 24 1c             mov    %eax,0x1c(%rsp)
 703:   bf 01 00 00 00          mov    $0x1,%edi
 708:   31 c0                   xor    %eax,%eax
 70a:   e8 31 fe ff ff          callq  540 <__printf_chk@plt>
 70f:   8b 7c 24 1c             mov    0x1c(%rsp),%edi
 713:   83 ff 01                cmp    $0x1,%edi
 716:   7e a1                   jle    6b9 <RecursiveFunction+0x29>
 718:   83 ef 01                sub    $0x1,%edi
 71b:   89 7c 24 1c             mov    %edi,0x1c(%rsp)
 71f:   e8 6c ff ff ff          callq  690 <RecursiveFunction>
 724:   eb 93                   jmp    6b9 <RecursiveFunction+0x29>
 726:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
 72d:   00 00 00 

It’s great fun, right? You may try more combinations and see how gcc generate different code.

Eric Ma

Eric is a systems guy. Eric is interested in building high-performance and scalable distributed systems and related technologies. The views or opinions expressed here are solely Eric's own and do not necessarily represent those of any third parties.

Leave a Reply

Your email address will not be published. Required fields are marked *