RISCOS.com

www.riscos.com Technical Support:
Programmer's Reference Manual

 

Appendix C: ARM procedure call standard


This appendix relates to the implementation of compiler code-generators and language run-time library kernels for the Acorn RISC Machine (ARM) but is also a useful reference when interworking assembly language with high level language code.

The reader should be familiar with the ARM's instruction set, floating-point instruction set and assembler syntax before attempting to use this information to implement a code-generator. In order to write a run-time kernel for a language implementation, additional information specific to the relevant ARM operating system will be needed (some information is given in the sections describing the standard register bindings for this procedure-call standard).

The main topics covered in this appendix are the procedure call and stack disciplines. These disciplines are observed by Acorn's C language implementation for the ARM and, eventually, will be observed by other high level language compilers too. Because C is the first-choice implementation language for RISC OS applications and the implementation language of Acorn's UNIX product RISC iX, the utility of a new language implementation for the ARM will be related to its compatibility with Acorn's implementation of C.

At the end of this appendix are several examples of the usage of this standard, together with suggestions for generating effective code for the ARM.

The purpose of APCS

The ARM Procedure Call Standard is a set of rules, designed:

  • to facilitate calls between program fragments compiled from different source languages (eg to make subroutine libraries accessible to all compiled languages)
  • to give compilers a chance to optimise procedure call, procedure entry and procedure exit (following the reduced instruction set philosophy of the ARM). This standard defines the use of registers, the passing of arguments at an external procedure call, and the format of a data structure that can be used by stack backtracing programs to reconstruct a sequence of outstanding calls. It does so in terms of abstract register names. The binding of some register names to register numbers and the precise meaning of some aspects of the standard are somewhat dependent on the host operating system and are described in separate sections.

Formally, this standard only defines what happens when an external procedure call occurs. Language implementors may choose to use other mechanisms for internal calls and are not required to follow the register conventions described in this appendix except at the instant of an external call or return. However, other system-specific invariants may have to be maintained if it is required, for example, to deliver reliably an asynchronous interrupt (eg a SIGINT) or give a stack backtrace upon an abort (eg when dereferencing an invalid pointer). More is said on this subject in later sections.

Design criteria

This procedure call standard was defined after a great deal of experimentation, measurement, and study of other architectures. It is believed to be the best compromise between the following important requirements:

  • Procedure call must be extremely fast.
  • The call sequence must be as compact as possible. (In typical compiled code, calls outnumber entries by a factor in the range 2:1 to 5:1.)
  • Extensible stacks and multiple stacks must be accommodated. (The standard permits a stack to be extended in a non-contiguous manner, in stack chunks. The size of the stack does not have to be fixed when it is created, avoiding a fixed partition of the available data space between stack and heap. The same mechanism supports multiple stacks for multiple threads of control.)
  • The standard should encourage the production of re-entrant programs, with writable data separated from code.
  • The standard must support variation of the procedure call sequence, other than by conventional return from procedure (eg in support of C's longjmp, Pascal's goto-out-of-block, Modula-2+'s exceptions, UNIX's signals, etc) and tracing of the stack by debuggers and run-time error handlers. Enough is defined about the stack's structure to ensure that implementations of these are possible (within limits discussed later).

The Procedure Call Standard

This section defines the standard.

Register names

The ARM has 16 visible general registers and 8 floating-point registers. In interrupt modes some general registers are shadowed and not all floating-point operations are available, depending on how the floating-point operations are implemented.

This standard is written in terms of the register names defined in this section. The binding of certain register names (the call frame registers) to register numbers is discussed separately. We do this so that:

  • Diverse needs can be more easily accommodated, as can conflicting historical usage of register numbers, yet the underlying structure of the procedure call standard - on which compilers depend critically - remains fixed.
  • Run-time support code written in assembly language can be made portable between different register bindings, if it obeys the rules given in the Defined bindings of the procedure call standard.

The register names and fixed bindings are given immediately below.

General Registers

First, the four argument registers:

a1 RN 0  ; argument 1/integer result
a2 RN 1  ; argument 2
a3 RN 2  ; argument 3
a4 RN 3  ; argument 4

Then the six 'variable' registers:

v1 RN 4  ; register variable
v2 RN 5  ; register variable
v3 RN 6  ; register variable
v4 RN 7  ; register variable
v5 RN 8  ; register variable
v6 RN 9  ; register variable

Then the call-frame registers, the bindings of which vary (see the chapter entitled Defined bindings of the procedure call standard for details):

sl       ; stack limit / stack chunk handle
fp       ; frame pointer
ip       ; temporary workspace, used in procedure entry
sp RN 13 ; lower end of current stack frame

Finally, lr and pc, which are determined by the ARM's hardware:

lr RN 14 ; link address on calls/temporary workspace
pc RN 15 ; program counter and processor status

In the obsolete APCS-A register bindings described below, sp is bound to r12; in all other APCS bindings, sp is bound to r13.

Notes

Literal register names are given in lower case, eg v1, sp, lr. In the text that follows, symbolic values denoting 'some register' or 'some offset' are given in upper case, eg R, R+N.

References to 'the stack' denoted by sp assume a stack that grows from high memory to low memory, with sp pointing at the top or front (ie lowest addressed word) of the stack.

At the instant of an external procedure call there must be nothing of value to the caller stored below the current stack pointer, between sp and the (possibly implicit, possibly explicit) stack (chunk) limit. Whether there is a single stack chunk or multiple chunks, an explicit stack limit (in sl) or an implicit stack limit, is determined by the register bindings and conventions of the target operating system.

Here and in the text that follows, for any register R, the phrase 'in R' refers to the contents of R; the phrase 'at [R]' or 'at [R, #N]' refers to the word pointed at by R or R+N, in line with ARM assembly language notation.

Floating-point Registers

The floating-point registers are divided into two sets, analogous to the subsets a1-a4 and v1-v6 of the general registers. Registers f0-f3 need not be preserved by a called procedure; f0 is used as the floating-point result register. In certain restricted circumstances (noted below), f0-f3 may be used to hold the first four floating-point arguments. Registers f4-f7, the so called 'variable' registers, must be preserved by callees.

The floating-point registers are:

f0 FN 0   ; floating point result (or 1st FP argument)
f1 FN 1   ; floating point scratch register (or 2nd FP arg)
f2 FN 2   ; floating point scratch register (or 3rd FP arg)
f3 FN 3   ; floating point scratch register (or 4th FP arg)
f4 FN 4   ; floating point preserved register
f5 FN 5   ; floating point preserved register
f6 FN 6   ; floating point preserved register
f7 FN 7   ; floating point preserved register
Data representation and argument passing

The APCS is defined in terms of N (>= 0) word-sized arguments being passed from the caller to the callee, and a single word or floating-point result passed back by the callee. The standard does not describe the layout in store of records, arrays and so forth, used by ARM-targeted compilers for C, Pascal, Fortran-77, and so on. In other words, the mapping from language-level objects to APCS words is defined by each language's implementation, not by APCS, and, indeed, there is no formal reason why two implementations of, say, Pascal for the ARM should not use different mappings and, hence, not be cross-callable.

Obviously, it would be very unhelpful for a language implementor to stand by this formal position and implementors are strongly encouraged to adopt not just the letter of APCS but also the obviously natural mappings of source language objects into argument words. Strong hints are given about this in later sections which discuss (some) language specifics.

Register usage and argument passing to external procedures
Control Arrival

We consider the passing of N (>= 0) ACTUAL ARGUMENT WORDS TO A PROCEDURE WHICH EXPECTS TO RECEIVE EITHER EXACTLY N ARGUMENT WORDS OR A VARIABLE NUMBER V (>= 1) of argument words (it is assumed that there is at least one argument word which indicates in a language-implementation-dependent manner how many actual argument words there are: for example, by using a format string argument, a count argument, or an argument-list terminator).

At the instant when control arrives at the target procedure, the following shall be true (for any M, if a statement is made about argM, and M > N, the statement can be ignored):

arg1 is in a1
arg2 is in a2
arg3 is in a3
arg4 is in a4
for all I >= 5, argI is at [sp, #4*(I-5)]

fp contains 0 or points to a stack backtrace structure (as described in the next section).

The values in sp, sl, fp are all multiples of four.

lr contains the pc+psw value that should be restored into r15 on exit from the procedure. This is known as the return link value for this procedure call.

pc contains the entry address of the target procedure.

Now, let us call the lower limit to which sp may point in this stack chunk SP_LWM (Stack-Pointer Low Water Mark). Remember, it is unspecified whether there is one stack chunk or many, and whether SP_LWM is implicit, or explicitly derived from sl; these are binding-specific details. Then:

Space between sp and SP_LWM shall be (or shall be on demand) readable, writable memory which can be used by the called procedure as temporary workspace and overwritten with any values before the procedure returns.

sp >= SP_LWM + 256.

This condition guarantees that a stack extension procedure, if used, will have a reasonable amount - 256 bytes - of work space available to it, probably sufficient to call two or three procedure invocations further.

Control Return

At the instant when the return link value for a procedure call is placed in the pc+psw, the following statements shall be true:

fp, sp, sl, v1-v6, and f4-f7 shall contain the same values as they did at the instant of the call. If the procedure returns a word-sized result, R, which is not a floating-point value, then R shall be in a1. If the procedure returns a floating-point result, FPR, then FPR shall be in f0.

Notes

The definition of control return means that this is a 'callee saves' standard.

The requirement to pass a variable number of arguments to a procedure (as in K&R C) precludes the passing of floating-point arguments in floating-point registers (as the ARM's fixed point registers are disjoint from its floating-point registers). However, if a callee is defined to accept a fixed number K of arguments and its interface description declares it to accept exactly K arguments of matching types, then it is permissible to pass the first four floating-point arguments in floating-point registers f0-f3. However, Acorn's C compiler for the ARM does not yet exploit this latitude.

The values of a2-a4, ip, lr and f1-f3 are not defined at the instant of return.

The Z, N, C and V flags are set from the corresponding bits in the return link value on procedure return. For procedures called using a BL instruction, these flag values will be preserved across the call.

The flag values from lr at the instant of entry must be restored; it is not sufficient merely to preserve the flag values across the call.

Consider a procedure ProcA which has been 'tail-call optimised' and does:

        CMPS    a1, #0
        MOVLT   a2, #255
        MOVGE   a2, #0
        B       ProcB

If ProcB merely preserves the flags it sees on entry, rather than restoring those from lr, the wrong flags may be set when ProcB returns direct to ProcA's caller).

This standard does not define the values of fp, sp and sl at arbitrary moments during a procedure's execution, but only at the instants of (external) call and return. Further standards and restrictions may apply under particular operating systems, to aid event handling or debugging. In general, you are strongly encouraged to preserve fp, sp and sl, at all times.

The minimum amount of stack defined to be available is not particularly large, and as a general rule a language implementation should not expect much more, unless the conventions of the target operating system indicate otherwise. For example, code generated by the Arthur/RISC OS C compiler is able, if there is inadequate local workspace, to allocate more stack space from the C heap before continuing. Any language unable to do this may have its interaction with C impaired. That sl contains a stack chunk handle is important in achieving this. (See the chapter entitled Defined bindings of the procedure call standard for further details).

The statements about sp and SP_LWM are designed to optimise the testing of the one against the other. For example, in the RISC OS user-mode binding of APCS, sl contains SL_LWM+512, allowing a procedure's entry sequence to include something like:

        CMP  sp, sl
        BLLT |x$stack_overflow|

where x$stack_overflow is a part of the run-time system for the relevant language. If this test fails, and x$stack_overflow is not called, there are at least 512 bytes free on the stack.

This procedure should only call other procedures when sp has been dropped by 256 bytes or less, guaranteeing that there is enough space for the called procedure's entry sequence (and, if needed, the stack extender) to work in.

If 256 bytes are not enough, the entry sequence has to drop sp before comparing it with sl in order to force stack extension (see later sections on implementation specifics for details of how the RISC OS C compiler handles this problem).

The stack backtrace data structure

At the instant of an external procedure call, the value in fp is zero or it points to a data structure that gives information about the sequence of outstanding procedure calls. This structure is in the format shown below:


Stack backtrace data structure

This picture shows between four and 26 words of store, with those words higher on the page being at higher addresses in memory. The presence of any of the optional values does not imply the presence of any other. The floating-point values are in extended format and occupy three words each.

At the instant of procedure call, all of the following statements about this structure shall be true:

  • The return fp value is either 0 or contains a pointer to another stack backtrace data structure of the same form. Each of these corresponds to an active, outstanding procedure invocation. The statements listed here are also true of this next stack backtrace data structure and, indeed, hold true for each structure in the chain.
  • The save mask pointer value, when bits 0, 1, 26, 27, 28, 29, 30, 31 have been cleared, points twelve bytes beyond a word known as the return data save instruction.
  • The return data save instruction is a word that corresponds to an ARM instruction of the following form:

STMDB   sp!, {[a1], [a2], [a3], [a4],
              [v1], [v2], [v3], [v4], [v5], [v6],
              fp, ip, lr, pc}

Note that square brackets in the above denote optional parts: thus, there are 12 x 1024 possible values for the return data save instruction, corresponding to the following bit patterns:

          1110 1001 0010 1101 1101 10xx xxxx xxxx  APCS-R, APCS-U
or                        !    !  !
          1110 1001 0010 1100 1100 11xx xxxx xxxx  APCS-A (obsolete)

The least significant 10 bits represent argument and variable registers: if bit N is set, then register N will be transferred.

The optional parts a1, a2, a3, a4, v1, v2, v3, v4, v5 and v6 in this instruction correspond to those optional parts of the stack backtrace data structure that are present such that: for all M, if vM or aM is present then so is saved vM value or saved aM value, and if vM or aM is absent then so is saved vM value or saved aM value. This is as if the stack backtrace data structure were formed by the execution of this instruction, following the loading of ip from sp (as is very probably the case).

  • The sequence of up to four instructions following the return data save instruction determines whether saved floating-point registers are present in the backtrace structure. The four optional instructions allowed in this sequence are:
    STFE f7, [sp, #-12]! ; 1110 1101 0110 1101 0111 0001 0000 0011
    STFE f6, [sp, #-12]! ; 1110 1101 0110 1101 0110 0001 0000 0011
    STFE f5, [sp, #-12]! ; 1110 1101 0110 1101 0101 0001 0000 0011
    STFE f4, [sp, #-12]! ; 1110 1101 0110 1101 0100 0001 0000 0011
                                             !

    Any or all of these instructions may be missing, and any deviation from this order or any other instruction terminates the sequence.

    (A historical bug in the C compiler (now fixed) inserted a single arithmetic instruction between the return data save instruction and the first STFE. Some Acorn software allows for this.)

    The bit patterns given are for APCS-R/APCS-U register bindings. In the obsolete APCS-A bindings, the bit indicated by ! is 0.

    The optional instructions saving f4, f5, f6 and f7 correspond to those optional parts of the stack backtrace data structure that are present such that: for all M, if STFE fM is present then so is saved fM value; if STFE fM is absent then so is saved fM value.

  • At the instant when procedure A calls procedure B, the stack backtrace data structure pointed at by fp contains exactly those elements v1, v2, v3, v4, v5, v6, f4, f5, f6, f7, fp, sp and pc which must be restored into the corresponding ARM registers in order to cause a correct exit from procedure A, albeit with an incorrect result.
Notes

The following example suggests what the entry and exit sequences for a procedure are likely to look like (though entry and exit are not defined in terms of these instruction sequences because that would be too restrictive; a good compiler can often do better than is suggested here):

entry   MOV     ip, sp
        STMDB   sp!, {argRegs, workRegs, fp, ip, lr, pc}
        SUB     fp, ip, #4
exit    LDMDB   fp, {workRegs, fp, sp, pc}^

Many apparent idiosyncrasies in the standard may be explained by efforts to make the entry sequence work smoothly. The example above is neither complete (no stack limit checking) nor mandatory (making arguments contiguous for C, for instance, requires a slightly different entry sequence; and storing argRegs on the stack may be unnecessary).

The workRegs registers mentioned above correspond to as many of v1 to v6 as this procedure needs in order to work smoothly. At the instant when procedure A calls any other, those workspace registers not mentioned in A's return data save instruction will contain the values they contained at the instant A was entered. Additionally, the registers f4-f7 not mentioned in the floating-point save sequence following the return data save instruction will also contain the values they contained at the instant A was entered.

This standard does not require anything of the values found in the optional parts a1, a2, a3, a4 of a stack backtrace data structure. They are likely, if present, to contain the saved arguments to this procedure call; but this is not required and should not be relied upon.

Defined bindings of the procedure call standard

APCS-R and APCS-U: The RISC OS and RISC iX PCSs

These bindings of the APCS are used by:

  • RISC OS applications running in ARM user-mode
  • compiled code for RISC OS modules and handlers running in ARM SVC-mode
  • RISC iX applications (which make no use of sl) running in ARM user mode
  • RISC iX kernels running in ARM SVC mode.

The call-frame register bindings are:

sl RN 10 ; stack limit / stack chunk handle
         ;  unused by RISC iX applications
fp RN 11 ; frame pointer
ip RN 12 ; used as temporary workspace
sp RN 13 ; lower end of current stack frame

Although not formally required by this standard, it is considered good taste for compiled code to preserve the value of sl everywhere.

The invariants sp > ip > fp have been preserved, in common with the obsolete APCS-A (described below), allowing symbolic assembly code (and compiler code-generators) written in terms of register names to be ported between APCS-R, APCS-U and APCS-A merely by relabelling the call-frame registers provided:

  • When call-frame registers appear in LDM, LDR, STM and STR instructions they are named symbolically, never by register numbers or register ranges.
  • No use is made of the ordering of the four call-frame registers (eg in order to load/save fp or sp from a full register save).
APCS-R: Constraints on sl (For RISC OS applications and modules)

In SVC and IRQ modes (collectively called module mode) SL_LWM is implicit in sp: it is the next megabyte boundary below sp. Even though the SVC-mode and IRQ-mode stacks are not extensible, sl still points 512 bytes above a skeleton stack-chunk descriptor (stored just above the megabyte boundary). This is done for compatibility with use by applications running in ARM user-mode and to facilitate module-mode stack-overflow detection. In other words:

sl = SL_LWM + 512.

When used in user-mode, the stack is segmented and is extended on demand. Acorn's language-independent run-time kernel allows language run-time systems to implement stack extension in a manner which is compatible with other Acorn languages. sl points 512 bytes above a full stack-chunk structure and, again:

sl = SL_LWM + 512.

Mode-dependent stack-overflow handling code in the language-independent run-time kernel faults an overflow in module mode and extends the stack in application mode. This allows library code, including the run-time kernel, to be shared between all applications and modules written in C.

In both modes, the value of sl must be valid immediately before each external call and each return from an external call.

Deallocation of a stack chunk may be performed by intercepting returns from the procedure that caused it to be allocated. Tail-call optimisation complicates the relationship, so, in general, sl is required to be valid immediately before every return from external call.

APCS-U: Constraints on sl (For RISC iX applications and RISC iX kernels)

In this binding of the APCS the user-mode stack auto-extends on demand so sl is unused and there is no stack-limit checking.

In kernel mode, sl is reserved by Acorn.

APCS-A: The obsolete Arthur application PCS

This obsolete binding of the procedure-call standard is used by Arthur applications running in ARM user-mode. The applicable call-frame register bindings are as follows:

sl RN 13 ; stack limit/stack chunk handle
fp RN 10 ; frame pointer
ip RN 11 ; used as temporary workspace
sp RN 12 ; lower end of current stack frame

(Use of r12 as sp, rather than the architecturally more natural r13, is historical and predates both Arthur and RISC OS.)

In this binding of the APCS, the stack is segmented and is extended on demand. Acorn's language-independent run-time kernel allows language run-time systems to implement stack extension in a manner which is compatible with other Acorn languages.

The stack limit register, sl, points 512 bytes above a stack-chunk descriptor, itself located at the low-address end of a stack chunk. In other words:

sl = SL_LWM + 512.

The value of sl must be valid immediately before each external call and each return from an external call.

Although not formally required by this standard, it is considered good taste for compiled code to preserve the value of sl everywhere.

Notes on APCS bindings
Invariants and APCS-M

In all future supported bindings of APCS sp shall be bound to r13. In all supported bindings of APCS the invariant sp > ip > fp shall hold. This means that the only other possible binding of APCS is APCS-M:

sl RN 12 ; stack limit/stack chunk handle
fp RN 10 ; frame pointer
ip RN 11 ; used as temporary workspace
sp RN 13 ; lower end of current stack frame

This binding of APCS is unlikely to be used (by Acorn).

Further Restrictions in SVC Mode and IRQ Mode

There are some consequences of the ARM's architecture which, while not formally acknowledged by the ARM Procedure Call Standard, need to be understood by implementors of code intended to run in the ARM's SVC and IRQ modes.

An IRQ corrupts r14_irq, so IRQ-mode code must run with IRQs off until r14_irq has been saved. Acorn's preferred solution to this problem is to enter and exit IRQ handlers written in high-level languages via hand-crafted 'wrappers' which on entry save r14_irq, change mode to SVC, and enable IRQs and on exit return to the saved r14_irq (which also restores IRQ mode and the IRQ-enable state). Thus the handlers themselves run in SVC mode, avoiding this problem in compiled code.

Both SWIs and aborts corrupt r14_svc. This means that care has to be taken when calling SWIs or causing aborts in SVC mode.

In high-level languages, SWIs are usually called out of line so it suffices to save and restore r14 in the calling veneer around the SWI. If a compiler can generate in-line SWIs, then it should, of course, also generate code to save and restore r14 in-line, around the SWI, unless it is known that the code will not be executed in SVC mode.

An abort in SVC mode may be symptomatic of a fatal error or it may be caused by page faulting in SVC mode. Acorn expects SVC-mode code to be correct, so these are the only options. Page faulting can occur because an instruction needs to be fetched from a missing page (causing a prefetch abort) or because of an attempted data access to a missing page (causing a data abort). The latter may occur even if the SVC-mode code is not itself paged (consider an unpaged kernel accessing a paged user-space).

A data abort is completely recoverable provided r14 contains nothing of value at the instant of the abort. This can be ensured by:

  • saving R14 on entry to every procedure and restoring it on exit
  • not using R14 as a temporary register in any procedure
  • avoiding page faults (stack faults) in procedure entry sequences.

A prefetch abort is harder to recover from and an aborting BL instruction cannot be recovered, so special action has to be taken to protect page faulting procedure calls.

For Acorn C, R14 is saved in the second or third instruction of an entry sequence. Aligning all procedures at addresses which are 0 or 4 modulo 16 ensures that the critical part of the entry sequence cannot prefetch-abort. A compiler can do this by padding all code sections to a multiple of 16 bytes in length and being careful about the alignment of procedures within code sections.

Data-aborts early in procedure entry sequences can be avoided by using a software stack-limit check like that used in APCS-R.

Finally, the recommended way to protect BL instructions from prefetch-abort corruption is to precede each BL by a MOV ip, pc instruction. If the BL faults, the prefetch abort handler can safely overwrite r14 with ip before resuming execution at the target of the BL. If the prefetch abort is not caused by a BL then this action is harmless, as R14 has been corrupted anyway (and, by design, contained nothing of value at any instant a prefetch abort could occur).

Examples from Acorn language implementations

Example procedure calls in C
Here is some sample assembly code as it might be produced by the C compiler:

; gggg is a function of 2 args that needs one register variable (v1)
gggg    MOV     ip, sp
        STMFD   sp!, {a1, a2, v1, fp, ip, lr, pc}
        SUB     fp, ip, #4              ; points at saved PC
        CMPS    sp, sl
        BLLT    |x$stack_overflow|      ; handler procedure
        ...
        MOV     v1, ...                 ; use a register variable
        ...
        BL      ffff
... MOV ..., v1 ; rely on its value after ffff()

Within the body of the procedure, arguments are used from registers, if possible; otherwise they must be addressed relative to fp. In the two argument case shown above, arg1 is at [fp,#-24] and arg2 is at [fp,#-20]. But as discussed below, arguments are sometimes stacked with positive offsets relative to fp.

Local variables are never addressed offset from fp; they always have positive offsets relative to sp. In code that changes sp this means that the offsets used may vary from place to place in the code. The reason for this is that it permits the procedure x$stack_overflow to recover by setting sp (and sl) to some new stack segment. As part of this mechanism, x$stack_overflow may alter memory offset from fp by negative amounts, eg [fp, #-64] and downwards, provided that it adjusts sp to provide workspace for the called routine.

If the function is going to use more than 256 bytes of stack it must do:

        SUB     ip, sp, #<my stack size>
        CMPS    ip, sl
        BLLT    |x$stack_overflow_1|

instead of the two-instruction test shown above.

If a function expects no more than four arguments it can push all of them onto the stack at the same time as saving its old fp and its return address (see the example above); arguments are then saved contiguously in memory with arg1 having the lowest address. A function that expects more than four arguments has code at its head as follows:

        MOV     ip, sp
        STMFD   sp!, {a1, a2, a3, a4}          ; put arg1-4 below stacked args
        STMFD   sp!, {v1, v2, fp, ip, lr, pc}  ; v1-v6 saved as necessary
        SUB     fp, ip, #20                    ; point at newly created call-frame
        CMPS    sp, sl
        BLLT    |x$stack_overflow|
        ...
        ...
        LDMEA   fp, {v1, v2, fp, sp, pc}^      ; restore register vars & return

The store of the argument registers shown here is not mandated by APCS and can often be omitted. It is useful in support of debuggers and run-time trace-back code and required if the address of an argument is taken.

The entry sequence arranges that arguments (however many there are) lie in consecutive words of memory and that on return sp is always the lowest address on the stack that still contains useful data.

The time taken for a call, enter and return, with no arguments and no registers saved, is about 22 S-cycles.

Although not required by this standard, the values in fp, sp and sl are maintained while executing code produced by the C compiler. This makes it much easier to debug compiled code.

Multi-word results other than double precision reals in C programs are represented as an implicit first argument to the call, which points to where the caller would like the result placed. It is the first, rather than the last, so that it works with a C function that is not given enough arguments.

Procedure calls in other language implementations
Assembler

The procedure call standard is reasonably easy and natural for assembler programmers to use. The following rules should be followed:

  • Call-frame registers should always be referred to explicitly by symbolic name, never by register number or implicitly as part of a register range.
  • The offsets of the call-frame registers within a register dump should not be wired into code. Always use a symbolic offset so that you can easily change the register bindings.
Fortran

The Acorn/TopExpress Arthur/RISC OS Fortran-77 compiler violates the APCS in a number of ways that preclude inter-working with C, except via assembler veneers. This may be changed in future releases of the Fortran-77 product.

Pascal

The Acorn/3L Arthur/RISC OS ISO-Pascal compiler violates the APCS in a number of ways that preclude inter-working with C, except via assembler veneers. This may be changed in future releases of the ISO-Pascal product.

Lisp, BCPL and BASIC

These languages have their own special requirements which make it inappropriate to use a procedure call of the form described here. Naturally, all are capable of making external calls of the given form, through a small amount of assembler 'glue' code.

General

Note that there is no requirement specified by the standard concerning the production of re-entrant code, as this would place an intolerable strain on the conventional programming practices used in C and Fortran. The behaviour of a procedure in the face of multiple overlapping invocations is part of the specification of that procedure.

Various lessons

This appendix is not intended as a general guide to the writing of code-generators, but it is worth highlighting various optimisations that appear particularly relevant to the ARM and to this standard.

The use of a callee-saving standard, instead of a caller-saving one, reduces the size of large code images by about 10% (with compilers that do little or no interprocedural optimisation).

In order to make effective use of the APCS, compilers must compile code a procedure at a time. Line-at-a-time compilation is insufficient.

The preservation of condition codes over a procedure call is often useful because any short sequence of instructions (including calls) that forms the body of a short IF statement can be executed without a branch instruction. For example:

        if (a < 0) b = foo();

can compile into:

        CMP     a, #0
        BLLT    foo
        MOVLT   b, a1

In the case of a leaf or fast procedure - one that calls no other procedures - much of the standard entry sequence can be omitted. In very small procedures, such as are frequently used in data abstraction modules, the cost of the procedure can be very small indeed. For instance, consider:

        typedef struct {...; int a; ...} foo;
        int get_a(foo* f) {return(f->a);}

The procedure get_a can compile to just:

        LDR     a1, [a1, #aOffset]
        MOVS    pc, lr

This is also useful in procedures with a conditional as the top level statement, where one or other arm of the conditional is fast (ie calls no procedures). In this case there is no need to form a stack frame there. For example, using this, the C program:

        int sum(int i)
        {
          if (i <= 1)
            return(i);
          else
            return(i + sum(i-1));
        }

could be compiled into:

sum     CMP     a1, #1 ; try fast case
        MOVSLE  pc, lr ; and if appropriate, handle quickly!
        ; else, form a stack frame and handle the rest as normal code.
        MOV     ip, sp
        STMDB   sp!, {v1, fp, ip, lr, pc}
        CMP     sp, sl
        BLLT    overflow
        MOV     v1, a1                 ; register to hold i
        SUB     a1, a1, #1             ; set up argument for call
        BL      sum                    ; do the call
        ADD     a1, a1, v1             ; perform the addition
        LDMEA   fp, {v1, fp, sp, pc}^  ; and return

This is only worthwhile if the test can be compiled using only ip, and any spare of a1-a4, as scratch registers. This technique can significantly speed up certain speed-critical routines, such as read and write character. At the present time, this optimisation is not performed by the C compiler.

Finally, it is often worth applying the tail call optimisation, especially to procedures which need to save no registers. For example, the code fragment:

        extern void *malloc(size_t n)
        {
          return primitive_alloc(NOTGCABLEBIT, BYTESTOWORDS(n));
        }

is compiled by the C compiler into:

malloc  ADD     a1, a1, #3       ; 1S
        MOV     a2, a1, LSR #2   ; 1S
        MOV     a1, #1073741824  ; 1S
        B       primitive_alloc  ; 1N+2S = 4S

This avoids saving and restoring the call-frame registers and minimises the cost of interface 'sugaring' procedures. This saves five instructions and, on a 4/8MHz ARM, reduces the cost of the malloc sugar from 24S to 7S.

This edition Copyright © 3QD Developments Ltd 2015
Last Edit: Tue,03 Nov 2015