UC-SLS Lecture 18 : Using C to write and organize opcode bytes - Functions
Contents
18. UC-SLS Lecture 18 : Using C to write and organize opcode bytes - Functions#
create a directory
mkdir cfuncs; cd cfuncs
copy examples
add a
Makefile
to automate assembling and linkingwe are going run the commands by hand this time to highlight the details
add our
setup.gdb
to make working in gdb easiernormally you would want to track everything in git
18.1. Opcodes and C#
When we write assembly code we are free to layout our opcodes and use registers in any way we like.
We can place labels anywhere in our opcodes
We can specify a jump to any arbitrary location
While we can use processor support for passing return address via instructions, like
call
andret
, we are not required tooWe are not forced to use the registers in any particular way
18.1.1. C Standardizes how to organize and write opcodes#
Its all about standardizing how things are done
this way code written by different people or tools can inter-operate
there are rules they can rely on
“C” forces us to decompose and organize opcodes into “functions”
global label - single entry point
block of opcodes ending in a “return”
standardizes use of registers
standardizes use of stack
call frames - automatic storage of locals
separation into declaration (many) and definition (one)
compiler can get smart and optimize functions and variables away
in-line
dead-code elimination
register only variables
18.1.2. C Standardizes how to organize and write opcodes#
Overall summary
“C” forces us to decompose and organize opcodes into “functions”
“C” functions:
Have a unique global label that identifies a single starting address for the function
the label is formed from the function’s name which must conform to certain rules
Form a contiguous block of memory that does not overlap with another function or data
Therefore they have a clear size in bytes that spans their first opcode to the last
Are written so that there is a standard way for passing arguments to them
a fixed way for passing arguments eg. basics on x86_64
arg1 \(\rightarrow\)
rdi
arg2 \(\rightarrow\)
rsi
arg3 \(\rightarrow\)
rdx
arg4 \(\rightarrow\)
r8
arg5 \(\rightarrow\)
r9
rest are pushed on stack in reverse order (arg6 is last to be pushed)
a fixed way for passing a return value eg. basics on x86_64
return value \(\rightarrow\)
rax
details for cpu and OS are in specification documents eg:
Execution from a function must end with a return to the next instruction after the call
eg on x86_64
call
andret
are used
Support local variables that are automatically managed
eg. on x86_64 the processor stack is used
each call to a function creates a new stack frame
each frame represents a call to a function
the frame contains a version of the local variables for that call
thus each call has its own locals
when a function returns to its call the stack frame for the call is popped
thus support recursion
Separates declaration from definition
declaration: only specify its name but does not define any opcodes
the function declaration is needed to generate the assembly code to call the function
given the rules above for how arguments are passed and a unique entry point
compiler can generate the assembly code for a call with the declaration
does not need the body
declarations are placed in “header” files.
definitions are placed in “c” files.
definition: repeats the declaration but include a body that defines the functions opcodes
there can only be one of these
A compiler is allowed to “inline” a function if certain optimizations are enabled and criteria are met
there are times in which the overhead of calling a function is not worth it
it is better to just create a version of the opcodes in place where the call is being made
inlined
18.1.3. Let’s start at the beginning#
we will use the compiler and our ability to read assembly code to learn how “C” works
18.1.3.1. Function Name \(\rightarrow\) global label for its Entry point (start of its opcode)#
C:
void myfunc(void) {}
Assembly:
.file "myfunc0.c"
.intel_syntax noprefix
.text
.globl myfunc
.type myfunc, @function
myfunc:
ret
.size myfunc, .-myfunc
.ident "GCC: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0"
.section .note.GNU-stack,"",@progbits
function name “myfunc” introduces a global label
myfunc
in the text sectionreturn type prefixes function
void
is the “no” typea
void
return type means that the function does not return anything
parenthesis after function name
(
and)
demarks parameter lista
void
in parameter list means function takes no parameters
{
and}
demarks bodya set of statements that will be converted into opcodes
implicitly every function has at least one statement
return
if not written the compiler assumes one
generates instructions to return to the caller
X86 :
ret
18.1.3.2. Calling a function#
C:
__attribute__ ((noinline))
int funcA(void) {
return 7;
}
int funcB(void) {
return 3 + funcA();
}
Assembly
.file "myfunc1.c"
.intel_syntax noprefix
.text
.globl funcA
.type funcA, @function
funcA:
mov eax, 7
ret
.size funcA, .-funcA
.globl funcB
.type funcB, @function
funcB:
call funcA
add eax, 3
ret
.size funcB, .-funcB
.ident "GCC: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0"
.section .note.GNU-stack,"",@progbits
avoid compiler for optimizing away noinline and use of return value
exactly what we expected in assembly right?
18.1.4. Interlude : _start vs main()#
As we have seen the linker marks where execution should begin in our binary via the
_start
symbolHowever when we are writing ‘C’ we normally do not write raw assembly
all our code is in functions
In the last lectures we wrote our own
_start
in an assembly file that called our C generated assemblyand linked it by hand avoid all the defaults
However normally we don’t do this.
the C compiler come with some startup code along with the standard C library of functions
this code is usually in an object files of the name
crt*.o
The “c” runtime a bunch of code that runs before the code you write
runs setup code for you initializing c library and other aspects
when done calls
main
function passing in some standard parametersargc
,argv
and on Unixenvp
use
-v
to see it get added
So lets write a main and use gcc to link it in the “normal” way
C:
int main() {
myfunc();
return 0;
}
What’s the fix?
18.1.4.1. Function Declarations vs Definitions#
When C encounters code that calls another function
it cannot know how to generate the assembly for the call
unless it knows
name
arguments: type and order
and return value type
a function declaration is exactly this - just the signature of the function
does not generate any code itself just allows calls to be correctly created
a single definition either in the same file or another that will end up in a .o must exist
linker will stitch them together
Add a declaration of myfunc
that matches its defintion to the main.c
file
Normally we would put this into a header file eg. myfunc0.h
this way any file in which we want to call
myfunc
in we would simply include the header
C:
void myfunc(void);
int main() {
myfunc();
return 0;
}
Assembly
.file "main0.c"
.intel_syntax noprefix
.text
.section .text.startup,"ax",@progbits
.globl main
.type main, @function
main:
push rax
call myfunc
xor eax, eax
pop rdx
ret
.size main, .-main
.ident "GCC: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0"
.section .note.GNU-stack,"",@progbits
Lets add the verbose flag so that we can see what is really going on
Much bigger than we might have expected
Why so big?
All the extra stuff
We have suppressed dynamic link/loading
Normally we just take all this on faith … but since we know how now let’s look at it at least just this once at what all this stuff is
Edit
18.1.5. Variables declared in body of function are function local#
If the compiler needs memory for a local variable then it adds it to stack frame for call
C:
__attribute__ ((noinline))
void myfunc2(long long *i)
{
*i += 1;
}
long long myfunc(void) {
long long i=(long long)&myfunc;
myfunc2(&i);
return i;
}
Assembly
.file "myfunc2.c"
.intel_syntax noprefix
.text
.globl myfunc2
.type myfunc2, @function
myfunc2:
inc QWORD PTR [rdi]
ret
.size myfunc2, .-myfunc2
.globl myfunc
.type myfunc, @function
myfunc:
sub rsp, 16
mov QWORD PTR [rsp+8], OFFSET FLAT:myfunc
lea rdi, [rsp+8]
call myfunc2
mov rax, QWORD PTR [rsp+8]
add rsp, 16
ret
.size myfunc, .-myfunc
.ident "GCC: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0"
.section .note.GNU-stack,"",@progbits
must play some games to get the compiler to create a local with such simple functions
notice what this code is doing?
sub rsp, 16
what is this?what is
rsp + 8
Local variables have not fixed location in memory!
PS this code is dangerous and not something you would normally do
but C lets you cheat if you want too!
18.1.6. LEA#
18.1.7. Passing arguments and simple return value#
C:
__attribute__ ((noinline))
int func2(int x, int y)
{
return x + y;
}
int func1(int x)
{
return func2(x,2);
}
Assembly
.file "myfunc3.c"
.intel_syntax noprefix
.text
.globl func2
.type func2, @function
func2:
lea eax, [rdi+rsi]
ret
.size func2, .-func2
.globl func1
.type func1, @function
func1:
mov esi, 2
jmp func2
.size func1, .-func1
.ident "GCC: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0"
.section .note.GNU-stack,"",@progbits
What is
lea
being used for here?Why
eax
,rdi
andrsi
(esi
) being used the way they are?
LEA
The compiler is smart and figure out that it can use
lea
for arbitrary math that does not require flags update
Why these Register?
But remember the real truth or the rules for any given CPU and OS is in a standard somewhere https://www.uclibc.org/docs/psABI-x86_64.pdf
3.2.3 Parameter Passing p17
Figure 3.4: Register Usage p21
18.2. An example: Trace this code and visualize the stack#
C: myadd.c
long myadd(long *x_ptr, long val)
{
long x = *x_ptr;
long y = x + val;
*x_ptr = y;
return x;
}
Assembly
.file "myadd.c"
.intel_syntax noprefix
.text
.globl myadd
.type myadd, @function
myadd:
mov rax, QWORD PTR [rdi]
add rsi, rax
mov QWORD PTR [rdi], rsi
ret
.size myadd, .-myadd
.ident "GCC: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0"
.section .note.GNU-stack,"",@progbits
C: myadd.h
#ifndef __MY_ADD_H__
#define __MY_ADD_H__
long myadd(long *x_ptr, long val);
#endif // __MY_ADD_H__
C: callmyadd.c
#include "myadd.h"
long call_myadd(void) {
long a = 15214;
long b = myadd(&a, 5001);
return a+b;
}
long
main() {
return call_myadd();
}
Assembly
.file "callmyadd.c"
.intel_syntax noprefix
.text
.globl call_myadd
.type call_myadd, @function
call_myadd:
sub rsp, 24
mov esi, 5001
mov QWORD PTR [rsp+8], 15214
lea rdi, [rsp+8]
call myadd
add rax, QWORD PTR [rsp+8]
add rsp, 24
ret
.size call_myadd, .-call_myadd
.section .text.startup,"ax",@progbits
.globl main
.type main, @function
main:
jmp call_myadd
.size main, .-main
.ident "GCC: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0"
.section .note.GNU-stack,"",@progbits