This is a development environment where only the paranoid can survive.
– John Regehr [1]
I assume you know the basics of C.
This talk is not exhaustive because there’s tons of little fun things that are hard to fit in a talk.
Also, if you use them, you’re literally Satan and are either a cool person or an extreme violator of the psychopath rule.
Pointers are real. They’re what the hardware understands. Somebody has to deal with them. You can’t just place a LISP book on top of an x86 chip and hope that the hardware learns about lambda calculus by osmosis. Denying the existence of pointers is like living in ancient Greece and denying the existence of Krackens and then being confused about why none of your ships ever make it to Morocco, or Ur-Morocco, or whatever Morocco was called back then. Pointers are like Krackens — real, living things that must be dealt with so that polite society can exist.
– James Mickens [2]
System Programming within a SMALL team where cohesion can be maintained.
C is difficult to manage in the interface between components. It has no tools to enforce or detect cohesion.
The bigger the project/team the more components you have. The more components you have the harder it is to maintain cohesion.
Also, if you’re not doing systems programing, use python or something…
Ye olde C from a bygone era…
void blah(void)
int a;
{
...
}
K&R still has its place in history but we’ve come a long way since…
This is our base that’s supported mostly everywhere.
… unless you do emdbed in which case… good luck?
Adds a memory model! Adds a few quality of life features.
This is what I target.
The gap between what you can do and what you’re allowed to do.
Usually manifests during optimization.
Compiler assumes that something is not possible and makes optimization decisions as a result.
Famous Linux Kernel example…
int blah(int *ptr)
{
int x = *ptr; // accessing ptr would be undefined behaviour if NULL
// therefore ptr can't be NULL
if (ptr == NULL) return; // we just proved that ptr can't be NULL
// therefore this `if` is a noop and can be removed.
return *ptr + 5; // kaboom boom
}
Absolutely horrible code. Never write shit like this.
Newer versions of gcc and clang will warn on this.
Never assume consistent handling across compiler versions.
What works now, may not work in a year or when targetting a different arch.
Why not have the compiler detect it?
Many classes of undefined behaviours can only be detected at runtime.
Runtime checks are expensive.
Many have tried to specify safe C.
C is used in too many use cases where a safe subset that satisfy everyone simply doesn’t exist.
Where does that leave us?
Learn your undefined behaviours and be very paranoid.
Implementation defined means the compiler chooses one option and stays consistent with its decision.
Different compilers or architecture may choose differently.
Much more rare and usually has to do with data or memory layout.
Example: struct padding
Single most common source of undefined behaviour:
int a;
int b = a;
Trying to access unitialized memory is undefined behaviour.
Also applies to heap allocations:
int *a = malloc(sizeof(*a));
int b = *a;
malloc doesn’t initialize memory so this is also undefined behaviour.
I recommend the use of calloc to initialize everything to 0.
This is something that compilers can catch sometimes.
clang-tidy can also catch some instances statically.
valgrind can catch some of these dynamically.
valgrind can’t catch problems in code paths not executed.
Just initialize everything.
Even when it’s obviously safe.
There’s no way for you to know how the code will evolve.
Just be safe by default.
One of the worst part about C…
… that C++ decided to inherit wholesale…
sizeof(int)
What does that return?
IT DEPENDS! Oh what joy!
int is defined as being AT LEAST 16 bits [3]
It will generally be 32 bit on most systems.
How many int types are there?
[signed|unsigned] [long] [char|short|int|long] [int]
… or kinda… who cares… it’s a mess…
short short int signed short signed short int
are all equivalent to
short int
Screw all of that…
#include <stddef.h>
#include <stdint.h>
These headers are your friends.
#include <stddef.h>
size_t a;
ssize_t b;
Defined in standard to be big enough to index any element in an array.
Another of saying that it’s big enough to hold any memory address.
This is my default and what you want 90% of the time.
#include <stdint.h>
uint8_t a = UINT8_MAX;
int16_fast_t b = INT16_FAST_MAX;
Allows you to better declare your exact requirements from the int and have explicit bounds.
char is just a int type.
This means that by default it’s signed.
Yes… you read that right, chars are signed by default.
To express the notion of a byte you can do:
unsigned char x;
But semantically that’s insane…
I recommend:
uint8_t x;
They’re equivalent but the later has a stronger semantic meaning.
Questions about why not void* are differed to later.
Signed ints are prone to undefined behaviours…
int a = INT_MAX + 1;
int b = INT_MIN - 1;
int8_t a = ((int8_t) 1) << 7; // shift into sign bit
Equivalent expression for unsigned are all well defined.
As a result I tend to prefer unsigned unless I need signed expressions.
uint8_t a = ((uint8_t) 1) << 8;
Shifting past the length is undefined so check yours bounds.
Literals are always ints! This means they’re signed!
1 << 48;
This is technically undefined behariour but most compiler will catch it and tell you to fix your shit…
There are specifiers you can use to change the type of a literal:
1U; // unsgined
1L; // long
1UL; // unsigned long
1ULL; // unsigned long long (aka. 64 bits)
As a fun exercise, replace the 1 by 0xF.
ints in C are a minefield.
This is your basic primitive type and they’re not even safe!
Well it only gets worst because of…
The second worst part of C
The source of SO many security problems
const char *a = "bob";
Equivalent (or close enough):
const char a[] = {'b', 'o', 'b', 0};
That 0 at the end is the source of so much woe and misery in the world.
All these functions assumes that you have a trailing 0.
strlen(str);
strdup(str); // don't forget to call free...
strcmp(s1, s2);
strcpy(dst, src);
strcat(dst, src);
strtok(str, delim); // uses global state ?!
sprintf(str, format, ...);
vsprintf(src, format, args);
If it’s missing? It’ll just keep reading memory until it finds one. Or else…
What’s that? Your string is on the stack?
It would be a shame if someone were to write a very specific number in your return address for your function…
Oh no… Is your string on the heap?
It would be a shame if someone wrote random bytes to completely unrelated data structures that you’ll only find out way after the str function finishes…
**ALWAYS** default to the strn family
strnlen(str, len);
strndup(str, len); // don't forget to call free...
strncmp(s1, s2, len);
strncpy(dst, src, len);
strncat(dst, src, len);
snprintf(str, len, format, ...);
vsnprintf(src, len, format, args);
strn variants ensure that no more then len bytes will be read.
Get in the habit of using them even if obviously correct otherwise.
Don’t be that guy…
Warning: If there is no null byte among the first n bytes of src, the string placed in dest will not be null-terminated.
If the length of src is less than n, strncpy() writes additional null bytes to dest to ensure that a total of n bytes are written.
– man strncpy
WHY?!
Third time’s the charm…?
strlcpy(dst, src, len);
You need to compile with -lbsd and make you have libbsd installed.
urg… you had to remember…
strtok_r(str, delim, saveptr);
_r means that the function is reentrant.
You know you fucked up when you have to specify that for a string function.
saveptr is the state that is used in between calls.
What could possibly go wrong?!
int atoi(const char *)
long atol(const char *)
long long atoll(const char *)
double atof(const char *)
Oh look, same problem as all the other basic string functions.
Also, f stands for double. Don’t question it.
int sscanf(const char *str, const char *fmt, ...)
10 pound gorrilla, meet, tiny tiny tiny nail! If you don’t believe me, read the man page.
Also… no size!
double strtod(const char *nptr, char **endptr);
float strtof(const char *nptr, char **endptr);
long double strtold(const char *nptr, char **endptr);
unsigned long int strtoul(const char *nptr, char **endptr, int base);
unsigned long long int strtoull(const char *nptr, char **endptr, int base);
A bit better but still lots of problems…
long long strtonum(const char *nptr, long long minval, long long maxval, const char** err)
It’s an improvement with implicit bound checking via min/max val.
Could be better though.
Yeah, so… There’s nothing very good.
The other direction is a whole different set of functions But also a whole pile of meh.
Won’t continue down this rabbit hole as you’re probably all asleep by now.
The C standard library is insane and out to get you.
Period. No questions. That’s just the reality.
NEVER EVER use a function from the standard lib without having read its manpage first.
I still regurlarly recheck man pages for basic functions like strnlen.
Just to be sure…
The humble array:
size_t array[5] = {0};
The line is blurry because array implicitly cast to pointers.
size_t arr[5];
size_t x = arr[1];
size_t y = *(arr + 1);
Both lines do the same thing because C will implicitely cast arr to a pointer. At which point pointer arithmetic takes over.
And lo, much confusion was spread throughout the kingdom. Hooray!
In practice, FOR UNIDIMENTIONAL ARRAYS, you can use either.
static size_t arr[5];
When declared outside of a function they live in special pre-allocated segments of memory. Read up on linkers if you’re really bored.
void foo(void)
{
size_t arr[5];
}
When delcared in a function, they live on the stack.
C stacks are usually 8Mb and there are no safety nets so…
size_t arr[1ULL << 48];
Who wants an alternative addressing system based on the arbitrary location of your function stack pointer? I do!
Although you’re most likely to get an angry compiler or a face full of segfaults.
I you want to be all fancy and have dynamic sizes
void bob(size_t len)
{
size_t *arr = alloca(10 * sizeof(*arr));
}
alloca actually just increments the stack pointer and then returns it.
alloca also doesn’t initialize the memory so you gotta do that yourself.
Calling free on this pointer result in hilarity…
For me…
Watching you suffer…
Why yes title! C11 deprecated alloca in favour of
void bob(size_t len)
{
size_t arr[len] = {0};
}
Such Clean. Much Wow.
For heap allocations
size_t *arr = calloc(10, sizeof(*arr));
That’s one pointer you do want to free.
There’s a few more things going on here but I have a whole section on pointers so be patient.
Arrays are just as vulnerable as strings to bounds check problems
No nil byte shenanigans here so people tend to pass sizes around much more.
void bob(size_t *arr, size_t len);
size_t arr[10];
bob(arr, sizeof(arr) / sizeof(*arr));
Every array parameter should be paired with a length parameter.
That length parameter should also be checked religiously.
Do it always as a defense in depth mechanism.
Never leave home without your tin foil hat.
memset(ptr, byte, len);
memcmp(s1, s2, len);
memcpy(dst, src, len);
memmove(dst, src, len);
These are all surprisingly sane and work as expected.
A true novelty for C standard libraries.
In fact, I don’t trust it…
Something must be wrong…
Turns out there’s one catch…
memcpy: dst and src may NOT overlap memmove: dst and src may overlap
There are performance implications because of this.
Generally you’ll use memcpy even if it’s not safe by default.
99.9% of your use case won’t deal with overlapping memory.
The actual declaration for memset is
void *memset(void *s, int c, size_t n);
See where int is the type of the second parameter?
Only the least-significant 8 bits will be read and used.
Even the man page calls it a byte… but it’s an int…
Why?
Because. Now stop asking questions that make sense.
What about them…
urg… why…?
FINE!
They’re an all around pain to use.
I would probably need another 10 slides to go over them.
Instead I decided to have a fun dialogue with myself in front of 20 people.
Reality is that for systems programming, I never use them.
As a result I don’t have as much experience or opinions about them.
And that’s where we’ll leave it at.
Access the raw pointers is one of the main and best reasons to use C.
It’s also the easiest way to foot-gun yourself.
size_t * a, b;
is actually equivalent to:
size_t *a; size_t b;
Why? Because!
It gets even more fun with const.
To be safe, just do a single declaration per line.
void * mem = mmap(NULL, len, prot, flags, fd, 0);
assert(mem != MAP_FAILED);
struct header *hdr = mem;
hdr->...
C allows you take to determine what the type of random pieces of memory are.
Quite handy when manipulating files, IPC mechanisms, network packets, etc.
WARNING: struct layouts are implementation defined so use with care.
void * mem = something();
((uint64_t *) mem) + 2;
((struct bob *) mem) + 3;
Is equivalent to:
void * mem = something();
(uint64_t *) (((uintptr_t) mem) + (sizeof(uint64_t) * 2));
(struct bob *) (((uintptr_t) mem) + (sizeof(struct bob) * 3));
void *mem = something();
mem + 1;
What does that do?
Boils down to what does this return:
sizeof(void);
Standard says that void doesn’t have a size So arithmetic on void pointers is undefined behaviour…
It’s so common in practice that gcc and friends all do the following:
sizeof(void) = 1;
It’s usually used to index raw binary data so this behaviour makes sense.
I personally always use uint8_t to refer to binary data
void *mem = something();
((uint8_t *)mem) + 0xb0b;
void pointers are used in a great many different context I prefer to make it explicit when working with binary data
Speaking of which, casting to and from void pointers is implicit.
size_t* mem = (size_t *) calloc(1, sizeof(*mem));
That cast is a waste of everybody’s time and a easy way to introduce errors.
This works just fine and is clearer and more concise:
size_t* mem = calloc(1, sizeof(*mem));
const size_t * const mem = something();
I’ll let you enjoy that one for a sec…
const size_t * mem = something();
*mem = 10; // BAD
mem = something_else(); // OK
The trick to remember is that the const is with the type of the pointer Meaning that you’re pointing to a const type
Not a very good trick but it works for me so you can complain to someone else
size_t * const mem = something();
*mem = 10; // OK
mem = something_else(); // BAD
The trick to remember is that the const is on the variable name Meaning that the variable itself is const
const size_t * mem = something();
*((size_t *) mem) = 10;
Remember how C allows you to cast whatever memory to be whatever?
Yeah so you can technically cast away const-ness.
Pretty sure that this is all kinds of undefined behaviour though.
The easiest way for a language to gain a performance edge over C.
If the compiler can’t trace the source of your pointers, then it has to assume that they could be aliases
In other words, that they could point to the same data or overlapping data sets.
This SEVERELY limits compilers in what they can and can’t do.
size_t foo(size_t *x, size_t *y)
{
size_t a = *x; // read value of x once
*y = a * 2; // if x == y then we're writing to x
return *x; // which means that we must read x again
}
The compiler can’t assume that x != y and therefore must play it safe.
In pointer heavy code this can cause performance problems.
Arrays and by extension math heavy code, suffers from this.
There’s one caveat in the standard known as strict-aliasing:
Two pointers of different types can’t alias
size_t foo(size_t *x, double *y)
{
size_t a = *x;
*y = a * 2; // we can assume x != y because typeof(x) != typeof(y)
return *x; // which means the compiler can reuse the value of a here
}
Cool, that solves a lot of problems, right?
… yeah … so there’s this thing called…
Type punning means reinterpreting the bytes of a value as something else.
double value = -1e12;
uint64_t raw = *((uint64_t *) &value);
uint64_t sign = raw >> 63;
uint64_t mantissa = (raw >> 52) & ((1ULL << 11) - 1);
This is used in various context and breaks the strict-aliasing rules
Welcome to undefined behaviour town!
When compiler writers (gcc) decided to enable strict-aliasing by default, it caused quite a kerfuffle (technical term).
It broke ton of existing code that relied heavily on type-punning to do things.
Alot of projects, including the linux kernel, disable this feature
Proving once and for all that puns are bad.
The language actually doesn’t provide any way to type pun directly.
Most compilers do provide alternatives by either relaxing strict-aliasing rules or by allowing it through unions:
union utod
{
uint64_t u;
double d;
};
uint64_t raw = (union utod){ d = value}.u;
uint64_t sign = raw >> 63;
uint64_t mantissa = (raw >> 52) & ((1ULL << 11) - 1);
GCC has explicit allowance for this type of type punning.
int (blah *) (int);
Believe it or not this is a variable declaration.
Function pointer syntax is… special.
It even has this whole spiral rule thing for reading it.
Make everyone’s life more sane and just typedef your function pointers.
typedef int (fn_t *) (int);
fnt_t blah;
It makes code way more readable
You can also ignore the whole spiral rule nonsense.
Since we don’t have closures or proper classes in C, you’ll almost always want a void pointer in your function pointers
typedef int(fn_t *) (void *, int);
This will give the user the option of passing a context object.
It’s needed like 90% of the time so it’s basically good practice at the point.
Good form to do memory allocations
size_t *n = calloc(1, sizeof(*n));
- uses calloc -> avoids reading unitialized memory
- uses sizeof variable instead of explictly specifying a type -> To change the variable type, only change is required
- avoids explicit void pointer cast (it’s common, no idea why) -> not needed and makes code easier to read
Pointers allow for lots of fun things. Painful or impossible in most other languages.
*((uint8_t *) 10);
This kind of power is essential when dealing with hardware.
something something lisp book on cpu something something
struct bob { };
struct bob {...};
struct bob blah;
VS
typedef struct {...} bob_t;
bob_t blah;
The age old debate of absolute pointlessness
Once upon a time, the Visual Studio C++ compiler did not support writting:
struct bob blah;
As this isn’t how C++ declares classes But for interop reasons, we still need to be able to use C classes And so the eternal workaround was born
Note that I think there’s probably more to this story but I like blaming it on visual studio so sue me.
Nowadays this is a non-issue and you can use either.
The biggest difference ise explicit vs too verbose.
This applies to unions and enums as well.
It’s all down to stylistic but I generally recommend being explicit.
And if you disagree with me then you’re wrong and your entire life is a lie.
struct bob
{
uint8_t a;
uint16_t b;
uint8_t c;
};
sizeof(struct bob);
The answer is obviously, it depends… duh!
The compiler is free to add padding in between struct members to optimize.
struct bob
{
uint8_t a;
uint8_t __pad_0[3];
uint16_t b;
uint8_t __pad_1[2];
uint8_t c;
};
Unaligned memory accesses can be very costly on some architecture
By padding the compiler can avoid this completely
But it means that the size of a structure is not portable
GCC and friends have an extension for that:
#define packed __attribute__((__packed__))
struct packed bob
{
uint8_t a;
uint16_t b;
uint8_t c;
};
static_assert(sizeof(struct bob) == 4, "");
Bitfields are also a thing where you can be more granular
#define packed __attribute__((__packed__))
struct packed bob
{
uint8_t a:4;
uint8_t b:7;
uint8_t c:5;
};
If I remember correctly, the ordering of the fields is implementation defined.
Nobody is dumb enough to screw around with ordering though.
I don’t know if the compiler is allowed to do padding here but if you’re using bitfields, you REALLY don’t want padding so be explicit
Also, endianese is a thing
Personally, I rarely use bitfields.
I find that doing the bit manip by hand is about as much work I also haven’t memorized all the rules so I prefer to avoid land mines.
… and you all know my bias so let’s not get into that.
struct blah;
void blah(struct blah *); // OK!
void bloh(struct blah); // BWAD!
struct blah;
sizeof(struct blah *) == sizeof(uintptr_t);
sizeof(struct blah) == ???;
// blah.h
struct blah;
struct blah *make_blah(void);
// blah.c
struct blah {};
struct blah *make_blah(void)
{
return calloc(1, sizeof(*blah));
};
Allows you to hide implementation details.
Use it whenever possible in your interfaces.
struct buf
{
size_t len;
uint64_t data[];
};
size_t len = 10;
struct buf *b = calloc(1, sizeof(*b) + len * sizeof(b->data));
*b = (struct buf) { .len = len };
b->data[3];
Everything is allocated in one call to calloc and freed with one call to free
Data layout is contiguous which is nice for your cache.
Must be the last element of the structure Which means you can only have one in your struct.
That also includes nested structs.
Might sound like a good idea to do
struct vec
{
size_t len, cap;
uint64_t data[];
};
You then end up with this type of function definition:
struct vec *vec_append(struct *vec, uint64_t item);
Meaning that every mutable call could modify the size of the entire struct.
So every function has to potentially return the object.
You could double pointer but that’s a bit ugly and error prone.
I generally recommend using FSM for arrays whose size will be forever static.
In other cases I would add the extra layer of indirection
struct vec
{
size_t len, cap;
uint64_t *data;
};
size_t len = ...;
struct vec *v = calloc(1, sizeof(*v));
*v = (struct vec) {
.len = 0, .cap = len,
.data = calloc(len, sizeof(v->data)),
};
This tends to lead to interfaces that are easier to use.
Minute side diversion into poor-man’s sum type
struct blah
{
enum type type;
union
{
uint8_t u8;
uint64_t u64;
double f64;
struct { uint8_t x, y; } pos;
} d;
};
Either for polymorphism or to save on space.
sizeof is the size of the biggest element.
Layout is probably implemention dependent
d.u8 = 10;
return d.u64;
Not allowed to read to one element and read from another one.
Earlier example related to type punning and strict aliasing:
union utod
{
uint64_t u;
double d;
};
uint64_t raw = (union utod){ d = value}.u;
uint64_t sign = raw >> 63;
uint64_t mantissa = (raw >> 52) & ((1ULL << 11) - 1);
Technically undefined behaviour.
BUTT
GCC and friends overrides it and makes it defined.
Nothing too earth shattering or ground-breaking for structs.
Only like 2-3 undefined behaviours to keep an eye out for.
That’s pretty good all things considered.
- the no-param thing
- declaration vs defintion
- ordering matters
- header vs c files
- namespaces
- extern vs static
- compilation units
- gcc attributes
unsigned add(unsigned x, unsigned y)
{
return x + y;
}
void blah();
blah(12);
Valid or not?
void blah();
Accepts an unknown number of arguments of unknown type.
void blah(void);
Accepts no arguments
Ye olde C.
Modern GCC and friends will error out.
void first(void) { second(); } // second() doesn't exist :'(
void second(void) {}
C likes things to be nice and orderly
void second(void); // declaration
void first(void) { second(); } // second() doesn't exist :'(
void second(void) {}
Declaring a function tells the compiler that this function exists somwhere
If you need a function across multiple files, you can just declare it!
// blah.c
void blah(void) {}
// bleh.c
void blah(void);
void bleh(void) { blah(); }
That works!
… but gets tedious quickly…
// blah.h
void blah(void);
// blah.c
void blah(void) {}
// bleh.c
#include "blah.h" // needs to know that blah exists!
void bleh(void) { blah(); }
The header file is where you delcare things for OTHER compilation units
See headers as your interface
Keep them clean and minimal
The less you expose the easier the interface is to understand
DON’T USE HEADERS TO STORE ALL YOUR DECLARATION WILLY NILLY
It’s the equivalent of marking all your functions as public
Or of exporting all your functions in erlang
DON’T DO THAT
Treat every header like a package or a module
The C file will implement your package and the headers will present its interface.
Which brings up namepsaces in c or the lack of
[1]: https://blog.regehr.org/archives/1393 [2]: https://www.usenix.org/system/files/1311_05-08_mickens.pdf [3]: https://en.cppreference.com/w/cpp/language/types