Capital Punishment

posted by rgm on 2011.11.21, under User Land
21:

There is no stay on these executions

After talking about fork() the next logical step is to continue with exec(). The fork()/exec() pairing, as mentioned in my previous post in this series, is a method by which new processes are started in unix like systems. exec is the stage at which a new program is loaded from disk to replace the currently running process. In this post I will cover the user land side of the exec family of functions which provide userspace the interface to start processes other than init. Most of the meat of how exec() works will be covered when I post about the kernel side of things.

Diving into the details

The library function that ultimately makes the magic happen is execve. It is the version of exec that accepts an argument vector (v) and an environment vector (e) to be used by the newly loaded program. This function is a very simple wrapper in front of the sys_execve system call. That is if you ignore the case of having bounds checking enabled in glibc. The execve function is as straightforward as they come, it simply passes it’s arguments on as is and makes the system call. In the unlikely case that bounds checking is enabled, it first iterates overs both argv and envp to make sure the strings actually terminate within legal bounding limits. The path to the executable is checked as well. The details are unimportant as it is disabled in almost all cases. When the bounds checking support is not compiled in it is a single line function.

int __execve(const char *file, char *const argv[], char *const envp[])
{
  return INLINE_SYSCALL(execve, 3, file, argv, envp);
}

Granted the INLINE_SYSCALL macro does expand to substantially more than 1 line, but that’s a topic for another day. As you can see execve only takes 3 arguments; a path to the executable (file), a null terminated array of strings to use as program arguments (argv), and a null terminated array of strings to use as the environment (envp). To see this in action lets take a look at a very contrived example of what is marginally equivalent to a small portion of what happens when you type “ls -l .” at a shell prompt.

char *envp[] = {"FOO=bar", (char *)0};
char *argv[] = {"/bin/ls", "-l", ".", (char *)0};
execve("/bin/ls", argv, envp);

The unusual thing about execve is when it succeeds it doesn’t return. This makes sense since once execve completes without error the calling process has been replaced and there isn’t a valid place to return to. In the error case the function returns -1 and errno is set to give you an idea of what went wrong. Check out the man page execve(2) for a full list of those error numbers and what they mean. So instead of returning when all is well execution instead continues at the entry point of the new program that was loaded into memory.

It is also worth noting that the handling of scripts using the #! mechanism is done entirely in the kernel. Though with one caveat, in the case the calling program makes use of a execve wrapper function glibc does some magic in certain situations to preform this it self. If a file is passed in that is executable but the kernel fails to find a handler to run the file the library functions will try again with the file as an argument to /bin/sh. This provides a way to have simple shell scripts without a #! and can also provide lots of confusion.

I could leave it at that, as for the most part that is really all that happens in user space. However there are a plethora of convenience wrapper functions in libc to make calling execve “less” of a headache. These include, execl, execlp, execle, execv, execvp, and execvpe. They all are build on top of execve and have varying call signatures for various use cases.

Letters make the difference

The various letters after exec all have different meanings and relate to the type and number of arguments accepted and how they are handled.

execl*() refers to the fact that the arguments passed to the program being executed are included as individual arguments to the execl*() function. This proves to be the simplest way of doing argument passing if the number of arguments is fixed. Since the arguments are passed directly to the execl*() function you are limited to a fixed number of arguments at compile time.

int execlp(const char *file, const char *arg, ...);

Arguments are read from the stack until a NULL character is found.

execv*() means that program arguments are passed via an array of strings in the same manner as execve. The array must be terminated with a NULL character as the last element. This allows you to have more flexibility for argument passing in comparison to execl*, at the cost of having to build the array your self.

exec*e() means that environment is passed via an array of strings also in the same manner as execve. The array must be terminated with a NULL character as the last element just as the argument list in execv*().

exec*p() functions do path expansion using the PATH environment variable (or current directory is PATH is not set). Both execvp and execlp are wrappers around execvpe that does the actual path lookup before calling execve. While these functions do the same thing that a shell like bash does to allow you to type ‘ls‘ and correctly invoke ‘/bin/ls‘, bash doesn’t actually use any of the exec*p() functions and does the path look up it self.

In the exec functions that do not specify an environment vector the new program inherits the environment of the calling process.

Call signatures

For completeness the following is a list of example calls for each of the execve helper functions:

execl:

execl("/bin/ls", "/bin/ls", "-l", ".", (char *)0);

execlp:

execlp("ls", "ls", "-l", ".", (char *)0);

execle:

char *envp[] = {"FOO=bar", (char *)0};
execle("/bin/ls", "/bin/ls", "-l", ".", (char *)0, envp);

execv:

char *argv[] = {"/bin/ls", "-l", ".", (char *)0};
execv("/bin/ls", argv);

execvp:

char *argv[] = {"ls", "-l", ".", (char *)0};
execvp("ls", argv);

execvpe:

char *envp[] = {"FOO=bar", (char *)0};
char *argv[] = {"ls", "-l", ".", (char *)0};
execvpe("ls", argv, envp);

That concludes this brief description of execve and friends. Stay tuned for further posts that dive into the kernel implementation of this functionality.

A fork() in the road

posted by rgm on 2011.01.18, under User Land
18:

Go fork yourself

Several months ago a good friend of mine suggested that I write a post about process creation. Initially I planned on writing a single post on fork, clone, exec, and friends however after thinking about the scope of the topic I’ve decided to break the subject up into several posts. This is the first and will cover the libc magic of the fork function.

For clarity i will be referring to library functions by name (ex fork) and system calls as sys_name (ie sys_fork). Several of the examples will be making use of the strace and ltrace utilities so if you’re not familiar with them, now would be a good time to read their man pages.

Back to basics

While I am assuming there is a certain level of existing knowledge of how unix style process creation works, a quick overview seems appropriate. This topic has been covered at length by many a author much more eloquent than I, and if you are new to these ideas I would suggest a more in depth review of the material else where.

In the before time new processes were created by duplicating the calling process in it’s enterty. That meant that all of the process was copied, the kernel data structures, page tables, and the memory allocated to the process. I’m sure you can imagine how slow this would get as processes eat more and more memory that would need to get copied each time a new process was created. This describes a very naive simplistic version of what happened at a fork that hasn’t actually been the way it implemented for a very long time. The modern version of the same process does mostly the same things, copies the kernel data structures and page tables associated with the process. However the actual memory allocated to the process is not copied. Instead the page table entries for both the parent and the child are marked as copy-on-write (COW). This allows the child to share the allocated memory with the parent and new pages are only allocated to the child when a write occurs. COW provides a mechanizm by which all the memory in both the parent and child is shared until one of the who writes to a page. When a write occurs a duplicate is created for the writing process (parent or child). Allowing a single set of pages to service both processes with minimal copying overhead. This process continues until all the pages have been duplicated, one of the processes exit, or exec is called. The details of how COW is implemented is outside the scope of this post but it is an important concept that I will more than likely write another post about in the future. The important thing to take from this is COW prevents needless data duplication and reduces the overhead of creating a new process to a much more manageable level. Since one of the more common situations that requires creating a new process is executing a new program it makes sense to try and duplicate as little as possible from the parent.

Once a new process has been created by fork what is it to do? Well if you type “ls” at your command prompt bash forks and then execs ls. This means that the ls image is loaded from disk and replaces the child copy of bash. The fork/exec pattern is very common indeed; but from a functional level it is no different than any other process creation that doesn’t result in an exec. The only difference is that the logic in the process doing the fork is to have the child immediate call exec. There is nothing requiring the child to do that, and in many cases it won’t. The kernel and libc don’t care what the child does and while I am going to continue talking about exec as far as the process creation portion side of things is concerned the work is complete.

So exec. While in linux there is no actual “exec” library function, conceptually exec loads a new process image from disk to replace the currently running one. Most of the environment from the calling process is preserved after an exec, however there are a few things that get cleaned up. Such as signal handlers are reset to defaults, memory mappings are unmapped, shared memory segments are detached, etc. Once exec completes execution resumes at the entry point to the loaded process.

You may be wondering if all processes come from a parent process fork‘ing where did the first one come from? Lets just say that in the beginning there was nothing and the kernel said “let there be pid 1″ and so it was. Simply put the kernel creates the first process as part of the initial boot up and from that point onwards all new processes are created with fork.

That concludes a rather brief background on the fork/exec concept the details of many of the pieces described will be covered in this and subsequent posts in this series.

Behind the libc curtain

The description of the libc side of fork will use glibc as the reference implementation. That noted there is quite a lot of linux specific stuff to follow in this section since fork is so tightly wound with the linux threading code.

The first bit of libc magic around fork is that the library wrapper function fork does not actually call sys_fork but instead uses sys_clone. So I could have named this post “a clone() in the road” but that doesn’t have the same ring to it. With the integration of the nptl (native posix thread library) into glibc (happened in v2.3.2 in case you care.) the usage of the sys_fork call on most Linux systems went the way of the dodo.

A very contrived example will show the fork -> sys_clone relationship.

[/tmp]: ltrace -e fork sh -c 'ls'
fork()                                           = 2785
<ls output omitted for brevity>
--- SIGCHLD (Child exited) ---
+++ exited (status 0) +++

Since ltrace gives us information about library calls this shows the library function fork() being called by ‘sh’ to spawn ‘ls’.

[/tmp]: strace -e fork sh -c 'ls'
<ls output omitted for brevity>

Here with strace we are looking at system calls and can see that sys_fork is not being called.

[/tmp]: strace -e clone sh -c 'ls'
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fdad87d29d0) = 2789
<ls output omitted for brevity>
--- SIGCHLD (Child exited) @ 0 (0) ---

As promised a sys_clone is being done by the library call to fork(). sys_clone accepts a wide range of arguments that allow the child process to share various amounts of data with the parent. That fact is why clone is so important to threading, but that’s a topic for later. The arguments passed to sys_clone by fork result in the same behavior as when using sys_fork so this was a mostly transparent library change.

The fork wrapper is a bit more involved then one might initially imagine. Without looking into things it might appear that since fork takes no arguments all it would need to do is setup a hard coded set of arguments for sys_clone and trigger the syscall to let the kernel do it’s business. However that is not the case. Due to complications with threading there was a need for the ability to register handlers to be called before and after a call to fork. pthread_atfork provides this mechanism which is commonly used by multi-threaded libraries to protect internal state during a fork in a single threaded process making use of said library. The details of why this functionality is important is outside the scope of this article but needless to say it is important. If you are interested in why the man page for pthread_atfork is a good start. What is important right now is the fact that these handlers can be registered and they need to be dealt with during the fork process.

The handlers that are registered are stored in a single linked list which is walked by the fork code. The structure contains function pointers that perform the tasks that the code that registered the fork handler needs done. The structure is defined as follows:

 struct fork_handler
 {
   struct fork_handler *next;
   void (*prepare_handler) (void);
   void (*parent_handler) (void);
   void (*child_handler) (void);
   void *dso_handle;
   unsigned int refcntr;
   int need_signal;
 };

The fields that are most relevant to this topic are the *_handler fields. These are the call-back function pointers mentioned earlier. Their names are pretty self evident on when they are called. But for due diligence the prepare_handler is called in the parent in the preparation for a call to sys_clone. parent_handler is called after the fork in the parent process, and child_handler is called from the child also after the fork. refcntr is used to prevent the list from being removed after a call to fork has already started.

Once all of the prepare handlers have been dealt with the actual call to sys_clone happens. This is done via a macro ARCH_FORK which on linux ends up calling sys_clone. Once the “fork” has happened two different code paths are followed depending on if execution is in the parent or the child.

In the child the first order of business is to reset some libc locks so the child gets a fresh lock states. The call-backs registered for the child are then run. In the parent the call-backs for, the parent, are run. Mostly the same in both cases just subtle differences in regards to lock states.

The final step is to return the pid value returned by sys_clone. In the child value will be 0 and the parent will be the new pid of the child. This simply makes detecting and providing different behavior depending on which process the code continues in easy. Often in the child the first thing done is to exec a new program, but that is the topic for the next post in this series.

That concludes the first of the process creation series. While I do intend to start working on the next part after completing this one other posts may wiggle their way in between each new post in this series. See you next time.

NUMA NUMA – Architectural Overview

posted by rgm on 2010.06.04, under Computer Architecture, Kernel Architecture
04:

Architectural overview of NUMA

No not the silly youtube video, Non-Uniform Memory Access (NUMA) is a design model used in many newer multi-cpu computer systems. To understand NUMA it is best to first understand how things were before the its advent. The prototypical multiprocessor computer layout is symmetric multiprocessing (SMP) which uses a uniform memory access (UMA) model. The UMA nature of SMP means that each cpu is connected to a single memory bus. This methodology works well for a relatively small number of CPUs but as the number grows the contention for a single bus grows and cpus start having to wait in line for memory access for unacceptable lengths of time.

Bellow is a simple diagram of a SMP setup with 2 cpus connected to a single memory bus which is in turn connected to a single bank of memory.

Next  is a diagram of a NUMA setup with 4 cpus and 2 memory buses 2 cpus each. The 2 memory buses have their own bank of memory which in the NUMA nomenclature is referred to as local memory. The memory connected to the other bus is accessible however since it is not directly connected there are performance reductions for fetching information from a “remote” memory bank. This concept of local and remote memory is the fundamental principle of the NUMA architecture. Another important term when talking about NUMA is a node; in the diagram each group of memory and cpus that are connected to the same bus are considered a node.

The main consideration that needs to be make when building a NUMA aware system is recognizing the fact that not all memory takes the same amount of time to access. Memory allocations and process cpu locality need to be done taking account for what node the process that is requesting the allocation is in, and when doing process scheduling attempting not to evict a process from a cpu in one node to a cpu in a different node is important to providing the best performance.

Since cpus in addition to the main memory also include their own on chip caches another complication surfaces keeping all the caches consistent. This issue is referred to as cache coherence and affect all multi-cpu configurations that have a shared memory resource. Virtually all NUMA setups you will see in the wild are Cache coherent NUMA (ccNUMA) and I am not going to talk about non coherent setups.

Yet another complication with NUMA comes from the fact that accessing remote memory takes longer which can lead to problems with locking mechanisms. Taking the above diagram as an example if there is a lock structure in memory local to CPU0/1 a situation where the remote cpus are unable to take hold of the lock can occur. Say cpu0 is holding a spinlock, cpu2 then requests the lock and starts spinning, then cpu1 requests the lock and spins. Once cpu0 releases the lock due to the delay in accessing remote memory cpu2s request will likely have been beat out by cpu1.

Overall NUMA allows an operating system that is correctly accounting for the quirks of the design to scale well beyond the limitations of SMP. This mostly is due to the reduction of contention of a single memory bus which reduced the performance gained on SMP systems for each cpu added. The way most of us will see NUMA implemented is on dual or quad socket server motherboards with multi-core cpus. These setups often have a bank of memory for each socket meaning between 2-6 cpus per node.

This concludes the architectural overview. A future article will cover how linux takes advantage of NUMA hardware.

SLAB Cache Organization

posted by rgm on 2010.05.21, under Kernel Architecture
21:

Slab based memory allocation is a mechanism to provide efficient allocations for commonly used data structures that has been implemented in a variety of UNIX derived operating systems.  Linux is among the *NIXs that uses the slab allocation method and actually provides a few variations. I will be covering just the basic slab system however there are others including the slub system which is a variation intended to improve performance and reduce metadata overhead.

To begin with; what is a slab allocator?

Conceptually slab allocation is very simple, set aside some pages in memory designated for providing allocations of a specific size. The size of an object within a slab is usually based on the size of a commonly used kernel data structure. A few common examples of a structs that are allocated from a slab cashe are inodes, dentries, buffer_heads, and many more. The primary advantage to this methodology is a significant reduction in fragmentation of allocated memory as well as reducing the complexity of attempting to find an available chunk of memory for to satisfy the request.

Organization of the slab cache

(hurray for dia… starting to get the hang of that app)

Starting at the top is cache_chain a linked list containing all of the caches currently in existence. Each entry is a kmem_cache struct that organizes a cache for one size of object. Each kmem_cache contains an list of slabs for each NUMA node defined as an array of kmem_list3 structs. kmem_list3 contains 3 (could you guess?) lists of slabs for the cache; slabs_partial, slabs_full, and slabs_free. Each list is use to make decisions when servicing requests for a new object. slabs_partial is the list of slabs that are (wait for it) partially full and is the first place to look when allocating a new object. If there are no more free objects in a slab it gets moved to slabs_full, and if slabs_partial is empty slabs_free is checked for an available empty slab to be used.

Each struct slab within the 3 lists is a group of contiguous pages (quite often 1) and is the size a cache can be grown or shrunk. The process of keeping track of what objects within a slab are in use will be the subject of a future post. Each slab contains a different number of objects depending on the size of the object and the number of pages in each slab.

Creating and Destroying slabs

As objects are allocated and deallocated the number of slabs in slabs_free will change. When there are no available slabs in slabs_free a new slab must be allocated which is done by the cache_grow() function. cache_grow() kmalloc’s (indirectly) enough pages for the given slab from the NUMA node for the corresponding kmem_list3 struct, sets up the struct slab and attaches it to the slabs_free list. The slab system sets up a workqueue on each cpu to shrink caches by calling the cache_reap() function. cache_reap() walks down the cache_chain list and attempts to free pages associated with slabs in various slabs_free lists. This process happens every few seconds and is designed to keep the slabs_free lists from holding onto pages for too long. Another time that caches are “reaped” is when kswapd is attempting to free up some memory if the overall system memory is getting low.

Final Thoughts

This has been a relatively light overview of the slab system and while there is a lot left out it serves as a good jumping off point for further articles on the subject. To poke around the slab cache on your running system you can cat the proc knob /proc/slabinfo which contains a list of all of the existing caches and a number of interesting statistics about each one.

I hope you have found this useful and if you have any comments, questions, hate mail, do let me know.

digging back into computer science

posted by rgm on 2010.05.17, under Uncategorized
17:

it has been several years since i spent many brain cycles on what i would consider computer science. sure i’ve written some tools and started the process of learning the oddities of python, but nothing of real consequence. soon i will be starting a new job that will require me to get back into the cs mind set and as such this blog will serve as a repository of my studies.

a combination of topics related to linux kernel architecture and general computer science.

pagetop