Capital Punishment

posted by on 2011.11.21, under User Land
21:

There is no stay on these executions

After talking about fork() the next logical step is to continue with exec(). The fork()/exec() pairing, as mentioned in my previous post in this series, is a method by which new processes are started in unix like systems. exec is the stage at which a new program is loaded from disk to replace the currently running process. In this post I will cover the user land side of the exec family of functions which provide userspace the interface to start processes other than init. Most of the meat of how exec() works will be covered when I post about the kernel side of things.

Diving into the details

The library function that ultimately makes the magic happen is execve. It is the version of exec that accepts an argument vector (v) and an environment vector (e) to be used by the newly loaded program. This function is a very simple wrapper in front of the sys_execve system call. That is if you ignore the case of having bounds checking enabled in glibc. The execve function is as straightforward as they come, it simply passes it’s arguments on as is and makes the system call. In the unlikely case that bounds checking is enabled, it first iterates overs both argv and envp to make sure the strings actually terminate within legal bounding limits. The path to the executable is checked as well. The details are unimportant as it is disabled in almost all cases. When the bounds checking support is not compiled in it is a single line function.

int __execve(const char *file, char *const argv[], char *const envp[])
{
  return INLINE_SYSCALL(execve, 3, file, argv, envp);
}

Granted the INLINE_SYSCALL macro does expand to substantially more than 1 line, but that’s a topic for another day. As you can see execve only takes 3 arguments; a path to the executable (file), a null terminated array of strings to use as program arguments (argv), and a null terminated array of strings to use as the environment (envp). To see this in action lets take a look at a very contrived example of what is marginally equivalent to a small portion of what happens when you type “ls -l .” at a shell prompt.

char *envp[] = {"FOO=bar", (char *)0};
char *argv[] = {"/bin/ls", "-l", ".", (char *)0};
execve("/bin/ls", argv, envp);

The unusual thing about execve is when it succeeds it doesn’t return. This makes sense since once execve completes without error the calling process has been replaced and there isn’t a valid place to return to. In the error case the function returns -1 and errno is set to give you an idea of what went wrong. Check out the man page execve(2) for a full list of those error numbers and what they mean. So instead of returning when all is well execution instead continues at the entry point of the new program that was loaded into memory.

It is also worth noting that the handling of scripts using the #! mechanism is done entirely in the kernel. Though with one caveat, in the case the calling program makes use of a execve wrapper function glibc does some magic in certain situations to preform this it self. If a file is passed in that is executable but the kernel fails to find a handler to run the file the library functions will try again with the file as an argument to /bin/sh. This provides a way to have simple shell scripts without a #! and can also provide lots of confusion.

I could leave it at that, as for the most part that is really all that happens in user space. However there are a plethora of convenience wrapper functions in libc to make calling execve “less” of a headache. These include, execl, execlp, execle, execv, execvp, and execvpe. They all are build on top of execve and have varying call signatures for various use cases.

Letters make the difference

The various letters after exec all have different meanings and relate to the type and number of arguments accepted and how they are handled.

execl*() refers to the fact that the arguments passed to the program being executed are included as individual arguments to the execl*() function. This proves to be the simplest way of doing argument passing if the number of arguments is fixed. Since the arguments are passed directly to the execl*() function you are limited to a fixed number of arguments at compile time.

int execlp(const char *file, const char *arg, ...);

Arguments are read from the stack until a NULL character is found.

execv*() means that program arguments are passed via an array of strings in the same manner as execve. The array must be terminated with a NULL character as the last element. This allows you to have more flexibility for argument passing in comparison to execl*, at the cost of having to build the array your self.

exec*e() means that environment is passed via an array of strings also in the same manner as execve. The array must be terminated with a NULL character as the last element just as the argument list in execv*().

exec*p() functions do path expansion using the PATH environment variable (or current directory is PATH is not set). Both execvp and execlp are wrappers around execvpe that does the actual path lookup before calling execve. While these functions do the same thing that a shell like bash does to allow you to type ‘ls‘ and correctly invoke ‘/bin/ls‘, bash doesn’t actually use any of the exec*p() functions and does the path look up it self.

In the exec functions that do not specify an environment vector the new program inherits the environment of the calling process.

Call signatures

For completeness the following is a list of example calls for each of the execve helper functions:

execl:

execl("/bin/ls", "/bin/ls", "-l", ".", (char *)0);

execlp:

execlp("ls", "ls", "-l", ".", (char *)0);

execle:

char *envp[] = {"FOO=bar", (char *)0};
execle("/bin/ls", "/bin/ls", "-l", ".", (char *)0, envp);

execv:

char *argv[] = {"/bin/ls", "-l", ".", (char *)0};
execv("/bin/ls", argv);

execvp:

char *argv[] = {"ls", "-l", ".", (char *)0};
execvp("ls", argv);

execvpe:

char *envp[] = {"FOO=bar", (char *)0};
char *argv[] = {"ls", "-l", ".", (char *)0};
execvpe("ls", argv, envp);

That concludes this brief description of execve and friends. Stay tuned for further posts that dive into the kernel implementation of this functionality.

A fork() in the road

posted by on 2011.01.18, under User Land
18:

Go fork yourself

Several months ago a good friend of mine suggested that I write a post about process creation. Initially I planned on writing a single post on fork, clone, exec, and friends however after thinking about the scope of the topic I’ve decided to break the subject up into several posts. This is the first and will cover the libc magic of the fork function.

For clarity i will be referring to library functions by name (ex fork) and system calls as sys_name (ie sys_fork). Several of the examples will be making use of the strace and ltrace utilities so if you’re not familiar with them, now would be a good time to read their man pages.

Back to basics

While I am assuming there is a certain level of existing knowledge of how unix style process creation works, a quick overview seems appropriate. This topic has been covered at length by many a author much more eloquent than I, and if you are new to these ideas I would suggest a more in depth review of the material else where.

In the before time new processes were created by duplicating the calling process in it’s enterty. That meant that all of the process was copied, the kernel data structures, page tables, and the memory allocated to the process. I’m sure you can imagine how slow this would get as processes eat more and more memory that would need to get copied each time a new process was created. This describes a very naive simplistic version of what happened at a fork that hasn’t actually been the way it implemented for a very long time. The modern version of the same process does mostly the same things, copies the kernel data structures and page tables associated with the process. However the actual memory allocated to the process is not copied. Instead the page table entries for both the parent and the child are marked as copy-on-write (COW). This allows the child to share the allocated memory with the parent and new pages are only allocated to the child when a write occurs. COW provides a mechanizm by which all the memory in both the parent and child is shared until one of the who writes to a page. When a write occurs a duplicate is created for the writing process (parent or child). Allowing a single set of pages to service both processes with minimal copying overhead. This process continues until all the pages have been duplicated, one of the processes exit, or exec is called. The details of how COW is implemented is outside the scope of this post but it is an important concept that I will more than likely write another post about in the future. The important thing to take from this is COW prevents needless data duplication and reduces the overhead of creating a new process to a much more manageable level. Since one of the more common situations that requires creating a new process is executing a new program it makes sense to try and duplicate as little as possible from the parent.

Once a new process has been created by fork what is it to do? Well if you type “ls” at your command prompt bash forks and then execs ls. This means that the ls image is loaded from disk and replaces the child copy of bash. The fork/exec pattern is very common indeed; but from a functional level it is no different than any other process creation that doesn’t result in an exec. The only difference is that the logic in the process doing the fork is to have the child immediate call exec. There is nothing requiring the child to do that, and in many cases it won’t. The kernel and libc don’t care what the child does and while I am going to continue talking about exec as far as the process creation portion side of things is concerned the work is complete.

So exec. While in linux there is no actual “exec” library function, conceptually exec loads a new process image from disk to replace the currently running one. Most of the environment from the calling process is preserved after an exec, however there are a few things that get cleaned up. Such as signal handlers are reset to defaults, memory mappings are unmapped, shared memory segments are detached, etc. Once exec completes execution resumes at the entry point to the loaded process.

You may be wondering if all processes come from a parent process fork‘ing where did the first one come from? Lets just say that in the beginning there was nothing and the kernel said “let there be pid 1″ and so it was. Simply put the kernel creates the first process as part of the initial boot up and from that point onwards all new processes are created with fork.

That concludes a rather brief background on the fork/exec concept the details of many of the pieces described will be covered in this and subsequent posts in this series.

Behind the libc curtain

The description of the libc side of fork will use glibc as the reference implementation. That noted there is quite a lot of linux specific stuff to follow in this section since fork is so tightly wound with the linux threading code.

The first bit of libc magic around fork is that the library wrapper function fork does not actually call sys_fork but instead uses sys_clone. So I could have named this post “a clone() in the road” but that doesn’t have the same ring to it. With the integration of the nptl (native posix thread library) into glibc (happened in v2.3.2 in case you care.) the usage of the sys_fork call on most Linux systems went the way of the dodo.

A very contrived example will show the fork -> sys_clone relationship.

[/tmp]: ltrace -e fork sh -c 'ls'
fork()                                           = 2785
<ls output omitted for brevity>
--- SIGCHLD (Child exited) ---
+++ exited (status 0) +++

Since ltrace gives us information about library calls this shows the library function fork() being called by ‘sh’ to spawn ‘ls’.

[/tmp]: strace -e fork sh -c 'ls'
<ls output omitted for brevity>

Here with strace we are looking at system calls and can see that sys_fork is not being called.

[/tmp]: strace -e clone sh -c 'ls'
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fdad87d29d0) = 2789
<ls output omitted for brevity>
--- SIGCHLD (Child exited) @ 0 (0) ---

As promised a sys_clone is being done by the library call to fork(). sys_clone accepts a wide range of arguments that allow the child process to share various amounts of data with the parent. That fact is why clone is so important to threading, but that’s a topic for later. The arguments passed to sys_clone by fork result in the same behavior as when using sys_fork so this was a mostly transparent library change.

The fork wrapper is a bit more involved then one might initially imagine. Without looking into things it might appear that since fork takes no arguments all it would need to do is setup a hard coded set of arguments for sys_clone and trigger the syscall to let the kernel do it’s business. However that is not the case. Due to complications with threading there was a need for the ability to register handlers to be called before and after a call to fork. pthread_atfork provides this mechanism which is commonly used by multi-threaded libraries to protect internal state during a fork in a single threaded process making use of said library. The details of why this functionality is important is outside the scope of this article but needless to say it is important. If you are interested in why the man page for pthread_atfork is a good start. What is important right now is the fact that these handlers can be registered and they need to be dealt with during the fork process.

The handlers that are registered are stored in a single linked list which is walked by the fork code. The structure contains function pointers that perform the tasks that the code that registered the fork handler needs done. The structure is defined as follows:

 struct fork_handler
 {
   struct fork_handler *next;
   void (*prepare_handler) (void);
   void (*parent_handler) (void);
   void (*child_handler) (void);
   void *dso_handle;
   unsigned int refcntr;
   int need_signal;
 };

The fields that are most relevant to this topic are the *_handler fields. These are the call-back function pointers mentioned earlier. Their names are pretty self evident on when they are called. But for due diligence the prepare_handler is called in the parent in the preparation for a call to sys_clone. parent_handler is called after the fork in the parent process, and child_handler is called from the child also after the fork. refcntr is used to prevent the list from being removed after a call to fork has already started.

Once all of the prepare handlers have been dealt with the actual call to sys_clone happens. This is done via a macro ARCH_FORK which on linux ends up calling sys_clone. Once the “fork” has happened two different code paths are followed depending on if execution is in the parent or the child.

In the child the first order of business is to reset some libc locks so the child gets a fresh lock states. The call-backs registered for the child are then run. In the parent the call-backs for, the parent, are run. Mostly the same in both cases just subtle differences in regards to lock states.

The final step is to return the pid value returned by sys_clone. In the child value will be 0 and the parent will be the new pid of the child. This simply makes detecting and providing different behavior depending on which process the code continues in easy. Often in the child the first thing done is to exec a new program, but that is the topic for the next post in this series.

That concludes the first of the process creation series. While I do intend to start working on the next part after completing this one other posts may wiggle their way in between each new post in this series. See you next time.

pagetop