Architectural overview of NUMA
No not the silly youtube video, Non-Uniform Memory Access (NUMA) is a design model used in many newer multi-cpu computer systems. To understand NUMA it is best to first understand how things were before the its advent. The prototypical multiprocessor computer layout is symmetric multiprocessing (SMP) which uses a uniform memory access (UMA) model. The UMA nature of SMP means that each cpu is connected to a single memory bus. This methodology works well for a relatively small number of CPUs but as the number grows the contention for a single bus grows and cpus start having to wait in line for memory access for unacceptable lengths of time.
Bellow is a simple diagram of a SMP setup with 2 cpus connected to a single memory bus which is in turn connected to a single bank of memory.
Next is a diagram of a NUMA setup with 4 cpus and 2 memory buses 2 cpus each. The 2 memory buses have their own bank of memory which in the NUMA nomenclature is referred to as local memory. The memory connected to the other bus is accessible however since it is not directly connected there are performance reductions for fetching information from a “remote” memory bank. This concept of local and remote memory is the fundamental principle of the NUMA architecture. Another important term when talking about NUMA is a node; in the diagram each group of memory and cpus that are connected to the same bus are considered a node.
The main consideration that needs to be make when building a NUMA aware system is recognizing the fact that not all memory takes the same amount of time to access. Memory allocations and process cpu locality need to be done taking account for what node the process that is requesting the allocation is in, and when doing process scheduling attempting not to evict a process from a cpu in one node to a cpu in a different node is important to providing the best performance.
Since cpus in addition to the main memory also include their own on chip caches another complication surfaces keeping all the caches consistent. This issue is referred to as cache coherence and affect all multi-cpu configurations that have a shared memory resource. Virtually all NUMA setups you will see in the wild are Cache coherent NUMA (ccNUMA) and I am not going to talk about non coherent setups.
Yet another complication with NUMA comes from the fact that accessing remote memory takes longer which can lead to problems with locking mechanisms. Taking the above diagram as an example if there is a lock structure in memory local to CPU0/1 a situation where the remote cpus are unable to take hold of the lock can occur. Say cpu0 is holding a spinlock, cpu2 then requests the lock and starts spinning, then cpu1 requests the lock and spins. Once cpu0 releases the lock due to the delay in accessing remote memory cpu2s request will likely have been beat out by cpu1.
Overall NUMA allows an operating system that is correctly accounting for the quirks of the design to scale well beyond the limitations of SMP. This mostly is due to the reduction of contention of a single memory bus which reduced the performance gained on SMP systems for each cpu added. The way most of us will see NUMA implemented is on dual or quad socket server motherboards with multi-core cpus. These setups often have a bank of memory for each socket meaning between 2-6 cpus per node.
This concludes the architectural overview. A future article will cover how linux takes advantage of NUMA hardware.
Slab based memory allocation is a mechanism to provide efficient allocations for commonly used data structures that has been implemented in a variety of UNIX derived operating systems. Linux is among the *NIXs that uses the slab allocation method and actually provides a few variations. I will be covering just the basic slab system however there are others including the slub system which is a variation intended to improve performance and reduce metadata overhead.
To begin with; what is a slab allocator?
Conceptually slab allocation is very simple, set aside some pages in memory designated for providing allocations of a specific size. The size of an object within a slab is usually based on the size of a commonly used kernel data structure. A few common examples of a structs that are allocated from a slab cashe are inodes, dentries, buffer_heads, and many more. The primary advantage to this methodology is a significant reduction in fragmentation of allocated memory as well as reducing the complexity of attempting to find an available chunk of memory for to satisfy the request.
Organization of the slab cache
(hurray for dia… starting to get the hang of that app)
Starting at the top is cache_chain a linked list containing all of the caches currently in existence. Each entry is a kmem_cache struct that organizes a cache for one size of object. Each kmem_cache contains an list of slabs for each NUMA node defined as an array of kmem_list3 structs. kmem_list3 contains 3 (could you guess?) lists of slabs for the cache; slabs_partial, slabs_full, and slabs_free. Each list is use to make decisions when servicing requests for a new object. slabs_partial is the list of slabs that are (wait for it) partially full and is the first place to look when allocating a new object. If there are no more free objects in a slab it gets moved to slabs_full, and if slabs_partial is empty slabs_free is checked for an available empty slab to be used.
Each struct slab within the 3 lists is a group of contiguous pages (quite often 1) and is the size a cache can be grown or shrunk. The process of keeping track of what objects within a slab are in use will be the subject of a future post. Each slab contains a different number of objects depending on the size of the object and the number of pages in each slab.
Creating and Destroying slabs
As objects are allocated and deallocated the number of slabs in slabs_free will change. When there are no available slabs in slabs_free a new slab must be allocated which is done by the cache_grow() function. cache_grow() kmalloc’s (indirectly) enough pages for the given slab from the NUMA node for the corresponding kmem_list3 struct, sets up the struct slab and attaches it to the slabs_free list. The slab system sets up a workqueue on each cpu to shrink caches by calling the cache_reap() function. cache_reap() walks down the cache_chain list and attempts to free pages associated with slabs in various slabs_free lists. This process happens every few seconds and is designed to keep the slabs_free lists from holding onto pages for too long. Another time that caches are “reaped” is when kswapd is attempting to free up some memory if the overall system memory is getting low.
This has been a relatively light overview of the slab system and while there is a lot left out it serves as a good jumping off point for further articles on the subject. To poke around the slab cache on your running system you can cat the proc knob /proc/slabinfo which contains a list of all of the existing caches and a number of interesting statistics about each one.
I hope you have found this useful and if you have any comments, questions, hate mail, do let me know.