Ian Ward's email:
first name at this domain
wardi on OFTC, freenode and github
OLS topics on day three including Lguest, SMB2, Large memory allocations and Concurrent Pagecache.
Lguest is a simple Linux-on-Linux paravirtualization module for x86. The Lguest source code is only about 5000 lines, including a user space utility. It is intended to remain simple so the code may be instructional. If you need features like migration, x86-64 support and full virtualization you will be better served by Xen or KVM.
Lguest uses a "switcher" that appears at the top of the 4GB address space. The switcher switches between the host and guest virtual machines and maintains the guest's shadow page tables, "one of the things about virtualization that is actually tricky". Lguest uses page protection to stop guests from writing to the switcher. Page protection doesn't stop the guest from reading the data in the switcher, but this information can't be used to escape from the virtual machine.
Lguest is very well documented, once you run a script to populate the code with comments. As a bonus, Rusty has aimed to include a witty rejoinder in the comments every 100 lines or so, and has promised to buy anyone that has read the whole thing a beer. I did not personally witness anyone taking Rusty up on this offer at OLS.
Future plans for Lguest include NUMA simulation by delaying certain memory accesses. Also proposed is a fork hypercall that could be used for exhaustive kernel testing by forking the guest at every potential failure case in the kernel code.
SMB2 is the new default protocol for Windows Vista. It replaces the SMB/CIFS protocol which operates on the same port, and in some ways is simpler to implement on Linux. SMB2 has 19 commands that replace SMB's 81 commands. Many of the commands in SMB were duplicated or not required. NFSv4 is also much more complicated than SMB2 when comparing the commands that must be implemented.
There have been a number of extensions to SMB to add Unix-semantics to the file system. It is currently possible to use SMB to host Linux home directories if the "server inode" option is enabled on the server.
SMB is also important for MacOS clients because it is actually the protocol with the fastest implementation available.
There are many details about SMB2 (that I couldn't read from the projector) included in the slides for this talk.
Some processors are capable of supporting memory pages of 1GB or larger. Linux doesn't currently support multiple page sizes, but using large pages can be a big performance win. Larger pages take less space in the TLB, reducing the number of TLB misses for applications that use them.
The current memory allocators tend to leave memory quite fragmented, even when there is lots of memory free it can be impossible to allocate a large enough page. One strategy to guarantee success of large page allocation is to partition memory at boot time. This strategy sets aside an area of memory that will only be used for large allocations. The obvious drawback is that this memory now can't be used for page cache or other allocations even while that memory is not being used. Also, it can be difficult to decide the size of the area to reserve.
A second strategy is to try to group movable and unmovable pages, so that when there is a request for a large page it is more likely that we can move pages to make room. It turns out that much of the code for moving pages already exists as page migration code. This strategy does not guarantee that large page allocations will succeed, but it makes better use of available RAM when they are not required and can accommodate many large allocations.
A third strategy is to perform page compaction. Page compaction pushes all the movable pages together, defragmenting the free space so that many large pages become available. This strategy is also not guaranteed to work and during compaction there will be a long stall for the process trying to allocate memory, but it can make very large regions of memory available.
"Free memory is wasted memory", so why not use it to store data that is likely to be accessed and save the disk access time? The Linux page cache does exactly that by storing partial file contents in memory. The page cache uses a radix tree to keep track of the mapping between pages in memory and file blocks.
This radix tree can become a real performance bottleneck if multiple CPUs are reading and writing to different parts of the same file at the same time. There is a single writer lock that protects the data structure and when multiple CPUs are writing to the structure the cache line bounces back and forth between the CPUs. The concurrent page cache is a modification to the page cache structure designed to solve this problem.
The first change is to add a lock at each node in the radix tree. The goal is to acquire locks as far down the tree as possible so that there is no contention between CPUs modifying different parts of the same file. The highest lock that must be acquired is the lowest node with more than one child along the path to the target node. This node is called the "termination point". If the termination point is locked we know that the children of the termination point won't be detached from above.
The strategy used to modify the radix tree without locking higher in the tree (and causing the same lock contention problem we started with) is as follows: