excess.org

Ian Ward

Consulting
Boxkite Inc.
Software
CKAN contributor/tech lead
PyRF primary contributor
Urwid author
Speedometer author

Presentations
Contributing to Open Source
IASA E-Summit, 2014-05-16
Urwid Applications
2012-11-14
Urwid Intro
2012-01-22
Unfortunate Python
2011-12-19
Django 1.1
2009-05-16

Writing
Moving to Python 3
2011-02-17
Article Tags

Home

Ian Ward's email:
first name at this domain

wardi on OFTC, freenode and github

Locations of visitors to this page

Ottawa Linux Symposium 2007 Day 3

Puppies! (Lguest logo)
Posted on 2007-06-29, last modified 2007-07-03.

OLS topics on day three including Lguest, SMB2, Large memory allocations and Concurrent Pagecache.

Lguest: Implementing the Little Linux Hypervisor, Rusty Russell

Lguest is a simple Linux-on-Linux paravirtualization module for x86. The Lguest source code is only about 5000 lines, including a user space utility. It is intended to remain simple so the code may be instructional. If you need features like migration, x86-64 support and full virtualization you will be better served by Xen or KVM.

Lguest uses a "switcher" that appears at the top of the 4GB address space. The switcher switches between the host and guest virtual machines and maintains the guest's shadow page tables, "one of the things about virtualization that is actually tricky". Lguest uses page protection to stop guests from writing to the switcher. Page protection doesn't stop the guest from reading the data in the switcher, but this information can't be used to escape from the virtual machine.

Lguest is very well documented, once you run a script to populate the code with comments. As a bonus, Rusty has aimed to include a witty rejoinder in the comments every 100 lines or so, and has promised to buy anyone that has read the whole thing a beer. I did not personally witness anyone taking Rusty up on this offer at OLS.

Future plans for Lguest include NUMA simulation by delaying certain memory accesses. Also proposed is a fork hypercall that could be used for exhaustive kernel testing by forking the guest at every potential failure case in the kernel code.

A New Network Filesystem is Born: Comparison of SMB2, CIFS and NFS, Steve French

SMB2 is the new default protocol for Windows Vista. It replaces the SMB/CIFS protocol which operates on the same port, and in some ways is simpler to implement on Linux. SMB2 has 19 commands that replace SMB's 81 commands. Many of the commands in SMB were duplicated or not required. NFSv4 is also much more complicated than SMB2 when comparing the commands that must be implemented.

There have been a number of extensions to SMB to add Unix-semantics to the file system. It is currently possible to use SMB to host Linux home directories if the "server inode" option is enabled on the server.

SMB is also important for MacOS clients because it is actually the protocol with the fastest implementation available.

There are many details about SMB2 (that I couldn't read from the projector) included in the slides for this talk.

Supporting the Allocation of Large Contiguous Regions of Memory, Mel Gorman

Some processors are capable of supporting memory pages of 1GB or larger. Linux doesn't currently support multiple page sizes, but using large pages can be a big performance win. Larger pages take less space in the TLB, reducing the number of TLB misses for applications that use them.

The current memory allocators tend to leave memory quite fragmented, even when there is lots of memory free it can be impossible to allocate a large enough page. One strategy to guarantee success of large page allocation is to partition memory at boot time. This strategy sets aside an area of memory that will only be used for large allocations. The obvious drawback is that this memory now can't be used for page cache or other allocations even while that memory is not being used. Also, it can be difficult to decide the size of the area to reserve.

A second strategy is to try to group movable and unmovable pages, so that when there is a request for a large page it is more likely that we can move pages to make room. It turns out that much of the code for moving pages already exists as page migration code. This strategy does not guarantee that large page allocations will succeed, but it makes better use of available RAM when they are not required and can accommodate many large allocations.

A third strategy is to perform page compaction. Page compaction pushes all the movable pages together, defragmenting the free space so that many large pages become available. This strategy is also not guaranteed to work and during compaction there will be a long stall for the process trying to allocate memory, but it can make very large regions of memory available.

Concurrent Pagecache, Peter Zijlstra

"Free memory is wasted memory", so why not use it to store data that is likely to be accessed and save the disk access time? The Linux page cache does exactly that by storing partial file contents in memory. The page cache uses a radix tree to keep track of the mapping between pages in memory and file blocks.

This radix tree can become a real performance bottleneck if multiple CPUs are reading and writing to different parts of the same file at the same time. There is a single writer lock that protects the data structure and when multiple CPUs are writing to the structure the cache line bounces back and forth between the CPUs. The concurrent page cache is a modification to the page cache structure designed to solve this problem.

The first change is to add a lock at each node in the radix tree. The goal is to acquire locks as far down the tree as possible so that there is no contention between CPUs modifying different parts of the same file. The highest lock that must be acquired is the lowest node with more than one child along the path to the target node. This node is called the "termination point". If the termination point is locked we know that the children of the termination point won't be detached from above.

The strategy used to modify the radix tree without locking higher in the tree (and causing the same lock contention problem we started with) is as follows:

  1. Follow the nodes to our target, recording the termination point
  2. Take an "optimistic lock" on the termination point
  3. Verify that it is the node we are looking for, if not, release the lock and repeat from 1.
  4. Continue locking the nodes down to the target node
  5. Modify the target and release all locks in reverse order

continue to day 4

Tags: OLS Linux