Ian Ward's email:
first name at this domain
wardi on OFTC, freenode and github
OLS topics on the last day including Extreme High Performance Computing, the Linux Desktop Audio Mess and IOMMU Performance
High performance computing is required for tasks such as research into physics, cosmology and biology, as well as simulating things like airflow, weather systems, molecules, dark matter and black holes. Supercomputers are often custom-built for jobs like these. NASA's 1,240 CPU supercomputer is pictured on the right. Supercomputers have a single memory space with non-uniform memory access (NUMA). This means that each CPU has memory that is fast to access and other parts of memory take longer to access, because that memory is physically much further away.
Clusters can also be used for this sort of work, but they require that the applications running are split into many small parts that can run separately. With a supercomputer you must have an operating system that is smart enough to handle NUMA, but applications don't need as many changes.
It is sometimes claimed that a microkernel design is the best for supercomputers. Microkernels break up each of the tasks of an operating system into small isolated modules that can individually be restarted or replaced. Proponents say that microkernels scale better, allow for better design, make the system more reliable and are essential for dealing with the complexities of large systems.
However there are few if any supercomputers running an operating systems with a microkernel design. This may be due to some of the problems with microkernels. Separating the kernel into modules that are protected from each other involves message passing and context switching, which are slow. These modules need to maintain APIs, which make them difficult to improve. Also, application state is difficult to gather in a microkernel because it is separated between a number of modules and reading that data violates separation.
Linux is being run on the NASA supercomputer, divided into 512-CPU virtual computers. Linux uses source code modularity and loadable modules instead of run-time separation and message passing.
This makes experimenting with new types of locking, per-CPU areas and cache line optimizations
There are improvement still to be made as Linux doesn't scale all the way to 10,240 CPUs yet, but this work is progressing. One part that is missing is the ability to control where memory allocations happen from user space. A user space application may know which CPU is going to need quick access to a data structure and should be able to allocate memory close to that CPU.
Current options for audio on Linux include OSS, Alsa, EsounD, AAS, Alsa dmix, Phonom and Jack. Each of these different APIs has its own set of features. If a user has more than one library installed on their system they are all fighting for control of the sound device, and only one will win. If you are writing a program that uses audio you have to decide which APIs to support.
OSS is the original standard, and it is not going away. EsounD has good network transparency features, but is not well maintained. Alsa dmix is commonly used for mixing sound output, but is has unstable sound latency. Jack requires floating point samples and is designed mainly for professional audio.
None of these options are aware of the desktop environment, so it is difficult to implement features such as pausing music when a VoIP call comes in, or adjusting volume of an application depending on whether it has the focus. It is also difficult to gracefully support devices like USB headsets that are plugged in and removed while an application is playing sound. In Linux "we need a compiz for sound."
PulseAudio is a new audio server that tries to solve these problems. It is integrated with the X server so it can act on window focus events. It can sync audio output across devices to turn two standard audio cards into one virtual surround-sound device, and it can sync playback across devices on a network. It can accurately estimate audio latency, and can interpolate latency for network connections.
PulseAudio is a replacement for EsounD, implementing its full API (in some cases better than EsounD does). It will work with Jack for professional audio setups. It uses a zero copy playback model for low CPU usage and has built-in underrun protection. Network audio devices are detected using Zeroconf. It can support about 90% of applications that use OSS by using an LD_PRELOAD setting.
PulseAudio is popular on many thin clients, but has yet to be integrated into desktop distributions. It currently requires a floating point unit, but soon won't have that requirement so it will work on more embedded devices.
Virtualization shares hardware between multiple operating system images. Paravirtualization makes a guest operating system aware that it is not running on real hardware for better I/O performance to the host. However, there is still a cost for passing I/O to the host before sending it to the hardware, and it would be better in some cases to allow a guest to "own" a particular device and talk directly to it.
On the x86 architecture this causes a security problem, because hardware that can perform DMA can access and modify all of system memory. This makes it possible for a guest to "escape" its virtual machine or cause problems for other virtual machines by programming its device to write to memory it shouldn't have access to. IOMMUs provide memory protection for DMA operations the same way MMUs protect system memory. With an IOMMU we can safely allow a guest direct access to hardware.
IBM uses similar hardware for its PowerPC and x86 X-series servers, and seemingly by accident they shipped their x86 servers with a "Calgary" IOMMU on the motherboard. As it turns out, this IOMMU mostly works. This provided a test bed for Linux x86 IOMMU work. Once support was added to the kernel something interesting happened: they found driver bugs. The IOMMU had stopped some drivers from DMA-ing to regions of memory they shouldn't have been.
The Xorg X server doesn't work with the IOMMU enabled because it bypasses the kernel and talks directly to the hardware. This is apparently only true on the x86 architecture, but it does need to be fixed.
IOMMUs can also be used with full virtualization where the guest is unaware that it is running on a virtual machine by mapping all of the guest memory as a DMA region. Paravirtualization can improve on this by only mapping the regions required and protecting the guest from itself.
The performance overhead at this early stage is 15-60% for network I/O, but there are a number of obvious optimizations to make to the code. Mapping and unmapping regions is slow, so unmapping could be delayed or not done at all. The test system has 16 CPUs and lock contention could have also been an issue.
x86 hardware with IOMMU support is not widely available yet and the best design is still an open question. Intel and AMD are both working on adding them, and within the next few years they will be widely available.