Ian Ward's email:
first name at this domain
wardi on OFTC, freenode and github
OLS topics on day two including Linux Kernel Development, EXT4, Cell Broadband Engine, Debugging Google clusters and LinuxBIOS.
All the OLS papers have now been posted. The talks were recorded this year, and the videos may be released online for free once the cost of production is covered by purchased copies. Contact Andrew Hutton if you can help by purchasing a copy.
Kernel development in the past two years has been moving faster and working more smoothly than it ever has. Initially Andrew Morton was expected to maintain the stable version of the kernel while Linus maintained the development version, as used to be the case with Linus' previous lieutenants. As it turns out, the opposite has happened. Linus is very good at maintaining the stable version of the kernel and Andrew is very good at managing the bulk of new patches being tested.
The pace of development is incredible. Over the past two years approximately 4 patches per hour have been merged, 24 hours a day, 7 days a week. 2,000 new lines of code are added and about 3,500 are modified or removed every day. "No single company could keep up with this rate of change". By comparison Vista's code base is small because Microsoft doesn't ship many fewer drivers with their operating system.
The rate of change has been steadily increasing, while at the same time the number of new bugs being reported has stayed flat. The change is taking place all over the kernel from the core to the subsystems to the drivers. Kernel developers are not afraid to make core changes when they find a potential for improvement because all the drivers that depend on those changes are in the tree and easy to update.
The number of developers is also increasing while the number of changes from the biggest contributers account for a smaller percentage of the total, indicating that development is scaling well for the new contributors. Individual developers, people contributing less than 10 patches, account for 33% of the patches that enter the kernel.
Some companies that make very heavy use of Linux are not well represented among kernel contributors. If these companies don't get involved then they are indicating that they are happy with the current state of development. Development is moving so quickly that any company maintaining kernel code outside of the tree will soon find that their changes are prohibitively expensive to maintain.
The new EXT4 file system is a backwards-compatible successor to the common Linux EXT3 file system. EXT4 now supports 48-bit inode numbers allowing file systems to grow to 1 Exabyte, which "should be enough for 5 or 10 years". Files up to 16 Terabytes are now supported. EXT4 is faster than EXT3. Recent file system benchmarks show EXT4 competing well with XFS. EXT4 is still considered experimental and is not recommended for use with important data.
EXT4 now has extents for more efficient storage of large files. Extents can coexist with block maps from EXT3. A planned feature for EXT4 extents is called "persistent preallocation". With persistent preallocation a file can be allocated on disk with an extent that has a special flag to indicate that the extent is full of zeros. When this flag is set the file system will return zeros from the marked extent without reading from the actual blocks.
Delayed allocation is also planned for EXT4. Delayed allocation allows better decisions about where to put file contents by allocating the disk blocks at flush instead of write. This makes it easier to avoid fragmenting a file over multiple free regions. Once a file system is heavily used some fragmentation can't be avoided, so EXT4 will support online defragmentation to maintain good performance.
With current tools running fsck on a 1 Exabyte file system could potentially take 119 years to complete. "Most of us are unwilling to wait that long". EXT4 will only check used inodes, not all the inodes on the disk to address this problem.
The size of inodes has been increased to allow storing extended attributes, increase the resolution of time stamps and to store a version counter. The counter is incremented when the file is modified and there is talk of exporting it to user space via the stat function.
If you are planning to migrate to EXT4 you can create an EXT3 file system with "-i 256" to leave room for the larger inodes. Otherwise you can simply mount an EXT3 partition as EXT4 and not use some of the extra features, or you can expand the inodes with an offline tool.
The Cell Broadband Engine is the CPU used in the Sony PS3 and is available in server systems. The Cell BE has great floating point and I/O performance. One part of the CPU is a traditional Power PC chip called the "PPE" and the rest of the chip is occupied 8 vector processing cores called "SPEs". The PPE by itself is slower than more traditional CPUs, so performance-sensitive applications need to be rewritten to use the SPEs as well.
The SPEs have 128KB of RAM each and use a kind of DMA system for copying to and from main memory. They share page tables with the PPE, so the code they can run user code with limited access to main memory. Each SPE can be doing up to 16 simultaneous DMA operations.
The SPEs are set up for for vectorized operations and use a different instruction set, but SPE code can be mixed with PPE code in C and GCC will take care of running each part in the right place. GCC now also supports overlays to allow the SPE code to grow beyond the 128KB limit. Wrappers are provided for syscalls from the SPEs that marshal data to and from the SPE formats and transfer control to the PPE for calling into the kernel.
The Cell BE is a power-hungry device, but there are hopes that newer versions produced using a 65nm process will be more efficient.
Debugging problems that occur in a cluster of thousands of machines presents some interesting challenges. There are problems such as 3-way races, intermittent delays and bugs that disappear when you try to attach a debugger. To track down these problems Google uses static probes in the kernel that can be enabled when something is not working, and configured to trigger on an event that is of interest.
These probes will continuously gather data and spool it off to disk or across the network. The information collected can be coarsely filtered in the kernel, then sent to user space using relayfs for finer filtering. Google uses LTTng and ktrace with a custom data format to reduce storage and network requirements. A number of visualization tools are used on the resulting data to try gather information about problems.
Many problems have been located using these tools, some in the kernel and others in Google's own programs or configuration. If a kernel problem is triggered by a proprietary Google application then data from the tracing (which is not proprietary) can be used to strengthen the case for a fix to go into the kernel.
Jonathan Corbet has covered this talk very well (about half-way down under the heading "ACPI Myths").
LinuxBIOS is a project to replace proprietary, slow and buggy bios implementations with an open source one that can be customized to suit your needs. Recently LinuxBIOS has been going in to embedded devices, but it has also been deployed on clusters of traditional computers to improve manageability and reduce boot times.
LinuxBIOS uses a clever trick to let developers use C for the most of the code, even before memory has been initialized for a stack. Apparently you can map memory into the CPU cache, zero it out then "disable the cache" which stops writes from the cache back to RAM but still lets the CPU read and write in the cache. This allows room for a small stack all on-CPU so that important things like local variables and function calls will work properly.
The C code can then calculate memory timings and perform memory initialization then copy the LinuxBIOS image to RAM. Another block of software called "VSA2" is also copied to RAM. This code can be used to handle system level interrupts for video and sound, and may also be used to emulate PCI devices for embedded systems with no PCI bus.
Finally LinuxBIOS will run plug and play device detection and PCI allocation before copying and executing the Linux kernel image or "payload". The payload is in ELF format which can specify exactly where in memory everything will be copied or initialized. There are a number of payloads available for LinuxBIOS including ones that perform Ethernet booting with tftp, run a GRUB-like boot menu, perform memory tests and even one that starts a stripped down X session.