1

Topic: Matt Dillon Interview

This is a repost of the interview we did with Matt Dillon, leader of the DragonFly BSD project. There's a lot of interesting information here, I don't think we want to lose it. Thanks to Matt for the great answers.

How well do you think DragonFly is evolving given the project's goals, and what are the project's main achievements so far in your opinion?

I think we are making good progress towards our goals.  I am not able to do quite as much programming now as I used to due to the increased management load, but that simply means that DragonFly is becoming more popular so it is all to the better.

What are the benefits of DragonFly's serializing token model when compared to traditional mutexed kernels?

The main benefit is a simplified API with simplified programming rules. A DragonFly token may be held across a blocking condition (the 'lock' is temporarily lost during that time, unlike a normal lock).  That is, the token only serializes code while a thread is runnable, which is a good differentiator between a token and a normal lock and usually all you really need 99.9% of the time.

The consequence of this model is that a subsystem can obtain a token and hold it while performing potentially complex calls into other subsystems without having to inform those other subsystems that it is holding the token.  The caller still has to be aware that the subsystem might block, thereby temproarily breaking the serialization the token provides, but this matters only for a small portion of the token use in the system and still isolates the knowledge to the caller (i.e. if it isn't sure it must assume that a call into another subsystem might block).

This is a far less onerous burden then you have with a hard mutex model.  With a hard mutex model you have to pass your mutex down into other subsystems that you call so they can release it before they potentially block, or you have to temporarily release your mutex yourself, or a myrid of other issues.  Either way you are creating a coding burden and massive knowledge pollution between subsystems that makes the APIs less flexible and the coding more hazardous.

A lot of work has been done in the VFS subsystem recently. What were the benefits of this rewrite?

There are two major aspects to the VFS work:  (1) A reformulation of the namecache and (2) threading the VFS.  #1 is now complete.

Reformulation of the namecache is a major change in the way namespace operations work.  All namespace operations are now namecache-centric, which means that we can *lock* portions of the filesystem namespace using only the namecache without having to push down into the VFS itself. This removes the locking atomicy burden from all the VOP (vnode operations) calls a VFS must implement and greatly simplifies the call requirements and side effects.

For example, if you open a file O_CREAT|O_EXCL the system is now able to 'reserve' the filename solely via the namecache to prevent races, and once it has the namespace locked it can then make whatever VFS calls are necessary to implement the operation. This makes the VOP interface a lot more flexible. For example, VFS code is now able to block without having to worry about namespace races.

The second major aspect of the VFS work is to make it possible to thread a filesystem on a per-mount basis... either one thread per mount, or N threads per mount depending on the filesystem. This has not been  completed yet and requires a great deal of groundwork to make happen (just as the namecache work required a great deal of groundwork). The idea here is to support filesystems in a multi-threaded environment without requiring that every single last VFS implementation be SMP capable.  VFSs which do not have to be highly reentrant, such as ISO 9660 (CDROM) filesystems could be implemented with just one thread and thus not have to be coded with concurrency in mind.  High performance VFSs such as UFS would have to be multi-threaded, but even in the multi-threading case we have a great deal of flexibility.

The threading model allows us to implement and test 85% of the code without actually having to go MP, greatly reducing the programming
burden. Then when we do start to multi-thread high performance filesystems if a bug creeps in we can determine whether it is a concurrency issue or not by telling the system to use only one thread for the mount and observing whether the bug still occurs.  The system operator thus has direct control over the trade-off between a multi-threaded mode of operation that might potentially have more bugs verses a single threaded mode of operation that might potentially have fewer bugs.  Plus we don't have to multi-thread the entire VFS system all at once, we can get most of the infrastructure in place and tested and then attack multi-threading issues on a filesystem by filesystem basis.

A great deal of precursor work is necessary in order to ensure that a threaded VFS model remains efficient.  Amoung other things we need to move the system data caching up a layer (similar to the namecache work) to allow the system to satisfy cached requests from the VM page cache and namecache without having to push down into the VFS.

Can you talk a bit about the journaling infrastructure you're working on currently and what benefits will it bring to users and system administrators? What are the advantages of having a kernel journaling layer, as opposed to "traditional" journaling implemented in other operating systems?

The journaling work has three goals.  First, to provide a near realtime, infinitely fine-grained off-site backup for a filesystem.  Second, to eventually be used as a transport along with an integrated cache coherency mechanism to move data between machines in a cluster.  Third, to provide feedback to a filesystem to allow the filesystem to take advantage of the knowledge that operations are being journaled.  If you consider the goals you will find that a ton of other uses become obvious.  For example, security auditing becomes trivial, mirroring becomes trivial, reverting a filesystem or portions of a filesystem to its state as of some prior date becomes very easy to do, and so forth.

I am about half way through the first goal.  I have the basic infrastructure in place and I am working on filling it out to actually generate a meaningful data stream over a generic descriptor (socket, pipe,  whatever).

In the mailing lists, you mentioned journaling as one of the three legs that would support DragonFly's SSI clustering goals. Can you talk a bit about the other two legs, the cache coherency scheme and resource sharing and migration?

Whenever you have multiple machines trying to share data you need a mechanism to ensure that no machine is trying to operate on obsolete data.  When you try to combine cache coherency with real multi-master caching as we intend to do, this becomes the #1 issue.  A fully integrated cache coherency and data transport model is very difficult    to implement so we have broken the problem up into two pieces.  The cache coherency part of the equation will work in a manner similar to a namespace... it will manage the data's namespace (including offset ranges within a data object) but not deal with the actual data itself. The data transport model will be responsible for dealing with the actual data.  Transaction ids will allow the two parts of the system to be integrated together into a single whole by the kernel.  For example, if the kernel is holding a piece of cached data with transaction id "A" but the cache coherency layer says that the latest version is "B", the kernel then knows it has to throw away "A" and to block until it gets "B".  That is a very simplified description of the mechanism.

Resource sharing and migration are aspects of the integrated whole. If you have a cache coherency mechanism (not yet written) and you
have a data transport mechanism (the journaling), then you effectively have data sharing.  Migration is a slight modification of the sharing scheme that tries to move mastership to a specific target, or away from a specific source machine that you might want to take down for maintainance.

What are your plans for DragonFly regarding security? You've mentioned extending jails to something more powerful a few times in the mailing lists. Could you explaing what you have in mind?

Security is a tough nut to crack.  There are many aspects to system security.  An audit trail is important, and I hope to have that via the filesystem journaling.  Compartmentalization is important, and we might borrow from OpenBSD to get that (i.e. the idea that a program is only allowed to touch certain parts of the filesystem and only run certain suid programs, even if it is a root-owned or suid program). Service isolation is important.  For example, take a password lookup. Would you rather have a program like ftpd or sshd open /etc/spwd.db itself or would you rather it NOT have access to your encrypted password file and instead connect to a service to pass the username and challenge down into?  I would far prefer to have the service because then I only have to worry about one piece of software compromising my password file rather then 50 pieces of software.  The service could rate limit dictionary attacks and do other things to prevent a compromise of your entire encrypted password file, reducing the fallout from security events.

We aren't OpenBSD, however.  We are security concious but if we were to focus on it to the degree that OpenBSD does we would wind up doing nothing else, and there wouldn't be much point to having a DragonFly project. Security is important, but it isn't our biggest focus.

Much has been discussed about packaging in the mailing lists. What are your views on how a package management system should work, and which DragonFly-specific features could it make use of?

I have a few simple requirements: 

    * Multiple installed versions must be supported.

    * Defaults controllable on a per user / per process basis.  One user might want colorls, another might want traditional ls.  One user might want one version of perl as his default, another might want another.

    * Ability to isolate packages so they are not readily available in an environment.  For example, if building a package depends on A, B, and C, then I only want A, B, and C available when I build the package, and not D, E, F, G... Z.  Otherwise we have no real ability to determine whether dependancy lists are correct or not.

This would also allow us to build packages using the exact version of the library or libraries they depend on rather then whatever version happens to be available, resulting in a more robust package building mechanism.  Other mechanisms can deal with informing the user or system operator that a package needs to have its dependancies upgraded for security or other reasons, it shouldn't be a build-time "oh lets just try building with whatever libraries we have and pray that it works" deal.

    * Users would use binary packages rather the source based packages by default.  This allows us to enforce additional requirements for the build environment without creating a mess of bug reports from our users.

I believe we can accomplish these goals using the combination of an existing packaging system and our VARSYM utility (dollar variables embedded in symlinks) to control visibility.

I remember when someone asked on the mailing lists how do you keep a project as big as an operating system in your head, and your answer was "Basically you have to live it". Can you give some tips or suggestions for the people out there getting started on kernel programming who want live the code?

I'm going to be sage here.  You can only live the code if you already know what that means.  Meaning that it isn't something you learn, it is something that you already want to do and so you just do it.

Are you still a Netrek fan? smile

I used to play netrek.  A long time ago.  My handle was "fx".