AUG
19
2006

Why does Linux need defragmenting?

This so often repeated myth is getting so old and so boring. And untrue. Linux doesn't need defragmenting, because its filesystem handling is not so stupid like when using several decades old FAT. Yadda yadda, blah blah. Now, the real question is: If Linux really doesn't need defragmenting, why does Windows boot faster and why does second startup of KDE need only roughly one quarter of time the first startup needs?

Ok, first of all, talking about defragmenting is actually wrong. Defragmenting is making sure no file is fragmented, i.e. that every file is just one contiguous area of the disk. But do you know any today's application that reads just one file? The thing that should be talked instead should be linearizing, i.e. making sure that related files (not one, files) are one contiguous area of the disk.

Just in case you don't know, let me tell you one thing about your the thing busily spinning in your computer: It's very likely it can without trouble read 50M or more in a second - if it's a block read of contiguous data. However, as soon as it actually has to seek in order to access data scattered in various areas of the disk, the reading performance can suddenly plummet seriously - only bloody fast drives today have average seek time smaller than 10ms, and your drive is very likely not one of them. Now do the maths, how many times do 10ms (or more) fit in one second? Right, at most 100 times. So your drive can on average read at most 100 files a second, and that's actually ignoring the fact that reading a file usually means more than just a single seek (on the other hand that's also ignoring the drive's built-in cache that can avoid some seeks). Some of the pictures explaining how Linux doesn't need defragmentation actually nicely demonstrate that with files scattered so much the disk simply has to seek.

Now, again, how many files does an average application open during startup? One? It's actually hundreds, usually, at least. And since Linux kernel (AFAIK) at the present time has next to none support for linear reading of several files, you can guess what happens. Indeed, kernel developers will undoubtedly tell you that it's the applications' fault and that they shouldn't be using so many files, but then kernel developers often have funny ideas about how userspace should work and seriously, why do we have filesystems if they're not to be used and applications should compress all their data into a single file? For people who don't know about this (and most don't, actually) it feels kind of natural to structure data into files.

Nothing is perfect and just blaming kernel developers for this wouldn't be quite fair, but then it sometimes can really upset me when I see people "fixing" problems by claiming they don't exist. I am a KDE developer, not a kernel developer, so it may very well be that some of what I've written above is wrong, but the single fact that the problem exist can be easily be proved even by you:

Boot your computer, log into KDE, wait for the login to finish. Log out. Log in again. Even if you use a recent distribution that may use some kind of a preload technique that reduces this problem, there should be still a visible difference. And the only difference is that the second time almost everything is read from kernel's disk caches instead of the disk itself. Which avoids reading of the data and which avoids seeking. And the difference is the seeking, not the reading of the data: KDE during startup should be very unlikely to read more than 100M of data and that's 2 seconds with 50M/s disks - is the difference really only 2 seconds for you? I don't think so.

So, who still believes this myth that everything in the land of Linux filesystems is nice and perfect? Fortunately, some kernel developers have started investigating this problem and possible solutions.

Comments

> why do I have to wait up to 60 seconds for KWrite to start on my compute

Because the GNU linker does a bad job with C++ code, and code that uses the same prefix for all it's functions (e.g. gaim_ in GAIM ). For the first few letters it uses a hashing algorithm, but then it starts with text-compares, comparing every function with the other one. Most of the startup time of KDE applications (also OpenOffice, Mozilla and apps like gaim) can be traced back to this problem.


By vdboor at Sat, 08/19/2006 - 17:50

I have heard this many time, how is it possible that it hasn't been fixed yet?
If not, why is it so difficult?

[If this post is offtopic please excuse me, probably we should start a separate thread about this problem]


By Maurizio Monge at Sun, 08/20/2006 - 11:55

i think the most important reason why windows boots faster than linux is the amount of control microsoft has, and the fact they can simply assign someone a boring task and fix stuff. Windows DOES reordering of files, and linux could do it too - just nobody ever wrote it, and if someone did, you're right: he/she would have a hard time getting it into the kernel.

also in windows you have a base set of libraries, i guess - linux has a more diverse set of em, and so more has to be loaded. a clean KDE/Qt only system could be very fast, i think, but it gets bloated with GTK/Wx/etc stuff.


By superstoned at Sat, 08/19/2006 - 11:02

Last October, I did some experimenting with putting Gentoo on flash media. I don't have a snazzy BIOS that can boot off USB-Storage, so I built a kernel, put it in /boot, and ran it with root=/dev/sda1.

I did a little writeup at the time, but to summarize: Linux is slower at initializing USB-storage devices than IDE, so you have to wait 6 seconds for the card reader to come online, and the end result is a tie. Which I consider not too bad for a device with 1/4 of the linear read speed of the hard disk.


By sapphirecat at Sat, 08/19/2006 - 18:39

A while back I did a project for my systems class where I profiled the seeks done by the kernel on my machine during the KDE startup.
It turned out the file system did a horrible job: the files kbuildsycoca stat'd ended up split on two opposite ends of the disk,
with seeks alternating between the two after 5 files each or so....
http://www.cs.cornell.edu/~maksim/trace35.pdf is the picturel; in it a cross represents a request and the circle connected to it its completion (so you can see how long it took from the app asking for it, and the drive delivering it). X axis is time (ms, if I recall), Y axis has the LBA numbers -- roughly the head position along the disk.

To be fair, the machine was around forever, and being a development machine it had far heavier FS activity than normal, but still...

As a fact of the matter, I am not even sure this whole "anti-fragmentation" heuristic is actually a good idea: while it makes fragmentation less likely, it also can spread out files widely through the disk, increasing seek times. One can probably find that in a favorite install of $DISTRO, on a clean hard-drive, files KDE needs to start would be all over the disk.

And there is a further caveat to the whole use-one-file "solution":
the application does not have access to information on disk geometry and control over placement, so it does not have a way of structuring its indexing structures for good seek locality. And, of course, there is zilch guarantee that the one-big-honking-file, which often needs to change size, would not get badly fragmented, resulting in a whole lot of head ping-pong (though perhaps madvize(WILLNEED) can help with that...). And there are further complications because a file is a sort of a natural unit of atomicity. If one has multiple atomic units of information in separate files, suddenly one has to do all sorts of concurrency control in user applications --- whaaa?

The bottom-line in my view, really, is that the OS-provided file abstraction can not provide decent performance.


By Maksim Orlovich at Sun, 08/20/2006 - 04:11

Its one of the least known FSes out there (or should I say, least used?) but there are a lot of optimalizations in there that seem to be specifically targeted at this.
I had the pleasure of seeing Hans do a presentation at Fosdem a couple of years back and have to say he is an excellent speaker. But more importantly he went into things like compressing files into less blocks and moving data into the directory blocks to avoid seeks and otherwise moving things around to get higher data-density.

I'd be interrested how much time is gained by his techniques.


By Thomas Zander at Sun, 08/20/2006 - 09:22

It is quite easy to get an idea what files are read during login. Before logging in, wait at least one minute to let the access time expire on the files in your home directory (at least if you were previously logged in). Directly after logging in, execute

find .kde -amin 1

That will list all files accessed in the last minute in the .kde directory.
Right now, when I run it, it lists 254 files and directories.


By claes at Mon, 08/21/2006 - 16:36

I don't know what OS does the author use, I tried PCLinux Full Monty and it did boot almost as slow, but definitely not slower than Windows. I thought it was because of all the bloatware, but maybe it's KDE.
My Ubuntu 12.10 with Gnome3 boots ~4 Times faster than my Windows. Maybe if you're willing to use as dangerous tools as hibernation in Windows 8 AND keep your system extremely well optimized it would boot nearly as fast as Linux, still if you used linux for a couple of months you'd be frustrated how slow everything in windows works (not to mention the fact of being spied on by your own OS - which also makes it easier for others to put spyware in your windows).


By me at Sun, 04/27/2014 - 13:38

Pages