Sam's Blog
Wednesday, April 06, 2005
I know it's been a while, my apologies, nothing but a very geek/tech related post today. This one's on a happy note. I solved a little mystery that was really getting under my skin. It wasn't even important but it really opened my eyes about a little quirk with Unix file systems (and filesystems in general).

I've never really made much distinction of the notion of how files often take up more space on disk than their actual filesize. Well, they do, mostly because files must be in units of blocks (or clusters, same thing, different term for different filesystems). Often filesizes don't fall on even block bounderies so there's a tiny chuck of wasted space on the end of a file which represents the unused portion of a block (this is known as "internal fragmentation"). My ext3 partition use a 4k block size, so for every file, there's the potential to wast 4K minus 1B worth of space.

I've know about this for some time now. What I didn't know, is that this isn't the only space that's "wasted" (wasted is a harse term, "necessarily used for addressing" would be more appropriate) on Unix filesystems.

I found this out by taking the block count given to me by the stat command, and multiplying by 512 (stat always reports blocks in units of 512). I figured this was how one calculated actual filesize on disk. The filesizes were different alright. For very small sizes, the difference in size was always smaller than my block size (4k). However, larger files had a much, much larger difference. It wasn't phenominal (1.3 megs difference for a 1.4 gig file) but it was perplexing, because I didn't have an explanation for it. Why wasn't the filesize difference fitting within a block?

The answer took me a while to find out. I hit the chat rooms on freenode like a banshee but no one had an answer. I researched and researched on the web on ext2/3 and Unix filesystems. I finally found out why. It has to do with the way inodes address data blocks. Basically, the first 12 blocks are addressed directly. This makes addressing and reading the firts 12 blocks (always the most important in most files) of a file very fast. However, the pointer after that isn't a direct address, it's a pointer to another block that can address (1024 * block size) more data. This is known as the single indirect block. After that you have one more pointer. This one points to a pointer that points to several blocks that point to data, it's known as the double indirect block. This basically lets you address around 4 gigs or so worth of data. Finally if you need more than that, there's the third indirect block (you get the idea for this one) that'll let you go up to 2TB worth of data (the filesize limit). So basically, at 13 blocks, an extra block is used to store an additional 1024 * (block size) data. After 1037 blocks, 2 additional blocks are used to address data, and another block is used for every 1024 blocks additional data stored after that, until 1024 * 1024 blocks have been stored, then the number goes up by 2 instead of one, then 1 for every 1024 blocks, on and on until you start using that third indirect block, at which point I believe it'll go up by 3, etc. Basically, the filesystem grabs additional blocks for addressing as needed, as the file grows. This makes it so inodes only have to be a block in size instead of large enough to have data that addresses every block in a file (would waste lots more space let me tell ya). These address blocks are reported by stat as being used by the file.

In conclusion, with a large file you have a tendancy to use (very, very, roughly anyway) an additional 0.1% of the file's size in address blocks. FAT16/32 and NTFS, etc. have similar ways of using additional disk space for addressing, however when mounting them under Linux, stat won't report these used blocks. Kind of makes it look like fat16/32 and NTFS are doing a better job preventing wasted space ;-) (I know, I know, the space isn't "wasted" it's used necessarily for addressing, but it looks wasted to the layman). Not true, but it sure looked that way to me for a while! I'm glad I finally managed to figure this problem out for myself.

And if any of this is wrong, please someone email me and straighten me out. I'd really love to understand this correctly.

Powered by Blogger