Sparse files
I was astonished the other day by the following experience:
I booted an LNX-BBC test image and started a test download of something using BitTorrent. BitTorrent was downloading into a RAM disk (actually a tmpfs, which is kind of like a dynamically-sized RAM disk) with a maximum size of about 128 MB.
I expected the download to run out of space at some point, since the file I was trying to download was about 1 GB, much larger than the RAM disk. After a little while, I went to check on it by looking at how much disk space had been used.
The system said that only about 50 MB had been downloaded. Then I remembered that BitTorrent preallocates space for the files it downloads. I couldn't understand how the download had even been able to start. The BitTorrent FAQ says that
BitTorrent pre-allocates the entire file when your download begins, then writes in pieces in random order as it gets them. As a result the file jumps to its full size immediately. BitTorrent will tell you when the download is complete.
If this was the case, how could a BitTorrent download of a 1 GB file even start on a 128 MB RAM disk? Furthermore, how could the system claim that only 50 MB had been used?
My confusion was compounded when I went to look at the size of the preallocated file, and ls reported it as occupying 1 GB.
Nick, who was visiting, explained that Unix supports sparse files and a file's size in the filesystem may be substantially larger than the amount of space it's actually taking up. When BitTorrent allocates a complete file's size on a Unix filesystem, it will only use a trivial amount of actual storage, and the amount of storage used will increase as the download progresses.
I found this totally astonishing. I'm familiar with sparse files, but I always thought they were a VMS thing and never realized that they've been a standard part of Unix for a long time. I don't know how I missed that.
The basic consequence of this is that the file size reported by ls -l can be totally different from the file size reported by du. Blocks not yet written will just not be allocated on disk, and reading them will return zeroes.