<D <M <Y
Y> M> D>

I was astonished the other day by the following experience:

I booted an LNX-BBC test image and started a test download of something using BitTorrent. BitTorrent was downloading into a RAM disk (actually a tmpfs, which is kind of like a dynamically-sized RAM disk) with a maximum size of about 128 MB.

I expected the download to run out of space at some point, since the file I was trying to download was about 1 GB, much larger than the RAM disk. After a little while, I went to check on it by looking at how much disk space had been used.

The system said that only about 50 MB had been downloaded. Then I remembered that BitTorrent preallocates space for the files it downloads. I couldn't understand how the download had even been able to start. The BitTorrent FAQ says that

BitTorrent pre-allocates the entire file when your download begins, then writes in pieces in random order as it gets them. As a result the file jumps to its full size immediately. BitTorrent will tell you when the download is complete.

If this was the case, how could a BitTorrent download of a 1 GB file even start on a 128 MB RAM disk? Furthermore, how could the system claim that only 50 MB had been used?

My confusion was compounded when I went to look at the size of the preallocated file, and ls reported it as occupying 1 GB.

Nick, who was visiting, explained that Unix supports sparse files and a file's size in the filesystem may be substantially larger than the amount of space it's actually taking up. When BitTorrent allocates a complete file's size on a Unix filesystem, it will only use a trivial amount of actual storage, and the amount of storage used will increase as the download progresses.

I found this totally astonishing. I'm familiar with sparse files, but I always thought they were a VMS thing and never realized that they've been a standard part of Unix for a long time. I don't know how I missed that.

The basic consequence of this is that the file size reported by ls -l can be totally different from the file size reported by du. Blocks not yet written will just not be allocated on disk, and reading them will return zeroes.

Here are four problems:

  1. you have a collection of strings G which were generated by some unknown formal grammar; find a formal grammar which generates them
  2. you have a collection of strings G which were generated by some unknown formal grammar and a collection F which could not be generated by that grammar; find a formal grammar which generates them
  3. you have a formal grammar G and a string S; determine whether S could have been generated by G
  4. you have a computer program P and a string S; determine whether, for some input, S could have been generated by P

(Another twist is to replace "grammar" by "Unix regular expression" and "generated by" with "matches".)

I would find these problems more straightforward to talk about if I had taken a compilers course, or an automata course, but I never got that far in CS. I did study formal languages only briefly, and I once read Minsky's Computation: Finite and Infinite Machines, which talks about the different classes of automata.

(1) and (2) are easy because you can just say that the grammar is ( string1 | string2 | string3 | string4 | ... ). This is a trivial solution and not very useful -- it has no economy and no predictive power. It's tantamount to saying that the laws of physics are "All observed phenomena occur and all non-observed phenomena are impossible". (What's amusing is that this isn't even true, partly because some observed phenomena do not actually occur.) It would be nice to have a more interesting solution.

(4) is as hard as the Halting Problem, but it is solvable for some programs. You can see quickly that it isn't solvable in general (even if you didn't know that the Halting Problem is unsolvable and if you just had an intuition that there is no magical way of solving all math problems with the same technique). You could just write a program which verifies whether something is an exception to some conjecture. If you can tell whether "yes" is ever an output of that program, you can tell immediately whether the conjecture is true.

(3) is pretty interesting, and it can actually be solved. The regular expression version is a standard part of many programming languages. The efficiency of a solution is also an interesting question.

I regret that I didn't get to go on this hike, because it was beautiful.

The recordings on Out There Live are very good. Some of them are better than the studio recordings on some other Dar Williams CDs. I have to recommend this CD very highly.

I went to Stacey's with Zack, at the end of a fairly unsuccessful series of errands.

I got the 2nd edition of Friedl's regular expression book from O'Reilly, which might shed some light on my problem (3) above, in that it discusses implementation issues related to RE matching. Friedl's book is one of the most useful books ever published by O'Reilly, but it's not particularly well-known. But it is a triumph.

I also got a Chomsky book on the Vietnam War because of the publisher's successful attempt to suggest that it had a new relevance or resonance today.

I also got a copy of Practical Cryptography, the latest in the series of "Schneier books that criticize earlier Schneier books". (So Bruce Schneier is like Ludwig Wittgenstein and Cliff Stoll -- interesting company.)

"What you say is very fine, Adso, and I thank you. The order that our mind imagines is like a net, or like a ladder, built to attain something. But afterward you must throw the ladder away, because you discover that, even if it was useful, it was meaningless. Er muoz gelîchesame die leiter abewerfen, sô er an ir ufgestigen . . . . Is that how you say it?"

"That is how it is said in my language. Who told you that?"

"A mystic from your land. He wrote it somewhere, I forget where. And it is not necessary for somebody one day to find that manuscript again. The only truths that are useful are instruments to be thrown away."

(Umberto Eco, The Name of the Rose)

(Er muss sozusagen die Leiter wegwerfen, nachdem er auf ihr hinaufgestiegen ist.)

Er muss diese Sätze überwinden, dann sieht er die Welt richtig.

(Ludwig Wittgenstein, Tractatus Logico-Philosophicus 6.54)

I repeat: it suffices that a book be possible for it to exist. Only the impossible is excluded. For example: no book can be a ladder, although no doubt there are books which discuss and negate and demonstrate this possibility and others whose structure corresponds to that of a ladder.

(Jorge Luis Borges, "The Library of Babel")


[Main]
Support Bloggers' Rights!
Support Bloggers' Rights!


Contact: Seth David Schoen