Interactive history sniffing and its relatives

Readers of this blog will probably already know that, up till the middle of last year, it was possible to sniff browsing history by clever tricks involving CSS, JavaScript, and the venerable tradition of drawing hyperlinks to already-visited URLs in purple instead of blue. Last year, though, David Baron came up with a defense against history sniffing which has now been adopted by every major browser except Opera. One fewer thing to worry about when visiting the internets, hooray? Not so fast.

Imagine for a moment that the next time you visited an unfamiliar website and you wanted to leave a comment without creating an account, instead of one of those illegibly distorted codes that you have to type back in, you saw this:

Please click on all the chess pawns.

A six-by-six checkerboard grid with chess pawns in random locations.  One of the pawns is green and has a mouse-cursor arrow pointing to it.

As you click on the pawns, they turn green. Nifty, innit? Much easier than an illegibly distorted code. Also easy for a spambot equipped with image processing software, but it turns out the distorted codes are not that hard for spambots anymore either and probably no one’s written the necessary image processing code for this one yet. Possibly also easier on people with poor eyesight, and there could still be a link to an audio challenge for people with no eyesight.

… What’s this got to do with history sniffing? That chessboard isn’t really a CAPTCHA. All the squares have pawns on them. But each one is a hyperlink, and the pawns linked to sites you haven’t visited are being drawn in the same color as the square, so they’re invisible. You only click on the pawns you can see, of course, and so you reveal to the site which of those URLs you have visited. A little technical cleverness is required—the pawns have to be Unicode dingbats, not images; all the normal interactive behavior of hyperlinks has to be suppressed; etcetera—but nothing too difficult. I and three other researchers with CMU Silicon Valley’s Web Security Group have tested this and a few other such fake CAPTCHAs on 300 people. We found them to be practical, although you have to be careful not to make the task too hard; for details please see our paper (to be presented at the IEEE Symposium on Security and Privacy, aka Oakland 2011).

An attacker obviously can’t use an interactive sniffing attack like this one to find out which sites out of the entire Alexa 10K your victim has visited—nobody’s going to work through that many chessboards—and for the same reason, deanonymization attacks that require the attacker to probe hundreds of thousands of URLs are out of reach. However, an attacker could reasonably probe a couple hundred URLs with an interactive attack, and according to Dongseok Jang’s study of actual history sniffing (paper), that’s about how many URLs real attackers want to sniff. It seems that the main thing real attackers want to know about your browsing history is which of their competitors you patronize, and that’s never going to need more than a few dozen URLs.

On the other hand, CAPTCHAs are such a hassle for users that they cause 10% to 33% attrition in conversion rates. And users don’t expect to see them on every visit to a site—just the first, usually, or each time they submit an anonymous comment. Even websites that were sniffing history when it was possible to do so automatically, and want to keep doing it, may consider that too high a price. But we can imagine similar attacks on higher-value information, where even a tiny success rate would be worth it. For instance, a malicious site could ask you to type a string of gibberish to continue—which happens to be your Amazon Web Services secret access key, IFRAMEd in from their management console. Amazon has taken steps to make this precise scenario difficult, but I’m not prepared to swear that it’s impossible, and other cloud services providers may have been less cautious.

Going forward, we also need to think carefully about how new web-platform capabilities might enable attackers to make similar end-runs around the browser’s security policies. In the aforementioned research project, we were also able to sniff history without user interaction by using a webcam to detect the color of the light reflecting off the user’s face; even with our remarkably crude image processing code this worked great as long as the user held still. It’s not terribly practical, because the user has to grant access to their webcam, and it involves putting an annoying flashing box on the screen, but it demonstrates the problem. We are particularly concerned about WebGL right now, since its shader programs can perform arbitrary computations and have access to cross-domain content that page JavaScript cannot see; there may well be a way for them to communicate back to page JavaScript that avoids the <canvas> element’s information leakage rules. Right now it’s not possible to put the rendering of a web page into a GL texture, so this couldn’t be used to snoop on browsing history, but there’s legitimate reasons to want to do that, so it might become possible in the future.

Classical Mechanics Interlude: Acceleration to stop in a constant distance

Over on twitter, @MegaManSE asked

does anyone know the equation to find the acceleration to stop a moving object in a constant distance given some random starting velocity?

I didn’t, at the time, know … but I do know how to work it out from first principles, and it makes a decent little classical mechanics exercise, and also an excuse to figure out how to get MathJax hooked up on this blog, which might be useful in the future. So here’s how it’s done.

The first step in solving one of these problems is to rewrite the question as formally as possible:

At time t=0t=0 an object is at position x=0x=0 and moving with velocity ν=v\nu=v. Find the constant acceleration aa such that at some future time t=Tt=T, when the object is at position x=dx=d, its velocity will be zero.

Now how do we do that? It’s time for just a little bit of integral calculus. Velocity is the rate at which a moving object’s position changes, as a function of time, and acceleration is the rate at which a moving object’s velocity changes, also as a function of time. The calculus was invented to answer the question, if I know what one of these is, what are the other two? It has a somewhat-deserved reputation for being confusing, but mostly that’s because it’s hard to explain how you come up with its rules. If you know the rules, they’re pretty easy to apply. The acceleration in this problem is constant, aa, and we know at time 00 the velocity is vv and the position is 00. Therefore, the velocity at time tt is

ν(t)=v+0tadt=v+at\nu(t) = v + \int_0^t a\; \text{d}t = v + at

and the position is

x(t)=0+0tv+atdt=0+vt+at22x(t) = 0 + \int_0^t v + at\; \text{d}t = 0 + vt + \frac{at^2}{2}

These are both functions of time, but we want to solve for acceleration as a function of distance and starting velocity. But that’s just a matter of algebra. We want ν(T)=0\nu(T) = 0, so we plug that into the first of these equations and solve for TT:

0=v+aTT=va0 = v + aT \quad\rightarrow\quad T = \frac{-v}{a}

And we want x(T)=dx(T) = d, so we plug both that and the formula for TT into the second equation:

d=vva+a2(va)2d = v\frac{-v}{a} + \frac{a}{2}\left(\frac{-v}{a}\right)^2

Now all we have to do is solve for aa:

d=v2a+v22ad = \frac{-v^2}{a} + \frac{v^2}{2a}

d=2v2+v22ad = \frac{-2v^2 + v^2}{2a}

2ad=v22ad = -v^2

a=v22da = \frac{-v^2}{2d}

Wait, the acceleration comes out to be negative?! Yes. That’s how you know the object is slowing down rather than speeding up. (If the object weren’t moving in a straight line, its position, velocity, and acceleration would all have to be treated as 2- or 3-dimensional vectors, but the calculations would wind up being very nearly the same, only with more boldface. Also, if the velocity were negative, it would mean the object was moving backward. This is, in fact, the difference between velocity and speed: speed is the magnitude of velocity, without the direction, so it can never be negative.)

Securing the future net

Today I had the fortune to attend a group discussion ambitiously entitled Future of Internet Security at Mozilla. What this was mostly about was, given that a recent incident has severely shaken everyone’s confidence in the PKIX (PDF, say sorry) mechanism that everyone currently uses to decide that a secure website is who it says it is, what can we do about it? I’m not going to attempt to summarize; instead I’m going to point at the Etherpad log and [2016: the Etherpad log is no longer available either from Mozilla or the Internet Archive] video record of the discussion, then plow boldly forward with my own (incontrovertibly correct, of course) opinion on the way forward, on the assumption that everyone who reads this will already be familiar enough with the context to know what I’m talking about.

I will quote in full the principles with which the discussion was kicked off, though (really they’re more like constraints on solutions acceptable to all parties).

  • Performance - large sites will not adopt solutions which bulk up the amount of data required to be exchanged to establish an secure connection.
  • Independence/Availability - large sites will not accept tying the uptime of their site to the uptime of infrastructure over which they have no control (e.g. an OCSP responder)
  • Accessibility/Usability - solutions should not put the cost of security, either in terms of single sites or large deployments, out of the reach of ordinary people
  • Simplicity - solutions should be simple to deploy, or capable of being made simple.
  • Privacy - ideally, web users should not have to reveal their browsing habits to a third party.
  • Fail-closed - new mechanisms should allow us to treat mechanism and policy failures as hard failures (not doing so is why revocation is ineffective) (however this is trading off security for availability, which has historically proven almost impossible).
  • Disclosure - the structure of the system should be knowable by all parties, and users must know the identities of who they are trusting

I should probably emphasize that this is a walk, do not run, to the exits situation. The status quo is dire, but we can afford to take the time to come up with a solution that solves the problem thoroughly; we do not need an emergency stopgap. Despite that, I think the short-term solution will be different from the long-term solution.

In the short term, the solution with the most traction, and IMO the best chance of actually helping, is DANE, an IETF draft standard for putting TLS server keys in the DNS. This can (at least on paper) completely replace the common DV certificates issued by traditional certificate authorities. However, to offer real security improvements relative to the status quo, I assert that the final version of the spec needs to:

  • Require clients to fail closed on any sort of validation failure. The current text of the spec does say this, but not clearly and not with enough RFC2119 MUSTs.
  • Provide exclusion (trust no server keys but these, possibly also trust no CA but these) rather than inclusion (you should trust this server key). The current text of the spec can be read either way. A vocal minority of the DANE working group wants inclusion. It is my considered opinion that inclusion is completely useless—all it does is add the DNS root signing key to the existing pool of trusted CAs, which doesn’t solve the untrustworthy CA problem.
  • Require the use of DNSSEC. It has recently been suggested that a signed DNS zone is not necessary for exclusion, but then a DNS-tampering attacker can deny service by injecting a bogus DANE record, which will deter deployment. (It doesn’t matter that a DNS-tampering attacker can also deny service by messing up the A records; this is a new risk, which scares people more than an existing risk.)
  • Clearly indicate that it does not provide EV-level validation, leaving a business model for traditional CAs to retreat to.

In the longer term, I think we’re going to want to move to some sort of content-based addressing. DANE gets rid of the CA mess, but it substitutes the DNS as a single point of failure. Here’s a half-baked scheme that we could start rolling out real soon for URIs that don’t need to be user comprehensible:

<!-- jQuery 1.5.2 -->
<script src="h:sha1,b8dcaa1c866905c0bdb0b70c8e564ff1c3fe27ad"></script>

The browser somehow knows how to expand h: URIs to something it can go ask a server on the net for. What the server produces MUST be discarded if it does not have the specified hash (and the browser can go try some other server). We don’t need to worry about where the browser got the content or whether it was transferred under encryption—if it’s not what was wanted, it’ll fail the hash check. Still to be worked out: how to do the expansion without reintroducing that single point of failure; how to disseminate these URIs; how to fit dynamic content into the scheme; under what circumstances h: URIs should, or should not, be considered same-origin with the requesting page.

Legibility of embedded Web fonts

It’s recently become possible to embed fonts in your website, so that you aren’t limited to using the same old fonts that everyone already has on their computer. Yay! Unfortunately, there are a lot of gotchas. Lots of people discuss the technical gotchas, but when you get past that, you’ve still got to worry about legibility.

Continued…

Fantasy setting without the hack-and-slash

Several months ago, LJ user fadethecat posted a request for help identifying a dimly-remembered video game. In passing, she mentioned that

Unfortunately, all the games I’ve found with that scale of now direct your dudes to go chop down trees are inextricably linked to some tedious Also, you have to fight off enemies game.

And, frankly, if I want to fight off enemies, I’ll go play Starcraft or something. I like my city-building separate from my fighty games, thank you very much. But they seem to always come hand in hand. No, I can’t just build my dairy farms for cheese and my lovely little castle, I need to deal with frickin’ knights and invaders…

Whereupon it occurred to me that this is one of the big reasons I don’t play Dwarf Fortress anymore. The parts of DF that I enjoy are city-building, prospecting for minerals to make stuff out of, and trade. Manufacturing is too much of a micromanagey pain in the ass, and everything to do with building up an army bores me to tears (it doesn’t help that that is also far too micromanagey for my taste, and depends on manufacturing). Further, all the development since the first 3D version has been focused on precisely the parts I don’t like. But it’s a damn shame, because I really enjoy the city-building and the mining.

At about the same time, Minecraft had just started making the big news. I tried to get it to run, discovered that this would require more than zero messing with a Java installation, and gave up. But I watched a bunch of the X’s Adventures in Minecraft Let’s Play videos, enough to get a pretty good idea of what the gameplay is like. It seems like I’d have much the same opinion of it that I do of DF: yay exploring, yay building, boo micromanagement, boo having to fight monsters.

And this was also about the time I gave up on Lord of Ultima, which was a massively-multiplayer territorial competition game that billed itself as allowing a purely economic strategy—but it turned out that past a certain point everyone has far more resources than they need, so the only remaining thing to do is fight; I have absolutely no interest in PvP army duels, and guess what? Also with the way too much micromanagement! (You can pay for UX enhancements that mitigate this, but it didn’t appear that they improved it enough, and the only way I’ll pay for a premium game is if I’m already having tons of fun with the free edition.)

Finally, longtime readers will remember that the last thing I said about that roguelike I’m not writing was that perhaps the world does not need another game where what you do is kill the monsters and take their stuff. But I’m still interested in the notion of a game where you’re exploring an otherworld that’s coming apart at the seams, and maybe trying to put it back together again. So here’s a set of design elements, that feel like they add up to a game:

  • Gameplay is about exploration, construction, manufacturing, and trade.
  • Multiplayer interaction, if any, is all about the trading.
  • Micromanagement is to be avoided with extreme prejudice.
  • No fighting.
  • No crafting of legendary weapons or armor, either.
  • Like reality unless noted applies to geology, biology, and technology.
  • Magic is not tame.
  • Magic is not only for special people.
  • It might be nice to try to subvert Our Dwarves Are All The Same but then again, there’s a reason why that characterization works so well.

The tricky part is finding a source of conflict, since we’ve taken out all the fighting. Perhaps the coming apart at the seams thing is enough of an issue to hang a plot on? If not, there is always politics and diplomacy, particularly if we make it hard for a settlement to be entirely self-sufficient. Not all minerals are found under the same mountain; not all biomes are suitable for subsistence farming.

I think this would work well with a zoom level roughly the same as DF fortress mode: player sets goals, individual simulated characters carry them out to the best of their ability. Simulated characters push back on the player to fulfill their desires and carry out their agendas, but (unlike DF nobility, for instance) there is a way for the player to say no. Which has consequences, of course, but less-bad consequences than what happens if you ignore a production order in DF.

Four Ideas for a Better Internet 2011

On Tuesday night I attended a talk at Stanford entitled Four Ideas for a Better Internet. Four groups of Harvard and Stanford Law students, having just completed a special seminar entitled Difficult Problems in Cyberspace, each presented a proposed improvement to the internets; they were then grilled on said proposal by a panel of, hm, let’s call them practitioners (many but not all were from the industry). Jonathan Zittrain moderated. In general, I thought all of the proposals were interesting, but none of them was ready to be implemented; they probably weren’t intended to be, of course, but I—and the panelists—could poke pretty serious holes in them without trying very hard.

The first proposal was to improve social network security by allowing you to specify a group of extra-trusted friends who could intervene to protect your social-network presence if it appeared to have been hijacked, or who could vouch for a request you might make that requires extra verification (for instance, a request to change the email address associated with your account). This is quite intentionally modeled on similar practices found offline; they made an analogy to the (never-yet used) procedure in section 4 of the 25th amendment to the U.S. Constitution which allows the Vice President, together with a majority of the Cabinet, to declare the President temporarily unable to do his job. It’s not a bad idea in principle, but they should have looked harder at the failure modes of those offline practices—A25§4 itself goes on to discuss what happens if the President objects to having been relieved of duty (Congress has to decide who’s right). More down-to-earth, one might ask whether this is likely to make messy breakups worse, and why the hey, moderators, this account looks like it’s been hijacked button (not to be confused with the hey, moderators, this account appears to belong to a spammer button) couldn’t be available to everyone.

The third and fourth proposals were less technical, and quite closely related. The third group wanted to set up a data haven specializing in video documenting human rights abuses by dictatorships. Naturally, if you do this, you have to anonymize the videos so the dictatorship can’t find the people in the video and punish them; you have to have some scheme for accepting video from people who don’t have unfiltered access to the net (they suggested samizdat techniques and dead drops); and you have to decide which videos are actually showing abuses (the cat videos are easy to weed out, but the security cam footage of someone getting mugged…not so much). The fourth group wanted to set up a clearinghouse for redacting leaked classified documents—there is no plausible way to put the Wikileaks genie back in the bottle, but (we hope) everyone agrees that ruining the life of J. Afghani who did a little translation work for the U.S. Army is not what we do, so maybe there could be an organization that talks off-the-record to both leakers and governments and takes care of making sure the names are removed.

It seems to me that while the sources are different, the redactions that should be done are more or less the same in both cases. It also seems to me that an organization that redacts for people—whoever they are, wherever the documents came from—is at grave risk of regulatory capture by the governments giving advice on what needs redacted. The panelists made an analogy to the difficulty of getting the UN to pass any resolution with teeth, and Clay Shirky suggested that what is really wanted here is a best-practices document enabling the leakers to do their own redactions; I’d add that this also puts the authors behind the veil of ignorance so they’re much less likely to be self-serving about it.

I’ve saved the second proposal for last because it’s the most personally interesting. They want to cut down on trolling and other toxic behavior on forums and other sites that allow participation. Making another analogy to offline practice, they point out that a well-run organization doesn’t allow just anyone who shows up to vote for the board of directors; new members are required to demonstrate their commitment to the organization and its values, usually by sticking around for several years, talking to older members, etc. Now, on the internets, there are some venues that can already do this. High-traffic discursive blogs like Making Light, Slacktivist, and Crooked Timber cultivate good dialogue by encouraging people to post under the same handle frequently. Community advice sites like StackOverflow often have explicit reputation scores which members earn by giving good advice. But if you’re a little bitty blog like this one, your commenters are likely to have no track record with you. In some contexts, you could imagine associating all the site-specific identities that use the same OpenID authenticator; StackOverflow’s network of spinoffs does this. But in other contexts, people are adamant about preserving a firewall between the pseudonym they use on one site and those they use elsewhere; witness what happened when Blizzard Entertainment tried to require real names on their forums. The proposal tries to solve all these issues with a trusted intermediary that aggregates reputation information from many sites and produces a credibility score that you can take wherever you wish to comment. Like a credit score, the details of how the score was computed are not available, so you can’t deduce someone’s identity on any other site. Further, you can have as many separate, unconnectable pseudonyms as you want, all with the same score.

People will try to game any such system, but that’s actually the easy problem, addressable with clever algorithms and human moderators. The more serious problem in my book is, will produce quality comments isn’t the sort of thing that you can reduce to a single number. To give an extreme example, the sort of comment that gets you mad props on /b/ is exactly what most other sites do not want. The team did propose to break it down as three or four numbers, but it’s not clear to me that that helps enough. (But if you expose too much detail to sites trying to consume the data, that may leave them unable to reach a conclusion.) And finally, anonymization of this kind of data is much harder than it looks: I need only point at the successful unmasking of two users within the Netflix Challenge data set. Anonymization is in tension with utility here, because the more information you expose about what sort of reputation someone has on which sites, the easier it becomes to unmask them.

I think the idea is not totally doomed, though. We could help it a great deal by turning it on its head: rate sites on the quality of their discourse. This would be done with a publicly documented, but subject to revision, scoring scheme that humans execute against a random sample of pages from the site; we might be able to use a set of seed scores to train some sort of expert system to do it automatically, but I think it’s not a disaster if we have to have humans do the site evaluations. This would be useful in itself, in that it would be a stick to beat sites with when their discourse is terrible. Meantime, each site exports its existing member-reputation scheme (or makes one up—even something simple like average number of posts per month would probably be useful) in a standard format. When you want to introduce yourself in a new context, you can bring along a recommendation from any number of sites of your choice, which is just each site’s discourse score + your reputation on that site. It is explicit in the UX for this that you are linking your identity on the new site to your identity on the others (I might even go as far as allowing people to click through to your posting history on the other sites). You then get some reputation spillover on the new site from that, which might be as limited as doesn’t go through the mod queue the first time. Contrariwise, if you don’t provide any recommendations, your new pseud gets to stay dissociated from your other identities, but doesn’t get any rep. Sprinkle with crypto, nonrepudiation schemes, and human moderator feedback as necessary.

Strawman: MIME type for fonts

For a little while now, it’s been possible for websites to embed fonts that all major browsers will pick up on. This of course implies fonts being served as HTTP resources. But it turns out that nobody has bothered to assign any of the common font formats a MIME type.1 Fonts being embedded on the web nowadays come in two flavors and three kinds of container: you either have TrueType or PostScript CFF-style outline glyphs, and they are in a bare OpenType (really sfnt) container, or else compressed with either WOFF or EOT. (I am ignoring SVG fonts, which are spottily supported and open several cans of worms that I don’t want to get into right now.) In the future, people might also want to embed TTC font collections, which are also in a sfnt container and could thus also be compressed with WOFF—not sure about EOT there—and bare PostScript Type 1 fonts, but neither of these is supported in any browser at present, as far as I know. There is no official MIME type for any of these combinations; therefore, people deploying fonts over HTTP have been making them up. Without trying very hard, I found real sites using all of: application/ttf, application/otf, application/truetype, application/opentype, application/woff, application/eot, any of the above with an x-prefix, or any of the above in font/ instead of application/ (with or without the x-). There is no top-level font MIME category, making this last particularly egregious.

All of these made-up types work because browsers don’t pay any attention to the content type of a web-embedded font; they look at the data stream, and if it’s recognizably a font, they use it. Such sniffing has historically caused serious problems—recall my old post regarding CSS data theft—so you might expect me to be waving red flags and arguing for the entire feature to be pulled until we can get a standard MIME category for fonts, standard subtypes for the common ones, and browsers to start ignoring fonts served with the wrong type. But I’m not. I have serious misgivings about the whole the server-supplied Content-Type header is gospel truth, content sniffing is evil thing, and I think the font situation makes a nice test case for moving away from that model a bit.

Content types are a security issue because many of the file formats used on the web are ambiguous. You can make a well-formed HTML document that is simultaneously a well-formed CSS style sheet or JavaScript program, and attackers can and have taken advantage of this. But this isn’t necessarily the case for fonts. The sfnt container and its compressed variants are self-describing, unambiguously identifiable binary formats. Browsers thoroughly validate fonts before using them (because an accidentally malformed font can break the OS’s text drawing code), and don’t allow them to do anything but provide glyphs for text. A good analogy is to images: browsers also completely ignore the server’s content-type header for anything sent down for an <img>, and that doesn’t cause security holes—because images are also in self-describing binary formats, are thoroughly validated before use, and can’t do anything but define the appearance of a rectangle on the screen. We do not need filtering on the metadata, because we have filtering on the data itself.

Nonetheless, there may be value in having a MIME label for fonts as opposed to other kinds of binary blobs. For instance, if the server doesn’t think the file it has is a font, shouldn’t it be able to convince the browser of that, regardless of whether the contents of the file are indistinguishable from a font? (Old hands may recognize this as one of the usual rationales for not promoting text/plain to text/html just because the HTTP response body happens to begin with <!DOCTYPE.) The current draft standard algorithm for content sniffing takes this attitude with images, recommending that browsers only treat HTTP responses as images if their declared content type is in the image/ category, but ignore the subtype and sniff for the actual image format. With that in mind, here’s my proposal: let’s standardize application/font as the MIME type for all fonts delivered over the Internet, regardless of their format. Browsers should use only fonts delivered with that MIME type, but should detect the actual format based on the contents of the response body.

I can think of two potential problems with this scheme. First, it would be good if browsers could tell servers (using the normal Accept: mechanism) which specific font formats they understand. Right now, it’s reasonable to insist that browsers should be able to handle either TrueType or PostScript glyph definitions, in either bare sfnt or compressed WOFF containers, and ignore the other possibilities, but that state won’t endure forever. SVG fonts might become useful someday (if those cans of worms can be resolved to everyone’s satisfaction), or someone might come up with a new binary font format that has genuine advantages over OpenType. I think this should probably be handled with accept parameters, for instance Accept: application/font;container=sfnt could mean I understand all OpenType fonts but no others. The other problem is, what if someone comes up with a font format that can’t reliably be distinguished from an OpenType font based on the file contents? Well, this is pretty darn unlikely, and we can put it into the RFC defining application/font that future font formats need to be distinguishable or else get their own MIME type. The sfnt container keeps its magic number (and several other things that ought to be in the file header) in the wrong place, but as long as all the other font formats that we care about put their magic number at the beginning of the file where it belongs, that’s not a problem.


1 To be precise, there is a standard MIME type for a font format: RFC 3073 defines application/font-tdpfr for the Bitstream PFR font format, which nobody uses anymore, except possibly some proprietary television-related products. Bitstream appear to have been trying to get it used for web fonts back in the days of Netscape 4, and then to have given up on it, probably because the font foundries’ attitude was NO YOU CAN’T HAS LICENSE FOR WEBS until just last year.

Dead uncle Allotheria

Summaries of the rest of CCS’10 are still coming eventually, but what I want to talk about today is the Difference Engine No. 2 which I have now seen demonstrated twice: Mozilla had its annual meeting at the Computer History Museum a couple years back, and on Saturday last I was there again for SRI’s holiday dinner.

As you know, (Bob,) the Difference Engine was originally designed in the 1820s by Charles Babbage; a tabletop-sized demonstration version was built, but the full-size version was abandoned in 1833 after Babbage fell out with the mechanist he had hired. Babbage went on to design the Analytical Engine, which if completed would have been a fully operational stored-program computer, and later to redesign the Difference Engine with only eight thousand (instead of 25,000) parts. Having blown £17,500 of Parliament’s money (more than a million pounds at current rates) on the first failed project, he was not able to secure funding for either of these, and they remained drawings only. The London Science Museum built two copies of Difference Engine No. 2—that’s the 8000-part version—in the 1990s, with period-appropriate materials and manufacturing tolerances, to prove that it could have been done; one of these is on display at the Computer History Museum.

The Difference Engine is not a computer in the modern sense, or even a desk calculator. It does one thing only: it computes tables of values of mathematical functions. You approximate the function you want as a polynomial of up to 7th degree (you probably need several polynomials, for different ranges of the argument) and you can then have the Engine calculate the value of the polynomial for many different inputs, using finite differences. It stamps the numbers into a bed of plaster of Paris, which can then be used to cast a letterpress plate and run off many copies of a book. There were already such books, but they were computed by hand, and notorious for their errors. The Difference Engine’s failure meant that they kept on being computed by hand; however, Georg and Edvard Scheutz managed to build a miniature version of the Engine in 1857 and use it to print a table of logarithms; also, the first practical mechanical desk calculator went into production in 1851, and was no doubt employed in making tables more accurate.

Nowadays, of course, we could program an electronic computer to typeset such a book, but why would we bother, when the very same computer can just spit out exactly the values we’re interested in, of any function we want? In other words, the spiritual descendants of the Analytical Engine have not only obsoleted the Difference Engine—which I’m sure Babbage himself would have expected—but have made the job it was made to do a thing of the past. This hardly ever happens, and nobody (except possibly Vannevar Bush or Murray Leinster) saw it coming—witness the Heinlein juvenile of forgotten title in which the spaceship had an electronic computer to navigate it, but to make it work, the human navigation team had to type in numbers from a book of tables.

At the demonstration, several people in the audience did not seem to believe, even after it was repeated several times, that no Difference Engines were ever built in Victorian times. This one is a replica? they kept asking. Nope. Production model, serial number two, of the only production run there ever was. It does seem odd to me that, before now, it was so fervently believed that Babbage’s engines could not have been built with Victorian manufacturing technology, when they manifestly did build similar devices, such as the Scheutz brothers’ difference engine and various desk calculators.

The other telling question was on the order of if you set the Engine up wrong, what happens? Being so special purpose, of course it can’t have a bug of the sort we are all used to. Its gears can get out of alignment, but it’s got a mechanism to detect that and jam rather than produce an incorrect sum. (I don’t know how the operator would recover from such a jam, however.) And no matter how you set it up, it will compute some polynomial; but that might not be the polynomial you wanted. Babbage himself was asked this question, in a less-coherent form (look for On two occasions).

I’ll leave you all with a vision of what might have been.

CCS 2010, day 1

I’m attending the 2010 ACM Conference on Computer and Communications Security (in Chicago this year), from yesterday (I’m skipping the workshops on Monday and Friday). Was a little to tired to write up what I thought of yesterday’s talks yesterday, so here are some brief thoughts about them now.

Before lunch, I probably should have gone to the security analysis session, but I really wanted to see Justin Samuel’s talk on practical advice for dealing with compromised keys, mostly aimed at people doing signed software distribution—which could also be relevant for people running secure web sites, especially if browsers start paying more attention to changes in the server certificates. The other two talks in this session didn’t really grab me.

After lunch, there was lots of good stuff in the session on wireless and phone security. Husted and Myers described how a malicious group of cooperating cell phones can track the majority of other cell phone users in an area—this is not easy now but will only get easier. Halevi and Saxena (no link available) comprehensively broke all the current schemes for acoustically pairing small widgets together (you put your Bluetooth earbud against your phone, for instance, and it vibrates a code, which your phone detects), even at distance thanks to the magic of parabolic microphones. And a large group from Georgia Tech showed their technique for fingerprinting the networks through which a phone call passes, based on the characteristics of each network’s acoustic compression algorithm.

After that I decided to skip the tutorials and go for a walk. The conference hotel is right on the south bank of the Chicago River and only a few blocks from Lake Michigan, so I walked down the riverfront to the lake and then looped around to the south and back. I’ve never been to Chicago before and it’s very interesting, architecturally. I will post more on this when I can upload photos (left the cable at home, silly me).

The poster session was, unfortunately, a bit of a blur; by that point my brain was full. Collin introduced me to a bunch of people doing ’net security at Cal and we all went out to dinner, which involved more wandering around the downtown looking for a restaurant that had a table for eight.

URL Canonicalizer

supposing
for the sake of argument
that you had a giant list
of partial URLs
(you know, like www.example.com/blurf)
and you needed to canonicalize them
and chase redirects
and remove duplicates
and dead sites
and further you were aware
that this is much harder than it might sound
not to mention
that many websites do not like urllib
well then
you might be looking for this program
which was written by me
with a little help from serge broslavsky.