Adam M. Costello : Soap Box

URLs that Name Directories

Question:

Why should a URL that names a directory have a trailing slash?

Answer:

When a document contains relative links, they are resolved by the browser, not by the HTTP server. The browser starts with the URL for the current document, removes everything after the last slash, and appends the relative URL. If the URL for the current document names a file, this works fine. But if the URL for the current document names a directory, and the URL is missing the trailing slash, then the method fails.

For example,

http://www.foo.com/bar/

names the directory bar on host www.foo.com, which might contain a file called (say) index.html, which the server returns as the document. If that document contains a relative link to blah.html, the browser will use the above method to derive the URL

http://www.foo.com/bar/blah.html

But if the browser had been using http://www.foo.com/bar as the base URL, the method would yield

http://www.foo.com/blah.html

which will fail because it's not what the author of the document meant.

More abstractly, every directory has two roles: It is a child of its parent directory, and it is a parent to its children. The path /bar refers to bar's role as a child, while the path /bar/ (which is equivalent to /bar/.) refers to bar's role as a parent. Both regular files and directories are children, but only directories are parents. Therefore, if you want at directory to be thought of as a directory rather than a regular file (which affects how relative paths are interpreted), you must refer to its role as a parent.

Consequences of Missing Slashes

I think NCSA httpd 1.1 did nothing to deal with missing slashes. Relative links would simply succeed or fail depending on which form of the URL the browser was using.

There are two ways an HTTP 1.0 server could deal with missing slashes. When a browser tries to use a URL that names a directory but lacks a trailing slash, such as

http://www.foo.com/bar

the server could return an error. This would be perfectly acceptable, in my opinion, because the URL without the trailing slash is arguably wrong. I know of no HTTP server that does this.

I think all current servers, when a browser uses a URL with a missing slash, try to construct a correct URL and return it to the browser (the HTTP protocol provides a mechanism for this, the Location header field). Unfortunately, even if the constructed URL works (which it sometimes does not), it is often different from the one the browser had been using, because of hostname aliases, or symbolic links in the file system, or the explicit appearance of the default port number (80), or the explicit appearance of a file name that was supposed to be hidden (like index.html). When the browser follows relative links from this page, it constructs the target URL from the altered URL, so the alteration propagates.

The exposure of alternate URLs can wreak havoc on browser histories and bookmarks, because the browser can't tell that the different URLs refer to the same page. A user may find that many visited links are colored as unvisited, because those pages were reached along a different path and therefore accessed via a different URL. If a user bookmarks a page reached via an alternate URL, that bookmark may break when “internal” changes are made to the web server, because the alternate URL exposes internal names that were never meant to be exposed.

This problem can be magnified by web search engines. A single link with a missing slash can cause a web spider to crawl an entire site using an alternate URL, so that everyone who reaches the site through that search engine will be using the wrong URL.

You may wonder why the server doesn't simply append a slash to the URL it was given by the browser. It can't, because it doesn't get the full URL. It gets only the part after the hostname. (This was changed in HTTP 1.1.)

Also note that whenever the server sends back a corrected URL, it closes the connection. The browser must then open a new connection and issue a second request, incurring additional delay. (This was also changed in HTTP 1.1.)

The entire mess can be avoided if humans take it upon themselves always to include the trailing slash in URLs that name directories.

When the directory in question is the root directory, as in

http://www.foo.com/

the trailing slash is not strictly necessary, because the browser (and humans) know there must be an implicit slash there. However, the simple consistency of the rule “every URL that names a directory should end with a trailing slash” is attractive, so I recommend including the slash even in this case.


[AMC]  Prepared by Adam M. Costello
 Last modified: 2001-Jan-30-Tue 23:24:43 GMT
[Any Browser]