STG Logo Scholarly Technology Group

Pass-Through Proxying as a Solution to the Off-Campus Web-Access Problem

Abstract

This report outlines a simple, near-term solution to the universal academic problem of off-campus access to on-campus Web-based resources. This solution costs little, requires no special client software, works with a variety of authentication methods, allows fine-grained control over what services can be accessed, and offers both reasonable security and speed.


Table of Contents

The Problem

Just two decades ago, access to vital university information resources was largely a matter of physical proximity. You had to be near the research labs, classrooms, and lounges. And in order to get access to books, you had to present credentials (i.e., your university ID and/or library card) in person at the library.

As information resources have made their way onto the Internet, and in particular onto the Web, physical proximity has taken on an increasingly secondary role. To reach Web-based information resources all one needs is a computer with a Web browser and some sort of Internet feed.

The increasingly secondary role played by physical proximity has created a critical problem for university communities and for their shared information resources. If physical presence is no longer necessary, i.e., if someone can "knock at the door" without actually being at the door, how can we tell whether to let him or her in? How can we tell if the person knocking is a member in good standing of our community? How can we grant him or her access without accidentally letting anyone else in?

If you belong to a university library reference, MIS, or computer support organization, you have struggled with these sorts of questions. In particular, you have struggled through scenarios like the following:

Scenario 1:

Adjunct medical-school professor takes up a practice in a university-affiliated clinic. He turns on his office computer, logs in via the clinic's Internet service provider (ISP), then goes to his university's BioMed website, where he expects to find the Web version of Medline (to which his university subscribes). Instead of seeing a list of available databases, he finds himself rudely turned away with a "403 Access Denied" message.

Scenario 2:

Professor leaves on sabbatical to do research on eighteenth-century English scientific terminology. She settles into her cabin in upstate Maine - only to discover that she when she tries to access the online Oxford English Dictionary ( OED), which she has been accustomed to using for complex search and retrieval operations from her campus office, she is rejected.

Scenario 3:

Computer support staff personnel report that the number of students, staff, and faculty coming in via the modem pool at night has tripled in the last two years, and that the university must either expand its dial-in facilities, or else farm out this service to another ISP. Preliminary figures indicate that it would be significantly less expensive to farm the service out. Users revolt, however, when they are told that key library databases will no longer be accessible from off-campus under this arrangement. The support staff revolts when they hear that the only way to make these databases available is to outfit all remote machines with special client software and plug-ins. Professors revolt when they realize that the client software and plug-ins cannot easily be installed on other universities' machines, or on kiosks made available to them while they are away at conferences.

Certificate Authorities

In efforts to overcome these sorts of web-access problems, many organizations are setting up certificate authorities (CAs), and are beginning to issue special public/private key-pairs to individual users - key-pairs that must be carried around on floppy disks, copied across the network, or generated separately on every machine used, in order to permit off-campus access. These key-pairs can be used both to authenticate users (i.e., verify that they are who they say they are) and to authorize them (in this case, to grant them selective access to campus resources through special commercial webservers and directory services).

These and other such efforts are well-intended. But they rely on volatile, changing technologies, and require that a whole new family of support software be put into place. In the vast majority of cases, all we are really looking for in an immediate solution to the Web-access problem is a way to determine whether somebody belongs to our community - and thus whether he or she should be allowed access to the same sorts of Web resources that are accessible from campus. All that is truly required, therefore, is a simple list of people who belong to the community, coupled with a list of their passwords.

Cross-Institutional Databases

Another strategy that libraries in particular are using to solve their authentication problems is to piggyback on separate cross-institutional user/password databases run by a vendor, by an institutional consortium, or by a state library system.

Although this strategy is appropriate in many instances, in others it, like the CA solution, is overkill, requiring that a whole new infrastructure be put in place - often not only for the participating institutions themselves, but also for the vendors whose information resources are being accessed. And typically the authentication methods used in such contexts are weak, depending on passwords passed as plain text over the network.

Existing Authentication Methods; Kerberos

Fortunately, many institutions already have the tools they need to authenticate users without establishing any such infrastructures. These tools come in the form of cluster login IDs or Kerberos principals and passwords (Kerberos is a distributed system for granting users secure access to networked services based on principals, or IDs, instances, and passwords). The trick to any flexible, near-term solution to the Web-access problem is to leverage this existing infrastructure, and, if possible, to leverage the hardware that implements it as well (e.g., leveraging an existing Kerberos key server and its user database).

Unfortunately, integrating existing authentication systems such as Kerberos into generic commercial Web browsers has proven an arduous task, requiring the creation of specialized software and browser plug-ins that must be installed on every client machine running every supported operating system (as, e.g., with CMU's Minotaur and Project Mandarin Inc.'s SideCar and FrontCar). Even if we cut the list of browsers and supported operating systems down to Netscape and Internet Explorer running under MacOS, Windows 95, and NT, we are still talking about a significant distribution, maintenance, and support job. We also place an extra burden on people who want to use kiosks or public machines at other institutions. And we may, depending on how well the vendor supports its software, be leaving users running research operating systems like Linux and Solaris out in the cold.

One can, of course, tack Kerberos support onto just the server end of the equation, and leave the clients out (one can do this by piggybacking Kerberos authentication on the usual basic Web authentication methods). For some servers, most notably Apache, all that is required in order to do this is to add a new module, re-compile the server software, and make a few changes to the configuration files. Such alterations, however, force clients (i.e., people's Web browsers) to pass IDs and passwords over the network clear-text. And such alterations can make it possible for someone with physical access to the network (say, someone in a neighboring office or building) to steal IDs and passwords - as seems far too easy, for example, with NWU and UCI's proxies (cf. CWU's hybrid SSL-based system).

It is also possible to set up the remote server(s) to accept Kerberos tickets or other authentication information passed on as so-called "cookies" (i.e., as session information sent over as part of the HTTP protocol). Such systems also require special plug-ins or modifications on the server end (as, e.g., does the system being tested at UCDavis). Many users also have their browsers set not to accept cookies. And many servers already make extensive use of cookies for something else (e.g., session tracking, and their own authentication systems).

In our opinion, a clean, workable solution to the Web-access problem can't (at least at this stage in the evolution of the protocol) pass around extra cookies. It also can't pass any IDs or passwords clear-text over the network (as many libraries, amazingly, are currently doing). Yet it must still manage to leverage existing authentication equipment and/or methods. And it must not require special add-on packages or browser plug-ins.

Reluctant Vendors

Yet another problem that a clean, workable solution to the Web-access problem must solve is that of reluctant vendors. The fundamental reason why so many Web-based campus information resources are protected from casual outside use is that they are licensed from vendors that require us to do this. That is, as part of our licenses we must stipulate that we will make the resource in question accessible only to certain people. Usually the cost of the license is tied to how many of those people there are (most often the entire campus). And normally the licensee is held responsible for due diligence in excluding everyone else.

It turns out that many vendors are reluctant to extend their licenses to cover people who are part of the community, but who happen to be temporarily or permanently off-campus. They are reluctant because they are skeptical, and rightly so, that we can accurately distinguish members of the community from non-members when those members are coming in over the network from "outside" machines - machines whose return IP (Internet Protocol) addresses do not match any of the telltale institutional network signatures.

Because of these reluctant vendors, any legally defensible solution to the Web-access problem must offer a straightforward means of turning on and off access to specific resources - at least until the individual vendors that license those resources have seen and approved the access control mechanism being used. And such a solution should require no other action on the part of the vendor (such as, e.g., augmenting their authentication scheme; cf. CMU's Shelob, which would require vendors to install additional encrypting server software).

Proxy Servers

One strategy for solving the Web-access problem that has come increasingly to the fore is the establishment of a proxy server. Among other things, what a proxy server can do is make off-campus machines look like on-campus ones by allowing the off-campus machines to pass their Web traffic through an intermediate on-campus "proxy" machine. For this to work securely, however, some form of encrypted password or key-based authentication must be enabled between the off-campus machine and the proxy. Authentication based on the IP address of the client machine obviously won't work, since the whole idea here is to deal with machines coming in from unknown, outside networks. Nor will the plain-text authentication methods used by most non-IP-authenticated proxy servers.

It turns out that some proxy servers can be outfitted with specially secured authentication packages, but in fact doing this requires special client software and/or plug-ins to work. This is a problem, because it creates a need for a lot of additional client-side support. And it ultimately reduces the accessibility of the proxy.

A further problem with proxy servers is that they are often geared specifically for a given user's Internet service provider (ISP). Most large ISPs run caching proxies that keep temporary copies of Web pages their customers fetch. This makes it so that if the customers keep requesting the same pages (as frequently happens with popular sites), they don't have to connect over and over again to the same remote sites, but rather can just look at the temporary copies of those pages that are sitting in the ISPs cache (all of which happens unbeknownst to the user). In order to enforce caching, and to shore up security generally, many ISPs also run firewalls, which in this case forces customers to use their ISP's proxy, and no other. In practical terms, what this means is that if you try to use some other proxy in place of your ISP's, your browser may no longer work properly - as they have discovered at the Penn Library. In many cases (e.g., public cluster machines) the whole issue of proxing is moot, because users simply are not allowed to go in and change the browsers' basic settings.

For a solid, working example of a proxy server of this type, see the University of Wisconsin Library's proxy service.

Pass-Through Proxy Servers

The only way to run a proxy service that anybody can use is to run a reverse- or pass-through proxy (when outfitted with a cache, you may also hear it called an accelerator). A pass-through proxy is a proxy that masquerades as the server it is proxying for, such that the proxy appears to hold a mirror image of whatever is on the proxied server.

The process works like this. An outside client (presumably a web browser) requests a page from the pass-through proxy over a secured channel. The proxy, in turn, prompts the client for an ID and a password, if one was not provided. After clearing the ID and password with the local campus Kerberos key server, the pass-through proxy then fetches the requested page from the remote server (i.e., from the server it is mirroring). Finally, it sends the requested information to the client. Throughout this process, the client never talks directly to the remote server; and the remote server never talks directly to the client (all it sees is the on-campus pass-through proxy server).

It's kind of like the proverbial project coordinator who reports on work "he" has been doing - when really all he is doing is passing on information about work done by the other members of his team. Just as our coordinator presents himself as the source of the information, so also the pass-through proxy can make itself look like the source of pages it fetches from the site it mirrors.

Because pass-through proxies look like origin (i.e., "normal") webservers, they can be used in conjunction with other proxies, or through a firewall - just like any other webserver. And because pass-through proxies can be configured to run fully encrypted Secure Socket Layer (SSL) sessions, IDs and passwords can be passed back and forth through various networks without fear of snooping. SSL overhead can be compensated for, partially, by enabling caching on the pass-through proxy (which also helps compensate for browsers that do not catche SSL-enabled pages).

Pass-through proxies also require no set-up on the client (browser) end.

Problems with Pass-Through Proxying

One obstacle to using a pass-through proxy for Web access is that you really need to set up a new proxy server for every remote server being proxied. In this sense, it's not like our proverbial project coordinator, who can take credit for any number of team members' work. It assumes, rather, a one-to-one relationship. To work reliably, without relying on extra information being passed between the client and the proxy server, there has to be a single proxy server for every remote server being proxied. For sites with significant IP-restricted information resources, therefore, you have to run what might seem like a daunting number of proxy servers.

Fortunately, most webservers today allow port-based virtual hosting. That is, they allow you to set up distinct webservers on the same physical machine that differ only in the port number given in the URL (e.g., http://proxy.stg.brown.edu:443/, where "443" is the port number). Port-based virtual hosting can be accomplished without adding new machine names, and without requiring extra (e.g., "Host") headers to be exchanged between client browsers and the server. So, even in cases where there are several dozen vendors with servers that must be proxied, port-based virtual hosting makes it easy to set up enough proxy servers to cover them all.

The real problem with using a pass-through proxy to solve the Web-access problem is that links branching off of pages fetched from the mirrored site will often lead users out of the pass-through proxy, and back to the original server. The only way to avoid this problem is to insert a parsing module on the pass-through proxy that rewrites pages sent back to the user so that they contain no reference to the server of origin - i.e., so that links back to the server of origin are replaced by links to the proxy (our system for doing this uses Lex; cf. larger scale work of the same kind being done by OSF).

STG's Implementation

At Brown University's Scholarly Technology Group, we implemented a pass-through proxy system that works along the above-described lines. To review some of what has been said above, our basic constraints were:

  • Must leverage our existing Kerberos IV infrastructure
  • Must not pass passwords insecurely (e.g., as clear-text) over the network
  • Must work with any recent version of Netscape
  • Must work on any machine that can run Netscape
  • Must not require special client software or plug-ins
  • Must not require changes to webservers (other than the proxy)
  • Must allow control over what machines are proxied
  • Must serve users coming in via firewalled ISPs
  • Must not create a significant performance problem
  • Must not cost a great deal

In the "nice, but not necessary" category were the following constraints:

  • Utilize off-the-shelf commercial software
  • Incorporate no significant local modifications
  • Work in a completely transparent fashion, from the user's standpoint

Although we expect commercial software to become available within the next two years that can take over the role of our pass-through proxy, nothing available right now fits the bill. As a result, we ended up having little choice but to use free software (i.e., Apache) and modifying it locally to suit our needs. Although our system ended up reasonably robust and functional, it did not end up completely transparent to the user - due mainly to the problems with buggy Web clients, firewalled ISPs, and reluctant vendors.

Our approach to making the system tractable for users was to provide "in case you're off campus" documentation and a bounce page for local webmasters to use to redirect users who have found themselves locked out of a resource when coming in from off-site.

Compare the necessarily more elaborate system developed by the United States Navy Virtual Library (D-Lib magazine, March 1997). A system more closely resembling STG's, but that uses encapsulated URLs (URLs within URLs), is being developed by the University of Virginia (UVa mIm). Although mIm uses clear-text passwords, and does not yet deal well with cookies, it is a good system in that it requires no browser-side configuration.

Implementation Problems

In the four months that we have been testing it, we have found our pass-through proxy to be fast, flexible, extensible, fairly robust, and (now that the initial overhead of developing it is past), very low-cost. The only serious, systematic problems we have found to date are either obscure or rare, or else they affect functions that lie outside our main area of concern (which was to make IP-controlled resources available to off-campus users):

  1. Java-based programs requiring direct origin-server -> client connectivity may fail if they also check source IP addresses
  2. JavaScript code may fail if it is assembling URLs on the fly that refer back to the origin server
  3. Clients will not return cookies to the proxy if the server that set the cookie utilized domain restrictions (as a workaround, we simply wipe domain information out of the remote servers set-cookie request; in practice this actually works quite well)
  4. The reverse proxy does not handle resources protected by extra passwords or (SSL) encryption; it is for use only with IP-protected resources

Problems 1-3 above were only rarely seen (our vendors use little Java and JavaScript; only a few currently use domain-restricted cookies). Problem (4), however, became a significant nuisance in a few instances. The technical details here are as follows: Netscape does not react correctly to a 407 error code returned by a proxy server that mimics an origin webserver. Error code 407 is what prompts the browser to send proxy authentication tokens, so Netscape, in effect, cannot supply proxy authentication to our pass-through proxy server (which, as noted above, looks like an origin server). To get around this problem, we had to tell Netscape clients to use normal authentication instead. Unfortunately, we could not arrange for normal authentication tokens to be sent through the proxy to the remote server, because the proxy needs them, and because blind forwarding of such credentials might give away Kerberos principals and passwords.

In practice, the workaround for problem (4) was quite simple: We explained in our documentation pages that webmasters had to set up their "bounce" page so that it would only appear when an off-campus user tried to access a directory containing pages with links to on-campus resources. That way, on-campus users would never see the bounce page. Off-campus users would only see it when they were supposed to (i.e., whenever they attempted to use an IP-restricted resource). And the proxy would never get advertised for services it wasn't designed to offer.

Some other minor aches and pains we experienced include the problem of cached credentials. For example, if a Brown user walks away from a remote kiosk without exiting the browser after having used the pass-through proxy, the next user at that kiosk may be able to access Brown-only resources. (Our pass-through proxy requires users to authenticate separately for each remote server, so the potential damage here is minimal.) We also found the whole notion of entering Kerberos principals through standard Web-authentication forms to be unpalatable, especially since this required SSL-encrypting the entire session (which cut transmission speeds in half for slow modem connections). Finally, we found it inconvenient that, in order to get into the pass-through proxy's document hierarchy, one needed to come in via a specific entry point. One could not, for example, just go to our Oxford English Dictionary page directly and expect to get redirected automatically. One needed, rather, to come in via a page with links to the OED (and to other such resources) - pages that used URLs that would take users through the pass-through proxy.

At first we did, in fact, attempt to set up automatic redirection using a proxy auto configuration (PAC) file. We found the PAC file, however, to be problematic, because of bugs in Internet Explorer 3.02-4 (which is supposed to be able to use PAC files, but doesn't always work properly). We also found that users made many mistakes configuring PAC files, and that PAC files - misconfigured or not - introduced various performance and access problems. E.g., we found that some ISPs used firewalls that rendered browsers set to use our PAC file nonfunctional.

Our ultimate solution to this difficulty was just to provide (as noted) simple "in case you're off campus" docs that (e.g., Library) webmasters could link to from appropriate places in their hierarchies. This, along with our bounce page(s), provided faster, easier, and better-targeted access to the reverse proxy than was possible with PAC files.

As of March 1998, our pass-through system is being evaluated by Brown's Computing Information Systems department for use by the Library and by other departments offering other IP-restricted resources.

Conclusions

Throughout this study, I have emphasized the near-term nature of our solution. The reason for this is that our solution is basically a workaround for a problem whose fundamental terms (how we authenticate and authorize people for use of networked resources) are likely to change within two years, if not much sooner.

Also, whether we like it or not, issues of authentication, authorization, and access control touch on virtually every aspect of life in a networked computing environment, from access to services like development and alumni-relations databases, to tracking purchasing orders, to updating student records and using library resources; from collaborative research, editing, to file sharing, backups, license servers, institutional directories, and public/private key repositories. Ideally, every platform, from the desktop Macintosh to the server in the air-conditioned room, should all understand about authentication and authorization, and be able to use the same language to deal with it.

For now, though, the notion of a common environment in which diverse vendors' products all speak a standard authentication language is at best a pipe dream. Vendors have been unable (or unwilling) to form, and agree on, appropriate standards. Some of them, for instance, have invested heavily in a public/private key infrastructure that doesn't yet exist - and that may prove problematic in the future. Others continue to use shared-key systems like Kerberos, which lack broad support in the commercial marketplace. There are authentication methods for nearly every networked filing protocol, from AppleShare to AFS to LDAP to NFS. There are NIS+, Kerberos IV, Kerberos V, DCE, Lotus Notes, CORBA (+ COSS) and other protocols and products, many of which are unable to talk to each other without gateways or compatibility modes and libraries.

For the near-term, therefore, the most sensible way of obtaining a simple solution to a limited problem like off-campus access to source-IP authenticated campus Web resources is to pick a cheap, solid near-term solution, to make it work, then to wait a year or two until the landscape changes and see if another solution presents itself.

Fortunately, our solution requires no special client-side software; it necessitates no basic changes to client browser configurations; it doesn't depend on a particular authentication method. It offers fine-grained control over what services can be accessed; it's cheap to run, easy to use, reasonably fast; and it's secure. Although it only handles IP-authenticated resources, and although it is not as transparent as we would have liked, we have found it to be well suited to our modest needs during this transitional time.

Appendix: Implementation Specifics

The components used to create STG's pass-through proxy are as follows:

  • Operating System: RedHat Linux 5.0 (+ kernel 2.0.33)
  • Web server: Apache 1.2.5 + SSL patches + local mods
  • Public/private key encryption library: RSAREF 2.0
  • Secure-socket layer library: SSLeay-0.8.1
  • Kerberos libraries: kth-krb 0.9.8
  • Authentication system: PAM-0.59 + krb4 (Feb 6) module
  • Other: Miscellaneous PERL scripts to add and subtract entries from the list of proxied machines

Note that RSAREF (one of the components mentioned above) is for nonprofit use only. US businesses can't use it royalty-free. Outside the US, RSAREF may, in some cases, not be needed. Nevertheless, the very fact that you are using secure encrypted transmission technology may put you in volation of other laws. Bottom line: Be careful.

Note also that, in order to leverage Brown's existing Kerberos IV infrastructure, we found it advantageous to add Linux Pluggable Authentication Module (i.e., PAM) support to our proxy server. This allowed us to plug in any authentication method that the operating system on the server would permit. Because the operating system did not ship with Kerberos IV support, and because Kerberos IV was our mandated authentication system, we added a PAM-Kerberos IV module, and then re-installed the PAM subsystem. We then configured the operating system to use Kerberos IV when authenticating for the webserver and to refer all Kerberos IV authentication calls to the campus Kerberos key server. In Kerberos parlance, all we are doing for authentication is checking to see if the campus key server can generate a ticket-granting ticket for a given principal/password pair. This is enough for our purposes.

Using PAM, it would be a trivial operation to switch to some other form of authentication.

For those interested in the details of how the various parts of the system interact in practice, here is how a normal transaction between a client and the pass-through proxy server looks (assume: the client has downloaded and accepted the signer's cert, and is now offering a Kerberos principal and password as basic authentication credentials as part of the web page request):

  1. Client (web browser) makes an encrypted request to the pass-through proxy server (Apache) for a web page
  2. Server (using SSLeay/RSAREF) decrypts the request and passes the client's credentials on to the local host's PAM subsystem
  3. PAM determines that Kerberos authentication is needed and passes the client's credentials on to the local host's Kth-Krb (Kerberos) subsystem
  4. The Kerberos subsystem contacts the local campus key server over the network, and requests a ticket-granting ticket
  5. If the user's credentials are valid, the key server sends back a ticket to the Kerberos subsystem
  6. The Kerberos subsystem then informs PAM that the user has been authenticated
  7. PAM then informs Apache that the user has been authenticated
  8. Apache then passes on the client's original request to the remote server (possibly caching the page for faster access by the next user)
  9. The remote server processes the request, and sends the result back to Apache (which makes some changes to the data via its locally-modified proxy module)
  10. Apache then sends the (modified) data back to the client

The labor required to bring up the proxy server breaks down, approximately, as follows. Forty hours went into studying and modifying the Apache source code. An additional forty hours was required to configure the pass-through proxy specifically for Brown's environment, and to fix bugs and problems that surfaced in during later testing. A comparable chunk of systems administration time was required to spec out a Linux system, order it, bring it up, "harden" it, and compile in the basic tools needed by the proxy server. An as yet undetermined amount of time (at least eighty hours) was needed to document the system, outfit it with maintenance scripts, and give it a full alpha and beta testing.

Now that we have done the basic coding, documentation, and testing, it seems safe to say that if we were called on to construct another similar server, or to consult on one's construction, we could cut these figures in half. Other institutions should expect similar start-up costs.


Richard Goerwitz

Richard_Goerwitz@Brown.EDU