A review of the technology, basis, and features of open-source and commercial web caching proxies.
The development version of Squid is Squid 3.0 (latest release, however, is 2.5). Squid 3.0 is written in C++, and appears to (mostly) follow an object-oriented paradigm, in contrast to Squid 2.x which was written in C.
Notes on some of Squid's algorithms and heuristics:
Squid's primary cache replacement algorithm is based on LRU (Least Recently Used). It uses a queue to express the last time a cached object was used: items towards the bottom of the queue are searched periodically for removal. Source: How does Squid's cache replacement algorithm work? Squid since version 2.4 also has other caching algorithms available, detailed in Enchancement and Validation of Squid's Cache Replacement Policy.
A large portion of the current Squid architecture is devoted to the configuration of Access Control Lists. These lists have two types of lines, class definitions, and operators. The reference for this is in the Squid Configuration Manual
A class definition line looks something like one of the following:
acl name type string1 [string2 [… file?]] acl name type "file"
The acl 'name' is a unique descriptive string which explains the class. There are several ACL 'types', which can define the class as corresponding to a particular set of IP addresses, a particular time of day, a particular web browser, or almost countless other options. The 'strings' are parameters which describe the class.
Some of the most common acl class types. Some have a version which takes regular expressions:
- Source/Destination IP address
- Source/Destination Domain
- Words in the requested URL
- Words in the source or destination domain
- Current day/time
- Destination port
- Protocol (FTP, HTTP, SSL)
- Method (HTTP GET or HTTP POST)
- Browser type
- MIME type
- Name (according to the Ident protocol)
- Autonomous System (AS) number
- Username/Password pair
- SNMP Community
Operators are used to filter content based on the acls matched by the classes. the most common is http_access. It is suggested that a minimum http_access config looks something like:
http_access allow manager localhost http_access deny manager http_access deny !Safe_ports http_access deny CONNECT !SSL_ports http_access deny all
Other operators include:
- http_reply_access - allow/deny client replies
- miss_access - force other caches to use you as sibling, not parent
- cache_peer_access - sends requests to a specific cache server
- ident_lookup_access - do an ident lookup if the ACL is matched
- tcp_outgoing_tos - allows setting the TOS value
- tcp_outgoing_address - allows mapping requests to different IPs based on ACL
- reply_max_body_size - stop large file downloads
- log_access - log or don't log
Other config options:
'delay pools' are bandwidth limiters, which can limit bandwidth based on ACL's. It is possible to create several numbered delay pools, each one of which has a class (5 total classes). The next be is taken verbatim from here
- class 1 -- Everything is limited by a single aggregate bucket.
- class 2 -- Everything is limited by a single aggregate bucket as well as an "individual" bucket chosen from bits 25 through 32 of the IP address.
- class 3 -- Everything is limited by a single aggregate bucket as well as a "network" bucket chosen from bits 17 through 24 of the IP address and a "individual" bucket chosen from bits 17 through 32 of the IP address.
- class 4 -- Everything in a class 3 delay pool, with an additional limit on a per user basis. This only takes effect if the username is established in advance - by forcing authentication in your http_access rules.
- class 5 -- Requests are grouped according their tag (see external_acl tag= reply).
The delay_access tag determines which delay pool a request belongs to, using an ACL. The delay_paremeters tag allows setting the actual bandwidth limits.
It is possible to configure just about everything that a squid install does, including how many resources it uses, how often it rotates its logs, numerous cache sizing options, etc.
Squid development projects
devel.squid-cache.org is a centralized list of developmental features for Squid that have not yet landed in mainline Squid. Some interesting ones: HTML prefetching, Duplicate Storage Avoidance, Duplicate Transfer Detection, etc.
RabbIT is touted as a "dial-up acceleration" proxy written in Java, designed to increase web browsing speed on low-bandwidth connections. It is focused on features, implemented through a "filters" system:
- GZip compression of HTML pages (those not already compressed)
- Image recompression to a small JPEG quality level
- Ad removal
- HTTP/1.1 pipelining support
Because application of these filters is considered "heavy", RabbIT can also serve as a simple cache.
Blue Coat's SG appliances are general-purpose proxies which seem to be targeted mainly at businesses looking to block users from accessing particular content. The appliances run on a custom operating system, kernel, and filesystem, which supposedly has no ties to Windows or Unix.
Multi-protocol Accelerated Caching Hierarchy:
- Bandwidth management to prioritize various types of traffic
- Protocol optimization (for instance turning serial communications into parallel ones)
- Byte caching
- Object Caching
The proxy SG appliances:
- Use existing authentication sources
- Complicated authentication rules
- Custom logging
- Acts as SSL middle-man for all SSL sessions
- Can time out after inactivity
- Can erase cached authentication cookies
JAGUAR3000 by ARA Networks is another multipurpose (can operate in an explicit, transparent, and gateway mode) caching proxy.
- 64-bit support (no 4 GB memory limitation)
- Adjustable TTL per content type (e.g. images rarely expire)
- MCT (Minimal Context Thread) architecture: light-weight threading system consisting of a stack and minimum amount of context. introduction, manual. Thread manager also implemented as an optional Linux kernel module.
- Raw disk access through special-purpose object storage file system (FreeBSD and Linux)
- Company offers some other interesting projects, such as a P2P (peer-to-peer) client-side browser cache sharing
Oracle Application Server Web Cache is Oracle's caching product.
- Compressed object storage
- ESI (Edge Side Include) support: queries dynamic web applications concerning only parts that change
- Caches on target URL, as well as basis of associated headers/cookies
- Specially designed for caching of dynamic content (details sparse)
- Custom API for cache invalidation; accepts URLs/regular expression via XML request via HTTP POST sent to cache (documented in Administrator's guide)
- Patent-pending algorithm on when to serve stale objects and when to refresh them to avoid overloading origin web servers by too many requests. Invalidation API mentioned above can specify a "grace period" to allow stale objects to be served, as well as objects to never serve stale.
Stratacache's Stratacore offers multiple blackbox caching devices, divided into "tiers" which can be selected on price/performance ratios as well as the number of users serviced. Software appears to be same on low-end sub-$1k device to $125k+ device.
- Different management options: web-based administration UI, custom command-line interface available via telnet, conventional CLI administration (i.e. via SSH)
- Directory support for NTLM, Active Directory, LDAP, and Radius
- Streaming media caching
- Anti-virus filtering
- DNS caching
- Content filtering (WebSense, SquidGuard, Secure Computing)
- Content pre-population
- Enchanced transaction-based logging and reporting
While having features that appear similar to Squid, it is not based on Squid, but instead an old proprietary caching product that is no longer developed whose IP now belongs to Novell.
- Cisco Content Engines: Cisco's appliance caching solution. Difficult to find any useful product information on website.