When your apt-mirror is always downloading

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2008/01/24/apt-mirror-2.html

When I started building our apt-mirror, I ran into a problem: the
machine was throttled against ubuntu.com’s servers, but I had completed
much of the download (which took weeks to get multiple distributions).
I really wanted to roll out the solution quickly, particularly because
the service from the remote servers was worse than ever due to the
throttling that the mirroring created. But, with the mirror incomplete,
I couldn’t so easily make available incomplete repositories.

The solution was to simply let apache redirect users on to the real
servers if the mirror doesn’t have the file. The first order of
business for that is to rewrite and redirect URLs when files aren’t
found. This is a straightforward Apache configuration:

           RewriteEngine on
           RewriteLogLevel 0
           RewriteCond %{REQUEST_FILENAME} !^/cgi/
           RewriteCond /var/spool/apt-mirror/mirror/archive.ubuntu.com%{REQUEST_FILENAME} !-F
           RewriteCond /var/spool/apt-mirror/mirror/archive.ubuntu.com%{REQUEST_FILENAME} !-d
           RewriteCond %{REQUEST_URI} !(Packages|Sources)\.bz2$
           RewriteCond %{REQUEST_URI} !/index\.[^/]*$ [NC]
           RewriteRule ^(http://%{HTTP_HOST})?/(.*) http://91.189.88.45/$2 [P]
         

Note a few things there:

  • I have to hard-code an IP number, because as I mentioned in
    the last
    post on this subject
    , I’ve faked out DNS
    for archive.ubuntu.com and other sites I’m mirroring. (Note:
    this has the unfortunate side-effect that I can’t easily take advantage
    of round-robin DNS on the other side.)

  • I avoid taking Packages.bz2 from the other site, because
    apt-mirror actually doesn’t mirror the bz2 files (although I’ve
    submitted a patch to it so it will eventually).

  • I make sure that index files get built by my Apache and not
    redirected.

  • I am using Apache proxying, which gives me Yet Another type of
    cache temporarily while I’m still downloading the other packages. (I
    should actually work out a way to have these caches used by apt-mirror
    itself in case a user has already requested a new package while waiting
    for apt-mirror to get it.)

Once I do a rewrite like this for each of the hosts I’m replacing with
a mirror, I’m almost done. The problem is that if for any reason my
site needs to give a 403 to the clients, I would actually like to
double-check to be sure that the URL doesn’t happen to work at the place
I’m mirroring from.

My hope was that I could write a RewriteRule based on what the
HTTP return code would be when the request completed. This was
really hard to do, it seemed, and perhaps undoable. The quickest
solution I found was to write a CGI script to do the redirect. So, in
the Apache config I have:

        ErrorDocument 403 /cgi/redirect-forbidden.cgi
        

And, the CGI script looks like this:

        #!/usr/bin/perl
        
        use strict;
        use CGI qw(:standard);
        
        my $val = $ENV{REDIRECT_SCRIPT_URI};
        
        $val =~ s%^http://(\S+).sflc.info(/.*)$%$2%;
        if ($1 eq "ubuntu-security") {
           $val = "http://91.189.88.37$val";
        } else {
           $val = "http://91.189.88.45$val";
        }
        
        print redirect($val);
        

With these changes, the user will be redirected to the original when
the files aren’t available on the mirror, and as the mirror gets more
accurate, they’ll get more files from the mirror.

I still have problems if for any reason the user gets a Packages or
Sources file from the original site before the mirror is synchronized,
but this rarely happens since apt-mirror is pretty careful. The only
time it might happen is if the user did an apt-get update when
not connected to our VPN and only a short time later did one while
connected.