FTP Mirror Tracker: How it works.

From: Alexei Novikov <anovikov@dont-contact.us>
Date: Mon, 20 Sep 1999 17:41:02 +0400

Probably some of you do know already about this work as it was discussed
a number of times at TF-CACHE mailing list. Then you can safely ignore
this message.

The original idea was to create a redirector script (or URN
redirector) that will point to the nearest ftp mirror (or set of them)
during processing requests of the cache servers (or any other
software). Thus we have to be able to make direct correspondence
between files/directories on different FTP severs. We use quite nice
algorithm that can do this.

FTP Mirror Tracker Server (Mirror Tracker) is looking at the directory
contents of the FTP servers worldwide and calculates MD5 checksum
based on concatenating the names of the files, file sizes, and perhaps
timestamps. This information is added to the parent directories so
that only complete replicas of the original source are used for the
redirection. So we can be quite sure that files in the directory of
the ftp server requested by user are the same as located at nearby
mirror (hit). Unfortunately sometimes timestamps of the files at the
FTP servers are completely wrong, so we have to keep digests with
timestamps (hits with such digests are reported as "Exact Hits") and
without ("Time Blind"). We store these 2 checksums together with
URL of the directory and its hexademical MD5 checksum.

In oder to be able to handle links we store additional table with the
complete tree caused by traversing via links with the pointers to the
real directories. In such a way we were able to remove timely
processing of the links from the URN resolver script.

Both directory checksum information and links table are stored in the
database. (I'm using MySQL as it is well optimized for the lookups
made by the redirector).

As a drawback of this approach we can never be 100% sure that files
are really the same. It sometimes happens that files with the same
timestamps and sizes belong to different projects (these are mainly
system files like CVS data). In oder to remove such files/directories
form the list presented for the redirection longest-prefix match
algorithm is used. Ie we compare complete path to the directories and
assign different weights based on the similarity of the names of the
directories.

Mirror Tracker is supposed to prepare information about the servers in
one or two domains as it is quite time-consuming
operation. Information about other domains should be fetched from the
fellow servers. It is organized as follows, as soon as we have
required files each Mirror Tracker is requesting information about
other servers (from the master server at <URL http://squid.itep.ru>)
and sends them a request to download these new files. After complete
download (we compare checksums) these files are stored in the database
and are used in the redirection.
Received on Mon Sep 20 1999 - 07:59:27 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:48:29 MST