News downloading stuff

Last updated 24-Nov-98

Slurp patches

Slurp is a news downloader written by Stephen Hebditch some time ago. As far as I know the last version he released was slurp-1.10.tar.Z

A similar time ago I released some patches to allow articles from various sites to be ignored, on the basis of the Message-ID. The object is to prevent tying up the limited modem bandwidth downloading any news articles which may be spam.

This is a rather blunt weapon in the anti-spam stakes, but has the advantage of not having to download the article header for more extensive filtering. (A lot of spam articles have the greater length in the header so you may as well download the whole article as just the header). Get these patches from ftp.demon.co.uk or here.

I've continually improved my version of slurp to better cope with errors etc. So here you can get the complete set of patches to reach the slurp I use. This includes:

Changing the host timeout depending on whether the connection has been successful.
Change the error codes some more to distinguish hosts which may be worth a retry, and those which won't.

Unfortunately, it also includes some of my site-specific stuff for history file format selection etc. You may want to save your makefile and conf.h file before applying the patches.

News related scripts etc.

As there are many newshosts, I have developed some scripts to generate the slurp control files. These take a lists of newsgroups and a list of newshosts to generate the appropriate slurp control lists. I maintain a list of internet addresses for the individual news hosts in my local /etc/hosts file, which I keep up to date by running nslookup against news.demon.co.uk everytime I connect. Install these in /var/lib/news/.

This is the news download script which gets run whenever a connection is established for getting news, which shows how I vary the newsgroups collected depending on the day of the week. This script is fairly dependent on the fact that I use cnews as my news software (cnews-crg). This script should probably be installed in /etc, though so long as your ppp startup script can find it it doesn't really matter.

I like to archive FAQs etc so at the end of the cnews newsrun script I run this script. It will obviously require changing to suit local situations. This should be installed in /var/lib/news/bin/.

My newsreader needs overview files, which cnews doesn't create, or at least I couldn't get it to work properly, so I use mkover which I got so long ago I can't remember where from. I had to tweak it a bit, but it works for me. You can see it being run at the end of the previous script.

Bandwidth regulation script.

To maximise bandwidth, without overloading the modem connection, I've developed a script to restrict access to the network. The script is designed to regulate the number of processes trying to use a ppp link.

Simply call this script immediately before starting a program that accesses or downloads something off the internet. The script looks at the number of interrupts on the serial line used by the modem, and waits until there are fewer than some threshold in a second before returning. Note that it also checks that the line is still up, and if not hangs until it comes back up. This holds up my web page fetching script and stops all the fetches failing.

It turns out there are about 7 characters transferred every interrupt, and I have a 14400 modem, so I've set the threshold at 150 interrupts per second.

Using this new technique to limit the number of processes using the dial up line, I regularly get 6kbytes per second over 10 seconds downloading news and various web pages, with my 14400 baud modem. Downloading some pdf files, I've even seen the UART maxed out at 115200 baud over a 10 second period. The greatest improvement though has been the reduction in the number of web page fetches that hang due to throttling.

Previous attempts to regulate bandwidth used a weighting function, assessing the line usage as reported by netstat, but the variation of bandwidth to individual sites made it ineffective.

Simply allowing accesses serially suffered in that if an access hung, the line would go down while many more waited for it.

Allowing all accesses to run in parallel, even allowing a small delay between startups, overloaded the connection, causing many of the accesses to hang.

The advantage of looking at the interrupt count to measure line usage is that it is much more accurate, and, to a certain extent, adapts to the variations of traffic. The main problem is that it takes a second to make a decision, and with http accesses being very short, this can lead to gaps in line usage. I'm hoping that tuning the threshold can improve this.

MD5 checksums of all the files are available, should you want to check against modification or corruption.

This page has been designed by the Roestock webmaster, if you have any comments on the site, please mail me.