Swish-e and Document Abstracting


Note: This document, and the features that it added to swish-e, refers to a now obsolete version of swish-e.

Since this was written (in 1999) swish-e has added all the features found here, and a bunch more, to the base distribution of swish-e. This document is here for historical reasons only.






Abstract:

If you want to get this kind of search result output from swish-e, read on.

sample search output

[ Skip to the code section of this document if you don't want to read the discussion part. ]

Discussion:

I have been using swish-e to index the pages on my employer's web site since about the start of 1999.

After hacking together a little CGI interface to it -- http://www.lhsc.on.ca/cgibin/search -- it wasn't long before I wanted to start adding features like result filtering, result paging, and "start of document" contents to the search output. Adding these things to the C source for swish-e itself sounded like digging a large support hole for myself, so I decided to take the easiest route (at the expense of some speed) by modifying the swish-e spider and my own swish-e CGI front end.

The result filtering and result paging features are relatively simple additions to the CGI front-end, a small perl program.

The "start of document" contents feature, that I refer to as " document abstracting", is arrived it by doing two things:

Note: Since the document abstracting is done with the spider, you can only use it when building indexes using the HTTP method.

Code:

  1. Get and install these:

    Get swish-e running and building indexes using the HTTP method before making any changes. That way you know that everything is working before you start modifying it.

  2. Examine and then replace the distributed swishspider with this one.

  3. Examine, modify, copy pieces of, ridicule and improve upon, or otherwise use the CGI code that does all the URL filtering, pagination, and combining of the abstracts and the search results.

Notes:

You'll notice that spidering your site to build an index now takes longer, since the spider is doing more work than it used to (parsing the page, updating the database, etc...).

Here's how I update the index and abstract database for my site every night. Notice how both swish-e and the spider create files that end in working, so they can be renamed here.
(From a shell script launched from the web server user's crontab)

# Create swish-e index:
#
/opt/lhsc/www/swish/swish-e -i http://www.lhsc.on.ca/ \
                            -S http \
                            -c /www/database/swish-e/http.conf \
                            -f /www/database/swish-e/lhsc.index-working -v 0

mv /www/database/swish-e/abstract.gdbm.working  \
   /www/database/swish-e/abstract.gdbm
mv /www/database/swish-e/lhsc.index-working \
   /www/database/swish-e/lhsc.index


Steve van der Burg
September 9, 1999

Comments and questions should go to steve.vanderburg@lhsc.on.ca.