SWAN: SubWeb Analyzer
[ Last modified: 18 February 1998. Latest version of SWAN: 1.3b, 16 Feb 1998 ]
SWAN is a program that analyzes a local subweb of WWW.
Its purpose is to help maintain independent subwebs.
Definitions are given below.
SWAN can collect some statistics about the subweb and
it can report internal inconsistencies in the subweb.
SWAN can also generate a cross reference for the subweb,
consisting of one (often huge) HTML document, and/or
an HTML document for each HTML document in the subweb.
These cross references constitute, in a sense,
the inverse of the subweb;
that is, they enable one to follow hyperlinks in the reverse direction.
Here are two applications for cross references:
- When you (as webmaster) change a document in a subweb,
it is not uncommon that you have to change referencing documents as well.
The cross reference provides easy access to all such referencing documents.
- When you (as user) read an interesting document in a subweb,
you might wonder where this material is actually used.
A cross reference helps you find reverse related documents.
The example page shows
the application of SWAN to the subweb with SWAN material.
SWAN is available as a UNIX shell script,
see Implementation for details.
The current version is 1.3b dated 16 February 1998
(the source file contains a version history).
A subweb is a subset of the resources on the
World-Wide Web
(WWW).
The subweb rooted in a directory (on a particular net location)
contains as resources all world-readable files
accessible from that directory
via a path of world-readable-and-searchable directories.
We sometimes refer to the directory inducing a subweb as the
root directory or just the root of the subweb.
Whether the root directory itself is world-readable-and-searchable or not
plays no role in the definition above.
Of course,
the subweb is accessible to the outside world only if its root is
accessible.
N.B.
Directories themselves are not considered resources of the subweb.
See the discussion of independence below for a motivation.
In the context of a given subweb, a hyperlink is called
- internal if it originates inside and
points inside the subweb;
- external if it originates inside and
points outside the subweb.
Hyperlinks originating outside a subweb are (necessarily)
ignored by SWAN.
A subweb is called independent when
- it can be accessed via various schemes
(in particular,
http
, ftp
, or file
)
without a user noticing the difference (except possibly for speed), and
- it can be transferred to another location
without affecting its appearance.
To accomplish independence,
a subweb must adhere to the following conventions:
- Internal hyperlinks use relative URLs
(see RFC 1808).
- Hyperlinks to
index.html
are explicit,
that is, use `dirpath/index.html
',
rather than `dirpath/
' or
just `dirpath'.
The reason for the latter convention is the following.
When a directory with URL dirpath/
or dirpath
is accessed via http
,
the file dirpath/index.html
is returned, if present,
and a directory listing, otherwise.
Accessing such a directory by ftp
or file
always
returns a directory listing.
When a directory contains no file index.html
,
a direct access of it results in a directory listing under all access schemes.
Currently,
SWAN flags all hyperlinks to directories as inconsistencies
(even hyperlinks to directories not containing a file index.html
).
SWAN does not flag as inconsistencies internal hyperlinks using absolute URLs.
However, these hyperlinks are grouped among the external hyperlinks
where they can easily be found.
Synopsis
swan
[ options ] [ directory ]
Description
SWAN analyzes the subweb rooted in directory.
If no directory is supplied,
then the current directory is assumed.
Without options,
SWAN constructs an inventory of the subweb and
reports on standard output all detected inconsistencies and some statistics.
The following inconsistencies are reported:
- internal hyperlinks to resources that are not available,
- internal hyperlinks to document fragments (anchor names)
that are not defined,
- duplicate fragment definitions, and
- resources that are not referenced.
Inconsistencies concerning internal hyperlinks are reported
at the destination only.
It is for you to determine the actual cause of the problem,
which could also be with the source.
N.B.
External hyperlinks are not checked for availability.
(There are other tools for that.)
Options
-a
- Analyze all resources of the subweb, also invisible ones,
and all hyperlinks, also to invisible resources.
-h
- Provide help; no processing is done.
-i
- Read the subweb inventory from
.SWAN
in the root directory of the subweb being analyzed.
-o
- Write the subweb inventory to
.SWAN
in the root directory of the subweb being analyzed.
-q
- Do not report anything on standard output (quiet mode).
+r
/ -r
- Force world-read permission on/off for cross reference files
(by default, your
umask
is in effect).
-s
- Report only statistics on standard output, do not list inconsistencies.
-x
- Create an individual cross reference
.
file.html
for each HTML document
file.html
.
-X
- Create an overall cross reference
..SWAN.html
.
Files
- standard output
- The report is written to standard output.
- directory
/.SWAN
- Inventory for subweb rooted in directory.
- directory
/..SWAN.html
- Overall cross reference for subweb rooted in directory.
- directory
/.
file.html
- Cross reference for
directory
/
file.html
.
/tmp/SWAN
pid.inv
- Temporary file with subweb inventory for
-o
option.
/tmp/SWAN
pid.sh
- Temporary file with shell script to create individual cross references
for
-x
option.
/tmp/SWAN
pid.html
- Temporary file with overall cross references for
-X
option.
Linking a SubWeb to Its Cross Reference
There are some pitfalls when linking a subweb to the cross reference
generated by SWAN.
Once a cross reference has been generated and SWAN is rerun,
there is the danger that the old cross reference gets incorporated in
the new cross reference (like a virus).
However, two features of SWAN help avoid this problem.
- The cross references reside in invisible files.
They are not seen by SWAN (unless option
-a
is used)
but they can be available to the outside world.
- Internal hyperlinks to invisible files are ignored by SWAN
(again, unless option
-a
is used).
The thing you want to avoid is running SWAN with option -a
while cross reference files already exist.
(Maybe there should be an option to remove them on the fly.)
The current version of SWAN is a prototype
implemented as a UNIX shell script.
Its operation proceeds in four phases.
The first three phases (front end) generate an
inventory of the subweb.
The fourth phase (back end) produces the output from the inventory.
- SWAN uses
find
and egrep
- to supply definitions for all resources in the subweb, and
- to select fragment definitions and hyperlinks, together with
their line numbers, from all HTML documents in the subweb.
Definitions and hyperlinks are selected by the occurrence of
an anchor or image tag.
The output of this phase is a text file with lines of the form
filepath:linenumber:text
where
- filepath always starts with `
./
',
- linenumber is empty for supplied (implicit) definitions, and
- text contains `
<A
' or
`<IMG
'.
- SWAN uses
sed
to transform the output of the preceding phase.
It does so in four steps:
- Isolate definitions and hyperlinks, splitting lines when necessary.
SWAN looks for the following patterns (in some disguise):
<A ...NAME=...
<A ...HREF=...
<IMG ...SRC=...
N.B.
When the tag name and attribute definition are not on the same line,
they are invisible to SWAN.
Lines now are put in the form
kind#filepath#linenumber#url#fragment
where kind equals `d
' for definitions and
`r
' for hyperlinks (references).
We switched to `#
' as field separator instead of
`:
' because the former is less likely to cause trouble
when splitting URLs (see next step).
- Split (possibly relative) URLs into their components,
see RFC 1808.
Lines now have the form
kind#dirpath#filename#linenumber#scheme#netloc#path#params#query#fragment
- Resolve relative destination URLs, taking the document's source URL
as base URL
(see RFC 1808).
Lines retain their form.
- Remove `
./
' and `segment/../
'
from paths
(see RFC 1808).
Lines retain their form.
- SWAN uses
sort
to make each definition and all
hyperlinks to it consecutive lines.
Definitions, if present, appear before hyperlinks.
The output of this phase is called the inventory
of the subweb.
- SWAN uses
awk
to
report statistics and inconsistencies, and
to create the cross references.
It generates a shell script which is fed into
sh
to create the individual cross reference files
(working around the limit of 10 output files in awk
).
The correctness of SWAN hinges on the following assumptions about the subweb:
- All HTML documents are syntactically correct.
Actually, SWAN is pretty tolerant, since it ignores most HTML code anyway.
The only tags that play a role are
anchor tags (recognized by `
<A
',
i.e. a closing `>
' is not required,
but the space after A
is),
image tags (recognized by `<IMG
'),
and (as of version 1.3b) body tags
(recognized by `<BODY
'),.
and frame tags
(recognized by `<FRAME
').
Although the
HTML 2.0 Standard
requires that values of NAME
,
HREF
, and SRC
attributes are properly delimited
by double quotes ("
), SWAN does not require them.
- The complete start tag of each definition and hyperlink in an HTML
document appears on a single line.
- Resources do not contain embedded base URL definitions
(
<BASE HREF="...">
in HTML).
When resolving embedded relative URLs,
SWAN always uses the document's retrieval URL as base URL.
Here are some other limitations and possibilities for future development:
- Hyperlinks to URLs of the form
/
path are
counted as external to the subweb, but are listed as internal
(due to sorting).
- SWAN could be extended to handle other hyperlinks, such as occur in
<BODY BACKGROUND="
URL"
(done in 1.3b)
<IMG USEMAP="
URL"
<IMG LOWSRC="
URL"
<AREA HREF="
URL"
<META URL="
URL"
<EMBED SRC="
URL"
<FORM ACTION="
URL"
<FRAME SRC="
URL"
(done in 1.3b)
- SWAN could be extended to `know' more HTML syntax.
That way it could flag syntax errors and be (even) more foolproof when
isolating definitions and hyperlinks.
Currently, you have to use other tools to help
guarantee syntactical correctness.
- The definition of which resources and hyperlinks in a subweb are visible
to SWAN could be more flexible.
Actually,
visibility to SWAN is only an issue when generating a cross reference,
not when checking consistency.
Of course, there may be other reasons why you want to make some
resources in a subweb invisible to SWAN.
- It would be nice for the cross references to have a facility to link
directly to an absolute line number in an HTML document.
For this to be a useful feature,
browsers should more explicitly indicate where a hyperlink leads to.
More particularly, when you follow a hyperlink to a document fragment,
many browsers do not show clearly what is actually referenced,
especially near the end of a document.
(Click here and find out where it is
intended to get you.)
- SWAN could be extended to check availability of external resources.
- Sometimes you might wish that SWAN could also provide the reverse
of hyperlinks to the subweb originating from outside it.
Problem: How to find out about those in a reasonable amount of time?
- In the overall cross reference,
`coarser' grouping could be done.
Now all fragments are grouped per (internal) resource,
but each (internal) resource is separated from others.
As of version 1.3b,
external resources are grouped per net location
(`mailto:' and `news:' links are all put together).
- In cross references, it might be nice for users to include,
besides the resource file name, also the resource title, and
besides the fragment name, also the named text
(appearing between
<A NAME=...>
and
</A>
).
- A UNIX-style manual page could be supplied.
- Additional options that might be considered:
- Verify that the subweb is strongly connected, that is,
each HTML document is reachable from every other HTML document.
- Omit line numbers from cross references.
They are intended for webmasters;
they make little sense to users,
though their number might be a useful indicator.
- Omit error messages from cross references (same reason as above).
- Include only error messages in overall cross reference.
- Omit unreferenced resources from overall cross reference.
- Omit external resources from overall cross reference.
- Omit internal non-HTML resources from overall cross reference.
- Omit internal HTML resources from overall cross reference,
optionally include a link to the individual cross reference page.
- Omit statistics from overall cross reference.
- Include statistics in individual cross references.
- Report only inconsistencies
involving certain files (either as source or destination of a hyperlink).
- Make cross references for certain files only.
- Use alternate directory for temporary files.
- Allow the specification of alternate file names for the inventory
and overall cross reference.
Below is a brief list of tools related to SWAN.
The
Weblint Home Page
also has ``information which may be of interest''.
- EIT's Link Verifier
- An optional extension of the
Webmaster's Starter Kit made available by
Enterprise Integration Technologies.
- htmlchek
- Developed and maintained by
Henry Churchyard.
[ FTP ]
- html_analyzer
- Developed, but no longer maintained, by
James Pitkow.
[ FTP ]
- MOMspider
- Developed (and maintained?) by
Roy Fielding.
[ FTP ]
- weblint
- Developed and maintained by
Neil Bowers.
[ FTP ]
- MacWebLint
- Ported by Jon S. Stevens.
[ FTP ]
Also need MacPerl 5.
- webxref
- Developed and maintained by
Rick Jansen.
The support of the
MAVERIC Research Group and the
Department of Computer Science
at the University of Waterloo, Canada,
enabled me to develop a prototype for SWAN.
Tom Verhoeff /
wstomv@win.tue.nl