htmlchek version 4.1, February 20 1995

htmlchek.awk, htmlchek.pl - Syntactically checks HTML 2.0 or 3.0 files for a number of possible errors; can do local link cross-reference checking, and generate a rudimentary reference-dependency map. Runs under awk or perl. Includes a number of supplemental utilities for HTML file processing.
Note: The current version of htmlchek.pl has a bug which causes it to claim fatal errors in perfectly good TABLEs. I have made a modification which works around the bug, but it causes the </TR> tag to be required rather than optional.
-- G. J. Perkins -- 1996.06.25.

Table of contents

Author: H. Churchyard churchh@uts.cc.utexas.edu

README.40:
htmlchek version 4.0, January 17 1995

     htmlchek  --  Syntactically checks HTML 2.0 or 3.0 files for a
                   number of possible errors; can do local link
                   cross-reference checking, and generate a
                   rudimentary reference-dependency map.  Runs
                   under awk or perl.  Includes a number of
                   supplemental utilities for HTML file processing.


This release of htmlchek (version 4.0) is a moderately significant
upgrade to previous versions, and includes the following files:
(The documentation for all programs and shell scripts other than htmlsrpl.pl
is in htmlchek.man/htmlchek.html.)

     README.40    This file
   htmlchek.man   Documentation
   htmlchek.html  HTML version of Documentation

   htmlchek.awk   Awk version of htmlchek HTML error checker
   htmlchek.pl    Port of htmlchek to perl
    example.cfg   Sample htmlchek configuration file
   html2dtd.cfg   Config. file for stricter compliance with 2.0 DTD

   htmlqref.txt   Yet another HTML quick reference (plain text)
   htmlqref.html  HTML version of yet another HTML quick reference

   htmlsrpl.pl    HTML-aware search-and-replace program (perl)
   htmlsrpl.man   Documentation for htmlsrpl.pl
   htmlsrpl.html  HTML version of documentation for htmlsrpl.pl

   xtraclnk.pl    Extracts links and link/title text from HTML files (perl)

   makemenu.awk   Makes simple menu for HTML files using <TITLE>; can also
   makemenu.pl      make table of contents using <H1>-<H6> (awk/perl)

     dehtml.awk   Remove all HTML markup, preliminary to spell check (awk)
     dehtml.pl    Perl version of dehtml

     entify.awk   Replace high Latin 1 alphabetic characters with ampersand
     entify.pl      entities for safe 7-bit transport (awk/perl)

   metachar.awk   Trivial program to protect HTML/SGML "&<>" metacharacters
   metachar.pl      in text to be included in an HTML file (awk/perl)

(Unix shell files:)

   htmlchek.sh    Run htmlchek.awk under the best available interpreter,
                    and with options checking
   htmlchkp.sh    Run htmlchek.pl with external options checking
   runachek.sh    Do cross-reference checking using htmlchek.awk
   runpchek.sh    Do cross-reference checking using htmlchek.pl
   rducfila.sh    Reduce .NAME/.HREF files (external xref check, awk)
   rducfilp.sh    Reduce .NAME/.HREF files (external xref check, perl)
   makemenu.sh    Run makemenu.awk under the best available interpreter,
                    and with options checking
     dehtml.sh    Run dehtml.awk under the best available interpreter


   The htmlchek program checks for quite a number of possible defects
in the HTML (Hyper-Text Mark-up Language) version 2.0 SGML files used
on the World-Wide Web.  (Preliminary HTML 3.0 files for the Arena
browser, or files with Netscape extensions, can also be checked by
specifying the appropriate options.)  The program makes no claim to
understand all of SGML, but is easy and relatively simple to use,
gives lots of information (including about many stylistically bad
practices), can do local cross-reference checking and generate
rudimentary reference-dependency maps, and can be run on any platform
for which the language interpreter (awk or perl) is available.

   This release of htmlchek also includes a number of supplemental
utilities, including the htmlsrpl.pl HTML-aware search-and-replace
program, which uses either literal strings or regular expressions;
acts either only outside HTML/SGML tags, or only within tags; can be
restricted to operate only within and/or only outside specified
elements; and can also upper-case tag names.

   The accompanying .sh files are for greater ease of use under Unix
(actually, any Posix 1003.2, including VMS Posix) but nothing in
htmlchek.awk or htmlchek.pl themselves, or in the accompanying
supplemental programs, depends on the Unix operating system (in
particular, the perl programs do not use any of the Unix-specific
systems-programming features of the perl language), so that this
package can be used on non-Unix systems.

   If you seem to get a million errors the first time you run htmlchek
on a file, don't be dismayed -- sometimes htmlchek can't compensate
for an error, so that the invalid HTML code it has encountered affects
its interpretation of valid HTML code later on in the file.  Just go
back and fix the _first_ error, or first few errors, in the HTML file,
then run htmlchek again and see what you get.  Iterate as necessary.
(However, I have tried to eliminate many of the cascades of redundant
errormessages that some earlier versions of this program tended to
generate.)

   The htmlchek program performs a fairly comprehensive job of
checking for HTML errors, but does not always exactly follow the
official standard (currently this is version 1.22 of the HTML 2.0
DTD).  Bad stylistic practices are warned against, as well as actual
HTML errors, and in some cases htmlchek is stricter than the standard,
in order to accommodate the peculiarities of some browsers.  The idea
is that HTML code should be ruggedized for the real world, rather than
just being SGML-ically correct -- especially since the official
standard allows many SGML features which are hardly understood by any
HTML-specific applications; for example, according to the official
standard the following is a completely valid HTML 2.0 file (without
even any omitted tags!):

 <><HEAD/<TITLE///<BODY/text<IMG TOP SRC=x.gif<![IGNORE[ </HTML>]]>/</>

 Version 4.0 of the htmlchek distribution has the following new features:

Main changes to htmlchek: added internal cross-reference checking (not as
hard as I thought it would be!); added option of generating dependency
map; added command-line options to allow `<' and`>' characters within
quoted attribute values and <!-- --> comments, and `>' characters outside
tags.  Other changes: added HTML quick reference, in plain text and .html
versions; added htmlsrpl.pl; added xtraclnk.pl; added makemenu.awk/
makemenu.pl; added metachar.awk/metachar.pl; added Perl version of
entify; enhanced the Unix/Posix-1003.2 shell scripts to redirect
non-program output to STDERR, detect non-zero exit status of awk/perl,
and add required trailing slashes automatically.  Minor changes to
htmlchek: added sample configuration files; added check for content of
<ADDRESS> element; now detect multiple <HEAD> elements in document;
<OPTION>, <TEXTAREA>, and <TITLE> elements should not contain any tags;
<INPUT>, <SELECT> and <TEXTAREA> do not have to be _immediately_
contained within a <FORM> (inclusion exception); allow reqopts=
command-line option to specify multiple required attributes for a single
tag; added dlstrict= option and changed default strictness to that of
dlstrict=1; differentiated novalopts= from tagopts=; added subtract="..."
command-line option (to facilitate checking files outside current
directory); updated Arena/HTML3 language definition; tinkered with the
Netscape language definition (in the absence of any definitive
documentation); improved internal htmlchek.pl options checking; other
minor fixes and enhancements.

Both the awk program htmlchek.awk and a port of this awk program to
perl are included in the distribution (the original reason for doing
the perl port in the first place was to make it possible to add full
off-site cross-reference checking over the the Web; however, this
project may never be completed, and at present the awk and perl
programs have the same functionality); similarly, most of the
supplemental programs also have both awk and perl versions.  You might
use one or the other based on personal preference, or because some
vendor-supplied awks on Unix boxes have proven to exhibit unendearing
peculiarities (you can also get around this by using GNU gawk if it is
on your system, or getting it from one of the ftp sites listed at the
end of almost every posting to the Usenet group gnu.announce and
compiling it; the program htmlchek.sh will automatically run gawk in
preference to nawk or awk, if gawk is on your system and in your
PATH).  Gawk for MS-DOS (and a pointer to OS/2 gawk) is available from
ftp://oak.oakland.edu/SimTel/msdos/awk/.  (See awk-perl.html.)

The anonymous ftp site for htmlchek is at:
 ftp://ftp.cs.buffalo.edu/pub/htmlchek/

The htmlchek documentation can be browsed online at:
 http://uts.cc.utexas.edu/~churchh/htmlchek.html



Typical command lines:

   awk -f htmlchek.awk [options] infiles.html > outfile.check

   perl htmlchek.pl [options] infiles.html > outfile.check

The options are in the form "option=value" (see htmlchek.html or
htmlchek.man).  Remember that on some Unix systems ``awk'' is an
archaic incompatible program, so you should use ``nawk'' or ``gawk''
instead; the shell script htmlchek.sh will do this automatically (and
do some options checking as well):

   sh htmlchek.sh [options] infiles.html > outfile.check


Author:  Henry Churchyard  churchh@uts.cc.utexas.edu

README.41:
htmlchek version 4.1, February 20 1995

This is a bugfix and update to version 4.0, adding several minor
features for greater convenience of use.

Changes are: Don't warn about null <TEXTAREA></TEXTAREA> element; only
check for inappropriate whitespace within elements commonly rendered
as underlined (<A> and <U>); check ordering of head tags before body
tags even in absence of explicit <head>...</head>; allow comments
between list items; only output non-numeric unquoted option values in
each file; corrected processing of HTML3 <LH>; updated HTML 3 language
definition to January 19 1995 draft; tinkered with Netscape extensions
language-definition yet again; added inline=1 command-line parameter;
added listfile=/lf= command-line parameter (especially for greater
MS-DOS convenience); allow cf= as abbreviation of configfile=;
ampersands followed by non-alphabetics generate warnings rather than
errors (so corresponding erromessage was removed from entify); added
"changed"/"unchanged" STDERR messages to htmlsrpl.pl output; added
.gif's to documentation; added awk-perl.html to documentation; added
index.html menu to documentation.

New files in this release are:

     README.41    This file
      index.html  HTML version of README.40, README.41, and menu
   awk-perl.html  Where to obtain Awk and Perl
     geterr.sh    Trivial script to extract only ERROR! messages
                    from htmlchek output
   geterwrn.sh    Trivial script to extract only ERROR!/Warning!
                    messages from htmlchek output
                  ___
        awk.gif      |    .gif files used
      camel.gif      |     in htmlchek HTML
        ftp.gif      |     documentation
   htmlchek.gif      |    (uuencoded as .uue
   htmlchks.gif      |     files in the
   valdhtml.gif      |     comp.sources.misc
    warning.gif   ___|     Usenet distribution)


Author:  Henry Churchyard  churchh@uts.cc.utexas.edu