[GRASS-dev] HTML files

I have been through and fixed some problems which prevented some of
the HTML files from validating. AFAICT, everything now validates (with
the sole exception of missing "alt" attributes within <img> tags).

Please ensure that all HTML files continue to validate against the
HTML 4.0 Transitional DTD. At some point, I want to replace g.html2man
with something more robust (e.g. something which handles tables), and
I don't particularly want to make a "smart" (i.e. fault-tolerant) HTML
parser (e.g. Beautiful Soup) a required dependency.

If you have OpenSP or OpenJade, you can validate an HTML file with
e.g.:

  nsgmls -s -c /usr/share/sgml/openjade-1.3.2/pubtext/HTML4.soc <filename>.html

[The program may be called nsgmls or onsgmls, and the exact location
where the catalogues are installed will vary.]

This needs to be done on the completed HTML file in
dist.<arch>/docs/html; the <module>.html files in the module
directories won't normally validate, as they lack the header which is
added by running the module with the --html-description.

FWIW, the most common error was using block elements (e.g. <div>,
<pre>, <p>) in contexts where only inline elements are allowed
(primarily <dt>).

You can determine which elements are allowed where from the DTD:

http://www.w3.org/TR/1998/REC-html40-19980424/sgml/loosedtd.html

E.g. the definition:

<!ELEMENT DT - O (%inline;)* -- definition term -->

indicates that only inline elements are allowed inside DT, while e.g.:

<!ELEMENT DD - O (%flow;)* -- definition description -->

indicates that both block and inline elements are allowed inside DD.

If you don't want to read the DTD, here's a rough summary:

Entity classes:

  %StyleSheet = <CSS stylesheet>
  %Script = <JavaScript code>
  
  %html.content = HEAD, BODY
  %head.content = TITLE, ISINDEX, BASE
  %heading = H1, H2, H3, H4, H5, H6
  %fontstyle = TT, I, B, U, S, STRIKE, BIG, SMALL
  %phrase = EM, STRONG, DFN, CODE, SAMP, KBD, VAR, CITE, ABBR,
        ACRONYM
  %special = A, IMG, APPLET, OBJECT, FONT, BASEFONT, BR, SCRIPT,
        MAP, Q, SUB, SUP, SPAN, BDO, IFRAME
  %formctrl = INPUT, SELECT, TEXTAREA, LABEL, BUTTON
  %list = UL, OL, DIR, MENU
  %head.misc = SCRIPT, STYLE, META, LINK, OBJECT
  %pre.exclusion = IMG, OBJECT, APPLET, BIG, SMALL, SUB, SUP,
        FONT, BASEFONT
  %preformatted = PRE
  %block = P, DL, DIV, CENTER, NOSCRIPT, NOFRAMES,
        BLOCKQUOTE, FORM, ISINDEX, HR, TABLE, FIELDSET,
        ADDRESS, %heading, %list, %preformatted
  %inline = #PCDATA, %fontstyle, %phrase, %special, %formctrl
  %flow = %block, %inline

The immediate children permitted for each element are:
  
  A: %inline
  ABBR: %inline
  ACRONYM: %inline
  ADDRESS: %inline, P
  APPLET: %flow, PARAM
  B: %inline
  BDO: %inline
  BIG: %inline
  BLOCKQUOTE: %flow
  BODY: %flow, INS, DEL
  BUTTON: %flow
  CAPTION: %inline
  CENTER: %flow
  CITE: %inline
  CODE: %inline
  COLGROUP: COL
  DD: %flow
  DEL: %flow
  DFN: %inline
  DIR: LI
  DIV: %flow
  DL: DT, DD
  DT: %inline
  EM: %inline
  FIELDSET: %flow, LEGEND
  FONT: %inline
  FORM: %flow
  FRAMESET: FRAMESET, FRAME, NOFRAMES
  H1: %inline
  H2: %inline
  H3: %inline
  H4: %inline
  H5: %inline
  H6: %inline
  HEAD: %head.content, %head.misc
  HTML: %html.content
  I: %inline
  IFRAME: %flow
  INS: %flow
  KBD: %inline
  LABEL: %inline
  LEGEND: %inline
  LI: %flow
  MAP: %block, AREA
  MENU: LI
  NOFRAMES: %flow
  NOSCRIPT: %flow
  OBJECT: %flow, PARAM
  OL: LI
  OPTGROUP: OPTION
  OPTION: #PCDATA
  P: %inline
  PRE: %inline
  Q: %inline
  S: %inline
  SAMP: %inline
  SCRIPT: %Script
  SELECT: OPTGROUP, OPTION
  SMALL: %inline
  SPAN: %inline
  STRIKE: %inline
  STRONG: %inline
  STYLE: %StyleSheet
  SUB: %inline
  SUP: %inline
  TABLE: CAPTION, COL, COLGROUP, THEAD, TFOOT, TBODY
  TBODY: TR
  TD: %flow
  TEXTAREA: #PCDATA
  TFOOT: TR
  TH: %flow
  THEAD: TR
  TITLE: #PCDATA
  TR: TH, TD
  TT: %inline
  U: %inline
  UL: LI
  VAR: %inline

Some elements don't allow certain elements as descendents:

  A: A
  BUTTON: %formctrl, A, FORM, ISINDEX, FIELDSET, IFRAME
  DIR: %block
  FORM: FORM
  LABEL: LABEL
  MENU: %block
  PRE: %pre.exclusion
  TITLE: %head.misc

Notes:

1. The children of DIR/MENU are LI, which is a block element, but
those LI can't contain block elements. UL/OL don't have this
restriction.

2. DT cannot contain block elements, but DD can. This means that you
can't use <div class="code"><pre> in a DT; use <span class="code"><tt>
instead. DIV and PRE are block elements; SPAN and TT are inline.

3. TABLE cannot have TR as a child. But TBODY can have TR, and TBODY
allows both the start and end tags to be omitted, so
<table><tr>....</tr></table> is really just a shorthand for
<table><tbody><tr>....</tr></tbody></table>.

4. P cannot contain blocks. So <p>...<div> is actually shorthand for
<p>...</p><div>. But <p>...<div>...</div>...</p> is an error, as the
</p> doesn't match any open element (the <div> implicitly closed the
original <p>, and P doesn't allow the start tag to be omitted).

5. HTML, HEAD, BODY, and TBODY allow the start tag to be omitted. With
the exception of TBODY, this feature shouldn't be used (it's a
nuisance to implement if the number of valid child tags is large).

--
Glynn Clements <glynn@gclements.plus.com>