PROOFREADER -- An Interim English Text Proofreader




A collection  of subsystems  and data bases  for doing  proofreading of
English text on Altos is currently under construction. These subsystems
will hardly  venture into the  semantic gulf at  all. They  will simply
read a piece of text and decide whether each string of letters  in that
text is an English word. For the near future they will make  no attempt
to correct a non-word into a  word, but will simply bring all  the non-
words to the user's attention.

At  the  moment, the  collection  consists of  three  programs  and one
massive data base. One program, BTREETEST, is a data base interrogation
and maintenance program which operates about at level 1 (the author can
run it, and you might be able to learn to run it too). Another program,
PROOFREADER, actually uses the data base to proofread a piece  of text.
Since it only requires one parameter and operates non-interactively, we
could probably call it  a level 2 program  (you could run it  with ease
the first time) even now. The third program, SORTDICT,  sorts MW-format
files (of which appendix B is  one) alphabetically. It is of no  use to
BTREETEST or PROOFREADER, but may be of use to you. The data base  is a
giant  Alto  file  (called  PAGING.PG)  containing  an  encoded English
dictionary with  about 35000 English  words. It corresponds  roughly to
the New Merriam-Webster Pocket Dictionary, except that  the definitions
have been left out, and a  number of technical terms and names  and the
like  have been  added.  The programs  and data  base  together consume
something like 1500 pages of Alto disk space.

The programs will now be presented in greater detail, and the format of
a dictionary entry set forth so  that if you have a few  favorite words
of your  very own, you  can add  them to your  copy of  the dictionary.
L'Academie Parc reserves the right  to control the content of  the main
dictionary, but what you add to (or delete from) your own copy  is your
business.



1. PROOFREADER


The program is invoked by

PROOFREADER TEXTFILE OUTPUTFILE DICTFILE

where  OUTPUTFILE  currently  defaults  to  "QSPELL.TX"   and  DICTFILE
currently defaults to 'PAGING.PG".  It reads through the text  file and
associates the characters together into atoms (roughly corresponding to
normal word  boundaries) which  are looked up  in the  dictionary file.
Some words are not present in the dictionary because they are inflected
forms of words which are present: e.g. "dogs". If the program  fails to
find an  exact match  in the dictionary,  it then  investigates various
ways in  which the  atom might have  been formed  by inflecting  a word
which is in the dictionary. Atoms known to be correctly-spelled English
words, as well as atoms known not to be English words, are stored  in a
hashed LRU cache to speed the look-up process.

                             ------------
                   Copyright Xerox Corporation 1979


PROOFREADER               September 10, 1975                          2




If words include embedded punctuation or spaces ("e.g.", "a  la carte")
then the original  decisions about where to  break the text  into atoms
may have been in error. If the surrounding context is contained  in the
dictionary, then the word is considered correctly spelled. For example,
the dictionary contains the following entries:

        "e: #.g."
        ".g: e#."
        "a: # la carte"
        "la: a # carte"
        "carte: a la #"


At the end of processing (or  perhaps earlier in a very long  text file
with many  questionably-spelled words) the  program produces a  list of
questionably-spelled words on OUTPUTFILE. The list is ordered according
to first  occurrence, and following  each word is  the number  of times
(since the word was last mentioned, if ever) that the word  appeared in
the text. For purposes of the output listing, two words  are considered
the same only if their capitalizations and spellings are identical.

It seems to be difficult to take advantage of the supposed fact that in
English, only proper nouns and the first words of sentences or sentence
fragments  are capitalized.  In fact,  at least  in the  Parc documents
proofread  thus far,  the general  practice seems  to be  to capitalize
anything that  needs emphasis. Issues  of capitalization  are currently
being punted by providing two switches:

        /A says that any atom which begins with a capital letter
                is by definition correctly spelled. This avoids
                having personal names and places flagged as
                questionably spelled. It also means that the first
                word of each sentence is not checked.

        /C says that unless an atom begins a sentence (immediately
                follows a ".", "?", or "!") if it begins with a
                capital letter it must have the attribute
                "proper noun" in the dictionary.


The default case is that  words which begin with small letters  must be
something other than a proper noun, but everything else is acceptable.

For  your added  proofreading pleasure,  a modification  of  USER.CM is
available  which  includes a  P  quit switch  for  Bravo,  which causes
PROOFREADER to be invoked on the file whose name is in Buffer  3, after
which Bravo is  re-invoked with two  windows, one for  the questionable
spellings and one for the original file.



2. BTREETEST


The dictionary  is stored  on the  disk in  the form  of a  B-Tree with
variable-length records (for background on B-Trees, see Knuth,  vol. 3,
pp. 473-479).  Each dictionary entry  is in the  form of  a BCPL-format
string followed by one 16-bit word which encodes the inflection classes
to which the  word belongs. The program  BTREETEST is a  rather fragile


PROOFREADER               September 10, 1975                          3




dictionary maintenance program  build on top  of some very  nice B-Tree
maintenance  subroutines which  I shall  be glad  to release  if anyone
asks.  BTREETEST has  an interactive  command structure,  including the
following commands:

N(ame of paging file: ) FILENAME
        Allows the user to specify a dictionary file other than
        PAGING.PG. May be done only when dictionary file is closed.
        Do you really want to do this?

I(nitialize and open tree...)
        Creates and opens an empty tree, which can subsequently be
        added to by various commands. Do you really want to do
        this?

O(pen pre-initialized tree...)
        Opens the dictionary file, reads buffers into memory.

M(erriam-Webster input from file: ) FILENAME
        Reads MW-format entries from FILENAME, and modifies the
        dictionary file accordingly. More about MW format in
        section III.

F(ind keys: ) KEY
        Displays the greatest key in the dictionary less than or
        equal to the one you typed. LF or CR advances to the next
        key ad nauseam. N or DEL terminates the command. The
        "Info" field may be decoded using Appendix A if you
        persevere.

S(how page: #) OCTALNUMBER
        This enables one to trace out the actual structure of
        the B-Tree. Starting with the page number printed out
        by the Open, one can wend his way down through the tree.
        Notice, by the way, that the keys in the root of the tree
        seem to be considerably shorter than those in the leaves.
        This is no accident. I plan to write a paper about it when
        I get a chance.

C(lose tree...)
        Closes the dictionary file, writes out dirty pages.
        Extremely important to do if you have modified the
        dictionary.

Q(uit)
        Return to the command processor.


In the all-too-likely event that  something goes wrong and you  want to
quit and start  over, it is very  important to close the  tree (thereby
writing out dirty pages) if you have modified it at all. If you are not
in a position to do this, you can call BTREETEST from SWAT, then  say C
Q, and finally control-K to SWAT.



3. MW Format


PROOFREADER               September 10, 1975                          4




This is the format of the files from which dictionary entries are built
and modified.  Each such file  consists of a  sequence of  entries, not
necessarily in alphabetical order. Changes are made in sequence order.

An entry consists of the following:

[WORD ATR1 VAL1 ATR2 VAL2 ... ATRn VALn ]

where there may be no space between the left bracket and the  word, and
there must be at least one  space between the last value and  the right
bracket. Blanks within the  word are represented by the  character "@".
The dictionary is  used by several groups,  and many of  the attributes
which are  useful in other  applications are ignored  in this  one. The
following attributes and values are used by this application:

        DELETE *        Delete this dictionary entry

        N       S       This is a common noun which pluralizes
                        by adding "s". (tool, toy)

        N       ES      This is a common noun which pluralizes
                        by changing a final "y", if any, to "i"
                        and adding "es". (church, flunky)

        N       *       This is a common noun which is irregular,
                        or normally spoken of in the plural,
                        or whose plural doesn't make sense.
                        (man, men, analyses)

        N       FALSE   Any common noun meanings for this word
                        should be deleted.

        V       S-ED    This is a verb which adds "s" and "ed"
                        and "ing". (talk, fool)

        V       S-D     This is a verb which adds "s" and "d" and
                        drops an "e" before adding "ing".
                        (use, revise)

        V       S-#ED   This is a verb which adds "s" and doubles
                        its final consonant before adding "ed" or
                        "ing". (clap, trot)

        V       ES-ED   This is a verb which adds "es" and "ed"
                        and "ing" (box, crouch)

        V       *       This is an irregular verb form.

        V       FALSE   Any verb meanings for this word
                        should be deleted.

        ADJ     R-ST    This is an adjective which gets stronger
                        by adding "r" or "st". (wide)

        ADJ     ER-EST  This is an adjective which gets stronger
                        by adding "er" or "est". (tall)

        ADJ     *       This is an adjective which doesn't get
                        stronger (joyful).


PROOFREADER               September 10, 1975                          5




        ADJ     FALSE   Any adjective meanings for this word
                        should be deleted.

        ADV     ER-EST  This is an adverb which gets stronger
                        by adding "er" or "est". (fast)

        ADV     *       This is an adverb which doesn't get
                        stronger (joyfully).
        ADV     FALSE   Any adverb meanings for this word
                        should be deleted.

        NPR     S       This is a proper noun which pluralizes
                        by adding "s". (Alto)

        NPR     ES      This is a proper noun which pluralizes
                        by adding "es". (Jones)

        NPR     *       This is a proper noun which is irregular,
                        or normally spoken of in the plural,
                        or whose plural doesn't make sense.

        NPR     FALSE   Any proper noun meanings for this word
                        should be deleted.

        NPR     (SIC S) This is a proper noun which is capitalized
                        exactly as shown and which pluralizes
                        by adding "s". (OISystem)

        NPR    (SIC ES) This is a proper noun which is capitalized
                        exactly as shown and which pluralizes
                        by adding "es".

        NPR     (SIC *) This is a proper noun which is capitalized
                        exactly as shown and which is irregular,
                        or normally spoken of in the plural,
                        or whose plural doesn't make sense.

        NPR (SIC FALSE) Any proper noun meanings for this word,
                        capitalized exactly as shown,
                        should be deleted.

        COMP    *       These are all other words which do not
        CONJ    *       inflect. My data structure lumps them
        DET     *       all together as "OTHERPART *"
        NUMBER  *
        ORD     *
        PREP    *
        PRO     *
        PUNCT   *
        QUANT   *
        PREFIX  *
        INTJ    *
        SPECIAL *

        OTHERPART FALSE Any "other part" meanings for this
                        word should be deleted.


Included for  your amusement as  Appendix B is  a page  randomly chosen


PROOFREADER               September 10, 1975                          6




from a file  called XEROXWORDS.DICT. It  is sorted not  for BTREETEST's
convenience, but for the reader's.

By the way,  it would be appreciated  if you would keep  separate .DICT
files for names, for technical jargon, and for English words  which are
just   missing  from   the   dictionary.  This   would   facilitate  my
incorporating these words into the main dictionary later. A  great many
in- and un- and re- and -able and -ly, etc., words are missing from the
dictionary. Techniques similar to those of Kaplan and Kay may  later be
employed to disassemble prefixes  and suffixes, in order to  reduce the
necessary size  of the dictionary  and accommodate the  productivity of
English.



4. SORTDICT


This subsystem  sorts MW-format files  alphabetically (appendix B  is a
fragment of a MW-format file). This can be useful if you are  trying to
eliminate  duplicate entries  or want  to print  them.  BTREETEST could
hardly care less whether its  MW-format files are sorted. To  sort, you
say

SORTDICT INPUTFILE OUTPUTFILE

INPUTFILE and OUTPUTFILE can be the same.



5. How to Get All This Wonderful Stuff


The file  <ALTO>PROOFREADER.DM contains  PROOFREADER and  BTREETEST and
SORTDICT and their  symbol files. The dictionary  tree file is  in mode
binary  on <MCCREIGHT>PARCENGLISH.TREE.  You should  read it  into your
Alto as  PAGING.PG. If  you are just  interested in  trying it  out and
don't feel like using 1500 pages on your disk to do it, a model 31 disk
with  all the  goodies on  it is  available, which  you are  welcome to
borrow or copy.


PROOFREADER               September 10, 1975                          7




Appendix A


structure DE:   // Dictionary Entry

        [ Key:  [ Length byte   // Number of bytes in key
                    1
                Char ,1 byte
                ]
        Info word       // The coding
        ]


//      Coding of parts of speech in the Info field

manifest
        [
        ImproperNoun = #3
        NS = #1
        NEs = #2
        NOther = #3

        Verb = #34
        VSEd = #4
        VEsEd = #10
        VSXed = #14
        VSD = #20
        VOther = #24

        Adj = #140
        AjRSt = #40
        AjErEst = #100
        AjOther = #140

        Adv = #600
        AvErEst = #200
        AvOther = #400

        ProperNoun = #3000
        NPS = #1000
        NPEs = #2000
        NPOther = #3000

        SicNoun = #14000
        NSS = #4000
        NSEs = #10000
        NSOther = #14000

        OtherPart = #20000


        AnyNZValue = #177777
        ]


PROOFREADER               September 10, 1975                          8




Appendix B


[BCPL
        NPR (SIC *) ]
[became
        V * ]
[benchmark
        N S ]
[bias
        N ES
        V ES-ED ]
[biaxial
        ADJ * ]
[bibliog.
        N *
        SUBSTITUTE ((bibliography)) ]
[binary
        ADJ * ]
[binder
        N S ]
[bipolar
        ADJ * ]
[bode
        V S-D ]
[boded
        DELETE * ]
[bootstrap
        V S-#ED ]
[boule
        N S ]
[Bravo
        NPR * ]
[breadboard
        N S
        V S-ED ]
[broad
        ADJ ER-EST ]
[brush
        N ES
        V ES-ED ]
[buffer
        V S-ED ]
[Burroughs
        NPR * ]
[byte
        N S ]
[ca.
        ADJ *
        SUBSTITUTE ((circa)) ]
[CACM
        NPR (SIC *)
        SUBSTITUTE ((communications of the association for
                                computing machinery)) ]
[callee
        N S ]
[capacitance
        N S ]
[capacitor
        N S ]


PROOFREADER               September 10, 1975                          9




[CCD
        NPR (SIC S)
        SUBSTITUTE ((charge-coupled device)) ]
[CDC
        NPR (SIC *)
        SUBSTITUTE ((control data corporation)) ]
[cei
        DELETE * ]
[cf.
        V *
        SUBSTITUTE ((compare)) ]
[ch.
        N *
        SUBSTITUTE ((chapter)) ]
[checkout
        N *
        ADJ * ]
[checksum
        N S ]
[cholesteric
        ADJ * ]