PROOFREADER -- An Interim English Text Proofreader A collection of subsystems and data bases for doing proofreading of English text on Altos is currently under construction. These subsystems will hardly venture into the semantic gulf at all. They will simply read a piece of text and decide whether each string of letters in that text is an English word. For the near future they will make no attempt to correct a non-word into a word, but will simply bring all the non- words to the user's attention. At the moment, the collection consists of three programs and one massive data base. One program, BTREETEST, is a data base interrogation and maintenance program which operates about at level 1 (the author can run it, and you might be able to learn to run it too). Another program, PROOFREADER, actually uses the data base to proofread a piece of text. Since it only requires one parameter and operates non-interactively, we could probably call it a level 2 program (you could run it with ease the first time) even now. The third program, SORTDICT, sorts MW-format files (of which appendix B is one) alphabetically. It is of no use to BTREETEST or PROOFREADER, but may be of use to you. The data base is a giant Alto file (called PAGING.PG) containing an encoded English dictionary with about 35000 English words. It corresponds roughly to the New Merriam-Webster Pocket Dictionary, except that the definitions have been left out, and a number of technical terms and names and the like have been added. The programs and data base together consume something like 1500 pages of Alto disk space. The programs will now be presented in greater detail, and the format of a dictionary entry set forth so that if you have a few favorite words of your very own, you can add them to your copy of the dictionary. L'Academie Parc reserves the right to control the content of the main dictionary, but what you add to (or delete from) your own copy is your business. 1. PROOFREADER The program is invoked by PROOFREADER TEXTFILE OUTPUTFILE DICTFILE where OUTPUTFILE currently defaults to "QSPELL.TX" and DICTFILE currently defaults to 'PAGING.PG". It reads through the text file and associates the characters together into atoms (roughly corresponding to normal word boundaries) which are looked up in the dictionary file. Some words are not present in the dictionary because they are inflected forms of words which are present: e.g. "dogs". If the program fails to find an exact match in the dictionary, it then investigates various ways in which the atom might have been formed by inflecting a word which is in the dictionary. Atoms known to be correctly-spelled English words, as well as atoms known not to be English words, are stored in a hashed LRU cache to speed the look-up process. ------------ Copyright Xerox Corporation 1979 PROOFREADER September 10, 1975 2 If words include embedded punctuation or spaces ("e.g.", "a la carte") then the original decisions about where to break the text into atoms may have been in error. If the surrounding context is contained in the dictionary, then the word is considered correctly spelled. For example, the dictionary contains the following entries: "e: #.g." ".g: e#." "a: # la carte" "la: a # carte" "carte: a la #" At the end of processing (or perhaps earlier in a very long text file with many questionably-spelled words) the program produces a list of questionably-spelled words on OUTPUTFILE. The list is ordered according to first occurrence, and following each word is the number of times (since the word was last mentioned, if ever) that the word appeared in the text. For purposes of the output listing, two words are considered the same only if their capitalizations and spellings are identical. It seems to be difficult to take advantage of the supposed fact that in English, only proper nouns and the first words of sentences or sentence fragments are capitalized. In fact, at least in the Parc documents proofread thus far, the general practice seems to be to capitalize anything that needs emphasis. Issues of capitalization are currently being punted by providing two switches: /A says that any atom which begins with a capital letter is by definition correctly spelled. This avoids having personal names and places flagged as questionably spelled. It also means that the first word of each sentence is not checked. /C says that unless an atom begins a sentence (immediately follows a ".", "?", or "!") if it begins with a capital letter it must have the attribute "proper noun" in the dictionary. The default case is that words which begin with small letters must be something other than a proper noun, but everything else is acceptable. For your added proofreading pleasure, a modification of USER.CM is available which includes a P quit switch for Bravo, which causes PROOFREADER to be invoked on the file whose name is in Buffer 3, after which Bravo is re-invoked with two windows, one for the questionable spellings and one for the original file. 2. BTREETEST The dictionary is stored on the disk in the form of a B-Tree with variable-length records (for background on B-Trees, see Knuth, vol. 3, pp. 473-479). Each dictionary entry is in the form of a BCPL-format string followed by one 16-bit word which encodes the inflection classes to which the word belongs. The program BTREETEST is a rather fragile PROOFREADER September 10, 1975 3 dictionary maintenance program build on top of some very nice B-Tree maintenance subroutines which I shall be glad to release if anyone asks. BTREETEST has an interactive command structure, including the following commands: N(ame of paging file: ) FILENAME Allows the user to specify a dictionary file other than PAGING.PG. May be done only when dictionary file is closed. Do you really want to do this? I(nitialize and open tree...) Creates and opens an empty tree, which can subsequently be added to by various commands. Do you really want to do this? O(pen pre-initialized tree...) Opens the dictionary file, reads buffers into memory. M(erriam-Webster input from file: ) FILENAME Reads MW-format entries from FILENAME, and modifies the dictionary file accordingly. More about MW format in section III. F(ind keys: ) KEY Displays the greatest key in the dictionary less than or equal to the one you typed. LF or CR advances to the next key ad nauseam. N or DEL terminates the command. The "Info" field may be decoded using Appendix A if you persevere. S(how page: #) OCTALNUMBER This enables one to trace out the actual structure of the B-Tree. Starting with the page number printed out by the Open, one can wend his way down through the tree. Notice, by the way, that the keys in the root of the tree seem to be considerably shorter than those in the leaves. This is no accident. I plan to write a paper about it when I get a chance. C(lose tree...) Closes the dictionary file, writes out dirty pages. Extremely important to do if you have modified the dictionary. Q(uit) Return to the command processor. In the all-too-likely event that something goes wrong and you want to quit and start over, it is very important to close the tree (thereby writing out dirty pages) if you have modified it at all. If you are not in a position to do this, you can call BTREETEST from SWAT, then say C Q, and finally control-K to SWAT. 3. MW Format PROOFREADER September 10, 1975 4 This is the format of the files from which dictionary entries are built and modified. Each such file consists of a sequence of entries, not necessarily in alphabetical order. Changes are made in sequence order. An entry consists of the following: [WORD ATR1 VAL1 ATR2 VAL2 ... ATRn VALn ] where there may be no space between the left bracket and the word, and there must be at least one space between the last value and the right bracket. Blanks within the word are represented by the character "@". The dictionary is used by several groups, and many of the attributes which are useful in other applications are ignored in this one. The following attributes and values are used by this application: DELETE * Delete this dictionary entry N S This is a common noun which pluralizes by adding "s". (tool, toy) N ES This is a common noun which pluralizes by changing a final "y", if any, to "i" and adding "es". (church, flunky) N * This is a common noun which is irregular, or normally spoken of in the plural, or whose plural doesn't make sense. (man, men, analyses) N FALSE Any common noun meanings for this word should be deleted. V S-ED This is a verb which adds "s" and "ed" and "ing". (talk, fool) V S-D This is a verb which adds "s" and "d" and drops an "e" before adding "ing". (use, revise) V S-#ED This is a verb which adds "s" and doubles its final consonant before adding "ed" or "ing". (clap, trot) V ES-ED This is a verb which adds "es" and "ed" and "ing" (box, crouch) V * This is an irregular verb form. V FALSE Any verb meanings for this word should be deleted. ADJ R-ST This is an adjective which gets stronger by adding "r" or "st". (wide) ADJ ER-EST This is an adjective which gets stronger by adding "er" or "est". (tall) ADJ * This is an adjective which doesn't get stronger (joyful). PROOFREADER September 10, 1975 5 ADJ FALSE Any adjective meanings for this word should be deleted. ADV ER-EST This is an adverb which gets stronger by adding "er" or "est". (fast) ADV * This is an adverb which doesn't get stronger (joyfully). ADV FALSE Any adverb meanings for this word should be deleted. NPR S This is a proper noun which pluralizes by adding "s". (Alto) NPR ES This is a proper noun which pluralizes by adding "es". (Jones) NPR * This is a proper noun which is irregular, or normally spoken of in the plural, or whose plural doesn't make sense. NPR FALSE Any proper noun meanings for this word should be deleted. NPR (SIC S) This is a proper noun which is capitalized exactly as shown and which pluralizes by adding "s". (OISystem) NPR (SIC ES) This is a proper noun which is capitalized exactly as shown and which pluralizes by adding "es". NPR (SIC *) This is a proper noun which is capitalized exactly as shown and which is irregular, or normally spoken of in the plural, or whose plural doesn't make sense. NPR (SIC FALSE) Any proper noun meanings for this word, capitalized exactly as shown, should be deleted. COMP * These are all other words which do not CONJ * inflect. My data structure lumps them DET * all together as "OTHERPART *" NUMBER * ORD * PREP * PRO * PUNCT * QUANT * PREFIX * INTJ * SPECIAL * OTHERPART FALSE Any "other part" meanings for this word should be deleted. Included for your amusement as Appendix B is a page randomly chosen PROOFREADER September 10, 1975 6 from a file called XEROXWORDS.DICT. It is sorted not for BTREETEST's convenience, but for the reader's. By the way, it would be appreciated if you would keep separate .DICT files for names, for technical jargon, and for English words which are just missing from the dictionary. This would facilitate my incorporating these words into the main dictionary later. A great many in- and un- and re- and -able and -ly, etc., words are missing from the dictionary. Techniques similar to those of Kaplan and Kay may later be employed to disassemble prefixes and suffixes, in order to reduce the necessary size of the dictionary and accommodate the productivity of English. 4. SORTDICT This subsystem sorts MW-format files alphabetically (appendix B is a fragment of a MW-format file). This can be useful if you are trying to eliminate duplicate entries or want to print them. BTREETEST could hardly care less whether its MW-format files are sorted. To sort, you say SORTDICT INPUTFILE OUTPUTFILE INPUTFILE and OUTPUTFILE can be the same. 5. How to Get All This Wonderful Stuff The file <ALTO>PROOFREADER.DM contains PROOFREADER and BTREETEST and SORTDICT and their symbol files. The dictionary tree file is in mode binary on <MCCREIGHT>PARCENGLISH.TREE. You should read it into your Alto as PAGING.PG. If you are just interested in trying it out and don't feel like using 1500 pages on your disk to do it, a model 31 disk with all the goodies on it is available, which you are welcome to borrow or copy. PROOFREADER September 10, 1975 7 Appendix A structure DE: // Dictionary Entry [ Key: [ Length byte // Number of bytes in key 1 Char ,1 byte ] Info word // The coding ] // Coding of parts of speech in the Info field manifest [ ImproperNoun = #3 NS = #1 NEs = #2 NOther = #3 Verb = #34 VSEd = #4 VEsEd = #10 VSXed = #14 VSD = #20 VOther = #24 Adj = #140 AjRSt = #40 AjErEst = #100 AjOther = #140 Adv = #600 AvErEst = #200 AvOther = #400 ProperNoun = #3000 NPS = #1000 NPEs = #2000 NPOther = #3000 SicNoun = #14000 NSS = #4000 NSEs = #10000 NSOther = #14000 OtherPart = #20000 AnyNZValue = #177777 ] PROOFREADER September 10, 1975 8 Appendix B [BCPL NPR (SIC *) ] [became V * ] [benchmark N S ] [bias N ES V ES-ED ] [biaxial ADJ * ] [bibliog. N * SUBSTITUTE ((bibliography)) ] [binary ADJ * ] [binder N S ] [bipolar ADJ * ] [bode V S-D ] [boded DELETE * ] [bootstrap V S-#ED ] [boule N S ] [Bravo NPR * ] [breadboard N S V S-ED ] [broad ADJ ER-EST ] [brush N ES V ES-ED ] [buffer V S-ED ] [Burroughs NPR * ] [byte N S ] [ca. ADJ * SUBSTITUTE ((circa)) ] [CACM NPR (SIC *) SUBSTITUTE ((communications of the association for computing machinery)) ] [callee N S ] [capacitance N S ] [capacitor N S ] PROOFREADER September 10, 1975 9 [CCD NPR (SIC S) SUBSTITUTE ((charge-coupled device)) ] [CDC NPR (SIC *) SUBSTITUTE ((control data corporation)) ] [cei DELETE * ] [cf. V * SUBSTITUTE ((compare)) ] [ch. N * SUBSTITUTE ((chapter)) ] [checkout N * ADJ * ] [checksum N S ] [cholesteric ADJ * ]