FindPkg - a fast file searching package This package uses the Alto microinstruction RAM, if available, to search standard Alto files for certain simple kinds of patterns at very high speed (it normally keeps up with the disk). It is written in Bcpl. Note: this release is incompatible with the previous one in an important way: it uses the Alto Operating System's ScanStream facility for scanning the file, rather than the (now defunct) ScanFile package. This required a change to the way you initialize a search (FindInit, now called FindInitScan) and the way you clean up afterwards. To use FindPkg, one first "compiles" the pattern into specialized microcode which is loaded into the RAM, or into tables which are interpreted by software if no RAM exists, and then scans as many files as desired using this microcode. To compile the pattern, call FindCompile(pattern, chartab[, wildchar, fuzz, outstream, storeproc, regtable, lvTables, zone]) where all the arguments beyond chartab are optional (may be omitted, or supplied as 0). The arguments have the following significance. Pattern is a Bcpl string, the pattern being searched for. The search ignores the high-order bit of characters in both the file and the pattern. In addition, the following 3 arguments affect how the pattern is interpreted. The maximum length of the pattern is the number of R and S registers available (see below), rounded down to an even number if necessary. Chartab is a 200b-word array which specifies how characters in the file are to be interpreted. Chartab!j specifies how occurrences of the character whose code is j are to be treated. The possible contents of each chartab entry are: classSkip, meaning ignore the character completely; classOther, meaning that the character is to be taken literally; or a code between 0 and 177b inclusive, meaning that the character is to be treated as though it were that character (which, in turn, must be of classOther in the table). For example, to cause lower case letters in the file to be treated as though they were the corresponding upper case letter, set chartab!$a = $A, etc. Wildchar is a character whose appearance in the pattern string means "match any character in the file". For example, if the pattern string is "A?B" and wildchar is $?, any occurrence of A followed by any character followed by B in the file will be considered an occurrence of the pattern. If wildchar is not a character code, it is ignored, and all characters in the pattern are taken literally. Wildchar defaults to -1 (take the pattern literally). Fuzz is the number of mismatches between the pattern and the corresponding string in the file that will be tolerated. For example, if the pattern is ABCD, then with fuzz=0, only the string ABCD in the file (after interpretation through chartab) will match; with fuzz=1, the strings ABCX, ABXD, AZCD, or ZBCD ------------ Copyright Xerox Corporation 1980 File searching package October 29, 1980 2 would match, and so on. Note that fuzz only applies to replacement mismatches, not insertions (e.g. ABXCD), deletions (e.g. ABD), or transpositions (e.g. ABDC). Fuzz defaults to 0 (exact match required). Outstream, if non-zero, is a character stream on which FindCompile will write a listing of the microcode it generates. This is only useful for debugging. Outstream defaults to 0 (no listing). Storeproc determines what will be done with the microcode. Storeproc=false means discard it (although a listing will still be produced if outstream is non-zero). Storeproc=true means store it in the RAM for execution. Otherwise, FindCompile calls storeproc(location, insvec) for each instruction it generates, where insvec is a 2-word vector containing the microinstruction. Storeproc defaults to true (store for execution). Regtable is a 4-word bit table that specifies what R and S registers are available for use by the microcode. These registers must not be used by other tasks, or by the Nova instruction set, although they may be used by BitBlt or other Alto-specific instructions. Also, registers 14b through 16b are assumed usable, and should not appear in the bit table. Regtable defaults to a table that lets the microcode use register 17b and registers 41b through 76b, which will accommodate a 30-character pattern. LvTables is the address of a cell in which FindCompile will store the address of the table space it allocates, or 0 if it did not need table space. After the search, your program should do something like if tables ne 0 then Free(zone, tables) where lvTables is lv tables and zone is the zone argument to FindCompile. If lvTables is defaulted, your program is responsible for finding and freeing the table space some other way (e.g. by providing special Allocate code for the zone, or by reinitializing the zone, neither of which is applicable to sysZone). Zone is a zone in which FindCompile will allocate table space if no RAM is available. This space must remain allocated while doing the actual file search, but can (should) be freed after the search is finished. Zone defaults to sysZone. Note that the outstream, storeproc, and regtable arguments have rather specialized purposes; the usual call on FindCompile will only supply pat, chartab, lvTables, and possibly wildchar, fuzz, and zone. The awkward order of the arguments results from backward compatibility requirements. FindCompile normally returns zero. If it encounters any difficulties, it returns a string which describes the difficulty. This string is meant to be printed for the user, not interpreted by the calling program. After calling FindCompile to load the RAM or set up the tables, one scans files as follows. First, create an ordinary OS disk stream for the file to be searched, using OpenFile, CreateDiskStream, etc. To start searching the file, call FindInitScan(stream, buf, bufsize, fa) File searching package October 29, 1980 3 where st is the stream, buf is the address of a buffer of bufsize words, and fa is a file address (FA) structure into which FindPkg will store each time it finds a match. FindInitScan returns an object called a scan stream descriptor (SSD), which you need to save for cleaning up. Then to find each match in turn, call FindNext() FindNext either finds the next match or scans to the end of the file. In the former case, it returns a non-negative number that says how many characters of the pattern had been examined before it decided it had a match, and stores the disk address, page number, and character position at that time into the fa given to FindInit. For example, if the pattern is "ABCD" and fuzz=1, then if the file contains ABXD, FindNext will stop after the D and return 4, while if it contains ABCX, it will stop after the C and return 3, since it knows it has a match at that point regardless of the next character. If FindNext runs off the end of the file, it returns -n-1 where n is the number of pages in the file. Your program should then call FinishScanStream(ssd) to clean up the ScanStream data structures, where ssd is the SSD returned by FindInitScan; close the disk stream; and call Free (as described above) to release any tables allocated by FindCompile. FindPkg consists of 5 files: FindNext.BR, containing the procedures FindInit and FindNext; FindNextAsm.BR, containing some assembly language procedures needed by FindNext; FindCompile.BR, containing the procedure FindCompile; FindCompMu.BR, containing some Alto microcode needed by FindCompile; FindPkgDefs.D, a Bcpl source file containing the definitions for the character classes.