FindPkg - a fast file searching package



This  package  uses the  Alto  microinstruction RAM,  if  available, to
search standard Alto files for certain simple kinds of patterns at very
high speed  (it normally  keeps up with  the disk).   It is  written in
Bcpl.

Note:  this  release  is  incompatible  with  the  previous  one  in an
important way: it uses the Alto Operating System's  ScanStream facility
for scanning the file, rather than the (now defunct)  ScanFile package.
This required a  change to the way  you initialize a  search (FindInit,
now called FindInitScan) and the way you clean up afterwards.

To  use  FindPkg, one  first  "compiles" the  pattern  into specialized
microcode  which is  loaded  into the  RAM,  or into  tables  which are
interpreted by software if no RAM exists, and then scans as  many files
as desired using this microcode.  To compile the pattern, call
      FindCompile(pattern,   chartab[,   wildchar,   fuzz,   outstream,
storeproc, regtable, lvTables, zone])
where all the arguments beyond chartab are optional (may be omitted, or
supplied as 0).  The arguments have the following significance.
   Pattern is a  Bcpl string, the  pattern being searched  for.  The
   search ignores the high-order bit of characters in both  the file
   and the pattern.  In  addition, the following 3  arguments affect
   how  the  pattern  is interpreted.   The  maximum  length  of the
   pattern is the number of R and S registers available (see below),
   rounded down to an even number if necessary.

   Chartab is  a 200b-word array  which specifies how  characters in
   the  file  are  to  be  interpreted.   Chartab!j   specifies  how
   occurrences of the character whose  code is j are to  be treated.
   The  possible  contents  of each  chartab  entry  are: classSkip,
   meaning ignore the character completely; classOther, meaning that
   the character is to be  taken literally; or a code between  0 and
   177b inclusive, meaning  that the character  is to be  treated as
   though  it  were  that  character (which,  in  turn,  must  be of
   classOther  in  the table).   For  example, to  cause  lower case
   letters  in  the file  to  be  treated as  though  they  were the
   corresponding upper case letter, set chartab!$a = $A, etc.

   Wildchar is  a character whose  appearance in the  pattern string
   means "match  any character  in the file".   For example,  if the
   pattern string is "A?B" and  wildchar is $?, any occurrence  of A
   followed  by any  character followed  by B  in the  file  will be
   considered an occurrence  of the pattern.   If wildchar is  not a
   character code, it is ignored, and all characters in  the pattern
   are taken literally.  Wildchar  defaults to -1 (take  the pattern
   literally).

   Fuzz  is the  number of  mismatches between  the pattern  and the
   corresponding string  in the  file that  will be  tolerated.  For
   example,  if the  pattern  is ABCD,  then with  fuzz=0,  only the
   string ABCD  in the file  (after interpretation  through chartab)
   will match; with  fuzz=1, the strings  ABCX, ABXD, AZCD,  or ZBCD


                             ------------
                   Copyright Xerox Corporation 1980


File searching package     October 29, 1980                           2




   would  match,  and  so  on.   Note  that  fuzz  only  applies  to
   replacement  mismatches, not  insertions (e.g.  ABXCD), deletions
   (e.g. ABD),  or transpositions (e.g.  ABDC).  Fuzz defaults  to 0
   (exact match required).

   Outstream,  if   non-zero,  is  a   character  stream   on  which
   FindCompile will write a  listing of the microcode  it generates.
   This is only useful  for debugging.  Outstream defaults to  0 (no
   listing).

   Storeproc  determines  what  will  be  done  with  the microcode.
   Storeproc=false means discard  it (although a listing  will still
   be  produced  if outstream  is  non-zero).   Storeproc=true means
   store it in the RAM for execution.  Otherwise,  FindCompile calls
   storeproc(location,  insvec) for  each instruction  it generates,
   where insvec is a 2-word vector containing  the microinstruction.
   Storeproc defaults to true (store for execution).

   Regtable  is a  4-word  bit table  that  specifies what  R  and S
   registers  are  available  for  use  by  the   microcode.   These
   registers  must  not be  used  by  other tasks,  or  by  the Nova
   instruction set,  although they  may be used  by BitBlt  or other
   Alto-specific instructions.  Also, registers 14b through  16b are
   assumed usable, and should not appear in the bit table.  Regtable
   defaults to a table that lets the microcode use register  17b and
   registers 41b through 76b, which will accommodate  a 30-character
   pattern.

   LvTables is the address of a cell in which FindCompile will store
   the address of the table space  it allocates, or 0 if it  did not
   need  table  space.  After  the  search, your  program  should do
   something like
         if tables ne 0 then Free(zone, tables)
   where lvTables  is lv  tables and  zone is  the zone  argument to
   FindCompile.   If   lvTables  is   defaulted,  your   program  is
   responsible for  finding and freeing  the table space  some other
   way (e.g. by providing special Allocate code for the zone,  or by
   reinitializing  the  zone,  neither  of  which  is  applicable to
   sysZone).

   Zone is a zone in which FindCompile will allocate table  space if
   no  RAM is  available.  This  space must  remain  allocated while
   doing the actual file search, but can (should) be freed after the
   search is finished.  Zone defaults to sysZone.
Note that the outstream, storeproc, and regtable arguments  have rather
specialized purposes;  the usual call  on FindCompile will  only supply
pat, chartab,  lvTables, and  possibly wildchar,  fuzz, and  zone.  The
awkward  order of  the  arguments results  from  backward compatibility
requirements.

FindCompile normally returns zero.  If it encounters  any difficulties,
it returns  a string  which describes the  difficulty.  This  string is
meant  to be  printed  for the  user,  not interpreted  by  the calling
program.

After calling  FindCompile to load  the RAM or  set up the  tables, one
scans files as follows.  First,  create an ordinary OS disk  stream for
the file  to be  searched, using  OpenFile, CreateDiskStream,  etc.  To
start searching the file, call
      FindInitScan(stream, buf, bufsize, fa)


File searching package     October 29, 1980                           3




where st  is the  stream, buf  is the  address of  a buffer  of bufsize
words, and fa is a file address (FA) structure into which  FindPkg will
store  each time  it  finds a  match.  FindInitScan  returns  an object
called  a scan  stream descriptor  (SSD), which  you need  to  save for
cleaning up.  Then to find each match in turn, call
      FindNext()
FindNext either finds the next match  or scans to the end of  the file.
In the former case, it returns a non-negative number that says how many
characters of the pattern had been examined before it decided it  had a
match, and stores the disk address, page number, and character position
at  that time  into the  fa  given to  FindInit.  For  example,  if the
pattern is "ABCD" and fuzz=1, then if the file contains  ABXD, FindNext
will stop after the D and return 4, while if it contains ABCX,  it will
stop after the C  and return 3, since it  knows it has a match  at that
point regardless of the next  character.  If FindNext runs off  the end
of the  file, it returns  -n-1 where n  is the number  of pages  in the
file.  Your program should then call FinishScanStream(ssd) to  clean up
the  ScanStream  data structures,  where  ssd is  the  SSD  returned by
FindInitScan; close the disk stream; and call Free (as described above)
to release any tables allocated by FindCompile.

FindPkg consists of 5 files:
      FindNext.BR, containing the procedures FindInit and FindNext;
      FindNextAsm.BR,  containing  some  assembly  language  procedures
needed by FindNext;
      FindCompile.BR, containing the procedure FindCompile;
      FindCompMu.BR,   containing   some  Alto   microcode   needed  by
FindCompile;
      FindPkgDefs.D, a Bcpl source file containing the  definitions for
the character classes.