pkg/doc/search.txt

pkg
SEARCH

1. Goals

   i.   Provide relevant information
   ii.  Provide a consistently fast response
   iii. Make responses consistent between local and remote search
   iv.  Provide the user with a good interface to the information
   v.   Allow seamless recovery when search fails
   vi.  Ensure the index is (almost) always in a consistent state

2. Approach

   From a high level, there are two components to search: the
   indexer, which maintains the information needed for search; the
   query engine, which actually performs a search of the information
   provided. The indexer is responsible for creating and updating the
   indexes and ensuring they're always in a consistent state. It does this
   by maintaining a set of inverted indexes as text files (details of which
   can be found in the comments at the top of indexer.py). On the server
   side, it's hooked into the publishing code so that the index is updated
   each  time a package is published. If indexing is already happening when
   packages are published, they're queued and another update to the indexes
   happens once the current run is finished. On the client side, it's
   hooked into the install, update, and uninstall code so that each
   of those actions are reflected in the index.

   The query engine is responsible for processing the text from the user,
   searching for that token in its information, and giving the client code
   the information needed for a reasonable response to the user. It must
   ensure that the information it uses is in a consistent state. On the
   server, an engine is created during the server initialization. It reads
   in the files it needs and stores the data internally. When the server gets
   a search request from a client, it hands the search token to the query
   engine. The query engine ensures that it has the most recent information
   (locking and rereading the files from disk if necessary) and then searches
   for the token in its dictionaries. On the client, the process is the same
   except that the indexes are read from disk each time instead of being stored
   because a new instance of pkg is started for each search.

3. Details

   Search reserves the $ROOT/index directory for its use on both the client
   and the server. It also creates a TMP directory inside index which it stores
   indexes in until it's ready to migrate them to the the proper directory.

   indexer.py contains detailed information about the files used to store the
   index and their formats.

   3.1 Locking

       The indexes use a version locking protocol. The requirements for the
       protocol are:
        the writer never blocks on readers
        any number of readers are allowed
        readers must always have consistent data regardless the
            writer's actions
       To implement these features, several conventions must be observed. The
       writer is responsible for updating these files in another location,
       then moving them on top of existing files so that from a reader's
       perspective, file updates are always atomic. Each file in the index has
       a version in the first line. The writer is responsible for ensuring that
       each time it updates the index, the files all have the same version
       number and that version number has not been previously used. The writer
       is not responsible for moving multiple files atomically, but it should
       make an effort to have files in $ROOT/index be out of sync for as short
       a time as is possible.

       The readers are responsible for ensuring that the files their reading
       the indexes from are a consistent set (have identical version
       numbers). consistent_open in search_storage takes care of this
       functionality.