.. This document is formatted using reStructuredText, which is a Markup Syntax and Parser Component of Docutils for Python. An html version of this document can be generated using the following command: rst2html.py doc/parallel-linked-images.txt >doc/parallel-linked-images.html ====================== Parallel Linked Images ====================== :Author: Edward Pilatowicz :Version: 0.1 Problems ======== Currently linked image recursion is done serially and in stages. For example, when we perform an "pkg update" on an image then for each child image we will execute multiple pkg.1 cli operations. The multiple pkg.1 invocations on a single child image correspond with the following sequential stages of pkg.1 execution: 1) publisher check: sanity check child publisher configuration against parent publisher configuration. 2) planning: plan fmri and action changes. 3) preparation: download content needed to execute planned changes. 4) execution: execute planned changes. So to update an image with children, we invoke pkg.1 four times for each child image. This architecture is inefficient for multiple reasons: - we don't do any operations on child images in parallel - when executing multiple pkg.1 invocations to perform a single operation on a child image, we are constantly throwing out and re-initializing lots of pkg.1 state. To make matters worse, when as we execute stages 3 and 4 on a child image the pkg client also re-executes previous stages. For example, when we start stage 4 (execution) we re-execute stages 2 and 3. So for each child we update we end up invoking stage 2 three times, and stage 3 twice. This leads to bugs like 18393 (where it seems that we download packages twice). It also means that we have caching code buried within the packaging system that attempts to cache internal state to disk in an effort to speed up subsequent re-runs of previous stages. Solutions ========= Eliminate duplicate work ------------------------ We want to eliminate a lot of the duplicate work done when executing packaging operations on children in stages. To do this we will update the pkg client api to allow callers to: - Save an image plan to disk. - Load an image plan from disk. - Execute a loaded plan from disk without first "preparing" it. (This assumes that the caller has already "prepared" the plan in a previous invocation.) In addition to eliminating duplicated work during staged execution, this will also allow us to stop caching intermediate state internally within the package system. Instead client.py will be enhanced to cache the image plan and it will be the only component that knows about "staging". To allow us to save and restore plans, all image plan data will be saved within a PlanDescription object, and we will support serializing this object into a json format. The json format for saved image plans is an internal, unstable, and unversioned private interface. We will not support saving an image plan to disk and then executing it later with a different version of the packaging system on a different host. Also, even though we will be adding data into the PlanDescription object we will also not be exposing any new information about an image plan to via the PlanDescription object to api consumers. An added advantage of allowing api consumers to save an image plan to disk is that it should help with our plans to have the api.gen_plan_*() functions to be able to return PlanDescription object for child images. A file descriptor (or path) associated with a saved image plan would be one way for child images to pass image plans back to their parent (which could then load them and yield them as results to api.gen_plan_*()). Update children in parallel --------------------------- We want to enhance the package client so that it can update child images in parallel. Due to potential resource constraints (cpu, memory, and disk io) we cannot entirely remove the ability to operate on child images serially. Instead, we plan to allow for a concurrency setting that specifies how many child images we are willing to update in parallel. By default when operating on child images we will use a concurrency setting of 1, this maintains the current behavior of the packaging system. If a user wants to specify a higher concurrency setting, they can use the "-C N" option to subcommands that recurse (like "install", "update", etc) or they can set the environment variable "PKG_CONCURRENCY=N". (In both cases N is an integer which specifies the desired concurrency level.) Currently, pkg.1 worker subprocesses are invoked via the pkg.1 cli interfaces. When switching to parallel execution this will be changed to use a json encoded rpc execution model. This richer interface is needed to allow worker processes to pause and resume execution between stages so that we can do multi-staged operations in a single process. Unfortunately, the current implementation does not yet retain child processes across different stages of execution. Instead, whenever we start a new stage of execution, we spawn one process for each child images, then we make a remote procedure call into N images at once (where N is our concurrency level). When an RPC returns, that child process exits and we start a call for the next available child. Ultimately, we'd like to move to model where we have a pool of N worker processes, and those processes can operate on different images as necessary. These processes would be persistent across all stages of execution, and ideally, when moving from one stage to another these processes could cache in memory the state for at least N child images so that the processes could simply resume execution where they last left off. The client side of this rpc interface will live in a new module called PkgRemote. The linked image subsystem will use the PkgRemote module to initiate operations on child images. One PkgRemote instance will be allocated for each child that we are operating on. Currently, this PkgRemote module will only support the sync and update operations used within linked images, but in the future it could easily be expanded to support other remote pkg.1 operations so that we can support recursive linked image operations (see 7140357). When PkgRemote invokes an operation on a child image it will fork off a new pkg.1 worker process as follows: pkg -R /path/to/linked/image remote --ctlfd=5 this new pkg.1 worker process will function as an rpc server which the client will make requests to. Rpc communication between the client and server will be done via json encoded rpc. These requests will be sent between the client and server via a pipe. The communication pipe is created by the client, and its file descriptor is passed to the server via fork/exec. The server is told about the pipe file descriptor via the --ctlfd parameter. To avoid issues with blocking IO, all communication via this pipe will be done by passing file descriptors. For example, if the client wants to send a rpc request to the server, it will write that rpc request into a temporary file and then send the fd associated with the temporary file over the pipe. Any reply from the server will be similarly serialized and then sent via a file descriptor over the pipe. This should ensure that no matter the size of the request or the response, we will not block when sending or receiving requests via the pipe. (Currently, the limit of fds that can be queued in a pipe is around 700. Given that our rpc model includes matched requests and responses, it seems unlikely that we'd ever hit this limit.) In the pkg.1 worker server process, we will have a simple json rpc server that lives within client.py. This server will listen for requests from the client and invoke client.py subcommand interfaces (like update()). The client.py subcommand interfaces were chosen to be the target for remote interfaces for rpc calls for the following reasons: - Least amount of encoding / decoding. Since these interfaces are invoked just after parsing user arguments, they mostly involve simple arguments (strings, integers, etc) which have a direct json encoding. Additionally, the return values from these calls are simple return code integers, not objects, which means the results are also easy to encode. This means that we don't need lots of extra serialization / de-serialization logic (for things like api exceptions, etc). - Output and exception handling. The client.py interfaces already handle exceptions and output for the client. This means that we don't have to create new output classes and build our own output and exception management handling code, instead we leverage the existing code. - Future recursion support. Currently when recursing into child images we only execute "sync" and "update" operations. Eventually we want to support pkg.1 subcommand recursion into linked images (see 7140357) for many more operations. If we do this, the client.py interfaces provide a nice boundary since there will be an almost 1:1 mapping between parent and child subcommand operations. Child process output and progress management -------------------------------------------- Currently, since child execution happens serially, all child images have direct access to standard out and display their progress directly there. Once we start updating child images in parallel this will no longer be possible. Instead, all output from children will be logged to temporary files and displayed by the parent when a child completes a given stage of execution. Additionally, since child images will no longer have access to standard out, we will need a new mechanism to indicate progress while operating on child images. To do this we will have a progress pipe between each parent and child image. The child image will write one byte to this pipe whenever one of the ProgressTracker`*_progress() interfaces are invoked. The parent process can read from this pipe to detect progress within children and update its user visible progress tracker accordingly.