Cross Reference: /illumos-gate/usr/src/uts/common/avs/ns/sdbc/CACHE_SPEC

	CACHE_SPEC revision fcf3ce441efd61da9bb2884968af01cb7c1452cc
# CDDL HEADER START
#
# The contents of this file are subject to the terms of the
# Common Development and Distribution License (the "License").
# You may not use this file except in compliance with the License.
#
# You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
# or http://www.opensolaris.org/os/licensing.
# See the License for the specific language governing permissions
# and limitations under the License.
#
# When distributing Covered Code, include this CDDL HEADER in each
# file and include the License file at usr/src/OPENSOLARIS.LICENSE.
# If applicable, add the following below this CDDL HEADER, with the
# fields enclosed by brackets "[]" replaced with your own identifying
# information: Portions Copyright [yyyy] [name of copyright owner]
#
# CDDL HEADER END
#
# Copyright 2008 Sun Microsystems, Inc.  All rights reserved.
# Use is subject to license terms.
#
# $Id: CACHE_SPEC,v 3.6.0.0 1998/01/05 22:55:19 idumois Exp $
#

	"sd" cache layer
	----------------
#include <sys/sd/sd.h>

The "sd" layer provides a common interface to the functionality
described below.  It will also allow switching to a direct to disk
version, so that a new cache module could be loaded.
The functions are basically the same as those below,
but named without the leading underscore.
(ie sd_alloc_buf instead of _sd_alloc_buf)


	"sdbc" -- storage device block cache (aka blkc)
	-----------------------------------------------

#include "uts/sd/sdbc/sd_cache.h"	/* for SDBC interface */
#include "sys/sd/sd.h"			/* for generic SD interface */

(all interaction is in terms of the buf_handle.

Currently buf_handle is declared as:

#define _SD_MAX_BLKS	64
#define _SD_MAX_FBAS	(_SD_MAX_BLKS << FBA_SHFT)

typedef struct _sd_buf_handle {
	int bh_cd;		/* actually bh_buf.sb_cd */
	int bh_fba_pos;		/* bh_buf.sb_pos */
	int bh_fba_len;		/* bh_buf.sb_len */
	int bh_flag;		/* bh_buf.sb_flag */
	int bh_error;		/* bh_buf.sb_error */
	_sd_vec_t bh_bufvec[_SD_MAX_BLKS]; /* bh_buf.sb_vec */
	void (*bh_disconnect_cb)();
	void (*bh_read_cb)();
	void (*bh_write_cb)();
	......
} _sd_buf_handle_t;


typedef struct sd_vec_s {		/* Scatter gather element */
	unsigned char	*sv_addr;	/* Virtual address of data */
	unsigned int	sv_vme;		/* VME address of data */
	int		sv_len;		/* Data length in bytes */
} sd_vec_t;

The upper level routines should reference only: handle->bh_error,
handle->bh_bufvec The bh_bufvec is an array of _sd_vec_t with the
last item in the array having a NULL bufaddr.

IMPORTANT: The handle should be treated read-only and never be modified.

	1) Multiple accesses to a single file will be supported.
	(Side effect: If a process owning cache blocks of a files attempts
	to allocate overlapping cache blocks, it will be a
	deadlock condition.)

	2) Multiple writes to an allocated block will be supported. It
	is no longer necessary to free and re-allocate between writes.

	3) _SD_NOBLOCK is equivalent of async_io -- the io will be initiated
	if required with the call returning _SD_PENDING. A callback
	(read or write) will be called at io end action.

	4) Disconnect hints to ckd will be provided by the use of
	either psema or thread_bind() when io needs to be initiated.


NOTE:
	fba_pos = disk block number, each block being 512 bytes.
	fba_len = len in disk blocks, each block being 512 bytes.
		Thus, 512 bytes = 1 fba_len, 1024 = 2 fba_len etc...

Hints:
	_SD_WRTHRU: write through mode.
		This hint can be set on a node, a device or per access.
	_SD_FORCED_WRTHRU: forced write through (node down or flow control)
		If this hint is cleared,  when only one node is up,
		_sd_uncommit() will not work properly, and a second
		failure could result in lost data.
		This is a node hint.
	_SD_NOCACHE: reuse cache blocks immediately instead of keeping
		in lru order.
		This hint can be set on a device or per access.

Interface:

_sd_buf_handle_t *
_sd_alloc_handle(discon_cb, read_cb, write_cb)
	void (*discon_cb)();
	void (*read_cb)();
	void (*write_cb)();

	The callbacks can be NULL if you do not want any callbacks.
	Else, the callbacks will be stored in the handle, and will be
	called at specific points in the cache. (Its up to  the
	callback to do what is necessary, including disconnecting
	from the channel)

	Usage: for better performance, an application could allocate
	a handle (or as many handles as is required)  upfront and
	use it later on in the cache calls.

	Not allocating and managing the handles would mean a new
	handle will be allocated and freed during _sd_alloc_buf
	and _sd_freebuf.

int
_sd_free_handle(handle)
	_sd_buf_handle_t *handle;

	Only handles that are allocated through _sd_alloc_handle
	should be freed with this call.

int
_sd_alloc_buf (cd, fba_pos, fba_len, flag, handle_p)
	int cd;
	int fba_pos;
	int fba_len;
	int flag;
	_sd_buf_handle_t **handle_p;

	cd = cache descriptor. Results in an error if this node does
		not own this disk and the other node has not crashed.
		(ie. requests must be routed to the correct node)
		(see fault tolerant aspects discussed elsewhere)

	fba_pos = disk position in multiples of 512 byte blocks.
	fba_len = length in multiples of 512 bytes blocks.
		(NOTE: This cannot exceed _SD_MAX_FBAS)

	flag = None, one or more of the following (described below):
		_SD_RDBUF | SD_WRBUF | _SD_RDWRBUF | _SD_PINNABLE |
		_SD_NOBLOCK | _SD_NOCACHE | _SD_WRTHRU

	handle_p = (*handle_p = handle to be used for this call)
		If *handle_p == NULL, a new handle will be
		allocated. _sd_free_buf will free up any handles
                    allocated in this fashion.
		NOTE: Handles allocated in this fashion will not have
			any callbacks registered in them. As such,
			_SD_NOBLOCK flag along with a NULL handle would
			result in the io being lost.

	return: Error number if > 0
		possible errors:
			EINVAL if arguments are incorrect or
				cache not initialized or
				device not open.
			E2BIG if this request is a read and such a large
			request cannot be currently satisfied. (break up
			the io or re-issue at a later point)
			EIO or any other errno that the driver might return.
		Note: on error, the handle is not active,
		and also is freed if *handle_p was NULL.

	if 0 or less, status will be one of:
	   _SD_DONE: buffer is ready, and ready to be used.
		(with the blocks valid if _SD_RDBUF is set)
	   _SD_PENDING:
		read callback, if one has been registered in the handle,
		will be called to complete this request.
	   _SD_HIT:  Same as _SD_DONE, read was satisfied by cache,
		or no blocking required for write buffer.

	Note:	_SD_RDBUF will issue the read if necessary.
		_SD_WRBUF allocates a network address to reflect to
			mirror node on _sd_write().
		~_SD_RDBUF allocates buffers but does NOT pre-read;
			use _sd_read() to fill in (portions) as req'd.

	Note:	flag == (_SD_RDBUF|_SD_WRTHRU|_SD_NOCACHE) will
		clear valid bits (that are not dirty) thus read direct
		from disk, without requiring a hash invalidate.


int
_sd_write (handle, fba_pos, fba_len, flag)
	_sd_buf_handle_t *handle;
	int fba_pos, fba_len;
	int flag;
{
	handle = handle previously allocated in allocate buf.
          fba_pos and fba_len have to be within the allocated portion.
	int flag. Flag: _SD_NOBLOCK | SD_WRTHRU

	Attempting to write to a handle that was not allocated for write
	will return error (EINVAL)

	returns:  errno if return > 0
	if 0 or less, return  will be one of:
	   _SD_PENDING: will be returned only if _SD_NOBLOCK is set AND
		either the flag is _SD_WRTHRU or the other node is down,
		or the device/node is in write through mode
	   _SD_DONE: is returned if the block has been written to the disk.
	   _SD_HIT: write block in cache..

int
_sd_read (handle, fba_pos, fba_len, flag)
	_sd_buf_handle_t *handle;
	int fba_pos, fba_len;
	int flag;

	handle = handle previously allocated in allocate buf.
          fba_pos and fba_len have to be within the allocated portion.
	int flag. Flag: _SD_NOBLOCK

	returns:  errno if return > 0
		error E2BIG if this request is big and cannot be currently
		 satisfied. (break up the io or re-issue at a later point)

	if 0 or less, return  will be one of:
	   _SD_PENDING: will be returned only if _SD_NOBLOCK is set and
		we need to do an io.
	   _SD_HIT: is returned if the blocks were satisfied by cache.
	   _SD_DONE: some blocks were read from disk.

int
_sd_uncommit(handle, fba_pos, fba_len, flag)
	_sd_buf_handle_t *handle;
	int fba_pos, fba_len;
	int flag;

	handle = handle previously allocated in allocate buf.
          fba_pos and fba_len have to be within the allocated portion.
	flag: reserved for future use.

	_sd_uncommit could block and cannot be called from a
		"non-blocking" context.
	(This is under review, from the ckd point of view)

	returns 0 (_SD_DONE) else errno;


int
_sd_zero (handle, fba_pos, fba_len, flag)
	_sd_buf_handle_t *handle;
	int fba_pos, fba_len;
	int flag;

	handle = handle previously allocated in allocate buf.
          fba_pos and fba_len have to be within the allocated portion.
	zero the buffer described by the handle.
	flag: _SD_NOBLOCK | _SD_WRTHRU

	The call commits data to disk.
	This call has characteristics similar to _sd_write.

	returns: errno if return > 0
		if 0 or less, return will be one of:
		_SD_DONE
		_SD_PENDING

_sd_copy (handle1, handle2, fba_pos1, fba_pos2, fba_len)
	_sd_buf_handle_t *handle1, handle2;
	int fba_pos1, fba_pos2, fba_len;

	Copies relevant data from handle1 to handle2.
	Useful for mirroring, remote dual copy, backup while open,
	in-house tests, etc.

	This call does not commit data to disk - you must explicitly
	call _sd_write() on handle2 if that is what you want.

	returns: errno if return > 0:
			 EIO - if sd module should do a generic bcopy
			 others - real error (passed to user)
		 if 0 or less, return will be:
		 	_SD_DONE - sucess

_sd_free_buf(handle)
	_sd_buf_handle_t *handle;

	handle = handle previously allocated in allocate buf.

	returns 0 (_SD_DONE) else errno;

_sd_open(filename, flag)
	char *filename;
	int flag;

	returns a cache descriptor, or negative error number.
	Typically use _sd_attach_cd(cd) before accessing the device.
	Note: if devices is already open, it returns the same cache descriptor.
	Currently there is no reference count; so one _sd_close() closes
	the cache descriptor (in all contexts).

_sd_close(cd)
	int cd;
	Similar to _sd_detach_cd below.
	Note: intended to be called when terminating the cache; and not during
	normal operation.  No reference count (see above).
	Returns: 0 success, EIO.

_sd_detach_cd(cd)
	re-reflect any pinned blocks to the other side,
	or wait for writes to flush; and invalidate that device's hash entries,
	and relinquish device responsibility.
	Returns: 0 success, EIO, EAGAIN.

_sd_attach_cd(cd)
	If device has pinned blocks then scan for and re-pin those blocks
	(same idea as "node recovery" process, but per-device);
	and assert device responsibility.

_sd_notify_all_pin(cd)
	rescan list of failed blocks and re-issue the pinned callback to
	simulation.


_sd_register_pinned(func)
	void (*func)();
    callback (*func)(cd, fba_pos, fba_len) when disk write fails,
    and _SD_PENDING was specified on alloc.

_sd_register_unpinned(func)
	void (*func)();
    callback (*func)(cd, fba_pos, fba_len) when data previously pinned
    is successfully written to disk.

_sd_register_down(func)
	void (*func)();
    callback (*func)() when health monitor detects the other node went down.

_sd_set_hint(cd, hint)
_sd_clear_hint(cd, hint)
_sd_get_cd_hint(cd, &hint)
_sd_set_node_hint(hint)
_sd_clear_node_hint(hint)
_sd_get_node_hint(&hint)

    where hint is _SD_NOCACHE and _SD_WRTHRU. (Write through being synchronous
	write and will be the default if the second node dies.)

   _SD_NOCACHE: hint indicating that the current access need not be
	cached for later consumption.


_sd_discard_pinned(cd, fba_pos, fba_len)
	call from ckd into cache, called when data that was earlier
	on pinned can be discarded from the cache.

	returns: 0 or error.
	(error = EINVAL if the discard could not be done)

(note: there is an inherent race between the unpinned callback and
_sd_discard_pinned which could put the data on disk in an inconsistent
state)


Failover support:

The Nodedown callback will be called, if one has been registered. This
will happen as soon as the other node has been detected to have gone down,
or when the cache is disabled on the other node.

The amount of time to for this callback to happen after the node goes down
is not deterministic.

Access to a mirror node's devices is only valid from the point the
nodedown callback is called till the other node is determined to be back
in operation.

Access to mirror node's devices while recovery is in progress will
block the access till the recovery is complete.