Basic Principles of the LogCache implementation
===============================================

0. Table of Contents
--------------------

1. Overview
2. Components
3. Streams storage layer
4. Log data model
5. Log cache management


1. Overview
-----------

The log cache allows the storage of SVN log information
on the client side. Basic features:

- extensible, compact storage format
- generic, high-throughput storage layer
- all data is in-memory; r/w granularity is a whole cache file

- data model covers SVN log -v output
- cache file management handles concurrent access and corruptions

Interface details can be derived from the documentation
in the respective header files.


2. Components
-------------

Currently, the functionality is split over 4 libs that
roughly represent different layers of abstraction.
But they are NOT a strict hierarchy of layers. In particular,
not every layer is used at any given time.

LogCacheLib

    - manages the log cache files
    - uniform interface (ILogQuery) to log cache and
          direct SVN log calls
    - manipulates the log cache data model

LogCacheAccessLib

    - iterator objects that follow the history of a given path
    - import / export in SVN log --xml format
    - export in CSV format

LogCacheContainersLib

    - the actual log cache data model (passive)
    - serialization / deserialization code

LogCacheStreamsLib

    - generic external data format:
      hierarchy of compressed streams
    - collection of different stream classes


3. Streams Storage Layer
------------------------

3.1 Basic Concepts

Data structures like a log cache have a well-defined root object
and form a hierarchy of container objects. The storage layer
allows for a natural mapping of these data structures onto a
hierarchy of compressed data streams.

Currently, there is no way to modify an existing file.
Instead, the client will read it in its entirety and overwrite
it completely with new data, if necessary.

A stream contains

    - a chunk of arbitrary binary data (= content)
    - a possibly empty list of sub-streams

it also stores

    - a STREAM_TYPE_ID to identify the stream type
      (i.e. how to interpret the binary content)
    - length of the binary content (in bytes)
    - number of and references to the sub-streams

The code paths for read and write are entirely separate.
IHierarchicalInStream and IHierarchicalOutStream, respectively,
define the basic interface that every stream must implement.

Sub-streams are identified by user-defined number of type
SUB_STREAM_ID. They must be unique with their parent stream.

See CHierachical[In|Out]StreamBase for further details of the
sub-stream table encoding. This knowledge is not necessary
for extending the stream hierarchy or even adding new types.


3.2 Stream Content

How the content is actually to be interpreted is handled by
specific stream classes. Most of the classes provided store
std::vector<primitive_type> data. This is because the data
is either integer IDs, flags, offsets etc. or large string
buffers.

Depending on the data type at hand, typical value ranges
(check the Out class headers for info!) and possible correlation
between neighboring values, there are many specific stream
classes. They all try to reduce the data as much as possible.
I.e. diff integer streams will work great for sequences
of ascending or descending values, i.e. if abs (v[i] - v[i-1]) << abs(v[i])

A huffman encoding pass over any kind of stream will remove
any redundancy left or introduced by the stream encoding scheme.


3.3 Writing Data

There can be only one open stream at a time and existing
streams cannot be extended. That implies you have to write
the entire content of a stream before sub-streams can be
added. Closing streams and writing their data to disk is
handled transparently.

Objects will usually define a operator<<(IHierarchicalOutStream*)
method that contains the serialization code. For every
sub-structure, create a sub-stream and serialize the
respective data to it. The stream type / class is specified
by its STREAM_TYPE_ID.

Arrays / lists of structs will require multiple sub-stream:
one per struct member. Rationale: the value correlation
along a table column is much higher than within a row.

PackedDWORDOutStream.h contains some utility function
templates that handle the serialization of std::vector<>.

If you need other serialization schemes, you can call
AddValue() (or your specific stream method) directly.
Since it is your responsibility to interpret the stream
content correctly, you should put size information first
as you often cannot derive it from the binary stream size.

See the operator>> implementations in LogCacheContainersLib
for examples for how to serialize various types of data.


3.4 Reading Data

Reading data is much more flexible than the write path.
You may read any data at any time. Since the type info
is stored persistently in the external file, OpenSubStream()
will automatically return an instance of the respective
stream class.

Implement your operator<< symmetrically to your operator>>.

Please note that only the DEBUG builds contain stream
overflow checks. Correct (de-)serialization is your responsibility.

Very large streams may use the Prefetch(), AutoOpen()
and AutoClose() methods to exploit parallelism and to
reduce temporary memory usage. This is useful for stream
sizes >> 100MB.


3.5 Format Versioning

The whole file is tagged with two version info IDs to handle
backward compatibility with older clients. Clients that are
too old will usually delete or ignore and overwrite the cache.

OUR_LOG_CACHE_FILE_VERSION

    - format version used by the last client that wrote
      this file (i.e. "our" client version)
    - bump if you add new sub-streams
    - convention: use the current date 0xYYYYMMDD

MIN_LOG_CACHE_FILE_VERSION

    - format version that must be supported by the client
      to be able to interpret its content correctly
    - try to be backward compatible
    - must be bumped if new stream types are being used
      (even if they would not be accessed by old clients)

OLDEST_LOG_CACHE_FILE_VERSION

    - oldest format this client can read
      (must be <= OUR_LOG_CACHE_FILE_VERSION

NEWEST_LOG_CACHE_FILE_VERSION

    - newest format this client can read
      (client too old, if < writer's MIN_LOG_CACHE_FILE_VERSION)

Constraints

    - newer version numbers must have greater numerical values

    - OUR_LOG_CACHE_FILE_VERSION >= MIN_LOG_CACHE_FILE_VERSION
    - OUR_LOG_CACHE_FILE_VERSION >= OLDEST_LOG_CACHE_FILE_VERSION
    - NEWEST_LOG_CACHE_FILE_VERSION >= OLDEST_LOG_CACHE_FILE_VERSION


3.6 Extending the Format

3.6.1 New Sub-Streams

New sub-streams can be introduced anywhere in the hierarchy
without breaking backward compatibility.

Since clients will only access specific sub-streams via GetSubStream(),
old clients will not see new sub-streams. However, that implies
they will also not write them, i.e. their content gets lost
once an old client writes the file.

This problem can be addressed by the upgrade code that the
client needs to handle older cache files, because the file
*is* in old format once it got overwritten. A client should
use HasSubStream() to test for the presence of its additional
streams.


3.6.2 New Sub-Stream Types

If none of the provided stream types fits you requirements,
choose the one closest to it and derive your implementation
from it. You need to provide

- a new STREAM_TYPE_ID definition
- an input stream class
- an output stream class
- make sure to derive from C[In|Out]StreamImplBase<>
  that handle the whole factory stuff

See CBLOB[In|Out]Stream, CComposite[In|Out]Stream and
CPackedInteger[In|Out]Stream for typical examples.

Don't forget to bump the file version numbers.


4. Log Data Model
-----------------

4.1 Basics

The log cache is organized as a set of tables. All references
are expressed through IDs that are nothing more than the
(fixed) position within the respective table.

New records are appended to the tables because reordering
requires all references to be updated. That is only done by
special, expensive optimization methods (e.g. Optimize() ).
Selective data removal is likewise unsupported.


4.2 Standard Containers

A small number of container types cover standard data structures,
string dictionaries in particular. They are (re-)used in several
places as part of the specific log cache containers.

Also, most of these standard containers store unique values,
i.e. no two entries may have the same value. That has
the following advantages:

    (1) entry content can be *identified* by index_t
        (e.g. two strings are equal exactly if they
        have the same index value)
    (2) reduces data model size
    (3) -1 always represents an unknown / invalid object

The standard containers are:


CIndexPairDictionary

    - maps index_t <-> pair<index_t, index_t>
      (pairs must be unique)
    - useful to store associations between two tables
      (pair.first  = index in table A,
       pair.second = index in table B)

    - encapsulates a vector<pair<index_t, index_t> >
    - preserves element order (= insertion order)
    - allows for efficient lookup of <index_t, index_t> pairs
      (forward and backward)


CStringDictionary

    - maps index_t <-> 0-terminated sequences of chars
      (strings must be unique, match is case-sensitive)
    - useful to store plain strings
      (outside these containers, all strings are represented
      by their respective index_t value)

    - encapsulates a single vector<char> buffer and a vector<offset>
    - preserves element order (= insertion order)
    - allows for efficient lookup of strings
      (forward and backward)


CBlobDictionary

    - similar to CStringDictionary but stores
      non-0-terminated byte sequences
    - not used by log cache


4.3 Composite Containers

These build upon the standard containers to handle
slightly more complex data.


CPathDictionary

    - maps index_t <-> path (elements separated by '/')
      (strings must be unique, match is case-sensitive)
    - pathA is parent of pathB -> index (pathA) < index(pathB)
    - index ("/") = 0

    - stores path elements in a CStringDictionary
    - represents paths as <index of parent, index of element>
      pairs in a CIndexPairDictionary
    - allows for efficient traversal of path hierarchies

    - CDictionaryBasedPath and CDictionaryBasedTempPath
      encapsulate an index_t, dictionary pair and provide
      typical path operations based on it (see their header comments)


CTokenizedStringContainer

    - maps index_t -> std::string
      (one way, i.e. duplicate strings are possible)
    - used to store large, text-like strings, e.g. comments

    - complex internal structure (see header for details)
      extracting common sub-strings to conserve memory


4.4 Log Data

The primary key to access log information is the revision.
Log info is retrieved in unknown order (accumulates over many
requests) and the containers can only append efficiently
(even if the methods are called Insert() ). Therefore,
we must map the revision number onto our internal record
numbers:


CRevisionIndex

    - maps revision_t -> index_t, called revision index
      (index is unique but reverse mapping is not supported)

    - uses a vector<index_t> plus a single offset value
      to reduce memory consumption in case only the
      top revisions of a large repo are stored


With the index value read from this container, we can directly
address the data for that revision in the actual log data
container. That container has a more complex structure:


CRevisionInfoContainer

    - dictionaries (see 4.2 and 4.3) for mapping
      names and paths onto IDs

        * paths (index is called pathID)
        * authorPool (index is called autorID)
        * userRevPropsPool (index is called propID)

    - all table info is actually stored in per-column
      vectors. See header file for the precise names.

    - revision header info

        * index_t(rev) -> ( validityFlags
                  , comment // in CTokenizedStringContainer
                          , authorID -> authorPool
                          , timeStamp )

    - changed paths involves sub-tables

        * index_t(rev) -> ( firstChangeOffset -> changes sub-table
                          , firstCopyFromOffset -> copy-from sub-table
                  , pathID (common root of all changes) -> paths
                          , combined action flags over all changes)

        * changeOffset -> ( change action // ADD, MOD etc. + copy-from presence
                          , pathID -> paths
                          , path / node type )

          valid range for changeOffset is
          [firstChangeOffset (index (rev)), firstChangeOffset (index (rev)+1))

        * copyFromOffset -> ( pathID -> paths
                            , revision_t )

          valid range for copyFromOffset is
          firstCopyFromOffset(index (rev)), firstCopyFromOffset (index (rev)+1))

      In other words, changes are stored in a common list
      for all revisions and the relevant section for a
      given revision index is from its start offset value
      up to but not including the start offset value of
      the next entry. The last entry gets always terminated
      by an internal dummy entry.

      Copy-from info is NOT stored with the changes.
      Instead, a flag in the change action field indicates
      that there is an entry in the copy-from table.
      Therefore, copy-from info can only be read by
      forward-iteration over both tables. Random access
      within a revision is not supported.

      CChangesIterator encapsulates the necessary iteration
      logic. GetChangesBegin() and GetChangesEnd() return
      the range of changes for a given revision index.

    - user revprops

        * index_t(rev) -> ( firstPropOffset -> revProps sub-table )

        * propOffset -> ( propID -> userRevPropsPool
                        , value // in CTokenizedStringContainer )

          valid range for propOffset is
          [firstPropOffset (index (rev)), firstPropOffset (index (rev)+1))

      All revision properties except for the 3 standard
      properties are considered user-defined. The
      container interface does not provide random
      access within a revision. Use CUserRevPropsIterator
      for iteration, instead.

    - merge info

      Not used ATM as it cannot be retrieved from SVN
      in the required format. The table structure is
      similar to the path changes and user revprops.

Also, log info containers can be merged. That becomes necessary,
if the original log request did not provide all information
(e.g. no path changes). Since insertion is not possible,
a new container instance receives the data from the repository
and is afterwards merged with original container in a single
Update() method call.


4.5 Skip Ranges

A svn log will usually not return a consecutive sequence
of revisions. Instead, there will be "gaps" for those
revisions that did not change the respective path.

For a request to be served from the cache efficiently,
the cache must know that certain revisions did not
modify a given path, even if the revisions are not
present in the cache.


CSkipRevisionInfo

    - stores (index (path), startRev, numOfRevs) tripels
      marking the respective revision range as irrelevant
      for the given path

Add() will trim known revisions from the range (try to store
skip ranges only for unknown revs).

Compress() gets called upon serialization and will try to trim
ranges as well (maybe, we received more revision data in the
meanwhile).

GetPreviousRevision() will return the next revision that
might be relevant for the given path. It returns -1, if
no suitable skip range has been found.


4.6 (De-)Serialization

Every container has an input and an output stream operator
(operator>> and operator<<) that expect a hierarchical stream
as source / destination.

Sub-containers (e.g. paths in CRevisionInfoContainer) are
serialized into sub-streams of the current one. Since every
container "owns" its stream, it can freely decide upon
serialization order and sub-stream IDs.


4.7 Extensions

Additional information will probably be on a pre-revision
basis. This info should then be stored in CRevisionInfoContainer.

Due to the simple internal per-column data model, new columns
and sub-tables can be added easily. Just follow the model of
the existing data. That means, you should extend existing
iterators for extra columns or introduce new ones for new
sub-tables.

Moreover, make sure to extend the Update() and Optimize()
methods accordingly. Otherwise, your data may no longer be
associated to the right revision index.


5. Log Cache Management
-----------------------

5.1 Granularity

There is one log cache (file) per repository. All updates
and access control works on that granularity.

The registry settings SupportAmbiguousURL and SupportAmbiguousUUID
control when two possibly different repositories are considered
the same.


5.2 Cache Pool


5.3 Coordinating Concurrent Access


5.4 Offline Access


5.5 Temporary Caches

