Skip to content

Latest commit

 

History

History
345 lines (239 loc) · 17 KB

leaf-protocol-draft.md

File metadata and controls

345 lines (239 loc) · 17 KB

Leaf Protocol Draft

The Leaf Protocol is a data format on top of the Willow Protocol. Willow deals with how data is replicated and stored amongst peers, providing access control as well as key-value-based byte storage. On top of Willow, Leaf provides:

  • A schema system and serialization format.
  • A standard for creating an Entity-Component "Web of Data".

See Introducing the Leaf Protocol for a high-level overview of the ideas and motivation.

ℹ️ Note: This specification is a work in progress and has not yet been updated with the latest ideas. See the Leaf issues for pending modifications.

Data Model

The three major components of the Leaf Protocol data model are Entitys, Components, and Schemas.

Entities

An Entity represents any distinct "thing". This could be a chat message, a blog post, a comment, a feed, a profile, or anything else.

Entities are able to store data by attaching Components to them, and they may also be the target of Links.

Each Entity is stored in a Willow Namespace, under a specific Subspace and Path. The PathComponents must follow the rules below.

Note: In this spec we refer to Willow path components as PathComponents, to distinguish it from Components for Entitys in this specification.

Entity Path Components

Each PathComponent for an Entity, other than the last one, must be Borsh serialized data matching the following format:

enum PathComponent {
    Null,
    Bool(bool),
    Uint(u64),
    Int(i64),
    String(String),
    Bytes(Vec<u8>),
}

Additionally, the last PathComponent in the Path for an Entity must always be empty, i.e. zero bytes.

ℹ️ Explanation: The empty PathComponent at the end of each Path makes sure that creating an entity will never accidentally trigger prefix pruning and cause other entities to be deleted. See "prefix pruning" in the Willow data model.

This means it is possible to store an entity at "Hello"/"World"/1/[empty] , and still be able to store an entity at "Hello"/"World"/[empty] without overwriting it.

This is incredibly useful for allowing for the existence of "Feed" entities, or other similar group entities, that describe the purpose of or add metadata relevant to entities in it's sub-paths.

This also means that each NamespaceId + SubspaceId has one "default" entity, the one with only a single, empty PathComponent, that can be used to describe the subspace.

Entity Payloads

The Payload of an entity must be a sorted list of ComponentIds.

Note: Since ComponentIds are each PayloadDigests, they must be sorted according to the total order of the PayloadDigest. The Willow protocol requires that PayloadDigests have a total order.

This sorted list of ComponentIds is called an EntitySnapshot.

The PayloadDigest of the EntitySnapshot is called an EntitySnapshotId.

Components

Components are pieces of data that may be attached to an Entity. The PayloadDigest of a Component is called it's ComponentId.

Components are stored individually in a content addressable store. This is usually the same store used by the Willow implementation for storing Entity Payloads.

The data of a Component is Borsh serialized data matching the following format:

enum Component {
    Unencrypted(ComponentData),
    Encrypted {
        algorithm: EncryptionAlgorithmId,
        key_id: [u8; 32],
        encrypted_data: Vec<u8>,
    }
}

struct ComponentData {
    schema: SchemaId,
    data: Vec<u8>,
}

In the ComponentData struct, the SchemaId is the ID of the Schema that describes the data field.

Component Encryption

Components may be either encrypted or unencrypted. The Leaf Protocol allows any number of EncryptionAlgorithms to be implemented. It is up to implementations to choose which algorithms to support.

When a component is encrypted, the Borsh serialized data matching the ComponentData struct will be encrypted using the algorithm and stored in the encrypted_data field.

The interpretation of key_id may be different between different encryption algorithms. This field may be used to store the public key in asymmetric key algorithms for example. An algorithm may choose to put the entire key into the key_id field, or it may choose to store the key in the content addressed store and put its PayloadDigest in the key_id field.

ℹ️ Explanation: This design allows for individual Components of an Entity to be encrypted, even if other components are not encrypted. This could be useful, for example, on a user profile, where the Email for the user profile might be encrypted so that the user can choose to share it only with specific users or services.

This does not prevent you from using Willow's own encryption mechanisms to encrypt the entire Entity or it's Path.

Schemas

A Schema is a description of the data that is in a component. The PayloadDigest of a Schema is called a SchemaId.

The data of a schema is Borsh serialized data matching the following format:

struct Schema {
    name: String,
    format: BorshSchema,
    specification: EntitySnapshotId,
}

The name of the schema is a human-readable name, for documentation purposes only.

The format is a Borsh serialized BorshSchema. This BorshSchema may be used to deserialize the Component's data.

The specification is an EntitySnapshotId that represents the human-readable specification describing how the component data is meant to be interpreted.

ℹ️ Explanation ( Component Specifications ): While the format in a schema is enough information to deserialize the component data, it does not give humans enough information to understand how it should be used in an application. For example, two different components might have exactly the same format containing a single String type, even though one is mean to be an email address and the other is meant to be a name. It is the specification that distinguishes them from each other and provides guidance on how applications are meant to use the data.

ℹ️ Explanation ( Documenting Specifications ): Schemas, EncryptionAlgorithms, and KeyResolvers all use an EntitySnapshot to document their specifications. This means that the documentation itself is described by the Components in that EntitySnapshot.

The simplest form of documentation would be to add a single UTF-8 Component to the EntitySnapshot, containing a human explanation of the specification. Alternatives could include using a Markdown component or an HTML component. This is intentionally flexible, and may even include WASM modules if useful. ( See the note on EncryptionAlgorithms. )

Since each Component used to document a specification must have it's own Schema, with it's own specification, you will always be able to follow the chain of specifications components and their schemas until you get to an UTF-8 component.

Bootstrapping Schemas

Because Schema specifications are EntitySnapshots that, in turn, contains Components with their own Schemas, and all of them are linked by digest, it is impossible to create a Schema that uses itself in it's specification documentation. In other words, Schemas and specifications create a Directed Acyclic Graph ( DAG ).

This situation means that the first schema that is ever created must have a specification that is set to an empty EntitySnapshot. This is called an unspecified schema.

Each specified schema, must eventually, down the chain of components and their schemas, be documented by an unspecified schema. This is not ideal, so we define one special case of unspecified schema, the UTF-8 Schema.

The UTF-8 Schema

When all the following are true of a Schema, it describes the UTF-8 schema:

All BorshSchema::String types are required to be UTF-8 strings. The UTF-8 Schema has no specific meaning beyond it's own contents, and it is primarily meant for use in the specifications of other components.

Unspecified Schemas

When any schema that is not the UTF-8 Schema has a specification set to the empty EntitySnapshot is is called an unspecified schema.

Unspecified schemas are generally discouraged, because although the format will describe the data layout of a component with the schema, it does not give any indication how that data is meant to be interpreted by apps or humans. This makes unspecified schemas ambiguous, and one app may interpret an unspecified schema in a different way than another app.

Still, nothing prevents the creation of unspecified schemas, so they are allowed to exist and be used on Entitys.

Borsh Schemas

BorshSchemas are used to describe the binary format of Component data. BorshSchemas are themselves serialized with Borsh according to the following format:

enum BorshSchema {
    Null,
    Bool,
    U8,
    U16,
    U32,
    U64,
    U128,
    I8,
    I16,
    I32,
    I64,
    I128,
    F32,
    F64,
    String,
    Option {
        schema: BorshSchema
    },
    Array {
        schema: BorshSchema,
        len: u32,
    },
    Struct {
        fields: Vec<String, BorshSchema>,
    },
    Enum {
        variants: Vec<(String, BorshSchema)>,
    },
    Vector {
        BorshSchema
    },
    Map {
        key: BorshSchema,
        value: BorshSchema,
    },
    Set {
        schema: Borshchema
    },
    Blob,
    Snapshot,
    Link,
}

The BorshSchema allows us to represent the Borsh data model so that we can deserialize component data with it. We make a couple modifications to the normal borsh data model:

  • We remove tuples. Structs are clearer and take up no more space for per-component storage.
  • We add Snapshot and Link types.

Blobs

A BorshSchema::Blob is serialized/deserialized as a PayloadDigest.

A blob allows you to separate large binary data from the other data in a component. For example, an Image component might describe the mime_type and the size of an image and store the data of the image as a Blob. Doing this allows you to download the image metadata without having to download the entire image when you read the component.

Snapshots

A BorshSchema::Snapshot is serialized/deserialized as an EntitySnapshotId.

A snapshot is is similar in purpose to a Link but without a path. This may be useful for things like edit history components, where the older versions of the entity are not stored at any entity path anymore, but their snapshots are stored in a component on the new version of the entity.

Links

A BorshSchema::Link is serialized/deserialize using Borsh with the following structure:

struct Link {
    namespace: KeyResolverKind,
    subspace: KeyResolverKind,
    path: Vec<PathComponent>,
    snapshot: Option<EntitySnapshotId>
}

enum KeyResolverKind {
    Inline([u8; 32]),
    Custom {
        id: KeyResolverId,
        data: Vec<u8>,
    },
}

A Link is a reference to an Entity. Links allow us to build expressive graphs with our entities.

The path in the Link specifies the path to the entity in the namespace and subspace.

The optional snapshot allows you to put the EntitySnapshotId in the link, so that even if the entity is changed, moved, or deleted you can still load the data of the entity, at the time that the link was made.

The namespace and subspace specify the KeyResolverKind used to lookup the public keys that identify the spaces. The simplest KeyResolverKind is the Inline variant, which lets you hard-code the key. The Custom variant, allows you to use any KeyResolver and data input to the KeyResolver.

Key Resolvers

A KeyResolver is a specification that describes a way to resolve some data to a NamespaceId or a SubspaceId. Key resolvers allow Links to have a level of indirection, possibly using DNS or other mechanisms such as DIDs to lookup a key, instead of hardcoding it.

The PayloadDigest of a KeyResolver is called a KeyResolverId.

Key resolvers contain a specification, similar to a Schema, that documents how to implement the key-resolver. Each app must decide which key resolvers to implement.

KeyResolvers are stored in a content addressed store, and their data is Borsh serialized data matching the following format:

struct KeyResolver {
    name: String,
    specification: EntitySnapshotId,
}

The name of the KeyResolver is a human readable name for documentation purposes.

The specification is the ID of an EntitySnapshot documenting the key resolution process. The specification is usually human documentation that preferably includes all of the information necessary to implement the key resolver.

Encryption Algorithms

An EncryptionAlgorithm is specification describing how data may be encrypted and decrypted.

The PayloadDigest of an EncryptionAlgorithm is called an EncryptionAlgorithmId.

The data of an EncryptionAlgorithm is Borsh serialized data matching the following format:

struct EncryptionAlgorithm {
    name: String,
    specification: EntitySnapshotId,
}

The name is a human readable name for documentation purposes.

The specification is the ID of an EntitySnapshot that documents the encryption algorithm. The specification is usually human documentation that preferably includes all of the information necessary to encrypt and decrypt data with the encryption algorithm.

Note: It is an interesting consideration that while the specification for an EncryptionAlgorithm should always include human documentation describing the algorithm, it might also contain additional machine-readable Components, such as a WASM module that can be used to actually perform the encryption and decryption.

If a standardized interface for encryption modules was developed, it might be possible to allow clients to automatically download and execute compatible encryption modules automatically.

This kind of standard is allowed to develop on top of the Leaf protocol independently. Details on how this might be done is out of scope for this specification.

Notes

The Leaf Protocol specifies a data format on top of Willow storage, and not much else. All of the Willow features such as the Meadowcap capability system can work with Leaf seamlessly.

The goal of Leaf is simply to provide a more expressive format for storing rich data that can be incrementally understood by different applications.