Starting a Deriva-Py Data Management Project

This document provides a best practice for starting a new project with a new ERMrest catalog and Hatrac object store, using Deriva-py client APIs to setup the project.

When starting, many projects do not have very complex access-control policy requirements. However, establishing a few good practices from the start can help immensely as the project evolves later:

  1. Create Globus groups representing access roles in the catalog.
  2. Create an initial ERMrest catalog and obtain a Hatrac object storage namespace.
  3. Set ERMrest catalog and Hatrac object store policies to grant rights to those groups.
  4. Add users to groups based on their expected role in the project.
  5. Gain experience and revisit policy design as you go.

Python Client Bindings

Throughout this document, we give snippets of example Python code assuming you are using our derivapy client library and have already established active server credentials using the deriva-auth graphical authentication agent. When combining these tools with project-specific Globus groups, it is important to stop and restart your deriva-auth sessions after your group memberships have changed, to avoid any confusion with stale membership affecting your subsequent API requests.

This code establishes the presumed Python environment and all subsequent Python examples should be read as incrementally extending this environment in the order presented:

from deriva.core import DerivaServer, ErmrestCatalog, HatracStore, AttrDict, get_credential
from deriva.core.ermrest_model import builtin_types, Table, Column, Key, ForeignKey

# replace this with your real server FQDN
servername = "yourserver.example.com"
credentials = get_credential(servername)

Your Project-Specific Groups

A successful project will require several Globus groups to be created, each representing a meaningful role.

Suggested Groups

You may consider starting with a few designated roles in your project. The anonymous role is not represented by a Globus group, but rather embodies all clients without authentication status giving them any other role-based privileges, including real anonymous users without any login status as well as logged in users who do not belong to any of your project-designated roles:

Role Purpose
admin Owns catalog and can manipulate models, policies, and data.
curator Trusted to create/read/update/delete all data but not model nor policy.
writers Trusted to create and edit their own records and to read records created by others.
readers Trusted to read all data in catalog.
anonymous Not allowed to do anything useful.

NOTE: A small project may skip the “writers” and “readers” groups during initial work, if all initial project members are equally trusted to managed and consume project data. The other roles may be introduced incrementally as the need arises. However, we will give examples as if these roles are all useful from the start.

Keeping Track of Group Identifiers

Each group has a unique hexadecimal identifier and corresponds to a simple role within the project. For use in our code snippets, we want to keep track of these identifiers:

# replace these with your real group IDs
grps = AttrDict({
  "admin":   "https://auth.globus.org/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "curator": "https://auth.globus.org/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "writer":  "https://auth.globus.org/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "reader":  "https://auth.globus.org/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
})

this data structure allows us to access them by shorter role names, e.g. grps.admin when constructing policies later.

Your First Hatrac Namespace

An adminstrative client requests creation of a top-level namespace in the Hatrac object store and sets appropriate ACLs to allow subsequent management of data objects:

object_store = HatracStore('https', servername, credentials)
namespace = '/hatrac/project_data'
object_store.create_namespace(namespace)
object_store.set_acl(namespace, 'owner', [grps.admin])
object_store.set_acl(namespace, 'subtree-create', [grps.curator, grps.writer])
object_store.set_acl(namespace, 'subtree-update', [grps.curator])
object_store.set_acl(namespace, 'subtree-read', [grps.reader])

These requests can only succeed if the client represented by the credentials is granted namepsace creation rights on the root namepsace of the server. Often, a server administrator must assist with this first step.

The recommended policies above will allow curators and writers to create new objects, curators to update objects, and readers to read objects under the common project namespace. General Hatrac behavior also sets a per-object ownership ACL to the requesting client, so writers are able to update objects they have created but not those created by other writers nor curators.

Your First Catalog

An administrative client requests creation of a new catalog on an ERMrest server:

server = DerivaServer('https', servername, credentials)
catalog = server.create_ermrest_catalog()

This request can only succeed if the client represented by the credentials is granted catalog creation rights on the server (via an ACL in the ermrest_config.json service configuration file).

Initial Catalog Content

Starting with a new catalog, your model is almost empty. Every new catalog already has:

  • The administrative client is set as owner ACL on the catalog.
  • A public schema.
  • An ermrest_client table in the public schema, hidden by default.

The administrative client can reconfigure the initial catalog policies and start creating project-specific model definitions.

You can inspect the ermrest_client table definition:

model = catalog.getCatalogModel()
client_table = model.schemas["public"].tables["ermrest_client"]

# you can access the column definitions in their defined order in the table
for column in client_table.column_definitions:
  print((column.name, column.type.typename, column.nullok))

# you can also look up columns by name or by position
column = client_table.column_definitions[0]
assert column == client_table.column_definitions[column.name]

The column definitions are summarized in the following table.

Column Name Typename Null OK Description
RID ermrest_rid false System-managed record identifier
RCT ermrest_rct false System-managed record creation timestamp
RMT ermrest_rmt false System-managed record last modification timestamp
RCB ermrest_rcb true System-managed record created by provenance
RMB ermrest_rmb true System-managed record last modified by provenance
id text false Authenticated client identifier i.e. https://auth.globus.org/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
display_name text true Human-readable display name of client, if known
full_name text true Human-readable full name of client, if known
email text true Email address of client, if known
client_obj jsonb false JSON object containing all attributes known from identity provider

The first five columns RID, RCT, RMT, RCB, and RMB are ERMrest system columns present in every table. The remaining columns are specific to the ermrest_client table’s purpose.

Your First Catalog Policy

We can configure the catalog representing most of our desired roles:

# modify local representation of catalog ACL config
model.acls.update({
  "owner": [grps.admin],
  "insert": [grps.curator, grps.writer],
  "update": [grps.curator],
  "delete": [grps.curator],
  "select": [grps.writer, grps.reader],
  "enumerate": ["*"],
})
# apply these local config changes to the server
model.apply(catalog)

This policy is sufficient to allow initial project work using the admin, curator, and reader roles. However, to fully enable the less privileged writer role, you will also need to deploy additional data-dependent policies on each table where those writers are allowed to modify their own existing records. The preceding catalog-wide policy alone will allow them to create records on any table, but only curators and admins may edit existing records.

The enumerate ACL configuration supports our companion ERMresolve service but also exposes basic model structures to anonymous clients. Please see further discussion on enumerable models.

Static Policy Inheritance

Because ERMrest allows elements of the model to inherit ACLs from enclosing scopes, a small project can set one catalog-wide ACL configuration and omit any localized ACLs on specific schemas, tables, columns, etc. This provides a consistent data access-control behavior across the entire catalog.

Later, if certain tables or columns require more sophisticated access control rules, you can override the catalog-wide ACL and supply a more specific local policy instead.

Exposing the ERMrest Client Table

The ermrest_client table automatically records each newly signed on user, providing a useful reference for catalog administrators to map opaque Globus user identifiers into human-readable user profile information such as full name, display name, or email addresses.

During catalog creation, this table receives a table-level policy which sets all table ACLs to the empty list [] to suppress any ACLs inherited from the public schema or the catalog as a whole. This effectively hides the table and its content from all but administrative clients.

You may consider relaxing this policy to allow your other roles to see basic user information. However, we recommend that you restrict general users from accessing the client_obj column unless you have reviewed the available content and your own project’s privacy policies and determine that it is appropriate to share this raw identity provider information:

# mutate local configuration of client_table which is part of model
client_table.acls.update({
  "select": [grps.curator, grps.writer, grps.reader],
})
client_table.column_definitions["client_obj"].acls.update({
  "select": [],    
})

# apply these local changes to server
model.apply(catalog)

This is particularly useful if you create foreign key relationships from system-generated provenance columns to the id key column of this table. The RCB and RMB columns store the opaque identifiers of clients creating or modifying records, respectively.

Extending Your Catalog

The ERMrest service provides a set of HTTP operations for managing the model of a catalog and our Python bindings provide some helper APIs to perform these tasks.

The preceding code snippets for configuring policies already use some features specific to managing ACLs. There are also specific API methods for managing the structure of your model, i.e. adding and removing tables, columns, and other model elements.

These Python API bindings are relatively simple wrappers for the raw HTTP methods, using Python data structures to represent document structures sent over the wire in JSON format. A Python client programmer may familiarize themselves with the behavior of the json.dumps and json.loads serialization/deserialization routines to understand how Python dict, list, string, number, and boolean types are mapped to JSON documents.

As a convenience, we provide some “factory” routines to help construct documents suitable as input to these model management APIs.

Adding a Table

When adding a table for our project, we only need to define project-specific columns while allowing the API to fill in the ERMrest system columns for us. Here is a trivial “Journal” example where we simply define a “Notes” column to store formatted markdown text, while relying on the system columns to provide keys and basic timestamp metadata:

model.schemas["public"].create_table(
  Table.define(
    "Journal",
    [
      Column.define(
        "Notes",
        builtin_types["markdown"],
        nullok=False,
        comment="User-provided notes.",
      )
    ],
    comment="A journal of user-provided notes."
  )
)

# retrieve catalog model again to ensure we reflect latest structural changes
model = catalog.getCurrentModel()

Adding an Asset Table

Let’s embellish our Journal to allow the user to submit file attachments linked to their journal entries:

model.schemas["public"].create_table(
  Table.define(
    "Journal_Attachment",
    [
      Column.define(
        "journal_rid",
        builtin_types["text"],
        nullok=False,
        comment="The journal entry to which this asset is attached."
      ),
      Column.define(
        "url",
        builtin_types["text"],
        nullok=False,
        comment="The URL of the stored attachment."
      ),
      Column.define(
        "length",
        builtin_types["int8"],
        nullok=False,
        comment="The asset length (byte count)."
      ),
      Column.define(
        "md5",
        builtin_types["text"],
        nullok=False,
        comment="The hexadecimal encoded MD5 checksum of the asset."
      ),
      Column.define(
        "content_type",
        builtin_types["text"],
        nullok=True,
        comment="The content-type of the asset."
      ),
      Column.define(
        "file_name",
        builtin_types["text"],
        nullok=True,
        comment="The suggested local filename on client systems."
      ),
    ],
    fkey_defs=[
      ForeignKey.define(
        ["journal_rid"],
        "public",
        "Journal",
        ["RID"],
        on_delete="CASCADE",
        constraint_names=[["public", "Journal_Attachment_journal_rid_fkey"]]
      ),
    ],
    comment="Assets (files) attached to Journal entries.",
  )
)

# retrieve catalog model again to ensure we reflect latest structural changes
model = catalog.getCurrentModel()

This example defines a number of standard metadata columns to represent an uploaded file, as well as a foreign key journal_rid linking each asset to a specific Journal entry.

Asset Configuration

To get the best experience from the Chaise GUI, we also want to add model annotations declaring how we prefer to present this table:

attachment_table = model.schemas["public"].tables["Journal_Attachment"]
url_column = attachment_table.column_definitions["url"]

url_column.annotations.update({
  "tag:isrd.isi.edu,2017:asset": {
    "filename_column": "file_name",
    "byte_count_column": "length",
    "md5": "md5",
    "url_pattern": "/hatrac/project_data/journal_attachment/{{{journal_rid}}}/{{{_url.md5_hex}}}"
  }
})

attachment_table.annotations.update({
  "tag:isrd.isi.edu,2015:display": {
    "name_style": { "underline_space": True }
  },
  "tag:isrd.isi.edu,2016:visible-columns": {
    "entry": [
      ["public", "Journal_Attachment_journal_rid_fkey"],
      "url"
    ]
  }
})

model.apply(catalog)

The first tag:isrd.isi.edu,2017:asset annotation tells Chaise to provide a more specialized asset behavior with upload/download features when presenting the url column:

  • Present existing url content as a download link.
  • Offer a file-chooser to submit new files in data-entry forms.
  • Name uploaded assets in the object store based on linked Journal entry RID and asset checksum.
  • Store computed asset metadata to additional metadata columns: MD5, length, and original file name.

Without this, Chaise would just present it as a textual column containing a URL.

The second tag:isrd.isi.edu,2015:display annotation simply tells Chaise that we prefer to display the table name of Journal_Attachements as Journal Attachments by rewriting the underscore as whitespace.

The third tag:isrd.isi.edu,2016:visible-columns annotation further customizes data-entry in Chaise, while using default presentation in other read-only contexts:

  • Instead of display the raw value in journal_rid, interpret the foreign key linkage and present an item-chooser to link this Journal Attachment to an existing Journal record.
  • Prompt for url input, which will be displayed as a file-chooser due to the preceding asset annotation.
  • Don’t prompt for input on other excluded columns, since they will automatically be populated by the Chaise file uploader based on the preceding asset annotation.

Allowing Self-Service Editing by Writers

Our initial catalog policy allows writers to insert new records in our Journal table, but after that only curators can revise them. If we wish to enable self-service editing of journal entries by writers, we need to add a data-dependent policy to the table:

journal_table = model.schemas["public"].tables["Journal"]
attachment_table = model.schemas["public"].tables["Journal"]

# we can re-use this generic policy on any table since they all have an RCB column
self_service_policy = {
  "self_service": {
    "types": ["update", "delete"],
    "projection": ["RCB"],
    "projection_type": "acl"
  }
}

journal_table.acl_bindings.update(self_service_policy)
attachment_table.acl_bindings.update(self_service_policy)

# apply these local config changes to the server
model.apply(catalog)

This slightly cryptic policy states that ERMrest should project the RCB (aka row created by) column and treat it like an ACL; when the requesting client matches this extra ACL for an existing row, the client is granted the update and delete access rights which they otherwise would not have.

In practice, this means that after a client with writer role inserts a Journal entry or Journal attachment, their client ID will be listed as the RCB and they will be permitted to make updates to or delete their own journal entries.

Preventing Attachments by Unrelated Writers

The policy so far is a little bit trusting. It does not prevent one writer from introducing attachments to a different writer’s Journal entries. To restrict this, we can leverage a more subtle ERMrest feature which allows policies on foreign keys.

Rather than determining row permissions based on row-specific content like RCB of the row being modified, such policies can restrict the expression of a foreign key value in the controlled foreign key column based on row-specific content of the referenced entity. In this case, we can make sure writers can only link an attachment to a Journal record if they are the creator of that Journal entry:

journal_fkey = attachment_table.foreign_keys[
  ("public", "Journal_Attachment_journal_rid_fkey")
]

journal_fkey.acls.update({
  "insert": [grps.curator],
  "update": [grps.curator],
})

journal_fkey.acl_bindings.update({
  "self_linkage": {
    "types": ["insert", "update"],
    "projection": ["RCB"],
    "projection_type": "acl",
  }
})

The preceding static ACLs overrides the ERMrest default behavior assuming an ACL of ["*"] to allow unconstrained use of foreign key values; in this case, we still want to allow curators to adjust attachment linkage regardless of whether they created a Journal entry or not.

The dynamic ACL binding is much like the self-service binding introduced earlier. There are two subtle differences:

  1. The projected RCB column comes from the referred row in the Journal table rather than the Journal_Attachment table row being modified.
  2. Because the projected row already exists, we can set a policy for the insert access type, i.e. while creating a new Journal_Attachment record. Regular table (and column) ACL bindings cannot control insert access since there is no existing row to consult when determining access privileges.

Pitfall of Foreign Key Policies

While this policy mechanism is useful and important in some scenarios, there is a potential user-experience problem. The Chaise GUI does not understand the restriction and will offer the user choices which will subsequently fail when submitted to the server.

For now, use of such policies should include user training so they understand the policy of the project. Eventually, we hope to make the policy visible in some way to Chaise so that it can automatically restrict the offered choices to those which may succeed.

Adding a Column

Our trivial Journal example only has the system-managed timestamp information for when an entry is created. There is no option for an ERMrest client to override and “backdate” these timestamps if they wish to enter past information that was collected offline. We can add a Date column to model a user-provided timestamp:

journal_table.add_column(
  "Date",
  builtin_types["timestamptz"],
  nullok=False,
  comment="User-provided timestamp."
)
# retrieve catalog model again to ensure we reflect latest structural changes
model = catalog.getCurrentModel()

With no other special configuration, this column will experience the same policy as the existing Notes column.

Removing a Column

Assuming our users never really enter offline journal entries, they may report that the new Date column is cumbersome. We can remove it from the model:

journal_table.column_definitions["Date"].delete()
# retrieve catalog model again to ensure we reflect latest structural changes
model = catalog.getCurrentModel()

Removing a Table

In the same manner, we can prune an entire table when we realize users do not actually find it useful:

model.schemas["public"].tables["Journal"].delete()
# retrieve catalog model again to ensure we reflect latest structural changes
model = catalog.getCurrentModel()

This request will actually fail unless we first remove the dependent Journal_Attachment table (or at least the foreign key constraint linking the two):

model.schemas["public"].tables["Journal_Attachment"].delete()
model.schemas["public"].tables["Journal"].delete()
# retrieve catalog model again to ensure we reflect latest structural changes
model = catalog.getCurrentModel()

Note, even though we deleted the attachment table, all existing assets in the Hatrac object store will remain.

A project data administrator needs to consider data retention policies when deciding whether they should periodically search for abandoned/orphaned objects or retain them in case there are external URL references to these objects outside the ERMrest metadata catalog.

Other Model Changes…?

Further Discussion

Here we discuss related topics and clarify the background for certain recommendations.

Curators versus Writers

We suggest starting with the “curator” role. In a simple project, this may be the only data-writing role you need. As project needs evolve, you may introduce one or more other less-privileged groups who have some write privileges on specific bits of the catalog. However, you should still retain a curator role given more holistic privileges over the entire catalog, as embodied in this initial policy.

In our example configuration above, we designated one “writer” role. However, many projects may find that they have more than one class of writer, each with different granted privileges. For example, some tables may be written by one subgroup of users but not others. Or, some tables may have records associated with individual project sites, and a per-site writer group should be able to modify rows associated with their own site regardless of whether they were the user creating the record.

Anonymous Access

Our proposed policy configuration requires all users to be added to at least one group and to sign in before being able to access catalog content. Some projects reach a point where they wish to expose some of their data to the public, without requiring users to obtain accounts, be enrolled in groups, nor even sign in.

The "*" wildcard in an ACL grants permission to anonymous clients. So, projects can revise their policies to grant additional rights to anonymous clients as projects evolve. However, the policy design needed to provide appropriate anonymous access is beyond the scope of this initial policy sketch. Oftentimes, a project’s intention to make content visible to the public is actually more nuanced than simply extending a blanket select privilege to all comers. For example, they may want to track a curation state or “release status” and only expose records to the public when they have been reviewed and marked as publication-ready.

Extending ERMrest Client Table Definition

More sophisticated projects may wish to extend the ERMrest client table with more localized user information. This is allowed, but you should remember that ERMrest automatically maintains the core client information columns and will update records on each return of the client. Thus, attempting to amend user profile information would be fruitless as it would be reset to the values provided by the authentication system. There are two reasonable approaches to extending the table:

  1. Add columns edited by curators (or by users themselves with appropriate self-service policies in place).
  2. Link the client table to other project tables via foreign keys or associations.

Enumerable Catalogs support ERMresolve

The enumerate ACL means that the catalog can be detected even by anonymous clients. This is useful to allow our companion ERMresolve service to easily support resolution of short record IDs against a catalog:

  • Ermresolve URL: /id/xyz
  • ERMrest resolution support URL: /ermrest/catalog/1/entity_rid/xyz
  • ERMrest record URL: /ermrest/catalog/1/entity/Schema1:Table1/RID=xyz
  • Chaise record URL: /chaise/record/#1/Schema1:Table1/RID=xyz

During normal operation, the resolver probes the resolution support URL which is governed by the enumerate ACL. On successful resolution, the resolver redirects HTTP clients to Chaise or ERMrest record URLs. The client encounters normal data access control enforcement when they attempt to access the real record.

However, this resolution mechanism also means that anonymous clients may be able to discover the model definitions in your catalog. If you plan to use the resolver service you should avoid using model element names or structures which you would consider too sensitive to reveal to anonymous clients. Conversely, if you have sensitive model structures you can render your system safe by setting the enumerate ACL to the empty list [] and hence disabling ERMresolve.

If you think you have a need to both hide sensitive parts of your model and use the resolution service, please reach out to us so we can better understand your use case requirements and provide better advice or possibly expand the capabilities of ERMrest to cover your scenario!