Starting a Deriva-Py Data Management Project¶
This document provides a best practice for starting a new project with a new ERMrest catalog and Hatrac object store, using Deriva-py client APIs to setup the project.
When starting, many projects do not have very complex access-control policy requirements. However, establishing a few good practices from the start can help immensely as the project evolves later:
- Create Globus groups representing access roles in the catalog.
- Create an initial ERMrest catalog and obtain a Hatrac object storage namespace.
- Set ERMrest catalog and Hatrac object store policies to grant rights to those groups.
- Add users to groups based on their expected role in the project.
- Gain experience and revisit policy design as you go.
Python Client Bindings¶
Throughout this document, we give snippets of example Python code
assuming you are using our derivapy
client library and have already
established active server credentials using the deriva-auth
graphical authentication agent. When combining these tools with
project-specific Globus groups, it is important to stop and restart
your deriva-auth
sessions after your group memberships have changed,
to avoid any confusion with stale membership affecting your subsequent
API requests.
This code establishes the presumed Python environment and all subsequent Python examples should be read as incrementally extending this environment in the order presented:
from deriva.core import DerivaServer, ErmrestCatalog, HatracStore, AttrDict, get_credential
from deriva.core.ermrest_model import builtin_types, Table, Column, Key, ForeignKey
# replace this with your real server FQDN
servername = "yourserver.example.com"
credentials = get_credential(servername)
Your Project-Specific Groups¶
A successful project will require several Globus groups to be created, each representing a meaningful role.
Suggested Groups¶
You may consider starting with a few designated roles in your project. The anonymous role is not represented by a Globus group, but rather embodies all clients without authentication status giving them any other role-based privileges, including real anonymous users without any login status as well as logged in users who do not belong to any of your project-designated roles:
Role | Purpose |
---|---|
admin | Owns catalog and can manipulate models, policies, and data. |
curator | Trusted to create/read/update/delete all data but not model nor policy. |
writers | Trusted to create and edit their own records and to read records created by others. |
readers | Trusted to read all data in catalog. |
anonymous | Not allowed to do anything useful. |
NOTE: A small project may skip the “writers” and “readers” groups during initial work, if all initial project members are equally trusted to managed and consume project data. The other roles may be introduced incrementally as the need arises. However, we will give examples as if these roles are all useful from the start.
Keeping Track of Group Identifiers¶
Each group has a unique hexadecimal identifier and corresponds to a simple role within the project. For use in our code snippets, we want to keep track of these identifiers:
# replace these with your real group IDs
grps = AttrDict({
"admin": "https://auth.globus.org/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"curator": "https://auth.globus.org/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"writer": "https://auth.globus.org/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"reader": "https://auth.globus.org/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
})
this data structure allows us to access them by shorter role names,
e.g. grps.admin
when constructing policies later.
Your First Hatrac Namespace¶
An adminstrative client requests creation of a top-level namespace in the Hatrac object store and sets appropriate ACLs to allow subsequent management of data objects:
object_store = HatracStore('https', servername, credentials)
namespace = '/hatrac/project_data'
object_store.create_namespace(namespace)
object_store.set_acl(namespace, 'owner', [grps.admin])
object_store.set_acl(namespace, 'subtree-create', [grps.curator, grps.writer])
object_store.set_acl(namespace, 'subtree-update', [grps.curator])
object_store.set_acl(namespace, 'subtree-read', [grps.reader])
These requests can only succeed if the client represented by the credentials is granted namepsace creation rights on the root namepsace of the server. Often, a server administrator must assist with this first step.
The recommended policies above will allow curators and writers to create new objects, curators to update objects, and readers to read objects under the common project namespace. General Hatrac behavior also sets a per-object ownership ACL to the requesting client, so writers are able to update objects they have created but not those created by other writers nor curators.
Your First Catalog¶
An administrative client requests creation of a new catalog on an ERMrest server:
server = DerivaServer('https', servername, credentials)
catalog = server.create_ermrest_catalog()
This request can only succeed if the client represented by the
credentials is granted catalog creation rights on the server (via an
ACL in the ermrest_config.json
service configuration file).
Initial Catalog Content¶
Starting with a new catalog, your model is almost empty. Every new catalog already has:
- The administrative client is set as
owner
ACL on the catalog. - A
public
schema. - An
ermrest_client
table in thepublic
schema, hidden by default.
The administrative client can reconfigure the initial catalog policies and start creating project-specific model definitions.
You can inspect the ermrest_client
table definition:
model = catalog.getCatalogModel()
client_table = model.schemas["public"].tables["ermrest_client"]
# you can access the column definitions in their defined order in the table
for column in client_table.column_definitions:
print((column.name, column.type.typename, column.nullok))
# you can also look up columns by name or by position
column = client_table.column_definitions[0]
assert column == client_table.column_definitions[column.name]
The column definitions are summarized in the following table.
Column Name | Typename | Null OK | Description |
---|---|---|---|
RID | ermrest_rid | false | System-managed record identifier |
RCT | ermrest_rct | false | System-managed record creation timestamp |
RMT | ermrest_rmt | false | System-managed record last modification timestamp |
RCB | ermrest_rcb | true | System-managed record created by provenance |
RMB | ermrest_rmb | true | System-managed record last modified by provenance |
id | text | false | Authenticated client identifier i.e. https://auth.globus.org/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx |
display_name | text | true | Human-readable display name of client, if known |
full_name | text | true | Human-readable full name of client, if known |
text | true | Email address of client, if known | |
client_obj | jsonb | false | JSON object containing all attributes known from identity provider |
The first five columns RID
, RCT
, RMT
, RCB
, and RMB
are
ERMrest system columns present in every table. The remaining columns are
specific to the ermrest_client
table’s purpose.
Your First Catalog Policy¶
We can configure the catalog representing most of our desired roles:
# modify local representation of catalog ACL config
model.acls.update({
"owner": [grps.admin],
"insert": [grps.curator, grps.writer],
"update": [grps.curator],
"delete": [grps.curator],
"select": [grps.writer, grps.reader],
"enumerate": ["*"],
})
# apply these local config changes to the server
model.apply(catalog)
This policy is sufficient to allow initial project work using the admin, curator, and reader roles. However, to fully enable the less privileged writer role, you will also need to deploy additional data-dependent policies on each table where those writers are allowed to modify their own existing records. The preceding catalog-wide policy alone will allow them to create records on any table, but only curators and admins may edit existing records.
The enumerate
ACL configuration supports our companion ERMresolve
service but also exposes basic model structures to anonymous
clients. Please
see
further discussion on enumerable models.
Static Policy Inheritance¶
Because ERMrest allows elements of the model to inherit ACLs from enclosing scopes, a small project can set one catalog-wide ACL configuration and omit any localized ACLs on specific schemas, tables, columns, etc. This provides a consistent data access-control behavior across the entire catalog.
Later, if certain tables or columns require more sophisticated access control rules, you can override the catalog-wide ACL and supply a more specific local policy instead.
Exposing the ERMrest Client Table¶
The ermrest_client
table automatically records each newly signed on
user, providing a useful reference for catalog administrators to map
opaque Globus user identifiers into human-readable user profile information
such as full name, display name, or email addresses.
During catalog creation, this table receives a table-level policy
which sets all table ACLs to the empty list []
to suppress any ACLs
inherited from the public
schema or the catalog as a whole. This
effectively hides the table and its content from all but
administrative clients.
You may consider relaxing this policy to allow your other roles to see
basic user information. However, we recommend that you restrict
general users from accessing the client_obj
column unless you have
reviewed the available content and your own project’s privacy policies
and determine that it is appropriate to share this raw identity
provider information:
# mutate local configuration of client_table which is part of model
client_table.acls.update({
"select": [grps.curator, grps.writer, grps.reader],
})
client_table.column_definitions["client_obj"].acls.update({
"select": [],
})
# apply these local changes to server
model.apply(catalog)
This is particularly useful if you create foreign key relationships
from system-generated provenance columns to the id
key column of
this table. The RCB
and RMB
columns store the opaque identifiers
of clients creating or modifying records, respectively.
Extending Your Catalog¶
The ERMrest service provides a set of HTTP operations for managing the model of a catalog and our Python bindings provide some helper APIs to perform these tasks.
The preceding code snippets for configuring policies already use some features specific to managing ACLs. There are also specific API methods for managing the structure of your model, i.e. adding and removing tables, columns, and other model elements.
These Python API bindings are relatively simple wrappers for the raw
HTTP methods, using Python data structures to represent document
structures sent over the wire in JSON format. A Python client
programmer may familiarize themselves with the behavior of the
json.dumps
and json.loads
serialization/deserialization routines
to understand how Python dict
, list
, string, number, and boolean
types are mapped to JSON documents.
As a convenience, we provide some “factory” routines to help construct documents suitable as input to these model management APIs.
Adding a Table¶
When adding a table for our project, we only need to define project-specific columns while allowing the API to fill in the ERMrest system columns for us. Here is a trivial “Journal” example where we simply define a “Notes” column to store formatted markdown text, while relying on the system columns to provide keys and basic timestamp metadata:
model.schemas["public"].create_table(
Table.define(
"Journal",
[
Column.define(
"Notes",
builtin_types["markdown"],
nullok=False,
comment="User-provided notes.",
)
],
comment="A journal of user-provided notes."
)
)
# retrieve catalog model again to ensure we reflect latest structural changes
model = catalog.getCurrentModel()
Adding an Asset Table¶
Let’s embellish our Journal to allow the user to submit file attachments linked to their journal entries:
model.schemas["public"].create_table(
Table.define(
"Journal_Attachment",
[
Column.define(
"journal_rid",
builtin_types["text"],
nullok=False,
comment="The journal entry to which this asset is attached."
),
Column.define(
"url",
builtin_types["text"],
nullok=False,
comment="The URL of the stored attachment."
),
Column.define(
"length",
builtin_types["int8"],
nullok=False,
comment="The asset length (byte count)."
),
Column.define(
"md5",
builtin_types["text"],
nullok=False,
comment="The hexadecimal encoded MD5 checksum of the asset."
),
Column.define(
"content_type",
builtin_types["text"],
nullok=True,
comment="The content-type of the asset."
),
Column.define(
"file_name",
builtin_types["text"],
nullok=True,
comment="The suggested local filename on client systems."
),
],
fkey_defs=[
ForeignKey.define(
["journal_rid"],
"public",
"Journal",
["RID"],
on_delete="CASCADE",
constraint_names=[["public", "Journal_Attachment_journal_rid_fkey"]]
),
],
comment="Assets (files) attached to Journal entries.",
)
)
# retrieve catalog model again to ensure we reflect latest structural changes
model = catalog.getCurrentModel()
This example defines a number of standard metadata columns to
represent an uploaded file, as well as a foreign key journal_rid
linking
each asset to a specific Journal entry.
Asset Configuration¶
To get the best experience from the Chaise GUI, we also want to add model annotations declaring how we prefer to present this table:
attachment_table = model.schemas["public"].tables["Journal_Attachment"]
url_column = attachment_table.column_definitions["url"]
url_column.annotations.update({
"tag:isrd.isi.edu,2017:asset": {
"filename_column": "file_name",
"byte_count_column": "length",
"md5": "md5",
"url_pattern": "/hatrac/project_data/journal_attachment/{{{journal_rid}}}/{{{_url.md5_hex}}}"
}
})
attachment_table.annotations.update({
"tag:isrd.isi.edu,2015:display": {
"name_style": { "underline_space": True }
},
"tag:isrd.isi.edu,2016:visible-columns": {
"entry": [
["public", "Journal_Attachment_journal_rid_fkey"],
"url"
]
}
})
model.apply(catalog)
The first tag:isrd.isi.edu,2017:asset
annotation tells Chaise to
provide a more specialized asset behavior with upload/download
features when presenting the url
column:
- Present existing
url
content as a download link. - Offer a file-chooser to submit new files in data-entry forms.
- Name uploaded assets in the object store based on linked Journal entry RID and asset checksum.
- Store computed asset metadata to additional metadata columns: MD5, length, and original file name.
Without this, Chaise would just present it as a textual column containing a URL.
The second tag:isrd.isi.edu,2015:display
annotation simply tells
Chaise that we prefer to display the table name of
Journal_Attachements
as Journal Attachments
by rewriting the
underscore as whitespace.
The third tag:isrd.isi.edu,2016:visible-columns
annotation further
customizes data-entry in Chaise, while using default presentation in
other read-only contexts:
- Instead of display the raw value in
journal_rid
, interpret the foreign key linkage and present an item-chooser to link this Journal Attachment to an existing Journal record. - Prompt for
url
input, which will be displayed as a file-chooser due to the preceding asset annotation. - Don’t prompt for input on other excluded columns, since they will automatically be populated by the Chaise file uploader based on the preceding asset annotation.
Allowing Self-Service Editing by Writers¶
Our initial catalog policy allows writers to insert new records in our Journal table, but after that only curators can revise them. If we wish to enable self-service editing of journal entries by writers, we need to add a data-dependent policy to the table:
journal_table = model.schemas["public"].tables["Journal"]
attachment_table = model.schemas["public"].tables["Journal"]
# we can re-use this generic policy on any table since they all have an RCB column
self_service_policy = {
"self_service": {
"types": ["update", "delete"],
"projection": ["RCB"],
"projection_type": "acl"
}
}
journal_table.acl_bindings.update(self_service_policy)
attachment_table.acl_bindings.update(self_service_policy)
# apply these local config changes to the server
model.apply(catalog)
This slightly cryptic policy states that ERMrest should project the
RCB
(aka row created by) column and treat it like an ACL; when the
requesting client matches this extra ACL for an existing row, the
client is granted the update
and delete
access rights which they
otherwise would not have.
In practice, this means that after a client with writer role inserts a
Journal entry or Journal attachment, their client ID will be listed as
the RCB
and they will be permitted to make updates to or delete
their own journal entries.
Pitfall of Foreign Key Policies¶
While this policy mechanism is useful and important in some scenarios, there is a potential user-experience problem. The Chaise GUI does not understand the restriction and will offer the user choices which will subsequently fail when submitted to the server.
For now, use of such policies should include user training so they understand the policy of the project. Eventually, we hope to make the policy visible in some way to Chaise so that it can automatically restrict the offered choices to those which may succeed.
Adding a Column¶
Our trivial Journal example only has the system-managed timestamp information for when an entry is created. There is no option for an ERMrest client to override and “backdate” these timestamps if they wish to enter past information that was collected offline. We can add a Date column to model a user-provided timestamp:
journal_table.add_column(
"Date",
builtin_types["timestamptz"],
nullok=False,
comment="User-provided timestamp."
)
# retrieve catalog model again to ensure we reflect latest structural changes
model = catalog.getCurrentModel()
With no other special configuration, this column will experience the same policy as the existing Notes column.
Removing a Column¶
Assuming our users never really enter offline journal entries, they may report that the new Date column is cumbersome. We can remove it from the model:
journal_table.column_definitions["Date"].delete()
# retrieve catalog model again to ensure we reflect latest structural changes
model = catalog.getCurrentModel()
Removing a Table¶
In the same manner, we can prune an entire table when we realize users do not actually find it useful:
model.schemas["public"].tables["Journal"].delete()
# retrieve catalog model again to ensure we reflect latest structural changes
model = catalog.getCurrentModel()
This request will actually fail unless we first remove the dependent Journal_Attachment table (or at least the foreign key constraint linking the two):
model.schemas["public"].tables["Journal_Attachment"].delete()
model.schemas["public"].tables["Journal"].delete()
# retrieve catalog model again to ensure we reflect latest structural changes
model = catalog.getCurrentModel()
Note, even though we deleted the attachment table, all existing assets in the Hatrac object store will remain.
A project data administrator needs to consider data retention policies when deciding whether they should periodically search for abandoned/orphaned objects or retain them in case there are external URL references to these objects outside the ERMrest metadata catalog.
Other Model Changes…?¶
Further Discussion¶
Here we discuss related topics and clarify the background for certain recommendations.
Curators versus Writers¶
We suggest starting with the “curator” role. In a simple project, this may be the only data-writing role you need. As project needs evolve, you may introduce one or more other less-privileged groups who have some write privileges on specific bits of the catalog. However, you should still retain a curator role given more holistic privileges over the entire catalog, as embodied in this initial policy.
In our example configuration above, we designated one “writer” role. However, many projects may find that they have more than one class of writer, each with different granted privileges. For example, some tables may be written by one subgroup of users but not others. Or, some tables may have records associated with individual project sites, and a per-site writer group should be able to modify rows associated with their own site regardless of whether they were the user creating the record.
Anonymous Access¶
Our proposed policy configuration requires all users to be added to at least one group and to sign in before being able to access catalog content. Some projects reach a point where they wish to expose some of their data to the public, without requiring users to obtain accounts, be enrolled in groups, nor even sign in.
The "*"
wildcard in an ACL grants permission to anonymous clients.
So, projects can revise their policies to grant additional rights to
anonymous clients as projects evolve. However, the policy design
needed to provide appropriate anonymous access is beyond the scope of
this initial policy sketch. Oftentimes, a project’s intention to make
content visible to the public is actually more nuanced than simply
extending a blanket select
privilege to all comers. For example,
they may want to track a curation state or “release status” and only
expose records to the public when they have been reviewed and marked
as publication-ready.
Extending ERMrest Client Table Definition¶
More sophisticated projects may wish to extend the ERMrest client table with more localized user information. This is allowed, but you should remember that ERMrest automatically maintains the core client information columns and will update records on each return of the client. Thus, attempting to amend user profile information would be fruitless as it would be reset to the values provided by the authentication system. There are two reasonable approaches to extending the table:
- Add columns edited by curators (or by users themselves with appropriate self-service policies in place).
- Link the client table to other project tables via foreign keys or associations.
Enumerable Catalogs support ERMresolve¶
The enumerate
ACL means that the catalog can be detected even by
anonymous clients. This is useful to allow our companion ERMresolve
service to easily support resolution of short record IDs against a
catalog:
- Ermresolve URL: /id/xyz
- ERMrest resolution support URL: /ermrest/catalog/1/entity_rid/xyz
- ERMrest record URL: /ermrest/catalog/1/entity/Schema1:Table1/RID=xyz
- Chaise record URL: /chaise/record/#1/Schema1:Table1/RID=xyz
During normal operation, the resolver probes the resolution support
URL which is governed by the enumerate
ACL. On successful
resolution, the resolver redirects HTTP clients to Chaise or ERMrest
record URLs. The client encounters normal data access control
enforcement when they attempt to access the real record.
However, this resolution mechanism also means that anonymous clients
may be able to discover the model definitions in your catalog. If you
plan to use the resolver service you should avoid using model element
names or structures which you would consider too sensitive to reveal
to anonymous clients. Conversely, if you have sensitive model
structures you can render your system safe by setting the enumerate
ACL to the empty list []
and hence disabling ERMresolve.
If you think you have a need to both hide sensitive parts of your model and use the resolution service, please reach out to us so we can better understand your use case requirements and provide better advice or possibly expand the capabilities of ERMrest to cover your scenario!