deriva-download-cli¶
The deriva-download-cli
is a command-line utility for orchestrating the bulk export of tabular data
(stored in ERMRest) and download of asset data (stored in Hatrac, or other supported HTTP-accessible object store).
It supports the transfer of data directly to local filesystems, or packaging results into the
bagit
container format. The program is driven by the combined
usage of command-line arguments and a JSON-based configuration (“spec”) file, which contains the processing
directives used to orchestrate the creation of the result data set.
Features¶
- Transfer both tabular data and file assets from Deriva catalogs.
- Create
bag
containers, which may reference files stored in remote locations. - Supports an extensible processing pipeline whereby data may be run through transform functions or other arbitrary processing before final result packaging.
Command-Line options¶
usage: deriva-download-cli.py [-h] [--version] [--quiet] [--debug]
[--credential-file <file>] [--catalog <1>]
[--token <auth-token>]
<host> <config file> <output dir> ...
Deriva Data Download Utility - CLI
positional arguments:
<host> Fully qualified host name.
<config file> Path to a configuration file.
<output dir> Path to an output directory.
[key=value key=value ...]
Variable length of whitespace-delimited key=value pair
arguments used for string interpolation in specific
parts of the configuration file. For example:
key1=value1 key2=value2
optional arguments:
-h, --help show this help message and exit
--version Print version and exit.
--quiet Suppress logging output.
--debug Enable debug logging output.
--credential-file <file>
Optional path to a credential file.
--catalog <1> Catalog number. Default: 1
--token <auth-token> Authorization bearer token.
Positional arguments:¶
<host>
¶
All operations are performed with respect to a specific host and most hosts will
require authentication. If the --host HOSTNAME
option is not given, localhost
will be assumed.
<config file>
¶
A path to a configuration file is required. The format and syntax of the can be configuration file is described below.
<output dir>
¶
A path to a output base directory is required. This can be an absolute path or a path relative to the current working directory.
Optional arguments:¶
--token
¶
The CLI accepts an authentication token with the --token TOKEN
option. If this
option is not given, it will look in the user home dir where the DERIVA-Auth
client would store the credentials.
--credential-file
¶
If --token
is not specified, the program will look in the user home dir where the DERIVA-Auth
client would store the credentials. Use the --credential file
argument to override this behavior and specify an alternative credential file.
--catalog
¶
Configuration file format¶
The configuration JSON file (or “spec”) is the primary mechanism for orchestrating the export and download of data for a given host.
There are three primary objects that comprise the configuration spec; the catalog
element, the env
element, and the bag
element.
The catalog
object is a REQUIRED element, and is principally composed of an array named queries
which is a set of configuration stanzas,
executed in declared order, that individually describe what data to retrieve, how the data should be processed, and where
the result data should be placed in the target filesystem.
The env
object is an OPTIONAL element which, if present, is expected to be a dictionary of key-value pairs that are available to use as
interpolation variables for various keywords in the queries
section of the configuration file. The string substitution is performed using the keyword
interpolation syntax of Python str.format
. NOTE: when specifying arbitrary
key-value pairs on the command-line, such pairs will OVERRIDE any matching keys found in the env
element of the configuration file.
The bag
object is an OPTIONAL element which, if present, declares that the aggregate output from all configuration stanzas listed in the
catalog:queries
array should be packaged as a bagit
formatted container. The bag
element contains
various optional parameters which control bag creation specifics.
Example configuration file:¶
{
"env": {
"accession": "XYZ123",
"term": "Chip-seq"
},
"bag": {
"bag_name": "test-bag",
"bag_archiver": "zip",
"bag_metadata": {
"Source-Organization": "USC Information Sciences Institute, Informatics Systems Research Division"
}
},
"catalog": {
"queries": [
{
"processor": "csv",
"processor_params": {
"query_path": "/attribute/D:=isa:dataset/accession={accession}/E:=isa:experiment/experiment_type:=isa:experiment_type/term=RNA%20expression%20%28RNA-seq%29/$E/STRAND:=vocabulary:strandedness/$E/R:=isa:replicate/SAMPLE:=isa:biosample/SPEC:=vocabulary:species/$R/SEQ:=isa:sequencing_data/PAIRED:=vocabulary:paired_end_or_single_read/$SEQ/file_format:=vocabulary:file_format/term=FastQ/$SEQ/dataset:=D:accession,experiment:=E:RID,biosample:=SAMPLE:RID,replicate:=R:RID,bioreplicate_num:=R:bioreplicate_number,techreplicate_num:=R:technical_replicate_number,species:=SPEC:term,paired:=PAIRED:term,stranded:=STRAND:term,read:=SEQ:read,file:=SEQ:RID,filename:=SEQ:filename,url:=SEQ:url",
"output_path": "{accession}/{accession}-RNA-Seq"
}
},
{
"processor": "download",
"processor_params": {
"query_path": "/attribute/D:=isa:dataset/accession={accession}/E:=isa:experiment/experiment_type:=isa:experiment_type/term=RNA%20expression%20%28RNA-seq%29/$E/STRAND:=vocabulary:strandedness/$E/R:=isa:replicate/SAMPLE:=isa:biosample/SPEC:=vocabulary:species/$R/SEQ:=isa:sequencing_data/PAIRED:=vocabulary:paired_end_or_single_read/$SEQ/file_format:=vocabulary:file_format/term=FastQ/$SEQ/dataset:=D:RID,experiment:=E:RID,biosample:=SAMPLE:RID,file:=SEQ:RID,filename:=SEQ:filename,size:=SEQ:byte_count,md5:=SEQ:md5,url:=SEQ:url",
"output_path": "{dataset}/{experiment}/{biosample}/seq"
}
},
{
"processor": "csv",
"processor_params": {
"query_path": "/attribute/D:=isa:dataset/accession={accession}/E:=isa:experiment/experiment_type:=isa:experiment_type/term=Chip-seq/$E/TARGET:=vocabulary:target_of_assay/$E/R:=isa:replicate/SAMPLE:=isa:biosample/SPEC:=vocabulary:species/$R/SEQ:=isa:sequencing_data/PAIRED:=vocabulary:paired_end_or_single_read/$SEQ/file_format:=vocabulary:file_format/term=FastQ/$SEQ/dataset:=D:accession,experiment:=E:RID,control:=E:control_assay,biosample:=SAMPLE:RID,replicate:=R:RID,bioreplicate_num:=R:bioreplicate_number,technical_replicate_num:=R:technical_replicate_number,species:=SPEC:term,target:=TARGET:term,paired:=PAIRED:term,read:=SEQ:read,file:=SEQ:RID,filename:=SEQ:filename,url:=SEQ:url",
"output_path": "{accession}/{accession}-ChIP-Seq"
}
},
{
"processor": "fetch",
"processor_params": {
"query_path": "/attribute/D:=isa:dataset/accession={accession}/E:=isa:experiment/experiment_type:=isa:experiment_type/term=Chip-seq/$E/TARGET:=vocabulary:target_of_assay/$E/R:=isa:replicate/SAMPLE:=isa:biosample/SPEC:=vocabulary:species/$R/SEQ:=isa:sequencing_data/PAIRED:=vocabulary:paired_end_or_single_read/$SEQ/file_format:=vocabulary:file_format/term=FastQ/$SEQ/dataset:=D:accession,experiment:=E:RID,biosample:=SAMPLE:RID,technical_replicate_num:=R:technical_replicate_number,rid:=SEQ:RID,filename:=SEQ:filename,size:=SEQ:byte_count,md5:=SEQ:md5,url:=SEQ:url",
"output_path": "{dataset}/{experiment}/{biosample}/seq",
"output_filename": "{rid}_{filename}"
}
}
]
}
}
Configuration file element: catalog
¶
Example:
{
"catalog": {
"queries": [
{
"processor": "csv",
"processor_params": {
"query_path": "/attribute/D:=isa:dataset/accession={accession}/E:=isa:experiment/experiment_type:=isa:experiment_type/term=Chip-seq/$E/TARGET:=vocabulary:target_of_assay/$E/R:=isa:replicate/SAMPLE:=isa:biosample/SPEC:=vocabulary:species/$R/SEQ:=isa:sequencing_data/PAIRED:=vocabulary:paired_end_or_single_read/$SEQ/file_format:=vocabulary:file_format/term=FastQ/$SEQ/dataset:=D:accession,experiment:=E:RID,control:=E:control_assay,biosample:=SAMPLE:RID,replicate:=R:RID,bioreplicate_num:=R:bioreplicate_number,technical_replicate_num:=R:technical_replicate_number,species:=SPEC:term,target:=TARGET:term,paired:=PAIRED:term,read:=SEQ:read,file:=SEQ:RID,filename:=SEQ:filename,url:=SEQ:url",
"output_path": "{accession}/{accession}-ChIP-Seq"
}
},
{
"processor": "fetch",
"processor_params": {
"query_path": "/attribute/D:=isa:dataset/accession={accession}/E:=isa:experiment/experiment_type:=isa:experiment_type/term=Chip-seq/$E/TARGET:=vocabulary:target_of_assay/$E/R:=isa:replicate/SAMPLE:=isa:biosample/SPEC:=vocabulary:species/$R/SEQ:=isa:sequencing_data/PAIRED:=vocabulary:paired_end_or_single_read/$SEQ/file_format:=vocabulary:file_format/term=FastQ/$SEQ/dataset:=D:accession,experiment:=E:RID,biosample:=SAMPLE:RID,technical_replicate_num:=R:technical_replicate_number,rid:=SEQ:RID,filename:=SEQ:filename,size:=SEQ:byte_count,md5:=SEQ:md5,url:=SEQ:url",
"output_path": "{dataset}/{experiment}/{biosample}/seq",
"output_filename": "{rid}_{filename}"
}
}
]
}
}
Parameters:
Parent Object | Parameter | Description | Interpolatable |
---|---|---|---|
root | catalog | This is the parent object for all catalog related parameters. | No |
catalog | queries | This is an array of objects representing a list of ERMRest queries and the logical outputs of these queries. The logical outputs of each query are then in turn processed by an output format processor, which can either be one of a set of default processors, or an external class conforming to a specified interface. |
No |
queries | processor | This is a string value used to select from one of the built-in query output processor formats. Valid values are env , csv , json , json-stream , download , or fetch . |
No |
queries | processor_type | A fully qualified Python class name declaring an external processor class instance to use. If this parameter is present, it OVERRIDES the default value mapped to the specified processor . This class MUST be derived from the base class deriva.transfer.download.processors.BaseDownloadProcessor . For example, "processor_type": "deriva.transfer.download.processors.CSVDownloadProcessor" . |
No |
queries | processor_params | This is an extensible JSON Object that contains processor implementation-specific parameters. | No |
processor_params | query_path | This is string representing the actual ERMRest query path to be used in the HTTP(S) GET request. It SHOULD already be percent-encoded per RFC 3986 if it contains any characters outside of the unreserved set. |
Yes |
processor_params | output_path | This is a POSIX-compliant path fragment indicating the target location of the retrieved data relative to the specified base download directory. | Yes |
processor_params | output_filename | This is a POSIX-compliant path fragment indicating the OVERRIDE filename of the retrieved data relative to the specified base download directory and the value of output_path , if any. |
Yes |
Configuration file element: bag
¶
Example:
{
"bag": {
"bag_name": "test-bag",
"bag_archiver": "zip",
"bag_algorithms": ["sha256"],
"bag_metadata": {
"Source-Organization": "USC Information Sciences Institute, Informatics Systems Research Division"
}
}
}
Parameters:
Parent Object | Parameter | Description |
---|---|---|
root | bag | This is the parent object for all bag-related defaults. |
bag | bag_algorithms | This is an array of strings representing the default checksum algorithms to use for bag manifests, if not otherwise specified. Valid values are "md5", "sha1", "sha256", and "sha512". |
bag | bag_archiver | This is a string representing the default archiving format to use if not otherwise specified. Valid values are "zip", "tar", and "tgz". |
bag | bag_metadata | This is a list of simple JSON key-value pairs that will be written as-is to bag-info.txt. |
Configuration file element: env
¶
Example:
{
"env": {
"accession": "XYZ123",
"term": "Chip-seq"
}
}
Parameters:
Parent Object | Parameter | Description |
---|---|---|
root | env | This is the parent object for all global "environment" variables. Note that the usage of "env" in this case does not refer to the set of OS environment variables, but rather a combination of key-value pairs from the JSON configuration file and CLI arguments. |
env | key:value, ... | Any number of arguments in the form key:value where value is a string . |
Supported processors¶
The following processor
tag values are supported by default:
Tag | Type | Description |
---|---|---|
env |
Metadata | Populates the context metadata ("environment") with values returned by the query. |
csv |
CSV | CSV format with column header row |
json |
JSON | JSON Array of row objects. |
json-stream |
"Streaming" JSON | Newline-delimited, multi-object JSON. |
download |
Asset download | File assets referenced by URL are download to local storage relative to output_path . |
fetch |
Asset reference | Bag -based. File assets referenced by URL are assigned as remote file references via fetch.txt . |
Processor details¶
Each processor is designed for a specific task, and the task types may vary for a given data export task. Some processors are designed to handle the export of tabular data from the catalog, while others are meant to handle the export of file assets that are referenced by tables in the catalog. Other processors may be implemented that could perform a combination of these tasks, implement a new format, or perform some kind of data transformation.
env
¶
This processor
processor performs a catalog query in JSON mode and stores the key-value pairs of the first row of data returned into the metadata context or “working environment” for the download.
These key-value pairs can then be used as interpolation variables in subsequent stages of processing.
csv
¶
This processor
generates a standard Comma Separated Values formatted text file. The first row is a comma-delimited list of column names, and all subsequent rows are comma-delimted values. Fields are not enclosed in quotation marks.
Example output:
subject_id,sample_id,snp_id,gt,chipset
CNP0001_F09,600009963128,rs6265,0/1,HumanOmniExpress
CNP0002_F15,600018902293,rs6265,0/0,HumanOmniExpress
json
¶
This processor
generates a text file containing a JSON Array of row data, where each JSON object in the array represents one row.
Example output:
[{"subject_id":"CNP0001_F09","sample_id":"600009963128","snp_id":"rs6265","gt":"0/1","chipset":"HumanOmniExpress"},
{"subject_id":"CNP0002_F15","sample_id":"600018902293","snp_id":"rs6265","gt":"0/0","chipset":"HumanOmniExpress"}]
json-stream
¶
This processor
generates a text file containing multiple lines of individual JSON objects terminated by the newline line terminator \n
. This
format is generally used when the result set is too prohibitively large to parse as a single JSON object and instead can be processed on a line-by-line basis.
Example output:
{"subject_id":"CNP0001_F09","sample_id":"600009963128","snp_id":"rs6265","gt":"0/1","chipset":"HumanOmniExpress"}
{"subject_id":"CNP0002_F15","sample_id":"600018902293","snp_id":"rs6265","gt":"0/0","chipset":"HumanOmniExpress"}
download
¶
This processor
performs multiple actions. First, it issues a json-stream
catalog query against the specified query_path
,
in order to generate a file download manifest file named download-manifest.json
. This manifest is simply a set of rows which MUST contain at least one field named url
, MAY contain a field named filename
,
and MAY contain other arbitrary fields.
If the filename
field is present, it will be appended to the final (calculated) output_path
, otherwise the application will perform a HEAD HTTP request against
the url
for the Content-Disposition
of the referenced file asset. If this query fails to determine the filename, the application falls back to using the final string component of the url
field after the last /
character.
The output_filename
field may be used to override all of the output_path
filename computation logic stated above, in order to explicitly declare the desired filename.
If other fields are present, they are available for variable substitution in other parameters that support interpolation, e.g., output_path
and output_filename
.
After the file download manifest is generated, the application attempts to download the files referenced in each url
field to the local filesystem, storing them at the base relative path specified by output_path
.
For example, the following configuration stanza:
{
"processor": "download",
"processor_params": {
"query_path": "/attribute/D:=isa:dataset/accession={accession}/E:=isa:experiment/experiment_type:=isa:experiment_type/term=RNA%20expression%20%28RNA-seq%29/$E/STRAND:=vocabulary:strandedness/$E/R:=isa:replicate/SAMPLE:=isa:biosample/SPEC:=vocabulary:species/$R/SEQ:=isa:sequencing_data/PAIRED:=vocabulary:paired_end_or_single_read/$SEQ/file_format:=vocabulary:file_format/term=FastQ/$SEQ/dataset:=D:RID,experiment:=E:RID,biosample:=SAMPLE:RID,file:=SEQ:RID,filename:=SEQ:filename,size:=SEQ:byte_count,md5:=SEQ:md5,url:=SEQ:url",
"output_path": "{dataset}/{experiment}/{biosample}/seq"
}
}
Produces a download-manifest.json
with rows like:
{
"dataset":13641,
"experiment":51203,
"biosample":50233,
"file":55121,
"filename":"LPHW_111414_001A_e11.5_facebase_md_rna_R1.fastq.gz",
"size":2976697043,
"md5":"9139b1626a35122fa85688cbb7ae6a8a",
"url":"/hatrac/facebase/data/fb2/FB00000806.2/LPHW_111414_001A_e11.5_facebase_md_rna_R1.fastq.gz"
}
After the output_path
template string is interpolated with the values of the example row above, the file is then downloaded to the following relative path:
./13641/51203/50233/seq/LPHW_111414_001A_e11.5_facebase_md_rna_R1.fastq.gz
fetch
¶
This processor
performs multiple actions. First, it issues a json-stream
catalog query against the specified query_path
, in order to generate a file download manifest.
This manifest is simply a set of rows which MUST contain at least one field named url
, and SHOULD contain two additional fields: length
,
which is the size of the referenced file in bytes, and (at least) one of the following checksum fields; md5
, sha1
, sha256
, sha512
. If the length and appropriate checksum fields are missing,
an attempt will be made to dynamically determine these fields from the remote url
by issuing a HEAD HTTP request and parsing the result headers for the missing information.
If the required values cannot be determined this way, it is an error condition and the transfer will abort.
Similar to the download
processor, the output of the catalog query MAY contain other fields. If the filename
field is present, it will be appended to the final (calculated) output_path
, otherwise the application will perform a HEAD HTTP request against
the url
for the Content-Disposition
of the referenced file asset. If this query fails to determine the filename, the application falls back to using the final name component of the url
field after the last /
character.
The output_filename
field may be used to override all of the output_path
filename computation logic stated above, in order to explicitly declare the desired filename.
If other fields are present, they are available for variable substitution in other parameters that support interpolation, e.g., output_path
and output_filename
.
Unlike the download
processor, the fetch
processor does not actually download any asset files, but rather uses the query results to create a bag
with check-summed manifest entries that reference each remote asset via the bag
’s fetch.txt
file.
Supported transform_processors¶
The following transform_processor
tag values are supported by default:
Tag | Type | Description |
---|---|---|
strsub |
Transform | String substitution transformation. |
interpolation |
Transform | Performs a string interpolation. |
cat |
Transform | Concatenates multiple files. |
Transform Processor details¶
Each transform processor performs a transformation over the input stream(s). The transform processors may alter
specific fields of the input (e.g., strsub
) while others alter the entire contents and format of the input (e.g.,
interpolation
).
strsub
¶
This transform_processor
processor performs a string substitution on a designated property of the input stream. The
input must be json-stream
. The spec allows multiple substitutions
where pattern
is given as a regular expresison
following Python re
conventions, reply
is the replacement string to substitute for each matched pattern, input
is
the name of the object attribute to process, and output
is the name of the object attribute to set with the result.
The following example would strip off the version suffix (...:version-id
) from Hatrac versioned URLs.
{
"transform_processors": [
{
"processor":"strsub",
"processor_params": {
"input_path": "track-metadata.json",
"output_path": "track-metadata-unversioned.json",
"substitutions": [
{
"pattern": ":[^/]*$",
"repl": "",
"input": "url",
"output": "url"
}
]
}
}
]
}
interpolation
¶
This transform_processor
processor performs a string interpolation on each line of the input stream. The input must
be json-stream
format. Each row of the input is passed as the environment for the string interpolation parameters.
The following example would take metadata for genomic annotation tracks and create a line for the “custom tracks”
specification used by UCSC and other Genome Browsers.
{
"processor":"interpolation",
"processor_params": {
"input_path": "track-metadata-unversioned.json",
"output_path": "customtracks.txt",
"template": "track type=$type name=\"$RID\" description=\"$filename\" bigDataUrl=https://www.facebase.org$url\n"
}
}
cat
¶
This transform_processor
processor performs a concatenation of multiple input streams into a single output stream. In
the following example, 2 input files are concatenated into one. (More than 2 input files are allow.)
{
"processor":"cat",
"processor_params": {
"input_paths": ["super-track.txt", "track.txt"],
"output_path": "trackDb.txt"
}
}