deriva-download-cli

The deriva-download-cli is a command-line utility for orchestrating the bulk export of tabular data (stored in ERMRest) and download of asset data (stored in Hatrac, or other supported HTTP-accessible object store). It supports the transfer of data directly to local filesystems, or packaging results into the bagit container format. The program is driven by the combined usage of command-line arguments and a JSON-based configuration (“spec”) file, which contains the processing directives used to orchestrate the creation of the result data set.

Features

Transfer both tabular data and file assets from Deriva catalogs.
Create bag containers, which may reference files stored in remote locations.
Supports an extensible processing pipeline whereby data may be run through transform functions or other arbitrary processing before final result packaging.

Command-Line options

usage: deriva-download-cli.py [-h] [--version] [--quiet] [--debug]
                              [--credential-file <file>] [--catalog <1>]
                              [--token <auth-token>]
                              <host> <config file> <output dir> ...

Deriva Data Download Utility - CLI

positional arguments:
  <host>                Fully qualified host name.
  <config file>         Path to a configuration file.
  <output dir>          Path to an output directory.
  [key=value key=value ...]
                        Variable length of whitespace-delimited key=value pair
                        arguments used for string interpolation in specific
                        parts of the configuration file. For example:
                        key1=value1 key2=value2

optional arguments:
  -h, --help            show this help message and exit
  --version             Print version and exit.
  --quiet               Suppress logging output.
  --debug               Enable debug logging output.
  --credential-file <file>
                        Optional path to a credential file.
  --catalog <1>         Catalog number. Default: 1
  --token <auth-token>  Authorization bearer token.

Positional arguments:

`<host>`

All operations are performed with respect to a specific host and most hosts will require authentication. If the --host HOSTNAME option is not given, localhost will be assumed.

`<config file>`

A path to a configuration file is required. The format and syntax of the can be configuration file is described below.

`<output dir>`

A path to a output base directory is required. This can be an absolute path or a path relative to the current working directory.

Optional arguments:

`--token`

The CLI accepts an authentication token with the --token TOKEN option. If this option is not given, it will look in the user home dir where the DERIVA-Auth client would store the credentials.

`--credential-file`

If --token is not specified, the program will look in the user home dir where the DERIVA-Auth client would store the credentials. Use the --credential file argument to override this behavior and specify an alternative credential file.

`--catalog`

The catalog number (or path specifier). Defaults to 1.

Configuration file format

The configuration JSON file (or “spec”) is the primary mechanism for orchestrating the export and download of data for a given host. There are three primary objects that comprise the configuration spec; the catalog element, the env element, and the bag element.

The catalog object is a REQUIRED element, and is principally composed of an array named queries which is a set of configuration stanzas, executed in declared order, that individually describe what data to retrieve, how the data should be processed, and where the result data should be placed in the target filesystem.

The env object is an OPTIONAL element which, if present, is expected to be a dictionary of key-value pairs that are available to use as interpolation variables for various keywords in the queries section of the configuration file. The string substitution is performed using the keyword interpolation syntax of Python str.format. NOTE: when specifying arbitrary key-value pairs on the command-line, such pairs will OVERRIDE any matching keys found in the env element of the configuration file.

The bag object is an OPTIONAL element which, if present, declares that the aggregate output from all configuration stanzas listed in the catalog:queries array should be packaged as a bagit formatted container. The bag element contains various optional parameters which control bag creation specifics.

Example configuration file:

{
  "env": {
    "accession": "XYZ123",
    "term": "Chip-seq"
  },
  "bag": {
    "bag_name": "test-bag",
    "bag_archiver": "zip",
    "bag_metadata": {
      "Source-Organization": "USC Information Sciences Institute, Informatics Systems Research Division"
    }
  },
  "catalog": {
    "queries": [
      {
        "processor": "csv",
        "processor_params": {
          "query_path": "/attribute/D:=isa:dataset/accession={accession}/E:=isa:experiment/experiment_type:=isa:experiment_type/term=RNA%20expression%20%28RNA-seq%29/$E/STRAND:=vocabulary:strandedness/$E/R:=isa:replicate/SAMPLE:=isa:biosample/SPEC:=vocabulary:species/$R/SEQ:=isa:sequencing_data/PAIRED:=vocabulary:paired_end_or_single_read/$SEQ/file_format:=vocabulary:file_format/term=FastQ/$SEQ/dataset:=D:accession,experiment:=E:RID,biosample:=SAMPLE:RID,replicate:=R:RID,bioreplicate_num:=R:bioreplicate_number,techreplicate_num:=R:technical_replicate_number,species:=SPEC:term,paired:=PAIRED:term,stranded:=STRAND:term,read:=SEQ:read,file:=SEQ:RID,filename:=SEQ:filename,url:=SEQ:url",
          "output_path": "{accession}/{accession}-RNA-Seq"
        }
      },
      {
        "processor": "download",
        "processor_params": {
          "query_path": "/attribute/D:=isa:dataset/accession={accession}/E:=isa:experiment/experiment_type:=isa:experiment_type/term=RNA%20expression%20%28RNA-seq%29/$E/STRAND:=vocabulary:strandedness/$E/R:=isa:replicate/SAMPLE:=isa:biosample/SPEC:=vocabulary:species/$R/SEQ:=isa:sequencing_data/PAIRED:=vocabulary:paired_end_or_single_read/$SEQ/file_format:=vocabulary:file_format/term=FastQ/$SEQ/dataset:=D:RID,experiment:=E:RID,biosample:=SAMPLE:RID,file:=SEQ:RID,filename:=SEQ:filename,size:=SEQ:byte_count,md5:=SEQ:md5,url:=SEQ:url",
          "output_path": "{dataset}/{experiment}/{biosample}/seq"
        }
      },
      {
        "processor": "csv",
        "processor_params": {
          "query_path": "/attribute/D:=isa:dataset/accession={accession}/E:=isa:experiment/experiment_type:=isa:experiment_type/term=Chip-seq/$E/TARGET:=vocabulary:target_of_assay/$E/R:=isa:replicate/SAMPLE:=isa:biosample/SPEC:=vocabulary:species/$R/SEQ:=isa:sequencing_data/PAIRED:=vocabulary:paired_end_or_single_read/$SEQ/file_format:=vocabulary:file_format/term=FastQ/$SEQ/dataset:=D:accession,experiment:=E:RID,control:=E:control_assay,biosample:=SAMPLE:RID,replicate:=R:RID,bioreplicate_num:=R:bioreplicate_number,technical_replicate_num:=R:technical_replicate_number,species:=SPEC:term,target:=TARGET:term,paired:=PAIRED:term,read:=SEQ:read,file:=SEQ:RID,filename:=SEQ:filename,url:=SEQ:url",
          "output_path": "{accession}/{accession}-ChIP-Seq"
        }
      },
      {
        "processor": "fetch",
        "processor_params": {
          "query_path": "/attribute/D:=isa:dataset/accession={accession}/E:=isa:experiment/experiment_type:=isa:experiment_type/term=Chip-seq/$E/TARGET:=vocabulary:target_of_assay/$E/R:=isa:replicate/SAMPLE:=isa:biosample/SPEC:=vocabulary:species/$R/SEQ:=isa:sequencing_data/PAIRED:=vocabulary:paired_end_or_single_read/$SEQ/file_format:=vocabulary:file_format/term=FastQ/$SEQ/dataset:=D:accession,experiment:=E:RID,biosample:=SAMPLE:RID,technical_replicate_num:=R:technical_replicate_number,rid:=SEQ:RID,filename:=SEQ:filename,size:=SEQ:byte_count,md5:=SEQ:md5,url:=SEQ:url",
          "output_path": "{dataset}/{experiment}/{biosample}/seq",
          "output_filename": "{rid}_{filename}"
        }
      }
    ]
  }
}

Configuration file element: `catalog`

Example:

{
  "catalog": {
    "queries": [
      {
        "processor": "csv",
        "processor_params": {
          "query_path": "/attribute/D:=isa:dataset/accession={accession}/E:=isa:experiment/experiment_type:=isa:experiment_type/term=Chip-seq/$E/TARGET:=vocabulary:target_of_assay/$E/R:=isa:replicate/SAMPLE:=isa:biosample/SPEC:=vocabulary:species/$R/SEQ:=isa:sequencing_data/PAIRED:=vocabulary:paired_end_or_single_read/$SEQ/file_format:=vocabulary:file_format/term=FastQ/$SEQ/dataset:=D:accession,experiment:=E:RID,control:=E:control_assay,biosample:=SAMPLE:RID,replicate:=R:RID,bioreplicate_num:=R:bioreplicate_number,technical_replicate_num:=R:technical_replicate_number,species:=SPEC:term,target:=TARGET:term,paired:=PAIRED:term,read:=SEQ:read,file:=SEQ:RID,filename:=SEQ:filename,url:=SEQ:url",
          "output_path": "{accession}/{accession}-ChIP-Seq"
        }
      },
      {
        "processor": "fetch",
        "processor_params": {
          "query_path": "/attribute/D:=isa:dataset/accession={accession}/E:=isa:experiment/experiment_type:=isa:experiment_type/term=Chip-seq/$E/TARGET:=vocabulary:target_of_assay/$E/R:=isa:replicate/SAMPLE:=isa:biosample/SPEC:=vocabulary:species/$R/SEQ:=isa:sequencing_data/PAIRED:=vocabulary:paired_end_or_single_read/$SEQ/file_format:=vocabulary:file_format/term=FastQ/$SEQ/dataset:=D:accession,experiment:=E:RID,biosample:=SAMPLE:RID,technical_replicate_num:=R:technical_replicate_number,rid:=SEQ:RID,filename:=SEQ:filename,size:=SEQ:byte_count,md5:=SEQ:md5,url:=SEQ:url",
          "output_path": "{dataset}/{experiment}/{biosample}/seq",
          "output_filename": "{rid}_{filename}"
        }
      }
    ]
  }
}

Parameters:

Parent Object	Parameter	Description	Interpolatable
root	catalog	This is the parent object for all catalog related parameters.	No
catalog	queries	This is an array of objects representing a list of `ERMRest` queries and the logical outputs of these queries. The logical outputs of each query are then in turn processed by an output format processor, which can either be one of a set of default processors, or an external class conforming to a specified interface.	No
queries	processor	This is a string value used to select from one of the built-in query output processor formats. Valid values are `env`, `csv`, `json`, `json-stream`, `download`, or `fetch`.	No
queries	processor_type	A fully qualified Python class name declaring an external processor class instance to use. If this parameter is present, it OVERRIDES the default value mapped to the specified `processor`. This class MUST be derived from the base class `deriva.transfer.download.processors.BaseDownloadProcessor`. For example, `"processor_type": "deriva.transfer.download.processors.CSVDownloadProcessor"`.	No
queries	processor_params	This is an extensible JSON Object that contains processor implementation-specific parameters.	No
processor_params	query_path	This is string representing the actual `ERMRest` query path to be used in the HTTP(S) GET request. It SHOULD already be percent-encoded per RFC 3986 if it contains any characters outside of the unreserved set.	Yes
processor_params	output_path	This is a POSIX-compliant path fragment indicating the target location of the retrieved data relative to the specified base download directory.	Yes
processor_params	output_filename	This is a POSIX-compliant path fragment indicating the OVERRIDE filename of the retrieved data relative to the specified base download directory and the value of `output_path`, if any.	Yes

Configuration file element: `bag`

Example:

{
    "bag": {
        "bag_name": "test-bag",
        "bag_archiver": "zip",
        "bag_algorithms": ["sha256"],
        "bag_metadata": {
            "Source-Organization": "USC Information Sciences Institute, Informatics Systems Research Division"
        }
    }
}

Parameters:

Parent Object	Parameter	Description
root	bag	This is the parent object for all bag-related defaults.
bag	bag_algorithms	This is an array of strings representing the default checksum algorithms to use for bag manifests, if not otherwise specified. Valid values are "md5", "sha1", "sha256", and "sha512".
bag	bag_archiver	This is a string representing the default archiving format to use if not otherwise specified. Valid values are "zip", "tar", and "tgz".
bag	bag_metadata	This is a list of simple JSON key-value pairs that will be written as-is to bag-info.txt.

Configuration file element: `env`

Example:

{
    "env": {
        "accession": "XYZ123",
        "term": "Chip-seq"
    }
}

Parameters:

Parent Object	Parameter	Description
root	env	This is the parent object for all global "environment" variables. Note that the usage of "env" in this case does not refer to the set of OS environment variables, but rather a combination of key-value pairs from the JSON configuration file and CLI arguments.
env	key:value, ...	Any number of arguments in the form `key:value` where `value` is a `string`.

Supported processors

The following processor tag values are supported by default:

Tag	Type	Description
`env`	Metadata	Populates the context metadata ("environment") with values returned by the query.
`csv`	CSV	CSV format with column header row
`json`	JSON	JSON Array of row objects.
`json-stream`	"Streaming" JSON	Newline-delimited, multi-object JSON.
`download`	Asset download	File assets referenced by URL are download to local storage relative to `output_path`.
`fetch`	Asset reference	`Bag`-based. File assets referenced by URL are assigned as remote file references via `fetch.txt`.

Processor details

Each processor is designed for a specific task, and the task types may vary for a given data export task. Some processors are designed to handle the export of tabular data from the catalog, while others are meant to handle the export of file assets that are referenced by tables in the catalog. Other processors may be implemented that could perform a combination of these tasks, implement a new format, or perform some kind of data transformation.

`env`

This processor processor performs a catalog query in JSON mode and stores the key-value pairs of the first row of data returned into the metadata context or “working environment” for the download. These key-value pairs can then be used as interpolation variables in subsequent stages of processing.

`csv`

This processor generates a standard Comma Separated Values formatted text file. The first row is a comma-delimited list of column names, and all subsequent rows are comma-delimted values. Fields are not enclosed in quotation marks.

Example output:

subject_id,sample_id,snp_id,gt,chipset
CNP0001_F09,600009963128,rs6265,0/1,HumanOmniExpress
CNP0002_F15,600018902293,rs6265,0/0,HumanOmniExpress

`json`

This processor generates a text file containing a JSON Array of row data, where each JSON object in the array represents one row.

Example output:

[{"subject_id":"CNP0001_F09","sample_id":"600009963128","snp_id":"rs6265","gt":"0/1","chipset":"HumanOmniExpress"},
 {"subject_id":"CNP0002_F15","sample_id":"600018902293","snp_id":"rs6265","gt":"0/0","chipset":"HumanOmniExpress"}]

`json-stream`

This processor generates a text file containing multiple lines of individual JSON objects terminated by the newline line terminator \n. This format is generally used when the result set is too prohibitively large to parse as a single JSON object and instead can be processed on a line-by-line basis.

Example output:

{"subject_id":"CNP0001_F09","sample_id":"600009963128","snp_id":"rs6265","gt":"0/1","chipset":"HumanOmniExpress"}
{"subject_id":"CNP0002_F15","sample_id":"600018902293","snp_id":"rs6265","gt":"0/0","chipset":"HumanOmniExpress"}

`download`

This processor performs multiple actions. First, it issues a json-stream catalog query against the specified query_path, in order to generate a file download manifest file named download-manifest.json. This manifest is simply a set of rows which MUST contain at least one field named url, MAY contain a field named filename, and MAY contain other arbitrary fields.

If the filename field is present, it will be appended to the final (calculated) output_path, otherwise the application will perform a HEAD HTTP request against the url for the Content-Disposition of the referenced file asset. If this query fails to determine the filename, the application falls back to using the final string component of the url field after the last / character. The output_filename field may be used to override all of the output_path filename computation logic stated above, in order to explicitly declare the desired filename. If other fields are present, they are available for variable substitution in other parameters that support interpolation, e.g., output_path and output_filename.

After the file download manifest is generated, the application attempts to download the files referenced in each url field to the local filesystem, storing them at the base relative path specified by output_path.

For example, the following configuration stanza:

{
  "processor": "download",
  "processor_params": {
    "query_path": "/attribute/D:=isa:dataset/accession={accession}/E:=isa:experiment/experiment_type:=isa:experiment_type/term=RNA%20expression%20%28RNA-seq%29/$E/STRAND:=vocabulary:strandedness/$E/R:=isa:replicate/SAMPLE:=isa:biosample/SPEC:=vocabulary:species/$R/SEQ:=isa:sequencing_data/PAIRED:=vocabulary:paired_end_or_single_read/$SEQ/file_format:=vocabulary:file_format/term=FastQ/$SEQ/dataset:=D:RID,experiment:=E:RID,biosample:=SAMPLE:RID,file:=SEQ:RID,filename:=SEQ:filename,size:=SEQ:byte_count,md5:=SEQ:md5,url:=SEQ:url",
    "output_path": "{dataset}/{experiment}/{biosample}/seq"
  }
}

Produces a download-manifest.json with rows like:

{
  "dataset":13641,
  "experiment":51203,
  "biosample":50233,
  "file":55121,
  "filename":"LPHW_111414_001A_e11.5_facebase_md_rna_R1.fastq.gz",
  "size":2976697043,
  "md5":"9139b1626a35122fa85688cbb7ae6a8a",
  "url":"/hatrac/facebase/data/fb2/FB00000806.2/LPHW_111414_001A_e11.5_facebase_md_rna_R1.fastq.gz"
}

After the output_path template string is interpolated with the values of the example row above, the file is then downloaded to the following relative path:

./13641/51203/50233/seq/LPHW_111414_001A_e11.5_facebase_md_rna_R1.fastq.gz

`fetch`

This processor performs multiple actions. First, it issues a json-stream catalog query against the specified query_path, in order to generate a file download manifest. This manifest is simply a set of rows which MUST contain at least one field named url, and SHOULD contain two additional fields: length, which is the size of the referenced file in bytes, and (at least) one of the following checksum fields; md5, sha1, sha256, sha512. If the length and appropriate checksum fields are missing, an attempt will be made to dynamically determine these fields from the remote url by issuing a HEAD HTTP request and parsing the result headers for the missing information. If the required values cannot be determined this way, it is an error condition and the transfer will abort.

Similar to the download processor, the output of the catalog query MAY contain other fields. If the filename field is present, it will be appended to the final (calculated) output_path, otherwise the application will perform a HEAD HTTP request against the url for the Content-Disposition of the referenced file asset. If this query fails to determine the filename, the application falls back to using the final name component of the url field after the last / character. The output_filename field may be used to override all of the output_path filename computation logic stated above, in order to explicitly declare the desired filename. If other fields are present, they are available for variable substitution in other parameters that support interpolation, e.g., output_path and output_filename.

Unlike the download processor, the fetch processor does not actually download any asset files, but rather uses the query results to create a bag with check-summed manifest entries that reference each remote asset via the bag’s fetch.txt file.

Supported transform_processors

The following transform_processor tag values are supported by default:

Tag	Type	Description
`strsub`	Transform	String substitution transformation.
`interpolation`	Transform	Performs a string interpolation.
`cat`	Transform	Concatenates multiple files.

Transform Processor details

Each transform processor performs a transformation over the input stream(s). The transform processors may alter specific fields of the input (e.g., strsub) while others alter the entire contents and format of the input (e.g., interpolation).

`strsub`

This transform_processor processor performs a string substitution on a designated property of the input stream. The input must be json-stream. The spec allows multiple substitutions where pattern is given as a regular expresison following Python re conventions, reply is the replacement string to substitute for each matched pattern, input is the name of the object attribute to process, and output is the name of the object attribute to set with the result. The following example would strip off the version suffix (...:version-id) from Hatrac versioned URLs.

{
  "transform_processors": [
    {
      "processor":"strsub",
      "processor_params": {
        "input_path": "track-metadata.json",
        "output_path": "track-metadata-unversioned.json",
        "substitutions": [
          {
            "pattern": ":[^/]*$",
            "repl": "",
            "input": "url",
            "output": "url"
          }
        ]
      }
    }
  ]
}

`interpolation`

This transform_processor processor performs a string interpolation on each line of the input stream. The input must be json-stream format. Each row of the input is passed as the environment for the string interpolation parameters. The following example would take metadata for genomic annotation tracks and create a line for the “custom tracks” specification used by UCSC and other Genome Browsers.

    {
      "processor":"interpolation",
      "processor_params": {
        "input_path": "track-metadata-unversioned.json",
        "output_path": "customtracks.txt",
        "template": "track type=$type name=\"$RID\" description=\"$filename\" bigDataUrl=https://www.facebase.org$url\n"
      }
    }

`cat`

This transform_processor processor performs a concatenation of multiple input streams into a single output stream. In the following example, 2 input files are concatenated into one. (More than 2 input files are allow.)

    {
      "processor":"cat",
      "processor_params": {
        "input_paths": ["super-track.txt", "track.txt"],
        "output_path": "trackDb.txt"
      }
    }

deriva-download-cli

Features

Command-Line options

Positional arguments:

<host>

<config file>

<output dir>

Optional arguments:

--token

--credential-file

--catalog

Configuration file format

Example configuration file:

Configuration file element: catalog

Configuration file element: bag

Configuration file element: env

Supported processors

Processor details

env

csv

json

json-stream

download

fetch

Supported transform_processors

Transform Processor details

strsub

interpolation

cat

`<host>`

`<config file>`

`<output dir>`

`--token`

`--credential-file`

`--catalog`

Configuration file element: `catalog`

Configuration file element: `bag`

Configuration file element: `env`

`env`

`csv`

`json`

`json-stream`

`download`

`fetch`

`strsub`

`interpolation`

`cat`