ETL¶
The Carrot ETL process runs via a Command Line Interface, installed with the carrot-cdm package.
In the context of the Carrot workflow ETL stands for (and means):
- Extract: input data is extracted into
.csv
format and (optionally) pseudonymised - Transform: a CDM model is created and processed, given the extracted data and a
json
transformation rules file which tells the software how to map (transform) the data. - Load: inserts the data into a database or other destination.
Workflow¶
Our ETL workflow is provided in the form of the so-called ETL-Tool which is a command line interface (CLI):
carrot etl --help
Usage: carrot etl [OPTIONS] COMMAND [ARGS]...
Command group for running the full ETL of a dataset
Options:
--config, --config-file TEXT specify a yaml configuration file
-d, --daemon run the ETL as a daemon process
-l, --log-file TEXT specify the log file to write to
--help Show this message and exit.
This is an automated tool, meaning it is able to run (optionally as a background process) and detect changes in the inputs or configuration files to process new data dumps.
Currently, automation using the carrot etl
CLI is possible for loading to a BC-Link or outputing to a local file storage system.
Carrot--BC-LINK Workflow¶
The whole point in transforming data into the OMOP CDM format is so the data can be uploaded to BC-Link. This workflow can be performed in one step with the correct configuration of the input yaml
file when running carrot etl --config <yaml file>
, see:
However, the process may need to be decoupled into multiple steps; for example, if BC-Link is not hosted on a machine that has access to the original data. In this scenario the carrot etl
or carrot run map
can be used to perform the transform (OMOP mapping), the output files can then be transferred to the machine hosting BC-Link and be uploaded (from the command-line, or using the BC-Link GUI)
Architecture Overview¶
A schematic diagram of the Carrot/bclink ETL is given below:
The following breaks down what each part of the process is doing...
Extract¶
- Formats the input data to abide by the co-connect data-standards [manual]
- Creates data-dumps of the input datasets in the form of
csv
files for each dataset table. [manual] - Pseudonymises the input datasets, masking any person identifiers or person IDs [optionally automated]
Transform¶
-
The transform mapping is executed with the command
carrot run map [arguments]
, where additional arguments pass the paths to a mapping-rulesjson
file and input datacsv
files [optionally automated]:- A new pythonic
CommonDataModel
is created. DataCollection
is created to handle/chunk the input files and is added to theCommonDataModel
.- The mapping-rules
json
is used to create new CDM Tables (e.g. Person).- For each CDM Table, multiple tables can be created. E.g. there may be multiple Condition Occurrences defined across multiple input data files and columns (fields).
- The rules
json
encodes so-called "term-mapping" - how to map raw values into OHDSI concept IDs for the output. These are setup as lambda functions and passed to the object'sdefine
function
- Processing of the
CommonDataModel
is triggered:- [optionally chunked] A new chunk of
DataCollection
is grabbed. - Each CDM table is looped over:
- All objects of this CDM table are found and looped over:
- The
define
function is called to apply the rules. - A new dataframe is retrieved for the object.
- The
- All objects of this CDM table are found and looped over:
- All retrieved dataframes are merged.
- The merged dataframe is has a formatting check applied
- The merged dataframe primary keys are correctly indexed
- The final dataframe is saved/appended to an output
tsv
file. - This is repeated until all chunks are processed.
- [optionally chunked] A new chunk of
- A new pythonic
Loads¶
- Uploads the output
tsv
files into the bclink job-queue which formats and uploads these into the bclink database [optionally automated]