Introduction Using the ETLTool

Introduction¶

The ETL transform to CDM using the classes defined in carrot.cdm is documented here as python notebook, as an example of how the classes can be used. Developers can follow the following workbook example, changing the rules file and the input files.

Installing¶

The best way is to install the module via pip.

!pip3 install carrot-cdm -q

!carrot --version

0.6.2

Loading the Rules¶

Given the full path to a json file containing the rules, the first step is to load this up into a json object/dict.

import carrot.tools
import json
import os

carrot.data_folder = os.path.join(os.path.dirname(carrot.__file__),'data')

rules = carrot.tools.load_json(f'{carrot.data_folder}/test/rules/rules_14June2021.json')
print(json.dumps(rules, indent=6))

{
      "metadata": {
            "date_created": "2021-06-14T15:27:37.123947",
            "dataset": "Test"
      },
      "cdm": {
            "observation": {
                  "observation_0": {
                        "observation_concept_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "ethnicity",
                              "term_mapping": {
                                    "Asian": 35825508
                              }
                        },
                        "observation_datetime": {
                              "source_table": "Demographics.csv",
                              "source_field": "date_of_birth"
                        },
                        "observation_source_concept_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "ethnicity",
                              "term_mapping": {
                                    "Asian": 35825508
                              }
                        },
                        "observation_source_value": {
                              "source_table": "Demographics.csv",
                              "source_field": "ethnicity"
                        },
                        "person_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "PersonID"
                        }
                  },
                  "observation_1": {
                        "observation_concept_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "ethnicity",
                              "term_mapping": {
                                    "Bangladeshi": 35825531
                              }
                        },
                        "observation_datetime": {
                              "source_table": "Demographics.csv",
                              "source_field": "date_of_birth"
                        },
                        "observation_source_concept_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "ethnicity",
                              "term_mapping": {
                                    "Bangladeshi": 35825531
                              }
                        },
                        "observation_source_value": {
                              "source_table": "Demographics.csv",
                              "source_field": "ethnicity"
                        },
                        "person_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "PersonID"
                        }
                  },
                  "observation_2": {
                        "observation_concept_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "ethnicity",
                              "term_mapping": {
                                    "Indian": 35826241
                              }
                        },
                        "observation_datetime": {
                              "source_table": "Demographics.csv",
                              "source_field": "date_of_birth"
                        },
                        "observation_source_concept_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "ethnicity",
                              "term_mapping": {
                                    "Indian": 35826241
                              }
                        },
                        "observation_source_value": {
                              "source_table": "Demographics.csv",
                              "source_field": "ethnicity"
                        },
                        "person_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "PersonID"
                        }
                  },
                  "observation_3": {
                        "observation_concept_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "ethnicity",
                              "term_mapping": {
                                    "White": 35827394
                              }
                        },
                        "observation_datetime": {
                              "source_table": "Demographics.csv",
                              "source_field": "date_of_birth"
                        },
                        "observation_source_concept_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "ethnicity",
                              "term_mapping": {
                                    "White": 35827394
                              }
                        },
                        "observation_source_value": {
                              "source_table": "Demographics.csv",
                              "source_field": "ethnicity"
                        },
                        "person_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "PersonID"
                        }
                  },
                  "observation_4": {
                        "observation_concept_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "ethnicity",
                              "term_mapping": {
                                    "Black": 35825567
                              }
                        },
                        "observation_datetime": {
                              "source_table": "Demographics.csv",
                              "source_field": "date_of_birth"
                        },
                        "observation_source_concept_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "ethnicity",
                              "term_mapping": {
                                    "Black": 35825567
                              }
                        },
                        "observation_source_value": {
                              "source_table": "Demographics.csv",
                              "source_field": "ethnicity"
                        },
                        "person_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "PersonID"
                        }
                  },
                  "observation_5": {
                        "observation_concept_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "ethnicity",
                              "term_mapping": {
                                    "White and Asian": 35827395
                              }
                        },
                        "observation_datetime": {
                              "source_table": "Demographics.csv",
                              "source_field": "date_of_birth"
                        },
                        "observation_source_concept_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "ethnicity",
                              "term_mapping": {
                                    "White and Asian": 35827395
                              }
                        },
                        "observation_source_value": {
                              "source_table": "Demographics.csv",
                              "source_field": "ethnicity"
                        },
                        "person_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "PersonID"
                        }
                  }
            },
            "condition_occurrence": {
                  "condition_occurrence_0": {
                        "condition_concept_id": {
                              "source_table": "Symptoms.csv",
                              "source_field": "symptom1",
                              "term_mapping": {
                                    "Y": 254761
                              }
                        },
                        "condition_end_datetime": {
                              "source_table": "Symptoms.csv",
                              "source_field": "visit_date"
                        },
                        "condition_source_concept_id": {
                              "source_table": "Symptoms.csv",
                              "source_field": "symptom1",
                              "term_mapping": {
                                    "Y": 254761
                              }
                        },
                        "condition_source_value": {
                              "source_table": "Symptoms.csv",
                              "source_field": "symptom1"
                        },
                        "condition_start_datetime": {
                              "source_table": "Symptoms.csv",
                              "source_field": "visit_date"
                        },
                        "person_id": {
                              "source_table": "Symptoms.csv",
                              "source_field": "PersonID"
                        }
                  }
            },
            "person": {
                  "female": {
                        "birth_datetime": {
                              "source_table": "Demographics.csv",
                              "source_field": "date_of_birth"
                        },
                        "gender_concept_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "sex",
                              "term_mapping": {
                                    "F": 8532
                              }
                        },
                        "gender_source_concept_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "sex",
                              "term_mapping": {
                                    "F": 8532
                              }
                        },
                        "gender_source_value": {
                              "source_table": "Demographics.csv",
                              "source_field": "sex"
                        },
                        "person_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "PersonID"
                        }
                  },
                  "male": {
                        "birth_datetime": {
                              "source_table": "Demographics.csv",
                              "source_field": "date_of_birth"
                        },
                        "gender_concept_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "sex",
                              "term_mapping": {
                                    "M": 8507
                              }
                        },
                        "gender_source_concept_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "sex",
                              "term_mapping": {
                                    "M": 8507
                              }
                        },
                        "gender_source_value": {
                              "source_table": "Demographics.csv",
                              "source_field": "sex"
                        },
                        "person_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "PersonID"
                        }
                  }
            },
            "measurement": {
                  "covid_antibody": {
                        "value_as_number": {
                              "source_table": "covid19_antibody.csv",
                              "source_field": "IgG"
                        },
                        "measurement_source_value": {
                              "source_table": "covid19_antibody.csv",
                              "source_field": "IgG"
                        },
                        "measurement_concept_id": {
                              "source_table": "covid19_antibody.csv",
                              "source_field": "IgG",
                              "term_mapping": 37398191
                        },
                        "measurement_source_concept_id": {
                              "source_table": "covid19_antibody.csv",
                              "source_field": "IgG",
                              "term_mapping": 37398191
                        },
                        "measurement_datetime": {
                              "source_table": "covid19_antibody.csv",
                              "source_field": "date"
                        },
                        "person_id": {
                              "source_table": "covid19_antibody.csv",
                              "source_field": "PersonID"
                        }
                  }
            }
      }
}

Loading the input data¶

The ETL Tool takes in as input pandas dataframes and provides a tool for loading CSV files

CSV¶

A convienience function is available to create a map between a file name and a file path for all files in a directory:

f_map = carrot.tools.get_file_map_from_dir(f'{carrot.data_folder}/test/inputs/')
print (json.dumps(f_map,indent=6))

{
      "Symptoms.csv": "/Users/calummacdonald/Usher/CO-CONNECT/Software/CaRROT-CDM/carrot/data/test/inputs/Symptoms.csv",
      "Covid19_test.csv": "/Users/calummacdonald/Usher/CO-CONNECT/Software/CaRROT-CDM/carrot/data/test/inputs/Covid19_test.csv",
      "covid19_antibody.csv": "/Users/calummacdonald/Usher/CO-CONNECT/Software/CaRROT-CDM/carrot/data/test/inputs/covid19_antibody.csv",
      "vaccine.csv": "/Users/calummacdonald/Usher/CO-CONNECT/Software/CaRROT-CDM/carrot/data/test/inputs/vaccine.csv",
      "Demographics.csv": "/Users/calummacdonald/Usher/CO-CONNECT/Software/CaRROT-CDM/carrot/data/test/inputs/Demographics.csv"
}

use the f_map to load all the inputs into a map between the file name and a dataframe object. This can be created manually via any prefered method.

inputs = carrot.tools.load_csv(f_map)
inputs

2022-06-17 14:46:49 - LocalDataCollection - INFO - DataCollection Object Created
2022-06-17 14:46:49 - LocalDataCollection - INFO - Registering  Symptoms.csv [<carrot.io.common.DataBrick object at 0x10df1b040>]
2022-06-17 14:46:49 - LocalDataCollection - INFO - Registering  Covid19_test.csv [<carrot.io.common.DataBrick object at 0x10df1b0d0>]
2022-06-17 14:46:49 - LocalDataCollection - INFO - Registering  covid19_antibody.csv [<carrot.io.common.DataBrick object at 0x10df1b310>]
2022-06-17 14:46:49 - LocalDataCollection - INFO - Registering  vaccine.csv [<carrot.io.common.DataBrick object at 0x110d473d0>]
2022-06-17 14:46:49 - LocalDataCollection - INFO - Registering  Demographics.csv [<carrot.io.common.DataBrick object at 0x10deb5dc0>]

<carrot.io.plugins.local.LocalDataCollection at 0x10deb5d30>

inputs.keys()

dict_keys(['Symptoms.csv', 'Covid19_test.csv', 'covid19_antibody.csv', 'vaccine.csv', 'Demographics.csv'])

inputs['Symptoms.csv']

2022-06-17 14:46:49 - LocalDataCollection - INFO - Retrieving initial dataframe for 'Symptoms.csv' for the first time

	PersonID	visit_date	symptom1	symptom2	symptom3
0	16dc368a89b428b2485484313ba67a3912ca03f2b2b424...	2020-11-15 00:00:00.000000	Y	Y	Y
1	37834f2f25762f23e1f74a531cbe445db73d6765ebe608...	2020-01-04 00:00:00.000000	Y	Y	Y
2	454f63ac30c8322997ef025edff6abd23e0dbe7b8a3d51...	2020-03-27 00:00:00.000000	Y	Y	Y
3	5ef6fdf32513aa7cd11f72beccf132b9224d33f271471f...	2020-06-24 00:00:00.000000	N	N	N
4	1253e9373e781b7500266caa55150e08e210bc8cd8cc70...	2020-07-27 00:00:00.000000	Y	Y	Y
...	...	...	...	...	...
795	62f6d46c48c7d9ff3d09a408d0ec880f167a5dc9c8fd34...	2020-11-04 00:00:00.000000	N	Y	N
796	c62510afc57db491f9f993387b76dd9a7d08f09c013269...	2020-07-27 00:00:00.000000	Y	Y	Y
797	bdc5d8a48c23897906b09a9a3680bd2e9c8b3121edbda3...	2020-03-27 00:00:00.000000	Y	Y	Y
798	fa88d374b9cf5e059fad4a2fe406feae4c49cbf4803083...	2020-12-24 00:00:00.000000	N	N	N
799	6a97982dccf77dd3dafa27fcbdf75c017301f730ba186b...	2020-11-15 00:00:00.000000	Y	Y	Y

800 rows × 5 columns

Chunked CSV¶

For large datasets, it's better to chunk the data as to not overload your computer memory, this can be achieved by supplying a chunksize argument:

inputs_chunked = carrot.tools.load_csv(f_map,chunksize=100)
inputs_chunked['Symptoms.csv']

2022-06-17 14:46:49 - LocalDataCollection - INFO - DataCollection Object Created
2022-06-17 14:46:49 - LocalDataCollection - INFO - Using a chunksize of '100' nrows
2022-06-17 14:46:49 - LocalDataCollection - INFO - Registering  Symptoms.csv [<carrot.io.common.DataBrick object at 0x10deb5850>]
2022-06-17 14:46:49 - LocalDataCollection - INFO - Registering  Covid19_test.csv [<carrot.io.common.DataBrick object at 0x111d3c490>]
2022-06-17 14:46:49 - LocalDataCollection - INFO - Registering  covid19_antibody.csv [<carrot.io.common.DataBrick object at 0x111d3c580>]
2022-06-17 14:46:49 - LocalDataCollection - INFO - Registering  vaccine.csv [<carrot.io.common.DataBrick object at 0x111d3c970>]
2022-06-17 14:46:49 - LocalDataCollection - INFO - Registering  Demographics.csv [<carrot.io.common.DataBrick object at 0x111d3cc40>]
2022-06-17 14:46:49 - LocalDataCollection - INFO - Retrieving initial dataframe for 'Symptoms.csv' for the first time

	PersonID	visit_date	symptom1	symptom2	symptom3
0	16dc368a89b428b2485484313ba67a3912ca03f2b2b424...	2020-11-15 00:00:00.000000	Y	Y	Y
1	37834f2f25762f23e1f74a531cbe445db73d6765ebe608...	2020-01-04 00:00:00.000000	Y	Y	Y
2	454f63ac30c8322997ef025edff6abd23e0dbe7b8a3d51...	2020-03-27 00:00:00.000000	Y	Y	Y
3	5ef6fdf32513aa7cd11f72beccf132b9224d33f271471f...	2020-06-24 00:00:00.000000	N	N	N
4	1253e9373e781b7500266caa55150e08e210bc8cd8cc70...	2020-07-27 00:00:00.000000	Y	Y	Y
...	...	...	...	...	...
95	a4e00d7e6aa82111575438c5e5d3e63269d4c475c718b2...	2020-11-04 00:00:00.000000	N	Y	N
96	8bcbb4c131df56f7c79066016241cc4bdf4e58db55c4f6...	2020-07-27 00:00:00.000000	Y	Y	Y
97	a4e00d7e6aa82111575438c5e5d3e63269d4c475c718b2...	2020-01-04 00:00:00.000000	Y	Y	Y
98	5a39cadd1b007093db50744797c7a04a34f73b35ed4447...	2020-12-24 00:00:00.000000	N	N	N
99	27badc983df1780b60c2b3fa9d3a19a00e46aac798451f...	2020-11-04 00:00:00.000000	N	Y	N

100 rows × 5 columns

The internal working of the InputData object will more to the next slice of data when running the process, untill all data has been processed.

inputs_chunked.next()

2022-06-17 14:46:49 - LocalDataCollection - INFO - Getting next chunk of data
2022-06-17 14:46:49 - LocalDataCollection - INFO - Getting the next chunk of size '100' for 'Symptoms.csv'
2022-06-17 14:46:49 - LocalDataCollection - INFO - --> Got 100 rows

inputs_chunked['Symptoms.csv']

	PersonID	visit_date	symptom1	symptom2	symptom3
100	43974ed74066b207c30ffd0fed5146762e6c60745ac977...	2020-03-27 00:00:00.000000	Y	Y	Y
101	fc56dbc6d4652b315b86b71c8d688c1ccdea9c5f1fd077...	2020-02-04 00:00:00.000000	N	N	N
102	f8809aff4d69bece79dabe35be0c708b890d7eafb841f1...	2020-06-24 00:00:00.000000	N	N	N
103	5cf4e26bd3d87da5e03f80a43a64f1220a1f4ba9e1d634...	2020-11-15 00:00:00.000000	Y	Y	Y
104	f8809aff4d69bece79dabe35be0c708b890d7eafb841f1...	2020-07-27 00:00:00.000000	Y	Y	Y
...	...	...	...	...	...
195	a0f8b2c4cb1ac82abdb37f0fe5203b97be556c4468c83b...	2020-12-24 00:00:00.000000	N	N	N
196	4c15f47afe7f817fd559e12ddbc276f4930c5822f20490...	2020-01-04 00:00:00.000000	Y	Y	Y
197	983bd614bb5afece5ab3b6023f71147cd7b6bc2314f9d2...	2020-11-04 00:00:00.000000	N	Y	N
198	c3ea99f86b2f8a74ef4145bb245155ff5f91cd856f2875...	2020-06-24 00:00:00.000000	N	N	N
199	f32828acecb4282c87eaa554d2e1db74e418cd68458430...	2020-03-27 00:00:00.000000	Y	Y	Y

100 rows × 5 columns

SQL¶

Another alternative, if your input data is not in csv format is to load the data manually yourself from SQL / Spark / DataBricks views etc.

Firstly, initialise an input data handler object"

inputs_sql = carrot.io.DataCollection()
inputs_sql

2022-06-17 14:46:49 - DataCollection - INFO - DataCollection Object Created

<carrot.io.common.DataCollection at 0x111c49130>

Load your data, for example fron a PostgresSQL server, using sqlalchemy and pandas:

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('postgresql://localhost:5432/coconnect_data_test')
df = pd.read_sql("my_data_table",engine)
df

	PersonID	visit_date	symptom1	symptom2	symptom3
0	16dc368a89b428b2485484313ba67a3912ca03f2b2b424...	2020-11-15 00:00:00.000000	Y	Y	Y
1	37834f2f25762f23e1f74a531cbe445db73d6765ebe608...	2020-01-04 00:00:00.000000	Y	Y	Y
2	454f63ac30c8322997ef025edff6abd23e0dbe7b8a3d51...	2020-03-27 00:00:00.000000	Y	Y	Y
3	5ef6fdf32513aa7cd11f72beccf132b9224d33f271471f...	2020-06-24 00:00:00.000000	N	N	N
4	1253e9373e781b7500266caa55150e08e210bc8cd8cc70...	2020-07-27 00:00:00.000000	Y	Y	Y
...	...	...	...	...	...
95	a4e00d7e6aa82111575438c5e5d3e63269d4c475c718b2...	2020-11-04 00:00:00.000000	N	Y	N
96	8bcbb4c131df56f7c79066016241cc4bdf4e58db55c4f6...	2020-07-27 00:00:00.000000	Y	Y	Y
97	a4e00d7e6aa82111575438c5e5d3e63269d4c475c718b2...	2020-01-04 00:00:00.000000	Y	Y	Y
98	5a39cadd1b007093db50744797c7a04a34f73b35ed4447...	2020-12-24 00:00:00.000000	N	N	N
99	27badc983df1780b60c2b3fa9d3a19a00e46aac798451f...	2020-11-04 00:00:00.000000	N	Y	N

100 rows × 5 columns

Set the input object to this dataframe

note: the name of the input table must be the same as the name in the json rules. In this example, the name in the json for the mapping for this table is Symptoms.csv, therefore the dataframe is associated with that name.

inputs_sql['Symptoms.csv'] = carrot.io.DataBrick(df)
inputs_sql['Symptoms.csv']

2022-06-17 14:46:50 - DataCollection - INFO - Registering  Symptoms.csv [<carrot.io.common.DataBrick object at 0x111d44940>]
2022-06-17 14:46:50 - DataCollection - INFO - Retrieving initial dataframe for 'Symptoms.csv' for the first time

	PersonID	visit_date	symptom1	symptom2	symptom3
0	16dc368a89b428b2485484313ba67a3912ca03f2b2b424...	2020-11-15 00:00:00.000000	Y	Y	Y
1	37834f2f25762f23e1f74a531cbe445db73d6765ebe608...	2020-01-04 00:00:00.000000	Y	Y	Y
2	454f63ac30c8322997ef025edff6abd23e0dbe7b8a3d51...	2020-03-27 00:00:00.000000	Y	Y	Y
3	5ef6fdf32513aa7cd11f72beccf132b9224d33f271471f...	2020-06-24 00:00:00.000000	N	N	N
4	1253e9373e781b7500266caa55150e08e210bc8cd8cc70...	2020-07-27 00:00:00.000000	Y	Y	Y
...	...	...	...	...	...
95	a4e00d7e6aa82111575438c5e5d3e63269d4c475c718b2...	2020-11-04 00:00:00.000000	N	Y	N
96	8bcbb4c131df56f7c79066016241cc4bdf4e58db55c4f6...	2020-07-27 00:00:00.000000	Y	Y	Y
97	a4e00d7e6aa82111575438c5e5d3e63269d4c475c718b2...	2020-01-04 00:00:00.000000	Y	Y	Y
98	5a39cadd1b007093db50744797c7a04a34f73b35ed4447...	2020-12-24 00:00:00.000000	N	N	N
99	27badc983df1780b60c2b3fa9d3a19a00e46aac798451f...	2020-11-04 00:00:00.000000	N	Y	N

100 rows × 5 columns

Spark Databricks¶

If you want to use something like Spark for integration with DataBricks, you can use pyspark to load the data:

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.jars", "/Users/calummacdonald/Downloads/postgresql-42.3.1.jar") \
    .getOrCreate()

df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://localhost:5432/coconnect_data_test") \
    .option("dbtable", "my_data_table") \
    .option("driver", "org.postgresql.Driver") \
    .load()

df.printSchema()

root
 |-- PersonID: string (nullable = true)
 |-- visit_date: string (nullable = true)
 |-- symptom1: string (nullable = true)
 |-- symptom2: string (nullable = true)
 |-- symptom3: string (nullable = true)

inputs_databricks = carrot.io.DataCollection()
inputs_databricks['Symptons.csv'] = carrot.io.DataBrick(df.select("*").toPandas())
inputs_databricks['Symptons.csv']

2022-06-17 14:47:04 - DataCollection - INFO - DataCollection Object Created
2022-06-17 14:47:07 - DataCollection - INFO - Registering  Symptons.csv [<carrot.io.common.DataBrick object at 0x111c490d0>]
2022-06-17 14:47:07 - DataCollection - INFO - Retrieving initial dataframe for 'Symptons.csv' for the first time

	PersonID	visit_date	symptom1	symptom2	symptom3
0	16dc368a89b428b2485484313ba67a3912ca03f2b2b424...	2020-11-15 00:00:00.000000	Y	Y	Y
1	37834f2f25762f23e1f74a531cbe445db73d6765ebe608...	2020-01-04 00:00:00.000000	Y	Y	Y
2	454f63ac30c8322997ef025edff6abd23e0dbe7b8a3d51...	2020-03-27 00:00:00.000000	Y	Y	Y
3	5ef6fdf32513aa7cd11f72beccf132b9224d33f271471f...	2020-06-24 00:00:00.000000	N	N	N
4	1253e9373e781b7500266caa55150e08e210bc8cd8cc70...	2020-07-27 00:00:00.000000	Y	Y	Y
...	...	...	...	...	...
95	a4e00d7e6aa82111575438c5e5d3e63269d4c475c718b2...	2020-11-04 00:00:00.000000	N	Y	N
96	8bcbb4c131df56f7c79066016241cc4bdf4e58db55c4f6...	2020-07-27 00:00:00.000000	Y	Y	Y
97	a4e00d7e6aa82111575438c5e5d3e63269d4c475c718b2...	2020-01-04 00:00:00.000000	Y	Y	Y
98	5a39cadd1b007093db50744797c7a04a34f73b35ed4447...	2020-12-24 00:00:00.000000	N	N	N
99	27badc983df1780b60c2b3fa9d3a19a00e46aac798451f...	2020-11-04 00:00:00.000000	N	Y	N

100 rows × 5 columns

Creating a CDM¶

As CO-CONNECT-Tools contains a pythonic version of the CDM, we can create an instance of the CommonDataModel class.

from carrot.cdm import CommonDataModel

outputs = carrot.tools.create_csv_store(output_folder='output_dir/')

cdm = CommonDataModel(name=rules['metadata']['dataset'],
                      inputs=inputs,
                      outputs=outputs)
cdm

2022-06-17 14:47:07 - LocalDataCollection - INFO - DataCollection Object Created
2022-06-17 14:47:07 - CommonDataModel - INFO - CommonDataModel (5.3.1) created with co-connect-tools version 0.0.0
2022-06-17 14:47:07 - CommonDataModel - INFO - Running with an DataCollection object
2022-06-17 14:47:07 - CommonDataModel - INFO - Turning on automatic cdm column filling

<carrot.cdm.model.CommonDataModel at 0x112313820>

Adding CDM Objects to the CDM¶

The next step is to loop over all the rules from the json, creating and adding a new CDM object (e.g. Person) to the CDM.

Within the loop the CDM object define function is set a lambda function to the apply rules. This means that during the executing, in runtime, the tool (via the CommonDataModel class, will execute the define function and know how to apply the mapping rules.

cdm.create_and_add_objects(rules)

2022-06-17 14:47:07 - CommonDataModel - INFO - Added observation_0 of type observation
2022-06-17 14:47:07 - CommonDataModel - INFO - Added observation_1 of type observation
2022-06-17 14:47:07 - CommonDataModel - INFO - Added observation_2 of type observation
2022-06-17 14:47:07 - CommonDataModel - INFO - Added observation_3 of type observation
2022-06-17 14:47:07 - CommonDataModel - INFO - Added observation_4 of type observation
2022-06-17 14:47:07 - CommonDataModel - INFO - Added observation_5 of type observation
2022-06-17 14:47:07 - CommonDataModel - INFO - Added condition_occurrence_0 of type condition_occurrence
2022-06-17 14:47:07 - CommonDataModel - INFO - Added female of type person
2022-06-17 14:47:07 - CommonDataModel - INFO - Added male of type person
2022-06-17 14:47:07 - CommonDataModel - INFO - Added covid_antibody of type measurement

After the initialisation and creation of the CDM objects, we can see what objects we have been registered in the model..

cdm.objects()

{'observation': {'observation_0': <carrot.cdm.objects.versions.v5_3_1.observation.Observation at 0x112313460>,
  'observation_1': <carrot.cdm.objects.versions.v5_3_1.observation.Observation at 0x112305f70>,
  'observation_2': <carrot.cdm.objects.versions.v5_3_1.observation.Observation at 0x11236d790>,
  'observation_3': <carrot.cdm.objects.versions.v5_3_1.observation.Observation at 0x112313340>,
  'observation_4': <carrot.cdm.objects.versions.v5_3_1.observation.Observation at 0x112399fa0>,
  'observation_5': <carrot.cdm.objects.versions.v5_3_1.observation.Observation at 0x112305d60>},
 'condition_occurrence': {'condition_occurrence_0': <carrot.cdm.objects.versions.v5_3_1.condition_occurrence.ConditionOccurrence at 0x11239ce50>},
 'person': {'female': <carrot.cdm.objects.versions.v5_3_1.person.Person at 0x1123135e0>,
  'male': <carrot.cdm.objects.versions.v5_3_1.person.Person at 0x11239deb0>},
 'measurement': {'covid_antibody': <carrot.cdm.objects.versions.v5_3_1.measurement.Measurement at 0x11239d970>}}

Process The CDM¶

Processing the CDM will execute all objects, pandas dataframes will be created for each object, based on the rules that have been provided.

Importantly the CDM will also format, finalise and merge all the individual dataframes for each objects.

Formatting makes sure the columns are in the correct format i.e. a date is YYY-MM-DD
Finalise makes sure

cdm.process()

2022-06-17 14:47:07 - CommonDataModel - INFO - Starting processing in order: ['person', 'observation', 'condition_occurrence', 'measurement']
2022-06-17 14:47:07 - CommonDataModel - INFO - Number of objects to process for each table...
{
      "observation": 6,
      "condition_occurrence": 1,
      "person": 2,
      "measurement": 1
}
2022-06-17 14:47:07 - CommonDataModel - INFO - for person: found 2 objects
2022-06-17 14:47:07 - CommonDataModel - INFO - working on person
2022-06-17 14:47:07 - CommonDataModel - INFO - starting on female
2022-06-17 14:47:07 - Person - INFO - Called apply_rules
2022-06-17 14:47:07 - LocalDataCollection - INFO - Retrieving initial dataframe for 'Demographics.csv' for the first time
2022-06-17 14:47:07 - Person - INFO - Mapped birth_datetime
2022-06-17 14:47:07 - Person - INFO - Mapped gender_concept_id
2022-06-17 14:47:07 - Person - INFO - Mapped gender_source_concept_id
2022-06-17 14:47:07 - Person - INFO - Mapped gender_source_value
2022-06-17 14:47:08 - Person - INFO - Mapped person_id
2022-06-17 14:47:08 - Person - WARNING - Requiring non-null values in gender_concept_id removed 400 rows, leaving 600 rows.
2022-06-17 14:47:08 - Person - INFO - Automatically formatting data columns.
2022-06-17 14:47:08 - Person - INFO - created df (0x1123bb490)[female]
2022-06-17 14:47:08 - CommonDataModel - INFO - finished female (0x1123bb490) ... 1/2 completed, 600 rows
2022-06-17 14:47:08 - LocalDataCollection - INFO - saving person_ids to output_dir//person_ids.csv
2022-06-17 14:47:08 - LocalDataCollection - INFO - finished save to file
2022-06-17 14:47:08 - CommonDataModel - INFO - starting on male
2022-06-17 14:47:08 - Person - INFO - Called apply_rules
2022-06-17 14:47:08 - Person - INFO - Mapped birth_datetime
2022-06-17 14:47:08 - Person - INFO - Mapped gender_concept_id
2022-06-17 14:47:08 - Person - INFO - Mapped gender_source_concept_id
2022-06-17 14:47:08 - Person - INFO - Mapped gender_source_value
2022-06-17 14:47:08 - Person - INFO - Mapped person_id
2022-06-17 14:47:08 - Person - WARNING - Requiring non-null values in gender_concept_id removed 600 rows, leaving 400 rows.
2022-06-17 14:47:08 - Person - INFO - Automatically formatting data columns.
2022-06-17 14:47:08 - Person - INFO - created df (0x1123f5760)[male]
2022-06-17 14:47:08 - CommonDataModel - INFO - finished male (0x1123f5760) ... 2/2 completed, 400 rows
2022-06-17 14:47:08 - LocalDataCollection - INFO - updating person_ids in output_dir//person_ids.csv
2022-06-17 14:47:08 - LocalDataCollection - INFO - finished save to file
2022-06-17 14:47:08 - CommonDataModel - INFO - saving dataframe (0x1123bb190) to <carrot.io.plugins.local.LocalDataCollection object at 0x112313ee0>
2022-06-17 14:47:08 - LocalDataCollection - INFO - saving person to output_dir//person.csv
2022-06-17 14:47:08 - LocalDataCollection - INFO - finished save to file
2022-06-17 14:47:08 - CommonDataModel - INFO - finalised person on iteration 0 producing 1000 rows from 2 tables
2022-06-17 14:47:08 - LocalDataCollection - INFO - Getting next chunk of data
2022-06-17 14:47:08 - LocalDataCollection - INFO - All input files for this object have now been used.
2022-06-17 14:47:08 - LocalDataCollection - INFO - resetting used bricks
2022-06-17 14:47:08 - CommonDataModel - INFO - for observation: found 6 objects
2022-06-17 14:47:08 - CommonDataModel - INFO - working on observation
2022-06-17 14:47:08 - CommonDataModel - INFO - starting on observation_0
2022-06-17 14:47:08 - Observation - INFO - Called apply_rules
2022-06-17 14:47:08 - LocalDataCollection - INFO - Retrieving initial dataframe for 'Demographics.csv' for the first time
2022-06-17 14:47:08 - Observation - INFO - Mapped observation_concept_id
2022-06-17 14:47:08 - Observation - INFO - Mapped observation_datetime
2022-06-17 14:47:08 - Observation - INFO - Mapped observation_source_concept_id
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_source_value
2022-06-17 14:47:09 - Observation - INFO - Mapped person_id
2022-06-17 14:47:09 - Observation - WARNING - Requiring non-null values in observation_concept_id removed 900 rows, leaving 100 rows.
2022-06-17 14:47:09 - Observation - INFO - Automatically formatting data columns.
2022-06-17 14:47:09 - Observation - INFO - created df (0x1123bb5e0)[observation_0]
2022-06-17 14:47:09 - CommonDataModel - INFO - finished observation_0 (0x1123bb5e0) ... 1/6 completed, 100 rows
2022-06-17 14:47:09 - CommonDataModel - INFO - starting on observation_1
2022-06-17 14:47:09 - Observation - INFO - Called apply_rules
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_concept_id
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_datetime
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_source_concept_id
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_source_value
2022-06-17 14:47:09 - Observation - INFO - Mapped person_id
2022-06-17 14:47:09 - Observation - WARNING - Requiring non-null values in observation_concept_id removed 900 rows, leaving 100 rows.
2022-06-17 14:47:09 - Observation - INFO - Automatically formatting data columns.
2022-06-17 14:47:09 - Observation - INFO - created df (0x11241ce20)[observation_1]
2022-06-17 14:47:09 - CommonDataModel - INFO - finished observation_1 (0x11241ce20) ... 2/6 completed, 100 rows
2022-06-17 14:47:09 - CommonDataModel - INFO - starting on observation_2
2022-06-17 14:47:09 - Observation - INFO - Called apply_rules
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_concept_id
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_datetime
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_source_concept_id
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_source_value
2022-06-17 14:47:09 - Observation - INFO - Mapped person_id
2022-06-17 14:47:09 - Observation - WARNING - Requiring non-null values in observation_concept_id removed 900 rows, leaving 100 rows.
2022-06-17 14:47:09 - Observation - INFO - Automatically formatting data columns.
2022-06-17 14:47:09 - Observation - INFO - created df (0x112427580)[observation_2]

2022-06-17 14:47:09 - CommonDataModel - INFO - finished observation_2 (0x112427580) ... 3/6 completed, 100 rows
2022-06-17 14:47:09 - CommonDataModel - INFO - starting on observation_3
2022-06-17 14:47:09 - Observation - INFO - Called apply_rules
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_concept_id
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_datetime
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_source_concept_id
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_source_value
2022-06-17 14:47:09 - Observation - INFO - Mapped person_id
2022-06-17 14:47:09 - Observation - WARNING - Requiring non-null values in observation_concept_id removed 700 rows, leaving 300 rows.
2022-06-17 14:47:09 - Observation - INFO - Automatically formatting data columns.
2022-06-17 14:47:09 - Observation - INFO - created df (0x1124357f0)[observation_3]
2022-06-17 14:47:09 - CommonDataModel - INFO - finished observation_3 (0x1124357f0) ... 4/6 completed, 300 rows
2022-06-17 14:47:09 - CommonDataModel - INFO - starting on observation_4
2022-06-17 14:47:09 - Observation - INFO - Called apply_rules
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_concept_id
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_datetime
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_source_concept_id
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_source_value
2022-06-17 14:47:09 - Observation - INFO - Mapped person_id
2022-06-17 14:47:09 - Observation - WARNING - Requiring non-null values in observation_concept_id removed 800 rows, leaving 200 rows.
2022-06-17 14:47:09 - Observation - INFO - Automatically formatting data columns.
2022-06-17 14:47:09 - Observation - INFO - created df (0x112440490)[observation_4]
2022-06-17 14:47:09 - CommonDataModel - INFO - finished observation_4 (0x112440490) ... 5/6 completed, 200 rows
2022-06-17 14:47:10 - CommonDataModel - INFO - starting on observation_5
2022-06-17 14:47:10 - Observation - INFO - Called apply_rules
2022-06-17 14:47:10 - Observation - INFO - Mapped observation_concept_id
2022-06-17 14:47:10 - Observation - INFO - Mapped observation_datetime
2022-06-17 14:47:10 - Observation - INFO - Mapped observation_source_concept_id
2022-06-17 14:47:10 - Observation - INFO - Mapped observation_source_value
2022-06-17 14:47:10 - Observation - INFO - Mapped person_id
2022-06-17 14:47:10 - Observation - WARNING - Requiring non-null values in observation_concept_id removed 900 rows, leaving 100 rows.
2022-06-17 14:47:10 - Observation - INFO - Automatically formatting data columns.
2022-06-17 14:47:10 - Observation - INFO - created df (0x112456f70)[observation_5]
2022-06-17 14:47:10 - CommonDataModel - INFO - finished observation_5 (0x112456f70) ... 6/6 completed, 100 rows
2022-06-17 14:47:10 - CommonDataModel - INFO - saving dataframe (0x112427280) to <carrot.io.plugins.local.LocalDataCollection object at 0x112313ee0>
2022-06-17 14:47:10 - LocalDataCollection - INFO - saving observation to output_dir//observation.csv
2022-06-17 14:47:10 - LocalDataCollection - INFO - finished save to file
2022-06-17 14:47:10 - CommonDataModel - INFO - finalised observation on iteration 0 producing 900 rows from 6 tables
2022-06-17 14:47:10 - LocalDataCollection - INFO - Getting next chunk of data
2022-06-17 14:47:10 - LocalDataCollection - INFO - All input files for this object have now been used.
2022-06-17 14:47:10 - LocalDataCollection - INFO - resetting used bricks
2022-06-17 14:47:10 - CommonDataModel - INFO - for condition_occurrence: found 1 object
2022-06-17 14:47:10 - CommonDataModel - INFO - working on condition_occurrence
2022-06-17 14:47:10 - CommonDataModel - INFO - starting on condition_occurrence_0
2022-06-17 14:47:10 - ConditionOccurrence - INFO - Called apply_rules
2022-06-17 14:47:10 - LocalDataCollection - INFO - Retrieving initial dataframe for 'Symptoms.csv' for the first time
2022-06-17 14:47:10 - ConditionOccurrence - INFO - Mapped condition_concept_id
2022-06-17 14:47:10 - ConditionOccurrence - INFO - Mapped condition_end_datetime
2022-06-17 14:47:10 - ConditionOccurrence - INFO - Mapped condition_source_concept_id
2022-06-17 14:47:10 - ConditionOccurrence - INFO - Mapped condition_source_value
2022-06-17 14:47:10 - ConditionOccurrence - INFO - Mapped condition_start_datetime
2022-06-17 14:47:10 - ConditionOccurrence - INFO - Mapped person_id
2022-06-17 14:47:10 - ConditionOccurrence - WARNING - Requiring non-null values in condition_concept_id removed 400 rows, leaving 400 rows.
2022-06-17 14:47:10 - ConditionOccurrence - INFO - Automatically formatting data columns.
2022-06-17 14:47:10 - ConditionOccurrence - INFO - created df (0x112456790)[condition_occurrence_0]
2022-06-17 14:47:10 - CommonDataModel - INFO - finished condition_occurrence_0 (0x112456790) ... 1/1 completed, 400 rows

Inspect Outputs¶

cdm.keys()

cdm['person'].dropna(axis=1,how='all')

cdm['observation'].dropna(axis=1,how='all')

cdm['condition_occurrence'].dropna(axis=1,how='all')