ETL Workbook Example
Introduction¶
The ETL transform to CDM using the classes defined in carrot.cdm
is documented here as python notebook, as an example of how the classes can be used. Developers can follow the following workbook example, changing the rules file and the input files.
Installing¶
The best way is to install the module via pip
.
!pip3 install carrot-cdm -q
!carrot --version
0.6.2
Loading the Rules¶
Given the full path to a json
file containing the rules, the first step is to load this up into a json
object/dict.
import carrot.tools
import json
import os
carrot.data_folder = os.path.join(os.path.dirname(carrot.__file__),'data')
rules = carrot.tools.load_json(f'{carrot.data_folder}/test/rules/rules_14June2021.json')
print(json.dumps(rules, indent=6))
{
"metadata": {
"date_created": "2021-06-14T15:27:37.123947",
"dataset": "Test"
},
"cdm": {
"observation": {
"observation_0": {
"observation_concept_id": {
"source_table": "Demographics.csv",
"source_field": "ethnicity",
"term_mapping": {
"Asian": 35825508
}
},
"observation_datetime": {
"source_table": "Demographics.csv",
"source_field": "date_of_birth"
},
"observation_source_concept_id": {
"source_table": "Demographics.csv",
"source_field": "ethnicity",
"term_mapping": {
"Asian": 35825508
}
},
"observation_source_value": {
"source_table": "Demographics.csv",
"source_field": "ethnicity"
},
"person_id": {
"source_table": "Demographics.csv",
"source_field": "PersonID"
}
},
"observation_1": {
"observation_concept_id": {
"source_table": "Demographics.csv",
"source_field": "ethnicity",
"term_mapping": {
"Bangladeshi": 35825531
}
},
"observation_datetime": {
"source_table": "Demographics.csv",
"source_field": "date_of_birth"
},
"observation_source_concept_id": {
"source_table": "Demographics.csv",
"source_field": "ethnicity",
"term_mapping": {
"Bangladeshi": 35825531
}
},
"observation_source_value": {
"source_table": "Demographics.csv",
"source_field": "ethnicity"
},
"person_id": {
"source_table": "Demographics.csv",
"source_field": "PersonID"
}
},
"observation_2": {
"observation_concept_id": {
"source_table": "Demographics.csv",
"source_field": "ethnicity",
"term_mapping": {
"Indian": 35826241
}
},
"observation_datetime": {
"source_table": "Demographics.csv",
"source_field": "date_of_birth"
},
"observation_source_concept_id": {
"source_table": "Demographics.csv",
"source_field": "ethnicity",
"term_mapping": {
"Indian": 35826241
}
},
"observation_source_value": {
"source_table": "Demographics.csv",
"source_field": "ethnicity"
},
"person_id": {
"source_table": "Demographics.csv",
"source_field": "PersonID"
}
},
"observation_3": {
"observation_concept_id": {
"source_table": "Demographics.csv",
"source_field": "ethnicity",
"term_mapping": {
"White": 35827394
}
},
"observation_datetime": {
"source_table": "Demographics.csv",
"source_field": "date_of_birth"
},
"observation_source_concept_id": {
"source_table": "Demographics.csv",
"source_field": "ethnicity",
"term_mapping": {
"White": 35827394
}
},
"observation_source_value": {
"source_table": "Demographics.csv",
"source_field": "ethnicity"
},
"person_id": {
"source_table": "Demographics.csv",
"source_field": "PersonID"
}
},
"observation_4": {
"observation_concept_id": {
"source_table": "Demographics.csv",
"source_field": "ethnicity",
"term_mapping": {
"Black": 35825567
}
},
"observation_datetime": {
"source_table": "Demographics.csv",
"source_field": "date_of_birth"
},
"observation_source_concept_id": {
"source_table": "Demographics.csv",
"source_field": "ethnicity",
"term_mapping": {
"Black": 35825567
}
},
"observation_source_value": {
"source_table": "Demographics.csv",
"source_field": "ethnicity"
},
"person_id": {
"source_table": "Demographics.csv",
"source_field": "PersonID"
}
},
"observation_5": {
"observation_concept_id": {
"source_table": "Demographics.csv",
"source_field": "ethnicity",
"term_mapping": {
"White and Asian": 35827395
}
},
"observation_datetime": {
"source_table": "Demographics.csv",
"source_field": "date_of_birth"
},
"observation_source_concept_id": {
"source_table": "Demographics.csv",
"source_field": "ethnicity",
"term_mapping": {
"White and Asian": 35827395
}
},
"observation_source_value": {
"source_table": "Demographics.csv",
"source_field": "ethnicity"
},
"person_id": {
"source_table": "Demographics.csv",
"source_field": "PersonID"
}
}
},
"condition_occurrence": {
"condition_occurrence_0": {
"condition_concept_id": {
"source_table": "Symptoms.csv",
"source_field": "symptom1",
"term_mapping": {
"Y": 254761
}
},
"condition_end_datetime": {
"source_table": "Symptoms.csv",
"source_field": "visit_date"
},
"condition_source_concept_id": {
"source_table": "Symptoms.csv",
"source_field": "symptom1",
"term_mapping": {
"Y": 254761
}
},
"condition_source_value": {
"source_table": "Symptoms.csv",
"source_field": "symptom1"
},
"condition_start_datetime": {
"source_table": "Symptoms.csv",
"source_field": "visit_date"
},
"person_id": {
"source_table": "Symptoms.csv",
"source_field": "PersonID"
}
}
},
"person": {
"female": {
"birth_datetime": {
"source_table": "Demographics.csv",
"source_field": "date_of_birth"
},
"gender_concept_id": {
"source_table": "Demographics.csv",
"source_field": "sex",
"term_mapping": {
"F": 8532
}
},
"gender_source_concept_id": {
"source_table": "Demographics.csv",
"source_field": "sex",
"term_mapping": {
"F": 8532
}
},
"gender_source_value": {
"source_table": "Demographics.csv",
"source_field": "sex"
},
"person_id": {
"source_table": "Demographics.csv",
"source_field": "PersonID"
}
},
"male": {
"birth_datetime": {
"source_table": "Demographics.csv",
"source_field": "date_of_birth"
},
"gender_concept_id": {
"source_table": "Demographics.csv",
"source_field": "sex",
"term_mapping": {
"M": 8507
}
},
"gender_source_concept_id": {
"source_table": "Demographics.csv",
"source_field": "sex",
"term_mapping": {
"M": 8507
}
},
"gender_source_value": {
"source_table": "Demographics.csv",
"source_field": "sex"
},
"person_id": {
"source_table": "Demographics.csv",
"source_field": "PersonID"
}
}
},
"measurement": {
"covid_antibody": {
"value_as_number": {
"source_table": "covid19_antibody.csv",
"source_field": "IgG"
},
"measurement_source_value": {
"source_table": "covid19_antibody.csv",
"source_field": "IgG"
},
"measurement_concept_id": {
"source_table": "covid19_antibody.csv",
"source_field": "IgG",
"term_mapping": 37398191
},
"measurement_source_concept_id": {
"source_table": "covid19_antibody.csv",
"source_field": "IgG",
"term_mapping": 37398191
},
"measurement_datetime": {
"source_table": "covid19_antibody.csv",
"source_field": "date"
},
"person_id": {
"source_table": "covid19_antibody.csv",
"source_field": "PersonID"
}
}
}
}
}
f_map = carrot.tools.get_file_map_from_dir(f'{carrot.data_folder}/test/inputs/')
print (json.dumps(f_map,indent=6))
{
"Symptoms.csv": "/Users/calummacdonald/Usher/CO-CONNECT/Software/CaRROT-CDM/carrot/data/test/inputs/Symptoms.csv",
"Covid19_test.csv": "/Users/calummacdonald/Usher/CO-CONNECT/Software/CaRROT-CDM/carrot/data/test/inputs/Covid19_test.csv",
"covid19_antibody.csv": "/Users/calummacdonald/Usher/CO-CONNECT/Software/CaRROT-CDM/carrot/data/test/inputs/covid19_antibody.csv",
"vaccine.csv": "/Users/calummacdonald/Usher/CO-CONNECT/Software/CaRROT-CDM/carrot/data/test/inputs/vaccine.csv",
"Demographics.csv": "/Users/calummacdonald/Usher/CO-CONNECT/Software/CaRROT-CDM/carrot/data/test/inputs/Demographics.csv"
}
use the f_map
to load all the inputs into a map between the file name and a dataframe object. This can be created manually via any prefered method.
inputs = carrot.tools.load_csv(f_map)
inputs
2022-06-17 14:46:49 - LocalDataCollection - INFO - DataCollection Object Created
2022-06-17 14:46:49 - LocalDataCollection - INFO - Registering Symptoms.csv [<carrot.io.common.DataBrick object at 0x10df1b040>]
2022-06-17 14:46:49 - LocalDataCollection - INFO - Registering Covid19_test.csv [<carrot.io.common.DataBrick object at 0x10df1b0d0>]
2022-06-17 14:46:49 - LocalDataCollection - INFO - Registering covid19_antibody.csv [<carrot.io.common.DataBrick object at 0x10df1b310>]
2022-06-17 14:46:49 - LocalDataCollection - INFO - Registering vaccine.csv [<carrot.io.common.DataBrick object at 0x110d473d0>]
2022-06-17 14:46:49 - LocalDataCollection - INFO - Registering Demographics.csv [<carrot.io.common.DataBrick object at 0x10deb5dc0>]
<carrot.io.plugins.local.LocalDataCollection at 0x10deb5d30>
inputs.keys()
dict_keys(['Symptoms.csv', 'Covid19_test.csv', 'covid19_antibody.csv', 'vaccine.csv', 'Demographics.csv'])
inputs['Symptoms.csv']
2022-06-17 14:46:49 - LocalDataCollection - INFO - Retrieving initial dataframe for 'Symptoms.csv' for the first time
PersonID | visit_date | symptom1 | symptom2 | symptom3 | |
---|---|---|---|---|---|
0 | 16dc368a89b428b2485484313ba67a3912ca03f2b2b424... | 2020-11-15 00:00:00.000000 | Y | Y | Y |
1 | 37834f2f25762f23e1f74a531cbe445db73d6765ebe608... | 2020-01-04 00:00:00.000000 | Y | Y | Y |
2 | 454f63ac30c8322997ef025edff6abd23e0dbe7b8a3d51... | 2020-03-27 00:00:00.000000 | Y | Y | Y |
3 | 5ef6fdf32513aa7cd11f72beccf132b9224d33f271471f... | 2020-06-24 00:00:00.000000 | N | N | N |
4 | 1253e9373e781b7500266caa55150e08e210bc8cd8cc70... | 2020-07-27 00:00:00.000000 | Y | Y | Y |
... | ... | ... | ... | ... | ... |
795 | 62f6d46c48c7d9ff3d09a408d0ec880f167a5dc9c8fd34... | 2020-11-04 00:00:00.000000 | N | Y | N |
796 | c62510afc57db491f9f993387b76dd9a7d08f09c013269... | 2020-07-27 00:00:00.000000 | Y | Y | Y |
797 | bdc5d8a48c23897906b09a9a3680bd2e9c8b3121edbda3... | 2020-03-27 00:00:00.000000 | Y | Y | Y |
798 | fa88d374b9cf5e059fad4a2fe406feae4c49cbf4803083... | 2020-12-24 00:00:00.000000 | N | N | N |
799 | 6a97982dccf77dd3dafa27fcbdf75c017301f730ba186b... | 2020-11-15 00:00:00.000000 | Y | Y | Y |
800 rows × 5 columns
Chunked CSV¶
For large datasets, it's better to chunk the data as to not overload your computer memory, this can be achieved by supplying a chunksize
argument:
inputs_chunked = carrot.tools.load_csv(f_map,chunksize=100)
inputs_chunked['Symptoms.csv']
2022-06-17 14:46:49 - LocalDataCollection - INFO - DataCollection Object Created
2022-06-17 14:46:49 - LocalDataCollection - INFO - Using a chunksize of '100' nrows
2022-06-17 14:46:49 - LocalDataCollection - INFO - Registering Symptoms.csv [<carrot.io.common.DataBrick object at 0x10deb5850>]
2022-06-17 14:46:49 - LocalDataCollection - INFO - Registering Covid19_test.csv [<carrot.io.common.DataBrick object at 0x111d3c490>]
2022-06-17 14:46:49 - LocalDataCollection - INFO - Registering covid19_antibody.csv [<carrot.io.common.DataBrick object at 0x111d3c580>]
2022-06-17 14:46:49 - LocalDataCollection - INFO - Registering vaccine.csv [<carrot.io.common.DataBrick object at 0x111d3c970>]
2022-06-17 14:46:49 - LocalDataCollection - INFO - Registering Demographics.csv [<carrot.io.common.DataBrick object at 0x111d3cc40>]
2022-06-17 14:46:49 - LocalDataCollection - INFO - Retrieving initial dataframe for 'Symptoms.csv' for the first time
PersonID | visit_date | symptom1 | symptom2 | symptom3 | |
---|---|---|---|---|---|
0 | 16dc368a89b428b2485484313ba67a3912ca03f2b2b424... | 2020-11-15 00:00:00.000000 | Y | Y | Y |
1 | 37834f2f25762f23e1f74a531cbe445db73d6765ebe608... | 2020-01-04 00:00:00.000000 | Y | Y | Y |
2 | 454f63ac30c8322997ef025edff6abd23e0dbe7b8a3d51... | 2020-03-27 00:00:00.000000 | Y | Y | Y |
3 | 5ef6fdf32513aa7cd11f72beccf132b9224d33f271471f... | 2020-06-24 00:00:00.000000 | N | N | N |
4 | 1253e9373e781b7500266caa55150e08e210bc8cd8cc70... | 2020-07-27 00:00:00.000000 | Y | Y | Y |
... | ... | ... | ... | ... | ... |
95 | a4e00d7e6aa82111575438c5e5d3e63269d4c475c718b2... | 2020-11-04 00:00:00.000000 | N | Y | N |
96 | 8bcbb4c131df56f7c79066016241cc4bdf4e58db55c4f6... | 2020-07-27 00:00:00.000000 | Y | Y | Y |
97 | a4e00d7e6aa82111575438c5e5d3e63269d4c475c718b2... | 2020-01-04 00:00:00.000000 | Y | Y | Y |
98 | 5a39cadd1b007093db50744797c7a04a34f73b35ed4447... | 2020-12-24 00:00:00.000000 | N | N | N |
99 | 27badc983df1780b60c2b3fa9d3a19a00e46aac798451f... | 2020-11-04 00:00:00.000000 | N | Y | N |
100 rows × 5 columns
The internal working of the InputData
object will more to the next slice of data when running the process, untill all data has been processed.
inputs_chunked.next()
2022-06-17 14:46:49 - LocalDataCollection - INFO - Getting next chunk of data
2022-06-17 14:46:49 - LocalDataCollection - INFO - Getting the next chunk of size '100' for 'Symptoms.csv'
2022-06-17 14:46:49 - LocalDataCollection - INFO - --> Got 100 rows
inputs_chunked['Symptoms.csv']
PersonID | visit_date | symptom1 | symptom2 | symptom3 | |
---|---|---|---|---|---|
100 | 43974ed74066b207c30ffd0fed5146762e6c60745ac977... | 2020-03-27 00:00:00.000000 | Y | Y | Y |
101 | fc56dbc6d4652b315b86b71c8d688c1ccdea9c5f1fd077... | 2020-02-04 00:00:00.000000 | N | N | N |
102 | f8809aff4d69bece79dabe35be0c708b890d7eafb841f1... | 2020-06-24 00:00:00.000000 | N | N | N |
103 | 5cf4e26bd3d87da5e03f80a43a64f1220a1f4ba9e1d634... | 2020-11-15 00:00:00.000000 | Y | Y | Y |
104 | f8809aff4d69bece79dabe35be0c708b890d7eafb841f1... | 2020-07-27 00:00:00.000000 | Y | Y | Y |
... | ... | ... | ... | ... | ... |
195 | a0f8b2c4cb1ac82abdb37f0fe5203b97be556c4468c83b... | 2020-12-24 00:00:00.000000 | N | N | N |
196 | 4c15f47afe7f817fd559e12ddbc276f4930c5822f20490... | 2020-01-04 00:00:00.000000 | Y | Y | Y |
197 | 983bd614bb5afece5ab3b6023f71147cd7b6bc2314f9d2... | 2020-11-04 00:00:00.000000 | N | Y | N |
198 | c3ea99f86b2f8a74ef4145bb245155ff5f91cd856f2875... | 2020-06-24 00:00:00.000000 | N | N | N |
199 | f32828acecb4282c87eaa554d2e1db74e418cd68458430... | 2020-03-27 00:00:00.000000 | Y | Y | Y |
100 rows × 5 columns
SQL¶
Another alternative, if your input data is not in csv
format is to load the data manually yourself from SQL / Spark / DataBricks views etc.
Firstly, initialise an input data handler object"
inputs_sql = carrot.io.DataCollection()
inputs_sql
2022-06-17 14:46:49 - DataCollection - INFO - DataCollection Object Created
<carrot.io.common.DataCollection at 0x111c49130>
Load your data, for example fron a PostgresSQL server, using sqlalchemy
and pandas:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('postgresql://localhost:5432/coconnect_data_test')
df = pd.read_sql("my_data_table",engine)
df
PersonID | visit_date | symptom1 | symptom2 | symptom3 | |
---|---|---|---|---|---|
0 | 16dc368a89b428b2485484313ba67a3912ca03f2b2b424... | 2020-11-15 00:00:00.000000 | Y | Y | Y |
1 | 37834f2f25762f23e1f74a531cbe445db73d6765ebe608... | 2020-01-04 00:00:00.000000 | Y | Y | Y |
2 | 454f63ac30c8322997ef025edff6abd23e0dbe7b8a3d51... | 2020-03-27 00:00:00.000000 | Y | Y | Y |
3 | 5ef6fdf32513aa7cd11f72beccf132b9224d33f271471f... | 2020-06-24 00:00:00.000000 | N | N | N |
4 | 1253e9373e781b7500266caa55150e08e210bc8cd8cc70... | 2020-07-27 00:00:00.000000 | Y | Y | Y |
... | ... | ... | ... | ... | ... |
95 | a4e00d7e6aa82111575438c5e5d3e63269d4c475c718b2... | 2020-11-04 00:00:00.000000 | N | Y | N |
96 | 8bcbb4c131df56f7c79066016241cc4bdf4e58db55c4f6... | 2020-07-27 00:00:00.000000 | Y | Y | Y |
97 | a4e00d7e6aa82111575438c5e5d3e63269d4c475c718b2... | 2020-01-04 00:00:00.000000 | Y | Y | Y |
98 | 5a39cadd1b007093db50744797c7a04a34f73b35ed4447... | 2020-12-24 00:00:00.000000 | N | N | N |
99 | 27badc983df1780b60c2b3fa9d3a19a00e46aac798451f... | 2020-11-04 00:00:00.000000 | N | Y | N |
100 rows × 5 columns
Set the input object to this dataframe
note: the name of the input table must be the same as the name in the json
rules. In this example, the name in the json
for the mapping for this table is Symptoms.csv
, therefore the dataframe is associated with that name.
inputs_sql['Symptoms.csv'] = carrot.io.DataBrick(df)
inputs_sql['Symptoms.csv']
2022-06-17 14:46:50 - DataCollection - INFO - Registering Symptoms.csv [<carrot.io.common.DataBrick object at 0x111d44940>]
2022-06-17 14:46:50 - DataCollection - INFO - Retrieving initial dataframe for 'Symptoms.csv' for the first time
PersonID | visit_date | symptom1 | symptom2 | symptom3 | |
---|---|---|---|---|---|
0 | 16dc368a89b428b2485484313ba67a3912ca03f2b2b424... | 2020-11-15 00:00:00.000000 | Y | Y | Y |
1 | 37834f2f25762f23e1f74a531cbe445db73d6765ebe608... | 2020-01-04 00:00:00.000000 | Y | Y | Y |
2 | 454f63ac30c8322997ef025edff6abd23e0dbe7b8a3d51... | 2020-03-27 00:00:00.000000 | Y | Y | Y |
3 | 5ef6fdf32513aa7cd11f72beccf132b9224d33f271471f... | 2020-06-24 00:00:00.000000 | N | N | N |
4 | 1253e9373e781b7500266caa55150e08e210bc8cd8cc70... | 2020-07-27 00:00:00.000000 | Y | Y | Y |
... | ... | ... | ... | ... | ... |
95 | a4e00d7e6aa82111575438c5e5d3e63269d4c475c718b2... | 2020-11-04 00:00:00.000000 | N | Y | N |
96 | 8bcbb4c131df56f7c79066016241cc4bdf4e58db55c4f6... | 2020-07-27 00:00:00.000000 | Y | Y | Y |
97 | a4e00d7e6aa82111575438c5e5d3e63269d4c475c718b2... | 2020-01-04 00:00:00.000000 | Y | Y | Y |
98 | 5a39cadd1b007093db50744797c7a04a34f73b35ed4447... | 2020-12-24 00:00:00.000000 | N | N | N |
99 | 27badc983df1780b60c2b3fa9d3a19a00e46aac798451f... | 2020-11-04 00:00:00.000000 | N | Y | N |
100 rows × 5 columns
Spark Databricks¶
If you want to use something like Spark for integration with DataBricks, you can use pyspark
to load the data:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.jars", "/Users/calummacdonald/Downloads/postgresql-42.3.1.jar") \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://localhost:5432/coconnect_data_test") \
.option("dbtable", "my_data_table") \
.option("driver", "org.postgresql.Driver") \
.load()
df.printSchema()
root
|-- PersonID: string (nullable = true)
|-- visit_date: string (nullable = true)
|-- symptom1: string (nullable = true)
|-- symptom2: string (nullable = true)
|-- symptom3: string (nullable = true)
inputs_databricks = carrot.io.DataCollection()
inputs_databricks['Symptons.csv'] = carrot.io.DataBrick(df.select("*").toPandas())
inputs_databricks['Symptons.csv']
2022-06-17 14:47:04 - DataCollection - INFO - DataCollection Object Created
2022-06-17 14:47:07 - DataCollection - INFO - Registering Symptons.csv [<carrot.io.common.DataBrick object at 0x111c490d0>]
2022-06-17 14:47:07 - DataCollection - INFO - Retrieving initial dataframe for 'Symptons.csv' for the first time
PersonID | visit_date | symptom1 | symptom2 | symptom3 | |
---|---|---|---|---|---|
0 | 16dc368a89b428b2485484313ba67a3912ca03f2b2b424... | 2020-11-15 00:00:00.000000 | Y | Y | Y |
1 | 37834f2f25762f23e1f74a531cbe445db73d6765ebe608... | 2020-01-04 00:00:00.000000 | Y | Y | Y |
2 | 454f63ac30c8322997ef025edff6abd23e0dbe7b8a3d51... | 2020-03-27 00:00:00.000000 | Y | Y | Y |
3 | 5ef6fdf32513aa7cd11f72beccf132b9224d33f271471f... | 2020-06-24 00:00:00.000000 | N | N | N |
4 | 1253e9373e781b7500266caa55150e08e210bc8cd8cc70... | 2020-07-27 00:00:00.000000 | Y | Y | Y |
... | ... | ... | ... | ... | ... |
95 | a4e00d7e6aa82111575438c5e5d3e63269d4c475c718b2... | 2020-11-04 00:00:00.000000 | N | Y | N |
96 | 8bcbb4c131df56f7c79066016241cc4bdf4e58db55c4f6... | 2020-07-27 00:00:00.000000 | Y | Y | Y |
97 | a4e00d7e6aa82111575438c5e5d3e63269d4c475c718b2... | 2020-01-04 00:00:00.000000 | Y | Y | Y |
98 | 5a39cadd1b007093db50744797c7a04a34f73b35ed4447... | 2020-12-24 00:00:00.000000 | N | N | N |
99 | 27badc983df1780b60c2b3fa9d3a19a00e46aac798451f... | 2020-11-04 00:00:00.000000 | N | Y | N |
100 rows × 5 columns
Creating a CDM¶
As CO-CONNECT-Tools contains a pythonic version of the CDM, we can create an instance of the CommonDataModel
class.
from carrot.cdm import CommonDataModel
outputs = carrot.tools.create_csv_store(output_folder='output_dir/')
cdm = CommonDataModel(name=rules['metadata']['dataset'],
inputs=inputs,
outputs=outputs)
cdm
2022-06-17 14:47:07 - LocalDataCollection - INFO - DataCollection Object Created
2022-06-17 14:47:07 - CommonDataModel - INFO - CommonDataModel (5.3.1) created with co-connect-tools version 0.0.0
2022-06-17 14:47:07 - CommonDataModel - INFO - Running with an DataCollection object
2022-06-17 14:47:07 - CommonDataModel - INFO - Turning on automatic cdm column filling
<carrot.cdm.model.CommonDataModel at 0x112313820>
Adding CDM Objects to the CDM¶
The next step is to loop over all the rules from the json
, creating and adding a new CDM object (e.g. Person) to the CDM.
Within the loop the CDM object define function is set a lambda function to the apply rules. This means that during the executing, in runtime, the tool (via the CommonDataModel
class, will execute the define function and know how to apply the mapping rules.
cdm.create_and_add_objects(rules)
2022-06-17 14:47:07 - CommonDataModel - INFO - Added observation_0 of type observation
2022-06-17 14:47:07 - CommonDataModel - INFO - Added observation_1 of type observation
2022-06-17 14:47:07 - CommonDataModel - INFO - Added observation_2 of type observation
2022-06-17 14:47:07 - CommonDataModel - INFO - Added observation_3 of type observation
2022-06-17 14:47:07 - CommonDataModel - INFO - Added observation_4 of type observation
2022-06-17 14:47:07 - CommonDataModel - INFO - Added observation_5 of type observation
2022-06-17 14:47:07 - CommonDataModel - INFO - Added condition_occurrence_0 of type condition_occurrence
2022-06-17 14:47:07 - CommonDataModel - INFO - Added female of type person
2022-06-17 14:47:07 - CommonDataModel - INFO - Added male of type person
2022-06-17 14:47:07 - CommonDataModel - INFO - Added covid_antibody of type measurement
After the initialisation and creation of the CDM objects, we can see what objects we have been registered in the model..
cdm.objects()
{'observation': {'observation_0': <carrot.cdm.objects.versions.v5_3_1.observation.Observation at 0x112313460>,
'observation_1': <carrot.cdm.objects.versions.v5_3_1.observation.Observation at 0x112305f70>,
'observation_2': <carrot.cdm.objects.versions.v5_3_1.observation.Observation at 0x11236d790>,
'observation_3': <carrot.cdm.objects.versions.v5_3_1.observation.Observation at 0x112313340>,
'observation_4': <carrot.cdm.objects.versions.v5_3_1.observation.Observation at 0x112399fa0>,
'observation_5': <carrot.cdm.objects.versions.v5_3_1.observation.Observation at 0x112305d60>},
'condition_occurrence': {'condition_occurrence_0': <carrot.cdm.objects.versions.v5_3_1.condition_occurrence.ConditionOccurrence at 0x11239ce50>},
'person': {'female': <carrot.cdm.objects.versions.v5_3_1.person.Person at 0x1123135e0>,
'male': <carrot.cdm.objects.versions.v5_3_1.person.Person at 0x11239deb0>},
'measurement': {'covid_antibody': <carrot.cdm.objects.versions.v5_3_1.measurement.Measurement at 0x11239d970>}}
Process The CDM¶
Processing the CDM will execute all objects, pandas dataframes will be created for each object, based on the rules that have been provided.
Importantly the CDM will also format, finalise and merge all the individual dataframes for each objects.
- Formatting makes sure the columns are in the correct format i.e. a date is YYY-MM-DD
- Finalise makes sure
cdm.process()
2022-06-17 14:47:07 - CommonDataModel - INFO - Starting processing in order: ['person', 'observation', 'condition_occurrence', 'measurement']
2022-06-17 14:47:07 - CommonDataModel - INFO - Number of objects to process for each table...
{
"observation": 6,
"condition_occurrence": 1,
"person": 2,
"measurement": 1
}
2022-06-17 14:47:07 - CommonDataModel - INFO - for person: found 2 objects
2022-06-17 14:47:07 - CommonDataModel - INFO - working on person
2022-06-17 14:47:07 - CommonDataModel - INFO - starting on female
2022-06-17 14:47:07 - Person - INFO - Called apply_rules
2022-06-17 14:47:07 - LocalDataCollection - INFO - Retrieving initial dataframe for 'Demographics.csv' for the first time
2022-06-17 14:47:07 - Person - INFO - Mapped birth_datetime
2022-06-17 14:47:07 - Person - INFO - Mapped gender_concept_id
2022-06-17 14:47:07 - Person - INFO - Mapped gender_source_concept_id
2022-06-17 14:47:07 - Person - INFO - Mapped gender_source_value
2022-06-17 14:47:08 - Person - INFO - Mapped person_id
2022-06-17 14:47:08 - Person - WARNING - Requiring non-null values in gender_concept_id removed 400 rows, leaving 600 rows.
2022-06-17 14:47:08 - Person - INFO - Automatically formatting data columns.
2022-06-17 14:47:08 - Person - INFO - created df (0x1123bb490)[female]
2022-06-17 14:47:08 - CommonDataModel - INFO - finished female (0x1123bb490) ... 1/2 completed, 600 rows
2022-06-17 14:47:08 - LocalDataCollection - INFO - saving person_ids to output_dir//person_ids.csv
2022-06-17 14:47:08 - LocalDataCollection - INFO - finished save to file
2022-06-17 14:47:08 - CommonDataModel - INFO - starting on male
2022-06-17 14:47:08 - Person - INFO - Called apply_rules
2022-06-17 14:47:08 - Person - INFO - Mapped birth_datetime
2022-06-17 14:47:08 - Person - INFO - Mapped gender_concept_id
2022-06-17 14:47:08 - Person - INFO - Mapped gender_source_concept_id
2022-06-17 14:47:08 - Person - INFO - Mapped gender_source_value
2022-06-17 14:47:08 - Person - INFO - Mapped person_id
2022-06-17 14:47:08 - Person - WARNING - Requiring non-null values in gender_concept_id removed 600 rows, leaving 400 rows.
2022-06-17 14:47:08 - Person - INFO - Automatically formatting data columns.
2022-06-17 14:47:08 - Person - INFO - created df (0x1123f5760)[male]
2022-06-17 14:47:08 - CommonDataModel - INFO - finished male (0x1123f5760) ... 2/2 completed, 400 rows
2022-06-17 14:47:08 - LocalDataCollection - INFO - updating person_ids in output_dir//person_ids.csv
2022-06-17 14:47:08 - LocalDataCollection - INFO - finished save to file
2022-06-17 14:47:08 - CommonDataModel - INFO - saving dataframe (0x1123bb190) to <carrot.io.plugins.local.LocalDataCollection object at 0x112313ee0>
2022-06-17 14:47:08 - LocalDataCollection - INFO - saving person to output_dir//person.csv
2022-06-17 14:47:08 - LocalDataCollection - INFO - finished save to file
2022-06-17 14:47:08 - CommonDataModel - INFO - finalised person on iteration 0 producing 1000 rows from 2 tables
2022-06-17 14:47:08 - LocalDataCollection - INFO - Getting next chunk of data
2022-06-17 14:47:08 - LocalDataCollection - INFO - All input files for this object have now been used.
2022-06-17 14:47:08 - LocalDataCollection - INFO - resetting used bricks
2022-06-17 14:47:08 - CommonDataModel - INFO - for observation: found 6 objects
2022-06-17 14:47:08 - CommonDataModel - INFO - working on observation
2022-06-17 14:47:08 - CommonDataModel - INFO - starting on observation_0
2022-06-17 14:47:08 - Observation - INFO - Called apply_rules
2022-06-17 14:47:08 - LocalDataCollection - INFO - Retrieving initial dataframe for 'Demographics.csv' for the first time
2022-06-17 14:47:08 - Observation - INFO - Mapped observation_concept_id
2022-06-17 14:47:08 - Observation - INFO - Mapped observation_datetime
2022-06-17 14:47:08 - Observation - INFO - Mapped observation_source_concept_id
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_source_value
2022-06-17 14:47:09 - Observation - INFO - Mapped person_id
2022-06-17 14:47:09 - Observation - WARNING - Requiring non-null values in observation_concept_id removed 900 rows, leaving 100 rows.
2022-06-17 14:47:09 - Observation - INFO - Automatically formatting data columns.
2022-06-17 14:47:09 - Observation - INFO - created df (0x1123bb5e0)[observation_0]
2022-06-17 14:47:09 - CommonDataModel - INFO - finished observation_0 (0x1123bb5e0) ... 1/6 completed, 100 rows
2022-06-17 14:47:09 - CommonDataModel - INFO - starting on observation_1
2022-06-17 14:47:09 - Observation - INFO - Called apply_rules
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_concept_id
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_datetime
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_source_concept_id
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_source_value
2022-06-17 14:47:09 - Observation - INFO - Mapped person_id
2022-06-17 14:47:09 - Observation - WARNING - Requiring non-null values in observation_concept_id removed 900 rows, leaving 100 rows.
2022-06-17 14:47:09 - Observation - INFO - Automatically formatting data columns.
2022-06-17 14:47:09 - Observation - INFO - created df (0x11241ce20)[observation_1]
2022-06-17 14:47:09 - CommonDataModel - INFO - finished observation_1 (0x11241ce20) ... 2/6 completed, 100 rows
2022-06-17 14:47:09 - CommonDataModel - INFO - starting on observation_2
2022-06-17 14:47:09 - Observation - INFO - Called apply_rules
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_concept_id
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_datetime
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_source_concept_id
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_source_value
2022-06-17 14:47:09 - Observation - INFO - Mapped person_id
2022-06-17 14:47:09 - Observation - WARNING - Requiring non-null values in observation_concept_id removed 900 rows, leaving 100 rows.
2022-06-17 14:47:09 - Observation - INFO - Automatically formatting data columns.
2022-06-17 14:47:09 - Observation - INFO - created df (0x112427580)[observation_2]
2022-06-17 14:47:09 - CommonDataModel - INFO - finished observation_2 (0x112427580) ... 3/6 completed, 100 rows
2022-06-17 14:47:09 - CommonDataModel - INFO - starting on observation_3
2022-06-17 14:47:09 - Observation - INFO - Called apply_rules
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_concept_id
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_datetime
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_source_concept_id
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_source_value
2022-06-17 14:47:09 - Observation - INFO - Mapped person_id
2022-06-17 14:47:09 - Observation - WARNING - Requiring non-null values in observation_concept_id removed 700 rows, leaving 300 rows.
2022-06-17 14:47:09 - Observation - INFO - Automatically formatting data columns.
2022-06-17 14:47:09 - Observation - INFO - created df (0x1124357f0)[observation_3]
2022-06-17 14:47:09 - CommonDataModel - INFO - finished observation_3 (0x1124357f0) ... 4/6 completed, 300 rows
2022-06-17 14:47:09 - CommonDataModel - INFO - starting on observation_4
2022-06-17 14:47:09 - Observation - INFO - Called apply_rules
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_concept_id
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_datetime
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_source_concept_id
2022-06-17 14:47:09 - Observation - INFO - Mapped observation_source_value
2022-06-17 14:47:09 - Observation - INFO - Mapped person_id
2022-06-17 14:47:09 - Observation - WARNING - Requiring non-null values in observation_concept_id removed 800 rows, leaving 200 rows.
2022-06-17 14:47:09 - Observation - INFO - Automatically formatting data columns.
2022-06-17 14:47:09 - Observation - INFO - created df (0x112440490)[observation_4]
2022-06-17 14:47:09 - CommonDataModel - INFO - finished observation_4 (0x112440490) ... 5/6 completed, 200 rows
2022-06-17 14:47:10 - CommonDataModel - INFO - starting on observation_5
2022-06-17 14:47:10 - Observation - INFO - Called apply_rules
2022-06-17 14:47:10 - Observation - INFO - Mapped observation_concept_id
2022-06-17 14:47:10 - Observation - INFO - Mapped observation_datetime
2022-06-17 14:47:10 - Observation - INFO - Mapped observation_source_concept_id
2022-06-17 14:47:10 - Observation - INFO - Mapped observation_source_value
2022-06-17 14:47:10 - Observation - INFO - Mapped person_id
2022-06-17 14:47:10 - Observation - WARNING - Requiring non-null values in observation_concept_id removed 900 rows, leaving 100 rows.
2022-06-17 14:47:10 - Observation - INFO - Automatically formatting data columns.
2022-06-17 14:47:10 - Observation - INFO - created df (0x112456f70)[observation_5]
2022-06-17 14:47:10 - CommonDataModel - INFO - finished observation_5 (0x112456f70) ... 6/6 completed, 100 rows
2022-06-17 14:47:10 - CommonDataModel - INFO - saving dataframe (0x112427280) to <carrot.io.plugins.local.LocalDataCollection object at 0x112313ee0>
2022-06-17 14:47:10 - LocalDataCollection - INFO - saving observation to output_dir//observation.csv
2022-06-17 14:47:10 - LocalDataCollection - INFO - finished save to file
2022-06-17 14:47:10 - CommonDataModel - INFO - finalised observation on iteration 0 producing 900 rows from 6 tables
2022-06-17 14:47:10 - LocalDataCollection - INFO - Getting next chunk of data
2022-06-17 14:47:10 - LocalDataCollection - INFO - All input files for this object have now been used.
2022-06-17 14:47:10 - LocalDataCollection - INFO - resetting used bricks
2022-06-17 14:47:10 - CommonDataModel - INFO - for condition_occurrence: found 1 object
2022-06-17 14:47:10 - CommonDataModel - INFO - working on condition_occurrence
2022-06-17 14:47:10 - CommonDataModel - INFO - starting on condition_occurrence_0
2022-06-17 14:47:10 - ConditionOccurrence - INFO - Called apply_rules
2022-06-17 14:47:10 - LocalDataCollection - INFO - Retrieving initial dataframe for 'Symptoms.csv' for the first time
2022-06-17 14:47:10 - ConditionOccurrence - INFO - Mapped condition_concept_id
2022-06-17 14:47:10 - ConditionOccurrence - INFO - Mapped condition_end_datetime
2022-06-17 14:47:10 - ConditionOccurrence - INFO - Mapped condition_source_concept_id
2022-06-17 14:47:10 - ConditionOccurrence - INFO - Mapped condition_source_value
2022-06-17 14:47:10 - ConditionOccurrence - INFO - Mapped condition_start_datetime
2022-06-17 14:47:10 - ConditionOccurrence - INFO - Mapped person_id
2022-06-17 14:47:10 - ConditionOccurrence - WARNING - Requiring non-null values in condition_concept_id removed 400 rows, leaving 400 rows.
2022-06-17 14:47:10 - ConditionOccurrence - INFO - Automatically formatting data columns.
2022-06-17 14:47:10 - ConditionOccurrence - INFO - created df (0x112456790)[condition_occurrence_0]
2022-06-17 14:47:10 - CommonDataModel - INFO - finished condition_occurrence_0 (0x112456790) ... 1/1 completed, 400 rows
Inspect Outputs¶
cdm.keys()
cdm['person'].dropna(axis=1,how='all')
cdm['observation'].dropna(axis=1,how='all')
cdm['condition_occurrence'].dropna(axis=1,how='all')