Data Science for Merged FHIR and PacBio VCF Data on Azure Machine Learning Notebooks
Published Aug 09 2022 02:20 PM 1,467 Views
  1. Introduction

This blog post is an extension of a previous article, “Convert Synthetic FHIR and PacBio VCF Data to parquet and Explore with Azure Synapse Analytics”, which covers generating synthetic clinical data in FHIR format [1] using Synthea [2] and converting FHIR and genomic VCF data into tabular Parquet format [3] for further analysis. If you haven’t yet read that article, or its accompanying Jupyter Notebook walkthrough, please do so first!


This article, which also has a corresponding Jupyter notebook, demonstrates an example use case of merged clinical and genomics patient data. Whereas our previous notebook ended with both data modalities in a convenient tabular format, this notebook will merge both tables and demonstrate a basic machine learning application: patient clustering. We recommend Azure Machine Learning Studio to test sample notebooks. Each row in the final table represents a patient, and the columns contain information about that patient, whether it be clinical (e.g. gender or medications) or genomic (e.g. variant genotypes or allele fractions).



Note: all FHIR patient data in the above table is fully synthetic, generated by Synthea [2].


Note that we are providing an example architectural design to illustrate how Microsoft tools can be utilized to connect the pieces together (data + interoperability + secure cloud + AI tools), enabling researchers to conduct research analyzing genomics+clinical data. We are not providing or recommending specific instructions for how investigators should conduct their research – we will leave that to the professionals!


2. Methodology

First, we parsed the Parquet FHIR data to retrieve relevant information from the stored Patient, MedicationRequest, and DiagnosticReport Parquet resources, and loaded this information into a pandas [4] DataFrame. Next, we merged individual patient PacBio VCFs into a single joint VCF file. Then, we extracted certain fields from the joint VCF by converting it to a TSV and loading it into a pandas DataFrame. Lastly, we merged the two DataFrames (results shown above) and clustered patients using scikit-learn [5]. An overview of this pipeline is shown in the figure below:




3. Results

Several clustering algorithms require knowing the final number of clusters a priori. To select a reasonable number of clusters we used the “elbow method”, which consists of plotting the number of clusters against cluster inertia (the average squared distance of datapoints to the nearest cluster center). As the number of clusters increases, the cluster inertia will always decrease, but once the data has already been clustered fairly well, there will be less benefit in introducing additional clusters. This shows up on an “elbow plot” as an elbow, or a slight bend towards the horizontal, which we can see at n = 5.




We then clustered our dataset of 50 patients using three different clustering methods -- K-means++: an iterative approach which updates cluster centroid locations to minimize total inertia and uses repeated random initializations, DBSCAN: a recursive approach that builds clusters from high-density areas containing “core samples”, and Spectral Clustering: which performs a low-dimensional embedding of the samples’ affinity matrix prior to clustering. Details on each clustering method are available on scikit-learn’s webpage [5]:



We evaluated these three clustering methods using the Davies-Bouldin Index, the Calinski-Harabsz Index, and the Silhouette Coefficient [5]. These three evaluation metrics were selected because they all do not require knowledge of some ground truth clustering of samples into classes. Since we are simply trying to discover similarities between patients, no ground truth is available. As the following table shows, K-Means++ selected the best clustering overall (for the Davies-Bouldin index, a lower score is better).




Lastly, we performed bottom-up clustering of our patients using an agglomerative clustering algorithm and display the results below in a dendrogram. This can be used to select a clustering of patients at any desired granularity.




4. Discussion

We encountered several challenges in working with the FHIR data. Firstly, although Parquet is a tabular format, the original format of our FHIR data was JSON. JSON allows nesting fields of arbitrary depth. As a result, some Parquet data columns stored flattened strings of JSON data which required custom parsing. Secondly, data formats that are easiest for clinicians to work with are not necessarily ideal for machine learning applications. We converted all dates to elapsed years and applied min-max scaling to all numeric fields to aid clustering. Lastly, all information related to a patient is not stored within the FHIR Patient resource. In addition to parsing simple fields from Patient Parquet resources, our notebook demonstrates how to extract information from other resources (i.e. MedicationRequest and DiagnosticReport and associate it with the corresponding patients using the medical record number (MRN) field.

In future work, we plan to use GVCF files as inputs rather than VCFs. In contrast to VCF files, GVCF files contain variant calling information about all (including reference/non-variant) positions on the genome, instead of only reporting the confidently called variants. In our notebook, we simply imputed reference quality and depth scores using the genome-wide average, akin to a GVCF with an exceptionally large block size.

As a next step, we plan to integrate the upstream PacBio variant calling pipeline and demonstrate the whole pipeline: from raw data to final analysis. We also plan to expand our analysis from simple clustering to a realistic example workflow: analyzing the effect of specific germline variants on decision support system’s effectiveness.



[1] HL7 International. “Welcome to FHIR: HL7 FHIR Release 4B Documentation”

[2] Walonoski, Jason, et al. "Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record." Journal of the American Medical Informatics Association 25.3 (2018): 230-238.

[3] Ivanov, Todor, and Matteo Pergolesi. "The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet." Concurrency and Computation: Practice and Experience 32.5 (2020): e5523.

[4] Reback, Jeff, et al. "pandas-dev/pandas: Pandas 1.0. 5." Zenodo (2020).

[5] Pedregosa, Fabian, et al. "Scikit-learn: Machine learning in Python." the Journal of machine Learning research 12 (2011): 2825-2830.


Version history
Last update:
‎Aug 09 2022 02:34 PM
Updated by: