a Basel Institute for Clinical Epidemiology and Biostatistics, University Hospital Basel, Switzerland
b Division of Infectious Diseases and Hospital Epidemiology, University Hospital Basel, Switzerland
c University of Basel, Switzerland
d Focal Area of Computational and Systems Biology, Biozentrum, University of Basel, Switzerland
e SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
f Department of Finance, University Hospital Basel, Switzerland
g Department of Internal Medicine, Kantonsspital Luzern, Switzerland
AIMS OF THE STUDYBased on large sets of routine hospital data from inpatient cases, we aimed to explore multimorbidity and intervention clusters showing high risks for in-hospital mortality and unplanned readmissions using data-driven analytical methods.
METHODSWe performed an explorative, historical cohort study of consecutive inpatient cases at a tertiary care centre with an integrated platform for routine healthcare data in Switzerland. From January 2012 through to December 2017, all inpatients aged ≥18 years at hospital admission were eligible for study inclusion. We predefined all-cause in-hospital death and unplanned hospital readmission as co-primary outcomes. In a first step, we explored and visualised multimorbidity and intervention clusters using mutual information analysis. In a subsequent step, we trained multi-layer Bayesian networks to identify clusters associated with in-hospital death and/or unplanned hospital readmission.
RESULTSAmong 190,837 inpatient cases, 7994 unique diagnoses and 6639 interventions were routinely recorded during the six-year study period. Based on the mutual information analysis, we identified 32 multimorbidity clusters and 24 intervention clusters – of which several were directly related to in-hospital mortality and/or unplanned readmission in the subsequent Bayesian network analysis.
CONCLUSIONSBayesian network analysis may be used as a tool to mine large healthcare databases in order to explore intervention targets for quality improvement programmes. However, the resulting associations should be substantiated in consecutive investigations using specific causal models. (Trial registration no EKNZ 2016-02128.)
Keywords: Bayesian network, cluster, digital epidemiology, electronic medical record, high-risk
A main goal of healthcare providers should be to deliver high-quality patient care and to continuously improve preventive and treatment services where needed [1, 2]. At present, many quality improvement and patient safety programmes focus on a few specific patient groups, health outcomes or high-risk interventions (e.g., prevention of postoperative infections, pressure ulcers), without any overarching, hospital-wide systematic prevention strategy. The increasing digitisation and integration of healthcare data may provide ideal conditions for the use of routine healthcare databases to identify high-risk patient and intervention groups and to define data-driven prevention targets [3, 4]. Clusters or groups of patients may be defined based on any relevant characteristic, such as specific disease combinations (multimorbidity clusters) or networks of therapeutic interventions (intervention clusters).
Using the large healthcare database of a tertiary care institution and based on previous work on multimorbidity clustering , we aimed to explore high-risk multimorbidity and intervention clusters by using data mining methods, namely mutual information and Bayesian network analysis, to assess the interrelationship between inpatient characteristics, diagnoses and interventions and health outcomes (unplanned readmissions and in-hospital mortality). We hypothesised that Bayesian networks may be useful tools with which to explore high-risk inpatient groups based on routine data from electronic medical records.
Materials and methods
Study design and patient selection
We performed an explorative, historical cohort study at the University Hospital Basel, an 850-bed tertiary care hospital in Switzerland with more than 1,000,000 ambulatory patient contacts and over 37,000 inpatient cases per year. Treatment services comprise all main medical and surgical disciplines, including hematopoietic stem cell and kidney transplantations.
From January 2012 through to December 2017, all consecutive inpatients aged ≥18 years at hospital admission were eligible for study inclusion. We excluded patients who declined to participate in observational studies that make secondary use of their routine health data (i.e. general research consent). We defined inpatients as patients who were hospitalised for at least 24 hours.
The Ethics Committee of Northwestern and Central Switzerland approved this study with a waiver of informed consent (project number 2016-02128). We reported our study according to the ‘REporting of studies Conducted using Observational Routinely-collected health Data’ (RECORD) recommendations .
Data extraction and study definitions
A data manager extracted the relevant study data of the included patients on an inpatient case level (for each hospitalisation period, lasting from hospital admission to discharge) from an in-house, in-memory data platform (SAP® HANA; SAP AG, Walldorf, Germany), as described previously [5, 7]. No filtering was required, as the relevant variables were available for all eligible inpatient cases. No linkage of the data with other databases was performed. The investigators had full access to the complete database population.
For each inpatient case, diagnoses and interventions (diagnostic, medical and surgical interventions) were routinely coded by medical coders for billing and controlling purposes according to the International Classification of Diseases, 10th revision (ICD-10; German modification) and the Swiss classification of medical interventions (“Schweizerische Operationsklassifikation”) respectively. Only the active discharge diagnoses of each inpatient case, which were either present on hospital admission (e.g., previously diagnosed chronic diseases) or newly diagnosed during the hospital stay, were coded .
Based on their clinical relevance, we defined all-cause in-hospital death and unplanned hospital readmission as a priori co-primary outcome measures. Unplanned readmissions (to the index hospital) due to the same diagnosis as during the previous hospitalisation (e.g., surgical site infection after heart surgery) are routinely coded for inpatients in Switzerland if they occur within 18 days after discharge .
We conducted the computations at the High Performance Computing Centre of the University of Basel. We performed all statistical analyses using the programming language Python, version 3.6.4 (Python Software Foundation, Wilmington, DE). Analysis codes can be requested from the corresponding author.
Mutual information cluster analysis
In order to quantify the occurrence relationship between pairs of diagnoses and interventions, and building upon our previous work , we calculated pairwise mutual information scores for all pairs of diagnoses and all pairs of interventions (diagnostic, medical and surgical interventions combined). Mutual information is an information theory concept that quantifies the general interdependence of two random variables : it quantifies how much information about one random variable can be obtained through the other.
We used these mutual information scores to graph separate multimorbidity/diagnosis and intervention clusters, with diagnoses and interventions as vertices and mutual information scores as edges. Based on clinical opinion and to avoid over-clustering, we considered only mutual information scores >0.01 and we chose to cluster diagnoses and interventions separately. Mutual information cut-offs ≤0.01 led to a multitude of clinically implausible diagnosis and intervention pairs/patterns.
We identified separate clusters by looking for connected components in the mutual information graphs. We considered any inpatient to be part of a given cluster if they had at least one diagnosis or intervention from that cluster. Therefore, inpatients may have had more than one cluster association.
Learning Bayesian networks
Bayesian networks are directed acyclic graphs in which the vertices represent variables and the edges/arrows represent conditional dependencies. We included the following inpatient variables in our analysis: discretised age category at hospital admission (years; <40, 40–59, 60–79, ≥80), emergency admission status (yes/no), sex (male/female), and the discretised number of secondary ICD-10 diagnoses/comorbidities (<3, 3–9, ≥10). The respective strata were defined a priori based on consensus within the study group. For each multimorbidity/diagnosis and intervention cluster, we learned a separate Bayesian network by adding the respective cluster and outcome status.
For probabilistic modelling, we used the Python package Pomegranate to learn the Bayesian networks . Based on clinical opinion and for better interpretability (with the level of complexity limited due to the explorative nature of these analyses), we added a few a priori structure constraints: the inpatient variables age, emergency admission status, sex and number of secondary diagnoses/comorbidities were restricted to layer one. Furthermore, the cluster association/membership and the outcome variable were restricted to layers two and three respectively. Only connections from layer one to layer two and/or three and from layer two to layer three were allowed. The rationale behind these restrictions was that we were interested in whether age, emergency admission status, sex and number of secondary diagnosis have an effect on cluster membership and, furthermore, whether these variables combined have an effect on the outcome. Any effects of the outcome on cluster membership were not of interest, since there cannot be a causal effect in this direction. We used the “exact” algorithm of the Bayesian network class of the Pomegranate package, i.e. an exhaustive search over all possible structures, to learn the Bayesian networks. This procedure optimises the minimum description length (MDL) score. The respective conditional probability table is provided in the supplementary table in appendix 4
During the six-year study period there were 198,972 inpatient cases overall. After excluding 8135 cases who declined to give their general research consent, we included the remaining 190,837 cases (114,651 unique inpatients) in the final analysis. During the study period, 7994 unique diagnoses and 6639 interventions were recorded. The main characteristics of the study population are presented in table 1. The median age of the patient cases was 63 years (interquartile range 45–76), with a median length of hospital stay of 4 days (interquartile range 2–8). 3834 out of the 190,837 patient cases (2.0%) died during the hospital stay and 15,643 out of the 190,837 patient cases (8.2%) had an unplanned hospital readmission.
|No. of unique inpatients||114,651|
|No. of cases per inpatient during the study period, mean||1.66|
|Age in years at hospital admission, median (IQR)||63 (45–76)|
|Female sex, n (%)||97,521 (51.1)|
|Length of hospital stay, median (IQR)||4 (2–8)|
|Emergency admission, n (%)||90,969 (47.7)|
|No. of secondary diagnoses/comorbidities*, median (IQR)||4 (2–7)|
|No. of interventions during the hospital stay†, median (IQR)||2 (1–4)|
|Discharge discipline‡, n cases (%)|
|– Medicine||84,327 (44.2)|
|– Surgery and orthopaedics||21,979 (11.5)|
|– Gynaecology and obstetrics||77,286 (40.5)|
|– Other||7245 (3.8)|
|Health outcomes, n cases (%)|
|– Unplanned hospital readmission§||15,643 (8.2)|
|– In-hospital all-cause death||3834 (2.0)|
We identified 32 diagnosis/multimorbidity clusters (overview in fig. 1A) and 24 intervention clusters (overview in fig. 1B) in the mutual information analysis. These clusters covered a wide spectrum of diagnoses as well as medical and surgical interventions. The respective clusters are described in detail in the supplementary text (appendix 1) and the figures in appendices 2 and 3. Several of these clusters were found to be directly related to in-hospital mortality and/or unplanned readmissions in the subsequent Bayesian network analysis. Furthermore, various multimorbidity and intervention clusters were only indirectly related to in-hospital mortality and unplanned readmissions via patient and admission characteristics (sex, age, emergency admission status and number of comorbidities). The Bayesian networks for all multimorbidity and intervention clusters are presented in the supplementary materials (appendices 2 and 3). For all-cause in-hospital death and unplanned readmission, we chose, for illustrative purposes and for better interpretability, to describe eight Bayesian networks, which are depicted in figures 1C to 1F.
The diagnosis cluster presented in figure 1C (cluster 8) consists of three diagnoses, describing a patient group with incidents caused by medical devices and products (ICD-10 Y82.8), with infections or inflammatory reactions through prostheses/implants in the urinary tract (ICD-10 T83.5), or with mechanical complications of a joint endoprosthesis (ICD-10 T84.0). Overall, 6,107 out of 190,837 cases (3.2%) had at least one of these three diagnoses during their hospital stay. In the corresponding Bayesian network, a direct relationship between cluster 8 and in-hospital death was observed, and the cases’ age, sex, emergency admission status and number of comorbidities were directly linked with the probability of belonging to cluster 8. There was, however, no direct link between the cluster and unplanned hospital readmissions. In cluster 8, 160 out of 6107 (2.6%) patients died during the hospital stay and 1140 out of 6107 patients (18.7%) had an unplanned hospital readmission. The patients’ underlying characteristics (age, sex, emergency admission status and number of comorbidities), but not the multimorbidity cluster association, seem to explain the high readmission proportion.
Figure 1D depicts a large neurological and diagnostic cluster (cluster 4) consisting of 19 interrelated interventions and diagnostic procedures. These ranged from computed tomography of the abdomen (87.41.99 and 33.24.13) and thorax (87.41.99) to different tracheobronchoscopic interventions (33.24.13, 33.24.11 and 33.24.14) and neurological diagnostics and treatments. Overall, 33,942 out of 190,837 cases (17.8%) had at least one of these diagnoses during their hospital stay. In the respective Bayesian network, a direct relationship between cluster 4 and in-hospital death was observed. As in cluster 8, there was no direct link with unplanned hospital readmissions. In cluster 4, 1315 out of 33,942 (3.9%) patients died during the hospital stay and 3158 out of 33,942 patients (9.3%) had an unplanned hospital readmission.
Figure 1E characterises a surgical complication cluster (cluster 2) consisting of nine diagnoses. These were primarily linked via the general diagnostic node ‘incidents caused by medical measures’ (ICD-10 T84.9). Related diagnoses included unspecified surgical complications (ICD-10 T81.8), suture dehiscence (ICD-10 T81.3), postoperative haemorrhage and haematoma (ICD-10 T81.0), acute bleeding anaemia (ICD-10 D62), temporary blood clotting disorder and acquired coagulation deficiencies (ICD-10 U69.12 and D68.4), unspecified delirium (ICD-10 F05.8), and surgical site infections (ICD-10 T81.4). Overall, 25,870 out of 190,837 cases (13.6%) had at least one of these nine clustered diagnoses during the hospital stay. Interestingly – and in contrast to the unplanned readmissions outcome – we observed no direct relationship between cluster 2 and in-hospital mortality. In cluster 2, 1229 out of 25,870 (4.8%) patients died during the hospital stay and 4366 out of 25,870 patients (16.9%) had an unplanned hospital readmission.
In the last example (fig. 1F), nine cardiological interventions and diagnostic procedures – including different diagnostics and ablation techniques in cases with tachyarrhythmia, such as ablation using three-dimensional mapping (37.34.14), conventional radiofrequency ablation of tachyarrhythmia (37.34.11), and transoesophageal echocardiography with contrast medium (88.72.24) – formed a single cluster (cluster 6). Overall, 3619 out of 190,837 cases (1.9%) had at least one of these nine diagnoses recorded during their hospital stay. This tachyarrhythmia cluster was directly linked to unplanned hospital readmissions but not to in-hospital mortality. In cluster 6, 11 out of 3619 (0.3%) patients died during the hospital stay and 210 out of 3619 patients (5.8%) had an unplanned hospital readmission. Our data suggest that the cases’ underling characteristics (age, emergency admission status and number of comorbidities) explain the low mortality in this cluster.
We explored the clustering of diagnoses, interventions and diagnostic procedures in a large, real-world dataset of medical and surgical inpatient cases. Moreover, we exemplified the potential use of Bayesian networks to mine healthcare databases for potential direct and/or indirect cluster effects on important health outcomes and quality measures. To our knowledge, this is one of the first studies in which Bayesian networks have been used to mine a large hospital database in order to explore inpatient complications and related high-risk clusters which may be amenable to quality improvement measures. Previous studies have applied the concept of multi-level Bayesian networks to explore multimorbidity at a population level [11–13]. Bayesian network analysis and consecutive directed acyclic graphs may be of particular interest, as they allow complex interrelationships to be easily visualised based on a priori knowledge, and potential direct and indirect effects to be differentiated. Nevertheless, as with any observational analysis method, the proposed data mining technique may not allow the demonstration of definite causal relationships. Potentially relevant detrimental or beneficial effects of a given set of diagnoses and/or interventions observed in the Bayesian networks should be substantiated through specific causal investigations. In the present proof-of-concept study, we used a limited set of variables (age, sex, emergency admission status and number of secondary diagnoses), which may act – based on external, expert knowledge – as potential confounders in the specified cluster-health outcomes relationships. Further variables could be added to the Bayesian networks to examine, for instance, time trends (e.g., years, months or seasons) and to further reduce the propensity for unmeasured confounding.
Regarding in-hospital death and unplanned readmission, we observed several potential high-risk clusters which are clinically plausible, are in line with previous investigations (e.g [14–16].), and which have not yet been selected as specific prevention targets in our hospital. A wide variety of questions could be posed, such as: Would inpatients with ablations for tachyarrhythmia or patients with certain postoperative complications (e.g., postoperative haemorrhage) benefit from a targeted quality programme to reduce unplanned hospital readmissions? Why are mechanical complications of joint endoprotheses linked to in-hospital mortality? These questions should be addressed by focused and more detailed investigations (e.g., a case-control study to evaluate risk factors for all-cause death in specific surgical populations) and subsequent quality and patients safety programmes such as the multimodal “Enhanced Recovery After Surgery” (ERAS) strategy, if appropriate. Such programmes have been shown to result in substantial quality improvements .
Cluster analysis combined with Bayesian network learning can be used to systematically generate hypotheses about potential high-risk clusters for a wide spectrum of health and quality outcomes. Such explorative data mining analysis may be able to provide a data-driven, objective overview of potential quality improvement targets, as many quality improvement and patient safety programmes have a narrow focus on specific patient groups, health outcomes or high-risk interventions, without any overarching, hospital-wide prevention strategy. The presented analytical strategy could be particularly important for large healthcare institutions and healthcare networks (e.g., Kaiser Permanente in the United States) with high admission numbers, allowing reasonable stratification in Bayesian network analysis. Nonetheless, it is still unclear whether an institution-wide, data-driven quality strategy is more effective and/or more cost-effective than regular, more focused quality approaches.
Our study has limitations. Firstly, we relied on routine codes and classifications, which may have led to the misclassification of diagnoses and/or interventions during the coding process. We did not have internal validation data on the inter-rater agreement for coding certain diagnoses or interventions; however, all diagnoses and interventions were coded and verified by professional medical coders using standard criteria based on a list of diagnoses and interventions, making it highly unlikely that diagnoses have been misclassified. Secondly, in our Bayesian network analyses we cannot exclude the possibility of unmeasured or residual confounding. Thirdly, our study results were derived from a single healthcare centre and may not be generalisable to other healthcare settings and institutions that cover different patient populations or that use other classifications to routinely code inpatient diagnoses and interventions. Lastly, we used a pragmatic approach and performed our analyses on a case level and not on a longitudinal inpatient level, which might have helped us to interpret certain relationships and signals over time .
As part of an explorative analysis, we identified potential high-risk multimorbidity and intervention clusters in a tertiary care inpatient population. Bayesian network analysis may be used as a tool to mine large healthcare databases in order to systematically explore intervention targets for quality improvement and patient safety programmes. However, relevant associations derived from Bayesian network analysis should be substantiated in consecutive investigations using specific causal models.
We performed all analyses at the sciCORE scientific computing centre, University of Basel, Switzerland.
JAR, TS, MG and BLH contributed to the study conception. TS performed the main data analyses with input from JAR. JAR and TS wrote the first manuscript draft. All authors contributed to interpreting the study findings and drafting the manuscript. All authors reviewed and approved the final manuscript.
No financial support and no other potential conflict of interest relevant to this article was reported.
Dr Balthasar L. Hug, MD, MBA, MPH, Department of Internal Medicine, Kantonsspital Luzern, Spitalstrasse, 6000 Luzern 16, Switzerland, balthasar.hug[at]luks.ch.
2 Bates DW, Larizgoitia I, Prasopa-Plaizier N, Jha AK; Research Priority Setting Working Group of the WHO World Alliance for Patient Safety. Global priorities for patient safety research. BMJ. 2009;338(may14 1):b1775. doi:. http://dx.doi.org/10.1136/bmj.b1775 PubMed
3 Einbinder JS, Bates DW. Leveraging information technology to improve quality and safety. Yearb Med Inform. 2007:22–9. PubMed
4 Roth JA, Battegay M, Juchler F, Vogt JE, Widmer AF. Introduction to machine learning in digital healthcare epidemiology. Infect Control Hosp Epidemiol. 2018;39(12):1457–62. doi:. http://dx.doi.org/10.1017/ice.2018.265 PubMed
5 Roth JA, Sakoparnig T, Neubauer S, Kuenzel-Pawlik E, Gerber M, Widmer AF, et al.; PATREC Study Group. Medical diagnoses showed low relatedness in an explorative mutual information analysis of 190,837 inpatient cases. J Clin Epidemiol. 2019;109:42–50. doi:. http://dx.doi.org/10.1016/j.jclinepi.2019.01.003 PubMed
6 Benchimol EI, Smeeth L, Guttmann A, Harron K, Moher D, Petersen I, et al.; RECORD Working Committee. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) statement. PLoS Med. 2015;12(10):e1001885. doi:. http://dx.doi.org/10.1371/journal.pmed.1001885 PubMed
7 Roth JA, Goebel N, Sakoparnig T, Neubauer S, Kuenzel-Pawlik E, Gerber M, et al.; PATREC Study Group. Secondary use of routine data in hospitals: description of a scalable analytical platform based on a business intelligence system. JAMIA Open. 2018;1(2):172–7. doi:. http://dx.doi.org/10.1093/jamiaopen/ooy039 PubMed
8 Kutz A, Gut L, Ebrahimi F, Wagner U, Schuetz P, Mueller B. Association of the Swiss Diagnosis-Related Group reimbursement system with length of stay, mortality, and readmission rates in hospitalized adult patients. JAMA Netw Open. 2019;2(2):e188332. doi:. http://dx.doi.org/10.1001/jamanetworkopen.2018.8332 PubMed
9 Tourassi GD, Frederick ED, Markey MK, Floyd CE. Application of the mutual information criterion for feature selection in computer-aided diagnosis. Med Phys. 2001;28(12):2394–402. doi:. http://dx.doi.org/10.1118/1.1418724 PubMed
10 Schreiber J. Pomegranate: fast and flexible probabilistic modeling in python. Journal of Machine Learning. 2018;18:1–6.
11 Lappenschaar M, Hommersom A, Lucas PJF, Lagro J, Visscher S. Multilevel Bayesian networks for the analysis of hierarchical health care data. Artif Intell Med. 2013;57(3):171–83. doi:. http://dx.doi.org/10.1016/j.artmed.2012.12.007 PubMed
12 Lappenschaar M, Hommersom A, Lucas PJF, Lagro J, Visscher S, Korevaar JC, et al.Multilevel temporal Bayesian networks can model longitudinal change in multimorbidity. J Clin Epidemiol. 2013;66(12):1405–16. doi:. http://dx.doi.org/10.1016/j.jclinepi.2013.06.018 PubMed
13 Lappenschaar M, Hommersom A, Lucas PJF. Probabilistic causal models of multimorbidity concepts. Med Phys. In AMIA Proceedings of the 2012 Annual Symposium, pages 475-484, Chicago, United States.
14 Ahmad S, Munir MB, Sharbaugh MS, Althouse AD, Pasupula DK, Saba S. Causes and predictors of 30-day readmission after cardiovascular implantable electronic devices implantation: Insights from Nationwide Readmissions Database. J Cardiovasc Electrophysiol. 2018;29(3):456–62. doi:. http://dx.doi.org/10.1111/jce.13396 PubMed
15 Bruns BR, Lissauer M, Tesoriero R, Narayan M, Buchanan L, Galvagno SM, et al.Infectious complications and mortality in an American acute care surgical service. Eur J Trauma Emerg Surg. 2016;42(2):243–7. doi:. http://dx.doi.org/10.1007/s00068-015-0538-4 PubMed
16 Chikuda H, Yasunaga H, Horiguchi H, Takeshita K, Sugita S, Taketomi S, et al.Impact of age and comorbidity burden on mortality and major complications in older adults undergoing orthopaedic surgery: an analysis using the Japanese diagnosis procedure combination database. BMC Musculoskelet Disord. 2013;14(1):173. doi:. http://dx.doi.org/10.1186/1471-2474-14-173 PubMed
18 Hanauer DA, Ramakrishnan N. Modeling temporal relationships in large scale clinical associations. J Am Med Inform Assoc. 2013;20(2):332–41. doi:. http://dx.doi.org/10.1136/amiajnl-2012-001117 PubMed
Description of diagnosis/multimorbidity clusters (coded according to the German modification of the International Classification of Diseases, 10th revision [ICD-10]). Available as separate PDF file at https://smw.ch/article/doi/smw.2020.20299.
All-cause mortality and unplanned hospital readmission: Bayesian networks of multimorbidity clusters. Available as separate PDF file at https://smw.ch/article/doi/smw.2020.20299.
All-cause mortality and unplanned hospital readmission: Bayesian networks of intervention clusters. Available as separate PDF file at https://smw.ch/article/doi/smw.2020.20299.
Conditional probability table. Available as separate PDF file at https://smw.ch/article/doi/smw.2020.20299.
Published under the copyright license CC BY-NC-SA: This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution is given to the creator. If you remix, adapt, or build upon the material, you must license the modified material under identical terms.