• Không có kết quả nào được tìm thấy

Thư viện số Văn Lang: Secondary Analysis of Electronic Health Records

N/A
N/A
Nguyễn Gia Hào

Academic year: 2023

Chia sẻ "Thư viện số Văn Lang: Secondary Analysis of Electronic Health Records"

Copied!
189
0
0

Loading.... (view fulltext now)

Văn bản

In the case of the IAC study, the control group is fairly simple: patients who received MV and did not have an IAC placed. In the case of IACs, the research question comparing patients who had an IAC placed with patients who did not have an IAC placed would be a comparative effectiveness study.

Table 9.1 Major types of observational research, and their purpose Type of observational research Purpose
Table 9.1 Major types of observational research, and their purpose Type of observational research Purpose

Putting It Together

Images or other third-party material in this chapter are licensed under a Creative Commons Work License unless otherwise noted in the credit line; if such material is not covered by a Creative Commons license for the work and such action is not permitted by law, users will need to obtain permission from the licensee to duplicate, adapt or reproduce the material. Booth CM, Tannock IF (2014) Randomized controlled trials and population-based observational research: partners in developing medical evidence.

Introduction

PART 1 — Theoretical Concepts .1 Exposure and Outcome of Interest

  • Comparison Group
  • Building the Study Cohort
  • Hidden Exposures
  • Data Visualization
  • Study Cohort Fidelity

Ideally, this group should consist of patients who are phenotypically similar to those in the study cohort but who lack the exposure of interest. Based on the size of the study cohort, 5-10% of the clinical charts should be reviewed to ensure the presence or absence of the exposure of interest.

PART 2 — Case Study: Cohort Selection

The authors began their cohort selection with all 24,581 patients included in the MIMIC II database. Ultimately, there were 984 patients in the group who received an IAC and 792 patients who did not.

Introduction

Part 1 — Theoretical Concepts .1 Categories of Hospital Data

  • Context and Collaboration
  • Quantitative and Qualitative Data
  • Data Files and Databases
  • Reproducibility

Coding practices can be influenced by issues such as financial compensation and associated paperwork, consciously or otherwise. Version control systems such as Git can be used to track the changes in the code over time and are also becoming an increasingly popular tool for researchers [8].

Fig. 11.1 Comma separated value (CSV) fi le formatted to the RFC 4180 speci fi cation
Fig. 11.1 Comma separated value (CSV) fi le formatted to the RFC 4180 speci fi cation

Part 2 — Practical Examples of Data Preparation .1 MIMIC Tables

  • SQL Basics
  • Joins
  • Ranking Across Rows Using a Window Function
  • Making Queries More Manageable Using WITH

Because a patient may have been in multiple ICUs, the same patient ID sometimes appears multiple times in the result of a previous query. Open Access This chapter is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/ . 4.0/), which permits any non-commercial use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as the original author(s) and source are properly credited, a link to the Creative Commons license is provided, and any changes made are noted.

Introduction

Understand the requirements for a "clean" database that is "edited" and ready for use in statistical analysis. Preprocessing is sometimes iterative and may involve repeating this series of steps until the data is satisfactorily organized for the purposes of statistical analysis.

Part 1 — Theoretical Concepts .1 Data Cleaning

Data Integration

Data integration is the process of combining data from different data sources (such as databases, flat files, etc.) into a consistent data set. In the MIMIC database, this mainly becomes a problem when certain information is entered into the EHR during another phase of the patient's care pathway, such as before admission to the emergency department, or from external data.

Data Transformation

Data Reduction

PART 2 — Examples of Data Pre-processing in R

  • R—The Basics
  • Data Integration
  • Data Transformation
  • Data Reduction

FROM mimic2v26.comorbidity_scores WHERE subject_id IN (SELECT subject_id FROM mimic2v26.icustay_detail WHERE subject_icustay_seq = 1 . EN icustay_age_group = 'volwasse' EN hadm_id IS NIE nul nie). SELECT subject_id, value1num FROM mimic2v26.chartevents WHERE subject_id IN ( SELECT subject_id .. WHERE subject_icustay_seq = 1 EN icustay_age_group = 'volwasse' EN hadm_id IS NIE nul nie) EN itemid=456.

Conclusion

Raw data for secondary analysis are often "unordered", meaning that they are not in a form suitable for statistical analysis; the data must be "cleansed" into a valid, complete and efficiently organized "ordered" database that can be analyzed. The goal of data preprocessing is to prepare available raw data for analysis without introducing bias by changing information in the data or otherwise influencing the final results.

Introduction

Salgado, Carlos Azevedo, Hugo Proença, and Susana M. What are the different types of missing data and sources of missing data. On the other hand, when data are missing for unspecified reasons, values ​​are assumed to be missing due to random and unintentional causes.

Part 1 — Theoretical Concepts

  • Types of Missingness
  • Proportion of Missing Data
  • Dealing with Missing Data
  • Choice of the Best Imputation Method

For example, if only 4 of 20 variables are needed for a study, this method would discard only the missing observations of the 4 variables of interest. Another advantage of this method is that it takes into account the correlation structure of the data.

Table 13.1 Examples of
Table 13.1 Examples of

Part 2 — Case Study

Proportion of Missing Data and Possible Reasons for Missingness

Choose the method that works best at the level of missing data in your data set. In both cases, the fact that the data is missing contains information about the answer, so it is MNAR.

Univariate Missingness Analysis

In this case, the imputed data distribution fits the original data better than the previous methods (Fig. 13.8). The multivariate normal distribution with multiple imputation gave more importance to the center values ​​of the distribution (Fig.13.10).

Fig. 13.5 Histogram of variable age in the IAC group before and after univariate complete case method
Fig. 13.5 Histogram of variable age in the IAC group before and after univariate complete case method

Evaluating the Performance of Imputation Methods on Mortality Prediction

The quality of the imputation methods was evaluated even in the presence of the absence of multivariates with a uniform probability in all variables (Fig. 13.13). It should be noted that obtaining results for more than 40% missingness in all variables is quite impossible in most cases and there is no guarantee of good performance with either method.

Fig. 13.13 Mean AUC of the logistic regression models for different degrees of multivariate missingness
Fig. 13.13 Mean AUC of the logistic regression models for different degrees of multivariate missingness

Conclusion

Alosh M (2009) The impact of missing data in a generalized integer-valued autoregression model for count data. Saeed M, Villarroel M, Reisner AT, Clifford G, Lehman L-W, Moody G, Heldt T, Kyaw TH, Moody B, Mark RG (2011) Multiparameter intelligent monitoring in intensive care II (MIMIC-II): a public access intensive care unit database.

Introduction

How to evaluate the performance of an outlier detection method and how to compare different methods. Evaluating the effectiveness of an outlier detection algorithm and comparing the different approaches is complex.

Part 1 — Theoretical Concepts

While this chapter provides a description of some of the most common methods for detecting outliers, there are many others. Outliers can be identified by visual inspection, highlighting data points that appear to be relatively outside the inherent groupings of the 2-D data.

Statistical Methods

  • Tukey ’ s Method
  • Z-Score
  • Modi fi ed Z-Score
  • Interquartile Range with Log-Normal Distribution
  • Ordinary and Studentized Residuals
  • Cook ’ s Distance
  • Mahalanobis Distance

However, in many real-world datasets the underlying distribution of the data is unknown or complex. The following sections describe some of the most commonly used statistical tests for identifying outliers.

Proximity Based Models

The algorithm minimizes the within-cluster sum of squares, the sum of distances between each point in a cluster and the cluster centroid. Ac value that is too high will increase the cost function even if it reduces the within-group sum of squares [12,13].

A problem with this algorithm is the need to determine k, the number of clusters, in advance. Metrics such as the Akaike Information Criterion or the Bayesian Information Criterion, which add a factor proportional to k to the cost function used during clustering, can help determine this.

Criteria for Outlier Detection

If the ratio of the distance from the nearest point to the cluster center to these calculated distances is less than a certain threshold, the point is considered an outlier. The threshold value is defined by the user and should depend on the number of clusters selected, since the higher the number of clusters, the closer the points within the cluster are, i.e. the threshold value should decrease with increasing c.

Figure 14.2 provides a graphical example of the effect of varying values of w in the creation of boundaries for outlier detection
Figure 14.2 provides a graphical example of the effect of varying values of w in the creation of boundaries for outlier detection

Supervised Outlier Detection

Outlier Analysis Using Expert Knowledge

Case Study: Identi fi cation of Outliers in the Indwelling Arterial Catheter

Expert Knowledge Analysis

Univariate Analysis

Figure 14.3 shows a distribution of all data points and outliers identified in the IAC cluster. On the other hand, when values ​​follow approximately a normal distribution, as in the case of chloride (see Fig. 14.4), the IQ method identifies fewer outliers than log-IQ.

Fig. 14.3 Outliers identi fi ed by statistical analysis for the variable BUN, in the IAC cohort
Fig. 14.3 Outliers identi fi ed by statistical analysis for the variable BUN, in the IAC cohort

Multivariable Analysis

For illustrative purposes, we present only the graphical results of patients who died in the IAC group (class 1). The detection of outliers seems to be more influenced by binary features than by continuous features: red lines, with a few exceptions, are quite close to the black lines for the continuous variables (1 to 2 and 15 to 25) and far away for the binary variables.

Fig. 14.5 Outliers identi fi ed by clustering based approaches for patients that died after IAC.
Fig. 14.5 Outliers identi fi ed by clustering based approaches for patients that died after IAC.

Classi fi cation of Mortality in IAC and Non-IAC Patients

Conclusions and Summary

Therefore, the “outliers” in this study appear to contain useful information in their extreme values, and automatic exclusion resulted in the loss of this information. Some modeling methods already adjust for deviations so that they have minimal impact on the model and can be set to be more or less sensitive to them.

Introduction

Part 1 — Theoretical Concepts .1 Suggested EDA Techniques

Non-graphical EDA

These characteristics can express the central tendency of the data (arithmetic mean, median, mode), their spread (variance, standard deviation, interquartile range, maximum and minimum value) or some characteristics of their distribution (saturation, kurtosis). Kurtosis is a summary statistic that provides information about the tails (minimum and maximum values) of a distribution.

Fig. 15.1 Symmetrical versus asymmetrical (skewed) distribution, showing mode, mean and median
Fig. 15.1 Symmetrical versus asymmetrical (skewed) distribution, showing mode, mean and median

Graphical EDA

The interpretation of a QN plot is visual (Fig.15.10): either the points fall randomly around the line (data set normally distributed) or they follow a curved pattern instead of following the line (non-normality). Presenting several boxplots side by side makes it easy to compare the characteristics of various groups of data (example Fig. 15.11).

Fig. 15.5 Example of the effect of a log transformation on the distribution of the datasetFig
Fig. 15.5 Example of the effect of a log transformation on the distribution of the datasetFig

Part 2 — Case Study

Non-graphical EDA

The benefits of this are twofold: firstly it is useful to identify potentially confounding variables contributing to an outcome other than the predictor (exposure) variable. Identifying these variables is important as it is possible to attempt to control for these using adjustment methods such as multivariate logistic regression.

Graphical EDA

For example, to compare disease severity between cohorts of patients, SOFA score histograms can be plotted side by side (Figure 15.17). For example, to investigate differences in blood pressure by disease severity, subjects could be categorized by disease severity by plotting baseline blood pressure values ​​(Figure 15.18).

Fig. 15.17 histograms of SOFA scores by intra-arterial catheter status
Fig. 15.17 histograms of SOFA scores by intra-arterial catheter status

Conclusion

Introduction to Data Analysis .1 Introduction

Identifying Data Types and Study Objectives

Determining the objective of the study is an extremely important aspect of data analysis planning for health data. Once you have identified study outcomes and covariates, determining the types of outcome data will often be critical in selecting an appropriate analysis technique.

Fig. 16.1 Flow diagram of simpli fi ed process for choosing an analysis method based on the study objective and outcome data types
Fig. 16.1 Flow diagram of simpli fi ed process for choosing an analysis method based on the study objective and outcome data types

Case Study Data

Linear Regression .1 Section Goals

  • Introduction
  • Model Selection
  • Reporting and Interpreting Linear Regression
  • Caveats and Conclusions

In summary, we would conclude that we need both an intercept and a slope in the model. This quantity is a proportion (a number between 0 and 1), and describes how much of the total variability in the data is explained by the model.

Fig. 16.2 Scatterplot of PCO2 (x-axis) and TCO2 (y-axis) along with linear regression estimates from the quadratic model (co2.quad.lm) and linear only model (co2.lm)
Fig. 16.2 Scatterplot of PCO2 (x-axis) and TCO2 (y-axis) along with linear regression estimates from the quadratic model (co2.quad.lm) and linear only model (co2.lm)

Logistic Regression .1 Section Goals

  • Introduction
  • Introducing Logistic Regression
  • Hypothesis Testing and Model Selection
  • Confidence Intervals
  • Prediction
  • Presenting and Interpreting Logistic Regression Analysis
  • Caveats and Conclusions

If you remember, logðOddsx¼0Þ ¼b0, so b0 is the log odds of the outcome in the youngest group. In setting logistic regression, this involves trying to estimate the probability of an outcome given a patient's characteristics (covariates).

Fig. 16.5 Plot of log-odds of mortality for each of the fi ve age and temperature groups
Fig. 16.5 Plot of log-odds of mortality for each of the fi ve age and temperature groups

Survival Analysis .1 Section Goals

  • Introduction
  • Kaplan-Meier Survival Curves
  • Cox Proportional Hazards Models
  • Caveats and Conclusions

This is consistent with the exploratory figures we created in the previous section using Kaplan-Meier curves. The biggest challenge in doing this lies mainly in the construction of the dataset, which is discussed in some references at the end of this chapter.

Fig. 16.7 Kaplan-Meier plot of the estimated survivor function strati fi ed by service unit
Fig. 16.7 Kaplan-Meier plot of the estimated survivor function strati fi ed by service unit

Case Study and Summary .1 Section Goals

Introduction

Assessment of the proportional hazards assumption is an important part of any Cox regression analysis. Now that we have examined the basic general characteristics of the patients, we can begin the next steps in the analysis.

Table 16.1 Overall patient
Table 16.1 Overall patient

Logistic Regression Analysis

A common approach is to fit all univariate models (one covariate at a time, as we did with aline_ but separately for each covariate and without aline_fg) and perform a hypothesis test on each model. You could just cut and paste mva.full.glmcommand and remove+ cad_g, but the simpler, less error-prone way is to use update.

Table 16.3 Patient characteristics strati fi ed by 28 day mortality
Table 16.3 Patient characteristics strati fi ed by 28 day mortality

Conclusion and Summary

An estimate of the expected change in outcome per one unit increase in a covariate, holding all other covariates constant. An estimate of the fold change in the probability of an outcome per unit increase in a covariate, holding all other covariates constant.

Table 16.4 Multivariable logistic regression analysis for mortality at 28 days outcome ( fi nal model
Table 16.4 Multivariable logistic regression analysis for mortality at 28 days outcome ( fi nal model

Introduction

Part 1 — Theoretical Concepts .1 Bias and Variance

Common Evaluation Tools

However, we will briefly mention the two most common techniques: the R2 value, used for regressions, and the receiver operating characteristic (ROC) curve, used for a binary classifier (dichotomous score). High R2 values ​​mean that a large proportion of the variance is explained by the regression model.

Sensitivity Analysis

R2 ranges from 0 to 1, where values ​​close to 0 reflect situations in which the model does not significantly capture the variation in the outcome of interest, and values ​​close to 1 indicate that the model captures nearly all of the variation in the outcome of interest. interested. The principles of sensitivity analysis are: (a) to allow the researcher to quantify the uncertainty in the model, (b) to test the model of interest using a secondary experimental design, and (c) using the results of a secondary experiment. experimental design to calculate the overall sensitivity of the model of interest.

Validation

External validation is defined as testing the model on a sample of subjects drawn from a different population than the original group. External validation is usually a more robust approach to model testing in that the maximum amount of information is used from the initial data set to derive a model and an entirely independent data set is then used to verify the suitability of the model of interest.

Case Study: Examples of Validation and Sensitivity Analysis

  • Analysis 1: Varying the Inclusion Criteria of Time to Mechanical Ventilation
  • Analysis 2: Changing the Caliper Level for Propensity Matching
  • Analysis 3: Hosmer-Lemeshow Test
  • Implications for a ‘ Failing ’ Model

Changing the inclusion criteria for subjects included in the model is a common sensitivity analysis. In the favorable situation of a robust model, each sensitivity analysis and validation technique supports the model as an appropriate summary of the data.

Conclusion

Sekhon JS (2011) Multivariate and propensity score matching software with automated balance sheet optimization: the matching package for R. Robin for R and S+ to analyze and compare ROC curves.

Hình ảnh

Table 11.1 Overview of common categories of hospital data and common issues to consider during analysis
Fig. 11.1 Comma separated value (CSV) fi le formatted to the RFC 4180 speci fi cation
Fig. 11.2 Relational databases consist of multiple data tables linked by primary and foreign keys.
Fig. 11.3 Jupyter Notebooks enable documentation and code to be combined into a reproducible analysis
+7

Tài liệu tham khảo

Tài liệu liên quan

Giống lúa OM6976-Saltol có khả năng sinh trưởng ở cả giai đoạn nảy mầm và cây con trong điều kiện mặn tốt hơn hẳn so với giống OM6979.. Từ khóa: Gen chịu mặn Saltol,