In the case of the IAC study, the control group is fairly simple: patients who received MV and did not have an IAC placed. In the case of IACs, the research question comparing patients who had an IAC placed with patients who did not have an IAC placed would be a comparative effectiveness study.

## Putting It Together

Images or other third-party material in this chapter are licensed under a Creative Commons Work License unless otherwise noted in the credit line; if such material is not covered by a Creative Commons license for the work and such action is not permitted by law, users will need to obtain permission from the licensee to duplicate, adapt or reproduce the material. Booth CM, Tannock IF (2014) Randomized controlled trials and population-based observational research: partners in developing medical evidence.

Introduction

## PART 1 — Theoretical Concepts .1 Exposure and Outcome of Interest

*Comparison Group**Building the Study Cohort**Hidden Exposures**Data Visualization**Study Cohort Fidelity*

Ideally, this group should consist of patients who are phenotypically similar to those in the study cohort but who lack the exposure of interest. Based on the size of the study cohort, 5-10% of the clinical charts should be reviewed to ensure the presence or absence of the exposure of interest.

## PART 2 — Case Study: Cohort Selection

The authors began their cohort selection with all 24,581 patients included in the MIMIC II database. Ultimately, there were 984 patients in the group who received an IAC and 792 patients who did not.

Introduction

## Part 1 — Theoretical Concepts .1 Categories of Hospital Data

*Context and Collaboration**Quantitative and Qualitative Data**Data Files and Databases**Reproducibility*

Coding practices can be influenced by issues such as financial compensation and associated paperwork, consciously or otherwise. Version control systems such as Git can be used to track the changes in the code over time and are also becoming an increasingly popular tool for researchers [8].

## Part 2 — Practical Examples of Data Preparation .1 MIMIC Tables

*SQL Basics**Joins**Ranking Across Rows Using a Window Function**Making Queries More Manageable Using WITH*

Because a patient may have been in multiple ICUs, the same patient ID sometimes appears multiple times in the result of a previous query. Open Access This chapter is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/ . 4.0/), which permits any non-commercial use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as the original author(s) and source are properly credited, a link to the Creative Commons license is provided, and any changes made are noted.

## Introduction

Understand the requirements for a "clean" database that is "edited" and ready for use in statistical analysis. Preprocessing is sometimes iterative and may involve repeating this series of steps until the data is satisfactorily organized for the purposes of statistical analysis.

## Part 1 — Theoretical Concepts .1 Data Cleaning

### Data Integration

Data integration is the process of combining data from different data sources (such as databases, flat files, etc.) into a consistent data set. In the MIMIC database, this mainly becomes a problem when certain information is entered into the EHR during another phase of the patient's care pathway, such as before admission to the emergency department, or from external data.

Data Transformation

Data Reduction

## PART 2 — Examples of Data Pre-processing in R

*R—The Basics**Data Integration**Data Transformation**Data Reduction*

FROM mimic2v26.comorbidity_scores WHERE subject_id IN (SELECT subject_id FROM mimic2v26.icustay_detail WHERE subject_icustay_seq = 1 . EN icustay_age_group = 'volwasse' EN hadm_id IS NIE nul nie). SELECT subject_id, value1num FROM mimic2v26.chartevents WHERE subject_id IN ( SELECT subject_id .. WHERE subject_icustay_seq = 1 EN icustay_age_group = 'volwasse' EN hadm_id IS NIE nul nie) EN itemid=456.

## Conclusion

Raw data for secondary analysis are often "unordered", meaning that they are not in a form suitable for statistical analysis; the data must be "cleansed" into a valid, complete and efficiently organized "ordered" database that can be analyzed. The goal of data preprocessing is to prepare available raw data for analysis without introducing bias by changing information in the data or otherwise influencing the final results.

## Introduction

Salgado, Carlos Azevedo, Hugo Proença, and Susana M. What are the different types of missing data and sources of missing data. On the other hand, when data are missing for unspecified reasons, values are assumed to be missing due to random and unintentional causes.

## Part 1 — Theoretical Concepts

*Types of Missingness**Proportion of Missing Data**Dealing with Missing Data**Choice of the Best Imputation Method*

For example, if only 4 of 20 variables are needed for a study, this method would discard only the missing observations of the 4 variables of interest. Another advantage of this method is that it takes into account the correlation structure of the data.

## Part 2 — Case Study

### Proportion of Missing Data and Possible Reasons for Missingness

Choose the method that works best at the level of missing data in your data set. In both cases, the fact that the data is missing contains information about the answer, so it is MNAR.

### Univariate Missingness Analysis

In this case, the imputed data distribution fits the original data better than the previous methods (Fig. 13.8). The multivariate normal distribution with multiple imputation gave more importance to the center values of the distribution (Fig.13.10).

### Evaluating the Performance of Imputation Methods on Mortality Prediction

The quality of the imputation methods was evaluated even in the presence of the absence of multivariates with a uniform probability in all variables (Fig. 13.13). It should be noted that obtaining results for more than 40% missingness in all variables is quite impossible in most cases and there is no guarantee of good performance with either method.

## Conclusion

Alosh M (2009) The impact of missing data in a generalized integer-valued autoregression model for count data. Saeed M, Villarroel M, Reisner AT, Clifford G, Lehman L-W, Moody G, Heldt T, Kyaw TH, Moody B, Mark RG (2011) Multiparameter intelligent monitoring in intensive care II (MIMIC-II): a public access intensive care unit database.

## Introduction

How to evaluate the performance of an outlier detection method and how to compare different methods. Evaluating the effectiveness of an outlier detection algorithm and comparing the different approaches is complex.

## Part 1 — Theoretical Concepts

While this chapter provides a description of some of the most common methods for detecting outliers, there are many others. Outliers can be identified by visual inspection, highlighting data points that appear to be relatively outside the inherent groupings of the 2-D data.

## Statistical Methods

*Tukey ’ s Method**Z-Score**Modi ﬁ ed Z-Score**Interquartile Range with Log-Normal Distribution**Ordinary and Studentized Residuals**Cook ’ s Distance**Mahalanobis Distance*

However, in many real-world datasets the underlying distribution of the data is unknown or complex. The following sections describe some of the most commonly used statistical tests for identifying outliers.

## Proximity Based Models

The algorithm minimizes the within-cluster sum of squares, the sum of distances between each point in a cluster and the cluster centroid. Ac value that is too high will increase the cost function even if it reduces the within-group sum of squares [12,13].

A problem with this algorithm is the need to determine k, the number of clusters, in advance. Metrics such as the Akaike Information Criterion or the Bayesian Information Criterion, which add a factor proportional to k to the cost function used during clustering, can help determine this.

### Criteria for Outlier Detection

If the ratio of the distance from the nearest point to the cluster center to these calculated distances is less than a certain threshold, the point is considered an outlier. The threshold value is defined by the user and should depend on the number of clusters selected, since the higher the number of clusters, the closer the points within the cluster are, i.e. the threshold value should decrease with increasing c.

Supervised Outlier Detection

Outlier Analysis Using Expert Knowledge

Case Study: Identi ﬁ cation of Outliers in the Indwelling Arterial Catheter

Expert Knowledge Analysis

## Univariate Analysis

Figure 14.3 shows a distribution of all data points and outliers identified in the IAC cluster. On the other hand, when values follow approximately a normal distribution, as in the case of chloride (see Fig. 14.4), the IQ method identifies fewer outliers than log-IQ.

## Multivariable Analysis

For illustrative purposes, we present only the graphical results of patients who died in the IAC group (class 1). The detection of outliers seems to be more influenced by binary features than by continuous features: red lines, with a few exceptions, are quite close to the black lines for the continuous variables (1 to 2 and 15 to 25) and far away for the binary variables.

Classi ﬁ cation of Mortality in IAC and Non-IAC Patients

## Conclusions and Summary

Therefore, the “outliers” in this study appear to contain useful information in their extreme values, and automatic exclusion resulted in the loss of this information. Some modeling methods already adjust for deviations so that they have minimal impact on the model and can be set to be more or less sensitive to them.

Introduction

## Part 1 — Theoretical Concepts .1 Suggested EDA Techniques

### Non-graphical EDA

These characteristics can express the central tendency of the data (arithmetic mean, median, mode), their spread (variance, standard deviation, interquartile range, maximum and minimum value) or some characteristics of their distribution (saturation, kurtosis). Kurtosis is a summary statistic that provides information about the tails (minimum and maximum values) of a distribution.

### Graphical EDA

The interpretation of a QN plot is visual (Fig.15.10): either the points fall randomly around the line (data set normally distributed) or they follow a curved pattern instead of following the line (non-normality). Presenting several boxplots side by side makes it easy to compare the characteristics of various groups of data (example Fig. 15.11).

## Part 2 — Case Study

### Non-graphical EDA

The benefits of this are twofold: firstly it is useful to identify potentially confounding variables contributing to an outcome other than the predictor (exposure) variable. Identifying these variables is important as it is possible to attempt to control for these using adjustment methods such as multivariate logistic regression.

### Graphical EDA

For example, to compare disease severity between cohorts of patients, SOFA score histograms can be plotted side by side (Figure 15.17). For example, to investigate differences in blood pressure by disease severity, subjects could be categorized by disease severity by plotting baseline blood pressure values (Figure 15.18).

Conclusion

## Introduction to Data Analysis .1 Introduction

### Identifying Data Types and Study Objectives

Determining the objective of the study is an extremely important aspect of data analysis planning for health data. Once you have identified study outcomes and covariates, determining the types of outcome data will often be critical in selecting an appropriate analysis technique.

Case Study Data

## Linear Regression .1 Section Goals

*Introduction**Model Selection**Reporting and Interpreting Linear Regression**Caveats and Conclusions*

In summary, we would conclude that we need both an intercept and a slope in the model. This quantity is a proportion (a number between 0 and 1), and describes how much of the total variability in the data is explained by the model.

## Logistic Regression .1 Section Goals

*Introduction**Introducing Logistic Regression**Hypothesis Testing and Model Selection**Conﬁdence Intervals**Prediction**Presenting and Interpreting Logistic Regression Analysis**Caveats and Conclusions*

If you remember, logðOddsx¼0Þ ¼b0, so b0 is the log odds of the outcome in the youngest group. In setting logistic regression, this involves trying to estimate the probability of an outcome given a patient's characteristics (covariates).

## Survival Analysis .1 Section Goals

*Introduction**Kaplan-Meier Survival Curves**Cox Proportional Hazards Models**Caveats and Conclusions*

This is consistent with the exploratory figures we created in the previous section using Kaplan-Meier curves. The biggest challenge in doing this lies mainly in the construction of the dataset, which is discussed in some references at the end of this chapter.

## Case Study and Summary .1 Section Goals

### Introduction

Assessment of the proportional hazards assumption is an important part of any Cox regression analysis. Now that we have examined the basic general characteristics of the patients, we can begin the next steps in the analysis.

### Logistic Regression Analysis

A common approach is to fit all univariate models (one covariate at a time, as we did with aline_ but separately for each covariate and without aline_fg) and perform a hypothesis test on each model. You could just cut and paste mva.full.glmcommand and remove+ cad_g, but the simpler, less error-prone way is to use update.

### Conclusion and Summary

An estimate of the expected change in outcome per one unit increase in a covariate, holding all other covariates constant. An estimate of the fold change in the probability of an outcome per unit increase in a covariate, holding all other covariates constant.

Introduction

## Part 1 — Theoretical Concepts .1 Bias and Variance

### Common Evaluation Tools

However, we will briefly mention the two most common techniques: the R2 value, used for regressions, and the receiver operating characteristic (ROC) curve, used for a binary classiﬁer (dichotomous score). High R2 values mean that a large proportion of the variance is explained by the regression model.

### Sensitivity Analysis

R2 ranges from 0 to 1, where values close to 0 reflect situations in which the model does not significantly capture the variation in the outcome of interest, and values close to 1 indicate that the model captures nearly all of the variation in the outcome of interest. interested. The principles of sensitivity analysis are: (a) to allow the researcher to quantify the uncertainty in the model, (b) to test the model of interest using a secondary experimental design, and (c) using the results of a secondary experiment. experimental design to calculate the overall sensitivity of the model of interest.

### Validation

External validation is defined as testing the model on a sample of subjects drawn from a different population than the original group. External validation is usually a more robust approach to model testing in that the maximum amount of information is used from the initial data set to derive a model and an entirely independent data set is then used to verify the suitability of the model of interest.

## Case Study: Examples of Validation and Sensitivity Analysis

*Analysis 1: Varying the Inclusion Criteria of Time to Mechanical Ventilation**Analysis 2: Changing the Caliper Level for Propensity Matching**Analysis 3: Hosmer-Lemeshow Test**Implications for a ‘ Failing ’ Model*

Changing the inclusion criteria for subjects included in the model is a common sensitivity analysis. In the favorable situation of a robust model, each sensitivity analysis and validation technique supports the model as an appropriate summary of the data.

## Conclusion

Sekhon JS (2011) Multivariate and propensity score matching software with automated balance sheet optimization: the matching package for R. Robin for R and S+ to analyze and compare ROC curves.