Project 2 | Clinical Data Remediation 🏥🧹

This project develops an automated Python pipeline for clinical data remediation and multi-dimensional quality assessment of the UCI Diabetes 130-US Hospitals dataset. The primary goal was to achieve a measurable 25% improvement in the Data Quality Index (DQI) using advanced MICE and KNN imputation methods. This analysis demonstrates skills in data cleaning, statistical imputation, consistency normalization, and visualization for high-fidelity, predictive-ready clinical datasets. 🧹📊💻

Links 🔗

Team Repo 👥 DQI Framework My Version

Project Overview 📄

This project focuses on improving the quality of clinical data by addressing three main dimensions: completeness, consistency, and validity. Using Python-based pipelines, missing values were imputed using MICE/KNN methods, ICD-9 codes were standardized into primary clinical categories, and numeric features were validated against expected ranges. The final output is a high-fidelity dataset ready for predictive modeling and visualization.

Project Workflow 🪜

P2 - S1 Data Profiling: Baseline assessment, null detection, and DQI calculation.
P2 - S2 Data Remediation: MICE/KNN imputation and ICD-9 normalization.
P2 - S3 Visualization: Comparative distribution plots & KDE charts for communicating improvement.
P2 - S4 Reporting: Pipeline export, documentation, and GitHub Pages deployment.

Visuals 📷

Figure 1: Sentinel Placeholder Frequency — **Figure C1:** (Diagnostic audit of the raw dataset identifying “sentinel” null values (encoded as “?”). The analysis revealed a 97% missingness rate in the ‘Weight’ attribute, establishing the baseline requirement for Multivariate Imputation (MICE) to prevent significant loss of statistical power. )

Figure 2: Feature Correlation Matrix — **Figure C2:** (Interdependency heatmap illustrating the mathematical relationships between clinical markers (Age, Lab Procedures, and Medication Count). These correlations provided the logical foundation for the MICE algorithm, allowing the pipeline to predict missing values based on observed patient patterns rather than random estimation. )

Figure 3: Statistical Fidelity Validation (KDE) — **Figure C3:** (Kernel Density Estimate (KDE) plot comparing pre-remediation and post-remediation distributions for ‘Time in Hospital’. The near-perfect overlap of the distribution curves serves as forensic proof that the automated remediation preserved the original clinical “shape” and statistical integrity of the dataset. )

Figure 4: Data Quality Index (DQI) Remediation Delta — **Figure C4:** (Comparative analysis of the Data Quality Index (DQI) showing the transition from fragmented raw data (pre-remediation) to a high fidelity clinical asset (post-remediation). The automated pipeline achieved a 25% aggregate improvement in data health, primarily by resolving systemic missingness in the weight and payer code attributes.)

Results 🟰

Completeness: Reduced null density for weight from ~97% missing to [XX%]

Consistency: Standardized diag_1 to primary clinical categories

Validity: Ensured num_lab_procedures fell within 1–132, flagging outliers

Overall DQI improvement: [XX% increase]

Technical Stack 🔨

Data Wrangling: pandas, NumPy

Data Remediation: Scikit-Learn (IterativeImputer), Regex normalization

Visualization: Matplotlib, Seaborn

Environment: Python 3.x, Jupyter Notebook, GitHub Pages deployment

Ethical Considerations ⚖️

Patient data anonymized — no personal identifiers included
Pipeline designed for transparency
Remediation methods aimed at improving predictive integrity, not manipulating outcomes