This project develops an automated Python pipeline for clinical data remediation and multi-dimensional quality assessment of the UCI Diabetes 130-US Hospitals dataset. The primary goal was to achieve a measurable 25% improvement in the Data Quality Index (DQI) using advanced MICE and KNN imputation methods. This analysis demonstrates skills in data cleaning, statistical imputation, consistency normalization, and visualization for high-fidelity, predictive-ready clinical datasets. 🧹📊💻
This project focuses on improving the quality of clinical data by addressing three main dimensions: completeness, consistency, and validity. Using Python-based pipelines, missing values were imputed using MICE/KNN methods, ICD-9 codes were standardized into primary clinical categories, and numeric features were validated against expected ranges. The final output is a high-fidelity dataset ready for predictive modeling and visualization.
Project Workflow 🪜
P2 - S1 Data Profiling: Baseline assessment, null detection, and DQI calculation.
P2 - S2 Data Remediation: MICE/KNN imputation and ICD-9 normalization.
P2 - S3 Visualization: Comparative distribution plots & KDE charts for communicating improvement.
Figure C1: (Diagnostic audit of the raw dataset identifying “sentinel” null values (encoded as “?”). The analysis revealed a 97% missingness rate in the ‘Weight’ attribute, establishing the baseline requirement for Multivariate Imputation (MICE) to prevent significant loss of statistical power. )Figure C2: (Interdependency heatmap illustrating the mathematical relationships between clinical markers (Age, Lab Procedures, and Medication Count). These correlations provided the logical foundation for the MICE algorithm, allowing the pipeline to predict missing values based on observed patient patterns rather than random estimation. )Figure C3: (Kernel Density Estimate (KDE) plot comparing pre-remediation and post-remediation distributions for ‘Time in Hospital’. The near-perfect overlap of the distribution curves serves as forensic proof that the automated remediation preserved the original clinical “shape” and statistical integrity of the dataset. )Figure C4: (Comparative analysis of the Data Quality Index (DQI) showing the transition from fragmented raw data (pre-remediation) to a high fidelity clinical asset (post-remediation). The automated pipeline achieved a 25% aggregate improvement in data health, primarily by resolving systemic missingness in the weight and payer code attributes.)
Results 🟰
Completeness: Reduced null density for weight from ~97% missing to [XX%]
Consistency: Standardized diag_1 to primary clinical categories
Validity: Ensured num_lab_procedures fell within 1–132, flagging outliers
Overall DQI improvement: [XX% increase]
Technical Stack 🔨
Data Wrangling: pandas, NumPy
Data Remediation: Scikit-Learn (IterativeImputer), Regex normalization