Exploratory data analysis regarding demographic, lifestyle, and clinical correlations with diabetes diagnosis | October 2025
Author: Andrew Castro
This project performs an exploratory analysis of a diabetes risk dataset sourced from Kaggle. It focuses on
cleaning demographic and clinical data, transforming binary fields for clarity, and utilizing Power BI to
test hypotheses regarding physical activity and diet. The final output is an interactive dashboard
leveraging DAX for dynamic measures to identify risk factors such as BMI, LDL cholesterol, and socioeconomic
influence. It’s vital to understand that analysis is based among the simulated dataset, and may not reflect
real-world data.
More info below.
Key DAX Measures Created:
Positive diagnosis followed a typical bell curve, peaking between the ages of 45-55, with an average
diagnostic age of approx 50.
Positivity rates were highest among the White population (45.10%), followed by a significantly lower
rate in the Hispanic population (19.90%).
Education level displayed a clear inverse by positive diagnosis rate. Positivity rates were highest (44.90%) among
those with a High school education,
and lowest (14.80%) among those with a Postgraduate level. Similar patterns followed income levels,
affecting the middle class the most.
Age Influence
Socioeconimic Influence
Among individuals with a positive diabetes diagnosis, 28.63% reported having a family history of diabetes. This was true for 28.68% of females, 28.50% of males, and 30.43% of non-specified genders. Individuals with unknown family history are to be expected and have also decreased this total percentage. Additionally, 29.51% of individuals had a history of Hypertension following positive diagnosis.
A clear correlation exists between an individual’s physical activity per week and a reduction in their risk of diabetes. Conversely, individuals with increased BMI and LDL Cholesterol levels faced escalating risks of a positive diagnosis. Interestingly, while physical metrics were indicators, improved diet scores and sleep patterns displayed negligible correlation amongst this specific dataset regarding the improvement of their diabetic risk score.
lower = less risk
higher = more risk
Hypothesis 2 was proven false in this case study. Average post-prandial glucose was recorded
at 160.04mg/dl.
Although, individuals with a poor diet score averaged ~171 mg/dl, post-prandial glucose
compared to ~166 mg/dl for those with a perfect score; the difference suggests minimal correlation in this
dataset. Both glucose ratings are considered elevated blood sugar levels with minimal variance.
Additionally, those positively diagnosed had slightly increased triglycerides (123.21mg/dL)
compared those negatively diagnosed for Diabities, but both averages fell within the healthy range (below 150.00mg/dL).
Employment status and income levels were explored to find a correlation between an individual’s ability to afford a healthy lifestyle and their diabetic risk. However, the positive diagnostic rate between unemployed (39.78%) and employed (39.85%) groups were nearly identical.
Physical activity significantly reduced diabetes risk, while diet score showed negligible correlation on glucose levels in this dataset.
LDL cholesterol and BMI emerged as strong clinical risk factors, and socioeconomic influences such as education also
positively correlated with diabetes prevalence. However, it’s vital to understand that analysis and correlation is based
solely on the sampled bias within the simulated dataset. It can not be used to reflect real-world analysis.
This exploratory data analysis reveals the importance of combining lifestyle,
demographic, and clinical data to understand complex health risks.