Spread the love

Recently I’ve been undertaking the data analysis of 10000 patient records as part of the proof of concept for a medical diagnosis application being developed by a start up company.

This article describes the data analysis I have done on that dataset. The objective of the analysis was to use machine learning to identify patients suffering from a Pulmonary Embolism as defined by the ICD10 codes I26-I28. The analysis was done on a professional level cluster hosted at CoCalc.

I developed the code in Python using a series of Jupyter notebooks – the pdf versions are available in my GitHub repository. Using Pandas DataFrames I did an initial exploratory data analysis. This showed that the laboratory tests done at admission were not by themselves useful for diagnosing Pulmonary Embolisms. Both visually and statistically the lab results were identical across different groups of patients, with a statistical T-Test of 0.0017 and a p value of 0.9985.

Secondly I did an analysis using scikit-kit learn. A DecisionTreeClassifier was used to see if the ICD10 codes I26 – I28 relevant to Pulmonary Embolism could be deferentially classified. This technique was chosen because it was important that the trained machine learning model produced be transparent and explicable to users. This was quite successful and produced a Decision Tree Model which could be used within an app – a sample is shown below.

In addition I compared the DecisionTreeClassifier to other classifier models and confirmed its effectiveness in this context. However the RandomForestClassifier also has potential because of its ability to rank features by their importance. The metrics for the various models are shown below.

Finally I performed a correlation analysis which investigated disorders associated with pulmonary embolism – typically disorders of the circulatory system and blood clotting. I did this by calculating the Cramers V statistic for categorial-categorial association. I found that investigating the associated disorders was very effective in producing potentially actionable insights. 

In my report on the viability of the proposed app – it’s generally positive – though we all understand this is at a quite preliminary stage. Overall I found that although the exploratory data analysis indicated that the admission Lab tests would not by themselves provide a basis for diagnosing Pulmonary embolism, the classification analysis indicated that factors such as age, income, gender and ethnicity in combination with the LabValues could have diagnostic value. The correlation analysis was able to provide practical actionable insights and is a valuable complement to the other analyses.