An Interactive Web Solution for Electronic Health Records Segmentation and Prediction


A vast variety of patient data has been collected and monitored through Electronic Health Records (EHR) using various tools in the clinical research industry and it is a concern for healthcare providers to ensure the safety of the patients who are participating in the clinical trials.  It is evident that need for a centralized analytics solutions for EHR datasets that deliver insights and predictability.

The paper focuses on the healthcare industry, which can benefit immensely by allowing medical practitioners to gain insights into the EHR data. The paper aims to provide a platform to explore and gain descriptive statistics and to provide patient segmentation and recommendation.

The objective of the paper is to start data acquisition and data understanding and then create a web interface for data exploration and segmentation and classification. In the data modeling phase, the objective is to create machine learning models for segmentation and classification.

The first step is data acquisition from the MIMIC-III v1.4 (Clinical database) data mart. In the data understanding phase, the relationship of multiple tables is evaluated. In the data wrangling phase, SQL and Python are used to combine different tables to create a single dataset for analyzing the data and modeling the data. The combined dataset is then used for k-means clustering techniques for obtaining chest heart failure patients clusters. In the following phase, the diagnosis text data is extracted from the diagnosis dataset and performed text cleaning by removing punctuation, numbers, and stopwords. The cleaned text data is used for data modeling and for that TFIDF (Term Frequency Inverse Document Frequency) vectors and count vectors are created and then multiple classification techniques are applied for predicting the occurrences of death and the best model is considered for the model deployment.

In the model evaluation phase, it is observed that six clusters were optimal while training the model and it is incorporated into the application for predicting the segments of the patients based on the risk levels. Few machine learning models were trained on patient’s historic diagnosis text data and the logistic regression model indicated 89 % of AUC score in test data and is deployed into the application for the prediction.

Finally, a web interface is created using the python streamlit framework which allows the users to bring raw EHR datasets to explore the data. The created models for segmentation and classification are deployed with the web application and thus will provide a recommendation to the business.


Keywords: Natural Language Processing, EHR, Segmentation, Serious Adverse Event Prediction


Journal Name:  Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering


Sudeep Mathew

Mithun D J

Rashmi Agarwal

Leave a Reply

Your email address will not be published. Required fields are marked *