Learning clinical vectors from structured Electronic health records

Learning efficient and dense real-valued representations of high dimensional clinical codes remain a key challenge in data-driven clinical applications like predictive modelling, comparative effectiveness research and more. In this study, we used a slightly modified version of word2vec (a word embedding technique) to vectorise diagnostic and medication codes in structured electronic health record (EHR) – MIMIC-III. Given a list of clinical codes corresponding to a hospital visit, a shallow one-hidden layer neural network was trained to predict all neighbouring codes in the visit from a single code. Once trained, the input-hidden layer weights were used as the embedding matrix to represent each clinical code. In comparison to the traditional one-hot format, the newly learned representations are low-dimensional, dense and carry contextual similarity among codes. Moreover, the representations are learned in a fully unsupervised manner with minimal pre-processing. Representations for diagnostic codes can be visualised here.

Based on ICD-9 grouping schema, we found an average 70% recall@10 (No. of same category diagnoses among top 10 similar diagnoses) for common disease codes. The learned representations are thus useful for discovering interesting comorbidities (or relationships between codes) from EHR. The results also motivate using similar technique to discover groups of patients that are identical in terms of their clinical profile.

Awais Ashfaq, Anita Sant’Anna, Markus Lingman, Slawomir Nowaczyk,