Last week I attended yet another hackathon - the MedTech SuperConnector challenge. The topic of the weekend was mental health, and coincidently it happened at the end of my 6 week psychiatric attachment. Instead of talking about our idea, which is I feel is not developed enough for a post, I wanted to write about some fundamental difficulties in taking a data driven approach for psychiatric conditions. I will give a brief introduction to how such conditions are defined clinically, discuss the current approaches and end with my opinion on how we should approach this field.
There are two main classification guidelines used by clinicians for mental disorders: ICD-10 (11th revision released in 2018) endorsed by the World Health Organisation and DSM-IV endorsed by the American Psychiatric Association. For the purpose of this post, I will use the ICD-10 guideline.
The ICD-10 groups mental disorders in the following large categories:
I have rephrased and combined the last few categories for the non-clinical readers. Of these, I will discuss mood disorders as they currently seem to be one of the most popular topics for machine learning applications.
This category includes mania related conditions as well. However, due to its relative frequency, most recent works have focused on depression. Here is how depression is described:
In typical depressive episodes of all three varieties described below (mild, moderate, and severe) the individual usually suffers from depressed mood, loss of interest and enjoyment, and reduced energy leading to increased fatiguability and diminished activity. Marked tiredness after only slight effort is common. [F32 Depressive episode, ICD-10]
In clinical practice, we often use the PHQ-9 questionnaire, which was developed by Pfizer based on the above description (they cited the APA guidelines but that is irrelevant here). The questionnaire begins with the following question: “Over the last two weeks, how often have you been bothered by any of the following problems?”. The nine problems that follow are “Little interest or pleasure in doing things?”, “Feeling down, depressed, or hopeless?”, and etc. Patients are asked to answer the above questions in one of the four choices (not at all, some days, most days, all days), which is converted to a score of 0 ~ 3 respectively.
As subjective as the above method may seem, a validation study of the PHQ-9 survey showed that a score >= 10 had a 88% specificity and 88% sensitivity for major depression. So far so good. However, the paper itself states that the positive predictive score for the mild and moderate thresholds drops significantly to 31%. Furthermore, the original study does not include how sensitive the scores are to the change of patient state. The only two studies1,2 that I found to have attempted this both conclude that a better trial is required to assess the hypothesis. Considering the inherent subjectiveness of not only the questionnaire but also the definition of a depressive disorder, any quantitative analysis will likely have a huge uncertainty.
We need to understand the character of the uncertainty in this situation. In Yarin Gal’s PhD thesis, he explains the possible sources of uncertainty in machine learning as follows:
These were written in the context of machine learning, however it is closely related to our discussion if you view the PHQ-9 as the first layer of a neural network model that extracts a certain set of features. The above can be rephrased.
Understanding the source of uncertainty matters as it will change how we respond to them.
There are several approaches to what we can do to address each uncertainty. The first source, which is inherently caused by the lack of a quantitative definition of depression, can be considered using a probabilistic modelling approach to some degree. Fundamentally, we will need to gather more information from the subject about the spatiotemporal environment to rule out unrelated affects. Perhaps considering the approach of causal inference researchers may be worth a try.
The other two sources of uncertainty are to do with the PHQ-9 survey - a hand crafted feature extraction model. The development of deep neural networks have made such attempts obsolete in many domains and I believe depression is no exception. We need to look at the raw data, which in this case is the audio record of a consultation where the clinician assesses the descriptive criteria of depression. Although there are works attempting to predict depression from vocal prosody, many of them are limited by the fact that their labels are based on the PHQ-9 (or equivalent surveys) scores, which is the very source of uncertainty that we want to address. Attempts to predict depression from facial expressions are also interesting but problematic from the uncertainty perspective. As much as I agree with the potential benefit of this, we need to be careful as facial expression are likely to be ‘associated with’ depression but may not have a direct causal relationship with the clinical definition of depression.