Estimating the diagnosis of health conditions in England

I am a member of Data Unlocked, a co-operative business formed to work on data projects for social good. We are currently working on a project with Inside Outcomes and Birmingham City University to estimate the rate of diagnosis of over twenty health conditions in England using openly available datasets. This blog post explains how we have made our estimations.

Earlier this year Data Unlocked spoke with Darren Wright of Inside Outcomes about the possibility of estimating the diagnosis of health conditions in England by area, something we were surprised wasn’t already available.

The National Health Service (NHS) do produce statistics, called Quality Outcome Framework (QOF) measures, that break down the prevalence of diagnosis of twenty five different health conditions from Atrial Fibrillation to Stroke, by GP practice.

Pre-Existing Condition - The Noun Project

They also produce statistics about the number of patients registered in GPs practices, broken down by the Lower Layer Super Output Area (LSOA) they live in.

Lower Layer Super Output Areas (LSOAs) are small areas of the country where between 1000 and 3000 people live. They are the smallest area that we know about that have a whole range of statistics produced about them. For instance, census statistics are broken down by LSOA.

Darren suggested that there might be a way of estimating the prevalence of diagnosis of these health conditions at LSOA level, using these two datasets.

The following is a non-technical overview of how we created the estimated dataset.

Firstly, we calculated the prevalence of each indicator within a GP practice. Here we used the data made available by NHS Digital on the number of Patients Registered at a GP Practice, January 2018

As a made up example, take a GP practice that has the following breakdown of patients:

gp_code gp_name lsoa patients

In total it has 1000 patients. 10% of them live in LSAO001, 40% in LSOA002 and 50% in LSOA003.

Next, we looked at the prevalence of diagnosis of health conditions for GP practices from the QOF measures. Again, here is a made up example:

gp_code Indicator_group register patient_list_type list_size
AGP001 AF 100 TOTAL 1000
AGP001 AST 300 TOTAL 1000
AGP001 CAN 125 TOTAL 1000


From this we have made estimates of how many patients from The AVERAGE SURGERY have Atrial Fibrillation based on where they live.

From the example in the above table there are 100 of The AVERAGE SURGERY’s patients who have Atrial Fibrillation.

We know from our earlier table that 10% of The AVERAGE SURGERY’s patients live in LSOA001.

So, we would take 10% of 100 and estimate that there are 10 patients registered at THE AVERAGE SURGERY who also live in LSOA001 who have Atrial Fibrillation.

We would then add up all of the number of patients estimated to have Atrial Fibrillation in LSOA001 regardless of the surgery where they are registered.

Using this method we have estimated across all surgeries and all conditions.

It is worth emphasising that we are doing here is a rough estimate. We can’t, and don’t, claim that it is precise and certainly wouldn’t advise making decisions based on it. We do think that it is reasonably accurate (not a statistical term).

I am currently using some machine learning techniques to produce some analyses of the data. At the moment I am concentrating on clustering and I will post the initial results here soon.

This post originally appeared on Data Unlocked

Leave a Reply

Your email address will not be published. Required fields are marked *