Estimating the diagnosis of health conditions in England

I am a member of Data Unlocked, a co-operative business formed to work on data projects for social good. We are currently working on a project with Inside Outcomes and Birmingham City University to estimate the rate of diagnosis of over twenty health conditions in England using openly available datasets. This blog post explains how we have made our estimations.

Earlier this year Data Unlocked spoke with Darren Wright of Inside Outcomes about the possibility of estimating the diagnosis of health conditions in England by area, something we were surprised wasn’t already available.

The National Health Service (NHS) do produce statistics, called Quality Outcome Framework (QOF) measures, that break down the prevalence of diagnosis of twenty five different health conditions from Atrial Fibrillation to Stroke, by GP practice.

They also produce statistics about the number of patients registered in GPs practices, broken down by the Lower Layer Super Output Area (LSOA) they live in.

Lower Layer Super Output Areas (LSOAs) are small areas of the country where between 1000 and 3000 people live. They are the smallest area that we know about that have a whole range of statistics produced about them. For instance, census statistics are broken down by LSOA.

Darren suggested that there might be a way of estimating the prevalence of diagnosis of these health conditions at LSOA level, using these two datasets.

The following is a non-technical overview of how we created the estimated dataset.

Firstly, we calculated the prevalence of each indicator within a GP practice. Here we used the data made available by NHS Digital on the number of Patients Registered at a GP Practice, January 2018

As a made up example, take a GP practice that has the following breakdown of patients:

gp_code	gp_name	lsoa	patients
AGP001	THE AVERAGE SURGERY	LSOA001	100
AGP001	THE AVERAGE SURGERY	LSOA002	400
AGP001	THE AVERAGE SURGERY	LSOA003	500

In total it has 1000 patients. 10% of them live in LSAO001, 40% in LSOA002 and 50% in LSOA003.

Next, we looked at the prevalence of diagnosis of health conditions for GP practices from the QOF measures. Again, here is a made up example:

gp_code	Indicator_group	register	patient_list_type	list_size
AGP001	AF	100	TOTAL	1000
AGP001	AST	300	TOTAL	1000
AGP001	CAN	125	TOTAL	1000

From this we have made estimates of how many patients from The AVERAGE SURGERY have Atrial Fibrillation based on where they live.

From the example in the above table there are 100 of The AVERAGE SURGERY’s patients who have Atrial Fibrillation.

We know from our earlier table that 10% of The AVERAGE SURGERY’s patients live in LSOA001.

So, we would take 10% of 100 and estimate that there are 10 patients registered at THE AVERAGE SURGERY who also live in LSOA001 who have Atrial Fibrillation.

We would then add up all of the number of patients estimated to have Atrial Fibrillation in LSOA001 regardless of the surgery where they are registered.

Using this method we have estimated across all surgeries and all conditions.

It is worth emphasising that we are doing here is a rough estimate. We can’t, and don’t, claim that it is precise and certainly wouldn’t advise making decisions based on it. We do think that it is reasonably accurate (not a statistical term).

I am currently using some machine learning techniques to produce some analyses of the data. At the moment I am concentrating on clustering and I will post the initial results here soon.

This post originally appeared on Data Unlocked

Leave a Reply Cancel reply