Wales Gene Park, embedded within the School of Medicine at Cardiff University, provides DNA sequencing facilities for cancer and rare genetic disease researchers.
“Cancer research colleagues may be interested, for example, in sequencing the DNA of a tumour and comparing it with the individual’s blood DNA to identify aberrant mutations, or rare genetic disease researchers might be interested in particular mutations within particular genes,” says Data Strategy and IT Infrastructure Lead Kevin Ashelford.
“We also look at the expression of those genes. We each have a whole portfolio of genes that are switched on and off to be expressed at different levels within each of our cells. By profiling that expression, we can assist colleagues in exploring different diseases,” he says.
This research raises challenges due to the amounts of data involved. When an individual’s DNA is sequenced it produces vast files full of ‘sequence reads’, or DNA fragments, which need to be interpreted by comparing them to the reference genome – the human genome which was mapped in 2003.
“That requires sufficient computer processing power and memory to take all of those reads and map them, to produce a resulting alignment that we can then examine further. Once we have done that, we need to interpret the genome that we have put together, and then we need to visualise it. So there are various stages where the compute power and storage capability of Supercomputing Wales are required,” Ashelford says.
“At Wales Gene Park we do have staff who are familiar with this technology and so we worked with Supercomputing Wales to develop our own separate partition on the system, dedicated to our requirements. It’s a collaboration, really. They provide the essential IT engineering system administration skillsets and we bring the data science skills needed to run the software,” says Ashelford.
Supercomputing Wales RSE Research Software Engineer Anna Price has also worked with Wales Gene Park on a natural language processing algorithm that has simplified the task of curating information from research papers.
“A lot of research papers are being generated with details of medically important mutations, and the Human Gene Mutation Database, based here in Cardiff, gathers that information together for clinicians and researchers, providing the largest resource for finding disease-causing mutations in the world. They’ve always done it manually, but there’s a huge amount of papers to go through. So Anna took some natural language processing algorithms and applied them to the problem. Her program identifies those papers that are likely to be of interest, and flags them – so it’s not eliminating the need for a curator, but certainly begins to automate the process.”