What is natural language processing? Six questions with Amy Olex
The machines are learning. But that’s OK, because Amy Olex, M.S., is there to teach them.
The senior bioinformatics specialist at the Wright Center is extracting de-identified information from troves of clinical notes so that health researchers at VCU and VCU Health can create meaningful studies and bring research results to patients more quickly.
Natural language processing (NLP) is her specialty, and she is here to help us understand it.
What is the technology behind NLP text analysis and when did it come about?
NLP’s been around for a little while, but the algorithms and science have improved. It started in linguistics, analyzing properties of texts, looking at frequencies of word distributions. The earliest NLP application that became widely available is your digital spellcheck. From there, it’s grown to where now it can predict your sentences. You have NLP applications that can write actual papers that get accepted to journals.
NLP in recent years has really taken off because of the ability to store and process massive amounts of information. We haven’t had the capacity until recently. That’s why Google is so advanced, because they’ve had this massive database of text, and they have the ability to run statistics and machine learning algorithms over it and process all the linguistic patterns from everybody that enters data into their system. That’s how you get text prediction in Gmail. Now, more of us have the technology to store and process large volumes of data.
Before we had all of that data available, NLP algorithms were rule-based, based on manually written rules. So, if you’re looking for terms associated with cancer, you build out manual dictionaries of all the different ways the word ‘cancer’ can be represented in clinical notes. Then you’d have expressions and rules go through the text and find those patterns that you manually specified. Honestly, rule-based systems are still very popular because they’re very easily interpretable. When they miss something, we immediately know why. ‘Oh, we didn’t have a rule for that instance.’
Now, with machine learning, when you’re using the large datasets and something is wrong, you don’t necessarily know why it found what it found. It’s a black box. Clinicians don’t like black boxes, especially when you’re pulling out medical concepts. It’s gotten a whole lot better over the years, for sure. But that’s an active research field, trying to figure out these machine learning algorithms and how they’re making decisions as to what is, for example, a cancer concept and what is not.
What’s a typical research project that you might employ NLP for?
One of the most frequent types of projects are extracting de-identified concepts or information from the unstructured clinical notes that is not stored in the structured Electronic Health Record (EHR) data. In structured EHR data, there are set fields and values, so there’s a specific line for ‘blood pressure’ and a space for the clinician to enter those numbers.
But if, say, a doctor types up a narrative of a person’s history, with all their symptoms and their symptom progression, that’s not going to be in the structured EHR data. That’s going to be an unstructured blob of notes.
For example, in cancer research, they’d want to pull out notes on a tumor’s stage. A lot of the time, tumor stage is written into the unstructured, narrative text. But some of that information may actually be important to defining the cohort for a clinical trial or for your study. You’re going to want to pull out those notes and put them into a structured database for processing, to extract specific terms, measurements, etc. and discretize it into structured data.
Fifty years ago, how might a researcher have conducted a study where they knew there was important information they needed inside clinical notes?
Manually going through the notes! They would pull out the information by hand, and honestly, they still do that to this day, because NLP is not widely available. And it’s still not generalizable, especially in the clinical domain. Before NLP, if you needed stuff that was in written text, you hired students to dig through the data and pull out the information you needed.
How has NLP played into COVID-19 research?
There’s a group of informatics people like myself working with NLP as part of the National COVID Cohort Collaborative (N3C), a national, centralized data resource for researchers studying COVID-19.
The NLP group is working to build out the algorithmic infrastructure that a researcher could use to study, not only COVID-19, but how it interacts with comorbidities like obesity or depression, where descriptors for those conditions are not going to be in the structured text. It will be very impactful, once we can get that additional information to the N3C repository.
How does being at a Clinical and Translational Science Award (CTSA) institution matter when it comes to NLP work?
It’s a big advantage. The Wright Center is like a large networking hub. We have connections on the medical campus. We have connections on the academic campus. So, working in NLP, I have access to the clinical data that someone on an academic campus does not have easy access to, and I have access to the clinicians who understand that data. But I also have access to those in College of Engineering’s Department of Computer Science, who develop the NLP algorithms and are really knowledgeable with machine learning.
I’m not a clinician. I came from a computer science background. So when I’m working on a project that’s dealing with a specific type of condition like cancer or depression, I need access to subject matter experts to figure out what concepts I need to pull out. What’s important for differentiating patient A from patient B?
That network all at one university is a really valuable thing to have for any researcher that comes to us. I’ve had researchers on the clinical side, and I’ve connected them with people on the academic side, and vice versa. The Wright Center is a bridge between worlds – two worlds that sometimes speak different languages. And researchers need to know that it’s here, and they can take advantage of it.
What else would you want a researcher to know about NLP and how it can help them?
NLP can be very powerful. It can do a lot of cool things, but it’s still not easy. The projects require very close collaboration between me, who’s the coder and the one developing the algorithm, and the PI in charge of the research. The PI understands the clinical background to the problem they’re studying.
NLP is not a quick ‘Hey, I need these concepts.’ It’s ‘okay, you need these concepts. Our pipeline isn’t set up for that right now, so we need to work together to figure out what we need to change – to get you the answers that you need.’ It’s very project specific. You can’t throw data at us, and we throw answers back.
And we’re always happy to talk with people if they have questions. People sometimes hesitate to contact us, because they don’t have a specific research project. But sometimes you don’t know what your research project is going to be, unless you know what services we can provide. It helps to call us before you even write a grant and to say, ‘Hey, what do you guys do?’
Use this form to request an NLP consultation and other informatics services at the Wright Center.
Categories Clinical Research, Collaboration, Data Science