Patients who withdraw from medical research studies sometimes can’t be contacted—in medical jargon, they’re said to have been “lost to follow-up.” Did the experimental drug or procedure work? The smaller the number of study participants—known as the “sample size”—the less reliable the researchers’ conclusions. So it’s important for researchers to learn the fate of as many patients as possible. The outcome of a dropout is, of course, unknown. That’s where data mining comes in. Scientists are using obituaries to fill in some of the missing outcomes. Finding this information would otherwise require a tedious process of tracking down outpatient medical records, hospital charts, or death certificates.

Another group of researchers employed obituary data mining to look for a relationship between childbirth and cancer, as well as other diseases. They culled data from the online obituaries of more than 79,000 women with cancers of the breast, ovary, lung, colon, and pancreas. This data mining found a “significantly reduced risk of cancer,” especially cancers of the breast and ovary, in women who had given birth.

In a separate study published a month later, the same author group proved that its novel approach to obituary data mining could be used for ongoing surveillance of trends in cancer deaths for both men and women. This data-driven approach to epidemiology is so important to disease prediction and surveillance that some have taken to calling it “infodemiology.” With current methods, incidence and mortality statistics lag about 3 years behind the dates of diagnosis and death, respectively. An uptick in cancer incidence or death, then, isn’t immediately evident to researchers and public health officials. Because the data collection and curation process drags on so long, it’s difficult to draw cause-and-effect conclusions between prevention efforts and any reduction in cancer incidence. As a result, policymakers often lack solid information on which to base their decisions. Should funds go to screening programs or to lifestyle-based prevention programs, for example?

Online obituary data mining can give epidemiologists almost real-time feedback, helping them identify trends sooner and make more accurate predictions, and giving public health policymakers better data with which to make decisions. Data can also be analyzed retrospectively and compared with annual National Vital Statistics Reports to assess the accuracy of published data. If there’s a discrepancy, either the official stats were incorrect or the data mining approach was flawed—another study would be needed to make that determination. The good news is that so far, data mining has turned up results that are “highly correlated” with official published statistics.

The researchers use an innovative but complex data mining process known as rule-based information extraction. In other words, they feed a set of rules into a computer, which then crawls obituaries to collect or eliminate data based on those rules. For example, the researchers use pronouns to infer the obituee’s sex. In the aforementioned study of cancer and parity (childbirth), obituaries in which the pronouns “he,” “him,” and “his” appeared were eliminated from further analysis. Another rule tells the computer to infer age by recording the date of death if it’s given. If it’s not, the computer is instructed to use the date of publication as the date of death. That much is straightforward enough. But even the seemingly simple task of recording cause of death can be tricky. Here are just a few of the rules built into the algorithm the computer uses to crawl thousands of obituaries during the mining process:

  • Cause of death. To determine cause of death, the researchers can’t just look for the word “cancer,” for example, because someone who dies by suicide or dies in a car wreck might have left a will specifying that any donations be given to the American Cancer Society or to some other charity with “cancer” in its name. So the infodemiologists use phrases like “in lieu of flowers” to eliminate data that could be misleading. Likewise, they must filter out any reference to the person’s having been a cancer survivor. Any mention of cancer that remains can be assumed to have been the cause of death.
  • Duplicates. Online obituaries often appear on multiple sites, so the researchers devised a rule to eliminate duplicates by comparing name and age.
  • Geospatial information. Geographic data is important in disease surveillance (tracking), so the researchers use explicit mention of the obituee’s location to determine place of death. If that information is missing, the algorithm extracts the location of the funeral home as a proxy for place of death.
  • Exclusion criteria. If an obituee’s age at death, place or cause of death, or sex can’t be ascertained from the obituary text, that person’s obituary data is excluded from the study altogether.

Online obits have proved useful, then, not just to historians and sociologists, but also to doctors. Together, obits form a unique data set that helps medical researchers confirm treatment outcomes, spot trends, and make better predictions. It’s important to note that only the data is retained; patients’ individual names are discarded to protect their privacy. But success in the data mining process, researchers say, “depends on the societal and cultural trends of publishing online obituary articles [and] disclosing the cause of death.” Next time you write an obituary, that’s something to consider. Listing your loved one’s cause of death lets her make a final contribution to the world as she takes her leave of it.

(Image courtesy mcmurryjulie/Pixabay.)