Dr. Lee Harlandfounder and scientific director of SciBite (an Elsevier company).
EC Segar’s famous cartoon creation Popeye miraculously transformed into a superhero by eating a can of spinach, no doubt influenced by numerous scientific reports from the 1920s touting the vegetable as a “superfood “. Thanks to Popeye’s endorsement of spinach’s strength-building properties, attributed to its high iron content, Americans’ spinach consumption has increased by 33%, according to Popeye’s official website.
However, in 1981, scientist TJ Hamblin discovered conflicting reports which showed that the iron content of spinach had been overstated by ten, concluding that Segar, Popeye and the general public had been misled by a single misplaced decimal point. . This cautionary tale of human fallibility became legend and can still be found on the internet today.
In 2010, Dr. Mike Sutton set out to study the science behind “The Spinach Popeye Decimal Iron Error Story” or “SPIDES”. After months of detective work, he concluded that, while convincing, it was sadly simply not true. Hamblin’s original findings of a tenfold discrepancy exist, but Hamblin himself was unable to locate a source for the misplaced decimal point. The likely explanation is that the discrepancy was attributable to the way iron was measured (e.g. using wet or dry spinach). Neither value was wrong, and the fact that they were ten times different and therefore “looked like” a decimal point error was pure coincidence. The correct “answer” depended on the exact experiment that was performed.
As a result, two competing beliefs now persist. First, the incorrect truism that Segar got “bad” and second, that spinach is a good source of iron for humans (the “lies that won’t die”, according to Dr. Sutton [pg. 28]because spinach is rich in iron, but much of it is not metabolizable by humans).
Although at first glance this anecdote may seem trivial, it is a parable of the critical importance of metadata (data about data) – something that is vital for those who work with AI in life sciences. life today.
Always let the facts get in the way of a good story.
Looking back on Dr. Sutton’s detective work, we can see how incorrect assumptions, once accepted, become embedded in belief and repeated. The decimal point story would be good if it weren’t for the facts. In the life sciences, good data hygiene and understanding the origin of the datasets used in models and analyzes are essential. Researchers need to know how the data they are using was generated to be sure it is reliable, verifiable and reproducible. Otherwise, bad science may follow.
The recent manipulation controversy in a 2006 “historic” article on Alzheimer’s disease is a good example. If proven, any assumptions built on the data in what the BMJ claims to be the fifth most-cited paper on Alzheimer’s disease since its publication will be called into question.
AI could be responsible for several SPIDES-like cases if we feed computer systems with datasets whose origin and content are poorly described. Researchers must adhere to three fundamental elements of data hygiene:
• Data standards: The explosion of data in the life sciences has created the need for new standards. The FAIR data principles (findable, accessible, interoperable and reusable) are most important and ensure that data can be shared and reused for an unlimited number of potentially disparate projects. FAIR encourages standardization in how organizations capture and manage data and is essential for creating the quality training data required by machine learning algorithms.
• Adoption of ontology: As organizations generate and collect more data from internal and external sources, it must be harmonized to meet FAIR standards and allow comparison and analysis. Well-managed ontologies are crucial here. Ontologies are human-generated, machine-readable knowledge descriptions that describe classes of things and the relationships between them. They transform unstructured scientific texts into clean data that can be exploited by AI.
• Good metadata: The adoption of standards and ontologies is essential to the goal of good metadata. These data descriptions give context to data sets and are a critical part of ensuring that the data can be understood and retrieved. In the SPIDES example, metadata describing how iron content was measured would have provided vital context explaining apparent data discrepancies.
AI should not be a “black box” solution.
Ultimately, researchers cannot take the data at face value. This is the case whether the data is used for entry-level use cases, such as semantic search and big data integration, as well as more sophisticated computational approaches, such as machine learning. and deep learning. Better data representation is needed to ensure that data is verifiable and that faulty conclusions do not become viruses that infect datasets and reproduce.
Today, many life science companies are unable to realize the true value of their data due to the way it has been captured and then managed. Data hygiene best practices that combine human curation of ontologies and metadata with thorough data standards are essential. Basically, adhering to these practices minimizes the chances of researchers creating their own “lies that won’t die”. Rather, researchers will be empowered to unlock the wealth of information hidden in large datasets and apply it effectively and efficiently.
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs, and technology executives. Am I eligible?