Journal:Data management challenges for artificial intelligence in plant and agricultural research

From LIMSWiki
Revision as of 21:05, 29 January 2024 by Shawndouglas (talk | contribs) (Saving and adding more.)
Jump to navigationJump to search
Full article title Data management challenges for artificial intelligence in plant and agricultureal research
Journal F1000Research
Author(s) Williamson, Hugh F.; Brettschneider, Julia; Caccamo, Mario; Davey, Robert P.; Goble, Carole; Jersey, Paul J.; May, Sean; Morris, Richard J.; Ostler Richard
Author affiliation(s) University of Exeter, University of Warwick, National Research Institute of Brewing, Earlham Institute, University of Manchester, Royal Botanic Gardens, University of Nottingham, John Innes Centre, Rothamsted Research, Alan Turing Institute, University of Edinburgh
Primary contact S dot Leonelli at exeter dot ac dot uk
Year published 2023
Volume and issue 10
Article # 324
DOI 10.12688/f1000research.52204.2
ISSN 2046-1402
Distribution license Creative Commons Attribution 4.0 International
Website https://f1000research.com/articles/10-324/v2
Download https://f1000research.com/articles/10-324/v2/pdf (PDF)

Abstract

Artificial intelligence (AI) is increasingly used within plant science, yet it is far from being routinely and effectively implemented in this domain. Particularly relevant to the development of novel food and agricultural technologies is the development of validated, meaningful, and usable ways to integrate, compare, and visualize large, multi-dimensional datasets from different sources and scientific approaches. After a brief summary of the reasons for the interest in data science and AI within plant science, the paper identifies and discusses eight key challenges in data management that must be addressed to further unlock the potential of AI in crop and agronomic research, and particularly the application of machine learning (ML), which holds much promise for this domain.

Keywords: data science, plant science, crop science, agricultural research, machine learning, data management, data quality, data sharing

Introduction

Data science is central to the development of plant and agricultural research and its application to social and environmental problems of a global scale, such as food security, biodiversity, and climate change. Artificial intelligence (AI) offers great potential towards elucidating and managing the complexity of biological data, organisms, and systems. It constitutes a particularly promising approach for the plant sciences, which are marked by the distinctive challenge of understanding not only complex genotype-environment (GxE) interactions that span multiple scales from the cellular through the microbiome to climate systems, but also GxE interactions with rapidly shifting human management practices (GxExM) in agricultural and other settings, whose reliance on digital innovations is growing at a rapid pace. [Wang et al. 2020; Harfouche et al. 2019] Accordingly, examples of useful applications of AI—and particularly machine learning (ML)—to plant science contexts are increasing, with the COVID-19 pandemic crisis further accelerating interest in this approach. [King 2020]

Nevertheless, we are still far from a research landscape in which AI can be routinely and effectively implemented. A key obstacle concerns the development and implementation of effective and reliable data management strategies. Developing reliable and reproducible AI applications depends on having validated, meaningful, and usable ways to integrate large, multi-dimensional datasets from different sources and scientific approaches. This is especially relevant to the development of novel food and agricultural technologies, which rely on research from diverse fields including fundamental plant biology, crop research, conservation science, soil science, plant pathology, pest/pollinator ecology and management, water and land management, climate modelling, agronomy, and economics.

This paper explores data-related challenges to potential applications of AI in plant science, with particular attention paid to the analysis of GxExM interactions of relevance to crop science and agricultural implementations. It brings together the experiences of an interdisciplinary set of researchers from the plant and agricultural sciences, the engineering and computational sciences, and the social studies of science, all of whom are working with complex datasets spanning genomic, physiological, and environmental data and computational methods of analysis. The first part of the paper provides a brief overview of contemporary AI and data science applications within plant science, with particular attention paid to the UK and European landscape where the authors are based. The second part identifies and discusses eight challenges in data management that must be addressed to further unlock the potential of AI for plant science and agronomic research. We conclude with a reflection on how transdisciplinary and international collaborations on data management can foster impactful and socially responsible AI in this domain.

AI in plant research: Current status and challenges

Following wider trends in the biosciences, both basic and applied plant sciences have increasingly emphasized data-intensive modes of research over the last two decades. [Leonelli et al. 2017; Leonelli 2016, 2019] The capacity to measure biological complexity at the molecular, organismal, and environmental scales has increased dramatically, as demonstrated by [Tardieu et al. 2017]:

  • advances in high-throughput genomics and norms and tools that have supported the development of a commons of publicly shared genomic data;
  • the development of platforms for high-throughput plant phenotyping in the laboratory, the greenhouse, and the field; and
  • the proliferation of remote sensing devices on crop-growing fields.

Such platforms and associated data generation have contributed to a booming AI industry in commercial agriculture, focused on the delivery of “precision” farming strategies, with estimates that the market will be worth US $1.55 billion by 2025.[1] Indeed, AI applications in plant research and agriculture have so far primarily benefited large-scale industrial farming [Carbonell 2016], with R&D investment focused on commodity crops such as wheat, rice and maize; high-value horticulture crops such as soft fruits; and the enhancement of large-scale orchards and vineyards. In addition to this, however, the amount and type of data being collected, alongside advancements in AI methods, offer the opportunity to ask and address new questions of great importance to plant scientists and agricultural stakeholders around the world. [Tsiligiridis & Ainali 2018]

AI is the field of study and development of computer hardware and software that perform functions, such as problem solving or learning, which have traditionally been considered properties of intelligent life. A range of research fields have contributed to the development of AI, currently the most prominent of which is ML, the design of algorithms for data processing, prediction, and decision support that are able to learn from a priori (“supervised”), inductive (“unsupervised”), and reward-based (“reinforcement”) experience. [Mitchell 1997][a] This approach is particularly significant for applications that do not require an exact understanding of how the algorithm has reached its decision, as long as it has predictive power and it is possible to reproduce it. [Napoletani et al. 2015]

ML has been the dominant AI technology applied to plant and agricultural research so far. Many successful examples come from bioinformatics, where researchers may not need to worry about why a sequence of amino acids was classified as alpha-helical in structure as long as we know how reliable that prediction is, for instance. Indeed, ML has been widely used in the analysis of sequence data, for example to identify signal peptides and functional domains in amino-acid sequences via neural nets and profile hidden Markov models, such as Pfam and SMART [El-Gebali et al. 2019], as well as other classic examples. [Larrañaga et al. 2006] One key example from genomics that goes back to the 1990s is the use of models to identify genes and predict their functions based on training data from multiple species. [Hayes & Borodowski 1998; Birney et al. 2004; Zou et al. 2019] This has ongoing relevance for orphan and non-model crop research, where experimental approaches such as CRISPR knockouts to identify and validate gene function for individual species may not be feasible or cost-effective, but results may be inferred from experiments in model species. [Zou et al. 2019] Other challenges in genomics that can be addressed include the inference of gene regulatory networks [Mochida et al. 2018] and the identification of pathogen virulence effector genes from genomic sequence data [Sperschneider 2019], for example. Thus, ML can help to identify correlations not readily picked up by more traditional approaches and in turn suggest fruitful directions for further research. To date, whether or not correlations have biological meaning typically needs to be ascertained via experiment and/or observational data. [Leonelli 2014; Smith & Cordes 2019] Efforts towards explainable AI are, however, gaining momentum, and both methodological and computational techniques are emerging which promise to support biological use of ML. [Schramowski et al. 2020]

Alongside applications in genomics, AI offers new opportunities for linking genotypes to phenotypes. [Wang et al. 2020] Image-based plant phenotyping has proven a particularly fertile area for the application of ML techniques, with the rapid development of non-destructive methods for the evaluation of plant responses to biotic and abiotic stress [Singh et al. 2016; Mohanty et al. 2016; Ramcharan et al. 2017] and estimation of photosynthetic capacity [Fu et al. 2019], as well as a variety of feature detection, counting, classification, and semantic segmentation tasks. [Jiang & Li 2020] With the arrival of deep supervised convolutional networks, progress in the performance of ML algorithms in predicting leaf counts increased considerably. [Dobrescu et al. 2017] Convolutional neural networks (CNNs) were also shown to be capable of performing challenging tasks of point feature detection [Pound et al. 2018] and pixelwise segmentation [Yasrab et al. 2019, Soltaninejad et al. 2020] on both roots and shoots in a variety of imaging modalities in both laboratory and field environments. [Gao et al. 2020] These technologies pose substantial new opportunities for analyzing and understanding GxExM interactions through the integration of high-throughput phenotyping data with other forms of research data, including genomic, field evaluation, and climatic data. As well as addressing fundamental research questions, AI applications in this area offer the opportunity to understand and improve a range of practical activities from crop breeding through agricultural management. Three are discussed in more detail here, while Table 1 neatly collates all ML and AI application examples discussed in this paper.

Table 1. Examples of machine learning (ML) and artificial intelligence (AI) applications in plant and agricultural science discussed in this paper, and methods used in those referenced papers.
Example Section discussed Key ML/AI methods used Source(s)
Gene identification and function prediction across species This section Various; see citations in review paper [Zou et al. 2019]
Inference of gene regulatory networks This section Bayesian networks, random forest, Markov random fields, tree-based models, dynamic factor graph models [Mochida et al. 2018]
Identification of pathogen virulence effector genes from genomic sequence data This section Support vector machine, random forest, convolutional neural networks, ensemble learning, Bayesian networks, tree-based models [Sperschneider 2019]
Non-destructive evaluation of plant responses to biotic and abiotic stress This section Support vector machine, artificial neural networks, convolutional neural networks [Singh et al. 2016; Mohanty et al. 2016; Ramcharan et al. 2017]
Automatic estimation of photosynthetic capacity This section Artificial neural networks, support vector machine, least absolute shrinkage and selection operator (LASSO), random forest, Gaussian process regression [Fu et al. 2019]
Convolutional neural networks for plant phenotyping image analysis This section Convolutional neural networks, support vector machine, random forest, encoder-decoder model, multi-loss multi-resolution network, deep residual network [Jiang & Li 2020; Dobrescu et al. 2017; Pound et al. 2017; Yasrab et al. 2019; Soltaninejad et al. 2020]
Augmenting genomic selection models in plant breeding with machine learning This section Bayesian regularized neural networks, radial basis function neural networks, reproducing kernel Hilbert space, random forest regression [Gonzalez-Camacho et al. 2018; Harfouche et al. 2019]
Prediction of soil characteristics from near-infrared and mid-infrared soil spectroscopy data This section Regularised linear models, support vector mechanics, tree-based models [Data Study Group Team 2020]
Automatic identification of crop pest insects using bioacoustics data This section Support vector machines, random forest, randomized trees classifier, gradient boosting classifier [Potamitis et al. 2015]
Automatic digitisation of herbaria specimens and specimen metadata Next section Convolutional neural networks [Carranza-Rojas et al. 2017; Younis et al. 2018]
Leaf-counting models for plant phenotyping image analysis This and next section Multi-task learning, adversarial learning, layerwise relevance propagation, guided back propagation [Dobrescu et al. 2017, 2019, 2020; Giuffrida et al. 2019]
Computer Vision Problems in Plant Phenotyping (CVPPP) workshops Next section Various; see citations in review paper [Tsaftaris & Sharr 2019]
Image analysis for automatic disease diagnosis in multiple crops using PlantVillage Nuru Next section Convolutional neural networks [Ramcharan et al. 2019]

One example of AI opportunities is found with genomic selection. Genomic selection (GS) is an approach for estimating breeding values for individual plants that can guide breeders’ decisions for selection and crossing [Crossa et al. 2017], based on modelling associations between quantitative traits and a genome-wide set of markers. Accuracy of predictive models for GS and rate of genetic gain can be increased by employing ML, although the utility of ML in comparison to existing statistical models vary depending on the characteristics of the trait of interest. [Gonzalez-Camacho et al. 2018] A promising opportunity for the improvement of GS lies in using ML for the integration and the analysis of data from different omics layers (e.g., proteomics, metabolomics, and metagenomics) that mediate between genotype and phenotype, facilitating the prediction of quantitative traits based on biological mechanisms rather than genetic marker associations and thereby increasing the reliability and utility of models for a wider range of populations than is currently possible. [Harfouche et al. 2019]

A second example concerns long-term experiments. Long-term experiments (LTE)—where the same crop or crop rotation is grown for many years subject to a range of different management or treatment options—have an important place in agricultural research. Data from these experiments enable separation of agronomic and environmental (weather) influences on crop yield, as well as soil health over time, and have done much to influence modern farming practices. [Poulton et al. 2018; Jensen et al. 2020] The "Classical Experiments" at Rothamsted Research [Parolini 2015, Macdonald et. al. 2018] are important examples. The data from these experiments, some of which were started in 1843, are available and documented in the Electronic Rothamsted Archive (e-RA) data resource. [Perryman et al. 2018] Data from LTEs continue to be the subject of new analytical methods [Addy et al. 2020], yet remain a relatively untapped resource for knowledge discovery, in part because of the complexity of the experimental designs and the difficulty in accounting properly for the changes that might have occurred during their lifespans. To make LTEs more accessible for knowledge discovery, a recent initiative was launched by the Global Long Term Experiment Network to catalogue LTEs using a standard metadata schema. The use of ML methods combining data from LTEs with local weather data might, for example, reveal hidden patterns in the data linked to long-term or higher order interactions within the data, which could provide useful insights into the impact of future climate change.

Agricultural monitoring is a vital third example. AI offers many opportunities to improve the cost and labor efficiency of longstanding research and monitoring tasks in research and agricultural settings. While such possibilities are most developed in commercial agricultural settings, there are many opportunities too for the public research sector, as well as for small or non-commercial farmers, for example in agricultural settings where there is limited access to relevant scientific expertise. Take for example soil health assessment, a key driver of crop yields. However, wet soil chemistry analyses are both expensive and time-consuming and generally not accessible by growers in low- and middle-income countries. Using near-infrared (NIR) and mid-infrared (MIR) soil spectroscopy data, ML models can be developed to predict soil characteristics and nutrient content that are faster and cheaper to run. [Data Study Group Team 2020] Such models could be integrated with plant physiology models in the future to predict optimal crop performance in a given soil, and those models open the possibility of the development of hand-held soil devices for use directly by farmers or local advisors in countries where lab access and resources are limited. In another example of agricultural monitoring, conventional suction and light traps for monitoring the appearance and migration of airborne insects, including crop pests, currently require manual identification. Such methods can also be augmented by ML models trained to recognize and classify insect species based on bioacoustics data [Potamitis et al. 2015], connected to in-field sonic sensors. Such developments are directed at increasing the scalability of the insect pest monitoring networks and also potentially removing the need for manual steps for some insect species.

While these three examples offer optimism to using AI and ML applications in plant and agricultural science, the effective implementation of these and similar methods depends in large measure on establishing a favorable data landscape, consisting of the networks and practices of sourcing, managing, and maintaining data. This is particularly important for research undertaken outside of resource-intensive commercial sites, including research in and for low- and middle-income countries. Identifying the primary challenges faced by users and would-be users of AI in the contemporary data landscape of plant science is necessary in order to understand the possibilities and limitations afforded by AI for public as well as private plant and agricultural research. Here we build on the experiences of leading UK-based researchers in these areas to identify and discuss eight key data challenges, summarized in Table 2. These challenges span technical, social, and governmental domains, and will require concerted international and transdisciplinary efforts from a range of stakeholders to address.

Table 2. Synoptic view of the data challenges of effective implementing ML and AI in plant and agricultural science, possible solutions, and what can be lost and gained by investment in those areas.
Data challenges Solutions Risks Payoff Trade-offs
Heterogeneity of data types and sources in biology and agriculture Implement FAIR (findability, accessibility, interoperability, and reusability) principles [Wilkinson et al. 2018] for all data types. Acknowledge and reward data sources. Inconsistent standardization between domains and communities New possibilities for multi-scale analysis integrating diverse data types There are difficulties in implementing standards while retaining domain-specific insights.
Selection and digitization of data that is viable for AI applications Provide clear and accessible guidance on data requirements for AI. Develop new procedures for priority setting and selecting data. High labor costs of digitization and analysis on resources that may not prove to be significant AI tools and outputs that push forward the cutting edge of plant science research Data management procedures may take up a considerable budget and effort.
Ensuring sufficient linkage between biological materials and data used for AI applications Have clear documentation of material provenance when producing data and throughout analytical workflows. Increased documentation costs and exposure of commercially or otherwise sensitive materials Clear understanding of the biological scope of AI tools Analysis of documentation around materials requires specific expertise and effort.
Standardization and curation of data and related software to a level appropriate for AI applications Develop and use shared semantic standards. Standardize data at the point of collection. Potential to lose system-specific information that does not fit common standard Reusable multi-source data sets and easier validation and sharing between groups Some plant data (e.g., phenotypic observations) remain very difficult to standardize.
Obtaining training and adequate ground truth data for model validation and development Ensure that data quality benchmarking is tailored to analytical purposes. Expand collections of ground truth and training datasets. Data quality assessment requires error estimates and information on data collection, which are often lacking. Reproducible and sound inferences with clear scope of validity Tailoring data to specific research goals runs counter-popular to the narrative of AI relying on "representative" training data and "generalizable" solutions.
Access to and use of computing and modeling platforms, and related expertise Make software and models open and adaptable where appropriate, and/or have clear documentation on their scope. Provide researchers with full workflows, not just software. Software used outside its range of proven usefulness and danger of extrapolation and overfitting A suite of tools with clearly marked utility and relevance for a wide range of analytical tasks in the plant sciences There are difficulties in getting the required know-how to travel together with software and models.
Improving responsible data access Open access to datasets held by government and research institutions. Implement data governance regimes to protect sensitive data and ensure benefit sharing. “Digital feudalism”; unequal distribution of benefits from public or personal data Greater data resources of direct relevance to agricultural and other plant science applications There are ongoing difficulties in identifying and implementing non-exploitative, equitable models for data sharing.
Engagement across plant scientists, data scientists, and other stakeholders Invest in and promote data services for plant scientists. Additionally, promote plant science problems, especially GxE interactions, to ML researchers. Identify and invest in grand challenges and engagement. High cost with potentially limited impact unless closely targeted to needs and interests of researchers and wider stakeholders Greater community participation in the development of ML as a resource for plant science There is long-term investment involved, and its value depends on active and regular engagement of stakeholders.

In the remainder of the paper, we review these challenges in detail, drawing on a range of examples from fundamental and translational plant science. Several of the challenges are shared with the biosciences more broadly, reflecting the conditions and complexity of biological research, while others are specific to plant science and agriculture. In the conclusion, we offer some reflections on how these challenges could be overcome.

Data challenges

Data diversity and continuing obstacles to data sharing

Biological research tends to be very fragmented compared with other sciences, and biological data is highly heterogeneous as a result. [Hey et al. 2009; Marx 2013; Leonelli 2019; Strasser 2019] A key reason for this is the attention paid by biologists to the unique characteristics of the target systems that they are studying: different species of mushrooms, bacteria, trees, ferns, and mammals can behave and interact with their environment in fundamentally different ways, which in turn affects their different structures, functioning and reproduction. Biodiversity thus encourages the production of research methods and instruments specifically tailored to the "endless forms most beautiful" in question—with different laboratories producing data in a wide variety of ways. Added to this, there is the multiplicity of purposes for which biological research is conducted, which in the plant and crop sciences include the production of genetically engineered crops, understanding growth conditions, improving crop yield, and identifying medically useful compounds, many of which also require the study of key environmental features such as soil and climate conditions. Moreover, the translation of plant research into agronomic spaces is made especially complex by the multiplicity of stakeholders, with breeders focused on the specific conditions in their target markets, farmers producing a large variety of data of potential research interest as part of their everyday work, and many companies working in agritech (including companies producing sensing devices for farms), although many data producers remain secretive around their own data practices and datasets. Furthermore, there is a divergence between the large emphasis on omics data within academic plant science and the equally strong focus on phenotypic data for crop evaluation favored in more applied domains, which is only partly mitigated by ongoing efforts to bridge this gap and exploit the complementary nature of these data resources through integration and interoperability.

Last but not least, there is no consensus on data formats, standards and methods of analysis. Datasets are typically collected with a specific hypothesis or practical use in mind, with much data not generated in machine-readable formats and data standards rarely prioritized when developing new methods or technologies. Data circulation is also limited, due to a lack of targeted incentives and necessary infrastructures as well as a general reluctance from researchers to share their data beyond their immediate communities of collaborators. Many research funders and institutions do not yet provide concrete incentives to make data publicly available, including rewards and resources to match the significant labor involved. This has significant implications for researchers, especially given the competitive culture predominant within the life sciences and the well-founded fear that spending resources on data curation may lower the publication rate of any one group, with negative effects on their reputation and future endeavors. [Leonelli et al. 2017; European Commission 2017]

This fragmented data landscape limits the opportunities for the application of AI to plant research and agronomy. For example, when object recognition software is applied to human faces, relatively homogeneous reference sets of photographs are available for training, but equivalent data is not available when the same technologies are aimed at identifying morphological traits in plants. The introduction of the FAIR principles [Wilkinson et al. 2018]—stating that data should be findable, accessible, interoperable, and reusable—has greatly helped to address some of these issues.[b] Some organizations are promoting the “FAIRification” of data using semantic web technologies (e.g., https://www.go-fair.org), but even more limited forms of annotation, semantification, and standardization would significantly facilitate applications within more restricted domains. Many molecular biology data are already integrated in structured, curated, and interlinked public repositories [Rigden & Fernández 2020], which are widely used by the research community. This is not surprising given the historical ties between the development of sequencing technologies and the emergence of computation [November 2012; Stevens 2013; Strasser 2019] and related database standards and classification initiatives [Mackenzie et al. 2013], often starting with data from model organisms grown in standard conditions (like Arabidopsis thaliana) with large associated research communities. [Leonelli & Ankeny 2012]

At the same time, many other types of data are not as standardized, and the heterogeneity of data formats and methods across different areas of the life sciences is likely to affect the ways in which FAIR principles are implemented. Such differential adoption of FAIR principles and resources may, again, constrain the potential for ML to integrate data across multiple domains. Indeed, while the FAIR data principles are increasingly being applied across the plant sciences [Rodriguez-Iglesias et al. 2016; Pommier et al. 2019; Reiser et al. 2018], different projects have developed different elements of FAIR depending on their specific goals and context. Some applications, such as FAIDARE (FAIR Data-finder for Agronomic REsearch)[2] have focused on findability. Others, such as the Crop Ontology and related ontologies in the Planteome project, have focused on interoperability and semantic standards. AI and ML applications depend heavily on the interoperability and reusability dimensions of FAIR, but these have received less attention overall than findability and accessibility. As well as the semantic efforts mentioned above, more recent initiatives such as BrAPI (Breeding API) [Selby et al. 2019] and MIAPPE (Minimum Information about a Plant Phenotyping Experiment) [Papoutsoglou et al. 2020] have addressed these aspects in a more targeted way.

Acknowledging and rewarding those who generate data would go a long way towards encouraging effective data sharing. One approach to this issue is exemplified by the Annotated Crop Image Database[3], which is set up to show only fragments of annotated images of plant phenotypes, without necessarily showing the detailed metadata that would allow others to re-use those images for biological research. This encourages biologists to share their data as early as possible to support the development of methods such as feature detection, while at the same time protecting those data from re-use by other biologists for as long as it is needed for the original data producers to publish their own results. This is only one among many possible solutions to adequate acknowledgement of data sourcing, with other approaches favoring early data publication (for instance in data journals) as a way to reward data producers while also fast-tracking data sharing. The Research Data Alliance is one among many organizations engaged in developing conventions and methods to reassure those providing data that their own research and publications will not be adversely affected, such as for instance the CARE and the TRUST principles.[4] [Lin et al 2020] It is imperative that such guidance is visibly implemented and that researchers are trained to understand its significance for their own work and data management strategies.

Selecting and digitizing data

Given the wide variety of data types, formats, and sources in the plant sciences, determining which data resources could be selected for AI-informed analysis constitutes a serious challenge. Are there data sets of immediate potential if suitably curated, and what metadata is needed to describe data sets so that their suitability for inclusion in a given analysis can be assessed? The achievement of clear criteria and priorities for data selection is a crucial issue given the considerable amount of work required to digitize, curate, and process datasets and related metadata. Such criteria should consider the ML task at hand, the scientific goals, and the concerns of individuals and groups holding the data.

Consider herbarium specimens as a promising potential substrate for ML. Collectively, the world’s herbaria contain an estimated 392,353,689 plant specimens as of December 2019 [Thiers 2020], associated with metadata describing the place and time of their collection. ML can be used to infer useful information from the physical and molecular characteristics of the specimens to support automatic identification of plants [Carranza-Rojas et al. 2017], or to find material with potentially useful traits. [Younis et al. 2018] Recent efforts have combined specimen images, their associated metadata (including descriptive labeling), and associated field images. [Carranza-Rojas et al. 2017] These approaches could be used to monitor ex situ conservation efforts, to track changes in natural and farmed distribution of species in response to environmental changes, to trace the spread of invasive weeds, or many other applications not strictly related to crop research. However, many herbaria are only partially digitized, if at all. Most specimens have not been imaged or subject to molecular analysis, and even basic metadata is often not databased, but only exists in the form of hand-written or typed annotations attached to the physical specimen, meaning that even taking an inventory of stock is not possible, making access to the material only possible via physical visit. Thus, while the new technologies of imaging, molecular analysis, and ML have created new possibilities to exploit these historic collections [Soltis 2017], these will remain unrealized until the information they contain is extracted, digitized, and made publicly available, tasks which are very labor-intensive.

Interestingly, ML itself may be able to help solve this problem: the transcription of physical herbarium labels may be supported by the use of ML to interpret handwriting. A useful step towards this is the recent production of a benchmark dataset of transcribed herbarium labels [Dillen et al. 2019], which could be used to assess the performance of algorithms. This does not however help to address questions of data selection. Researchers still need to decide which specimen and related data/metadata to prioritize given limited resources and the vast scale of existing collections. In turn, the selection of usable and relevant data and digitization of records is tightly associated with the prioritization of research problems and questions on which to work. There is relatively little investment in improving procedures and methods in this area, and yet there is a need for processes through which researchers explicitly consider and debate which data should take precedence and why. Without such processes, the ensemble of data being curated risks being patchy and fragmentary, the random result of individual efforts by separate and uncoordinated projects rather than of a community effort to locate and invest on data of most relevance to all. Indeed, without such processes, pressure to use automatic methods, and to be seen using them, can aggravate the problem with researchers investing resources in the creation of large datasets without considering whether and how those data could be used.

Linking data to material samples

Footnotes

  1. We recognise that there is disagreement over whether ML can always be classified as AI, given that the application of ML techniques often requires extensive manual feature extraction in order to process data more effectively for analysis. In this regard, ML may be considered closer to statistical methods than to AI. For the purposes of this paper, where many of the data challenges are shared between existing methods of ML and AI sensu stricto, we will treat the two as a continuum of techniques where "AI" is the more encompassing and general term.
  2. In short, the existence of the data should be published, procedures for accessing the data should be available, sufficient metadata should be provided to allow the data to be understood and appropriately repurposed, and common formats and APIs should be used to facilitate the integration of different data sets.

References

Notes

This presentation is faithful to the original, with only a few minor changes to presentation and updates to spelling and grammar (including to the title). In some cases important information was missing from the references, and that information was added. The original article lists references in alphabetical order; this version lists them in order of appearance, by design. Many of the footnotes in the original are URLs essentially acting as references; for this version, a majority of the original footnotes were turned into full citations, making the citation list longer and footnotes shorter. The original put examples of AI opportunities into Boxes 1–3; this version converted that content into inline paragraphs to keep the text flow smooth. In the original, the term "Global South" is used to describe "research undertaken outside of resource-intensive commercial sites"; this term is largely unwarranted as a synonym of "Third World," and for this version "low- and middle-income nations" is used in its place, guided by the authors' use of that phrasing in the prior paragraph on agricultural monitoring (originally Box 3). Table 1 and 2 in the original are swapped in this version, based on order of mention.