Journal:Data management challenges for artificial intelligence in plant and agricultural research

Full article title	Data management challenges for artificial intelligence in plant and agricultureal research
Journal	F1000Research
Author(s)	Williamson, Hugh F.; Brettschneider, Julia; Caccamo, Mario; Davey, Robert P.; Goble, Carole; Jersey, Paul J.; May, Sean; Morris, Richard J.; Ostler Richard
Author affiliation(s)	University of Exeter, University of Warwick, National Research Institute of Brewing, Earlham Institute, University of Manchester, Royal Botanic Gardens, University of Nottingham, John Innes Centre, Rothamsted Research, Alan Turing Institute, University of Edinburgh
Primary contact	S dot Leonelli at exeter dot ac dot uk
Year published	2023
Volume and issue	10
Article #	324
DOI	10.12688/f1000research.52204.2
ISSN	2046-1402
Distribution license	Creative Commons Attribution 4.0 International
Website	https://f1000research.com/articles/10-324/v2
Download	https://f1000research.com/articles/10-324/v2/pdf (PDF)

This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed.

Abstract

Artificial intelligence (AI) is increasingly used within plant science, yet it is far from being routinely and effectively implemented in this domain. Particularly relevant to the development of novel food and agricultural technologies is the development of validated, meaningful, and usable ways to integrate, compare, and visualize large, multi-dimensional datasets from different sources and scientific approaches. After a brief summary of the reasons for the interest in data science and AI within plant science, the paper identifies and discusses eight key challenges in data management that must be addressed to further unlock the potential of AI in crop and agronomic research, and particularly the application of machine learning (ML), which holds much promise for this domain.

Keywords: data science, plant science, crop science, agricultural research, machine learning, data management, data quality, data sharing

Introduction

Data science is central to the development of plant and agricultural research and its application to social and environmental problems of a global scale, such as food security, biodiversity, and climate change. Artificial intelligence (AI) offers great potential towards elucidating and managing the complexity of biological data, organisms, and systems. It constitutes a particularly promising approach for the plant sciences, which are marked by the distinctive challenge of understanding not only complex genotype-environment (GxE) interactions that span multiple scales from the cellular through the microbiome to climate systems, but also GxE interactions with rapidly shifting human management practices (GxExM) in agricultural and other settings, whose reliance on digital innovations is growing at a rapid pace. [Wang et al. 2020; Harfouche et al. 2019] Accordingly, examples of useful applications of AI—and particularly machine learning (ML)—to plant science contexts are increasing, with the COVID-19 pandemic crisis further accelerating interest in this approach. [King 2020]

Nevertheless, we are still far from a research landscape in which AI can be routinely and effectively implemented. A key obstacle concerns the development and implementation of effective and reliable data management strategies. Developing reliable and reproducible AI applications depends on having validated, meaningful, and usable ways to integrate large, multi-dimensional datasets from different sources and scientific approaches. This is especially relevant to the development of novel food and agricultural technologies, which rely on research from diverse fields including fundamental plant biology, crop research, conservation science, soil science, plant pathology, pest/pollinator ecology and management, water and land management, climate modelling, agronomy, and economics.

This paper explores data-related challenges to potential applications of AI in plant science, with particular attention paid to the analysis of GxExM interactions of relevance to crop science and agricultural implementations. It brings together the experiences of an interdisciplinary set of researchers from the plant and agricultural sciences, the engineering and computational sciences, and the social studies of science, all of whom are working with complex datasets spanning genomic, physiological, and environmental data and computational methods of analysis. The first part of the paper provides a brief overview of contemporary AI and data science applications within plant science, with particular attention paid to the UK and European landscape where the authors are based. The second part identifies and discusses eight challenges in data management that must be addressed to further unlock the potential of AI for plant science and agronomic research. We conclude with a reflection on how transdisciplinary and international collaborations on data management can foster impactful and socially responsible AI in this domain.

AI in plant research: Current status and challenges

Following wider trends in the biosciences, both basic and applied plant sciences have increasingly emphasized data-intensive modes of research over the last two decades. [Leonelli et al. 2017; Leonelli 2016, 2019] The capacity to measure biological complexity at the molecular, organismal, and environmental scales has increased dramatically, as demonstrated by [Tardieu et al. 2017]:

advances in high-throughput genomics and norms and tools that have supported the development of a commons of publicly shared genomic data;
the development of platforms for high-throughput plant phenotyping in the laboratory, the greenhouse, and the field; and
the proliferation of remote sensing devices on crop-growing fields.

Such platforms and associated data generation have contributed to a booming AI industry in commercial agriculture, focused on the delivery of “precision” farming strategies, with estimates that the market will be worth US $1.55 billion by 2025.^[1] Indeed, AI applications in plant research and agriculture have so far primarily benefited large-scale industrial farming [Carbonell 2016], with R&D investment focused on commodity crops such as wheat, rice and maize; high-value horticulture crops such as soft fruits; and the enhancement of large-scale orchards and vineyards. In addition to this, however, the amount and type of data being collected, alongside advancements in AI methods, offer the opportunity to ask and address new questions of great importance to plant scientists and agricultural stakeholders around the world. [Tsiligiridis & Ainali 2018]

AI is the field of study and development of computer hardware and software that perform functions, such as problem solving or learning, which have traditionally been considered properties of intelligent life. A range of research fields have contributed to the development of AI, currently the most prominent of which is ML, the design of algorithms for data processing, prediction, and decision support that are able to learn from a priori (“supervised”), inductive (“unsupervised”), and reward-based (“reinforcement”) experience. [Mitchell 1997]^[a] This approach is particularly significant for applications that do not require an exact understanding of how the algorithm has reached its decision, as long as it has predictive power and it is possible to reproduce it. [Napoletani et al. 2015]

ML has been the dominant AI technology applied to plant and agricultural research so far. Many successful examples come from bioinformatics, where researchers may not need to worry about why a sequence of amino acids was classified as alpha-helical in structure as long as we know how reliable that prediction is, for instance. Indeed, ML has been widely used in the analysis of sequence data, for example to identify signal peptides and functional domains in amino-acid sequences via neural nets and profile hidden Markov models, such as Pfam and SMART [El-Gebali et al. 2019], as well as other classic examples. [Larrañaga et al. 2006] One key example from genomics that goes back to the 1990s is the use of models to identify genes and predict their functions based on training data from multiple species. [Hayes & Borodowski 1998; Birney et al. 2004; Zou et al. 2019] This has ongoing relevance for orphan and non-model crop research, where experimental approaches such as CRISPR knockouts to identify and validate gene function for individual species may not be feasible or cost-effective, but results may be inferred from experiments in model species. [Zou et al. 2019] Other challenges in genomics that can be addressed include the inference of gene regulatory networks [Mochida et al. 2018] and the identification of pathogen virulence effector genes from genomic sequence data [Sperschneider 2019], for example. Thus, ML can help to identify correlations not readily picked up by more traditional approaches and in turn suggest fruitful directions for further research. To date, whether or not correlations have biological meaning typically needs to be ascertained via experiment and/or observational data. [Leonelli 2014; Smith & Cordes 2019] Efforts towards explainable AI are, however, gaining momentum, and both methodological and computational techniques are emerging which promise to support biological use of ML. [Schramowski et al. 2020]

Alongside applications in genomics, AI offers new opportunities for linking genotypes to phenotypes. [Wang et al. 2020] Image-based plant phenotyping has proven a particularly fertile area for the application of ML techniques, with the rapid development of non-destructive methods for the evaluation of plant responses to biotic and abiotic stress [Singh et al. 2016; Mohanty et al. 2016; Ramcharan et al. 2017] and estimation of photosynthetic capacity [Fu et al. 2019], as well as a variety of feature detection, counting, classification, and semantic segmentation tasks. [Jiang & Li 2020] With the arrival of deep supervised convolutional networks, progress in the performance of ML algorithms in predicting leaf counts increased considerably. [Dobrescu et al. 2017] Convolutional neural networks (CNNs) were also shown to be capable of performing challenging tasks of point feature detection [Pound et al. 2018] and pixelwise segmentation [Yasrab et al. 2019, Soltaninejad et al. 2020] on both roots and shoots in a variety of imaging modalities in both laboratory and field environments. [Gao et al. 2020] These technologies pose substantial new opportunities for analyzing and understanding GxExM interactions through the integration of high-throughput phenotyping data with other forms of research data, including genomic, field evaluation, and climatic data. As well as addressing fundamental research questions, AI applications in this area offer the opportunity to understand and improve a range of practical activities from crop breeding through agricultural management. Three are discussed in more detail here, while Table 1 neatly collates all ML and AI application examples discussed in this paper.

Example	Section discussed	Key ML/AI methods used	Source(s)
Table 1. Examples of machine learning (ML) and artificial intelligence (AI) applications in plant and agricultural science discussed in this paper, and methods used in those referenced papers.
Gene identification and function prediction across species	This section	Various; see citations in review paper	[Zou et al. 2019]
Inference of gene regulatory networks	This section	Bayesian networks, random forest, Markov random fields, tree-based models, dynamic factor graph models	[Mochida et al. 2018]
Identification of pathogen virulence effector genes from genomic sequence data	This section	Support vector machine, random forest, convolutional neural networks, ensemble learning, Bayesian networks, tree-based models	[Sperschneider 2019]
Non-destructive evaluation of plant responses to biotic and abiotic stress	This section	Support vector machine, artificial neural networks, convolutional neural networks	[Singh et al. 2016; Mohanty et al. 2016; Ramcharan et al. 2017]
Automatic estimation of photosynthetic capacity	This section	Artificial neural networks, support vector machine, least absolute shrinkage and selection operator (LASSO), random forest, Gaussian process regression	[Fu et al. 2019]
Convolutional neural networks for plant phenotyping image analysis	This section	Convolutional neural networks, support vector machine, random forest, encoder-decoder model, multi-loss multi-resolution network, deep residual network	[Jiang & Li 2020; Dobrescu et al. 2017; Pound et al. 2017; Yasrab et al. 2019; Soltaninejad et al. 2020]
Augmenting genomic selection models in plant breeding with machine learning	This section	Bayesian regularized neural networks, radial basis function neural networks, reproducing kernel Hilbert space, random forest regression	[Gonzalez-Camacho et al. 2018; Harfouche et al. 2019]
Prediction of soil characteristics from near-infrared and mid-infrared soil spectroscopy data	This section	Regularised linear models, support vector mechanics, tree-based models	[Data Study Group Team 2020]
Automatic identification of crop pest insects using bioacoustics data	This section	Support vector machines, random forest, randomized trees classifier, gradient boosting classifier	[Potamitis et al. 2015]
Automatic digitisation of herbaria specimens and specimen metadata	Next section	Convolutional neural networks	[Carranza-Rojas et al. 2017; Younis et al. 2018]
Leaf-counting models for plant phenotyping image analysis	This and next section	Multi-task learning, adversarial learning, layerwise relevance propagation, guided back propagation	[Dobrescu et al. 2017, 2019, 2020; Giuffrida et al. 2019]
Computer Vision Problems in Plant Phenotyping (CVPPP) workshops	Next section	Various; see citations in review paper	[Tsaftaris & Sharr 2019]
Image analysis for automatic disease diagnosis in multiple crops using PlantVillage Nuru	Next section	Convolutional neural networks	[Ramcharan et al. 2019]

One example of AI opportunities is found with genomic selection. Genomic selection (GS) is an approach for estimating breeding values for individual plants that can guide breeders’ decisions for selection and crossing [Crossa et al. 2017], based on modelling associations between quantitative traits and a genome-wide set of markers. Accuracy of predictive models for GS and rate of genetic gain can be increased by employing ML, although the utility of ML in comparison to existing statistical models vary depending on the characteristics of the trait of interest. [Gonzalez-Camacho et al. 2018] A promising opportunity for the improvement of GS lies in using ML for the integration and the analysis of data from different omics layers (e.g., proteomics, metabolomics, and metagenomics) that mediate between genotype and phenotype, facilitating the prediction of quantitative traits based on biological mechanisms rather than genetic marker associations and thereby increasing the reliability and utility of models for a wider range of populations than is currently possible. [Harfouche et al. 2019]

A second example concerns long-term experiments. Long-term experiments (LTE)—where the same crop or crop rotation is grown for many years subject to a range of different management or treatment options—have an important place in agricultural research. Data from these experiments enable separation of agronomic and environmental (weather) influences on crop yield, as well as soil health over time, and have done much to influence modern farming practices. [Poulton et al. 2018; Jensen et al. 2020] The "Classical Experiments" at Rothamsted Research [Parolini 2015, Macdonald et. al. 2018] are important examples. The data from these experiments, some of which were started in 1843, are available and documented in the Electronic Rothamsted Archive (e-RA) data resource. [Perryman et al. 2018] Data from LTEs continue to be the subject of new analytical methods [Addy et al. 2020], yet remain a relatively untapped resource for knowledge discovery, in part because of the complexity of the experimental designs and the difficulty in accounting properly for the changes that might have occurred during their lifespans. To make LTEs more accessible for knowledge discovery, a recent initiative was launched by the Global Long Term Experiment Network to catalogue LTEs using a standard metadata schema. The use of ML methods combining data from LTEs with local weather data might, for example, reveal hidden patterns in the data linked to long-term or higher order interactions within the data, which could provide useful insights into the impact of future climate change.

Agricultural monitoring is a vital third example. AI offers many opportunities to improve the cost and labor efficiency of longstanding research and monitoring tasks in research and agricultural settings. While such possibilities are most developed in commercial agricultural settings, there are many opportunities too for the public research sector, as well as for small or non-commercial farmers, for example in agricultural settings where there is limited access to relevant scientific expertise. Take for example soil health assessment, a key driver of crop yields. However, wet soil chemistry analyses are both expensive and time-consuming and generally not accessible by growers in low- and middle-income countries. Using near-infrared (NIR) and mid-infrared (MIR) soil spectroscopy data, ML models can be developed to predict soil characteristics and nutrient content that are faster and cheaper to run. [Data Study Group Team 2020] Such models could be integrated with plant physiology models in the future to predict optimal crop performance in a given soil, and those models open the possibility of the development of hand-held soil devices for use directly by farmers or local advisors in countries where lab access and resources are limited. In another example of agricultural monitoring, conventional suction and light traps for monitoring the appearance and migration of airborne insects, including crop pests, currently require manual identification. Such methods can also be augmented by ML models trained to recognize and classify insect species based on bioacoustics data [Potamitis et al. 2015], connected to in-field sonic sensors. Such developments are directed at increasing the scalability of the insect pest monitoring networks and also potentially removing the need for manual steps for some insect species.

While these three examples offer optimism to using AI and ML applications in plant and agricultural science, the effective implementation of these and similar methods depends in large measure on establishing a favorable data landscape, consisting of the networks and practices of sourcing, managing, and maintaining data. This is particularly important for research undertaken outside of resource-intensive commercial sites, including research in and for low- and middle-income countries. Identifying the primary challenges faced by users and would-be users of AI in the contemporary data landscape of plant science is necessary in order to understand the possibilities and limitations afforded by AI for public as well as private plant and agricultural research. Here we build on the experiences of leading UK-based researchers in these areas to identify and discuss eight key data challenges, summarized in Table 2. These challenges span technical, social, and governmental domains, and will require concerted international and transdisciplinary efforts from a range of stakeholders to address.

Data challenges	Solutions	Risks	Payoff	Trade-offs
Table 2. Synoptic view of the data challenges of effective implementing ML and AI in plant and agricultural science, possible solutions, and what can be lost and gained by investment in those areas.
Heterogeneity of data types and sources in biology and agriculture	Implement FAIR (findability, accessibility, interoperability, and reusability) principles [Wilkinson et al. 2018] for all data types. Acknowledge and reward data sources.	Inconsistent standardization between domains and communities	New possibilities for multi-scale analysis integrating diverse data types	There are difficulties in implementing standards while retaining domain-specific insights.
Selection and digitization of data that is viable for AI applications	Provide clear and accessible guidance on data requirements for AI. Develop new procedures for priority setting and selecting data.	High labor costs of digitization and analysis on resources that may not prove to be significant	AI tools and outputs that push forward the cutting edge of plant science research	Data management procedures may take up a considerable budget and effort.
Ensuring sufficient linkage between biological materials and data used for AI applications	Have clear documentation of material provenance when producing data and throughout analytical workflows.	Increased documentation costs and exposure of commercially or otherwise sensitive materials	Clear understanding of the biological scope of AI tools	Analysis of documentation around materials requires specific expertise and effort.
Standardization and curation of data and related software to a level appropriate for AI applications	Develop and use shared semantic standards. Standardize data at the point of collection.	Potential to lose system-specific information that does not fit common standard	Reusable multi-source data sets and easier validation and sharing between groups	Some plant data (e.g., phenotypic observations) remain very difficult to standardize.
Obtaining training and adequate ground truth data for model validation and development	Ensure that data quality benchmarking is tailored to analytical purposes. Expand collections of ground truth and training datasets.	Data quality assessment requires error estimates and information on data collection, which are often lacking.	Reproducible and sound inferences with clear scope of validity	Tailoring data to specific research goals runs counter-popular to the narrative of AI relying on "representative" training data and "generalizable" solutions.
Access to and use of computing and modeling platforms, and related expertise	Make software and models open and adaptable where appropriate, and/or have clear documentation on their scope. Provide researchers with full workflows, not just software.	Software used outside its range of proven usefulness and danger of extrapolation and overfitting	A suite of tools with clearly marked utility and relevance for a wide range of analytical tasks in the plant sciences	There are difficulties in getting the required know-how to travel together with software and models.
Improving responsible data access	Open access to datasets held by government and research institutions. Implement data governance regimes to protect sensitive data and ensure benefit sharing.	“Digital feudalism”; unequal distribution of benefits from public or personal data	Greater data resources of direct relevance to agricultural and other plant science applications	There are ongoing difficulties in identifying and implementing non-exploitative, equitable models for data sharing.
Engagement across plant scientists, data scientists, and other stakeholders	Invest in and promote data services for plant scientists. Additionally, promote plant science problems, especially GxE interactions, to ML researchers. Identify and invest in grand challenges and engagement.	High cost with potentially limited impact unless closely targeted to needs and interests of researchers and wider stakeholders	Greater community participation in the development of ML as a resource for plant science	There is long-term investment involved, and its value depends on active and regular engagement of stakeholders.

In the remainder of the paper, we review these challenges in detail, drawing on a range of examples from fundamental and translational plant science. Several of the challenges are shared with the biosciences more broadly, reflecting the conditions and complexity of biological research, while others are specific to plant science and agriculture. In the conclusion, we offer some reflections on how these challenges could be overcome.

Data challenges

Data diversity and continuing obstacles to data sharing

Footnotes

↑ We recognise that there is disagreement over whether ML can always be classified as AI, given that the application of ML techniques often requires extensive manual feature extraction in order to process data more effectively for analysis. In this regard, ML may be considered closer to statistical methods than to AI. For the purposes of this paper, where many of the data challenges are shared between existing methods of ML and AI sensu stricto, we will treat the two as a continuum of techniques where "AI" is the more encompassing and general term.

References

↑ Market Reports World (26 March 2019). "Global Artificial Intelligence (AI) in Agriculture Market Size, Status and Forecast 2019–2025". https://www.marketreportsworld.com/global-artificial-intelligence-ai-in-agriculture-market-13268433.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation and updates to spelling and grammar (including to the title). In some cases important information was missing from the references, and that information was added. The original article lists references in alphabetical order; this version lists them in order of appearance, by design. Many of the footnotes in the original are URLs essentially acting as references; for this version, a majority of the original footnotes were turned into full citations, making the citation list longer and footnotes shorter. The original put examples of AI opportunities into Boxes 1–3; this version converted that content into inline paragraphs to keep the text flow smooth. In the original, the term "Global South" is used to describe "research undertaken outside of resource-intensive commercial sites"; this term is largely unwarranted as a synonym of "Third World," and for this version "low- and middle-income nations" is used in its place, guided by the authors' use of that phrasing in the prior paragraph on agricultural monitoring (originally Box 3). Table 1 and 2 in the original are swapped in this version, based on order of mention.