Journal:Judgements of research co-created by generative AI: Experimental evidence

Full article title	Judgements of research co-created by generative AI: Experimental evidence
Journal	Economics and Business Review
Author(s)	Niszczota, Paweł; Conway, Paul
Author affiliation(s)	Poznań University of Economics and Business, University of Southampton
Primary contact	Email: pawel dot niszczota at ue dot poznan dot pl
Year published	2023
Volume and issue	9(2)
Page(s)	101–114
DOI	10.18559/ebr.2023.2.744
ISSN	2450-0097
Distribution license	Creative Commons Attribution 4.0 International
Website	https://journals.ue.poznan.pl/ebr/article/view/744
Download	https://journals.ue.poznan.pl/ebr/article/view/744/569 (PDF)

This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed.

Abstract

The introduction of ChatGPT has fuelled a public debate on the appropriateness of using generative artificial intelligence (AI) (large language models or LLMs) in work, including a debate on how they might be used (and abused) by researchers. In the current work, we test whether delegating parts of the research process to LLMs leads people to distrust researchers and devalues their scientific work. Participants (N = 402) considered a researcher who delegates elements of the research process to a PhD student or LLM and rated three aspects of such delegation. Firstly, they rated whether it is morally appropriate to do so. Secondly, they judged whether—after deciding to delegate the research process—they would trust the scientist (who decided to delegate) to oversee future projects. Thirdly, they rated the expected accuracy and quality of the output from the delegated research process. Our results show that people judged delegating to an LLM as less morally acceptable than delegating to a human (d = –0.78). Delegation to an LLM also decreased trust to oversee future research projects (d = –0.80), and people thought the results would be less accurate and of lower quality (d = –0.85). We discuss how this devaluation might transfer into the underreporting of generative AI use.

Keywords: trust in science, metascience, ChatGPT, GPT, large language models, generative AI, experiment

Introduction

The introduction of ChatGPT appears to have become a tipping point for large language models (LLMs). It is expected that LLMs—such as those released by OpenAI (i.e., ChatGPT and GPT-4) [OpenAI, 2022, 2023], but also major technology firms such as Google and Meta—will impact the work of many white-collar professions. [Alper & Yilmaz, 2020; Eloundou et al., 2023; Korzynski et al., 2023] This impact extends to top academic journals such as Nature and Science, which have already acknowledged the impact artificial intelligence (AI) has on the scientific profession and started setting out some guides on how to use LLMs. [Thorp, 2023; ‘Tools Such as ChatGPT Threaten Transparent Science; Here Are Our Ground Rules for Their Use’, 2023] For example, listing ChatGPT as a co-author was deemed inappropriate. [Stokel-Walker, 2023; Thorp, 2023] However, the use of such models is not explicitly forbidden; rather, it is suggested that researchers report on which part of the research process they received assistance from ChatGPT.

Important questions remain regarding how scientists employing LLMs in their work are perceived by society. [Dwivedi et al., 2023] Do people view the use of LLMs as diminishing the importance, value, and worth of scientific efforts, and if so, which elements of the scientific process does LLM usage most impact? We examine these questions with a study on the perceptions of scientists who rely on an LLM for various aspects of the scientific process.

We anticipated that, overall, people would view the delegation of aspects of the research process to an LLM as morally worse than delegating to a human, and that doing so would reduce trust in the delegating scientist. Moreover, insofar as people view creativity as a core human trait, especially in comparison to AI [Cha et al., 2020], and some aspects of the research process may entail more creativity than others—such as idea generation and prior literature synthesis [King, 2023], compared to data identification and preparation, testing framework determination and implementation, or results analysis—we tested the exploratory prediction that the effect of delegation to AI versus a human on moral ratings and trust might be different for these aspects.

We contribute to an emerging literature exploring how large language models can assist research on economics and financial economics. The reader can find a valuable discussion on the use of LLMs in economic research in Korinek [2023] and Wach et al. [2023] A noteworthy empirical study can be found in Dowling and Lucey [2023], who asked financial academics to rate research ideas on cryptocurrency, and they judged that the output is of fair quality.

Research questions

We ask two research questions concerning laypeople’s perception of the use of LLMs in science. First, we tested the hypothesis that people will perceive research assistance from LLMs less favorably than the very same assistance from a junior human researcher. In both cases, we assume that the assistance is minor enough to not warrant co-authorship. This levels the playing field for human and AI assistance, as prominent journals have already expressed that LLMs cannot be listed as co-authors [Thorp, 2023] as had already been done in some papers. [Kung et al., 2022]

Second, we examined in which aspects of the research process are the prospective human-AI disparities the strongest. If—as we hypothesize—delegating to AI is perceived less favorably, then one can assume that delegating such processes to AI will have the greatest potential to devalue work done by scientists.

Participants

To assess the consequences of delegating research processes to LLMs, 441 participants from Prolific [Palan & Schitter, 2018] were recruited. Prolific is an online crowdsourcing platform used to collect primary data from humans, including experimental data. [Peer et al., 2017] For a long time, Amazon Mechanical Turk was the dominant online labor market, i.e., a marketplace where individuals can complete tasks—such as participate in a research study—for compensation. [Buhrmester et al., 2011] However, our experience, as well as some research, has shown that data gathered using Prolific is superior [Peer et al., 2022], and thus we decided to use this platform. To further ensure a high quality of data and a relatively homogenous sample, we recruited participants who had a 98% or higher approval rating, were located and born in the United States, and whose first language was English. As preregistered, thirty-nine participants that did not correctly answer both attention check questions were excluded, leaving a final sample size of 402 (48.3% female, 49.8% male, and 1.9% selected non-binary or did not disclose). The mean age of participants was 42.0 years (SD = 13.9). Of the final sample size, 97.5% had heard about ChatGPT, and 38.1% interacted with it.

The study was pre-registered at https://aspredicted.org/GVL_MR5. Data and materials are available at https://osf.io/fsavc/. The data file includes a short description of all variables used in the analysis.

Experimental design

We conducted a mixed-design experiment. We randomly allocated participants to one of two conditions between-subjects. Participants rated a distinguished senior researcher who delegated a part of the research process to either another person—specifically, a PhD student with two years’ experience in the area (human condition), or to an LLM such as ChatGPT (LLM condition). Each participant rated the effect of such delegation on each of the five parts of the research process discussed in Cargill and O’Connor [2021]: idea generation, prior literature synthesis, data identification and preparation, testing framework determination and implementation, and results analysis. Notably, Dowling and Lucey [2023] used all of these except results analysis to assess the quality of ChatGPT’s output. We rephrased the two last research processes for clarity.

For each research process, the participants rated the extent to which they agreed with three items, on a Likert scale of 1 (strongly disagree) to 7 (strongly agree):

I think that it is morally acceptable for a scientist to delegate—in such a scenario—the following part of the research process (after giving credit in the acknowledgments);
I think that a scientist that delegated the part of the research process shown below should be trusted to oversee future research projects; and
I think that delegating this part of the research process will produce correct output and stand up to scientific scrutiny (e.g., results would be robust, reliable, and correctly interpreted).

We expected the first two items to correlate with one another but not necessarily with the third. While people might acknowledge that AI might be better than humans in some tasks, they often exhibit an aversion toward the use of algorithms. [Dietvorst et al., 2015]

Given that each participant rated three different items for five different research processes, we obtained fifteen data points per participant. The main analysis (see Table 2, in the next section) is performed on various levels: the pooled dataset (with 15 data points per participant), and separately for: (1) each of the three items, and (2) each of the five research processes.

Results

Preliminary analysis

Prior to presenting the regression results, we examined as to how an-swers correlated with each other. As expected, moral acceptability ratings (with correlations based on mean ratings from the five research processes) correlated highly with trust to oversee future projects, r = 0.81, p < 0.001. However, moral acceptability ratings also correlated highly with accuracy ratings, r = 0.81, p < 0.001. Similarly, trust ratings correlated highly with accuracy, r = 0.80, p < 0.001.

However, it remains possible that the relationship between such perceptions was lower when the scientist delegated to an LLM instead of a human. To determine this, we conducted a regression analysis treating one item as the dependent variable, and another as the independent variable, but we added an interaction with a dummy variable across delegation condition. Results, presented in Table 1, suggest that the strength of the relationship between moral acceptability, trust, and accuracy either becomes stronger when delegating to an LLM (rather than a human) or is not statistically different. Therefore, people evaluated moral acceptability, trust, and accuracy in a similar manner in each condition.

	Moral acceptability	Trust	Accuracy
Table 1. The interrelationship between ratings of three items (moral acceptability, trust to oversee, and accuracy). Notes: Ratings are means for five research processes. Moral acceptability, trust, and accuracy scores are standardized to facilitate the interpretability of the coefficient for LLM (which corresponds to the effect of delegating to the LLM [relative to the human] when trust or accuracy is at its mean level). * p < 0.05; p < 0.01; * p < 0.001.
(Intercept)	0.12* (0.05)	0.11* (0.05)	0.10 (0.05)
Trust	0.68*** (0.06)
LLM (1 = yes, 0 = no)	–0.17** (0.06)	–0.15* (0.06)	–0.16* (0.07)
Trust ∙ LLM	0.14* (0.07)
Accuracy		0.67*** (0.06)	0.72*** (0.06)
Accuracy ∙ LLM		0.16* (0.07)	0.07 (0.07)
N	402	402	402
R² adjusted	0.660	0.656	0.636

Pre-registered analysis

We present the results of the pre-registered analysis in Table 2. Consistent with the hypothesis, people rated delegating the research process to an LLM as less morally acceptable and reported lower trust towards this scientist to oversee future research projects. Moreover, people also rated delegating to an LLM as producing less correct output. The effect of delegating to an LLM (relative to delegating the same to a PhD student) was similar for all three items, and thus results from the combined dataset (“All items and processes”) can serve as a benchmark for future studies.

For readers accustomed to Cohen’s d [Cohen, 1988], the effect sizes (and 95% confidence intervals) of delegating to an LLM instead of a human were large: d = –0.78 [–0.99, –0.58] for moral acceptability, d = –0.80 [–1.00, –0.60] for trust, and d = –0.85 [–1.06, –0.65] for accuracy.

	All items and processes	Items			Research processes
Table 2. Perceptions of delegating parts of the research process. Notes: Linear mixed effect models were estimated using R packages lme4 [Bates et al., 2015] and lmerTest. [Kuznetsova et al., 2017] The baseline values are “Moral acceptability” for Item, “Idea generation” for Research process, and "female" for gender. All variables bar age and gender are dummy variables, taking the value "1" if the variable is equal to what the variable’s name implies, and "0" otherwise. * p < 0.05; p < 0.01; * p < 0.001.
	All items and processes	Moral acceptability	Trust	Correctness	Idea generation	Prior literature synthesis	Data identification and preparation	Testing and interpreting the theoretical framework	Statistical result analysis
(Intercept)	5.34*** (0.09)	4.92*** (0.46)	4.90*** (0.50)	5.16*** (0.46)	4.83*** (0.53)	5.24*** (0.48)	5.34*** (0.48)	5.52*** (0.53)	4.67*** (0.51)
LLM	–1.07*** (0.12)	–1.01*** (0.13)	–1.13*** (0.14)	–1.09*** (0.13)	–0.89*** (0.15)	–1.12*** (0.14)	–1.23*** (0.13)	–1.30*** (0.15)	–0.85*** (0.14)
Item = Correctness	–0.16*** (0.03)				–0.08 (0.06)	–0.16** (0.06)	–0.25*** (0.06)	–0.07 (0.06)	–0.22*** (0.06)
Item = Trust	0.07* (0.03)				0.11 (0.06)	0.04 (0.06)	–0.05 (0.06)	0.20*** (0.06)	0.02 (0.06)
Research process = Prior literature synthesis	0.21*** (0.04)	0.26*** (0.08)	0.19** (0.06)	0.18** (0.07)
Research process = Data identification and preparation	0.28*** (0.04)	0.39*** (0.08)	0.23*** (0.06)	0.22** (0.07)
Research process = Testing and interpreting the theoretical framework	–0.11** (0.04)	–0.15 (0.08)	–0.05 (0.06)	–0.14* (0.07)
Research process = Statistical result analysis	0.11* (0.04)	0.18* (0.08)	0.09 (0.06)	0.05 (0.07)
Age		0.01 (0.00)	0.01 (0.01)	0.01 (0.00)	0.00 (0.01)	0.01 (0.00)	0.01* (0.00)	0.00 (0.01)	0.02** (0.01)
Heard of ChatGPT		0.01 (0.42)	0.14 (0.45)	–0.23 (0.41)	0.26 (0.48)	0.09 (0.44)	–0.10 (0.43)	–0.38 (0.48)	0.00 (0.47)
Interacted with ChatGPT		0.02 (0.13)	0.06 (0.14)	–0.12 (0.13)	0.26 (0.16)	–0.00 (0.14)	–0.01 (0.14)	–0.12 (0.16)	–0.19 (0.15)
Gender	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Random effects
σ²	1.12	1.20	0.77	0.93	0.79	0.64	0.64	0.70	0.71
τ₀₀	1.40_id	1.41_id	1.78_id	1.45_id	1.96_id	1.62_id	1.58_id	1.99_id	1.84_id
ICC	0.56	0.54	0.70	0.61	0.71	0.72	0.71	0.74	0.72
N_participants	402_id	402_id	402_id	402_id	402_id	402_id	402_id	402_id	402_id
N	6030	2010	2010	2010	1206	1206	1206	1206	1206
Marginal R²/ Conditional R²	0.111 / 0.606	0.106 / 0.589	0.119 / 0.733	0.127 / 0.660	0.078 / 0.736	0.131 / 0.752	0.162 / 0.757	0.151 / 0.779	0.096 / 0.748

Acknowledgements

Funding

This research was supported by grant 2021/42/E/HS4/00289 from the National Science Centre, Poland.

Conflict of interest

None stated.

References

Notes

This presentation is faithful to the original, with only a few minor changes to presentation, grammar, and punctuation. In some cases important information was missing from the references, and that information was added. The original lists references in alphabetical order; this version lists them in order of appearance, by design.