Difference between revisions of "User:Shawndouglas/Sandbox"

From LIMSWiki
Jump to navigationJump to search
(Added items.)
(Finished adding rest of content.)
(12 intermediate revisions by the same user not shown)
Line 8: Line 8:
==Sandbox begins below==
==Sandbox begins below==


{{Infobox journal article
|name        =
|image        =
|alt          = <!-- Alternative text for images -->
|caption      =
|title_full  = Application of text analytics to extract and analyze material–application pairs from a large scientific corpus
|journal      = ''Frontiers in Research Metrics and Analytics''
|authors      = Kalathil, Nikhil; Byrnes, John J.; Randazzese, Lucien; Hartnett, Daragh P.; Freyman, Christina A.
|affiliations = Center for Innovation Strategy and Policy and the Artificial Intelligence Center, SRI International
|contact      = Email: christina dot freyman at sri dot com
|editors      = Boyack, Kevin
|pub_year    = 2018
|vol_iss      = '''2'''
|pages        = 15
|doi          = [http://10.3389/frma.2017.00015 10.3389/frma.2017.00015]
|issn        = 2504-0537
|license      = [http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International]
|website      = [https://www.frontiersin.org/articles/10.3389/frma.2017.00015/full https://www.frontiersin.org/articles/10.3389/frma.2017.00015/full]
|download    = [https://www.frontiersin.org/articles/10.3389/frma.2017.00015/pdf https://www.frontiersin.org/articles/10.3389/frma.2017.00015/pdf] (PDF)
}}
{{ombox
| type      = content
| style    = width: 500px;
| text      = This article should not be considered complete until this message box has been removed. This is a work in progress.
}}
==Abstract==
When assessing the importance of materials (or other components) to a given set of applications, machine analysis of a very large corpus of scientific abstracts can provide an analyst a base of insights to develop further. The use of text analytics reduces the time required to conduct an evaluation, while allowing analysts to experiment with a multitude of different hypotheses. Because the scope and quantity of [[metadata]] analyzed can, and should, be large, any divergence from what a human analyst determines and what the text analysis shows provides a prompt for the human analyst to reassess any preliminary findings. In this work, we have successfully extracted material–application pairs and ranked them on their importance. This method provides a novel way to map scientific advances in a particular material to the application for which it is used. Approximately 438,000 titles and abstracts of scientific papers published from 1992 to 2011 were used to examine 16 materials. This analysis used coclustering text analysis to associate individual materials with specific clean energy applications, evaluate the importance of materials to specific applications, and assess their importance to clean energy overall. Our analysis reproduced the judgments of experts in assigning material importance to applications. The validated methods were then used to map the replacement of one material with another material in a specific application (batteries).


==1.0 Vendor information==
'''Keywords''': machine learning classification, science policy, coclustering, text analytics, critical materials, big data
 
==Introduction==
Scientific research and technological development are inherently combinatorial practices.<ref name="ArthurTheNature09">{{cite book |title=The Nature of Technology: What It Is and How It Evolves |author=Arthur, W.B. |publisher=Simon and Schuster |page=256 |year=2009 |isbn=9781439165782}}</ref> Researchers draw from, and build on, existing work in advancing the state of the art. Increasing the ability of researchers to review and understand previous research can stimulate and accelerate scientific progress. However, the number of scientific publications grows exponentially every year both on the aggregate level and in an individual field.<ref name="NSBScience16">{{cite web |url=https://www.nsf.gov/statistics/2016/nsb20161/uploads/1/nsb20161.pdf |format=PDF |title=Science and Engineering Indicators 2016 |author=National Science Board |publisher=National Science Foundation |pages=899 |date=11 January 2016}}</ref> It is impossible for any single researcher or organization to keep up with the vastness of new scientific publications. The ability to use text analytics to map the current state of the art to detect progress would enable more efficient analyses of data.
 
The Intelligence Advanced Research Projects Activity recognized the scale problem in 2011, creating the research program Foresight and Understanding from Scientific Exposition. Under this program, SRI and other performers processed “the massive, multi-discipline, growing, noisy, and multilingual body of scientific and patent literature from around the world and automatically generated and prioritized technical terms within emerging technical areas, nominated those that exhibit technical emergence, and provided compelling evidence for the emergence.”<ref name="DNIIARPA11">{{cite web |url=https://www.dni.gov/index.php/newsroom/press-releases/press-releases-2011/item/327-iarpa-launches-new-program-to-enable-the-rapid-discovery-of-emerging-technical-capabilities |title=IARPA Launches New Program to Enable the Rapid Discovery of Emerging Technical Capabilities |publisher=Office of the Director of National Intelligence |date=27 September 2011}}</ref> The work presented here applies and extends that platform to efficiently identify and describe the past and present evolution of research on a given set of materials. This work applies text analytics to demonstrate how these computational tools can be used by analysts to analyze much larger sets of data and develop more iterative and adaptive material assessments to better inform and shape government and industry research strategy and resource allocation.
 
==Materials==
===Ground truth===
The Department of Energy (DOE) has a specific interest in critical materials related to the energy economy. The DOE identifies critical materials through analysis of their use (demand) and supply. The approach balances an analysis of market dynamics (the vulnerability of materials to economic, geopolitical, and natural supply shocks) with technological analysis (the reliance of certain technologies on various materials). The DOE's R&D agenda is directly informed by assessments of material criticality. The DOE, the National Research Council, and the European Economic and Social Committee have all articulated a need for better measurements of material criticality. However, criticality depends on a multitude of different factors, including socioeconomic factors.<ref name="PoultonState13">{{cite journal |title=State of the World's Nonfuel Mineral Resources: Supply, Demand, and Socio-Institutional Fundamentals |journal=Annual Review of Environment and Resources |author=Poulton, M.M.; Jagers, S.C.; Linde, S. et al. |volume=38 |pages=345–371 |year=2013 |doi=10.1146/annurev-environ-022310-094734}}</ref> Various organizations across the world define resource criticality according to their own independent metrics and methodologies, and designations of criticality tend to vary dramatically.<ref name="PoultonState13" /><ref name="CommitteeMinerals08">{{cite book |url=https://www.nap.edu/catalog/12034/minerals-critical-minerals-and-the-us-economy |title=Minerals, Critical Minerals, and the U.S. Economy |author=Committee on Critical Mineral Impacts on the U.S. Economy |publisher=National Academies Press |pages=262 |year=2008 |isbn=9780309112826 |doi=10.17226/12034}}</ref><ref name="CommitteeManaging08">{{cite book |url=https://www.nap.edu/catalog/12028/managing-materials-for-a-twenty-first-century-military |title=Managing Materials for a Twenty-first Century Military |author=Committee on Assessing the Need for a Defense Stockpile |publisher=National Academies Press |pages=206 |year=2008 |isbn=9780309177924 |doi=10.17226/12028}}</ref><ref name="ErdmannCritic11">{{cite journal |title=Criticality of non-fuel minerals: A review of major approaches and analyses |journal=Environmental Science and Technology |author=Erdmann, L.; Graedel, T.E. |volume=45 |issue=18 |pages=7620–30 |year=2011 |doi=10.1021/es200563g |pmid=21834560}}</ref><ref name="EurLex52011PC0025">{{cite web |url=https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=celex%3A52011DC0025 |title=Communication from the Commission to the European Parliament, the Council, the European Economic and Social Committee and the Committee of the Regions Tackling the Challenges in Commodity Markets and on Raw Materials |work=Eur-Lex |publisher=European Union |date=02 February 2011}}</ref><ref name="ECReport14">{{cite web |url=https://ec.europa.eu/docsroom/documents/10010/attachments/1/translations/en/renditions/pdf |format=PDF |title=Report on Critical Raw Materials for the E.U.: Report of the Ad hoc Working Group on defining critical raw materials |publisher=European Commission |pages=41 |date=May 2014}}</ref><ref name="GraedelCritic15">{{cite journal |title=Criticality of metals and metalloids |journal=Proceedings of the National Academy of Sciences of the United States of America |author=Graedel, T.E.; Harper, E.M.; Nassar, N.T. et al. |volume=112 |issue=14 |pages=4257-62 |year=2015 |doi=10.1073/pnas.1500415112 |pmid=25831527 |pmc=PMC4394315}}</ref>
 
Experts tasked with assessing the role of materials must make decisions about what materials to focus on, what applications to review, what data sources to consult, and what analyses to pursue.<ref name="GraedelCritic15" /> The amount of data available to assess is vast and far too large for any single analyst or organization to address comprehensively. In addition, to the best of our knowledge, previous assessments of material criticality have not involved a comprehensive review of scientific research on material use. [Graedel and colleagues have published extensively using raw data on supply and other indicators to measure criticality, see Graedel ''et al.'' (2012<ref name="GraedelMethod12">{{cite journal |title=Methodology of metal criticality determination |journal=Environmental Science and Technology |author=Graedel, T.E.; Barr, R.; Chandler, C. et al. |volume=46 |issue=2 |pages=1063–70 |year=2012 |doi=10.1021/es203534z |pmid=22191617}}</ref>, 2015<ref name="GraedelCritic15" />) and Panousi ''et al.'' (2016<ref name="PanousiCritic15">{{cite journal |title=Criticality of Seven Specialty Metals |journal=Journal of Industrial Ecology |author=Panousi, S.; Harper, E.M.; Nuss, P. et al. |volume=20 |issue=4 |pages=837-853 |year=2016 |doi=10.1111/jiec.12295}}</ref>) and the references contained within.] Recent developments in text analytic computational approaches present a unique opportunity to develop new analytic approaches for assessing material criticality in a comprehensive, replicable, iterative manner.
 
The Department of Energy’s 2011 Critical Materials Strategy (CMS) Report uses importance to clean energy as one dimension of the criticality matrix (see Figure 1).<ref name="DOECritical11">{{cite web |url=https://www.energy.gov/sites/prod/files/DOE_CMS2011_FINAL_Full.pdf |format=PDF |title=Critical Materials Strategy |author=Office of Policy and International Affairs |publisher=U.S. Department of Energy |date=December 2011}}</ref> In this regard, the DOE report serves as a form of ground truth for the validation of our technique, though the DOE report considered supply risk as the second dimension to criticality, which the analysis described in this paper does not address.
 
 
[[File:Fig1 Kalathil FrontInResMetAnal2018 2.jpg|500px]]
{{clear}}
{|  
{|  
  | STYLE="vertical-align:top;"|
  | STYLE="vertical-align:top;"|
{| class="wikitable collapsible" border="1" cellpadding="10" cellspacing="0"
{| border="0" cellpadding="5" cellspacing="0" width="500px"
  |-
  |-
  ! colspan="2" style="text-align:left; padding-left:20px; padding-top:10px; padding-bottom:10px; width:800px;"| 1.0 '''Vendor information'''
   | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 1.''' 2011 Critical Materials Importance Analysis matrix, published by experts at the Department of Energy. This matrix served as ground truth for validation.</blockquote>
|-
  | style="padding:5px; width:200px;" |'''Company name'''
   | style="background-color:white; padding:5px; width:600px;" |                                                                                                                                   
|-  
  | style="padding:5px; width:200px;" |'''Physical address'''
  | style="background-color:white; padding:5px; width:600px;" |                                                                                                                                           
|-  
  | style="padding:5px; width:200px;" |'''Website'''
  | style="background-color:white; padding:5px; width:600px;" |
|-
  | style="padding:5px; width:200px;" |'''LIMSwiki web page'''
  | style="background-color:white; padding:5px; width:600px;" |
|-
  | style="padding:5px; width:200px;" |'''Contact name and title'''
  | style="background-color:white; padding:5px; width:600px;" |
|-
  | style="padding:5px; width:200px;" |'''Contact e-mail'''
  | style="background-color:white; padding:5px; width:600px;" |
|-
  | style="padding:5px; width:200px;" |'''Contact phone and fax'''
  | style="background-color:white; padding:5px; width:600px;" |
|-
  | style="padding:5px; width:200px;" |'''Years in business'''
  | style="background-color:white; padding:5px; width:600px;" |
  |-  
  |-  
|}
|}
|}
|}
==1.1 Vendor services==
 
===Scientific publications===
Data on scientific research articles was obtained from the Web of Science (WoS) database available from Thomson Reuters (now Clarivate Analytics). WoS contains metadata records. In principle, we could have analyzed this entire database; however, for budget-related reasons, the document set was limited by a topic search of keywords that appear in a document title, abstract, author-provided keywords, and WoS-added keywords for the following:
 
* the 16 materials listed in the 2011 CMS, or
 
* the 285 unique alloys/composites of the 16 critical materials.
 
The document set was also limited to articles published between 1992 and 2011, the 20-year period leading up the DOE's most recent critical material assessment.
 
The 16 materials listed in the 2011 CMS include europium, terbium, yttrium, dysprosium, neodymium, cerium, tellurium, lanthanum, indium, lithium, gallium, praseodymium, nickel, manganese, cobalt, and samarium. Excluded from our documents set was any publication appearing in 80 fields considered not likely to cover research in scope (e.g., fields in the social sciences, biological sciences, etc.). We used the 16 materials listed above because we were interested in validating a methodology against the 2011 CMI Strategy report, and these are the materials mentioned therein. The resulting data set consisted of approximately 438,000 abstracts of scientific papers published from 1992 to 2011.
 
==Methods==
===Text analytics and coclustering===
The principle behind coclustering is the statistical analysis of the occurrences of terms in the text. This includes the processing of the relationships both between terms and neighboring (or nearby) terms, and between terms and the documents in which they occur. The approach presented here grouped papers by looking for sets of papers containing similar sets of terms. As detailed below, our analytic methods process meaning beyond simple counts of words, and thus, for example, put papers about earthquakes and papers about tremors in the same group, but would exclude papers in the medical space that discuss hand tremors.
 
Coclustering is based on an important technique in natural language processing which involves the embedding of terms into a real vector space; i.e., each word of the language is assigned a point in a high-dimensional space. Given a vector representation for a term, terms can be clustered using standard clustering techniques (also known as cluster analysis), such as hierarchical agglomeration, principle components analysis, K-means, and distribution mixture modeling.<ref name="HastieTheElem04">{{cite book |title=The Elements of Statistical Learning |author=Hastie, T.; Tibshirani, R.; Friedman, J. |publisher=Springer |pages=745 |year=2009 |isbn=9780387848570}}</ref> This was first done in the 1980s under the name latent semantic analysis (LSA).<ref name="FurnasTheVocab87">{{cite journal |title=The vocabulary problem in human-system communication |journal=Communications of the ACM |author=Furnas, G.W.; Landauer, T.K.; Gomez, L.M.; Dumais, S.T. |volume=30 |issue=11 |pages=964–971 |year=1987 |doi=10.1145/32206.32212}}</ref> In the 1990s, neural networks were applied to find embeddings for terms using a technique called context vectors<ref name="GallantHNC92">{{cite journal |title=HNC's MatchPlus system |journal=ACM SIGIR Forum |author=Gallant, S.I.; Caid, W.R.; Carleton, J. et al. |volume=26 |issue=2 |pages=34–38 |year=1992 |doi=10.1145/146565.146569}}</ref><ref name="CaidLearned95">{{cite journal |title=Learned vector-space models for document retrieval |journal=Information Processing and Management |author=Caid, W.R.; Dumais, S.T.; Gallant, S.I. et al. |volume=31 |issue=3 |pages=419–429 |year=1995 |doi=10.1016/0306-4573(94)00056-9}}</ref> A Bayesian analysis of context vectors in the late 1990s provided probabilistic interpretation and enabled applying information-theoretic techniques.<ref name="GallantHNC92" /><ref name="ZhuBayesian95">{{cite journal |title=Bayesian invariant measurements of generalization |journal=Neural Processing Letters |author=Zhu, H.; Rohwer, R. |volume=2 |issue=6 |pages=28–31 |year=1995 |doi=10.1007/BF02309013}}</ref> We refer to this technique as Association Grounded Semantics (AGS). A similar Bayesian analysis of LSA resulted in a technique referred to as probabilistic-LSA<ref name="HofmannProb99">{{cite journal |title=Probabilistic latent semantic indexing |journal=Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval |author=Hofmann, T. |pages=50–57 |year=1999 |doi=10.1145/312624.312649}}</ref>, which was later extended to a technique known as latent Dirichlet allocation (LDA).<ref name="BleiLatent03">{{cite journal |title=Latent Dirichlet Allocation |journal=Journal of Machine Learning Research |author=Blei, D.M.; Ng, A.Y.; Jordan, M.I. |volume=3 |issue=1 |pages=993–1022 |year=2003 |url=http://www.jmlr.org/papers/v3/blei03a.html}}</ref> LDA is commonly referred to as “topic modeling” and is probably the most widely applied technique for discovering groups of similar terms and similar documents. Much more recently, Google directly extended the context vector approach of the early 1990s to derive word embeddings using much greater computing power and much larger datasets than had been used in the past, resulting in the word2vec product, which is now widely known.<ref name="MikolovEfficient13">{{cite web |url=https://arxiv.org/abs/1301.3781 |title=Efficient Estimation of Word Representations in Vector Space |author=Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. |publisher=Cornell University Library |date=07 September 2013}}</ref><ref name="BellemareTheArcade15">{{cite journal |title=The arcade learning environment: An evaluation platform for general agents |journal=Proceedings of the 24th International Conference on Artificial Intelligence |author=Bellemare, M.G.; Naddaf, Y.; Veness, J.; Bowling, M. |pages=4148–4152 |year=2015}}</ref>
 
The LDA model assumes that documents are collections of topics, and those topics generate terms, making it difficult to apply LDA to terms other than in the contexts of documents. Because coclustering treats the document as a context for a term, any other context of a term can be substituted in the coclustering model. For example, contexts may be neighboring terms or capital letters or punctuation. This allows us to apply coclustering to a much wider variety of feature types than is accommodated by LDA. In particular, “distributional clustering” (clustering of terms based on their distributions over nearby terms), which has been proven to be useful in information extraction<ref name="FreitagToward04">{{cite journal |title=Toward unsupervised whole-corpus tagging |journal=Proceedings of the 20th International Conference on Computational Linguistics |author=Freitag, D. |pages=357 |year=2004 |doi=10.3115/1220355.1220407}}</ref><ref name="FreitagTrained04">{{cite journal |title=Trained Named Entity Recognition using Distributional Clusters |journal=Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing |author=Freitag, D. |pages=262–69 |year=2004 |url=http://www.aclweb.org/anthology/W04-3234}}</ref> is captured by coclustering. In future work, we anticipate recognizing material names and application references using these techniques.
 
Word embeddings are primarily used to solve the “vocabulary problem” in natural language, which is that many ways exist to describe the same thing, so that a query for “earthquakes” will not necessarily pick up a report on a “tremor” unless some generalization can be provided to produce soft matching. The embeddings produce exactly such a mapping. Applying the information-theoretic approach called AGS led to the development of coclustering<ref name="ByrnesText05">{{cite journal |title=Text Modeling for Real-Time Document Categorization |journal=Proceedings of the 2005 IEEE Aerospace Conference |author=Byrnes, J.; Rohwer, R. |year=2005 |doi=10.1109/AERO.2005.1559610}}</ref>, one of the key text analytic tools used in this research.
 
Information-theoretic coclustering is the simultaneous estimation of two partitions (a mutually exclusive, collectively exhaustive collection of sets) of values of two categorical variable (such as “term” and “document”). Each member of the partition is referred to as a cluster. Formally, if ''X'' ranges over terms ''x''<sub>0</sub>, ''x''<sub>1</sub>, …, ''Y'' ranges over documents ''y''<sub>0</sub>, ''y''<sub>1</sub>, …, and Pr(''X'' = ''x'', ''Y'' = ''y'') is the probability of selecting an occurrence of term ''x'' in document ''y'' given that an arbitrary term occurrence is selected from a document corpus, then the mutual information ''I''(''X'';''Y'') between ''X'' and ''Y'' is given by:
 
<math>I\left( {X;Y} \right) = \sum\limits_{x,y}\text{Pr}\left( {X = x,Y = y} \right)\text{log}\frac{\text{Pr}\left( {X = x,Y = y} \right)}{\text{Pr}\left( {X = x} \right)\text{Pr}\left( {Y = y} \right)}</math>
 
We seek a partition ''A'' = {''a''<sub>0</sub>, ''a''<sub>1</sub>, …} over ''X'' and a partition ''B'' = {''b''<sub>0</sub>, ''b''<sub>1</sub>, …} over ''Y'' such that ''I''(''A'';''B'') is as high as possible. Since the information in the ''A'', ''B'' co-occurrence matrix is derived from the information in the ''X'', ''Y'' co-occurrence matrix, maximizing ''I''(''A'';''B'') is the same as minimizing ''I''(''X'';''Y'') − ''I''(''A'';''B''), we are compressing the original data (by replacing terms with term clusters and documents with document clusters) and minimizing the information lost due to compression.
 
Compound terms were discovered from the data through a common technique<ref name="ManningFound99">{{cite book |title=Foundations of Statistical Natural Language Processing |author=Manning, C.D.; Schütze, H. |publisher=The MIT Press |page=620 |year=1999 |isbn=9780262133609}}</ref> in which sequences are considered to be compound terms if the frequency of the sequence is significantly greater than that predicted from the frequency of the individual terms under the assumption that their occurrences are independent. As an example, when reading an American newspaper, the term “York” occurs considerably more frequently after the term “New” than it occurs in the newspaper overall, leading to the conclusion that “New York” should be treated as a single compound term. We formalize this as follows. Let Pr(''X<sub>i</sub>'' = ''x''<sub>0</sub>) be the probability that the term occurring at an arbitrarily selected position ''X<sub>i</sub>'' in the corpus is the term ''x''<sub>0</sub>. Then, Pr(''X<sub>i</sub>'' = ''x''<sub>0</sub>, ''X''<sub>''i''+1</sub> = ''x''<sub>1</sub>) is the probability of the event that ''x''<sub>0</sub> is seen immediately followed by ''x''<sub>1</sub>. If ''x''<sub>0</sub> and ''x''<sub>1</sub> occur independently of each other, then we would predict Pr(''X<sub>i</sub>'' = ''x''<sub>0</sub>, ''X''<sub>''i''+1</sub> = ''x''<sub>1</sub>) = Pr(''X<sub>i</sub>'' = ''x''<sub>0</sub>)Pr(''X''<sub>''i''+1</sub> = ''x''<sub>1</sub>). To measure the amount that the occurrences do seem to depend on each other, we measure the ratio:
 
<math>\frac{\text{Pr}\left( {X_{i} = x_{0},X_{i + 1} = x_{1}} \right)}{\text{Pr}\left( {X_{i} = x_{0}} \right)\text{Pr}\left( {X_{i + 1} = x_{1}} \right)}</math>
 
As this ratio becomes significantly higher than one, we become more confident that the sequence of terms should be treated as a single unit. This technique was iterated to provide for the construction of longer compound terms, such as “superconducting quantum interference device magnetometer.” The compound terms that either start or end with a word from a fixed list of prepositions and determiners (commonly referred to as “stopwords”) were deleted, in order to avoid having sequences such as “of platinum” become terms. This technique removes any reliance on domain-specific dictionaries or word lists, other than the list of prepositions and determiners.
 
Coclustering algorithms (detailed above) produce clusters of similar terms based on the titles and abstracts in which they appear, while grouping similar documents based on the terms they contain: thus, the name cocluster. In this project, a "document" is defined as a combined title and abstract. The process can be thought of as analogous to solving two equations with two unknowns. The process partitions the data into a set of collectively exhaustive document and term clusters resulting in term clusters that are maximally predictive of document clusters and document clusters that are maximally predictive of term clusters.
 
Coclustering results can be portrayed as an ''M'' × ''N'' matrix of term clusters (defined by the terms they contain) and document clusters (defined by the documents they contain). Terms that appear frequently in similar sets of documents are grouped together, while documents that mention similar terms are grouped together. Each term will appear in one, and only one, term cluster, while each document will appear in one, and only one, document cluster. When deciding on how many clusters to start with (term or document, i.e., ''M'' and ''N'') there is a tradeoff between breadth and depth. The goal is to differentiate between sub-topics while including a reasonable range of technical discussion within a single topic. The balance between breadth and depth is reflected in the number of clusters that are created. Partitioning into a larger number of clusters would result in more narrowly defined initial clusters, but might mischaracterize topics that span multiple clusters. On the other hand, partitioning into fewer clusters would capture broader topics. Researchers interested in a more fine-grained understanding of materials use, e.g., interested in isolating a set of documents focused on a narrower scientific or technological topic, would need to sub-cluster these initial broad clusters in later analyses.
 
Terms in term clusters, and titles and abstracts (i.e., document data) from document clusters reveal the content of each cluster. This information provides the basis for identifying what each term cluster “is about” and for selecting term clusters for further scrutiny. Term clusters of interest were filtered using a glossary, in this case, terms pertaining to common applications of the 16 critical materials. This list of terms was manually created through a brief literature review of common applications associated with the materials in question.
 
Each term cluster was correlated with each of the 437,978 scientific abstracts in our corpus, and the degree of similarity was determined between each term cluster and each abstract through an assessment of their mutual information. As discussed, the term clusters and document clusters described above were selected so as to maximize the mutual information between terms. In order to find document abstracts most strongly associated with a given term cluster, we want to choose those abstracts which were most predictive of the term cluster. These are the abstracts that contain the words in the cluster, but especially the words in the cluster that are rare in the corpus in general. We formalize this by defining the association of a term cluster ''t'' and document abstract ''d'' as the value of the (''t'',''d'')-term of the mutual information formula.
 
To formalize this, we consider uniformly randomly selecting an arbitrary term occurrence from the entire document set, and we write Pr(''T'' = ''t'', ''D'' = ''d'') for the probability that the term was a member of cluster ''t'' and that the occurrence was in document ''d''. We adopt the maximum likelihood estimate for this probability:
 
<math>\text{Pr}\left( {T = t,D = d} \right) = \frac{n\left( {t,d} \right)}{N}</math>
 
where ''n''(''t'',''d'') is the number of occurrences of terms from term cluster ''t'' in document abstract ''d'' and ''N'' is the total of number of term occurrences in all documents: ''N'' = ∑<sub>''t'',''d''</sub>''n''(''t'',''d'').
 
We define the association between ''t'' and ''d'' by the score:
 
<math>\text{Assoc}\left( {t,d} \right) = \text{Pr}\left( {T = t,D = d} \right)\text{log}\frac{\text{Pr}\left( {T = t,D = d} \right)}{\text{Pr}\left( {T = t} \right)\text{Pr}\left( {D = d} \right)}</math>
 
Given this score, document abstracts can be arranged from most associated to least associated with a given term cluster. This was done for all 437,978 abstracts in our corpus and all term clusters. This methodology allowed the identification of those abstracts most closely associated with each term cluster.
 
Not all disperse term clusters were easy to interpret. As with the construction of an appropriate filtering glossary, deciding on the appropriate number of sub-clusters can be subjective. In general, the more diffuse a term cluster appears to be in its subject matter, the more it should be sub-clustered as a means to separate its diverse topic matter. Even imprecise sub-clustering is effective in narrowing the focus of these clusters. Once we determined the number of sub-clusters to generate, sub-clustering of terms was done in the same general manner as the original coclustering: the same coclustering algorithms were applied only to the terms in the initial cluster, but it is instructed not to discover any new abstract clusters. Rather, the set of terms in a cluster are grouped into the pre-specified number of bins according to similarity in the already existing groups of documents in which the terms appear.
 
==Results==
===Workflow===
This research resulted in the creation of the following workflow, detailed in Figure 2. The workflow was developed to assess the material–application pairing matrices, and it illustrates how text analytics can be used to aid in assessment of material importance to specific application areas, identify pairings of materials and applications, and augment a human expert’s ability to monitor the use and importance of materials.
 
 
[[File:Fig2 Kalathil FrontInResMetAnal2018 2.jpg|900px]]
{{clear}}
{|  
{|  
  | STYLE="vertical-align:top;"|
  | STYLE="vertical-align:top;"|
{| class="wikitable collapsible" border="1" cellpadding="10" cellspacing="0"
{| border="0" cellpadding="5" cellspacing="0" width="900px"
  |-
  |-
   ! colspan="3" style="text-align:left; padding-left:20px; padding-top:10px; padding-bottom:10px;"| 1.1 '''Vendor services'''
   | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 2.''' Diagram of the text analytics workflow that this project developed. This work flow was utilized to identify material–application pairs.</blockquote>
|-
  ! style="color:brown; background-color:#ffffee; width:500px;"| Request for information
  ! style="color:brown; background-color:#ffffee; width:100px;"| Requirement code
  ! style="color:brown; background-color:#ffffee; width:700px;"| Vendor response
|-
  | style="padding:5px; width:500px;" |'''a.''' The vendor offers an online and/or on-site demonstration.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
  |-  
  |-  
  | style="padding:5px; width:500px;" |'''b.''' The vendor provides a detailed project plan that includes the project team, timeline, deliverables, and risk and issue management procedures.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''c.''' The vendor explains their overall project approach, deliverables, time constraints, and any other criteria for the project.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''d.''' The vendor provides reliable cost estimates and pricing schedules, including all products and services in the scope of work.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''e.''' The vendor details the amount of time and staff that the purchaser will have to provide for the implementation process.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''f.''' The vendor can explain the maintenance and support offered during and after implementation, including times and methods of availability, issue escalation and management, etc.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''g.''' The vendor provides a support schedule for the implementation process, including optional support levels, their function, and availability.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''h.''' The vendor provides support during the "go-live" period between system validation/operational deployment and final acceptance/beginning of maintenance and support agreements.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''i.''' The vendor provides a gap analysis after initial system installation, identifying the deliverables or tasks remaining.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''j.''' The vendor provides a table linking each deliverable to the corresponding user requirement specification it fulfills.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''k.''' The vendor uses a consistent training methodology for educating new users.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''l.''' The vendor supplies [[Laboratory information system|LIS]]-specific training program curricula.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''m.''' The vendor provides user, administrator, developer, installation, and reference manuals.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''n.''' The vendor provides design qualification documentation.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''o.''' The vendor provides installation qualification documentation.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''p.''' The vendor provides operation qualification documentation.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''q.''' The vendor provides performance qualification documentation during implementation.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''r.''' The vendor provides well-documented system upgrades that authorized users can independently install.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''s.''' The vendor provides source code for the system.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''t.''' The vendor provides an optional comprehensive set of test codes suitable for use by the purchaser.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
|}
|}
|}
|}


==1.2 Information technology==
The final data set acquired and ingested into the Copernicus platform contained 437,978 documents. We extracted 83,059 terms from titles and abstracts, including compound (multi-part) terms such as “chloroplatinic acid” and “alluvial platinum.” Initial clustering of terms showed some term clusters were very precise, others were less focused. Multiple clustering sizes were experimented with to find an optimal size, which was 400 document clusters and 400 term clusters to start, creating a 400 × 400 matrix of document and term clusters. By manually analyzing the terms in each term cluster, we identified which clusters focused narrowly on areas most relevant to our study, namely those that use materials in clean energy applications. The term cluster in Figure 3 below, for example, focused around lithium-ion batteries, as indicated by the representative term list and an analysis of the most closely associated papers.
 
 
[[File:Fig3 Kalathil FrontInResMetAnal2018 2.jpg|600px]]
{{clear}}
{|  
{|  
  | STYLE="vertical-align:top;"|
  | STYLE="vertical-align:top;"|
===1.2.1 General IT===
{| border="0" cellpadding="5" cellspacing="0" width="600px"
{| class="wikitable collapsible" border="1" cellpadding="10" cellspacing="0"
|-
  ! colspan="3" style="color:DarkSlateGray; text-align:left; padding-left:40px;"| 1.2.1 '''General IT'''
  |-
  |-
   ! style="color:brown; background-color:#ffffee; width:500px;"| Request for information
   | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 3.''' The coclustering algorithm produced term clusters. This is an example of a narrowly focused cluster.</blockquote>
  ! style="color:brown; background-color:#ffffee; width:100px;"| Requirement code
  ! style="color:brown; background-color:#ffffee; width:700px;"| Vendor response
  |-  
  |-  
  | style="padding:5px; width:500px;" |'''a.''' The system operates with a traditional client-server architecture, with software installed on each machine that needs to access the system.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''b.''' The system operates with a web-based interface, hosted on a server and accessed via a web browser on most any machine.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''c.''' The system contains a single, centralized database that supports multiple sites and departments.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''d.''' The system's database conforms to the [[ODBC|Open Database Connectivity Standard]] (ODBC).
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''e.''' The system is designed so upgrades to the back-end database do not require extensive reconfiguration or effectively cripple the system.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''f.''' The system is designed to not be impacted by multiple users or failover processes.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''g.''' The system applies security features to all system files.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''h.''' The system applies log-in security to all servers and workstations accessing it.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''i.''' The system provides a workstation and server authentication mechanism.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''j.''' The system applies Secured Socket Layer (SSL) encryption on the web client interface.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''k.''' The system encrypts client passwords in a database, with support for multi-case and special characters.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''l.''' The system uses TCP/IP as its network transport.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''m.''' The system provides data archiving functionality, including a configurable scheduled archive, for all contained data, without requiring an off-line mode or human interaction with the data to be archived.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''n.''' The system allows automated backup and restore capability without support intervention, as well as manual backups.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''o.''' The system maintains the transactional history of system administrators.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''p.''' The system maintains an analyst communication log, accessible by the administrator.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
|}
|}
===1.2.2 Hardware environment===
{| class="wikitable collapsible" border="1" cellpadding="10" cellspacing="0"
|-
  ! colspan="3" style="color:DarkSlateGray;text-align:left; padding-left:40px;"| 1.2.2 '''Hardware environment'''
|-
  ! style="color:brown; background-color:#ffffee; width:500px;"| Request for information
  ! style="color:brown; background-color:#ffffee; width:100px;"| Requirement code
  ! style="color:brown; background-color:#ffffee; width:700px;"| Vendor response
|-
  | style="padding:5px; width:500px;" |'''a.''' The system proves compatible with a variety of hardware environments.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''b.''' The system can be utilized with a touch-screen.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
|}
|}
===1.2.3 Software environment===
 
{| class="wikitable collapsible" border="1" cellpadding="10" cellspacing="0"
Analysis relies on our ability to extract meaningful statistics about the dataset from term clusters, which can be poorly defined, as seen in the example term cluster in Figure 4. When term clusters are poorly defined, we have limited ability to interpret the statistics we extract. There are multiple ways to differentiate term clusters. Principal among these was the division between term clusters that cover basic scientific research versus those that focused on specific technological applications. In addition, because this project used the 2011 DOE CMS as a validation source, a differentiation between clean energy and non-clean energy relevance was also necessary, especially when the same material discoveries were used in both a clean energy and non-clean energy context. More broadly, if a researcher is interested in a specific application area, a division between that application area and others can be used as the means for differentiation.
|-
 
  ! colspan="3" style="color:DarkSlateGray;text-align:left; padding-left:40px;"| 1.2.3 '''Software environment'''
 
[[File:Fig4 Kalathil FrontInResMetAnal2018 2.jpg|600px]]
{{clear}}
{|  
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="600px"
  |-
  |-
   ! style="color:brown; background-color:#ffffee; width:500px;"| Request for information
   | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 4.''' The coclustering algorithm produced term clusters. This is an example of an unfocused cluster that spans basic science research and non-clean energy research.</blockquote>
  ! style="color:brown; background-color:#ffffee; width:100px;"| Requirement code
  ! style="color:brown; background-color:#ffffee; width:700px;"| Vendor response
  |-  
  |-  
  | style="padding:5px; width:500px;" |'''a.''' The system proves compatible with a variety of software environments.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''b.''' The system utilizes a non-proprietary database such as Oracle or Microsoft SQL Server.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
|}
|}
|}
|}


==1.3 Regulatory compliance and security==
As discussed in the “Methods” section for this project, we manually determined the number of necessary sub-clusters. For example, one of the glossary terms we used was “bulbs.” A term cluster was identified as the bulb cluster. Once it was examined closely, however, it was clear it contained material about both plant bulbs as well as light bulbs, as shown in Figure 5. Accordingly, it was clear that this cluster should be split (sub-clustered) into two clusters. After sub-clustering, the term cluster split into two clearly defined clusters.
 
 
[[File:Fig5 Kalathil FrontInResMetAnal2018 2.jpg|1100px]]
{{clear}}
{|  
{|  
  | STYLE="vertical-align:top;"|
  | STYLE="vertical-align:top;"|
===1.3.1 Regulatory compliance===
{| border="0" cellpadding="5" cellspacing="0" width="1100px"
{| class="wikitable collapsible" border="1" cellpadding="10" cellspacing="0"
|-
  ! colspan="3" style="color:DarkSlateGray; text-align:left; padding-left:40px;"| 1.3.1 '''Regulatory compliance'''
  |-
  |-
   ! style="color:brown; background-color:#ffffee; width:500px;"| Request for information
   | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 5.''' Unfocused clusters were then sub-clustered to produce more useable and focused clusters, as shown in this figure. A blub term cluster was split. The sub-clustering algorithms produced two new clusters, as shown above. The new smaller clusters were more focused than the larger cluster.</blockquote>
  ! style="color:brown; background-color:#ffffee; width:100px;"| Requirement code
  ! style="color:brown; background-color:#ffffee; width:700px;"| Vendor response
  |-  
  |-  
  | style="padding:5px; width:500px;" |'''a.''' The system supports [[21 CFR Part 11]] and [[40 CFR Part 3]] requirements, including login security, settable automatic logouts, periodic requirements for mandatory password changes, limits on reusability of passwords, and full electronic signature.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''b.''' The system supports [[ISO/IEC 17025]] requirements.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''c.''' The system supports [[HIPAA]] requirements.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''d.''' The system supports [[GALP]] and/or [[GAMP]] standards.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''e.''' The system supports the standards of [[The NELAC Institute]].
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''f.''' The system supports AABB (American Association of Blood Banks), College of American Pathologists, (CAP), The Joint Commission (TJC), and/or any additional requirements or standards.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''g.''' The system meets government requirements for handling classified information and documents.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''h.''' The system can create reports for and route samples to the [[Centers for Disease Control and Prevention]] (CDC), Food and Drug Administrations (FDA), Environmental Protection Agency (EPA), and other contract laboratories.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''i.''' The system maintains [[audit trail|audit]] and specification violation trails of all data manipulation — such as result and header information changes — as consistent with all applicable regulations and standards.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''j.''' The system maintains audit trails at least as long as the records to which they pertain.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''k.''' The system's audit log retains all data, prohibits any deletions, allows user comments, and allows reporting of contained information.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''l.''' The system provides additional persistent auditing capabilities, such as the audit of cancelled tests and scheduled system functions.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''m.''' The system provides user-selectable [[The NELAC Institute|NELAP]]-compliant internal [[chain of custody]] that tracks all specimens and associated containers from the time they are collected until disposed of.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''n.''' The system provides the ability to insert/manage secure [[ELN feature#Electronic signatures|electronic]] and/or digital signatures.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''o.''' The system incorporates automatic date and time stamping of additions, changes, etc.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''p.''' The system can automatically validate and approve data prior to being moved to the main database.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
|}
|}
|}
The process of sub-clustering expanded the list of initial 42 application term clusters into a total of 134 relevant term clusters for analysis. Clusters were considered “relevant” based on the weights that were assigned to them. For this project, we developed a methodology for weighting clusters to replicate the ground truth of the 2011 DOE CMS.
===Weighting===
One of the principal questions addressed by the 2011 DOE CMS was the importance of specific materials to clean energy applications. The DOE ranked the 16 materials based on their importance to clean energy, and their supply risk, as detailed in Figure 1. We developed a methodology to replicate the ''y''-axis of this figure (material importance to clean energy) by combining the data on material distribution over clusters with an assessment of the clean energy importance of each cluster. To assess the clean energy importance of each of these clusters, we developed a clean energy importance weighting. A set of key words was constructed manually and searched the top 500 associated abstracts of each term cluster for these key words. Constructing this glossary is a crucial step that allows researchers to define their key words of interest. Document abstracts were analyzed for mentions of any clean energy field per cluster, and the number of different clean energy fields mentioned across the cluster. The clean energy weighting considers the extent and depth of impact within a cluster, and is equal to:
<math>\frac{\text{Total number of Abstracts Mentioning Any Clean Energy Field}}{\text{Number of Different Clean Energy Fields Mentioned Across the Cluster}}</math>
This weighting essentially captures the number of abstracts per clean energy field. The goal of the weighting is to discount clusters that mention a number of different clean energy fields, but do not discuss these clean energy fields in a substantial or significant manner. Materials that were mentioned frequently in clusters with high clean energy weights can be thought of as “important” to clean energy. The importance of each material to clean energy was determined by counting the number of document mentions in clean energy important clusters (i.e., clusters above some cutoff of minimum clean energy weight).
Material-to-clean-energy field application pairings were derived from term clusters. Recall that for each term cluster, the number of abstracts mentioning any of the 16 materials were determined as well as the number of abstracts mentioning any of 33 clean energy field keywords. In the importance analysis previously discussed, mentions of 33 clean energy fields were aggregated together to establish the clean energy weighting system. For each clean energy field, all term clusters were ranked based on a normalized count of document mentions of that field.
From this ranking, the “top” three term clusters were identified for a specific field for review. Then, the terms and keywords were manually reviewed to ensure that term clusters were actually relevant and the occasional false positives (term clusters that score high based on keyword counts but are not in fact relevant based on human content analysis) were discarded. This manual review, while requiring human intervention, is done on a significantly narrowed set of documents, and thus does not represent a major bottleneck in the process. Having linked clusters to a specific clean energy field, material importance to each field was evaluated by examining the distribution of material mentions across associated documents. The number of mentions divided by the average mentions across all 134 relevant term clusters was used to account for keywords or phrases that were mentioned at a high frequency. In the photovoltaics case, these normalized number of mentions dropped significantly after the first three term clusters. The first three term clusters were manually reviewed to ensure that they were relevant.


===1.3.2 Security===
===Extracting statistics===
{| class="wikitable collapsible" border="1" cellpadding="10" cellspacing="0"
Statistics related to material importance and material–application pairings were extracted from this final set of 134 relevant term clusters. For the purpose of this project, the top 500 abstracts from each of the 134 final clusters were analyzed, for a resulting 49,573 abstracts. Each of these 500 abstracts were automatically searched for mentions of the 16 materials (these counts included any compounds or alloys associated with those minerals) identified in the DOE strategy report, providing a count and distribution of material counts across the final set of granular term clusters that will serve as the basis for a subsequent manual analysis. An alternative to coclustering is hierarchical clustering in which terms are joined with their nearest matches only, and then each cluster is joined with its nearest match only, etc. Such an approach makes sub-clustering trivial by reducing it to undoing the merges that generated a given cluster. We opted against this structure because in past experience hierarchical clusters have generated qualitatively inferior clusterings of terms.
|-
 
  ! colspan="3" style="color:DarkSlateGray;text-align:left; padding-left:40px;"| 1.3.2 '''Security'''
Figure 6 displays the count of how many times each material was mentioned in the 49,573 abstracts that had the highest mutual information with the final 134 relevant term clusters (the 500 abstracts with the most mutual information with each of the 134 relevant term clusters). Material mentions were counted in the papers most closely associated with each term. This revealed what materials were most strongly associated with which terms. Thus, the methodology provides a way to analyze the distribution of materials over different topics, setting the foundation for material importance and material–application pair analysis.
 
 
[[File:Fig6 Kalathil FrontInResMetAnal2018 2.jpg|900px]]
{{clear}}
{|  
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="900px"
  |-
  |-
   ! style="color:brown; background-color:#ffffee; width:500px;"| Request for information
   | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 6.''' The number of documents in the 134 relevant clusters that contained a specific element varied from under 1,000 to over 12,000, as shown above.</blockquote>
  ! style="color:brown; background-color:#ffffee; width:100px;"| Requirement code
  ! style="color:brown; background-color:#ffffee; width:700px;"| Vendor response
  |-  
  |-  
  | style="padding:5px; width:500px;" |'''a.''' The system allows administrators and other authorized users to configure multiple levels of user rights and security by site location, department, group, [[LIMS feature#Configurable roles and security|role]], and/or specific function.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''b.''' The system allows administrators and users to reset user passwords.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''c.''' The system features and enforces adjustable rules concerning password complexity, reuse, and expiration.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''d.''' The system can lock a user out after a specified number of consecutive failed log-in attempts.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''e.''' The system provides the option for automatic user logout based on keyboard or mouse inactivity.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''f.''' The system makes authority checks to ensure only authorized individuals can use the system to perform an operation.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''g.''' The system allows authorized users to modify records, while also maintaining an audit trail of such actions.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''h.''' The system allows authorized users to manually delete records, while also maintaining an audit trail of such actions.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''i.''' The system prompts users to declare a reason for making changes to or deleting data in the system.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''j.''' The system provides email notification of lockout, security access, and improper workstation access.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''k.''' The system provides a mechanism to allow a user read-only access to stored data.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''l.''' The system allows authorized users to generate a detailed user access record.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''m.''' The system allows authorized users to review audit logs at will.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''n.''' The system allows authorized users to query and print chain of custody for items, cases, projects, and batches.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''o.''' The system allows supervisors to override chain of custody.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''p.''' The system automatically tracks when supervisors review critical result values.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''q.''' The system allows automatic and/or manual holds or locks to be placed on data to ensure it goes unaltered or remains retrievable during a retention period.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''r.''' The system can first feed data from connected non-CFR-compliant instruments through a virtual environment that is compliant (audit trailed, secure, versioned, etc.) before being stored.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''s.''' The system can control whether users are able to export data to portable long-term storage media like a USB flash drive or recordable DVD.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''t.''' The system employs automatic file encryption on stored data.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''u.''' The system employs checks to enforce permitted sequencing of steps and events.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''v.''' The system allows multiple users to connect simultaneously to a contract lab.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''w.''' The system provides read-only access to contract laboratory results.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''x.''' The system prohibits issuing reports outside of qualified areas while also allowing reports to be viewed locally or remotely based on security application limits and/or specimen ownership.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
|}
|}
|}
|}


==1.4 General system functions==
===Validation of method===
To compare the results with the CMS, we need to weight for clean energy, as described above. Multiple clean energy cutoffs were reviewed and ultimately a cutoff of 26 was employed. The cutoff of 26 corresponds to the beginning of the step up in the distribution presented in Figure 7. Thus, term clusters with a clean energy weight of less than 26 were excluded.
 
 
[[File:Fig7 Kalathil FrontInResMetAnal2018 2.jpg|900px]]
{{clear}}
{|  
{|  
  | STYLE="vertical-align:top;"|
  | STYLE="vertical-align:top;"|
===1.4.1 General functions===
{| border="0" cellpadding="5" cellspacing="0" width="900px"
{| class="wikitable collapsible" border="1" cellpadding="10" cellspacing="0"
|-
  ! colspan="3" style="color:DarkSlateGray; text-align:left; padding-left:40px;"| 1.4.1 '''General functions'''
  |-
  |-
   ! style="color:brown; background-color:#ffffee; width:500px;"| Request for information
   | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 7.''' To determine the importance rankings for clean energy, clusters were weighted using a weight reflecting their relevance to clean energy, as shown here. Weights were determined using a “clean energy” glossary, manually created by the analysts. When applied, the revised plant “bulb” cluster has a low clean energy weight, while the revised LED “bulb” has a high clean energy weight.</blockquote>
  ! style="color:brown; background-color:#ffffee; width:100px;"| Requirement code
  ! style="color:brown; background-color:#ffffee; width:700px;"| Vendor response
  |-  
  |-  
  | style="padding:5px; width:500px;" |'''a.''' The system offers non-LIS trained personnel the ability to easily access system data via an intuitive, user-friendly Windows-type graphical user interface (GUI) which permits the display of data from specimen points, projects, and user-defined queries, and can be configured to language, character set, and time zone needs.
|}
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''b.''' The system allows authorized users to configure their GUI to a specific language, character set, and time zone.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''c.''' The system permits remote access for users, system admins, and support agents.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''d.''' The system allows for the use of navigation keys to freely move from field to field.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''e.''' The system allows tabular data to be sorted and filtered.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''f.''' The system can send on-screen output to a printer or file without contradicting view-only statuses.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''g.''' The system provides single data entry, automatically populates other data fields, and remembers pertinent and relevant data so it doesn't need to be re-entered, selected, or searched for.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''h.''' The system eliminates (or significantly reduces) redundant data entry and paper trails.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''i.''' The system contains one or more spell-check dictionaries that allow authorized users to add, edit, or remove entries.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''j.''' The system provides full database keyword and field [[LIS feature#Query capability|search capability]], including the use of multiple search criteria.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''k.''' The system includes the ability to search multiple databases, including those containing legacy data.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''l.''' The system allows users to build, save, and edit queries for future use.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''m.''' The system can automate the search for and extraction of pertinent data, including the export of that data to external applications for additional processing and calculation.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''n.''' The system cleanly converts migrated data to allow for reporting of historical specimen collections.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''o.''' The system allows for the specification of a retention period for captured data and can enact it based on date-based fields or a future event.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''p.''' The system can manage and store both sample- and non-sample-related data, including images from microscopes, GCMS scans of peaks, PDF files, spreadsheets, or even raw data files from instrument runs for later processing.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''q.''' The system can manage and store media objects like digital photos, bitmaps, movies, and audio files.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''r.''' The system issues sequential numbers for chain of custody.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''s.''' The system's numbering scheme allows for sub-numbering while maintaining parent-child relationships.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''t.''' The system efficiently utilizes standardized data input points and enhanced individual workload tracking.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''u.''' The system captures data from all laboratory processes, ensuring uniformity of statistical reporting and other electronic data shared with designated users of the data.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''v.''' The system can link objects to other objects, e.g. linking a standard operating procedure (SOP) to a test result.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''w.''' The system [[LIS feature#Alarms and/or alerts|notifies users]] of events like the scheduling, receipt, and completion of tasks.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''x.''' The system includes the ability to set up alerts via email.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''y.''' The system has real-time messaging capabilities, including instant messaging to one or more users.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''z.''' The system supports the use of a voice recognition system (for navigation or transcription) or has that functionality.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''aa.''' The system offers integrated or online user help screens.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''ab.''' The system includes clinical trial management tools.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''ac.''' The system can handle [[LIS feature#Patient and case management|patient case management]] and assignment.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''ad.''' The system provides physician, clinic, and contract lab management tools.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''ae.''' The system provides outreach and call list management tools.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''af.''' The system supports multi-site, multi-lab, or multi-physician groups.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''ag.''' The system provides disease tracking functionality.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''ah.''' The system monitors turnaround time, from test request to clinician review.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''ai.''' The system is veterinary-compatible.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
|}
|}


===1.4.2 Configuration and customization===
Material mentions over the top 500 abstracts associated with the 19 “clean energy” term clusters were aggregated together, yielding a measure of overall importance to clean energy, as seen in Figure 8. This measure of overall importance counted the number of times each material was mentioned in the top 500 abstracts associated with the 19 term clusters that are highly important to clean energy, treating those term clusters as one set. The top and the bottom of our determination of importance matched the top and bottom of the DOE’s list. However, there was some variation in the middle. Such variation, in the case of real-world analysis, would provide analyst prompts regarding where to consider more study, and whether some aspects of a materials clean energy use and importance.
{| class="wikitable collapsible" border="1" cellpadding="10" cellspacing="0"
 
|-
 
  ! colspan="3" style="color:DarkSlateGray;text-align:left; padding-left:40px;"| 1.4.2 '''Configuration and customization'''
[[File:Fig8 Kalathil FrontInResMetAnal2018 2.jpg|400px]]
{{clear}}
{|  
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="400px"
  |-
  |-
   ! style="color:brown; background-color:#ffffee; width:500px;"| Request for information
   | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 8.''' The importance rankings as determined by SRI methodology. Red denotes more important, while green denotes less important. Results from the clustering algorithms matched the expert-produce high and low importance rankings.</blockquote>
  ! style="color:brown; background-color:#ffffee; width:100px;"| Requirement code
  ! style="color:brown; background-color:#ffffee; width:700px;"| Vendor response
  |-  
  |-  
  | style="padding:5px; width:500px;" |'''a.''' The system can be configured to meet the workflow of a laboratory without additional programming.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''b.''' The system architecture is modular or extensible and can easily and efficiently be modified to facilitate the addition of new functionality as business needs change.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''c.''' The system has an application programming interface (API) or a similar software development toolkit (SDK). If web-based, the API should support Simple Object Access Protocol (SOAP), representational state transfer (REST), or both.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''d.''' The system can expand to accommodate a new discipline.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''e.''' The system supports [[LIS feature#Customizable fields and/or interface|customized screens]] with user-definable information specific to a customer, department, analysis, etc.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''f.''' The system allows the administrator to create custom screens, applications, and reports.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''g.''' The system permits users to create templates and worksheets for standardizing analysis pages, patient pages, test ordering pages, and/or the reporting processes.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''h.''' The system allows a user to [[LIMS feature#Customizable fields and/or interface|independently add fields]] without requiring reconfiguration of the system, even after routine upgrades and maintenance.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''i.''' The system allows a user to independently add universal fields on all specimens logged into the system at any time during or after implementation, while neither voiding the warranty nor requiring vendor review at a later date.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''j.''' The system provides flexible and customizable mapping dictionaries for interconversion of different standards and more.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''k.''' The system supports the definition and maintenance of edit tables and lists.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''l.''' The system supports customizable test, panel, and profile libraries.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''m.''' The system supports customization of test pick lists based on specialty utilization.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''n.''' The system dynamically changes captions (labels) on system fields.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''o.''' The system has dynamically configurable limit periods and notification hierarchy.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''p.''' The system provides optional configuration choices for correctional facilities.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''q.''' The system provides optional configuration choices for [[biobank]]s and blood banks.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''r.''' The system allows for the integration of additional printers and scanners both locally and externally.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
|}
|}
|}
====Material-application pairings====
SRI utilized the term and document clusters to develop a matrix of material–application pairs to compare the ground truth. This was compared to the DOE 2011 CMS matrix that mapped materials to specific clean energy technologies (see Figure 9). The prevalence of different materials to a given clean energy field was considered, as defined and described above.


===1.4.3 Receiving and scheduling===
[[File:Fig9 Kalathil FrontInResMetAnal2018 2.jpg|700px]]
{| class="wikitable collapsible" border="1" cellpadding="10" cellspacing="0"
{{clear}}
|-
{|  
  ! colspan="3" style="color:DarkSlateGray;text-align:left; padding-left:40px;"| 1.4.3 '''Receiving and scheduling'''
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="700px"
  |-
  |-
   ! style="color:brown; background-color:#ffffee; width:500px;"| Request for information
   | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 9.''' Material–application pairing matrix from the U.S. Department of Energy Critical Materials Strategy Report, December 2011. These classifications were used to validate the results from the clustering algorithms.</blockquote>
  ! style="color:brown; background-color:#ffffee; width:100px;"| Requirement code
  ! style="color:brown; background-color:#ffffee; width:700px;"| Vendor response
  |-  
  |-  
  | style="padding:5px; width:500px;" |'''a.''' The system tracks status and [[workflow]] of the [[Accessioning (medical)|accession]] throughout the laboratory lifecycle, from submission to final analysis, including receiving, diagnostic testing, diagnostic test result reporting, and billing.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''b.''' The system supports [[LIS feature#Barcode support|barcoded specimen]] labeling and tracking.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''c.''' The system allows users to create, manage, and track viewable specimen container schemata.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''d.''' The system creates and maintains a unique electronic accession record for each accession received.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''e.''' The system supports standard-format digital picture and document upload and attachment to electronic accession records.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''f.''' The system supports a user-configurable, spreadsheet-style, templated ''multi-specimen'' ([[LIS feature#Sample and result batching|batch]]) login without requiring additional programming.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''g.''' The system features order entry rules for better managing duplicate orders, rejecting inappropriate tests, identifying inappropriate containers, rerouting tests to outside labs, etc.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''h.''' The system supports the modification of specimen or specimen batch information prior to actual ''multi-specimen'' (batch) login.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''i.''' The system supports ad-hoc specimens not predefined in the specimen point list during ''multi-specimen'' (batch) login.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''j.''' The system creates, saves, and recalls pre-login groups for routine specimens to simplify recurring logins.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''k.''' The system streamlines the [[LIS feature#Sample login and management|login]] of recurring specimen projects.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''l.''' The system allows standing orders and test schedules to be defined for future orders.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''m.''' The system automatically [[LIS feature#Label support|generates labels]] for recurring specimens and specimen groups.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''n.''' The system can automatically split and manage orders requiring multiple testing locations.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''o.''' The system can differentiate "investigation use only" and "research use only" tests.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''p.''' The system can automatically generate and manage administrative documents for genetics tests.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''q.''' The system allows authorized users to generate user-definable or rules-based chain of custodies, worksheets, routing sheets, and custom labels upon specimen login.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''r.''' The system provides a comprehensive view of all specimens and projects in the system using a color-coded status view of the current and scheduled specimens via user-configurable templates, all without requiring additional programming.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''s.''' The system includes [[LIS feature#Environmental monitoring|environmental monitoring]] (EM) functionality or integrate with an external EM product.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''t.''' The system prevents a specimen from being placed in a report queue until approved.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''u.''' The system includes comprehensive [[LIS feature#Task and event scheduling|scheduling]], [[LIS feature#Sample tracking|tracking]], and [[LIS feature#Workflow management|flow management]] of specimens, including across multiple sites.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''v.''' The system is capable of geographically organizing specimen records and their associated tests for special analysis and reporting.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''w.''' The system allows authorized users to accept, cancel, re-run, and override attributes of one or multiple tests for a given patient.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''x.''' The system allows authorized users to review the available test types in the system, including their reference range and units of measure.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''y.''' The system allows multiple diagnosis codes to be attached to an ordered test.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''z.''' The system has a "miscellaneous" test code to allow a test undefined in the system to be ordered and billed.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''aa.''' The system automatically runs medical necessity checks (based on diagnosis codes) on a requested test.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''ab.''' The system automatically makes duplicate order checks and notifies applicable users upon detection.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''ac.''' The system can import and utilize the [[Centers for Medicare and Medicaid Services]]' (CMS') National and Local Coverage Determinations files.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''ad.''' The system can track and manage advance beneficiary notices (ABNs) from CMS as needed.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''ae.''' The system allows authorized users to schedule routine specimens on an hourly, daily, weekly, or monthly basis, allowing them to be enabled and disabled as a group.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''af.''' The system generates an hourly, daily, weekly, or monthly specimen collection schedule from a scheduling database.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''ag.''' The system schedules and [[LIS feature#Project and/or task management|assign tasks]] based on available inventory and personnel.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''ah.''' The system supports automatic assignment and scheduling of [[LIS feature#Data and trend analysis|analysis requests]].
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''ai.''' The system receives accession/analysis request information from web-enabled forms.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''aj.''' The system electronically receives and processes collection and analysis request information and schedules from third parties.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''ak.''' The system can capture and store patient demographics, risk factors, and epidemiology data such as exposure data from third-party test requests.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''al.''' The system has an inter-lab transfer function.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''am.''' The system processes automated uploading of field-derived specimen collection data.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''an.''' The system allows users to handle billable and non-billable tests on the same accession.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''ao.''' The system supports tracking of shipping and receiving.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
|}
|}
|}
Figure 10 displays three term clusters, as illustration of the results and filtering step. The term cluster labeled 388 discussed the economic implications of photovoltaic technologies, and so was discarded after manual review. The other two clusters can be used to measure the distribution of material mentions across the associated abstracts of the top term clusters, an indicator of material importance to that specific application.


===1.4.4 Analysis and data entry===
[[File:Fig10 Kalathil FrontInResMetAnal2018 2.jpg|1000px]]
{| class="wikitable collapsible" border="1" cellpadding="10" cellspacing="0"
{{clear}}
|-
{|  
  ! colspan="3" style="color:DarkSlateGray;text-align:left; padding-left:40px;"| 1.4.4 '''Analysis and data entry'''
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="1000px"
  |-
  |-
   ! style="color:brown; background-color:#ffffee; width:500px;"| Request for information
   | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 10.''' Three term clusters as an illustration of results and filtering step. Analysis of the terms and associated abstracts from the “top” three photovoltaic term clusters reveal that two are relevant while one is not.</blockquote>
  ! style="color:brown; background-color:#ffffee; width:100px;"| Requirement code
  ! style="color:brown; background-color:#ffffee; width:700px;"| Vendor response
  |-  
  |-  
  | style="padding:5px; width:500px;" |'''a.''' The system supports a variety of test protocols, each capable of storing test comments, test required, and special information like GCMS conditions or special objects associated with the test.
|}
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''b.''' The system provides and enforces normal data range values for diagnostic tests.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''c.''' The system includes default input values for diagnostic tests.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''d.''' The system provides for a single test code requiring multiple analytes as targets.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''e.''' The system allows authorized users to make a test code inactive across one or more sites while retaining it for audit and reporting purposes.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''f.''' The system limits test code authorization to only qualified personnel and maintain their certification(s) to run assigned tests.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''g.''' The system supports and qualifies text-based tests.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''h.''' The system supports single-component tests such as pH, BOD, CD, etc.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''i.''' The system allows users to specify a single-component, multi-component, or narrative text test or group of tests, which represent all tests required.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''j.''' The system can effectively manage complex molecular testing protocols.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''k.''' The system can effectively manage genetic testing protocols.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''l.''' The system can effectively manage histology and cytology testing protocols.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''m.''' The system allows for user-definable and procedure-specific protocols for HIV specimens.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''n.''' The system permits user-generated and modifiable calculations (based on a formulaic language) to be applied to all tests.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''o.''' The system distinguishes between routine and duplicate analysis.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''p.''' The system provides an overview of all outstanding tests/analyses for better coordination of work schedules.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''q.''' The system notifies analysts of applicable safety hazards associated with a specimen, reagent, or test before testing begins.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''r.''' The system electronically transfers an item during testing from one functional area to another.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''s.''' The system's user interface displays visual indicators such as status icons to indicate a specimen's status in the workflow.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''t.''' The system allows file transfer of data from instruments via intelligent interfaces or multi-specimen/multi-test ASCII files, with full on-screen review prior to database commitment.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''u.''' The system permits [[LIS feature#Option for manual result entry|manual data entry]] into an electronic worksheet of test measurements and results.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''v.''' The system allows incorrectly inputted data to be manually corrected.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''w.''' The system prevents out-of-range and other critical results from being posted as final.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''x.''' The system provides colored visual indication of previously entered data as well as new data associated with a single specimen when a result is entered, with the indicator changing color if the value is out of specification.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''y.''' The system allows automated or semi-automated data insertion.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''z.''' The system stores non-narrative textual results in searchable fields.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
|}
|}


===1.4.5 Post-analysis and validation===
Each material was scored based on a normalized number of mentions across abstracts associated with each of the term clusters of analysis. In this case, the results from our methodology mirror the results from the DOE’s report exactly: indium, gallium, and tellurium are considered the most important materials to photovoltaics (see Table 1). Similarly, the results for magnets mirrored the DOE’s results (see Table 2).
{| class="wikitable collapsible" border="1" cellpadding="10" cellspacing="0"
|-
  ! colspan="3" style="color:DarkSlateGray;text-align:left; padding-left:40px;"| 1.4.5 '''Post-analysis and validation'''
|-
  ! style="color:brown; background-color:#ffffee; width:500px;"| Request for information
  ! style="color:brown; background-color:#ffffee; width:100px;"| Requirement code
  ! style="color:brown; background-color:#ffffee; width:700px;"| Vendor response
|-
  | style="padding:5px; width:500px;" |'''a.''' The system updates specimen/item status when tests are completed.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''b.''' The system automatically reorders a test or orders additional tests if results don't meet lab-defined criteria, especially when the original specimen is still available.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''c.''' The system can record test results in or map them to the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) and/or Logical Observation Identifiers Names and Codes (LOINC) standards.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''d.''' The system reads results from previously entered tests to calculate a final result and immediately displays the calculated result.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''e.''' The system allows authorized users to review all analytical results, including pricing, spec violations, history or trend analysis by analyte, and comments.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''f.''' The system graphically displays the results of one or more tests in a graph (normalized or otherwise) for the purpose of visualizing data or searching for possible trends.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''g.''' The system provides tools for graphical patient and analyte trend plotting.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''h.''' The system allows on-screen review of the stored test result, diluted result with corrected method detection limits (MDLs), and qualifiers after running samples for multiple dilutions as in [[gas chromatography–mass spectrometry]] (GC-MS).
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''i.''' The system includes data mining tools for model learning, evaluation, and usage.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''j.''' The system displays the standard operating procedure (SOP) associated with each test result to ensure proper techniques were used.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''k.''' The system stores test-related analysis comments with the test.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''l.''' The system provides auto-commenting for common laboratory result comments.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''m.''' The system is capable of displaying entered order and test comments as an onscreen alert for testing and other personnel.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''n.''' The system provides for high-volume multi-component transfers of test results, with the ability to automatically match samples to data files in either a backlog mode or a designated file mode, to parse the data, and to review and commit the sample data.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''o.''' The system's results [[LIMS feature#Data validation|validation process]] access all information about a sample or group of samples, including comments or special information about the sample.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''p.''' The system's results validation process checks each result against its individual sample location specifications (both warning and specification limits).
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''q.''' The system supports validation at the analysis and sample level, while also prohibiting sample validation when analysis validation is incomplete.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''r.''' The system uses a menu-driven process for results validation.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''s.''' The system provides secure electronic peer review of results.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''t.''' The system clearly differentiates released preliminary data from fully validated results.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''u.''' The system validates/approves data prior to being moved to the main database.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''v.''' The system can hold all test results on a specimen with multiple tests ordered on it until all work is completed and a final report is issued.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''w.''' The system fully manages all aspects of laboratory [[LIMS feature#QA/QC functions|quality control]], including the reporting and charting of all quality control data captured in the lab.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''x.''' The system provides a base for a quality assurance program, including proficiency testing, scheduled maintenance of equipment, etc.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''y.''' The system distinguishes QA/QC duplicates from normal samples.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''z.''' The system allows QA/QC tests to be easily created and associated with the primary analytical test.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''aa.''' The system allows manual entry of QA and QC data not captured as part of the system's regular processes.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''ab.''' The system calculates monthly QA/QC percentages for testing.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''ac.''' The system automatically flags out-of-range quality control limits.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''ad.''' The system is able to flag results for aspects other than being out-of-range, including testing location, patient age, etc.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''ae.''' The system is able to automatically report any diagnostic test results deemed "suspect" or "positive" for reportable diseases.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''af.''' The system checks data files for specification and corrects them for specific reporting and analyte limits and qualifiers like dilution factor, automatically assigning qualifiers based on project analyte limiting.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
|}


===1.4.6 Instruments===
{| class="wikitable collapsible" border="1" cellpadding="10" cellspacing="0"
|-
  ! colspan="3" style="color:DarkSlateGray;text-align:left; padding-left:40px;"| 1.4.6 '''Instruments'''
|-
  ! style="color:brown; background-color:#ffffee; width:500px;"| Request for information
  ! style="color:brown; background-color:#ffffee; width:100px;"| Requirement code
  ! style="color:brown; background-color:#ffffee; width:700px;"| Vendor response
|-
  | style="padding:5px; width:500px;" |'''a.''' The system bilaterally [[LIS feature#Instrument interfacing and management|interfaces with instruments]] and related software.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''b.''' The system downloads data directly from laboratory instruments.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''c.''' The system permits the defining and exporting of sequences to instruments.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''d.''' The system tracks and reports on the usage of attached laboratory instruments.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''e.''' The system allows automatic or manual reservation/scheduling of laboratory instruments.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''f.''' The system automatically (or manually allow an authorized user to) removes an instrument from potential use when it falls out of tolerance limit or requires scheduled calibration.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''g.''' The system provides a database of preventative maintenance, calibration, and repair records for laboratory equipment, preferably supported by standardized reporting.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''h.''' The system schedules calibration, verification, and maintenance tasks in the worksheets or work flow process and make that schedule available for viewing.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''i.''' The system allows users to create and edit instrument maintenance profiles.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
|}


===1.4.7 External system interfaces===
[[File:Tab1 Kalathil FrontInResMetAnal2018 2.jpg|400px]]
{| class="wikitable collapsible" border="1" cellpadding="10" cellspacing="0"
{{clear}}
|-
{|  
  ! colspan="3" style="color:DarkSlateGray;text-align:left; padding-left:40px;"| 1.4.7 '''External system interfaces'''
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="400px"
  |-
  |-
   ! style="color:brown; background-color:#ffffee; width:500px;"| Request for information
   | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Table 1.''' Materials/application pairing matrix comparing the results from our methodology with the Department of Energy (DOE)’s CM strategy report for the photovoltaic technology and coatings component.</blockquote>
  ! style="color:brown; background-color:#ffffee; width:100px;"| Requirement code
  ! style="color:brown; background-color:#ffffee; width:700px;"| Vendor response
  |-  
  |-  
  | style="padding:5px; width:500px;" |'''a.''' The system supports a library of common electronic data deliverable (EDD) formats.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''b.''' The system transfers data to and from another record management system.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''c.''' The system [[LIS feature#Third-party software integration|integrates]] with Microsoft Exchange services.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''d.''' The system [[LIS feature#Import data|imports data]] from and exports data to Microsoft Word, Excel, and/or Access.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''e.''' The system can interface with non-Microsoft programs.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''f.''' The system interfaces with external billing systems.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''g.''' The system interfaces with [[electronic medical record]] or [[electronic health record]] systems.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''h.''' The system interfaces with [[hospital information system]]s.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''i.''' The system interfaces with enterprise resource planning (ERP) systems.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''j.''' The system interfaces with external contract or reference laboratories to electronically send or retrieve datasheets, analysis reports, and other related information.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''k.''' The system exchanges data with National Identification System (NAIS) tracking systems.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''l.''' The system generates and exchanges data with other systems using [[Health Level 7]] (HL7) standards.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''m.''' The system leverages the application programming interface (API) of other systems to establish integration between systems.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''n.''' The system provides a real-time interface for viewing live and stored data transactions and errors generated by interfaced instruments and systems.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''o.''' The system transmits status changes of specimens, inventory, equipment, etc. to an external system.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''p.''' The system directs output from ad-hoc queries to a computer file for subsequent analysis by other software.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''q.''' The system supports the manual retransmission of data to interfaced systems.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''r.''' The system supports [[LIS feature#Mobile device integration|dockable mobile devices]] and handle information exchange between them and the system.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''s.''' The system supports the use of optical character recognition (OCR) software.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
|}
|}
===1.4.8 Reporting===
{| class="wikitable collapsible" border="1" cellpadding="10" cellspacing="0"
|-
  ! colspan="3" style="color:DarkSlateGray;text-align:left; padding-left:40px;"| 1.4.8 '''Reporting'''
|-
  ! style="color:brown; background-color:#ffffee; width:500px;"| Request for information
  ! style="color:brown; background-color:#ffffee; width:100px;"| Requirement code
  ! style="color:brown; background-color:#ffffee; width:700px;"| Vendor response
|-
  | style="padding:5px; width:500px;" |'''a.''' The system includes a versatile report writer and forms generator that can generate reports from any data in tables.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''b.''' The system includes a custom graphic generator for forms.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''c.''' The system interfaces with a third-party reporting application.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''d.''' The system allows the development of [[LIS feature#Configurable templates and forms|custom templates]] for different types of reports.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''e.''' The system maintains template versions and renditions, allowing management and tracking of the template over time.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''f.''' The system generates template letters for semi-annual reports.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''g.''' The system supports report queries by fields/keys, status, completion, or other variables.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''h.''' The system use Microsoft Office tools for formatting reports.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''i.''' The system supports multiple web browsers for viewing online reports.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''j.''' The system generates, stores, reproduces, and displays laboratory, statistical, and inventory reports on demand, including narrative.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''k.''' The system includes several standard reports and query routines to access all specimens with the pending status through a backlog report that includes the following criteria: all laboratory, department, analysis, submission date, collection date, prep test complete, location, project, specimen delivery group, and other user-selectable options.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''l.''' The system indicates whether a report is preliminary, amended, corrected, or final while retaining revision history.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''m.''' The system supports both structured and [[LIS feature#Synoptic reporting|synoptic reporting]].
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''n.''' The system generates management and turn-around time reports and graphs.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''o.''' The system generates customized final reports.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''p.''' The system automatically generates laboratory reports of findings and other written documents.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''q.''' The system automatically generates individual and aggregate workload and productivity reports on all operational and administrative activities.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''r.''' The system automatically generates and transmits exception trails and exception reports for all entered and/or stored out-of-specification data.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''s.''' The system generates a read-only progress report that allows for printed reports of specimen status and data collected to date.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''t.''' The system provides an ad-hoc web reporting interface to report on user-selected criteria.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''u.''' The system automatically generates and updates control charts.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''v.''' The system generates QA/QC charts for all recovery, precision, and lab control specimens via a full statistics package, including Levy-Jennings plots and Westgard multi-rule.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''w.''' The system displays history of previous results for an analyte's specimen point in a tabular report, graphic trend chart, and statistical summary.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''x.''' The system automatically generates and posts periodic static summary reports on an internal web server.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''y.''' The system transmits results in a variety of ways including fax, e-mail, print, and website in formats like RTF, PDF, HTML, XML, DOC, XLS, and TXT.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''z.''' The system electronically transmits results via final report only when all case reviews have been completed by the case coordinator.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''aa.''' The system includes a rules engine to determine the recipients of reports and other documents based on definable parameters.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''ab.''' The system allows database access using user-friendly report writing and inquiry tools.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''ac.''' The system supports automatic reporting to the state based on state-level health department rules.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''ad.''' The system reports molecular results in both clinical and anatomical pathology environments.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''ae.''' The system can produce specialized reports for microbiology, hematology, assay trend, and/or cardiac risk.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''af.''' The system can produce epidemiology and antibiogram reports.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''ag.''' The system provides the tools for creating and maintaining licensing, proficiency testing, inspection, and regulatory records and reports.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
|}
|}


===1.4.9 Laboratory management===
[[File:Tab2 Kalathil FrontInResMetAnal2018 2.jpg|400px]]
{| class="wikitable collapsible" border="1" cellpadding="10" cellspacing="0"
{{clear}}
|-
{|  
  ! colspan="3" style="color:DarkSlateGray;text-align:left; padding-left:40px;"| 1.4.9 '''Laboratory management'''
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="400px"
  |-
  |-
   ! style="color:brown; background-color:#ffffee; width:500px;"| Request for information
   | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Table 2.''' Materials/application pairing matrix comparing the results from our methodology with the Department of Energy (DOE)’s CM strategy report for wind turbines and vehicle technologies, and magnet component.</blockquote>
  ! style="color:brown; background-color:#ffffee; width:100px;"| Requirement code
  ! style="color:brown; background-color:#ffffee; width:700px;"| Vendor response
  |-  
  |-  
  | style="padding:5px; width:500px;" |'''a.''' The system allows the creation, modification, and duplication of user profiles.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''b.''' The system allows entry, maintenance, and administration of patient records and is able to track multiple patient encounters.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''c.''' The system allows entry, maintenance, and administration of [[LIS feature#Customer, supplier, and physician management|customers]], suppliers, and other outside entities.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''d.''' The system allows patients, customers, suppliers, physicians, and other such entities to be flagged as either active or inactive.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''e.''' The system allows the creation, modification, and maintenance of user [[LIS feature#Performance evaluation|training records]] and associated training materials.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''f.''' The system supports the ability to set up separate security, inventory, reporting, etc. profiles across multiple facilities.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''g.''' The system allows the management of information workflow, including notifications for requests and exigencies.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''h.''' The system allows the [[LIS feature#Document creation and management|management of documents]] like SOPs, MSDS, etc. to better ensure they are current and traceable.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''i.''' The system allows the management and monitoring of resources by analyst, priority, analysis, and instrument.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''j.''' The system allows authorized persons to select and assign tasks by analysts, work group, instrument, test, specimen, and priority.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''k.''' The system allows authorized persons to review unassigned work by discipline and by lab.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''l.''' The system allows authorized persons to review pending work by analyst prior to assigning additional work.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''m.''' The system manages and reports on reference specimens, reagents, and other inventory, including by department.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''n.''' The system automatically warns specified users when [[LIS feature#Inventory management|inventory counts]] reach a definable threshold and either prompt for or process a reorder.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''o.''' The system allows authorized users to monitor and report on reference and reagent creation, use, and expiration.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''p.''' The system allows authorized users to search [[LIS feature#Billing and revenue management|invoice information]] by invoice number, account number, accession, payment types, client, or requested diagnostic test(s).
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''q.''' The system includes [[LIS feature#Performance evaluation|performance assessment]] tracking.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''r.''' The system receives, records, and maintains customer and employee feedback and applies tools to track the investigation, resolution, and success of any necessary corrective action.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''s.''' The system includes an incident tracking system for recording, investigating, and managing safety and accident violations in the laboratory.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''t.''' The system monitors proficiency test assignment, completion, and casework qualification for analytical staff.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''u.''' The system includes revenue management functionality, including medical necessity checks and profitability analysis.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''v.''' The system provides analysis tools to better support laboratory functions like resource planning, productivity projections, workload distribution, and work scheduling, and those tools display information in a consolidated view, with the ability to drill down to more detailed data.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''w.''' The system calculates administrative and lab costs.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''x.''' The system captures and maintains patient, submitter, supplier, and other client demographics and billing information for costing, invoicing, collecting, reporting, and other billing activities.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''y.''' The system supports multiple customer [[LIS feature#Billing and revenue management|payment sources]] (e.g. grants).
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''z.''' The system supports multi-tiered pricing based on patient type and location.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
  | style="padding:5px; width:500px;" |'''aa.''' The system tracks number of visits per specific industry.
  | style="background-color:white; padding:5px;" |
  | style="background-color:white;" |
|-
|}
|}
|}
|}


==1.5 Custom requirements==
====Topic replacement====
One potential application of text analytics is the ability to examine trends in material use both within and between term and document clusters to measure how technology may be changing or “trending” within specific application areas. Work on detection of technology emergence has focused on keyword occurrence, sometimes within the context of an existing taxonomy.<ref name="EusabiIdent14">{{cite web |url=https://www.rand.org/pubs/research_reports/RR629.html |title=Identification and Analysis of Technology Emergence Using Patent Classification |author=Eusabi, C.; Silberglitt, R. |publisher=RAND Corporation |date=2014}}</ref> SRI's method produces both the terms and applications from the corpus. Manual evaluation determined that the term cluster labeled 361 mentioned nickel–metal hydride and other battery technologies, specifically in the context of electric vehicles. In 2008, document counts for lithium surpassed document counts for nickel in term cluster 361 (see Figure 11). The graph shown that around 2008, nickel–metal hydride battery technology for electric vehicles had reached a point of relative maturity. Lithium-ion batteries for use in electric vehicles, however, were a less mature technology, and research shifted to focus on advancing this immature technology. While we cannot draw any conclusions about the actual usage of these technologies from these data, the case study demonstrates the potential of text analytics to be used to analyze trends over time and identify how materials and technologies may replace one another. We have also seen this in dye-sensitized solar cells in a previous project.<ref name="RandazzeseHelios16">{{cite web |url=https://www.osti.gov/biblio/1336902-helios-understanding-solar-evolution-through-text-analytics |title=Helios: Understanding Solar Evolution Through Text Analytics |author=Randazzese, L. |work=OSTI.gov |publisher=U.S. Department of Energy |date=02 December 2016 |doi=10.2172/1336902}}</ref>
 
 
[[File:Fig11 Kalathil FrontInResMetAnal2018 2.jpg|700px]]
{{clear}}
{|  
{|  
  | STYLE="vertical-align:top;"|
  | STYLE="vertical-align:top;"|
{| class="wikitable collapsible" border="1" cellpadding="10" cellspacing="0"
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
  ! colspan="3" style="text-align:left; padding-left:20px; padding-top:10px; padding-bottom:10px; width:1300px;"| 1.7 '''Custom requirements'''
  |-
  |-
   ! style="color:brown; background-color:#ffffee; width:470px;"| Request for information
   | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 11.''' 4-year moving averages for lithium and nickel document counts in term cluster 361. These results from SRI algorithms show decreasing activity in battery research using nickel hydrides by 2008 while lithium research increased for battery applications.</blockquote>
  ! style="color:brown; background-color:#ffffee; width:100px;"| Requirement code
  ! style="color:brown; background-color:#ffffee; width:700px;"| Vendor response
  |-  
  |-  
  | style="padding:15px; width:470px;" |'''a.'''
  | style="background-color:white; padding:15px;" |                                                                                                                                                   
  | style="background-color:white;" |
|-
  | style="padding:15px; width:470px;" |'''b.'''
  | style="background-color:white; padding:15px;" |                                                                                                                                                         
  | style="background-color:white;" |
|-
  | style="padding:15px; width:470px;" |'''c.'''
  | style="background-color:white; padding:15px;" |
  | style="background-color:white;" |
|-
  | style="padding:15px; width:470px;" |'''d.'''
  | style="background-color:white; padding:15px;" |
  | style="background-color:white;" |
|-
  | style="padding:15px; width:470px;" |'''e.'''
  | style="background-color:white; padding:15px;" |
  | style="background-color:white;" |
|-
  | style="padding:15px; width:470px;" |'''f.'''
  | style="background-color:white; padding:15px;" |
  | style="background-color:white;" |
|-
  | style="padding:15px; width:470px;" |'''g.'''
  | style="background-color:white; padding:15px;" |
  | style="background-color:white;" |
|-
  | style="padding:15px; width:470px;" |'''h.'''
  | style="background-color:white; padding:15px;" |
  | style="background-color:white;" |
|-
  | style="padding:15px; width:470px;" |'''i.'''
  | style="background-color:white; padding:15px;" |
  | style="background-color:white;" |
|-
  | style="padding:15px; width:470px;" |'''h.'''
  | style="background-color:white; padding:15px;" |
  | style="background-color:white;" |
|-
  | style="padding:15px; width:470px;" |'''i.'''
  | style="background-color:white; padding:15px;" |
  | style="background-color:white;" |
|-
  | style="padding:15px; width:470px;" |'''j.'''
  | style="background-color:white; padding:15px;" |
  | style="background-color:white;" |
|-
|}
|}
|}
|}
==Discussion==
The results presented show the application of a text analytics method to extract meaningful information for use to evaluate the progress of research and development. This tool automates away a large amount of manual labor; however, human intervention is still necessary. Human experts are required to define the parameters of evaluation. Fundamentally, this methodology is not designed to replace human analysis or input nor is it intended to act independently. Instead, it is a tool that researchers can use to more effectively and efficiently analyze their entire domain of research, while also reaching into tangential domains that contain relevant concepts, components, and ideas. Researchers can utilize this methodology to perform objective, replicable, and adaptable reviews of the relative importance of individual components to work toward an understanding of how different pieces fit together. This methodology can be applied to help researchers and inventors better understand how specific components or materials are involved in a given technology or research stream, thereby increasing their potential to create new inventions or discover new scientific findings.
The specific manual construction of the glossary is a crucial step in this methodology. The choice of screening terms will have an obvious impact on what clusters are chosen for analysis. The glossary must be assembled with the end goal in mind. Alternatively, one could use a more limited glossary if the domain of interest was narrow and known. Instead of screening in and out various term clusters, one could keep all clusters and use weightings to rate relevance of clusters. In this case, we created our glossary by selecting keywords from a broad list of material applications (both clean energy relevant and not) to ensure that we did not artificially restrict our results early in the process. This selection was done purely manually based on our analysts’ background knowledge of applications of critical materials. This screening identified 42 application term clusters of interest, about 10% of all term clusters.
===Future work===
Obvious extensions to this work include text analysis of additional document corpora, more analysis of trends over time, and more sophisticated use of text analytics; for example, to include natural language processing approaches. We did preliminary clustering of patent data that suggested a path forward similar to what we did for papers. New types of insight, for example on the influence of external conditions on materials research, might be obtained from looking at non-technical corpora such as news articles.
The workflow developed for this project allows a subject matter analyst to leverage state-of-the-art text analytic tools without requiring those tools to produce perfect output (which is significant, because the state of the art in text analysis produces enormous amounts of noise when applied in any practical setting). We are able to reduce the overall manual effort required to understand the content of large volumes of scientific reporting, at the cost of shifting some of that effort to tasks dealing with the text analytic tools. For a targeted investigation such as ours, analysts will always select the target and shape the investigation.
We used term clustering primarily as a way for analysts to collect “concepts” from text, but of course there are many ways to carve up the concept space. In our workflow, analysts were able to specify sets of terms that should be further specified, but an improved interface to these tools would allow a user to suggest a split and see resulting changes immediately as opposed to the semi-manual process we currently have in which the data analyst invokes a standalone process to subdivide individual clusters. This was applied, for example, to distinguish the two different senses of the term “bulb,” for lighting and as a type of plant. Polysemy of this type is common in natural language and experimental techniques existing for automatically identifying and resolving this polysemy.<ref name="FreitagToward04" /><ref name="FreitagTrained04" /><ref name="FreitagNewExp05">{{cite journal |title=New experiments in distributional representations of synonymy |journal=Proceedings of the Ninth Conference on Computational Natural Language Learning |author=Freitag, D.; Blume, M.; Byrnes, J. et al. |pages=25–32 |year=2005}}</ref> Incorporating these techniques into our workflow and extending them where possible would reduce the effort by the analyst by providing superior semantic distinctions with each round of clustering. It is possible that the alternative clustering technique LDA, referred to in the background section, would contribute to this problem in complementary ways to the coclustering approach, as it allows for terms to be members of multiple topics rather than requiring each term to be in a single cluster. In addition, metrics such as entropy or information gain can be used to attempt to automatically recognize clusters most likely to need to be split, although we expect that such metrics will not be sufficient to replace human judgment.
Set expansion techniques<ref name="XuQuery96">{{cite journal |title=Query expansion using local and global document analysis |journal=Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval |author=Xu, J.; Croft, W.B. |pages=4–11 |year=1996 |doi=10.1145/243199.243202}}</ref><ref name="WangAuto09">{{cite journal |title=Automatic set instance extraction using the web |journal=Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP |author=Wang, R.C.; Cohen, W.W. |pages=441–449 |year=2009}}</ref> have been developed for finding a set of terms which are related to a given set of terms in the same way that those terms are related to each other. For example, the word “Achilles” is related to many words in Greek mythology, the word “Alabama” is related to U.S. state names and famous Alabamans, and the word “Queen Mary” is related to historical political figures. When taken together, however, the main thing that these terms have in common is that they are all ship names, and set expansion techniques find such hidden connections and then find additional terms sharing the relationship. Applying such techniques to keyword lists for target technologies would reduce the burden on the analyst to come up with comprehensive and specific keyword lists for targeting concepts.
==Ackowledgements==
===Author contributions===
NK contributed to the design and performed analysis. JB designed the text analytics workflow and extended the Copernicus tools to apply to this project. LR led the development of the overall analytical approach. DH managed the data, applied Copernicus, and implemented extensions designed by JB. CF contributed to the design of the study and methodology. All authors contributed to the drafting of the manuscript.
===Conflict of interest statement===
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
===Funding===
This work was supported by the National Renewable Energy Laboratory through contract number AEU-6-62527-01. Any opinions, findings, and conclusions expressed in this material are those of the authors and do not necessarily reflect the views of the National Renewable Energy Laboratory or its staff. The authors acknowledge John Chase for his contributions to the initial proposal and project plan. The authors wish to thank Alberta Carpenter, Joe Cresko, Fletcher Fields, and Rod Eggert for their helpful comments during the project. We thank the reviewers for their helpful comments.
==References==
{{Reflist|colwidth=30em}}
==Notes==
This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original article lists references alphabetically, but this version — by design — lists them in order of appearance. Turned the singular footnote from the original article into inline text in parentheses for this version.
<!--Place all category tags here-->
[[Category:LIMSwiki journal articles (added in 2018)‎]]
[[Category:LIMSwiki journal articles (all)‎]]
[[Category:LIMSwiki journal articles on materials informatics‎‎]]

Revision as of 20:21, 22 May 2018

Sandbox begins below

Full article title Application of text analytics to extract and analyze material–application pairs from a large scientific corpus
Journal Frontiers in Research Metrics and Analytics
Author(s) Kalathil, Nikhil; Byrnes, John J.; Randazzese, Lucien; Hartnett, Daragh P.; Freyman, Christina A.
Author affiliation(s) Center for Innovation Strategy and Policy and the Artificial Intelligence Center, SRI International
Primary contact Email: christina dot freyman at sri dot com
Editors Boyack, Kevin
Year published 2018
Volume and issue 2
Page(s) 15
DOI 10.3389/frma.2017.00015
ISSN 2504-0537
Distribution license Creative Commons Attribution 4.0 International
Website https://www.frontiersin.org/articles/10.3389/frma.2017.00015/full
Download https://www.frontiersin.org/articles/10.3389/frma.2017.00015/pdf (PDF)

Abstract

When assessing the importance of materials (or other components) to a given set of applications, machine analysis of a very large corpus of scientific abstracts can provide an analyst a base of insights to develop further. The use of text analytics reduces the time required to conduct an evaluation, while allowing analysts to experiment with a multitude of different hypotheses. Because the scope and quantity of metadata analyzed can, and should, be large, any divergence from what a human analyst determines and what the text analysis shows provides a prompt for the human analyst to reassess any preliminary findings. In this work, we have successfully extracted material–application pairs and ranked them on their importance. This method provides a novel way to map scientific advances in a particular material to the application for which it is used. Approximately 438,000 titles and abstracts of scientific papers published from 1992 to 2011 were used to examine 16 materials. This analysis used coclustering text analysis to associate individual materials with specific clean energy applications, evaluate the importance of materials to specific applications, and assess their importance to clean energy overall. Our analysis reproduced the judgments of experts in assigning material importance to applications. The validated methods were then used to map the replacement of one material with another material in a specific application (batteries).

Keywords: machine learning classification, science policy, coclustering, text analytics, critical materials, big data

Introduction

Scientific research and technological development are inherently combinatorial practices.[1] Researchers draw from, and build on, existing work in advancing the state of the art. Increasing the ability of researchers to review and understand previous research can stimulate and accelerate scientific progress. However, the number of scientific publications grows exponentially every year both on the aggregate level and in an individual field.[2] It is impossible for any single researcher or organization to keep up with the vastness of new scientific publications. The ability to use text analytics to map the current state of the art to detect progress would enable more efficient analyses of data.

The Intelligence Advanced Research Projects Activity recognized the scale problem in 2011, creating the research program Foresight and Understanding from Scientific Exposition. Under this program, SRI and other performers processed “the massive, multi-discipline, growing, noisy, and multilingual body of scientific and patent literature from around the world and automatically generated and prioritized technical terms within emerging technical areas, nominated those that exhibit technical emergence, and provided compelling evidence for the emergence.”[3] The work presented here applies and extends that platform to efficiently identify and describe the past and present evolution of research on a given set of materials. This work applies text analytics to demonstrate how these computational tools can be used by analysts to analyze much larger sets of data and develop more iterative and adaptive material assessments to better inform and shape government and industry research strategy and resource allocation.

Materials

Ground truth

The Department of Energy (DOE) has a specific interest in critical materials related to the energy economy. The DOE identifies critical materials through analysis of their use (demand) and supply. The approach balances an analysis of market dynamics (the vulnerability of materials to economic, geopolitical, and natural supply shocks) with technological analysis (the reliance of certain technologies on various materials). The DOE's R&D agenda is directly informed by assessments of material criticality. The DOE, the National Research Council, and the European Economic and Social Committee have all articulated a need for better measurements of material criticality. However, criticality depends on a multitude of different factors, including socioeconomic factors.[4] Various organizations across the world define resource criticality according to their own independent metrics and methodologies, and designations of criticality tend to vary dramatically.[4][5][6][7][8][9][10]

Experts tasked with assessing the role of materials must make decisions about what materials to focus on, what applications to review, what data sources to consult, and what analyses to pursue.[10] The amount of data available to assess is vast and far too large for any single analyst or organization to address comprehensively. In addition, to the best of our knowledge, previous assessments of material criticality have not involved a comprehensive review of scientific research on material use. [Graedel and colleagues have published extensively using raw data on supply and other indicators to measure criticality, see Graedel et al. (2012[11], 2015[10]) and Panousi et al. (2016[12]) and the references contained within.] Recent developments in text analytic computational approaches present a unique opportunity to develop new analytic approaches for assessing material criticality in a comprehensive, replicable, iterative manner.

The Department of Energy’s 2011 Critical Materials Strategy (CMS) Report uses importance to clean energy as one dimension of the criticality matrix (see Figure 1).[13] In this regard, the DOE report serves as a form of ground truth for the validation of our technique, though the DOE report considered supply risk as the second dimension to criticality, which the analysis described in this paper does not address.


Fig1 Kalathil FrontInResMetAnal2018 2.jpg

Figure 1. 2011 Critical Materials Importance Analysis matrix, published by experts at the Department of Energy. This matrix served as ground truth for validation.

Scientific publications

Data on scientific research articles was obtained from the Web of Science (WoS) database available from Thomson Reuters (now Clarivate Analytics). WoS contains metadata records. In principle, we could have analyzed this entire database; however, for budget-related reasons, the document set was limited by a topic search of keywords that appear in a document title, abstract, author-provided keywords, and WoS-added keywords for the following:

  • the 16 materials listed in the 2011 CMS, or
  • the 285 unique alloys/composites of the 16 critical materials.

The document set was also limited to articles published between 1992 and 2011, the 20-year period leading up the DOE's most recent critical material assessment.

The 16 materials listed in the 2011 CMS include europium, terbium, yttrium, dysprosium, neodymium, cerium, tellurium, lanthanum, indium, lithium, gallium, praseodymium, nickel, manganese, cobalt, and samarium. Excluded from our documents set was any publication appearing in 80 fields considered not likely to cover research in scope (e.g., fields in the social sciences, biological sciences, etc.). We used the 16 materials listed above because we were interested in validating a methodology against the 2011 CMI Strategy report, and these are the materials mentioned therein. The resulting data set consisted of approximately 438,000 abstracts of scientific papers published from 1992 to 2011.

Methods

Text analytics and coclustering

The principle behind coclustering is the statistical analysis of the occurrences of terms in the text. This includes the processing of the relationships both between terms and neighboring (or nearby) terms, and between terms and the documents in which they occur. The approach presented here grouped papers by looking for sets of papers containing similar sets of terms. As detailed below, our analytic methods process meaning beyond simple counts of words, and thus, for example, put papers about earthquakes and papers about tremors in the same group, but would exclude papers in the medical space that discuss hand tremors.

Coclustering is based on an important technique in natural language processing which involves the embedding of terms into a real vector space; i.e., each word of the language is assigned a point in a high-dimensional space. Given a vector representation for a term, terms can be clustered using standard clustering techniques (also known as cluster analysis), such as hierarchical agglomeration, principle components analysis, K-means, and distribution mixture modeling.[14] This was first done in the 1980s under the name latent semantic analysis (LSA).[15] In the 1990s, neural networks were applied to find embeddings for terms using a technique called context vectors[16][17] A Bayesian analysis of context vectors in the late 1990s provided probabilistic interpretation and enabled applying information-theoretic techniques.[16][18] We refer to this technique as Association Grounded Semantics (AGS). A similar Bayesian analysis of LSA resulted in a technique referred to as probabilistic-LSA[19], which was later extended to a technique known as latent Dirichlet allocation (LDA).[20] LDA is commonly referred to as “topic modeling” and is probably the most widely applied technique for discovering groups of similar terms and similar documents. Much more recently, Google directly extended the context vector approach of the early 1990s to derive word embeddings using much greater computing power and much larger datasets than had been used in the past, resulting in the word2vec product, which is now widely known.[21][22]

The LDA model assumes that documents are collections of topics, and those topics generate terms, making it difficult to apply LDA to terms other than in the contexts of documents. Because coclustering treats the document as a context for a term, any other context of a term can be substituted in the coclustering model. For example, contexts may be neighboring terms or capital letters or punctuation. This allows us to apply coclustering to a much wider variety of feature types than is accommodated by LDA. In particular, “distributional clustering” (clustering of terms based on their distributions over nearby terms), which has been proven to be useful in information extraction[23][24] is captured by coclustering. In future work, we anticipate recognizing material names and application references using these techniques.

Word embeddings are primarily used to solve the “vocabulary problem” in natural language, which is that many ways exist to describe the same thing, so that a query for “earthquakes” will not necessarily pick up a report on a “tremor” unless some generalization can be provided to produce soft matching. The embeddings produce exactly such a mapping. Applying the information-theoretic approach called AGS led to the development of coclustering[25], one of the key text analytic tools used in this research.

Information-theoretic coclustering is the simultaneous estimation of two partitions (a mutually exclusive, collectively exhaustive collection of sets) of values of two categorical variable (such as “term” and “document”). Each member of the partition is referred to as a cluster. Formally, if X ranges over terms x0, x1, …, Y ranges over documents y0, y1, …, and Pr(X = x, Y = y) is the probability of selecting an occurrence of term x in document y given that an arbitrary term occurrence is selected from a document corpus, then the mutual information I(X;Y) between X and Y is given by:

We seek a partition A = {a0, a1, …} over X and a partition B = {b0, b1, …} over Y such that I(A;B) is as high as possible. Since the information in the A, B co-occurrence matrix is derived from the information in the X, Y co-occurrence matrix, maximizing I(A;B) is the same as minimizing I(X;Y) − I(A;B), we are compressing the original data (by replacing terms with term clusters and documents with document clusters) and minimizing the information lost due to compression.

Compound terms were discovered from the data through a common technique[26] in which sequences are considered to be compound terms if the frequency of the sequence is significantly greater than that predicted from the frequency of the individual terms under the assumption that their occurrences are independent. As an example, when reading an American newspaper, the term “York” occurs considerably more frequently after the term “New” than it occurs in the newspaper overall, leading to the conclusion that “New York” should be treated as a single compound term. We formalize this as follows. Let Pr(Xi = x0) be the probability that the term occurring at an arbitrarily selected position Xi in the corpus is the term x0. Then, Pr(Xi = x0, Xi+1 = x1) is the probability of the event that x0 is seen immediately followed by x1. If x0 and x1 occur independently of each other, then we would predict Pr(Xi = x0, Xi+1 = x1) = Pr(Xi = x0)Pr(Xi+1 = x1). To measure the amount that the occurrences do seem to depend on each other, we measure the ratio:

As this ratio becomes significantly higher than one, we become more confident that the sequence of terms should be treated as a single unit. This technique was iterated to provide for the construction of longer compound terms, such as “superconducting quantum interference device magnetometer.” The compound terms that either start or end with a word from a fixed list of prepositions and determiners (commonly referred to as “stopwords”) were deleted, in order to avoid having sequences such as “of platinum” become terms. This technique removes any reliance on domain-specific dictionaries or word lists, other than the list of prepositions and determiners.

Coclustering algorithms (detailed above) produce clusters of similar terms based on the titles and abstracts in which they appear, while grouping similar documents based on the terms they contain: thus, the name cocluster. In this project, a "document" is defined as a combined title and abstract. The process can be thought of as analogous to solving two equations with two unknowns. The process partitions the data into a set of collectively exhaustive document and term clusters resulting in term clusters that are maximally predictive of document clusters and document clusters that are maximally predictive of term clusters.

Coclustering results can be portrayed as an M × N matrix of term clusters (defined by the terms they contain) and document clusters (defined by the documents they contain). Terms that appear frequently in similar sets of documents are grouped together, while documents that mention similar terms are grouped together. Each term will appear in one, and only one, term cluster, while each document will appear in one, and only one, document cluster. When deciding on how many clusters to start with (term or document, i.e., M and N) there is a tradeoff between breadth and depth. The goal is to differentiate between sub-topics while including a reasonable range of technical discussion within a single topic. The balance between breadth and depth is reflected in the number of clusters that are created. Partitioning into a larger number of clusters would result in more narrowly defined initial clusters, but might mischaracterize topics that span multiple clusters. On the other hand, partitioning into fewer clusters would capture broader topics. Researchers interested in a more fine-grained understanding of materials use, e.g., interested in isolating a set of documents focused on a narrower scientific or technological topic, would need to sub-cluster these initial broad clusters in later analyses.

Terms in term clusters, and titles and abstracts (i.e., document data) from document clusters reveal the content of each cluster. This information provides the basis for identifying what each term cluster “is about” and for selecting term clusters for further scrutiny. Term clusters of interest were filtered using a glossary, in this case, terms pertaining to common applications of the 16 critical materials. This list of terms was manually created through a brief literature review of common applications associated with the materials in question.

Each term cluster was correlated with each of the 437,978 scientific abstracts in our corpus, and the degree of similarity was determined between each term cluster and each abstract through an assessment of their mutual information. As discussed, the term clusters and document clusters described above were selected so as to maximize the mutual information between terms. In order to find document abstracts most strongly associated with a given term cluster, we want to choose those abstracts which were most predictive of the term cluster. These are the abstracts that contain the words in the cluster, but especially the words in the cluster that are rare in the corpus in general. We formalize this by defining the association of a term cluster t and document abstract d as the value of the (t,d)-term of the mutual information formula.

To formalize this, we consider uniformly randomly selecting an arbitrary term occurrence from the entire document set, and we write Pr(T = t, D = d) for the probability that the term was a member of cluster t and that the occurrence was in document d. We adopt the maximum likelihood estimate for this probability:

where n(t,d) is the number of occurrences of terms from term cluster t in document abstract d and N is the total of number of term occurrences in all documents: N = ∑t,dn(t,d).

We define the association between t and d by the score:

Given this score, document abstracts can be arranged from most associated to least associated with a given term cluster. This was done for all 437,978 abstracts in our corpus and all term clusters. This methodology allowed the identification of those abstracts most closely associated with each term cluster.

Not all disperse term clusters were easy to interpret. As with the construction of an appropriate filtering glossary, deciding on the appropriate number of sub-clusters can be subjective. In general, the more diffuse a term cluster appears to be in its subject matter, the more it should be sub-clustered as a means to separate its diverse topic matter. Even imprecise sub-clustering is effective in narrowing the focus of these clusters. Once we determined the number of sub-clusters to generate, sub-clustering of terms was done in the same general manner as the original coclustering: the same coclustering algorithms were applied only to the terms in the initial cluster, but it is instructed not to discover any new abstract clusters. Rather, the set of terms in a cluster are grouped into the pre-specified number of bins according to similarity in the already existing groups of documents in which the terms appear.

Results

Workflow

This research resulted in the creation of the following workflow, detailed in Figure 2. The workflow was developed to assess the material–application pairing matrices, and it illustrates how text analytics can be used to aid in assessment of material importance to specific application areas, identify pairings of materials and applications, and augment a human expert’s ability to monitor the use and importance of materials.


Fig2 Kalathil FrontInResMetAnal2018 2.jpg

Figure 2. Diagram of the text analytics workflow that this project developed. This work flow was utilized to identify material–application pairs.

The final data set acquired and ingested into the Copernicus platform contained 437,978 documents. We extracted 83,059 terms from titles and abstracts, including compound (multi-part) terms such as “chloroplatinic acid” and “alluvial platinum.” Initial clustering of terms showed some term clusters were very precise, others were less focused. Multiple clustering sizes were experimented with to find an optimal size, which was 400 document clusters and 400 term clusters to start, creating a 400 × 400 matrix of document and term clusters. By manually analyzing the terms in each term cluster, we identified which clusters focused narrowly on areas most relevant to our study, namely those that use materials in clean energy applications. The term cluster in Figure 3 below, for example, focused around lithium-ion batteries, as indicated by the representative term list and an analysis of the most closely associated papers.


Fig3 Kalathil FrontInResMetAnal2018 2.jpg

Figure 3. The coclustering algorithm produced term clusters. This is an example of a narrowly focused cluster.

Analysis relies on our ability to extract meaningful statistics about the dataset from term clusters, which can be poorly defined, as seen in the example term cluster in Figure 4. When term clusters are poorly defined, we have limited ability to interpret the statistics we extract. There are multiple ways to differentiate term clusters. Principal among these was the division between term clusters that cover basic scientific research versus those that focused on specific technological applications. In addition, because this project used the 2011 DOE CMS as a validation source, a differentiation between clean energy and non-clean energy relevance was also necessary, especially when the same material discoveries were used in both a clean energy and non-clean energy context. More broadly, if a researcher is interested in a specific application area, a division between that application area and others can be used as the means for differentiation.


Fig4 Kalathil FrontInResMetAnal2018 2.jpg

Figure 4. The coclustering algorithm produced term clusters. This is an example of an unfocused cluster that spans basic science research and non-clean energy research.

As discussed in the “Methods” section for this project, we manually determined the number of necessary sub-clusters. For example, one of the glossary terms we used was “bulbs.” A term cluster was identified as the bulb cluster. Once it was examined closely, however, it was clear it contained material about both plant bulbs as well as light bulbs, as shown in Figure 5. Accordingly, it was clear that this cluster should be split (sub-clustered) into two clusters. After sub-clustering, the term cluster split into two clearly defined clusters.


Fig5 Kalathil FrontInResMetAnal2018 2.jpg

Figure 5. Unfocused clusters were then sub-clustered to produce more useable and focused clusters, as shown in this figure. A blub term cluster was split. The sub-clustering algorithms produced two new clusters, as shown above. The new smaller clusters were more focused than the larger cluster.

The process of sub-clustering expanded the list of initial 42 application term clusters into a total of 134 relevant term clusters for analysis. Clusters were considered “relevant” based on the weights that were assigned to them. For this project, we developed a methodology for weighting clusters to replicate the ground truth of the 2011 DOE CMS.

Weighting

One of the principal questions addressed by the 2011 DOE CMS was the importance of specific materials to clean energy applications. The DOE ranked the 16 materials based on their importance to clean energy, and their supply risk, as detailed in Figure 1. We developed a methodology to replicate the y-axis of this figure (material importance to clean energy) by combining the data on material distribution over clusters with an assessment of the clean energy importance of each cluster. To assess the clean energy importance of each of these clusters, we developed a clean energy importance weighting. A set of key words was constructed manually and searched the top 500 associated abstracts of each term cluster for these key words. Constructing this glossary is a crucial step that allows researchers to define their key words of interest. Document abstracts were analyzed for mentions of any clean energy field per cluster, and the number of different clean energy fields mentioned across the cluster. The clean energy weighting considers the extent and depth of impact within a cluster, and is equal to:

This weighting essentially captures the number of abstracts per clean energy field. The goal of the weighting is to discount clusters that mention a number of different clean energy fields, but do not discuss these clean energy fields in a substantial or significant manner. Materials that were mentioned frequently in clusters with high clean energy weights can be thought of as “important” to clean energy. The importance of each material to clean energy was determined by counting the number of document mentions in clean energy important clusters (i.e., clusters above some cutoff of minimum clean energy weight).

Material-to-clean-energy field application pairings were derived from term clusters. Recall that for each term cluster, the number of abstracts mentioning any of the 16 materials were determined as well as the number of abstracts mentioning any of 33 clean energy field keywords. In the importance analysis previously discussed, mentions of 33 clean energy fields were aggregated together to establish the clean energy weighting system. For each clean energy field, all term clusters were ranked based on a normalized count of document mentions of that field.

From this ranking, the “top” three term clusters were identified for a specific field for review. Then, the terms and keywords were manually reviewed to ensure that term clusters were actually relevant and the occasional false positives (term clusters that score high based on keyword counts but are not in fact relevant based on human content analysis) were discarded. This manual review, while requiring human intervention, is done on a significantly narrowed set of documents, and thus does not represent a major bottleneck in the process. Having linked clusters to a specific clean energy field, material importance to each field was evaluated by examining the distribution of material mentions across associated documents. The number of mentions divided by the average mentions across all 134 relevant term clusters was used to account for keywords or phrases that were mentioned at a high frequency. In the photovoltaics case, these normalized number of mentions dropped significantly after the first three term clusters. The first three term clusters were manually reviewed to ensure that they were relevant.

Extracting statistics

Statistics related to material importance and material–application pairings were extracted from this final set of 134 relevant term clusters. For the purpose of this project, the top 500 abstracts from each of the 134 final clusters were analyzed, for a resulting 49,573 abstracts. Each of these 500 abstracts were automatically searched for mentions of the 16 materials (these counts included any compounds or alloys associated with those minerals) identified in the DOE strategy report, providing a count and distribution of material counts across the final set of granular term clusters that will serve as the basis for a subsequent manual analysis. An alternative to coclustering is hierarchical clustering in which terms are joined with their nearest matches only, and then each cluster is joined with its nearest match only, etc. Such an approach makes sub-clustering trivial by reducing it to undoing the merges that generated a given cluster. We opted against this structure because in past experience hierarchical clusters have generated qualitatively inferior clusterings of terms.

Figure 6 displays the count of how many times each material was mentioned in the 49,573 abstracts that had the highest mutual information with the final 134 relevant term clusters (the 500 abstracts with the most mutual information with each of the 134 relevant term clusters). Material mentions were counted in the papers most closely associated with each term. This revealed what materials were most strongly associated with which terms. Thus, the methodology provides a way to analyze the distribution of materials over different topics, setting the foundation for material importance and material–application pair analysis.


Fig6 Kalathil FrontInResMetAnal2018 2.jpg

Figure 6. The number of documents in the 134 relevant clusters that contained a specific element varied from under 1,000 to over 12,000, as shown above.

Validation of method

To compare the results with the CMS, we need to weight for clean energy, as described above. Multiple clean energy cutoffs were reviewed and ultimately a cutoff of 26 was employed. The cutoff of 26 corresponds to the beginning of the step up in the distribution presented in Figure 7. Thus, term clusters with a clean energy weight of less than 26 were excluded.


Fig7 Kalathil FrontInResMetAnal2018 2.jpg

Figure 7. To determine the importance rankings for clean energy, clusters were weighted using a weight reflecting their relevance to clean energy, as shown here. Weights were determined using a “clean energy” glossary, manually created by the analysts. When applied, the revised plant “bulb” cluster has a low clean energy weight, while the revised LED “bulb” has a high clean energy weight.

Material mentions over the top 500 abstracts associated with the 19 “clean energy” term clusters were aggregated together, yielding a measure of overall importance to clean energy, as seen in Figure 8. This measure of overall importance counted the number of times each material was mentioned in the top 500 abstracts associated with the 19 term clusters that are highly important to clean energy, treating those term clusters as one set. The top and the bottom of our determination of importance matched the top and bottom of the DOE’s list. However, there was some variation in the middle. Such variation, in the case of real-world analysis, would provide analyst prompts regarding where to consider more study, and whether some aspects of a materials clean energy use and importance.


Fig8 Kalathil FrontInResMetAnal2018 2.jpg

Figure 8. The importance rankings as determined by SRI methodology. Red denotes more important, while green denotes less important. Results from the clustering algorithms matched the expert-produce high and low importance rankings.

Material-application pairings

SRI utilized the term and document clusters to develop a matrix of material–application pairs to compare the ground truth. This was compared to the DOE 2011 CMS matrix that mapped materials to specific clean energy technologies (see Figure 9). The prevalence of different materials to a given clean energy field was considered, as defined and described above.


Fig9 Kalathil FrontInResMetAnal2018 2.jpg

Figure 9. Material–application pairing matrix from the U.S. Department of Energy Critical Materials Strategy Report, December 2011. These classifications were used to validate the results from the clustering algorithms.

Figure 10 displays three term clusters, as illustration of the results and filtering step. The term cluster labeled 388 discussed the economic implications of photovoltaic technologies, and so was discarded after manual review. The other two clusters can be used to measure the distribution of material mentions across the associated abstracts of the top term clusters, an indicator of material importance to that specific application.


Fig10 Kalathil FrontInResMetAnal2018 2.jpg

Figure 10. Three term clusters as an illustration of results and filtering step. Analysis of the terms and associated abstracts from the “top” three photovoltaic term clusters reveal that two are relevant while one is not.

Each material was scored based on a normalized number of mentions across abstracts associated with each of the term clusters of analysis. In this case, the results from our methodology mirror the results from the DOE’s report exactly: indium, gallium, and tellurium are considered the most important materials to photovoltaics (see Table 1). Similarly, the results for magnets mirrored the DOE’s results (see Table 2).


Tab1 Kalathil FrontInResMetAnal2018 2.jpg

Table 1. Materials/application pairing matrix comparing the results from our methodology with the Department of Energy (DOE)’s CM strategy report for the photovoltaic technology and coatings component.

Tab2 Kalathil FrontInResMetAnal2018 2.jpg

Table 2. Materials/application pairing matrix comparing the results from our methodology with the Department of Energy (DOE)’s CM strategy report for wind turbines and vehicle technologies, and magnet component.

Topic replacement

One potential application of text analytics is the ability to examine trends in material use both within and between term and document clusters to measure how technology may be changing or “trending” within specific application areas. Work on detection of technology emergence has focused on keyword occurrence, sometimes within the context of an existing taxonomy.[27] SRI's method produces both the terms and applications from the corpus. Manual evaluation determined that the term cluster labeled 361 mentioned nickel–metal hydride and other battery technologies, specifically in the context of electric vehicles. In 2008, document counts for lithium surpassed document counts for nickel in term cluster 361 (see Figure 11). The graph shown that around 2008, nickel–metal hydride battery technology for electric vehicles had reached a point of relative maturity. Lithium-ion batteries for use in electric vehicles, however, were a less mature technology, and research shifted to focus on advancing this immature technology. While we cannot draw any conclusions about the actual usage of these technologies from these data, the case study demonstrates the potential of text analytics to be used to analyze trends over time and identify how materials and technologies may replace one another. We have also seen this in dye-sensitized solar cells in a previous project.[28]


Fig11 Kalathil FrontInResMetAnal2018 2.jpg

Figure 11. 4-year moving averages for lithium and nickel document counts in term cluster 361. These results from SRI algorithms show decreasing activity in battery research using nickel hydrides by 2008 while lithium research increased for battery applications.

Discussion

The results presented show the application of a text analytics method to extract meaningful information for use to evaluate the progress of research and development. This tool automates away a large amount of manual labor; however, human intervention is still necessary. Human experts are required to define the parameters of evaluation. Fundamentally, this methodology is not designed to replace human analysis or input nor is it intended to act independently. Instead, it is a tool that researchers can use to more effectively and efficiently analyze their entire domain of research, while also reaching into tangential domains that contain relevant concepts, components, and ideas. Researchers can utilize this methodology to perform objective, replicable, and adaptable reviews of the relative importance of individual components to work toward an understanding of how different pieces fit together. This methodology can be applied to help researchers and inventors better understand how specific components or materials are involved in a given technology or research stream, thereby increasing their potential to create new inventions or discover new scientific findings.

The specific manual construction of the glossary is a crucial step in this methodology. The choice of screening terms will have an obvious impact on what clusters are chosen for analysis. The glossary must be assembled with the end goal in mind. Alternatively, one could use a more limited glossary if the domain of interest was narrow and known. Instead of screening in and out various term clusters, one could keep all clusters and use weightings to rate relevance of clusters. In this case, we created our glossary by selecting keywords from a broad list of material applications (both clean energy relevant and not) to ensure that we did not artificially restrict our results early in the process. This selection was done purely manually based on our analysts’ background knowledge of applications of critical materials. This screening identified 42 application term clusters of interest, about 10% of all term clusters.

Future work

Obvious extensions to this work include text analysis of additional document corpora, more analysis of trends over time, and more sophisticated use of text analytics; for example, to include natural language processing approaches. We did preliminary clustering of patent data that suggested a path forward similar to what we did for papers. New types of insight, for example on the influence of external conditions on materials research, might be obtained from looking at non-technical corpora such as news articles.

The workflow developed for this project allows a subject matter analyst to leverage state-of-the-art text analytic tools without requiring those tools to produce perfect output (which is significant, because the state of the art in text analysis produces enormous amounts of noise when applied in any practical setting). We are able to reduce the overall manual effort required to understand the content of large volumes of scientific reporting, at the cost of shifting some of that effort to tasks dealing with the text analytic tools. For a targeted investigation such as ours, analysts will always select the target and shape the investigation.

We used term clustering primarily as a way for analysts to collect “concepts” from text, but of course there are many ways to carve up the concept space. In our workflow, analysts were able to specify sets of terms that should be further specified, but an improved interface to these tools would allow a user to suggest a split and see resulting changes immediately as opposed to the semi-manual process we currently have in which the data analyst invokes a standalone process to subdivide individual clusters. This was applied, for example, to distinguish the two different senses of the term “bulb,” for lighting and as a type of plant. Polysemy of this type is common in natural language and experimental techniques existing for automatically identifying and resolving this polysemy.[23][24][29] Incorporating these techniques into our workflow and extending them where possible would reduce the effort by the analyst by providing superior semantic distinctions with each round of clustering. It is possible that the alternative clustering technique LDA, referred to in the background section, would contribute to this problem in complementary ways to the coclustering approach, as it allows for terms to be members of multiple topics rather than requiring each term to be in a single cluster. In addition, metrics such as entropy or information gain can be used to attempt to automatically recognize clusters most likely to need to be split, although we expect that such metrics will not be sufficient to replace human judgment.

Set expansion techniques[30][31] have been developed for finding a set of terms which are related to a given set of terms in the same way that those terms are related to each other. For example, the word “Achilles” is related to many words in Greek mythology, the word “Alabama” is related to U.S. state names and famous Alabamans, and the word “Queen Mary” is related to historical political figures. When taken together, however, the main thing that these terms have in common is that they are all ship names, and set expansion techniques find such hidden connections and then find additional terms sharing the relationship. Applying such techniques to keyword lists for target technologies would reduce the burden on the analyst to come up with comprehensive and specific keyword lists for targeting concepts.

Ackowledgements

Author contributions

NK contributed to the design and performed analysis. JB designed the text analytics workflow and extended the Copernicus tools to apply to this project. LR led the development of the overall analytical approach. DH managed the data, applied Copernicus, and implemented extensions designed by JB. CF contributed to the design of the study and methodology. All authors contributed to the drafting of the manuscript.

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Funding

This work was supported by the National Renewable Energy Laboratory through contract number AEU-6-62527-01. Any opinions, findings, and conclusions expressed in this material are those of the authors and do not necessarily reflect the views of the National Renewable Energy Laboratory or its staff. The authors acknowledge John Chase for his contributions to the initial proposal and project plan. The authors wish to thank Alberta Carpenter, Joe Cresko, Fletcher Fields, and Rod Eggert for their helpful comments during the project. We thank the reviewers for their helpful comments.


References

  1. Arthur, W.B. (2009). The Nature of Technology: What It Is and How It Evolves. Simon and Schuster. p. 256. ISBN 9781439165782. 
  2. National Science Board (11 January 2016). "Science and Engineering Indicators 2016" (PDF). National Science Foundation. pp. 899. https://www.nsf.gov/statistics/2016/nsb20161/uploads/1/nsb20161.pdf. 
  3. "IARPA Launches New Program to Enable the Rapid Discovery of Emerging Technical Capabilities". Office of the Director of National Intelligence. 27 September 2011. https://www.dni.gov/index.php/newsroom/press-releases/press-releases-2011/item/327-iarpa-launches-new-program-to-enable-the-rapid-discovery-of-emerging-technical-capabilities. 
  4. 4.0 4.1 Poulton, M.M.; Jagers, S.C.; Linde, S. et al. (2013). "State of the World's Nonfuel Mineral Resources: Supply, Demand, and Socio-Institutional Fundamentals". Annual Review of Environment and Resources 38: 345–371. doi:10.1146/annurev-environ-022310-094734. 
  5. Committee on Critical Mineral Impacts on the U.S. Economy (2008). Minerals, Critical Minerals, and the U.S. Economy. National Academies Press. pp. 262. doi:10.17226/12034. ISBN 9780309112826. https://www.nap.edu/catalog/12034/minerals-critical-minerals-and-the-us-economy. 
  6. Committee on Assessing the Need for a Defense Stockpile (2008). Managing Materials for a Twenty-first Century Military. National Academies Press. pp. 206. doi:10.17226/12028. ISBN 9780309177924. https://www.nap.edu/catalog/12028/managing-materials-for-a-twenty-first-century-military. 
  7. Erdmann, L.; Graedel, T.E. (2011). "Criticality of non-fuel minerals: A review of major approaches and analyses". Environmental Science and Technology 45 (18): 7620–30. doi:10.1021/es200563g. PMID 21834560. 
  8. "Communication from the Commission to the European Parliament, the Council, the European Economic and Social Committee and the Committee of the Regions Tackling the Challenges in Commodity Markets and on Raw Materials". Eur-Lex. European Union. 2 February 2011. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=celex%3A52011DC0025. 
  9. "Report on Critical Raw Materials for the E.U.: Report of the Ad hoc Working Group on defining critical raw materials" (PDF). European Commission. May 2014. pp. 41. https://ec.europa.eu/docsroom/documents/10010/attachments/1/translations/en/renditions/pdf. 
  10. 10.0 10.1 10.2 Graedel, T.E.; Harper, E.M.; Nassar, N.T. et al. (2015). "Criticality of metals and metalloids". Proceedings of the National Academy of Sciences of the United States of America 112 (14): 4257-62. doi:10.1073/pnas.1500415112. PMC PMC4394315. PMID 25831527. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4394315. 
  11. Graedel, T.E.; Barr, R.; Chandler, C. et al. (2012). "Methodology of metal criticality determination". Environmental Science and Technology 46 (2): 1063–70. doi:10.1021/es203534z. PMID 22191617. 
  12. Panousi, S.; Harper, E.M.; Nuss, P. et al. (2016). "Criticality of Seven Specialty Metals". Journal of Industrial Ecology 20 (4): 837-853. doi:10.1111/jiec.12295. 
  13. Office of Policy and International Affairs (December 2011). "Critical Materials Strategy" (PDF). U.S. Department of Energy. https://www.energy.gov/sites/prod/files/DOE_CMS2011_FINAL_Full.pdf. 
  14. Hastie, T.; Tibshirani, R.; Friedman, J. (2009). The Elements of Statistical Learning. Springer. pp. 745. ISBN 9780387848570. 
  15. Furnas, G.W.; Landauer, T.K.; Gomez, L.M.; Dumais, S.T. (1987). "The vocabulary problem in human-system communication". Communications of the ACM 30 (11): 964–971. doi:10.1145/32206.32212. 
  16. 16.0 16.1 Gallant, S.I.; Caid, W.R.; Carleton, J. et al. (1992). "HNC's MatchPlus system". ACM SIGIR Forum 26 (2): 34–38. doi:10.1145/146565.146569. 
  17. Caid, W.R.; Dumais, S.T.; Gallant, S.I. et al. (1995). "Learned vector-space models for document retrieval". Information Processing and Management 31 (3): 419–429. doi:10.1016/0306-4573(94)00056-9. 
  18. Zhu, H.; Rohwer, R. (1995). "Bayesian invariant measurements of generalization". Neural Processing Letters 2 (6): 28–31. doi:10.1007/BF02309013. 
  19. Hofmann, T. (1999). "Probabilistic latent semantic indexing". Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval: 50–57. doi:10.1145/312624.312649. 
  20. Blei, D.M.; Ng, A.Y.; Jordan, M.I. (2003). "Latent Dirichlet Allocation". Journal of Machine Learning Research 3 (1): 993–1022. http://www.jmlr.org/papers/v3/blei03a.html. 
  21. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. (7 September 2013). "Efficient Estimation of Word Representations in Vector Space". Cornell University Library. https://arxiv.org/abs/1301.3781. 
  22. Bellemare, M.G.; Naddaf, Y.; Veness, J.; Bowling, M. (2015). "The arcade learning environment: An evaluation platform for general agents". Proceedings of the 24th International Conference on Artificial Intelligence: 4148–4152. 
  23. 23.0 23.1 Freitag, D. (2004). "Toward unsupervised whole-corpus tagging". Proceedings of the 20th International Conference on Computational Linguistics: 357. doi:10.3115/1220355.1220407. 
  24. 24.0 24.1 Freitag, D. (2004). "Trained Named Entity Recognition using Distributional Clusters". Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing: 262–69. http://www.aclweb.org/anthology/W04-3234. 
  25. Byrnes, J.; Rohwer, R. (2005). "Text Modeling for Real-Time Document Categorization". Proceedings of the 2005 IEEE Aerospace Conference. doi:10.1109/AERO.2005.1559610. 
  26. Manning, C.D.; Schütze, H. (1999). Foundations of Statistical Natural Language Processing. The MIT Press. p. 620. ISBN 9780262133609. 
  27. Eusabi, C.; Silberglitt, R. (2014). "Identification and Analysis of Technology Emergence Using Patent Classification". RAND Corporation. https://www.rand.org/pubs/research_reports/RR629.html. 
  28. Randazzese, L. (2 December 2016). "Helios: Understanding Solar Evolution Through Text Analytics". OSTI.gov. U.S. Department of Energy. doi:10.2172/1336902. https://www.osti.gov/biblio/1336902-helios-understanding-solar-evolution-through-text-analytics. 
  29. Freitag, D.; Blume, M.; Byrnes, J. et al. (2005). "New experiments in distributional representations of synonymy". Proceedings of the Ninth Conference on Computational Natural Language Learning: 25–32. 
  30. Xu, J.; Croft, W.B. (1996). "Query expansion using local and global document analysis". Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval: 4–11. doi:10.1145/243199.243202. 
  31. Wang, R.C.; Cohen, W.W. (2009). "Automatic set instance extraction using the web". Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: 441–449. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original article lists references alphabetically, but this version — by design — lists them in order of appearance. Turned the singular footnote from the original article into inline text in parentheses for this version.