Difference between revisions of "User:Shawndouglas/Sandbox"

From LIMSWiki
Jump to navigationJump to search
(Saving and adding more.)
(Replaced content with "<div class="nonumtoc">__TOC__</div> {{ombox | type = notice | style = width: 960px; | text = This is my primary sandbox page, where I play with features and...")
Tags: Manual revert Replaced
 
(111 intermediate revisions by the same user not shown)
Line 3: Line 3:
| type      = notice
| type      = notice
| style    = width: 960px;
| style    = width: 960px;
| text      = This is my sandbox, where I play with features and test MediaWiki code. If you wish to leave a comment for me, please see [[User_talk:Shawndouglas|my discussion page]] instead.<p></p>
| text      = This is my primary sandbox page, where I play with features and test MediaWiki code. If you wish to leave a comment for me, please see [[User_talk:Shawndouglas|my discussion page]] instead.<p></p>
}}
}}


==Sandbox begins below==
==Sandbox begins below==
{{Infobox journal article
|name        =
|image        =
|alt          = <!-- Alternative text for images -->
|caption      =
|title_full  = Application of text analytics to extract and analyze material–application pairs from a large scientific corpus
|journal      = ''Frontiers in Research Metrics and Analytics''
|authors      = Kalathil, Nikhil; Byrnes, John J.; Randazzese, Lucien; Hartnett, Daragh P.; Freyman, Christina A.
|affiliations = Center for Innovation Strategy and Policy and the Artificial Intelligence Center, SRI International
|contact      = Email: christina dot freyman at sri dot com
|editors      =
|pub_year    = 2018
|vol_iss      = '''2'''
|pages        = 15
|doi          = [http://10.3389/frma.2017.00015 10.3389/frma.2017.00015]
|issn        = 2504-0537
|license      = [http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International]
|website      = [https://www.frontiersin.org/articles/10.3389/frma.2017.00015/full https://www.frontiersin.org/articles/10.3389/frma.2017.00015/full]
|download    = [https://www.frontiersin.org/articles/10.3389/frma.2017.00015/pdf https://www.frontiersin.org/articles/10.3389/frma.2017.00015/pdf] (PDF)
}}
{{ombox
| type      = content
| style    = width: 500px;
| text      = This article should not be considered complete until this message box has been removed. This is a work in progress.
}}
==Abstract==
When assessing the importance of materials (or other components) to a given set of applications, machine analysis of a very large corpus of scientific abstracts can provide an analyst a base of insights to develop further. The use of text analytics reduces the time required to conduct an evaluation, while allowing analysts to experiment with a multitude of different hypotheses. Because the scope and quantity of [[metadata]] analyzed can, and should, be large, any divergence from what a human analyst determines and what the text analysis shows provides a prompt for the human analyst to reassess any preliminary findings. In this work, we have successfully extracted material–application pairs and ranked them on their importance. This method provides a novel way to map scientific advances in a particular material to the application for which it is used. Approximately 438,000 titles and abstracts of scientific papers published from 1992 to 2011 were used to examine 16 materials. This analysis used coclustering text analysis to associate individual materials with specific clean energy applications, evaluate the importance of materials to specific applications, and assess their importance to clean energy overall. Our analysis reproduced the judgments of experts in assigning material importance to applications. The validated methods were then used to map the replacement of one material with another material in a specific application (batteries).
'''Keywords''': machine learning classification, science policy, coclustering, text analytics, critical materials, big data
==Introduction==
Scientific research and technological development are inherently combinatorial practices.<ref name="ArthurTheNature09">{{cite book |title=The Nature of Technology: What It Is and How It Evolves |author=Arthur, W.B. |publisher=Simon and Schuster |page=256 |year=2009 |isbn=9781439165782}}</ref> Researchers draw from, and build on, existing work in advancing the state of the art. Increasing the ability of researchers to review and understand previous research can stimulate and accelerate scientific progress. However, the number of scientific publications grows exponentially every year both on the aggregate level and in an individual field.<ref name="NSBScience16">{{cite web |url=https://www.nsf.gov/statistics/2016/nsb20161/uploads/1/nsb20161.pdf |format=PDF |title=Science and Engineering Indicators 2016 |author=National Science Board |publisher=National Science Foundation |pages=899 |date=11 January 2016}}</ref> It is impossible for any single researcher or organization to keep up with the vastness of new scientific publications. The ability to use text analytics to map the current state of the art to detect progress would enable more efficient analyses of data.
The Intelligence Advanced Research Projects Activity recognized the scale problem in 2011, creating the research program Foresight and Understanding from Scientific Exposition. Under this program, SRI and other performers processed “the massive, multi-discipline, growing, noisy, and multilingual body of scientific and patent literature from around the world and automatically generated and prioritized technical terms within emerging technical areas, nominated those that exhibit technical emergence, and provided compelling evidence for the emergence.”<ref name="DNIIARPA11">{{cite web |url=https://www.dni.gov/index.php/newsroom/press-releases/press-releases-2011/item/327-iarpa-launches-new-program-to-enable-the-rapid-discovery-of-emerging-technical-capabilities |title=IARPA Launches New Program to Enable the Rapid Discovery of Emerging Technical Capabilities |publisher=Office of the Director of National Intelligence |date=27 September 2011}}</ref> The work presented here applies and extends that platform to efficiently identify and describe the past and present evolution of research on a given set of materials. This work applies text analytics to demonstrate how these computational tools can be used by analysts to analyze much larger sets of data and develop more iterative and adaptive material assessments to better inform and shape government and industry research strategy and resource allocation.
==Materials==
===Ground truth===
The Department of Energy (DOE) has a specific interest in critical materials related to the energy economy. The DOE identifies critical materials through analysis of their use (demand) and supply. The approach balances an analysis of market dynamics (the vulnerability of materials to economic, geopolitical, and natural supply shocks) with technological analysis (the reliance of certain technologies on various materials). The DOE's R&D agenda is directly informed by assessments of material criticality. The DOE, the National Research Council, and the European Economic and Social Committee have all articulated a need for better measurements of material criticality. However, criticality depends on a multitude of different factors, including socioeconomic factors.<ref name="PoultonState13">{{cite journal |title=State of the World's Nonfuel Mineral Resources: Supply, Demand, and Socio-Institutional Fundamentals |journal=Annual Review of Environment and Resources |author=Poulton, M.M.; Jagers, S.C.; Linde, S. et al. |volume=38 |pages=345–371 |year=2013 |doi=10.1146/annurev-environ-022310-094734}}</ref> Various organizations across the world define resource criticality according to their own independent metrics and methodologies, and designations of criticality tend to vary dramatically.<ref name="PoultonState13" /><ref name="CommitteeMinerals08">{{cite book |url=https://www.nap.edu/catalog/12034/minerals-critical-minerals-and-the-us-economy |title=Minerals, Critical Minerals, and the U.S. Economy |author=Committee on Critical Mineral Impacts on the U.S. Economy |publisher=National Academies Press |pages=262 |year=2008 |isbn=9780309112826 |doi=10.17226/12034}}</ref><ref name="CommitteeManaging08">{{cite book |url=https://www.nap.edu/catalog/12028/managing-materials-for-a-twenty-first-century-military |title=Managing Materials for a Twenty-first Century Military |author=Committee on Assessing the Need for a Defense Stockpile |publisher=National Academies Press |pages=206 |year=2008 |isbn=9780309177924 |doi=10.17226/12028}}</ref><ref name="ErdmannCritic11">{{cite journal |title=Criticality of non-fuel minerals: A review of major approaches and analyses |journal=Environmental Science and Technology |author=Erdmann, L.; Graedel, T.E. |volume=45 |issue=18 |pages=7620–30 |year=2011 |doi=10.1021/es200563g |pmid=21834560}}</ref><ref name="EurLex52011PC0025">{{cite web |url=https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=celex%3A52011DC0025 |title=Communication from the Commission to the European Parliament, the Council, the European Economic and Social Committee and the Committee of the Regions Tackling the Challenges in Commodity Markets and on Raw Materials |work=Eur-Lex |publisher=European Union |date=02 February 2011}}</ref><ref name="ECReport14">{{cite web |url=https://ec.europa.eu/docsroom/documents/10010/attachments/1/translations/en/renditions/pdf |format=PDF |title=Report on Critical Raw Materials for the E.U.: Report of the Ad hoc Working Group on defining critical raw materials |publisher=European Commission |pages=41 |date=May 2014}}</ref><ref name="GraedelCritic15">{{cite journal |title=Criticality of metals and metalloids |journal=Proceedings of the National Academy of Sciences of the United States of America |author=Graedel, T.E.; Harper, E.M.; Nassar, N.T. et al. |volume=112 |issue=14 |pages=4257-62 |year=2015 |doi=10.1073/pnas.1500415112 |pmid=25831527 |pmc=PMC4394315}}</ref>
Experts tasked with assessing the role of materials must make decisions about what materials to focus on, what applications to review, what data sources to consult, and what analyses to pursue.<ref name="GraedelCritic15" /> The amount of data available to assess is vast and far too large for any single analyst or organization to address comprehensively. In addition, to the best of our knowledge, previous assessments of material criticality have not involved a comprehensive review of scientific research on material use. [Graedel and colleagues have published extensively using raw data on supply and other indicators to measure criticality, see Graedel ''et al.'' (2012<ref name="GraedelMethod12">{{cite journal |title=Methodology of metal criticality determination |journal=Environmental Science and Technology |author=Graedel, T.E.; Barr, R.; Chandler, C. et al. |volume=46 |issue=2 |pages=1063–70 |year=2012 |doi=10.1021/es203534z |pmid=22191617}}</ref>, 2015<ref name="GraedelCritic15" />) and Panousi ''et al.'' (2016<ref name="PanousiCritic15">{{cite journal |title=Criticality of Seven Specialty Metals |journal=Journal of Industrial Ecology |author=Panousi, S.; Harper, E.M.; Nuss, P. et al. |volume=20 |issue=4 |pages=837-853 |year=2016 |doi=10.1111/jiec.12295}}</ref>) and the references contained within.] Recent developments in text analytic computational approaches present a unique opportunity to develop new analytic approaches for assessing material criticality in a comprehensive, replicable, iterative manner.
The Department of Energy’s 2011 Critical Materials Strategy (CMS) Report uses importance to clean energy as one dimension of the criticality matrix (see Figure 1).<ref name="DOECritical11">{{cite web |url=https://www.energy.gov/sites/prod/files/DOE_CMS2011_FINAL_Full.pdf |format=PDF |title=Critical Materials Strategy |author=Office of Policy and International Affairs |publisher=U.S. Department of Energy |date=December 2011}}</ref> In this regard, the DOE report serves as a form of ground truth for the validation of our technique, though the DOE report considered supply risk as the second dimension to criticality, which the analysis described in this paper does not address.
[[File:Fig1 Kalathil FrontInResMetAnal2018 2.jpg|500px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="500px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 1.''' 2011 Critical Materials Importance Analysis matrix, published by experts at the Department of Energy. This matrix served as ground truth for validation.</blockquote>
|-
|}
|}
===Scientific publications===
Data on scientific research articles was obtained from the Web of Science (WoS) database available from Thomson Reuters (now Clarivate Analytics). WoS contains metadata records. In principle, we could have analyzed this entire database; however, for budget-related reasons, the document set was limited by a topic search of keywords that appear in a document title, abstract, author-provided keywords, and WoS-added keywords for the following:
* the 16 materials listed in the 2011 CMS, or
* the 285 unique alloys/composites of the 16 critical materials.
The document set was also limited to articles published between 1992 and 2011, the 20-year period leading up the DOE's most recent critical material assessment.
The 16 materials listed in the 2011 CMS include europium, terbium, yttrium, dysprosium, neodymium, cerium, tellurium, lanthanum, indium, lithium, gallium, praseodymium, nickel, manganese, cobalt, and samarium. Excluded from our documents set was any publication appearing in 80 fields considered not likely to cover research in scope (e.g., fields in the social sciences, biological sciences, etc.). We used the 16 materials listed above because we were interested in validating a methodology against the 2011 CMI Strategy report, and these are the materials mentioned therein. The resulting data set consisted of approximately 438,000 abstracts of scientific papers published from 1992 to 2011.
==Methods==
===Text analytics and coclustering===
The principle behind coclustering is the statistical analysis of the occurrences of terms in the text. This includes the processing of the relationships both between terms and neighboring (or nearby) terms, and between terms and the documents in which they occur. The approach presented here grouped papers by looking for sets of papers containing similar sets of terms. As detailed below, our analytic methods process meaning beyond simple counts of words, and thus, for example, put papers about earthquakes and papers about tremors in the same group, but would exclude papers in the medical space that discuss hand tremors.
Coclustering is based on an important technique in natural language processing which involves the embedding of terms into a real vector space; i.e., each word of the language is assigned a point in a high-dimensional space. Given a vector representation for a term, terms can be clustered using standard clustering techniques (also known as cluster analysis), such as hierarchical agglomeration, principle components analysis, K-means, and distribution mixture modeling.<ref name="HastieTheElem04">{{cite book |title=The Elements of Statistical Learning |author=Hastie, T.; Tibshirani, R.; Friedman, J. |publisher=Springer |pages=745 |year=2009 |isbn=9780387848570}}</ref> This was first done in the 1980s under the name latent semantic analysis (LSA).<ref name="FurnasTheVocab87">{{cite journal |title=The vocabulary problem in human-system communication |journal=Communications of the ACM |author=Furnas, G.W.; Landauer, T.K.; Gomez, L.M.; Dumais, S.T. |volume=30 |issue=11 |pages=964–971 |year=1987 |doi=10.1145/32206.32212}}</ref> In the 1990s, neural networks were applied to find embeddings for terms using a technique called context vectors<ref name="GallantHNC92">{{cite journal |title=HNC's MatchPlus system |journal=ACM SIGIR Forum |author=Gallant, S.I.; Caid, W.R.; Carleton, J. et al. |volume=26 |issue=2 |pages=34–38 |year=1992 |doi=10.1145/146565.146569}}</ref><ref name="CaidLearned95">{{cite journal |title=Learned vector-space models for document retrieval |journal=Information Processing and Management |author=Caid, W.R.; Dumais, S.T.; Gallant, S.I. et al. |volume=31 |issue=3 |pages=419–429 |year=1995 |doi=10.1016/0306-4573(94)00056-9}}</ref> A Bayesian analysis of context vectors in the late 1990s provided probabilistic interpretation and enabled applying information-theoretic techniques.<ref name="GallantHNC92" /><ref name="ZhuBayesian95">{{cite journal |title=Bayesian invariant measurements of generalization |journal=Neural Processing Letters |author=Zhu, H.; Rohwer, R. |volume=2 |issue=6 |pages=28–31 |year=1995 |doi=10.1007/BF02309013}}</ref> We refer to this technique as Association Grounded Semantics (AGS). A similar Bayesian analysis of LSA resulted in a technique referred to as probabilistic-LSA<ref name="HofmannProb99">{{cite journal |title=Probabilistic latent semantic indexing |journal=Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval |author=Hofmann, T. |pages=50–57 |year=1999 |doi=10.1145/312624.312649}}</ref>, which was later extended to a technique known as latent Dirichlet allocation (LDA).<ref name="BleiLatent03">{{cite journal |title=Latent Dirichlet Allocation |journal=Journal of Machine Learning Research |author=Blei, D.M.; Ng, A.Y.; Jordan, M.I. |volume=3 |issue=1 |pages=993–1022 |year=2003 |url=http://www.jmlr.org/papers/v3/blei03a.html}}</ref> LDA is commonly referred to as “topic modeling” and is probably the most widely applied technique for discovering groups of similar terms and similar documents. Much more recently, Google directly extended the context vector approach of the early 1990s to derive word embeddings using much greater computing power and much larger datasets than had been used in the past, resulting in the word2vec product, which is now widely known.<ref name="MikolovEfficient13">{{cite web |url=https://arxiv.org/abs/1301.3781 |title=Efficient Estimation of Word Representations in Vector Space |author=Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. |publisher=Cornell University Library |date=07 September 2013}}</ref><ref name="BellemareTheArcade15">{{cite journal |title=The arcade learning environment: An evaluation platform for general agents |journal=Proceedings of the 24th International Conference on Artificial Intelligence |author=Bellemare, M.G.; Naddaf, Y.; Veness, J.; Bowling, M. |pages=4148–4152 |year=2015}}</ref>
The LDA model assumes that documents are collections of topics, and those topics generate terms, making it difficult to apply LDA to terms other than in the contexts of documents. Because coclustering treats the document as a context for a term, any other context of a term can be substituted in the coclustering model. For example, contexts may be neighboring terms or capital letters or punctuation. This allows us to apply coclustering to a much wider variety of feature types than is accommodated by LDA. In particular, “distributional clustering” (clustering of terms based on their distributions over nearby terms), which has been proven to be useful in information extraction<ref name="FreitagToward04">{{cite journal |title=Toward unsupervised whole-corpus tagging |journal=Proceedings of the 20th International Conference on Computational Linguistics |author=Freitag, D. |pages=357 |year=2004 |doi=10.3115/1220355.1220407}}</ref><ref name="FreitagTrained04">{{cite journal |title=Trained Named Entity Recognition using Distributional Clusters |journal=Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing |author=Freitag, D. |pages=262–69 |year=2004 |url=http://www.aclweb.org/anthology/W04-3234}}</ref> is captured by coclustering. In future work, we anticipate recognizing material names and application references using these techniques.
==References==
{{Reflist|colwidth=30em}}
==Notes==
This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original article lists references alphabetically, but this version — by design — lists them in order of appearance.
<!--Place all category tags here-->
[[Category:LIMSwiki journal articles (added in 2018)‎]]
[[Category:LIMSwiki journal articles (all)‎]]
[[Category:LIMSwiki journal articles on materials informatics‎‎]]

Latest revision as of 17:47, 1 February 2022

Sandbox begins below