Difference between revisions of "User:Shawndouglas/Sandbox"

Full article title	Developing a file system structure to solve healthcare big data storage and archiving problems using a distributed file system
Journal	Applied Sciences
Author(s)	Ergüzen, Atilla; Ünver, Mahmut
Author affiliation(s)	Kırıkkale University
Primary contact	Email: munver at kku dot edu dot tr
Year published	2018
Volume and issue	8(6)
Page(s)	913
DOI	10.3390/app8060913
ISSN	2076-3417
Distribution license	Creative Commons Attribution 4.0 International
Website	http://www.mdpi.com/2076-3417/8/6/913/htm
Download	http://www.mdpi.com/2076-3417/8/6/913/pdf (PDF)

Revision as of 22:04, 11 June 2018

This is my sandbox, where I play with features and test MediaWiki code. If you wish to leave a comment for me, please see my discussion page instead.

Sandbox begins below

This article should not be considered complete until this message box has been removed. This is a work in progress.

Abstract

Recently, the use of the internet has become widespread, increasing the use of mobile phones, tablets, computers, internet of things (IoT) devices, and other digital sources. In the healthcare sector, with the help of next generation digital medical equipment, this digital world also has tended to grow in an unpredictable way such that nearly 10 percent of global data is healthcare-related, continuing to grow beyond what other sectors have. This progress has greatly enlarged the amount of produced data which cannot be resolved with conventional methods. In this work, an efficient model for the storage of medical images using a distributed file system structure has been developed. With this work, a robust, available, scalable, and serverless solution structure has been produced, especially for storing large amounts of data in the medical field. Furthermore, the security level of the system is extreme by use of static Internet Protocol (IP) addresses, user credentials, and synchronously encrypted file contents. One of the most important key features of the system is high performance and easy scalability. In this way, the system can work with fewer hardware elements and be more robust than others that use name node architecture. According to the test results, the performance of the designed system is better than 97% from a Not Only Structured Query Language (NoSQL) system, 80% from a relational database management system (RDBMS), and 74% from an operating system (OS).

Keywords: big data, distributed file system, health data, medical imaging

Introduction

In recent years, advances in information technology have increased worldwide; internet usage has exponentially accelerated the amount of data generated in all fields. The number of internet users was 16 million in 1995. This number reached 304 million in 2000, 888 million in 2005, 1.996 billion in 2010, 3.270 billion in 2015, and 3.885 billion in 2017.^[1]^[2]^[3] Every day, 2.5 exabytes (EB) of data are produced worldwide. Also, 90% of globally generated data has been produced since 2015. The data generated are in many different fields such as aviation, meteorology, IoT applications, health, and energy sectors. Likewise, the data produced through social media has reached enormous volumes. Not only did Facebook.com store 600 terabytes (TB) of data a day in 2014, but Google also processed hundreds of petabytes (PB) of data per day in the same year.^[4]^[5] Data production has also increased at a remarkable rate in the healthcare sector; widespread use of digital medical imaging peripherals has triggered this data production. Also, the data generated in the healthcare sector has reached such a point that it cannot be managed easily with traditional data management tools and hardware. Healthcare has accumulated a big data volume by keeping patients’ records, creating medical imaging that helps doctors with diagnoses, outputting digital files from various devices, and creating and storing the results of different surveys. Different types of data sources produce data in various structured and unstructured formats; examples include patient information, laboratory results, X-ray devices, computed tomography (CT) devices, and magnetic resonance imaging (MRI). World population and average human lifespan is apparently increasing continuously, which means an exponential increase in the number of patients to be served. As the number of patients increases, the amount of collected data also increases dramatically. Additionally, exhaustive digital healthcare devices make higher-density graphical outputs easy additions to the growing body of data. In 2011, the amount of data in the healthcare sector in the U.S. reached 150 EB. In 2013, it appeared to have achieved 153 EB. In 2020, it is estimated that this number will reach 2.3 ZB. For example, electronic medical record (EMR) use has increased 31% from 2001 to 2005 and more than 50% from 2005 to 2008.^[6]^[7] While neuroimaging operation data sizes had reached approximately 200 GB per year between 1985 and 1989, it has risen to 5 PB annually between 2010 and 2014, yet another indicator of the increase in data in the healthcare sector.^[8]

In this way, new problems have emerged due to the increasing volume of data generated in all fields at the global level. Now there are substantial challenges to store and to analyze the data. The storage of data has become costlier than gathering it.^[9] Thus, the amount of data that is produced, stored, and manipulated has increased dramatically, and because of this increase, big data and data science/knowledge have begun to develop.^[10] Big data is a reference to the variety, velocity, and volume of data; concerning healthcare records, finding an acceptable approach to cover these issues is particularly difficult to accomplish.

Big data problems in healthcare and the objectives of the study according to the previous arguments are listed as follows:

1. Increasing number of patients: The global population and average human lifespan are apparently increasing. For example, in Turkey, the number of visits to a physician has increased by about 4% per year since 2012.^[11] Moreover, the total number of per capita visits to a physician in healthcare facilities in 2016 was 8.6 while this value was 8.2 in 2012. As the number of patients increases, the amount of collected data also increases dramatically, which creates much more data to be managed.

2. Technological devices: Extensively used digital healthcare devices create high-resolution graphical outputs, which means huge amounts of data to be stored.

3. Expert personnel needs: To manage big data in institutions using software platforms such as Hadoop, Spark, Kubernetes, Elasticsearch, etc., qualified information technology specialists must be brought in to deploy, manage, and store big data solutions.^[12]

4. Small file size problem: Current solutions for healthcare, including Hadoop-based solutions, have a block size of 64 MB (detailed in the next section). This leads to vulnerabilities in performance and unnecessary storage usage, called "internal fragmentation," that is difficult to resolve.

5. Hospital information systems (HIS): These systems represent comprehensive software and related tools that help healthcare providers produce, store, fetch, and exchange patient information more efficiently and enable better patient tracking and care. The HIS must have essential non-functional properties like (a) robustness, (b) performance, (c) scalability, and (d) availability. These properties basically depend on a constructed data management architecture, which includes configured hardware devices and installed software tools. A HIS is responsible for solving big data problems alone, though it is much more than an IT project or a traditional application. As such, third-party software tools are needed to achieve the objectives of the healthcare providers.

This study seeks to obtain a mid-layer software platform which will help to address these healthcare gaps. In other words, we have implemented a server-cluster platform to store and to return health digital image data. It acts as a bridge between the HIS and various hardware resources located on the network. There are five primary aims of this study:

1. to overcome growing data problems by implementing a distributed data layer between the HIS and server-cluster platform;

2. to reduce operational costs, with no need to employ IT specialists to install and to deploy popular big data solutions;

3. to implement a new distributed file system architecture to achieve non-functional properties like performance, security, and scalability, which are of crucial importance for a HIS;

4. to show and prove that there can be different successful big data solutions; and, especially,

5. to solve these gaps efficiently for our university HIS.

In this study, the first section describes general data processing methods. The second part discusses the work and related literature on the subject, while the third part is the materials and methods section that describes the implemented approach. The last section is the conclusion of the evaluation that emerges as the result of our work.

Big data architecture in medicine

Big data solutions, in healthcare worldwide, primarily consist of three different solutions.

The first is a database system, which has two different popular application architectures: relational database management systems (RDBMS) and NoSQL database systems. RDBMSs, as the most widely known and used systems for this purpose, store data in a structured format. The data to be processed must be of the appropriate type and format. In these systems, a single database can serve multiple users and applications. Since these systems are built on vertical growth functionality, the data structure must be defined in advance. As a result, they have a lot of constraints like atomicity, consistency, isolation, and durability. The strict rules that make these systems indispensable are beginning to be questioned today. However, due to the used hardware and software, the initial installation costs are high. Especially when the volume of data increases, the horizontal scalability feature becomes quite unsatisfactory and difficult to manage, which is a major factor of their not being a part of an overall big data solution. Also, these systems are more complex than file systems, which most importantly is not suitable for big data. Due to the deficiency of managing RDBMSs’ big data, NoSQL database systems have emerged as an alternative. The main purpose of these systems is to store the increasing unstructured data volume associated with the internet and to respond to the needs of high-traffic systems via unstructured or semi-structured formats. NoSQL databases are systems that provide high accessibility according to RDBMSs and in which data are easily scaled horizontally.^[13] Reading and writing performances may be more acceptable than RDBMS. One of the most important features is that they are horizontally expandable. Thousands of servers can work together as a cluster and operate on big data. They are easy to program and manage due to their flexible structures. Another feature of these systems is that they must be doing grid computing in clusters that consist of many machines connected to a network; in this way, data process speeds have increased. However, NoSQL does not yet have as advanced data security features as RDBMSs. Some NoSQL projects are also lacking in documentation and professional technical support. Finally, the concept of "transactions" is not available in NoSQL database systems, meaning loss of data may occur, so they are not suitable for use in banking and financial systems.^[14]

Basic file management functions of operating systems (OS) are used for the second solution, which are called "file servers" in literature. In this system, medical image data is stored in files and folders in underlying operating system file structure. Operating systems use a hierarchical file system. In this structure, the files are in a tree structure, called a "directory." File servers store the files in the way that is determined by HIS, according to image type, file creation date, policlinic name, and patient information. The HIS executes read and write operations by invoking system calls, which act as a low-level interface to storage devices with the help of the operating system. The advantages of using file servers are that they are simple to deliver, easy to implement, and have acceptable file operation performance. Writing, deleting, reading, and searching files on the operating system is a fairly fast process because the operating system has been specialized for file management. Especially, operating systems have more flexibility and performance than RDBMS and NoSQL systems. However, the OS could not use these advantages to be a satisfactory solution model for big data because of the lack of horizontal scalability. The main task of OS file management is to serve system calls to other applications. So, the OS is a part of the solution rather the solution alone. Besides the storage of data not being as secure as other methods, data cannot be scaled according to data size, and backup and file transfer cannot be done safely. It seems that the operating system alone is not suitable for solving big data problems.

The third method involves distributed file systems (DFS). These systems represent the most up-to-date way to support machines in various locations as a single framework and provide the most appropriate solution to the big data problem. Hadoop DFS and MapReduce are primarily used for big data storage and analytics. Hybrid solutions that include Hadoop and NoSQL are also used and criticized in the literature. However, there are some drawbacks to using the Hadoop ecosystem in the healthcare setting. The first one is the small files problem; Hadoop cannot store and manage these types of files efficiently.^[15] Because Hadoop is primarily designed to manage large files greater than or equal to 64 MB, this size also acts as the default block size of Hadoop clusters.^[15] For example, a file that is one gigabyte in size, consisting of 16 blocks of 64 MB Hadoop blocks, occupies 2.4 KB of space in a name node. However, 100,000 files of 100 KB occupies one gigabyte of space in data nodes, 1.5 MB in a name node. This means more MapReduce operations are required when processing small files. The healthcare sector's average medical image file size is 10 MB, and when this situation is taken into consideration, a new DFS system is needed to embrace systems that have large numbers of small files.

This study proposes a new methodology for this issue via small block size and "no name" node structure. Wang et al. have identified five different strategies for how big data analytics can be effectively used to create business value for healthcare providers. This work was carried out for a total of 26 health institutions, from Europe and the U.S., that use Hadoop and MapReduce.^[16] However, it is also stated in this study that with a growing amount of unstructured data, more comprehensive analytic techniques like deep learning algorithms are required to be satisfied. A significant analysis and discussion on Hadoop, MapReduce, and STORM frameworks for big data in health care was presented by Liu and Park.^[17] It stated that Hadoop and MapReduce cannot be used in real-time systems due to a performance gap. Therefore, they proposed a novel health service model called BDeHS ((Big Data e-Health Service)) which has three key benefits. Additionally, Spark and STORM can be used more effectively for real-time data analytics of large databases according to MapReduce.^[18] One study provided detailed information about the architectural design of a personal health record system called “MedCloud” constructed on Hadoop’s ecosystem.^[19] Another study by Erguzen and Erdal looked at big data in healthcare, a new file structure and achieving system has been developed to store regions of interest (ROIs) from MRIs. In other words, they extracted the ROI portions, which contained vital information about the patient, from the image, discarded the rest—called non-ROIs, and stored the ROIs in the newly designed file structure with a success ratio of approximately 30%. However, this work was done only to decrease the image sizes, not to effectively store big data on DFS.^[7] Another study conducted by Raghupathi and Raghupathi showed that the Hadoop ecosystem has significant drawbacks for medium- or large-size healthcare providers: (a) it requires a great deal of programming skills for data analytics tasks using MapReduce; and (b) it is typically difficult to install, to configure, and to manage the Hadoop ecosystem completely. As such, it does not seem to be a feasible solution for medium- or large-scale healthcare providers.^[12]

Today, Hadoop is one of the enterprise-scaled open-source solutions that makes it possible to store big data with thousands of data nodes, as well as analyze the data with MapReduce. However, there are three disadvantages to Hadoop. First, Hadoop’s default block size is 64 MB, which presents an obstacle in managing numerous small files.^[15] When a file smaller than 64 MB is embedded in a 64 MB Hadoop block, it causes a gap, which is called internal fragmentation. On our system, the block size is 10 MB, which was constructed according to the average MRI file size, meaning less internal fragmentation. Second, performance issues arise when the system needs to run in a real-time environment. Third, Hadoop requires professional support to construct, operate, and maintain the system properly. These drawbacks are the key factors of why we developed this potential solution. As such, an original distributed file system has been developed for storing and managing big data in healthcare. The developed system has been shown to be quite successful for applications that run in the form of write once read many (WORM), which is a model that has many uses, such as in electronic record management systems and the healthcare sector.

Related studies

Big data represents data that cannot be stored and administrated on only one computer. Today, to administrate it, computers connected to a distributed file system and working together in a network are used. DFSs are separated into clusters consisting of nodes. Performance, data security, scalability, availability, easy accessibility, robustness, and reliability are the most important features of big data. Big data management problems can be solved by using DFS and network infrastructure. DFS-related work began in the 1970s^[20], with one of the first advances being the Roe File System, developed for replica consistency, easy setup, secure file authorization, and network transparency.^[21]

LOCUS, developed in 1981, is another DFS that features network transparency, high performance, and high reliability.^[22] The network file system (NFS) started to be developed by Sun Microsystems in 1984. This system is the most operated DFS on UNIX. Remote procedure call (RPC) is used for communication.^[23] It is designed for enabling the Unix file system to function as a "distributed" system, with the virtual file system acting as a layer. Therefore, clients can run different file systems easily and fault tolerance is high in the NFS. File status information is kept, and when an error occurs, the client reports this error status to the server immediately. File replication is not done in NFS, whereas the whole system is replicated.^[24] Only the file system is shared on NFS; no printer or modem can be shared. Objects to be shared can be a unit of a directory as well as a file. It is not necessary to set up every application on a local disk in the NFS, and there can be shared by using the server. For all that the same computer can be both a server and a client. As a result, an NFS reduces data storage costs.

The Andrew file system (AFS-1983) and its successor’s CODA (1992) and OpenAFS^[25] are open sources for distributed file systems. These systems have scalable and larger cluster size. Also, they can reduce server load and cache the whole file. CODA replicates on multiple servers to increase accessibility. Whereas AFS only supports Unix, OpenAFS and CODA support MacOS and Microsoft Windows. In these systems, the same namespace is created for all clients. However, replication is limited, and a read-one/write-all (ROWA) schema is used for it.^[26]^[27]

Frangipani was developed in 1997 as a new distributed file system with two layers. The bottom layer consists of virtual disks, providing storage services, and can be scaled and managed automatically. On the top layer, there are several machines that use the Frangipani file system. These machines run distributed on the shared virtual disk. The Frangipani file system provides consistent and shared access to the same set of files. As the data used in the system grows, more storage space and higher performance hardware elements are needed. If one of the system components does not work, it continues to serve due to its availability. As the system grows, the added components do not make management complicated, and thus there is less need for human management.^[28]

FARSITE (2002) is a serverless file system that runs distributed on a network, even one of physically unreliable computers. The system is a serverless, distributed file system, one that does not require centralized management. As such, there are not staff costs like a server system. FARSITE is designed to support the file I/O workload of a desktop computer in the university or a large company. It provides reasonable performance using client caching, availability, and accessibility using replication, authentication using encryption, and scalability using namespace delegation. One of the most important design goals of FARSITE is to use the benefits of Byzantine fault-tolerance.^[29]

Described in 2006, the CEPH file system is located on a top layer of similar systems that do object storage. This layer separates data and metadata management. This is accomplished by the random data distribution function (CRUSH), which is designed for unreliable object storage devices (OSDs). This function replaces the file allocation table. With CEPH, distributed data replication, error detection, and recovery operations are transferred to object storage devices running on the local file system. Thus, system performance is enhanced. A distributed set of metadata makes its management extremely efficient. The Reliable Autonomic Distributed Object Store (RADOS) layer manages all filing processes. Measurements were taken under various workloads to test the performance of CEPH, which can also work within different discs size. As a result, I/O performance is extremely high. It has been shown to have scalable metadata management. Because of the measurements, it supports 250,000 meta transactions per seconds, making CEPH a high-performance, reliable, and scalable distributed file system.^[30]

In 2007, Hadoop was developed, consisting of the Hadoop distributed file system (HDFS) and MapReduce parallel computing tool. Hadoop is a framework that provides analysis and transformation of very large datasets. HDFS distributes big data by dividing it into clusters on standard servers. To ensure data security, it backs the blocks up on the servers by copying them.^[31] Hadoop/MapReduce is used to process and manage big data. The "map" function distributes the data on the cluster and makes it available for processing. The "reduce" function ensures the data will be combined. Hadoop has scalability, and it can easily handle petabytes of data.^[32] Today, Hadoop is used by many major companies and is preferred in industrial and academic fields. Companies like LinkedIn, eBay, AOL, Yahoo, Facebook, and IBM use Hadoop generally.^[33]

References

↑ "Internet Live Stats". InternetLiveStats.com. http://www.internetlivestats.com/. Retrieved 16 July 2016.
↑ Kemp, S. (27 January 2016). "Digital in 2016". We Are Social. We Are Social Ltd. https://wearesocial.com/uk/special-reports/digital-in-2016. Retrieved 27 June 2016.
↑ "Internet Growth Statistics". Internet World Stats. Miniwatts Marketing Group. https://www.internetworldstats.com/emarketing.htm. Retrieved 21 May 2018.
↑ Vagata, P.; Wilfong, K. (10 April 2014). "Scaling the Facebook data warehouse to 300 PB". Facebook Code. Facebook. https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/. Retrieved 27 June 2016.
↑ Dhavalchandra, P.; Jignasu, M.; Amit, R. (2016). "Big data—A survey of big data technologies". International Journal Of Science Research and Technology 2 (1): 45–50. http://www.ijsrt.us/vol2issue1.aspx.
↑ Dean, B.B.; Lam, J.; Natoli, J.L. et al. (2009). "Review: Use of electronic medical records for health outcomes research: A literature review". Medical Care Research and Review 66 (6): 611–38. doi:10.1177/1077558709332440. PMID 19279318.
↑ ^7.0 ^7.1 Ergüzen, A.; Erdal, E. (2017). "Medical Image Archiving System Implementation with Lossless Region of Interest and Optical Character Recognition". Journal of Medical Imaging and Health Informatics 7 (6): 1246-1252. doi:10.1166/jmihi.2017.2156.
↑ Dinov, I.D. (2016). "Volume and Value of Big Healthcare Data". Journal of Medical Statistics and Informatics 4: 3. doi:10.7243/2053-7662-4-3. PMC PMC4795481. PMID 26998309. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4795481.
↑ Elgendy N.; Elragal A. (2014). "Big Data Analytics: A Literature Review Paper". In Perner, P.. Advances in Data Mining. Applications and Theoretical Aspects. Lecture Notes in Computer Science. 8557. Springer. doi:10.1007/978-3-319-08976-8_16. ISBN 9783319089768.
↑ Gürsakal, N. (2014). Büyük Veri. Dora Yayıncılık. p. 2.
↑ Başara, B.B.; Güler, C. (2017). "Sağlık İstatistikleri Yıllığı 2016 Haber Bülteni". Republic of Turkey Ministry of Health General Directorate for Health Research. http://www.deik.org.tr/contents-fileaction-15401.
↑ ^12.0 ^12.1 Raghupathi, W.; Raghupathi, V. (2014). "Big data analytics in healthcare: Promise and potential". Health Information Science and Systems 2: 3. doi:10.1186/2047-2501-2-3. PMC PMC4341817. PMID 25825667. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4341817.
↑ Klein, J.; Gorton, I.; Ernst, N. et al. (2015). "Application-Specific Evaluation of No SQL Databases". Proceeding of the 2015 IEEE International Congress on Big Data: 83. doi:10.1109/BigDataCongress.2015.83.
↑ Davaz, S. (28 March 2014). "NoSQL Nedir Avantajları ve Dezavantajları Hakkında Bilgi". Kodcu Blog. Kodcu. https://blog.kodcu.com/2014/03/nosql-nedir-avantajlari-ve-dezavantajlari-hakkinda-bilgi/. Retrieved 13 June 2017.
↑ ^15.0 ^15.1 ^15.2 He, H.; Du, Z.; Zhang, W.; Chen, A. (2016). "Optimization strategy of Hadoop small file storage for big data in healthcare". The Journal of Supercomputing 72 (10): 3696–3707. doi:10.1007/s11227-015-1462-4.
↑ Wang, Y.; Kung, L.; Byrd, T.A. (2018). "Big data analytics: Understanding its capabilities and potential benefits for healthcare organizations". Technological Forecasting and Social Change 126: 3–13. doi:10.1016/j.techfore.2015.12.019.
↑ Liu, W.; Park, E.K. (2014). "Big Data as an e-Health Service". Proceedings from the 2014 International Conference on Computing, Networking and Communications. doi:10.1109/ICCNC.2014.6785471.
↑ Mishra, S. (2018). "A Review on Big Data Analytics in Medical Imaging". International Journal of Computer Engineering and Applications 12 (1). http://www.ijcea.com/review-big-data-analytics-medical-imaging/.
↑ Sobhy, D.; El-Sonbaty, Y.; Elnasr, M.A. (2012). "MedCloud: Healthcare cloud computing system". Proceedings from the 2012 International Conference for Internet Technology and Secured Transactions. https://ieeexplore.ieee.org/document/6470935/.
↑ Alsberg, P.A.; Day, J.D. (1976). "A principle for resilient sharing of distributed resources". Proceedings of the 2nd International Conference on Software Engineering: 562–570. https://dl.acm.org/citation.cfm?id=807732.
↑ Ellis, C.S.; Floyd, R.A. (March 1983). "The ROE File System". Office of Naval Research. http://hdl.handle.net/1802/9796.
↑ Popek, G.; Walker, B.; Chow, J. et al. (1981). "LOCUS: A network transparent, high reliability distributed system". Proceedings of the Eighth ACM Symposium on Operating Systems Principles: 169–77. https://dl.acm.org/citation.cfm?id=806605.
↑ Sandberg, R.; Goldberg, D.; Kleiman, S. et al. (1985). "Design and implementation of the Sun Network Filesystem". Proceedings of the USENIX Conference & Exhibition: 119–30.
↑ Coulouris, G.; Dollimore, J.; Kindberg, T.; Blair, G. (2011). Distributed Systems: Concepts and Design (5th ed.). Pearson. pp. 1008. ISBN 9780132143011.
↑ Heidl, S. (July 2001). "Evaluierung von AFS/OpenAFS als Clusterdateisystem". Zuse-Institut Berlin. http://docplayer.org/995545-Evaluierung-von-afs-openafs-als-clusterdateisystem.html.
↑ Bžoch, P.; Šafařík, J. (2012). "Algorithms for increasing performance in distributed file systems". Acta Electrotechnica Et Informatica 12 (2): 24–30. doi:10.2478/v10198-012-0005-7.
↑ Karasula, B.; Korukoğlu, S. (2008). "Modern Dağıtık Dosya Sistemlerinin Yapısal Karşılaştırılması". Proceedings of the Akademik Bilişim’2008: 601–610. http://ab.org.tr/ab08/kitap/Bildiriler/Karasulu_Korukoglu_AB08.pdf.
↑ Thekkath, C.A.; Mann, T.; Lee, E.K. (1997). "Frangipani: A scalable distributed file system". Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles: 224–37. https://dl.acm.org/citation.cfm?id=266694.
↑ Adya, A.; Bolosky, W.J.; Castro, M. et al. (2002). "Farsite: Federated, available, and reliable storage for an incompletely trusted environment". ACM SIGOPS Operating Systems Review 36 (SI): 1–14. doi:10.1145/844128.844130.
↑ Weil, S.A.; Brandt, S.A.; Miller, E.L. et al. (2006). "Ceph: A scalable, high-performance distributed file system". Proceedings of the 7th Symposium on Operating Systems Design and Implementation: 307–20. https://dl.acm.org/citation.cfm?id=1298485.
↑ Shvachko, K.; Kuang, H.; Radia, S. (2010). "The Hadoop Distributed File System". Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies: 1–10. http://storageconference.us/2010/Papers/MSST/Shvachko.pdf.
↑ Yavuz, G.; Aytekin, S.; Akçay, M. (2012). "Apache Hadoop Ve Dağıtık Sistemler Üzerindeki Rolü". Dumlupinar Üniversitesi Fen Bilimleri Enstitüsü Dergisi (27): 43–54. http://fbe.dpu.edu.tr/index/sayfa/2399/fen-bilimleri-enstitusu-27sayili-dergi.
↑ Khidairi, S. (4 January 2012). "The Apache Software Foundation Announces Apache Hadoop v1.0". The Apache Software Foundation Blog. Apache Software Foundation. https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces21.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original article mentions "Liu and Park," yet they did not include a citation for those authors; this article adds the presumed citation associated with those names. The original URL to the Heidl citation led to a security warning from Google about the site; a substitute URL to DocPlayer has been added in its place. The original used Wikipedia as a citation about companies using Hadoop, which is frowned upon; updated with an improved source.

[ILS-1] "Internet Live Stats". InternetLiveStats.com. http://www.internetlivestats.com/. Retrieved 16 July 2016.

[KempDigital16-2] Kemp, S. (27 January 2016). "Digital in 2016". We Are Social. We Are Social Ltd. https://wearesocial.com/uk/special-reports/digital-in-2016. Retrieved 27 June 2016.

[IWS-3] "Internet Growth Statistics". Internet World Stats. Miniwatts Marketing Group. https://www.internetworldstats.com/emarketing.htm. Retrieved 21 May 2018.

[VagataScaling14-4] Vagata, P.; Wilfong, K. (10 April 2014). "Scaling the Facebook data warehouse to 300 PB". Facebook Code. Facebook. https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/. Retrieved 27 June 2016.

[DhavalchandraBig16-5] Dhavalchandra, P.; Jignasu, M.; Amit, R. (2016). "Big data—A survey of big data technologies". International Journal Of Science Research and Technology 2 (1): 45–50. http://www.ijsrt.us/vol2issue1.aspx.

[DeanReview09-6] Dean, B.B.; Lam, J.; Natoli, J.L. et al. (2009). "Review: Use of electronic medical records for health outcomes research: A literature review". Medical Care Research and Review 66 (6): 611–38. doi:10.1177/1077558709332440. PMID 19279318.

[Erg.C3.BCzenMedical17-7] 7.0 ^7.1 Ergüzen, A.; Erdal, E. (2017). "Medical Image Archiving System Implementation with Lossless Region of Interest and Optical Character Recognition". Journal of Medical Imaging and Health Informatics 7 (6): 1246-1252. doi:10.1166/jmihi.2017.2156.

[DinovVolume16-8] Dinov, I.D. (2016). "Volume and Value of Big Healthcare Data". Journal of Medical Statistics and Informatics 4: 3. doi:10.7243/2053-7662-4-3. PMC PMC4795481. PMID 26998309. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4795481.

[ElgendyBig14-9] Elgendy N.; Elragal A. (2014). "Big Data Analytics: A Literature Review Paper". In Perner, P.. Advances in Data Mining. Applications and Theoretical Aspects. Lecture Notes in Computer Science. 8557. Springer. doi:10.1007/978-3-319-08976-8_16. ISBN 9783319089768.

[G.C3.BCrsakalB.C3.BCy.C3.BCk14-10] Gürsakal, N. (2014). Büyük Veri. Dora Yayıncılık. p. 2.

[Ba.C5.9FaraSa.C4.9Fl.C4.B1k16-11] Başara, B.B.; Güler, C. (2017). "Sağlık İstatistikleri Yıllığı 2016 Haber Bülteni". Republic of Turkey Ministry of Health General Directorate for Health Research. http://www.deik.org.tr/contents-fileaction-15401.

[RaghupathiBig14-12] 12.0 ^12.1 Raghupathi, W.; Raghupathi, V. (2014). "Big data analytics in healthcare: Promise and potential". Health Information Science and Systems 2: 3. doi:10.1186/2047-2501-2-3. PMC PMC4341817. PMID 25825667. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4341817.

[KleinApp15-13] Klein, J.; Gorton, I.; Ernst, N. et al. (2015). "Application-Specific Evaluation of No SQL Databases". Proceeding of the 2015 IEEE International Congress on Big Data: 83. doi:10.1109/BigDataCongress.2015.83.

[DavazNoSQL14-14] Davaz, S. (28 March 2014). "NoSQL Nedir Avantajları ve Dezavantajları Hakkında Bilgi". Kodcu Blog. Kodcu. https://blog.kodcu.com/2014/03/nosql-nedir-avantajlari-ve-dezavantajlari-hakkinda-bilgi/. Retrieved 13 June 2017.

[HeOptimiz16-15] 15.0 ^15.1 ^15.2 He, H.; Du, Z.; Zhang, W.; Chen, A. (2016). "Optimization strategy of Hadoop small file storage for big data in healthcare". The Journal of Supercomputing 72 (10): 3696–3707. doi:10.1007/s11227-015-1462-4.

[WangBig18-16] Wang, Y.; Kung, L.; Byrd, T.A. (2018). "Big data analytics: Understanding its capabilities and potential benefits for healthcare organizations". Technological Forecasting and Social Change 126: 3–13. doi:10.1016/j.techfore.2015.12.019.

[LiuBigData14-17] Liu, W.; Park, E.K. (2014). "Big Data as an e-Health Service". Proceedings from the 2014 International Conference on Computing, Networking and Communications. doi:10.1109/ICCNC.2014.6785471.

[MishraARev18-18] Mishra, S. (2018). "A Review on Big Data Analytics in Medical Imaging". International Journal of Computer Engineering and Applications 12 (1). http://www.ijcea.com/review-big-data-analytics-medical-imaging/.

[SobhyMedCloud12-19] Sobhy, D.; El-Sonbaty, Y.; Elnasr, M.A. (2012). "MedCloud: Healthcare cloud computing system". Proceedings from the 2012 International Conference for Internet Technology and Secured Transactions. https://ieeexplore.ieee.org/document/6470935/.

[AlsbergAPrinc76-20] Alsberg, P.A.; Day, J.D. (1976). "A principle for resilient sharing of distributed resources". Proceedings of the 2nd International Conference on Software Engineering: 562–570. https://dl.acm.org/citation.cfm?id=807732.

[EllisTheROE83-21] Ellis, C.S.; Floyd, R.A. (March 1983). "The ROE File System". Office of Naval Research. http://hdl.handle.net/1802/9796.

[PopekLOCUS81-22] Popek, G.; Walker, B.; Chow, J. et al. (1981). "LOCUS: A network transparent, high reliability distributed system". Proceedings of the Eighth ACM Symposium on Operating Systems Principles: 169–77. https://dl.acm.org/citation.cfm?id=806605.

[SandbergDesign85-23] Sandberg, R.; Goldberg, D.; Kleiman, S. et al. (1985). "Design and implementation of the Sun Network Filesystem". Proceedings of the USENIX Conference & Exhibition: 119–30.

[CoulourisDistrib11-24] Coulouris, G.; Dollimore, J.; Kindberg, T.; Blair, G. (2011). Distributed Systems: Concepts and Design (5th ed.). Pearson. pp. 1008. ISBN 9780132143011.

[HeidlEvaluierung01-25] Heidl, S. (July 2001). "Evaluierung von AFS/OpenAFS als Clusterdateisystem". Zuse-Institut Berlin. http://docplayer.org/995545-Evaluierung-von-afs-openafs-als-clusterdateisystem.html.

[B.C5.BEochAlgor12-26] Bžoch, P.; Šafařík, J. (2012). "Algorithms for increasing performance in distributed file systems". Acta Electrotechnica Et Informatica 12 (2): 24–30. doi:10.2478/v10198-012-0005-7.

[KarasulaModern08-27] Karasula, B.; Korukoğlu, S. (2008). "Modern Dağıtık Dosya Sistemlerinin Yapısal Karşılaştırılması". Proceedings of the Akademik Bilişim’2008: 601–610. http://ab.org.tr/ab08/kitap/Bildiriler/Karasulu_Korukoglu_AB08.pdf.

[ThekkathFrangipani97-28] Thekkath, C.A.; Mann, T.; Lee, E.K. (1997). "Frangipani: A scalable distributed file system". Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles: 224–37. https://dl.acm.org/citation.cfm?id=266694.

[AdyaFarsite02-29] Adya, A.; Bolosky, W.J.; Castro, M. et al. (2002). "Farsite: Federated, available, and reliable storage for an incompletely trusted environment". ACM SIGOPS Operating Systems Review 36 (SI): 1–14. doi:10.1145/844128.844130.

[WeilCeph06-30] Weil, S.A.; Brandt, S.A.; Miller, E.L. et al. (2006). "Ceph: A scalable, high-performance distributed file system". Proceedings of the 7th Symposium on Operating Systems Design and Implementation: 307–20. https://dl.acm.org/citation.cfm?id=1298485.

[ShvachkoTheHadoop10-31] Shvachko, K.; Kuang, H.; Radia, S. (2010). "The Hadoop Distributed File System". Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies: 1–10. http://storageconference.us/2010/Papers/MSST/Shvachko.pdf.

[YavuzApache12-32] Yavuz, G.; Aytekin, S.; Akçay, M. (2012). "Apache Hadoop Ve Dağıtık Sistemler Üzerindeki Rolü". Dumlupinar Üniversitesi Fen Bilimleri Enstitüsü Dergisi (27): 43–54. http://fbe.dpu.edu.tr/index/sayfa/2399/fen-bilimleri-enstitusu-27sayili-dergi.

[KhudairiTheApache12-33] Khidairi, S. (4 January 2012). "The Apache Software Foundation Announces Apache Hadoop v1.0". The Apache Software Foundation Blog. Apache Software Foundation. https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces21.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

@@ Line 95: / Line 95: @@
 Described in 2006, the CEPH file system is located on a top layer of similar systems that do object storage. This layer separates data and [[metadata]] management. This is accomplished by the random data distribution function (CRUSH), which is designed for unreliable object storage devices (OSDs). This function replaces the file allocation table. With CEPH, distributed data replication, error detection, and recovery operations are transferred to object storage devices running on the local file system. Thus, system performance is enhanced. A distributed set of metadata makes its management extremely efficient. The Reliable Autonomic Distributed Object Store (RADOS) layer manages all filing processes. Measurements were taken under various workloads to test the performance of CEPH, which can also work within different discs size. As a result, I/O performance is extremely high. It has been shown to have scalable metadata management. Because of the measurements, it supports 250,000 meta transactions per seconds, making CEPH a high-performance, reliable, and scalable distributed file system.<ref name="WeilCeph06">{{cite journal |title=Ceph: A scalable, high-performance distributed file system |journal=Proceedings of the 7th Symposium on Operating Systems Design and Implementation |author=Weil, S.A.; Brandt, S.A.; Miller, E.L. et al. |pages=307–20 |year=2006 |url=https://dl.acm.org/citation.cfm?id=1298485}}</ref>
+In 2007, Hadoop was developed, consisting of the Hadoop distributed file system (HDFS) and MapReduce parallel computing tool. Hadoop is a framework that provides analysis and transformation of very large datasets. HDFS distributes big data by dividing it into clusters on standard servers. To ensure data security, it backs the blocks up on the servers by copying them.<ref name="ShvachkoTheHadoop10">{{cite journal |title=The Hadoop Distributed File System |journal=Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies |author=Shvachko, K.; Kuang, H.; Radia, S. |pages=1–10 |year=2010 |url=http://storageconference.us/2010/Papers/MSST/Shvachko.pdf}}</ref> Hadoop/MapReduce is used to process and manage big data. The "map" function distributes the data on the cluster and makes it available for processing. The "reduce" function ensures the data will be combined. Hadoop has scalability, and it can easily handle petabytes of data.<ref name="YavuzApache12">{{cite journal |title=Apache Hadoop Ve Dağıtık Sistemler Üzerindeki Rolü |journal=Dumlupinar Üniversitesi Fen Bilimleri Enstitüsü Dergisi |author=Yavuz, G.; Aytekin, S.; Akçay, M. |issue=27 |pages=43–54 |year=2012 |url=http://fbe.dpu.edu.tr/index/sayfa/2399/fen-bilimleri-enstitusu-27sayili-dergi}}</ref> Today, Hadoop is used by many major companies and is preferred in industrial and academic fields. Companies like LinkedIn, eBay, AOL, Yahoo, Facebook, and IBM use Hadoop generally.<ref name="KhudairiTheApache12">{{cite web |url=https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces21 |title=The Apache Software Foundation Announces Apache Hadoop v1.0 |author=Khidairi, S. |work=The Apache Software Foundation Blog |publisher=Apache Software Foundation |date=04 January 2012}}</ref>
 ==References==
@@ Line 100: / Line 101: @@
 ==Notes==
-This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original article mentions "Liu and Park," yet they did not include a citation for those authors; this article adds the presumed citation associated with those names. The original URL to the Heidl citation led to a security warning from Google about the site; a substitute URL to DocPlayer has been added in its place.
+This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original article mentions "Liu and Park," yet they did not include a citation for those authors; this article adds the presumed citation associated with those names. The original URL to the Heidl citation led to a security warning from Google about the site; a substitute URL to DocPlayer has been added in its place. The original used Wikipedia as a citation about companies using Hadoop, which is frowned upon; updated with an improved source.
 <!--Place all category tags here-->

Difference between revisions of "User:Shawndouglas/Sandbox"

Revision as of 22:04, 11 June 2018

Contents

Sandbox begins below

Abstract

Introduction

Big data architecture in medicine

Related studies

References

Notes

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

Popular publications

Print/export