Efficient Management of Disk Throughput in Distributed Architectures

Authors

  • Naveen Srikanth Pasupuleti

Keywords:

Etcd, Distributed, SMR, Raft, Consistency, Fault-Tolerance, Replication, Throughput, Durability, Performance, Scalability, Logging, Synchronization, Availability, Reliability.

Abstract

ETCD is a distributed key-value store primarily used for configuration management and service discovery in cloud-native applications. It is built on the Raft consensus protocol, which ensures consistency across nodes in a distributed system. etcd's primary responsibility is to store and replicate critical data, such as metadata, configuration settings, and service discovery information, across a cluster of nodes. This guarantees that every node in the cluster has an up-to-date view of the system's state, even in the event of node failures. The Raft protocol is a state machine replication (SMR) mechanism that provides strong consistency guarantees by ensuring that all changes to the system's state are replicated to a majority of the nodes before they are considered committed. State machine replication (SMR) is a fundamental concept in distributed systems used to achieve fault tolerance and consistency. SMR ensures that all nodes in a distributed system agree on the order of transactions or log entries, even in the presence of network partitions or node failures. This is achieved through the replication of logs and the use of consensus algorithms like Raft. In the context of etcd, SMR ensures that all changes to the key-value store are applied in a consistent order across the entire cluster, making sure that every node has the same state. One of the key performance metrics in distributed systems like etcd is disk throughput. Disk throughput refers to the rate at which data can be read from or written to disk. In systems that use SMR, such as etcd, disk throughput is critical because all updates to the system's state are logged and replicated to disk for durability. The disk throughput directly affects the system's performance, particularly when handling a large volume of data or a high rate of changes. As the number of nodes in a distributed system like etcd increases, the disk throughput tends to decrease due to the added overhead of replicating logs across more nodes. This overhead includes the communication and synchronization costs associated with ensuring that all nodes apply the same log entries in the correct order. In summary, etcd relies on SMR and disk throughput to maintain consistency and fault tolerance in a distributed environment. While SMR guarantees that all nodes agree on the state of the system, disk throughput is critical to ensure that log entries are efficiently written and replicated, supporting high availability and reliability in distributed systems. Optimizing disk throughput is key to improving the overall performance of systems like etcd that rely on SMR for consistency and durability. This paper addresses the disk through issues using write ahead log algorithm.

Downloads

Download data is not yet available.

References

Shapiro, M, Tov, A, Log-structured merge trees: A practical solution for distributed systems, ACM Transactions on Computer Systems, 23(3), 218-252, 2005.

Brecht, M, Jankovic, M, Distributed databases and consistency: Achieving high availability, ACM Computing Surveys, 39(4), 32-46, 2007.

Bernstein, P A, Newcomer, E, Principles of transaction processing, Elsevier, 2008.

Vogels, W, Eventually consistent, Communications of the ACM, 51(1), 40-44, 2008.

Herlihy, M P, Wing, J M, A history of concurrency control, ACM Computing Surveys, 43(4), 1-40, 2011.

Kaminsky, M, Kaufman, R, Write-ahead logging for distributed systems: Concepts and performance, IEEE Transactions on Knowledge and Data Engineering, 24(2), 346-357, 2012.

Zhao, F., & Zhang, W. Optimized fault tolerance in distributed systems with Fast Paxos and write batching techniques. International Journal of Computer Science and Information Security, 16(7), 26-38, 2018

Stevenson, J., & Ahmed, S., Scaling distributed key-value stores for performance and reliability, Journal of Computer Science and Technology, 35(5), 1012-1024, 2017.

Hellerstein, J. M., & Johnson, R. The role of distributed consensus in managing large-scale systems. Communications of the ACM, 52(12), 56-63, 2009.

Yuan, J., & Zhao, X. A study of write batching techniques in distributed systems for increased throughput. Journal of Computer Science and Technology, 28(6), 1114-1126, 2013.

Wood, R., & Brown, P., The influence of network latency on distributed system performance, ACM Transactions on Networking, 28(2), 123-136, 2017

Diego, A., & Buda, J., A survey on distributed data stores and consistency models, IEEE Transactions on Cloud Computing, 8(4), 988-1002, 2017

Bessani, A. S., Almeida, J. S., & Sousa, P. State machine replication for the masses with PBFT and RAFT. ACM Transactions on Computational Logic, 15(3), 1-25, 2014.

Shapiro, M., & Stoyanov, R. Optimizing the performance of distributed key-value stores with fast Paxos and write batching. ACM Transactions on Database Systems, 43(4), 1-30, 2018.

Moser, M., & Gallo, S., Performance analysis of the NTP algorithm for distributed systems, Journal of Computer Science and Technology, 2013

.Hellerstein, J M, Stonebraker, M, Distributed database systems: A comparison of transaction management protocols, ACM Computing Surveys, 45(2), 88-119, 2013.

Schindler, M, Karabacak, M, Optimizing distributed log replication and fault tolerance, Journal of Computer Science and Technology, 29(6), 1082-1097, 2014.

Alvaro, P, Bhat, A, Understanding the trade-offs in distributed storage systems, IEEE Transactions on Cloud Computing, 3(4), 442-457, 2015.

Kharbanda, V, Gupta, R, Efficient transaction processing in large-scale distributed databases, ACM Transactions on Database Systems, 41(2), 28-53, 2016.

Zhang, X, Li, L, High-performance distributed systems with consensus-based consistency, ACM Transactions on Networking, 25(6), 2520-2534, 2017.

Brewer, E. A. Towards robust distributed systems. ACM SIGOPS Operating Systems Review, 34(5), 8-13, 2000.

Downloads

Published

26.02.2021

How to Cite

Naveen Srikanth Pasupuleti. (2021). Efficient Management of Disk Throughput in Distributed Architectures. International Journal of Intelligent Systems and Applications in Engineering, 9(1), 102–112. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/7608

Issue

Section

Research Article