Storage systems Archives

How to synchronize OneDrive and OneDrive for Business files in Linux using Insync

ByDavid Yang Jan 28, 2020

OneDrive is one of the good cloud storage services available and there is a business version called OneDrive for Business. Microsoft’s Office 365 plan is widely used including Exchange Email service and OneDrive for Business. However, there is no official client released yet for Linux users. Insync is a third party cloud storage syncing software…

Storage systems | Systems

How to handle missing blocks and blocks with corrupt replicas in HDFS?

ByEric Ma Mar 24, 2018Feb 20, 2020

One of HDFS cluster’s hdfs dfsadmin -report reports: Under replicated blocks: 139016 Blocks with corrupt replicas: 9 Missing blocks: 0 The “Under replicated blocks” can be re-replicated automatically after some time. How to handle the missing blocks and blocks with corrupt replicas in HDFS? Understanding these blocks A block is “with corrupt replicas” in HDFS…

Storage systems | Systems | Tutorial

How to force a metadata checkpointing in HDFS

ByEric Ma Sep 9, 2017Sep 11, 2017

The metadata checkpointing in HDFS is done by the Secondary NameNode to merge the fsimage and the edits log files periodically and keep edits log size within a limit. For various reasons, the checkpointing by the Secondary NameNode may fail. For one example, HDFS SecondaraNameNode log shows errors in its log as follows. 2017-08-06 10:54:14,488…

Storage systems | Tutorial

How to Upload Large Files to Amazon S3 with AWS CLI

ByEric Ma Nov 29, 2015Aug 30, 2020

Amazon S3 is a widely used public cloud storage system. S3 allows an object/file to be up to 5TB which is enough for most applications. The AWS Management Console provides a Web-based interface for users to upload and manage files in S3 buckets. However, uploading a large files that is 100s of GB is not…

Computing systems | Resource management | Storage systems | Systems | Tutorial

Hadoop Installation Tutorial (Hadoop 2.x)

ByEric Ma Sep 14, 2014Dec 29, 2019

Hadoop 2 or YARN is the new version of Hadoop. It adds the yarn resource manager in addition to the HDFS and MapReduce components. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce designed and implemented by Google initially for processing and generating large data…

Computing systems | Storage systems | Systems

Big Data Benchmark from AMPLab of UC Berkeley

ByEric Ma Mar 17, 2014Sep 5, 2020

Benchmarks are important to understand the performance and quantitative and qualitative comparison of different systems. Many analytic frameworks, such as Hive, Impala and Shark, are designed and implemented these years and become fundamental software for processing big data. How to benchmark these big data analytic systems is an interesting problem. The Big Data Benchmark The…

Storage systems | Systems

Data Consistency Models of Public Cloud Storage Services: Amazon S3, Google Cloud Storage and Windows Azure Storage

ByEric Ma Feb 4, 2014Sep 5, 2020

The public cloud storage services like Amazon S3, Google Cloud Storage and Windows Azure Storage replicate the data to ensure high availability. On the other hand, with data being replicated, the storage services exhibits certain data consistency models. Different cloud service providers employ different data consistency models nowadays. In this post, we survey the data…

Computing systems | Insights | Storage systems | Systems

Software Engineering Advice from Building Large-Scale Distributed Systems by Jeff Dean

ByEric Ma Jul 18, 2013Aug 30, 2020

Software Engineering Advice from Building Large-Scale Distributed Systems by Jeff Dean. You can download the slides from Software Engineering Advice from Building Large-Scale Distributed Systems by Jeff Dean. These slides contain the “Numbers everyone should know” which everyone working on systems should be familiar with. Numbers Everyone Should Know L1 cache reference 0.5 ns Branch…

Computing systems | Storage systems

Large-scale Data Storage and Processing System in Datacenters

ByEric Ma Dec 11, 2012Aug 30, 2020

Research on Cloud Computing has made big progresses and many excellent large-scale systems have been designed in recent years. I compiled a list of some large-scale data storage and processing systems in datacenters as follows. Storage systems Google File System (GFS): http://research.google.com/archive/gfs.html HDFS implementation: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html Colossus (GFS2): Colossus: Successor to the Google File System (GFS)…

Computing systems | Resource management | Storage systems

Microsofts Cosmos Service

ByEric Ma Dec 10, 2012May 31, 2020

Cosmos is “Microsoft’s internal data storage/query system for analyzing enormous amounts (as in petabytes) of data”. There is no paper/technical report about Cosmos published yet. I compiled a list of information about Cosmos on the Web as follows. What is Microsoft’s Cosmos service? by Yaron Y. Goland. Microsoft Cosmos: Petabytes perfectly processed perfunctorily by Seth…

Storage systems | Systems

Colossus: Successor to the Google File System (GFS)

ByEric Ma Nov 29, 2012Aug 2, 2020

Colossus is the successor to the Google File System (GFS) as mentioned in the paper on Spanner at OSDI 2012. Colossus is also used by spanner to store its tablets. The information about Colossus is slim compared with GFS which is published in the paper at SOSP 2003. There is still some information about Colossus…

Computing systems | Storage systems | Systems

Hadoop Installation Tutorial (Hadoop 1.x)

ByEric Ma Oct 9, 2012Nov 28, 2020

Update: If you are new to Hadoop and trying to install one. Please check the newer version: Hadoop Installation Tutorial (Hadoop 2.x). Hadoop mainly consists of two parts: Hadoop MapReduce and HDFS. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce that is initially designed…