Why are keys overlap in SST files at Level 0 in LevelDB system?

As is known, keys may be overlap in SST files of level 0 in LevelDB. I am wondering why it needs to be overlap?

An sstable is read-only after being written to the disk from a memtable. Accumulated K/Vs in the log up to a certain size are organized as a memtable and written to the disk as an sstable in level 0.

K/Vs coming in to the log file are not sorted. Hence, a later written sstable in level 0 may have “smaller” keys than keys in previous sstables in level 0.

Example (keys):

write/updates:        2, 9, 1,   3, 9, 8
sstables in level 0: {1, 2, 9}, {3, 8, 9}

This makes sstables in level 0 written to the disk be “like” a log (append only) which is good for write performance to disks. From level 1, overlappings are avoided through the process of merging sstables.

Answered by Eric Z Ma.

What do you mean by “Accumulated K/Vs in the log up to a certain size are organized as a memtable and written to the disk as an sstable in level 0.”?

When one KV record is coming for writing, LevelDB would firstly append commit log to Log File (not sorted) and then insert this KV record to Memtable (SkipList Insert Operation). When Memtable (In sorted orders) achieves the threshold value, it would become the Immutable Table (sorted and read only). At last, Immutable Table would be dumped into disk as SSTables. Therefore, KV records in Level0 SSTables are coming from Immutable Table.

It seems that you mean SSTables in level0 are coming from the Log File. I am not very clear about your point. Would you please clarify this.


It seems that there are some overlaps in Immutable Memtable so it would be flushed into SSTables files in level 0 (Level 0 is not made by compaction). However, other levels (except level 0) would be compacted and merged so there are no overlaps in other levels.


The “accumulated K/Vs in the log” to be written to a sstable is the same set of K/Vs as in the corresponding memtable (turned to be the immutable table).


However, the minor compaction is from Immutable Memtable (records have already been sorted) to SSTables in Level0. There is no merge or delete in this process (Immutable Memtable to SSTable files in Level 0).

It seems like this.

1, write/updates: 2, 9, 1, 3, 9, 8
2, Immutable Memtable: {1, 2, 3, 8, 9, 9}
3, sstable file 1 in level 0: {1, 2, 3, 8, 9, 9}

1, write/updates: 2, 90, 1, 30, 90, 8
2, Immutable Memtable: {1, 2, 8, 30, 90, 90}
3, sstable file 2 in level 0: {1, 2, 8, 30, 90, 90}


The immutable memtable should be identical to the sstable, or sstable is just a dump of an immutable memtable to disk.

Merges are done from level 0 to level 1:

sstable file 1 in level 0: {1, 2, 3, 8, 9, 9}
sstable file 2 in level 0: {1, 2, 8, 30, 90, 90}
sstable file 3 in level 1: {1, 90}

will be merged to (assume no other sstables to be merged)

sstable file 4 in level 1: {1, 2, 3, 8, 9, 30, 90} (assume no deletion). In level 1, there are no other sstables covering key range [1, 90].


ACK. for your last comments. Thank you Zhiqiang ;-)

Eric Z Ma

Eric is a father and systems guy. Eric is interested in building high-performance and scalable distributed systems and related technologies. The views or opinions expressed here are solely Eric's own and do not necessarily represent those of any third parties.

Leave a Reply

Your email address will not be published. Required fields are marked *