实验室论文被ACM Transactions on Storage录用

发布者:邓玉辉发布时间:2022-07-25浏览次数:345

实验室硕士生林丽芳,邓玉辉老师等人联合撰写的论文《InDe: An inline data deduplication approach via adaptive detection of valid container utilization》被《ACM Transactions on Storage》录用。ACM Transactions on StorageCCF 推荐A类国际期刊。论文将于2023年正式发表。

 

 

论文摘要如下:

 

Inline deduplication removes redundant data in real time as data is being sent to the storage system. However, it causes data fragmentation: logically consecutive chunks are physically scattered across various containers after data deduplication. Many rewrite algorithms aim to alleviate the performance degradation due to fragmentation by rewriting fragmented duplicate chunks as unique chunks into new containers. Unfortunately, these algorithms determine whether a chunk is fragmented based on a simple pre-set fixed value, ignoring the variance of data characteristics between data segments. Accordingly, they often fail to select an appropriate set of old containers for rewrite, generating a substantial number of invalid chunks in retrieved containers when backups are restored.

To address this issue, we propose an inline deduplication approach for storage systems, called InDe, which uses a greedy algorithm to detect valid container utilization and dynamically adjusts the number of old container references in each segment. InDe fully leverages the distribution of duplicated chunks to improve the restore performance while maintaining high backup performance. We define an effectiveness metric valid container referenced counts (VCRC), to identify appropriate containers for rewrite. We design a rewrite algorithm F-greedy that detects valid container utilization to rewrite low-VCRC containers. According to the VCRC distribution of containers, F-greedy dynamically adjusts the number of old container references to only share duplicate chunks with high-utilization containers for each segment, thereby improving the restore speed. To take full advantage of the above features, we further propose another rewrite algorithm called F-greedy+ based on adaptive interval detection of valid container utilization. F-greedy+ makes more accurate estimation of the valid utilization of old containers by detecting trends of VCRC’s change in two directions, and selects referenced containers in the global scope. We quantitatively evaluate InDe using three real-world backup workloads. The experimental results show that compared with two state-of-the-art algorithms (Capping and SMR), our scheme improves the restore speed by 1.3x - 2.4x while achieving almost the same backup performance.