实验室硕士生张大统,邓玉辉老师等人联合撰写的论文《MGRM: A Multi-segment Greedy Rewriting Method to Alleviate Data Fragmentation in Deduplication-based Cloud Backup Systems》被《IEEE Transactions on Cloud Computing》录用。论文将于2023年正式发表。
论文摘要如下:
Abstract—Data deduplication has been broadly used in Cloud due to its storage space saving ability. An issue of deduplication is the contiguous data chunks in a segment may be scattered in different containers. This phenomenon is called data fragmentation. Because of data fragmentation, a restore process must reference various containers across a wide variety of segments, thereby hurting the restore performance. Capping methods that rewrite the data chunks of low Container Reference Ratio (CRR) containers are developed to alleviate data fragmentation. We analyze and observe from real traces that a number of segments only point to low CRR containers, while some others only contain high CRR containers. This interesting observation is ignored by the existing capping methods which sort containers from a single segment, falling short in searching multiple segments collectively. Thus, the reference count of selected containers in the existing capping methods is still high. To address this problem, we propose a multi-segment greedy rewriting method named MGRM. MGRM sorts containers of segments in a sequential way. More specifically, given the i-th segment currently being processed, MGRM will sort all the containers in the top i-th segments. This salient searching feature enables MGRM to select and rewrite the true low-reference container set. Moreover, to achieve a good balance between deduplication ratio and restore performance, MGRM has two working modes: an optimal rewriting mode and a radical rewriting mode. When working in the optimal rewriting mode, MGRM aims to improve the deduplication ratio; when the radical rewriting mode, MGRM strives to improve the restore performance. MGRM adaptively switches the working mode according to workload. Furthermore, unlike the existing capping methods that improve restore performance at the cost of the deduplication ratio, MGRM pays attention to both aspects. Our extensive experimental results show that MGRM achieves high restore performance, coupled with a high deduplication ratio. In particular, compared with the two state-of-art schemes FC and FLC, MGRM improves the deduplication ratio and restore performance by up to 114.83% and 99.34%, respectively.