obs对象存储重复文件对象清理

简介

之前的网盘兴起,个人文件备份在不同的网盘,同步的不一致,导致大量的重复文件。为了保持文件的完整,年前重装系统时,将43K个文件总计108G全部备份到huaweicloud 对象存储。为了清理重复,通过obs python sdk 导出文件名(keys)和唯一标识(etags)到excel,筛选找出重复的文件(keys)删除。一番操作,腾出的一半空间。

环境准备

1、云空间准备

  • 注册华为云帐号,
  • 购买资源包,
  • 获取访问密钥和终端节点,
  • 创建obs存储空间

2、开发环境搭建

ps: python更新代理设置:

python -m pip install --upgrade pip --proxy="https://127.0.0.1:8080"

3、数据迁移

OBS Browser+个人文件上传下载文件、浏览文件等

4、python sdk操作清理重复文件

导出文件清单

from obs import *
# coding=UTF-8

AK = '*** Provide your Access Key ***'
SK = '*** Provide your Secret Key ***'
server = 'https://yourdomainname'
bucketName = 'my-obs-bucket-demo'
objectKey = 'my-obs-object-key-demo'

from obs import *
# coding=UTF-8

import openpyxl

obsClient = ObsClient(
    access_key_id=AK,
    secret_access_key=SK,
    server=server
)

path = 'D:\huaweiobs\list-all.xlsx'
sheet_name = 'keys'
workbook = openpyxl.Workbook()
sheet = workbook.active
sheet.title = sheet_name

# Constructs a obs client instance with your account for accessing OBS
# obsClient = ObsClient(access_key_id=AK, secret_access_key=SK, server=server)

print('List all the objects in way of pagination: \n')
pageSize = 1000
index = 1
rows = 0
nextMarker = None
while True:
    try:
        resp = obsClient.listObjects(bucketName, max_keys=pageSize, marker=nextMarker)
        if resp.status < 300:
            print('Page:' + str(index) + '\n')
            for content in resp.body.contents:
                rows += 1
                sheet.cell(row=rows, column=1, value=str(content.key))
                sheet.cell(row=rows, column=2, value=str(content.etag))
            workbook.save(path)
            if not resp.body.is_truncated:
                break
            nextMarker = resp.body.next_marker
            index += 1
        else:
            print('errorCode:', resp.errorCode)
            print('errorMessage:', resp.errorMessage)
    except:
        import traceback
        print(traceback.format_exc())

workbook.close
resp.close



# See PyCharm help at https://www.jetbrains.com/help/pycharm/

筛选重复冗余文件

1、将导出的清单文件list-all.xlsx中sheets('keys')复制到另一表sheets('baoliu'),删除重复的etags行,保留唯一etags行的文件名(keys)
2、在原sheets('keys')使用vlookup引用新表sheets('baoliu')中找出需要保留文件名(keys)在行尾标记并删除,剩余部分为需要在obs云上删除操作的文件名(keys)。
3、将sheets('keys')中需要在obs云上删除操作的文件名(keys)另存为一文件(’D:\huaweiobs\list-r.xlsx‘)备用。
ps:最快捷的去除冗余etags是,将etags排序,从第二行开始与上一行比较是否相同,如果相同就将其标识出来,最后筛选出对应的“keys”。

删除重复文件

from obs import *
# coding=UTF-8
# Constructs a obs client instance with your account for accessing OBS
# obsClient = ObsClient(access_key_id=AK, secret_access_key=SK, server=server)

AK = '*** Provide your Access Key ***'
SK = '*** Provide your Secret Key ***'
server = 'https://yourdomainname'
bucketName = 'my-obs-bucket-demo'
objectKey = 'my-obs-object-key-demo'

obsClient = ObsClient(
    access_key_id=AK,
    secret_access_key=SK,
    server=server
)

import openpyxl
path = 'D:\huaweiobs\list-r.xlsx'
sheet_name = 'keys'
from openpyxl import load_workbook
wb = load_workbook(filename=path)
sheet_ranges = wb[sheet_name]
print(sheet_ranges['A2'].value)


pageindex = 0
rowsindex = 0
maxrows = sheet_ranges.max_row

while True:
    keys = []
    for i in range(50):
        rowsindex += 1
        key = sheet_ranges.cell(rowsindex, 1).value
        keys.append(Object(key=key))
    try:
        resp = obsClient.deleteObjects(bucketName, DeleteObjectsRequest(False, keys))
        if resp.status < 300:
            pageindex += 1
            print('Delete results:' + str(pageindex))
            if resp.body.deleted:
                for delete in resp.body.deleted:
                    print('\t' + str(delete))
            if resp.body.error:
                for err in resp.body.error:
                    print('\t' + str(err))
    except:
        import traceback
        print(traceback.format_exc())
    keys.clear()
    if rowsindex > maxrows:
        break
wb.close
resp.close



# See PyCharm help at https://www.jetbrains.com/help/pycharm/