简介
之前的网盘兴起,个人文件备份在不同的网盘,同步的不一致,导致大量的重复文件。为了保持文件的完整,年前重装系统时,将43K个文件总计108G全部备份到huaweicloud 对象存储。为了清理重复,通过obs python sdk 导出文件名(keys)和唯一标识(etags)到excel,筛选找出重复的文件(keys)删除。一番操作,腾出的一半空间。
环境准备
1、云空间准备
- 注册华为云帐号,
- 购买资源包,
- 获取访问密钥和终端节点,
- 创建obs存储空间
2、开发环境搭建
- 从Python官网下载并安装合适的Python版本。
- 从PyCharm官网下载并安装最新社区版本。
- 安装python sdk
ps: python更新代理设置:
python -m pip install --upgrade pip --proxy="https://127.0.0.1:8080"
3、数据迁移
OBS Browser+个人文件上传下载文件、浏览文件等
4、python sdk操作清理重复文件
导出文件清单
from obs import *
# coding=UTF-8
AK = '*** Provide your Access Key ***'
SK = '*** Provide your Secret Key ***'
server = 'https://yourdomainname'
bucketName = 'my-obs-bucket-demo'
objectKey = 'my-obs-object-key-demo'
from obs import *
# coding=UTF-8
import openpyxl
obsClient = ObsClient(
access_key_id=AK,
secret_access_key=SK,
server=server
)
path = 'D:\huaweiobs\list-all.xlsx'
sheet_name = 'keys'
workbook = openpyxl.Workbook()
sheet = workbook.active
sheet.title = sheet_name
# Constructs a obs client instance with your account for accessing OBS
# obsClient = ObsClient(access_key_id=AK, secret_access_key=SK, server=server)
print('List all the objects in way of pagination: \n')
pageSize = 1000
index = 1
rows = 0
nextMarker = None
while True:
try:
resp = obsClient.listObjects(bucketName, max_keys=pageSize, marker=nextMarker)
if resp.status < 300:
print('Page:' + str(index) + '\n')
for content in resp.body.contents:
rows += 1
sheet.cell(row=rows, column=1, value=str(content.key))
sheet.cell(row=rows, column=2, value=str(content.etag))
workbook.save(path)
if not resp.body.is_truncated:
break
nextMarker = resp.body.next_marker
index += 1
else:
print('errorCode:', resp.errorCode)
print('errorMessage:', resp.errorMessage)
except:
import traceback
print(traceback.format_exc())
workbook.close
resp.close
# See PyCharm help at https://www.jetbrains.com/help/pycharm/
筛选重复冗余文件
1、将导出的清单文件list-all.xlsx中sheets('keys')复制到另一表sheets('baoliu'),删除重复的etags行,保留唯一etags行的文件名(keys)
2、在原sheets('keys')使用vlookup引用新表sheets('baoliu')中找出需要保留文件名(keys)在行尾标记并删除,剩余部分为需要在obs云上删除操作的文件名(keys)。
3、将sheets('keys')中需要在obs云上删除操作的文件名(keys)另存为一文件(’D:\huaweiobs\list-r.xlsx‘)备用。
ps:最快捷的去除冗余etags是,将etags排序,从第二行开始与上一行比较是否相同,如果相同就将其标识出来,最后筛选出对应的“keys”。
删除重复文件
from obs import *
# coding=UTF-8
# Constructs a obs client instance with your account for accessing OBS
# obsClient = ObsClient(access_key_id=AK, secret_access_key=SK, server=server)
AK = '*** Provide your Access Key ***'
SK = '*** Provide your Secret Key ***'
server = 'https://yourdomainname'
bucketName = 'my-obs-bucket-demo'
objectKey = 'my-obs-object-key-demo'
obsClient = ObsClient(
access_key_id=AK,
secret_access_key=SK,
server=server
)
import openpyxl
path = 'D:\huaweiobs\list-r.xlsx'
sheet_name = 'keys'
from openpyxl import load_workbook
wb = load_workbook(filename=path)
sheet_ranges = wb[sheet_name]
print(sheet_ranges['A2'].value)
pageindex = 0
rowsindex = 0
maxrows = sheet_ranges.max_row
while True:
keys = []
for i in range(50):
rowsindex += 1
key = sheet_ranges.cell(rowsindex, 1).value
keys.append(Object(key=key))
try:
resp = obsClient.deleteObjects(bucketName, DeleteObjectsRequest(False, keys))
if resp.status < 300:
pageindex += 1
print('Delete results:' + str(pageindex))
if resp.body.deleted:
for delete in resp.body.deleted:
print('\t' + str(delete))
if resp.body.error:
for err in resp.body.error:
print('\t' + str(err))
except:
import traceback
print(traceback.format_exc())
keys.clear()
if rowsindex > maxrows:
break
wb.close
resp.close
# See PyCharm help at https://www.jetbrains.com/help/pycharm/