Massive Data Duplicate Situation

pseudomeaningful · Oct 13, 2019

Hello Fellow carbies

I'm hoping your combined knowledge will be able to help me with the problem I am facing, well I usually turn my back because it makes me cringe to see External HDD's in a box, on top of a computer tower.

Unfortunately I am the son of a massive data hoarder, and not the cool r/datahoarders kind. My Mother has approximately 16TB of data spread across 8-10 External hard drives. To make matters worst they are literally all 2.5" seagate drives 😖, one of which has already failed because... seagate.

The drives mostly contain backups of her Pictures as well as backups of her two work desktops. The issue is there is an astonishing amount of duplicate files. Of the Approx 16TB I estimate there is at maximum 2-3TB of Actual data. I plugged as many as i could into my computer at once and used CCleaners simple Duplicate file tool to scan them, and the resulting .txt file was 130MB...I have entire Microbiology Textbook pdfs which are smaller. There is literally 5 copies of the same image on some of the drives.

I looked into all the types of de-duplication software available and the mot comprehensive I could find was Diskover, which requires a working knowledge of Python, which I simply don't have time to learn right now with university.

Thus I decided the simplest solution would be to build an unraid server (also always wanted a home server) That she can backup her pictures and local machine to, probably going to make the company switch to macrium reflect instead of the current cobian they are using, to provide some redundancy and safety for storage. I will implement offsite storage through AWS, Backblaze, or similar to provide offsite storage once at a later stage. I have acquired most of the components necessary for the server already.

FINALLY, The actual question. I rather quickly came to the realization that de-duping this much data is way, way beyond my abilities. Thus, could anyone recommend an individual or company in the Durban area that deals with data management/ would be able to create two master copies of the de-duped files. Obviously there will be a cost involved but to continue buying hard drives is ridiculous.

I figured this would be the best place to start and if necessary the thread can be moved to a more appropriate section by the mods.

TL;DR - Much data, so duplicated. Let me know of anyone who can de-dupe a lot of hardrives

JollyJamma · Oct 14, 2019

I have a weird way of going through duplicate data:

Search *.* for all files and then sort by file size or type.

Really manual and not great for a ton of data but it does work.

I had some software that could find duplicate data for you but it wasn’t good unless you paid for it.

souljazk · Oct 15, 2019

I struggle with this with my mother too .lol. And clients... Check FreeNAS it has a data deduplication thingy, not sure if it will do what you want.

Look at a app called FolderSize. It might help see where large files/folders are. Before you start on this, copy NB data to another drive and one cloud service. Verify copy and cloud backup before moving forward.

Nemo415 · Oct 15, 2019

For the Photos...

If you aren't pedantic about Google and Cloud storage, use Google Photos. They will store for free and unlimited photos. They will however compress the photos, so if these pics were taken with a DSLR camera and 1 image in 100 MB then this won't work.

pseudomeaningful · Nov 23, 2019

souljazk said:
I struggle with this with my mother too .lol. And clients... Check FreeNAS it has a data deduplication thingy, not sure if it will do what you want.

Look at a app called FolderSize. It might help see where large files/folders are. Before you start on this, copy NB data to another drive and one cloud service. Verify copy and cloud backup before moving forward.

Apparently the dedupe on freenas is really slow but I'm breaking my 4tb mirror on main computer So i can create a zfs pool big enough. Created an AWS account and Uploading the raw data there, yay for fiber. Should have finished flashing the HBA, by the time that's done 😂 and literally all my side projects

dalion619 · Nov 23, 2019

pseudomeaningful said:
Apparently the dedupe on freenas is really slow but I'm breaking my 4tb mirror on main computer So i can create a zfs pool big enough. Created an AWS account and Uploading the raw data there, yay for fiber. Should have finished flashing the HBA, by the time that's done 😂 and literally all my side projects

Interesting problem lol.
If you pushed everything into S3 or something, and filtered out the duplicates by a MD5 hash or something?

Not sure how insane the bill would be though.

Rickster · Nov 26, 2019

Glary utilities has a duplicate file finder, works 100% and is free.

Search

Massive Data Duplicate Situation

pseudomeaningful

Senior Member

JollyJamma

What's my age again?

souljazk

VIP

Nemo415

DIY (Destroy IT Yourself) specialist

pseudomeaningful

Senior Member

dalion619

Legendary Member

Rickster

Legendary Member

Users who are viewing this thread

Similar threads

Latest posts

Top Donors

Share this page