Digital Cruft, Part 1

I bet you have duplicates of files on your computer (or on drives you have nearby) - maybe you even have duplicates of duplicates. Maybe you're just not sure.

Let me give you an example. One of my drives:

So this is a mess, right? If I bring in backups, it’s far, far worse.

I have a plan.

I'm going to write a program to scan through all my files and stick them in a database. I'm going to bring you along my adventure.

Part of software development is having at least a rough plan. Here's my rough plan

  • Database - Probably PostgreSQL - just because it is what I'm currently working on at work and that simplifies things for me.
  • Program - C# for now. Probably using .NET Core 5 or 6. It's a language I know well and I want this to be as fast as possible without delving into the intricacies of C++.
    • Scan each folder.
    • Record any folders that can't be scanned. Some, like the Windows folder will have restrictions I think. - noting the path should help.
    • Record each folder into a table, along with all the folders under it, recursively. We'll solve what to do with extra drives later.
    • Once we've scanned all folders, I'm going to scan all files. I'm going to record each file name, it's size and creation date. If all these match, I'm going to run a hash on them to ensure they're actually duplicate, and that one can be removed.
  • We'll go from there - developing queries along the way to look at this data and see what sense we can make of things.

I plan on doing all of this as an open source project.

Next
Next

SSH Quirks