I encountered a stream of tweets of which the following are typical:
Hmmm, is cf.7z
a different set of files from ebd-cf.7z
?
You could “eye-ball” the directory listings but that is tedious and error-prone.
Building on what we saw in Guccifer 2.0’s October 3rd 2016 Data Drop – Old News? (7 Duplicates out of 2085 files), let’s combine cf-7z-file-Sorted-Uniq.txt and ebd-cf-file-Sorted-Uniq.txt, and sort that file into cf-7z-and-ebd-cf-files-Sorted.txt.
Running
uniq -d cf-7z-and-ebd-cf-files-Sorted.txt | wc -l
(“-d” for duplicate lines) on the resulting file, piping it into wc -l
, will give you the result of 2177 duplicates. (The total length of the file is 4354 lines.)
Running
uniq -u cf-7z-and-ebd-cf-files-Sorted.txt
(“-u” for unique lines), will give you no return (no unique lines).
With experience, you will be able to check very large file archives for duplicates. In this particular case, despite the circulating under different names, it appears these two archives contain the same files.
BTW, do you think a similar technique could be applied to spreadsheets?