Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 8, 2016

Chasing File Names – Check My Work

Filed under: Government,Hillary Clinton,Politics — Patrick Durusau @ 9:04 pm

I encountered a stream of tweets of which the following are typical:

guccifer2-0-tweets-cf-7z-460

Hmmm, is cf.7z a different set of files from ebd-cf.7z?

You could “eye-ball” the directory listings but that is tedious and error-prone.

Building on what we saw in Guccifer 2.0’s October 3rd 2016 Data Drop – Old News? (7 Duplicates out of 2085 files), let’s combine cf-7z-file-Sorted-Uniq.txt and ebd-cf-file-Sorted-Uniq.txt, and sort that file into cf-7z-and-ebd-cf-files-Sorted.txt.

Running

uniq -d cf-7z-and-ebd-cf-files-Sorted.txt | wc -l

(“-d” for duplicate lines) on the resulting file, piping it into wc -l, will give you the result of 2177 duplicates. (The total length of the file is 4354 lines.)

Running

uniq -u cf-7z-and-ebd-cf-files-Sorted.txt

(“-u” for unique lines), will give you no return (no unique lines).

With experience, you will be able to check very large file archives for duplicates. In this particular case, despite the circulating under different names, it appears these two archives contain the same files.

BTW, do you think a similar technique could be applied to spreadsheets?

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress