Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 4, 2016

#Guccifer 2.0 Drop – Oct. 4, 2016 – File List

Filed under: Cybersecurity,Government,Politics — Patrick Durusau @ 9:37 pm

While you wait for your copy of the October 4, 2016 drop by #Guccifer 2.0 to download, you may want to peruse the file list for that drop: ebd-cf-file-list.gz.

A good starting place for comments on this drop is: Guccifer 2.0 posts DCCC docs, says they’re from Clinton Foundation – Files appear to be from Democratic Congressional Campaign Committee and DNC hacks. by Sean Gallagher.

The paragraph in Sean’s post that I find the most interesting is:


However, a review by Ars found that the files are clearly not from the Clinton Foundation. While some of the individual files contain real data, much of it came from other breaches Guccifer 2.0 has claimed credit for at the Democratic National Committee and the Democratic Congressional Campaign Committee—hacks that researchers and officials have tied to “threat groups” connected to the Russian Government. Other data could have been aggregated from public information, while some appears to be fabricated as propaganda.

To verify Sean’s claim of duplication, compare the file names in this dump against those from prior dumps.

Sean is not specific about which files/data are alleged to be “fabricated as propaganda.”

I continue to be amused by allegations of Russian Government involvement. When seeking funding, Russian (substitute other nationalities) possess super-human hacking capabilities. Yet, in cases like this one, which regurgitates old data, Russian Government involvement is presumed.

The inconsistency between Russian Government super-hackers and Russian Government copy-n-paste data leaks, doesn’t seem to be getting much play in the media.

Perhaps you can help on that score.

Enjoy!

An introduction to data cleaning with R

Filed under: Data Quality,R — Patrick Durusau @ 7:33 pm

An introduction to data cleaning with R by Edwin de Jonge and Mark van der Loo.

Summary:

Data cleaning, or data preparation is an essential part of statistical analysis. In fact, in practice it is often more time-consuming than the statistical analysis itself. These lecture notes describe a range of techniques, implemented in the R statistical environment, that allow the reader to build data cleaning scripts for data suffering from a wide range of errors and inconsistencies, in textual format. These notes cover technical as well as subject-matter related aspects of data cleaning. Technical aspects include data reading, type conversion and string matching and manipulation. Subject-matter related aspects include topics like data checking, error localization and an introduction to imputation methods in R. References to relevant literature and R packages are provided throughout.

These lecture notes are based on a tutorial given by the authors at the useR!2013 conference in Albacete, Spain.

Pure gold!

Plus this tip (among others):

Tip. To become an R master, you must practice every day.

The more data you clean, the better you will become!

Enjoy!

Deep-Fried Data […money laundering for bias…]

Filed under: Ethics,Machine Learning — Patrick Durusau @ 6:45 pm

Deep-Fried Data by Maciej Ceglowski. (paper) (video of same presentation) Part of Collections as Data event at the Library of Congress.

If the “…money laundering for bias…” quote doesn’t capture your attention, try:


I find it helpful to think of algorithms as a dim-witted but extremely industrious graduate student, whom you don’t fully trust. You want a concordance made? An index? You want them to go through ten million photos and find every picture of a horse? Perfect.

You want them to draw conclusions on gender based on word use patterns? Or infer social relationships from census data? Now you need some adult supervision in the room.

Besides these issues of bias, there’s also an opportunity cost in committing to computational tools. What irks me about the love affair with algorithms is that they remove a lot of the potential for surprise and serendipity that you get by working with people.

If you go searching for patterns in the data, you’ll find patterns in the data. Whoop-de-doo. But anything fresh and distinctive in your digital collections will not make it through the deep frier.

We’ve seen entire fields disappear down the numerical rabbit hole before. Economics came first, sociology and political science are still trying to get out, bioinformatics is down there somewhere and hasn’t been heard from in a while.

A great read and equally enjoyable presentation.

Enjoy!

Moral Machine [Research Design Failure]

Filed under: Ethics,Research Methods — Patrick Durusau @ 3:45 pm

Moral Machine

From the webpage:

Welcome to the Moral Machine! A platform for gathering a human perspective on moral decisions made by machine intelligence, such as self-driving cars.

We show you moral dilemmas, where a driverless car must choose the lesser of two evils, such as killing two passengers or five pedestrians. As an outside observer, you judge which outcome you think is more acceptable. You can then see how your responses compare with those of other people.

If you’re feeling creative, you can also design your own scenarios, for you and others to browse, share, and discuss.

The first time I recall hearing this type of discussion was over thirty years ago when a friend, taking an ethics class related the following problem:

You are driving a troop transport with twenty soldiers in the back and are about to enter a one lane bridge. You see a baby sitting in the middle of the bridge. Do you serve, going down an embankment, killing all on board or do you go straight?

A lively college classroom discussion erupted and continued for the entire class. Various theories and justifications were offered, etc. When the class bell rang, the professor announced the child perished 59 minutes, 59 seconds ago.

As you may guess, not a single person in the class called out “Swerve” when the question was posed.

The exercise was to illustrate that many “moral” decisions are made at the limits of human reaction time. Typically, 150 and 300 milliseconds. (Speedy Science: How Fast Can You React? is a great activity from Scientific American to test your reaction time.)

The examples in MIT’s Moral Machine perpetuate the myth that moral decisions are the result of reflection and consideration of multiple factors.

Considered moral decisions do exist. Dietrich Bonhoeffer deciding to participate in a conspiracy to assassinate Adolf Hitler. Lyndon Johnson supporting civil rights in the South. But those are not the subject of the “Moral Machine.”

Nor is the “Moral Machine” even a useful simulation of what a driven and/or driverless car would confront. Visibility isn’t an issue as it often is, there are no distractions, no smart phones ringing, no conflicting input from passengers, etc.

In short, the “Moral Machine” creates a fictional choice, about which to solicit your “moral” advice, under conditions you will never experience.

Separating pedestrians from vehicles (once suggested by Buckminster Fuller I think) is a far more useful exercise than college level discussion questions.

Resource: Malware analysis – …

Filed under: Cybersecurity,Programming,Security — Patrick Durusau @ 2:04 pm

Resource: Malware analysis – learning How To Reverse Malware: A collection of guides and tools by Claus Cramon Houmann.

This resource will provide you theory around learning malware analysis and reverse engineering malware. We keep the links up to date as the infosec community creates new and interesting tools and tips.

Some technical reading to enjoy instead of political debates!

Enjoy!

October 3, 2016

“Just the texts, Ma’am, just the texts” – Colin Powell Emails Sans Attachments

Filed under: Colin Powell Emails,Government,Politics,Uncategorized — Patrick Durusau @ 7:55 pm

As I reported in Bulk Access to the Colin Powell Emails – Update, I was looking for a host for the complete Colin Powell emails at 2.5 GB, but I failed on that score.

I can’t say if that result is lack of interest in making the full emails easily available or if I didn’t ask the right people. Please circulate my request when you have time.

In the meantime, I have been jumping from one “easy” solution to another, most of which involved parsing the .eml files.

But my requirement is to separate the attachment from the emails, quickly and easily. Not to parse the .eml files in preparation for further process.

How does a 22 character, command line sed expression sound?

Do you know of an “easier” solution?

sed -i '/base64/,$d' *

Reasoning the first attachment (in the event of multiple attachments) will include the string “base64” so I pass a range expression that starts there and ends at the end of the message “$” and delete that pattern, d, and write the files in place “-i.”

There are far more sophisticated solutions to this problem but as crude as this may be, I have reduced the 2.5 GB archive file that includes all the emails and their attachments down to 63 megabytes.

Attachments are important too but my first steps were to make these and similar files more accessible.

Obtaining > 29K files through the drinking straw at DCLeaks or waiting until I find a host for a consolidated 2.5 GB files, doesn’t make these files more accessible.

A 63 MB download of the Colin Powells Emails With No Attachments may.

Please feel free to mirror these files.

PS: One oddity I noticed in testing the download. With Chrome, the file size inflates to 294MB. With Mozilla, the file size is 65MB. ? Both unpack properly. Suggestions?

PPS: More sophisticated processing of the raw emails and other post-processing to follow.

October 2, 2016

Security Community “Reasoning” About Botnets (and malware)

Filed under: Bots,Cybersecurity,Security — Patrick Durusau @ 8:41 pm

In case you missed it: Source Code for IoT Botnet ‘Mirai’ Released by Brian Krebs offers this “reasoning” about a recent release of botnet software:

The source code that powers the “Internet of Things” (IoT) botnet responsible for launching the historically large distributed denial-of-service (DDoS) attack against KrebsOnSecurity last month has been publicly released, virtually guaranteeing that the Internet will soon be flooded with attacks from many new botnets powered by insecure routers, IP cameras, digital video recorders and other easily hackable devices.

The leak of the source code was announced Friday on the English-language hacking community Hackforums. The malware, dubbed “Mirai,” spreads to vulnerable devices by continuously scanning the Internet for IoT systems protected by factory default or hard-coded usernames and passwords.

Being a recent victim of a DDoS attack, perhaps Kerbs anger about the release of Mirai is understandable. But only to a degree.

Non-victims of such DDoS attacks have been quick to take up the “sky is falling” refrain.

Consider Hacker releases code for huge IoT botnet, or, Hacker Releases Code That Powered Record-Breaking Botnet Attack, or, Brace yourselves—source code powering potent IoT DDoSes just went public: Release could allow smaller and more disciplined Mirai botnet to go mainstream, as samples.

Mirai is now available to “anyone” but where the reasoning of Kerbs and others breaks down is there is no evidence that “everyone” wants to run a botnet.

Even if the botnet was as easy (sic) to use as Outlook.

For example, gun ownership in the United States is now at 36% of the adult population, but roughly one-third of the population will not commit murder this coming week.

As of 2010, there were roughly 210 million licensed drivers in the United States. Yet, this coming week, it is highly unlikely that any of them will commandeer a truck and run down pedestrians with it.

The point is that the vast majority of users, even if they were competent to read and use the Mirai code, aren’t criminals. Nor does possession of the Mirai code make them criminals.

It could be they are just curious. Or interested in how it was coded. Or, by some off chance, they could even have good intentions and want to study it to fight botnets.

Attempting to prevent the spread of information hasn’t resulted in any apparent benefit, at least to the cyber community at large.

Perhaps its time to treat the cyber community as adults, some of who will make good decisions and some less so.

Value-Add Of Mapping The Food Industry

Filed under: Cybersecurity,Security — Patrick Durusau @ 7:39 pm

Did you know that ten (10) companies control all of the major food/drink brands in the world?

behind-the-brands-illusion-of-choice-460

(From These 10 companies control everything you buy, where you can find a larger version of this image.)

You could, with enough searching, have put together all ten of these mini-maps, but then that effort would have to be repeated by everyone seeking the same information.

But, instead of duplicating an initial investment to identify players and their relationships, you can focus on identifying their IP addresses, process control machinery, employees, and other useful data.

What are your value-add of mapping examples?

October 1, 2016

Nuremberg Trial Verdicts [70th Anniversary]

Filed under: Text Analytics,Text Extraction,Text Mining,Texts,TF-IDF — Patrick Durusau @ 8:46 pm

Nuremberg Trial Verdicts by Jenny Gesley.

From the post:

Seventy years ago – on October 1, 1946 – the Nuremberg trial, one of the most prominent trials of the last century, concluded when the International Military Tribunal (IMT) issued the verdicts for the main war criminals of the Second World War. The IMT sentenced twelve of the defendants to death, seven to terms of imprisonment ranging from ten years to life, and acquitted three.

The IMT was established on August 8, 1945 by the United Kingdom (UK), the United States of America, the French Republic, and the Union of Soviet Socialist Republics (U.S.S.R.) for the trial of war criminals whose offenses had no particular geographical location. The defendants were indicted for (1) crimes against peace, (2) war crimes, (3) crimes against humanity, and of (4) a common plan or conspiracy to commit those aforementioned crimes. The trial began on November 20, 1945 and a total of 403 open sessions were held. The prosecution called thirty-three witnesses, whereas the defense questioned sixty-one witnesses, in addition to 143 witnesses who gave evidence for the defense by means of written answers to interrogatories. The hearing of evidence and the closing statements were concluded on August 31, 1946.

The individuals named as defendants in the trial were Hermann Wilhelm Göring, Rudolf Hess, Joachim von Ribbentrop, Robert Ley, Wilhelm Keitel, Ernst Kaltenbrunner, Alfred Rosenberg, Hans Frank, Wilhelm Frick, Julius Streicher, Walter Funk, Hjalmar Schacht, Karl Dönitz, Erich Raeder, Baldur von Schirach, Fritz Sauckel, Alfred Jodl, Martin Bormann, Franz von Papen, Arthur Seyss-Inquart, Albert Speer, Constantin von Neurath, Hans Fritzsche, and Gustav Krupp von Bohlen und Halbach. All individual defendants appeared before the IMT, except for Robert Ley, who committed suicide in prison on October 25, 1945; Gustav Krupp von Bolden und Halbach, who was seriously ill; and Martin Borman, who was not in custody and whom the IMT decided to try in absentia. Pleas of “not guilty” were entered by all the defendants.

The trial record is spread over forty-two volumes, “The Blue Series,” Trial of the Major War Criminals before the International Military Tribunal Nuremberg, 14 November 1945 – 1 October 1946.

All forty-two volumes are available in PDF format and should prove to be a more difficult indexing, mining, modeling, searching challenge than twitter feeds.

Imagine instead of “text” similarity, these volumes were mined for “deed” similarity. Similarity to deeds being performed now. By present day agents.

Instead of seldom visited dusty volumes in the library stacks, “The Blue Series” could develop a sharp bite.

Data Science Toolbox

Filed under: Data Science,Education,Teaching — Patrick Durusau @ 8:31 pm

Data Science Toolbox

From the webpage:

Start doing data science in minutes

As a data scientist, you don’t want to waste your time installing software. Our goal is to provide a virtual environment that will enable you to start doing data science in a matter of minutes.

As a teacher, author, or organization, making sure that your students, readers, or members have the same software installed is not straightforward. This open source project will enable you to easily create custom software and data bundles for the Data Science Toolbox.

A virtual environment for data science

The Data Science Toolbox is a virtual environment based on Ubuntu Linux that is specifically suited for doing data science. Its purpose is to get you started in a matter of minutes. You can run the Data Science Toolbox either locally (using VirtualBox and Vagrant) or in the cloud (using Amazon Web Services).

We aim to offer a virtual environment that contains the software that is most commonly used for data science while keeping it as lean as possible. After a fresh install, the Data Science Toolbox contains the following software:

  • Python, with the following packages: IPython Notebook, NumPy, SciPy, matplotlib, pandas, scikit-learn, and SymPy.
  • R, with the following packages: ggplot2, plyr, dplyr, lubridate, zoo, forecast, and sqldf.
  • dst, a command-line tool for installing additional bundles on the Data Science Toolbox (see next section).

Let us know if you want to see something added to the Data Science Toolbox.

Great resource for doing or teaching data science!

And an example of using a VM to distribute software in a learning environment.

Type-driven Development … [Further Reading]

Filed under: Functional Programming,Types,Uncategorized — Patrick Durusau @ 3:49 pm

The Further Reading slide from Edwin Brady’s presentation Type-driven Development of Communicating Systems in Idris (Lamda World, 2016) was tweeted as an image, eliminating the advantages of hyperlinks.

I have reproduced that slide with the links as follows:

Further Reading

On total functional programming

On interactive programming with dependent types

On types for communicating systems:

On Wadler’s paper, you may enjoy the video of his presentation, Propositions as Sessions or his slides (2016), Propositions as Sessions, Philip Wadler, University of Edinburgh, Betty Summer School, Limassol, Monday 27 June 2016.

Government Contractor Persistence

Filed under: Cybersecurity,Government,Security — Patrick Durusau @ 12:59 pm

Persistence of data is a hot topic in computer science but did you know government contractors exhibit persistence as well?

Remember the 22,000,000+ record leak from the US Office of Personnel Management?

Leaks don’t happen on their own and it turns out that Keypoint Government Solutions was weak link in the chain that resulted in that loss.

Cory Doctorow reports in Company suspected of blame in Office of Personnel Management breach will help run new clearance agency:


It’s still not clear how OPM got hacked, but signs point to a failure at one of its contractors, Keypoint Government Solutions, who appear to have lost control of their logins/passwords for sensitive OPM services.

In the wake of the hacks, the job of giving out security clearances has been given to a new government agency, the National Background Investigations Bureau.

NBIB is about to get started, and they’ve announced that they’re contracting out significant operations to Keypoint. Neither Keypoint nor the NBIB would comment on this arrangement.

The loss of 22,000,000 records?, well, that could happen to anybody.

WRONG!

Initiatives, sprints, proclamations, collaborations with industry, academia, etc., are unlikely to change the practice of cybersecurity in the U.S. government.

Changing cybersecurity practices in government requires:

  • Elimination of contractor persistence. One failure is enough.
  • Immediate and permanent separation of management and staff who fail to implement and follow standard security practices.
  • Separated staff and management barred from employment with any contractor with the government, permanently.
  • Staff of prior failed contractors barred from employment at present contractors. (An incentive for contractor staff to report shortfalls in current contracts.)
  • Multi-year funded contracts that include funding for independent red team testing of security.

A no consequences for failure of security policy defeats all known security policies.

« Newer Posts

Powered by WordPress