Were You Pwned by the “Human Cat” Story?

February 7th, 2018

Overseas Fake News Publishers Use Facebook’s Instant Articles To Bring In More Cash by Jane Lytvynenko

Fake stories first:

From the post:

While some mainstream publishers are abandoning Facebook’s Instant Articles, fake news sites based overseas are taking advantage of the format — and in some cases Facebook itself is earning revenue from their false stories.

BuzzFeed News found 29 Facebook pages, and associated websites, that are using Instant Articles to help their completely false stories load faster on Facebook. At least 24 of these pages are also signed up with Facebook Audience Network, meaning Facebook itself earns a share of revenue from the fake news being read on its platform.

Launched in 2015, Instant Articles offer a way for publishers to have their articles load quickly and natively within the Facebook mobile app. Publishers can insert their own ads or use Facebook’s ad network, Audience Network, to automatically place advertisements into their articles. Facebook takes a cut of the revenue when sites monetize with Audience Network.

“We’re against false news and want no part of it on our platform; including in Instant Articles,” said an email statement from a Facebook spokesperson. “We’ve launched a comprehensive effort across all products to take on these scammers, and we’re currently hosting third-party fact checkers from around the world to understand how we can more effectively solve the problem.”

The spokesperson did not respond to questions about the use of Instant Articles by spammers and fake news publishers, or about the fact that Facebook’s ad network was also being used for monetization. The articles sent to Facebook by BuzzFeed News were later removed from the platform. The company also removes publishers from Instant Articles if they’ve been flagged by third-party fact-checkers.

Really? You could be pwned by a “human cat” story?

Why I should be morally outraged and/or willing to devote attention to stopping that type of fake news?

Or ask anyone else to devote their resources to it?

Would you seek out Flat Earthers to dispel their delusions? If not, leave the “fake news” to people who seem to enjoy it. It’s their dime.

What the f*ck Python! 🐍

February 6th, 2018

What the f*ck Python! 🐍

From the post:

Python, being a beautifully designed high-level and interpreter-based programming language, provides us with many features for the programmer’s comfort. But sometimes, the outcomes of a Python snippet may not seem obvious to a regular user at first sight.

Here is a fun project to collect such tricky & counter-intuitive examples and lesser-known features in Python, attempting to discuss what exactly is happening under the hood!

While some of the examples you see below may not be WTFs in the truest sense, but they’ll reveal some of the interesting parts of Python that you might be unaware of. I find it a nice way to learn the internals of a programming language, and I think you’ll find them interesting as well!

If you’re an experienced Python programmer, you can take it as a challenge to get most of them right in first attempt. You may be already familiar with some of these examples, and I might be able to revive sweet old memories of yours being bitten by these gotchas 😅

If you’re a returning reader, you can learn about the new modifications here.

So, here we go…

What better way to learn than being really pissed off that your code isn’t working? Or isn’t working as expected.


This looks like a real hoot! Too late today to do much with it but I’ll be returning to it.


Dive into BPF: a list of reading material

February 6th, 2018

Dive into BPF: a list of reading material by Quentin Monnet.

From the post:

BPF, as in Berkeley Packet Filter, was initially conceived in 1992 so as to provide a way to filter packets and to avoid useless packet copies from kernel to userspace. It initially consisted in a simple bytecode that is injected from userspace into the kernel, where it is checked by a verifier—to prevent kernel crashes or security issues—and attached to a socket, then run on each received packet. It was ported to Linux a couple of years later, and used for a small number of applications (tcpdump for example). The simplicity of the language as well as the existence of an in-kernel Just-In-Time (JIT) compiling machine for BPF were factors for the excellent performances of this tool.

Then in 2013, Alexei Starovoitov completely reshaped it, started to add new functionalities and to improve the performances of BPF. This new version is designated as eBPF (for “extended BPF”), while the former becomes cBPF (“classic” BPF). New features such as maps and tail calls appeared. The JIT machines were rewritten. The new language is even closer to native machine language than cBPF was. And also, new attach points in the kernel have been created.

Thanks to those new hooks, eBPF programs can be designed for a variety of use cases, that divide into two fields of applications. One of them is the domain of kernel tracing and event monitoring. BPF programs can be attached to kprobes and they compare with other tracing methods, with many advantages (and sometimes some drawbacks).

The other application domain remains network programming. In addition to socket filter, eBPF programs can be attached to tc (Linux traffic control tool) ingress or egress interfaces and perform a variety of packet processing tasks, in an efficient way. This opens new perspectives in the domain.

And eBPF performances are further leveraged through the technologies developed for the IO Visor project: new hooks have also been added for XDP (“eXpress Data Path”), a new fast path recently added to the kernel. XDP works in conjunction with the Linux stack, and relies on BPF to perform very fast packet processing.

Even some projects such as P4, Open vSwitch, consider or started to approach BPF. Some others, such as CETH, Cilium, are entirely based on it. BPF is buzzing, so we can expect a lot of tools and projects to orbit around it soon…

I haven’t even thought about the Berkeley Packet Filter in more than a decade.

But such a wonderful reading list merits mention in its own right. What a great model for reading lists on other topics!

And one or more members of your team may want to get closer to the metal on packet traffic.

PS: I don’t subscribe to the only governments can build nation state level tooling for hacks. Loose confederations of people built the Internet. Something to keep in mind while sharing code and hacks.

Finally! A Main Stream Use for Deep Learning!

February 6th, 2018

Using deep learning to generate offensive license plates by Jonathan Nolis.

From the post:

If you’ve been on the internet for long enough you’ve seen quality content generated by deep learning algorithms. This includes algorithms trained on band names, video game titles, and Pokémon. As a data scientist who wants to keep up with modern tends in the field, I figured there would be no better way to learn how to use deep learning myself than to find a fun topic to generate text for. After having the desire to do this, I waited for a year before I found just the right data set to do it,

I happened to stumble on a list of banned license plates in Arizona. This list contains all of the personalized license plates that people requested but were denied by the Arizona Motor Vehicle Division. This dataset contained over 30,000 license plates which makes a great set of text for a deep learning algorithm. I included the data as text in my GitHub repository so other people can use it if they so choose. Unfortunately the data is from 2012, but I have an active Public Records Request to the state of Arizona for an updated list. I highly recommend you look through it, it’s very funny.

What a great idea! Not only are you learning deep learning but you are being offensive at the same time. A double-dipper!

A script for banging against your state license registration is left as an exercise for the reader.

A password generator using phonetics to spell offensive phrases for c-suite users would be nice.

Balisage: The Markup Conference 2018 – 77 Days To Paper Submission Deadline!

February 5th, 2018

Call for Participation

Submission dates/instructions have dropped!


  • 22 March 2018 — Peer review applications due
  • 22 April 2018 — Paper submissions due
  • 21 May 2018 — Speakers notified
  • 8 June 2018 — Late-breaking News submissions due
  • 15 June 2018 — Late-breaking News speakers notified
  • 6 July 2018 — Final papers due from presenters of peer reviewed papers
  • 6 July 2018 — Short paper or slide summary due from presenters of late-breaking news
  • 30 July 2018 — Pre-conference Symposium
  • 31 July –3 August 2018 — Balisage: The Markup Conference
Submit full papers in XML to info@balisage.net
See the pages Instructions for Authors and
Tag Set and Submission Guidelines for details.
Apply to the Peer Review panel

I’ve heard inability to submit valid markup counts in the judging of papers. That may just be rumor or it may be true. I suggest validating your submission.

You should be on the fourth or fifth draft of your paper by now, but be aware the paper submission deadline is April 22, 2018, or 77 days from today!

Looking forward to seeing exceptionally strong papers in the review process and being presented at Balisage!

New Draft Morphological Tags for MorphGNT

February 5th, 2018

New Draft Morphological Tags for MorphGNT by James Tauber.

From the post:

At least going back to my initial collaboration with Ulrik Sandborg-Petersen in 2005, I’ve been thinking about how I would do morphological tags in MorphGNT if I were starting from scratch.

Much later, in 2014, I had some discussions with Mike Aubrey at my first SBL conference and put together a straw proposal. There was a rethinking of some parts-of-speech, handling of tense/aspect, handling of voice, handling of syncretism and underspecification.

Even though some of the ideas were more drastic than others, a few things have remained consistent in my thinking:

  • there is value in a purely morphological analysis that doesn’t disambiguate on syntactic or semantic grounds
  • this analysis does not need the notion of parts-of-speech beyond purely Morphological Parts of Speech
  • this analysis should not attempt to distinguish middles and passives in the present or perfect system

As part of the handling of syncretism and underspecification, I had originally suggested a need for a value for the case property that didn’t distinguish nominative and accusative and a need for a value for the gender property like “non-neuter”.

If you are interested in language encoding, Biblical Greek, or morphology, Tauber has a project for you!

Be forewarned that what you tag has a great deal to do with what you can and/or will see. You have been warned.


Unfairness By Algorithm

February 5th, 2018

Unfairness By Algorithm: Distilling the Harms of Automated Decision-Making by Lauren Smith.

From the post:

Analysis of personal data can be used to improve services, advance research, and combat discrimination. However, such analysis can also create valid concerns about differential treatment of individuals or harmful impacts on vulnerable communities. These concerns can be amplified when automated decision-making uses sensitive data (such as race, gender, or familial status), impacts protected classes, or affects individuals’ eligibility for housing, employment, or other core services. When seeking to identify harms, it is important to appreciate the context of interactions between individuals, companies, and governments—including the benefits provided by automated decision-making frameworks, and the fallibility of human decision-making.

Recent discussions have highlighted legal and ethical issues raised by the use of sensitive data for hiring, policing, benefits determinations, marketing, and other purposes. These conversations can become mired in definitional challenges that make progress towards solutions difficult. There are few easy ways to navigate these issues, but if stakeholders hold frank discussions, we can do more to promote fairness, encourage responsible data use, and combat discrimination.

To facilitate these discussions, the Future of Privacy Forum (FPF) attempted to identify, articulate, and categorize the types of harm that may result from automated decision-making. To inform this effort, FPF reviewed leading books, articles, and advocacy pieces on the topic of algorithmic discrimination. We distilled both the harms and potential mitigation strategies identified in the literature into two charts. We hope you will suggest revisions, identify challenges, and help improve the document by contacting lsmith@fpf.org. In addition to presenting this document for consideration for the FTC Informational Injury workshop, we anticipate it will be useful in assessing fairness, transparency and accountability for artificial intelligence, as well as methodologies to assess impacts on rights and freedoms under the EU General Data Protection Regulation.

The primary attraction are two tables, Potential Harms from Automated Decision-Making and Potential Mitigation Sets.

Take the tables as a starting point for analysis.

Some “unfair” practices, such as increased auto insurance prices for night-shift workers, which results in differential access to insurance, is an actuarial question. Insurers are not public charities and can legally discriminate based on perceived risk.


February 5th, 2018


From the webpage:

From February 5-9, 2018, libraries, archives, and other cultural institutions around the world are sharing free coloring sheets and books based on materials in their collections.

Something fun to start the week!

In addition to more than one hundred participating institutions, you can also find instructions for creating your own coloring pages.

Any of the images you find at Mardi Gras New Orleans will make great coloring pages (modulo non-commercial use and/or permissions as appropriate).

The same instructions will help you make “adult” coloring pages as well.

I wasn’t able to get attractive results for Pedro Berruguete Saint Dominic Presiding over an Auto-da-fe 1495 using the simple instructions but will continue to play with it.

High hopes for an Auto-da-fe coloring page. FBI leaders who violate the privacy of American citizens as the focal point. (There are honest, decent and valuable FBI agents, but like other groups, only the bad apples get the press.)

Mapping Militant Selfies: …Generating Battlefield Data

February 3rd, 2018

Mapping Militant Selfies – Application of Entity Recognition/Extraction Methods to Generate Battlefield Data in Northern Syria (video) – presentation by Akin Unver.

From the seminar description:

As the Middle East goes through one of its most historic, yet painful episodes, the fate of the region’s Kurds have drawn substantial interest. Transnational Kurdish awakening—both political and armed—has attracted unprecedented global interest as individual Kurdish minorities across four countries, Turkey, Iraq, Iran, and Syria, have begun to shake their respective political status quo in various ways. In order to analyse this trend in a region in flux, this paper introduces a new methodology in generating computerised geopolitical data. Selfies of militants from three main warring non-state actors, ISIS, YPG and FSA, through February 2014 – February 2016, was sorted and operationalized through a dedicated repository of geopolitical events, extracted from a comprehensive open source archive of Turkish, Kurdish, Arabic, and Farsi sources, and constructed using entity extraction and recognition algorithms. These selfies were crosschecked against events related to conflict, such as unrest, attack, sabotage and bombings were then filtered based on human- curated lists of actors and locations. The result is a focused data set of more than 2000 events (or activity nodes) with a high level of geographical and temporal granularity. This data is then used to generate a series of four heat maps based on six-month intervals. They highlight the intensity of armed group events and the evolution of multiple fronts in the border regions of Turkey, Syria, Iraq and Iran.

Great presentation that includes the goal of:

With no reliance on ‘official’ (censored) data

Unfortunately, the technical infrastructure isn’t touched upon nor were any links given. I have written to Professor Unver asking for further information.

Although Unver focuses on the Kurds, these techniques support ad-hoc battlefield data systems, putting irregular forces to an information parity with better funded adversaries.

Replace selfies with time-stamped, geo-located images of government forces, plus image recognition, with a little discipline you have a start towards a highly effective force even if badly out numbered.

If you are interested in more academic application of this technology, see:

Schrödinger’s Kurds: Transnational Kurdish Geopolitics In The Age Of Shifting Borders


As the Middle East goes through one of its most historic, yet painful episodes, the fate of the region’s Kurds have drawn substantial interest. Transnational Kurdish awakening—both political and armed—has attracted unprecedented global interest as individual Kurdish minorities across four countries, Turkey, Iraq, Iran, and Syria, have begun to shake their respective political status quo in various ways. It is in Syria that the Kurds have made perhaps their largest impact, largely owing to the intensification of the civil war and the breakdown of state authority along Kurdish-dominated northern borderlands. However, in Turkey, Iraq, and Iran too, Kurds are searching for a new status quo, using multiple and sometimes mutually defeating methods. This article looks at the future of the Kurds in the Middle East through a geopolitical approach. It begins with an exposition of the Kurds’ geographical history and politics, emphasizing the natural anchor provided by the Taurus and Zagros mountains. That anchor, history tells us, has both rendered the Kurds extremely resilient to systemic changes to larger states in their environment, and also provided hindrance to the materialization of a unified Kurdish political will. Then, the article assesses the theoretical relationship between weak states and strong non-states, and examines why the weakening of state authority in Syria has created a spillover effect on all Kurds in its neighborhood. In addition to discussing classical geopolitics, the article also reflects upon demography, tribalism, Islam, and socialism as additional variables that add and expand the debate of Kurdish geopolitics. The article also takes a big-data approach to Kurdish geopolitics by introducing a new geopolitical research methodology, using large-volume and rapid-processed entity extraction and recognition algorithms to convert data into heat maps that reveal the general pattern of Kurdish geopolitics in transition across four host countries.

A basic app should run on Tails, in memory, such that if your coordinating position is compromised, powering down (jerking out the power cord) destroys all the data.

Hmmm, encrypted delivery of processed data from a web service to the coordinator, such that their computer is only displaying data.

Other requirements?

Where Are Topic Mappers Today? Lars Marius Garshol

February 3rd, 2018

Some are creating new children’s games:

If you’re interested, Ian Rogers has a complete explanation with examples at: The Google Pagerank Algorithm and How It Works or a different take with a table of approximate results at: RITE Wiki: Page Rank.

Unfortunately, both Garshol and Wikipedia’s PageRank page get the Google pagerank algorithm incorrect.

The correct formulation reads:

The results of reported algorithm are divided by U.S. Government Interference, an unknown quantity.

Perhaps that is why Google keeps its pagerank calculation secret. If I were an allegedly sovereign nation, I would keep Google’s lapdog relationship to the U.S. government firmly in mind.

IDA v7.0 Released as Freeware – Comparison to The IDA Pro Book?

February 3rd, 2018

IDA v7.0 Released as Freeware

From the download page:

The freeware version of IDA v7.0 has the following limitations:

  • no commercial use is allowed
  • lacks all features introduced in IDA > v7.0
  • lacks support for many processors, file formats, debugging etc…
  • comes without technical support

Copious amounts of documentation are online.

I haven’t seen The IDA Pro Book by Chris Eagle, but it was published in 2011. Do you know anyone who has compared The IDA Pro Book to version 7.0?

Two promising pages: IDA Support Overview and IDA Support: Links (external).

PubMed Commons to be Discontinued

February 2nd, 2018

PubMed Commons to be Discontinued

From the post:

PubMed Commons has been a valuable experiment in supporting discussion of published scientific literature. The service was first introduced as a pilot project in the fall of 2013 and was reviewed in 2015. Despite low levels of use at that time, NIH decided to extend the effort for another year or two in hopes that participation would increase. Unfortunately, usage has remained minimal, with comments submitted on only 6,000 of the 28 million articles indexed in PubMed.

While many worthwhile comments were made through the service during its 4 years of operation, NIH has decided that the low level of participation does not warrant continued investment in the project, particularly given the availability of other commenting venues.

Comments will still be available, see the post for details.

Good time for the reminder that even negative results from an experiment are valuable.

Even more so in this case because discussion/comment facilities are non-trivial components of a content delivery system. Time and resources not spent on comment facilities could be put in other directions.

Where do discussions of medical articles take place and can they be used to automatically annotate published articles?

The Unix Workbench

February 2nd, 2018

Unlikely to help you but a great resource to pass along to new Unix users by Sean Kross.

Some day, Microsoft will complete the long transition to Unix. Start today and you will arrive years ahead of it. 😉

Discrediting the FBI?

February 2nd, 2018

Whatever your opinion of the accidental U.S. president (that’s a dead give away), what does it mean to “discredit” the FBI?

Just hitting the high points:

The FBI has a long history of lying and abuse, these being only some of the more recent examples.

So my question remains: What does it mean to “discredit” the FBI?

The FBI and its agents are unworthy of any belief by anyone. Their own records and admissions are a story of staggering from one lie to the next.

I’ll grant the FBI is large enough that honorable, hard working, honest agents must exist. But not enough of them to prevent the repeated fails at the FBI.

Anyone who credits any FBI investigation has motivations other than the factual record of the FBI.

PS: The Nunes memo confirms what many have long suspected about the FISA court: It exercises no more meaningful oversight over FISA warrants than a physical rubber stamp would in their place.

How To Secure Sex Toys – End to End (so to speak)

February 2nd, 2018

Thursday began innocently enough and then I encountered:

The tumult of articles started (I think) with: Internet of Dildos: A Long Way to a Vibrant Future – From IoT to IoD, covering security flaws in Vibratissimo PantyBuster, MagicMotion Flamingo, and Realov Lydia, reads in part:

The results are the foundations for a Master thesis written by Werner Schober in cooperation with SEC Consult and the University of Applied Sciences St. Pölten. The first available results can be found in the following chapters of this blog post.

The sex toys of the “Vibratissimo” product line and their cloud platform, both manufactured and operated by the German company Amor Gummiwaren GmbH, were affected by severe security vulnerabilities. The information we present is not only relevant from a technological perspective, but also from a data protection and privacy perspective. The database containing all the customer data (explicit images, chat logs, sexual orientation, email addresses, passwords in clear text, etc.) was basically readable for everyone on the internet. Moreover, an attacker was able to remotely pleasure individuals without their consent. This could be possible if an attacker is nearby a victim (within Bluetooth range), or even over the internet. Furthermore, the enumeration of explicit images of all users is possible because of predictable numbers and missing authorization checks.

Other coverage of the vulnerability includes:

Vibratissimo product line (includes the PantyBuster).

The cited coverage doesn’t answer how to incentivize end-to-end encrypted sex toys?

Here’s one suggestion: Buy the PantyBuster or other “smart” sex toys in bulk. Re-ship these sex toys, after duly noting their serial numbers and other access information, to your government representatives, sports or TV figures, judges, military officers, etc. People whose privacy matters to the government.

If someone were to post a list of such devices, well, you can imagine the speed with sex toys will be required to be encrypted in your market.

Some people see vulnerabilities and see problems.

I see the same vulnerabilities and see endless possibilities.

Weird Machines, exploitability, and proven unexploitability – Video

February 2nd, 2018

Thomas Dullien/Halvar Flake’s presentation Weird Machines, exploitability, and proven unexploitability won’t embed but you can watch it on Vimeo.

Great presentation of the paper I mentioned at: Weird machines, exploitability, and provable unexploitability.

Includes this image of a “MitiGator:”

Views “software as an emulator for the finite state machine I would like to have.” (rough paraphrase)

Another gem, attackers don’t distinguish between data and programming:

OK, one more gem and you have to go watch the video:

Proof of unexploitability:

Mostly rote exhaustion of the possible weird state transitions.

The example used is “several orders of magnitude” less complicated than most software. Possible to prove but difficult even with simple examples.

Definitely a “watch this space” field of computer science.

Appendices with code: http://www.dullien.net/thomas/weird-machines-exploitability.pdf

NSA Exploits – Mining Malware – Ethics Question

February 1st, 2018

New Monero mining malware infected 500K PCs by using 2 NSA exploits

From the post:

It looks like the craze of cryptocurrency mining is taking over the world by storm as every new day there is a new malware targeting unsuspecting users to use their computing power to mine cryptocurrency. Recently, the IT security researchers at Proofpoint have discovered a Monero mining malware that uses leaked NSA (National Security Agency) EternalBlue exploit to spread itself.

The post also mentions use of the NSA exploit, EsteemAudit.

A fair number of leads and worth your time to read in detail.

I suspect most of the data science ethics crowd will down vote the use of NSA exploits (EternalBlue, EsteemAudit) for cyrptocurrency mining.

Here’s a somewhat harder data science ethics question:

Is it ethical to infect 500,000+ Windows computers belonging to a government for the purpose of obtaining internal documents?

Does your answer depend upon which government and what documents?

Governments don’t take your rights into consideration. Should you take their laws into consideration?

George “Machine Gun” Kelly (Bank Commissioner), DJ Patil (Data Science Ethics)

February 1st, 2018

A Code of Ethics for Data Science by DJ Patil. (Former U.S. Chief Data Scientist)

From the post:

With the old adage that with great power comes great responsibility, it’s time for the data science community to take a leadership role in defining right from wrong. Much like the Hippocratic Oath defines Do No Harm for the medical profession, the data science community must have a set of principles to guide and hold each other accountable as data science professionals. To collectively understand the difference between helpful and harmful. To guide and push each other in putting responsible behaviors into practice. And to help empower the masses rather than to disenfranchise them. Data is such an incredible lever arm for change, we need to make sure that the change that is coming, is the one we all want to see.

So how do we do it? First, there is no single voice that determines these choices. This MUST be community effort. Data Science is a team sport and we’ve got to decide what kind of team we want to be.

Consider the specifics of Patil’s regime (2015-2017), when government data scientists:

  • Mined information on U.S. citizens. (check)
  • Mined information on non-U.S. citizens. (check)
  • Hackd computer systems of both citizens and non-citizens. (check)
  • Spread disinformation both domestically and abroad. (check)

Unless you want to resurrect George “Machine Gun” Kelly to be your banking commissioner, Patil is a poor choice to lead a charge on ethics.

Despite violations of U.S. law during his tenure as U.S. Chief Data Scientist, Patil was responsible for NO prosecutions, investigations or even whistle-blowing on a single government data scientist.

Patil’s lemming traits come to the fore when he says:

And finally, our democratic systems have been under attack using our very own data to incite hate and sow discord.

Patil ignores two very critical aspects of that claim:

  1. There has been no, repeat no forensic evidence released to support that claim. All that supports it are claims by people who claim to have seen something, but they can’t say what.
  2. The United States (that would be us), has tried to overthrow governments seventy-two times during the Cold War. Sometimes the U.S. has succeeded. Posts on Twitter and Facebook pale by comparison.

Don’t mistake Patil’s use of the term “ethics” as meaning what you mean by “ethics.” Based on his prior record and his post, you can guess that Patil’s “ethics” gives a wide berth to abusive governments and corporations.

Python’s One Hundred and Thirty-Nine Week Lectionary Cycle

January 31st, 2018

Python 3 Module of the Week by Doug Hellmann

From the webpage:

PyMOTW-3 is a series of articles written by Doug Hellmann to demonstrate how to use the modules of the Python 3 standard library….

Hellman documents one hundred and thirty-nine (139) modules in the Python standard library.

How many of them can you name?

To improve your score, use Hellman’s list as a one hundred and thirty-nine (139) week lectionary cycle on Python.

Some modules may take less than a week, but some, re — Regular Expressions, will take more than a week.

Even if you don’t finish a longer module, push on after two weeks so you can keep that feeling of progress and encountering new material.

GraphDBLP [“dblp computer science bibliography” as a graph]

January 31st, 2018

GraphDBLP: a system for analysing networks of computer scientists through graph databases by Mario Mezzanzanica, et al.


This paper presents GraphDBLP, a system that models the DBLP bibliography as a graph database for performing graph-based queries and social network analyses. GraphDBLP also enriches the DBLP data through semantic keyword similarities computed via word-embedding. In this paper, we discuss how the system was formalized as a multi-graph, and how similarity relations were identified through word2vec. We also provide three meaningful queries for exploring the DBLP community to (i) investigate author profiles by analysing their publication records; (ii) identify the most prolific authors on a given topic, and (iii) perform social network analyses over the whole community. To date, GraphDBLP contains 5+ million nodes and 24+ million relationships, enabling users to explore the DBLP data by referencing more than 3.3 million publications, 1.7 million authors, and more than 5 thousand publication venues. Through the use of word-embedding, more than 7.5 thousand keywords and related similarity values were collected. GraphDBLP was implemented on top of the Neo4j graph database. The whole dataset and the source code are publicly available to foster the improvement of GraphDBLP in the whole computer science community.

Although the article is behind a paywall, GraphDBLP as a tool is not! https://github.com/fabiomercorio/GraphDBLP.

From the webpage:

GraphDBLP is a tool that models the DBLP bibliography as a graph database for performing graph-based queries and social network analyses.

GraphDBLP also enriches the DBLP data through semantic keyword similarities computed via word-embedding.

GraphDBLP provides to users three meaningful queries for exploring the DBLP community:

  1. investigate author profiles by analysing their publication records;
  2. identify the most prolific authors on a given topic;
  3. perform social network analyses over the whole community;
  4. perform shortest-paths over DBLP (e.g., the shortest-path between authors, the analysis of co-author networks, etc.)

… (emphasis in original)

Sorry to see author, title, venue, publication, keyword all as flat strings but that’s not uncommon. Disappointing but not uncommon.

Viewing these flat strings as parts of structured representatives will be in addition to this default.

Not to minimize the importance of improving the usefulness of the dblp, but imagine integrating the GraphDBLP into your local library system. Without a massive data mapping project. That’s what lies just beyond the reach of this data project.


January 31st, 2018


From the webpage:

As the name might suggest AutoSploit attempts to automate the exploitation of remote hosts. Targets are collected automatically as well by employing the Shodan.io API. The program allows the user to enter their platform specific search query such as; Apache, IIS, etc, upon which a list of candidates will be retrieved.

After this operation has been completed the ‘Exploit’ component of the program will go about the business of attempting to exploit these targets by running a series of Metasploit modules against them. Which Metasploit modules will be employed in this manner is determined by programatically comparing the name of the module to the initial search query. However, I have added functionality to run all available modules against the targets in a ‘Hail Mary’ type of attack as well.

The available Metasploit modules have been selected to facilitate Remote Code Execution and to attempt to gain Reverse TCP Shells and/or Meterpreter sessions. Workspace, local host and local port for MSF facilitated back connections are configured through the dialog that comes up before the ‘Exploit’ component is started.

Operational Security Consideration

Receiving back connections on your local machine might not be the best idea from an OPSEC standpoint. Instead consider running this tool from a VPS that has all the dependencies required, available.

What a great day to be alive!

“Security experts,” such as Richard Bejtlich, @taosecurity, are already crying:

There is no need to release this. The tie to Shodan puts it over the edge. There is no legitimate reason to put mass exploitation of public systems within the reach of script kiddies. Just because you can do something doesn’t make it wise to do so. This will end in tears.

The same “security experts” who never complain about script kiddies that work for the CIA for example.

Script kiddies at the CIA? Sure! Who do you think uses the tools described in: Vault7: CIA Hacking Tools Revealed, Vault 7: ExpressLane, Vault 7: Angelfire, Vault 7: Protego, Vault 8: Hive?

You didn’t think CIA staff only use tools they develop themselves from scratch did you? Neither do “security experts,” even ones capable of replicating well known tools and exploits.

So why the complaints present and forthcoming from “security experts?”

Well, for one thing, they are no longer special guardians of secret knowledge.

Ok, in practical economic terms, AutoSploit means any business, corporation or individual can run a robust penetration test against their own systems.

You don’t need a “security expert” for the task. The “security experts” with all the hoarded knowledge and expertise.

Considering “security experts” as a class (with notable exceptions) have sided with governments and corporations for decades, any downside for them is just an added bonus.

Email Address Vacuuming – Infoga

January 31st, 2018

Infoga – Email Information Gathering

From the post:

Infoga is a tool for gathering e-mail accounts information (ip,hostname,country,…) from different public sources (search engines, pgp key servers). Is a really simple tool, but very effective for the early stages of a penetration test or just to know the visibility of your company in the Internet.

Its not COMINT:

COMINT or communications intelligence is intelligence gained through the interception of foreign communications, excluding open radio and television broadcasts. It is a subset of signals intelligence, or SIGINT, with the latter being understood as comprising COMINT and ELINT, electronic intelligence derived from non-communication electronic signals such as radar. (COMINT (Communications Intelligence))

as practiced by the NSA, but that doesn’t keep it from being useful.

Not gathering useless data means a smaller haystack and a greater chance of finding needles.

Other focused information mining tools you would recommend?

Don’t Mix Public and Dark Web Use of A Bitcoin Address

January 31st, 2018

Bitcoin payments used to unmask dark web users by John E Dunn.

From the post:

Researchers have discovered a way of identifying those who bought or sold goods on the dark web, by forensically connecting them to Bitcoin transactions.

It sounds counter-intuitive. The dark web comprises thousands of hidden services accessed through an anonymity-protecting system, usually Tor.

Bitcoin transactions, meanwhile, are supposed to be pseudonymous, which is to say visible to everyone but not in a way that can easily be connected to someone’s identity.

If you believe that putting these two technologies together should result in perfect anonymity, you might want to read When A Small Leak Sinks A Great Ship to hear some bad news:

Researchers matched Bitcoin addresses found on the dark web with those found on the public web. Depending on the amount of information on the public web, identified named individuals.

Black Letter Rule: Maintain separate Bitcoin accounts for each online persona.

Black Letter Rule: Never use a public persona on the dark web or a dark web persona on the public web.

Black Letter Rule: Never make Bitcoin transactions between public versus dark web personas.

Remind yourself of basic OpSec rules every day.

Better OpSec – Black Hat Webcast – Thursday, February 15, 2018 – 2:00 PM EST

January 30th, 2018

How the Feds Caught Russian Mega-Carder Roman Seleznev by Norman Barbosa and Harold Chun.

From the webpage:

How did the Feds catch the notorious Russian computer hacker Roman Seleznev – the person responsible for over 400 point of sale hacks and at least $169 million in credit card fraud? What challenges did the government face piecing together the international trail of electronic evidence that he left? How was Seleznev located and ultimately arrested?

This presentation will review the investigation that will include a summary of the electronic evidence that was collected and the methods used to collect that evidence. The team that convicted Seleznev will show how that evidence of user attribution was used to finger Seleznev as the hacker and infamous credit card broker behind the online nics nCuX, Track2, Bulba and 2Pac.

The presentation will also discuss efforts to locate Seleznev, a Russian national, and apprehend him while he vacationed in the Maldives. The presentation will also cover the August 2016 federal jury trial with a focus on computer forensic issues, including how prosecutors used Microsoft Windows artifacts to successfully combat Seleznev’s trial defense.

If you want to improve your opsec, study hackers who have been caught.

Formally it’s called avoiding survivorship bias. Survivorship bias – lessons from World War Two aircraft by Nick Ingram.

Abraham Wald was tasked with deciding where to add extra armour to improve the survival of airplanes in combat. Abraham Wald and the Missing Bullet Holes (An excerpt from How Not To Be Wrong by Jordan Ellenberg).

It’s a great story and one you should remember.

Combating State of the Uniom Brain Damage – Malware Reversing – Burpsuite Keygen

January 30th, 2018

Malware Reversing – Burpsuite Keygen by @lkw.

From the post:

Some random new “user” called @the_heat_man posted some files on the forums multiple times (after being deleted by mods) caliming it was a keygen for burpsuite. Many members of these forums were suspicious of it being malware. I, along with @Leeky, @dtm, @Cry0l1t3 and @L0k1 (please let me know if I missed anyone) decided to reverse engineer it to see if it is. Surprisingly as well as containing a remote access trojan (RAT) it actually contains a working keygen. As such, for legal reasons I have not included a link to the original file.

The following is a writeup of the analysis of the RAT.

In the event you, friend or family member is accidentally exposed to the State of the Uniom speech night, permanent brain damage can be avoided by repeated exposure to intellectually challenging material. For an extended time period.

With that in mind, I mention Malware Reversing – Burpsuite Keygen.

Especially challenging if you aren’t familiar with reverse engineering but the extra work of understanding each step will exercise your brain that much harder.

How serious can the brain damage be?

A few tweets from Potus and multiple sources report Democratic Senators and Representatives extolling the FBI as a bulwark of democracy.

Really? The same FBI that infiltrated civil rights groups, anti-war protesters, 9/11 defense, Black Panthers, SCLC,, etc. That FBI? The same FBI that continues such activities to this very day?

A few tweets produce that level of brain dysfunction. Imagine the impact of 20 to 30 continuous minutes of exposure.

State of the Uniom is scheduled for 9 PM EST on 30 January 2018.

Readers are strongly advised to turn off all TVs and radios, to minimize the chances of accidental exposure to the State of the Uniom or repetition of the same. The New York Times will be streaming it live on its website. I have omitted that URL for your safety.

Safe activities include, reading a book, consensual sex, knitting, baking, board games and crossword puzzles, to name only a few. Best of luck to us all.

Have You Been Drafted by Data Science Ethics?

January 29th, 2018

I ask because Strava‘s recent heatmap release (Fitness tracking app Strava gives away location of secret US army bases) is being used as a platform to urge unpaid consideration of government and military interests by data scientists.

Consider Ray Crowell‘s Strava Heatmaps: Why Ethics in Design Matters which presumes data scientists have an unpaid obligation to consider the interests of the military:

From the post:

These organizations have been warned for years (including by myself) of the information/operational security (specifically with pattern of life, that is, the data collected and analyzed establish an individual’s past behavior, determine their current behavior, and predict their future behavior) implications associated with social platforms and advanced analytical technology. I spent my career stabilizing this intersection between national security and progress — having a deep understanding of the protection of lives, billion-dollar weapon systems, and geopolitical assurances and on the other side, the power of many of these technological advancements in enabling access to health and wellness for all.

Getting at this balance requires us to not get enamored by the idea or implications of ethically sound solutions, but rather exposing our design practices to ethical scrutiny.

These tools are not only beneficial for the designer, but for the user as well. I mention these specifically for institutions like the Defense Department, impacted from the Strava heatmap and frankly many other technologies being employed both sanctioned and unsanctioned by military members and on military installations. These tools are beneficial the institution’s leadership to “reverse engineer” what technologies on the market can do by way of harm … in balance with the good. I learned a long time ago, from wiser mentors than myself, that you don’t know what you’re missing, if you’re not looking to begin with.

Crowell imposes an unpaid ethical obligation any unsuspecting reader/data scientist to consider their impact on government or military organizations.

In that effort, Crowell is certainly not alone:

If you contract to work for a government or military group, you owe them an ethical obligation of your best efforts. Just as for any other client.

However, volunteering unpaid assistance for military or government organizations, damages the market for data scientists.

Now that’s unethical!

PS: I agree there are ethical obligations to consider the impact of your work on disenfranchised, oppressed or abused populations. Governments and military organizations don’t qualify as any of those.

‘Learning to Rank’ (No Unique Feature Name Fail – Update)

January 24th, 2018

Elasticsearch ‘Learning to Rank’ Released, Bringing Open Source AI to Search Teams

From the post:

Search experts at OpenSource Connections, the Wikimedia Foundation, and Snagajob, deliver open source cognitive search capabilities to the Elasticsearch community. The open source Learning to Rank plugin allows organizations to control search relevance ranking with machine learning. The plugin is currently delivering search results at Wikipedia and Snagajob, providing significant search quality improvements over legacy solutions.

Learning to Rank lets organizations:

  • Directly optimize sales, conversions and user satisfaction in search
  • Personalize search for users
  • Drive deeper insights from a knowledge base
  • Customize ranking down for complex nuance
  • Avoid the sticker shock & lock-in of a proprietary "cognitive search" product

“Our mission is to empower search teams. This plugin gives teams deep control of ranking, allowing machine learning models to be directly deployed to the search engine for relevance ranking” said Doug Turnbull, author of Relevant Search and CTO, OpenSource Connections.

I need to work through all the documentation and examples but:

Feature Names are Unique

Because some model training libraries refer to features by name, Elasticsearch LTR enforces unique names for each features. In the example above, we could not add a new user_rating feature without creating an error.

is a warning of what you (and I) are likely to find.

Really? Someone involved in the design thought globally unique feature names was a good idea? Or at a minimum didn’t realize it is a very bad idea?

Scope anyone? Either in the programming or topic map sense?

Despite the unique feature name fail, I’m sure ‘Learning to Rank’ will be useful. But not as useful as it could have been.

Doug Turnbull (https://twitter.com/softwaredoug) advises that features are scoped by feature stores, so the correct prose would read: “…LTR enforces unique names for each feature within a feature store.”

No fail, just bad writing.

Eset’s Guide to DeObfuscating and DeVirtualizing FinFisher

January 24th, 2018

Eset’s Guide to DeObfuscating and DeVirtualizing FinFisher

From the introduction:

Thanks to its strong anti-analysis measures, the FinFisher spyware has gone largely unexplored. Despite being a prominent surveillance tool, only partial analyses have been published on its more recent samples.

Things were put in motion in the summer of 2017 with ESET’s analysis of FinFisher surveillance campaigns that ESET had discovered in several countries. In the course of our research, we have identified campaigns where internet service providers most probably played the key role in compromising the victims with FinFisher.

When we started thoroughly analyzing this malware, the main part of our effort was overcoming FinFisher’s anti-analysis measures in its Windows versions. The combination of advanced obfuscation techniques and proprietary virtualization makes FinFisher very hard to de-cloak.

To share what we learnt in de-cloaking this malware, we have created this guide to help others take a peek inside FinFisher and analyze it. Apart from offering practical insight into analyzing FinFisher’s virtual machine, the guide can also help readers to understand virtual machine protection in general – that is, proprietary virtual machines found inside a binary and used for software protection. We will not be discussing virtual machines used in interpreted programming languages to provide compatibility across various platforms, such as the Java VM.

We have also analyzed Android versions of FinFisher, whose protection mechanism is based on an open source LLVM obfuscator. It is not as sophisticated or interesting as the protection mechanism used in the Windows versions, thus we will not be discussing it in this guide.

Hopefully, experts from security researchers to malware analysts will make use of this guide to better understand FinFisher’s tools and tactics, and to protect their customers against this omnipotent security and privacy threat.

Beyond me at the moment but one should always try to learn from the very best. Making note of what can’t be understood/used today in hopes of revisiting it in the future.

Numerous reports describe FinFisher as spyware sold exclusively to governments and their agencies. Perhaps less “exclusively” than previously thought.

In any event, FinFisher is reported to be in the wild so perhaps governments that bought Finfisher will be uncovered by FinFisher.

A more deserving group of people is hard to imagine.

Audio Adversarial Examples: Targeted Attacks on Speech-to-Text

January 24th, 2018

Audio Adversarial Examples: Targeted Attacks on Speech-to-Text by Nicholas Carlini and David Wagner.


We construct targeted audio adversarial examples on automatic speech recognition. Given any audio waveform, we can produce another that is over 99.9% similar, but transcribes as any phrase we choose (at a rate of up to 50 characters per second). We apply our iterative optimization-based attack to Mozilla’s implementation DeepSpeech end-to-end, and show it has a 100% success rate. The feasibility of this attack introduce a new domain to study adversarial examples.

You can consult the data used and code at: http://nicholas.carlini.com/code/audio_adversarial_examples.

Important not only for defeating automatic speech recognition but also for establishing properties of audio recognition differ from visual recognition.

A hint that automatic recognition properties cannot be assumed for unexplored domains.

Visualizing trigrams with the Tidyverse (Who Reads Jane Austen?)

January 24th, 2018

Visualizing trigrams with the Tidyverse by Emil Hvitfeldt.

From the post:

In this post I’ll go though how I created the data visualization I posted yesterday on twitter:

Great post and R code, but who reads Jane Austen? 😉

I have a serious weakness for academic and ancient texts so the Jane Austen question is meant in jest.

The more direct question is to what other texts would you apply this trigram/visualization technique?


I have some texts in mind but defer mentioning them while I prepare a demonstration of Hvitfeldt’s technique to them.

PS: I ran across an odd comment in the janeaustenr package:

Each text is in a character vector with elements of about 70 characters.

You have to hunt for a bit but 70 characters is the default plain text line length at Gutenberg. Some poor decisions are going to be with us for a very long time.