LOFAR Transients Pipeline (“TraP”)

May 24th, 2015

LOFAR Transients Pipeline (“TraP”)

From the webpage:

The LOFAR Transients Pipeline (“TraP”) provides a means of searching a stream of N-dimensional (two spatial, frequency, polarization) image “cubes” for transient astronomical sources. The pipeline is developed specifically to address data produced by the LOFAR Transients Key Science Project, but may also be applicable to other instruments or use cases.

The TraP codebase provides the pipeline definition itself, as well as a number of supporting routines for source finding, measurement, characterization, and so on. Some of these routines are also available as stand-alone tools.

High-level overview

The TraP consists of a tightly-coupled combination of a “pipeline definition” – effectively a Python script that marshals the flow of data through the system – with a library of analysis routines written in Python and a database, which not only contains results but also performs a key role in data processing.

Broadly speaking, as images are ingested by the TraP, a Python-based source-finding routine scans them, identifying and measuring all point-like sources. Those sources are ingested by the database, which associates them with previous measurements (both from earlier images processed by the TraP and from other catalogues) to form a lightcurve. Measurements are then performed at the locations of sources which were expected to be seen in this image but which were not detected. A series of statistical analyses are performed on the lightcurves constructed in this way, enabling the quick and easy identification of potential transients. This process results in two key data products: an archival database containing the lightcurves of all point-sources included in the dataset being processed, and community alerts of all transients which have been identified.

Exploiting the results of the TraP involves understanding and analysing the resulting lightcurve database. The TraP itself provides no tools directly aimed at this. Instead, the Transients Key Science Project has developed the Banana web interface to the database, which is maintained separately from the TraP. The database may also be interrogated by end-user developed tools using SQL.

While it uses the term “association,” I think you will conclude it is much closer to merging in a topic map sense:

The association procedure knits together (“associates”) the measurements in extractedsource which are believed to originate from a single astronomical source. Each such source is given an entry in the runningcatalog table which ties together all of the measurements by means of the assocxtrsource table. Thus, an entry in runningcatalog can be thought of as a reference to the lightcurve of a particular source.

Perhaps not of immediate use but good reading and a diversion from corruption, favoritism, oppression and other usual functions of government.

Beyond TPP: An International Agreement to Screw Security Researchers and Average Citizens

May 24th, 2015

US Govt proposes to classify cybersecurity or hacking tools as weapons of war by Manish Singh.

From the post:

Until now only when someone possessed a chemical, biological or nuclear weapon, it was considered to be a weapon of mass destruction in the eyes of the law. But we could have an interesting — and equally controversial — addition to this list soon. The Bureau of Industry and Security (BIS), an agency of the United States Department of Commerce that deals with issues involving national security and high technology has proposed tighter export rules for computer security tools — first brought up in the Wassenaar Arrangement (WA) at the Plenary meeting in December 2013. This proposal could potentially revise an international agreement aimed at controlling weapons technology as well as hinder the work of security researchers.

At the meeting, a group of 41 like-minded states discussed ways to bring cybersecurity tools under the umbrella of law, just as any other global arms trade. This includes guidelines on export rules for licensing technology and software as it crosses an international border. Currently, these tools are controlled based on their cryptographic functionality. While BIS is yet to clarify things, the new proposed rule could disallow encryption license exceptions.

This is like attempting to control burglary by prohibiting the possession of hammers. Hammers are quite useful for a number of legitimate tasks. But you can’t have one because some homeowners put weak locks on their doors.

If you want to see all the detail: Wassenaar Arrangement 2013 Plenary Agreements Implementation: Intrusion and Surveillance Items.

Deadline for comments is: 07/20/2015.

Warning: It is not written for casual reading. I may try to work through it in case anyone wants to point to actual parts of the proposal that are defective.

The real danger of such a proposal isn’t that the Feds will run amok prosecuting people but it will give them a basis for leaning on innocent researchers and intimating that a friendly U.S. Attorney and district judge might just buy an argument that you have violated an export restriction.

Most people don’t have the resources to face such threats (think Aaron Swartz) and so the Feds win by default.

If you don’t want to be an unnamed victim of federal intimidation or a known victim like Aaron Swartz, the time to stop this atrocity is now.

10 Expert Search Tips for Finding Who, Where, and When

May 24th, 2015

10 Expert Search Tips for Finding Who, Where, and When by Henk van Ess.

From the post:

Online research is often a challenge. Information from the web can be fake, biased, incomplete, or all of the above.

Offline, too, there is no happy hunting ground with unbiased people or completely honest authorities. In the end, it all boils down to asking the right questions, digital or not. Here are some strategic tips and tools for digitizing three of the most asked questions: who, where and when? They all have in common that you must “think like the document” you search.

Whether you are writing traditional news, blogging or writing a topic map, its difficult to think of a subject where “who, where, and when,” aren’t going to come up.

Great tips! Start of a great one pager to keep close at hand.

VA fails cybersecurity audit for 16th straight year

May 24th, 2015

VA fails cybersecurity audit for 16th straight year by Katie Dvorak.

From the post:

The U.S. Department of Veterans Affairs, which failed its Federal Information Security Management Act Audit for Fiscal Year 2014, is taking major steps to fix its cybersecurity in the wake of increasing scrutiny over vulnerabilities and cyberdeficiencies at the agency, according to an article at Federal News Radio.

This marks the 16th consecutive year the VA has failed the cybersecurity audit, according to the article. While the audit found that the agency has made progress in creating security policies and procedures, it also determined that problems remain in implementing its security risk management program.

“Weaknesses in access and configuration management controls resulted from VA not fully implementing security standards on all servers, databases, and network devices,” the report reads. “VA also has not effectively implemented procedures to identify and remediate system security vulnerabilities on network devices, database, and server platforms VA-wide.”

The first cybersecurity lesson here is that if you are exchanging data with or interacting with the Veterans Administration, do so on a computer that is completely isolated from your network. In the event of data transfers, but sure to scan and clean all incoming data from the VA.

The second lesson is requiring cypbersecurity, in the absence of incentives for its performance or penalties for its lack, will always end badly.

Astoria, the Tor client designed to beat the NSA surveillance

May 24th, 2015

Astoria, the Tor client designed to beat the NSA surveillance by Pierluigi Paganini.

From the post:

A team of security researchers announced to have developed Astoria, a new Tor client designed to beat the NSA and reduce the efficiency of timing attacks.

Tor and Deep web are becoming terms even popular among Internet users, the use of anonymizing network is constantly increasing for this reason intelligence agencies are focusing their efforts in its monitoring.

Edward Snowden has revealed that intelligence agencies belonging to the Five Eyes Alliance have tried to exploit several techniques to de-anonymized Tor users.

Today I desire to introduce you the result of the work of a joint effort of security researchers from American and Israeli organizations which have developed a new advanced Tor client called Astoria.

The Astoria Tor Client was specially designed to protect Tor user form surveillance activities, it implements a series of features that make eavesdropping harder.

Time to upgrade and to help support the Tor network!

Every day that you help degrade NSA activities, you have contributed to your own safety and the safety of others.

Summarizing and understanding large graphs

May 24th, 2015

Summarizing and understanding large graphs by Danai Koutra, U Kang, Jilles Vreeken and Christos Faloutsos. (DOI: 10.1002/sam.11267)

Abstract:

How can we succinctly describe a million-node graph with a few simple sentences? Given a large graph, how can we find its most “important” structures, so that we can summarize it and easily visualize it? How can we measure the “importance” of a set of discovered subgraphs in a large graph? Starting with the observation that real graphs often consist of stars, bipartite cores, cliques, and chains, our main idea is to find the most succinct description of a graph in these “vocabulary” terms. To this end, we first mine candidate subgraphs using one or more graph partitioning algorithms. Next, we identify the optimal summarization using the minimum description length (MDL) principle, picking only those subgraphs from the candidates that together yield the best lossless compression of the graph—or, equivalently, that most succinctly describe its adjacency matrix.

Our contributions are threefold: (i) formulation: we provide a principled encoding scheme to identify the vocabulary type of a given subgraph for six structure types prevalent in real-world graphs, (ii) algorithm: we develop VoG, an efficient method to approximate the MDL-optimal summary of a given graph in terms of local graph structures, and (iii) applicability: we report an extensive empirical evaluation on multimillion-edge real graphs, including Flickr and the Notre Dame web graph.

Use the DOI if you need the official citation, otherwise select the title link and it takes you a non-firewalled copy.

A must read if you are trying to extract useful information out of multimillion-edge graphs.

I first saw this in a tweet by Kirk Borne.

.sucks

May 24th, 2015

New gTLDs: .SUCKS Illustrates Potential Problems for Security, Brand Professionals by Camille Stewart.

From the post:

The launch of the .SUCKS top-level domain name (gTLD) has reignited and heightened concerns about protecting brands and trademarks from cybersquatters and malicious actors. This new extension, along with more than a thousand others, has been approved by the Internet Corporation for Assigned Names and Numbers (ICANN) as part of their new gTLD program. The program was designed to spark competition and innovation by opening up the market to additional gTLDs.

Not surprisingly, though, complaints are emerging that unscrupulous operators are using .SUCKS to extort money from companies by threatening to use it to create websites that could damage their brands. ICANN is now reportedly asking the Federal Trade Commission (FTC) and Canada’s Office of Consumer Affairs to weigh in on potential abuses so it can address them. Recently, Congress weighed in on the issue, holding a hearing about. SUCKS and other controversial domains like .PORN .

Vox Populi Registry Ltd. began accepting registrations for .SUCKS domains on March 30 from trademark holders and celebrities before it opened to public applicants. It recommended charging $2,499 a year for each domain name registration, and according to Vox Populi CEO John Berard, resellers are selling most of the names for around $2,000 a year. Berard asserts that the extension is meant to create destinations for companies to interact with their critics, and called his company’s business “well within the lines of ICANN rules and the law.”

If you follow the link to the statement by Vox Populi CEO John Berard, that post concludes with:

The new gTLD program is about increasing choice and competition in the TLD space, it’s not supposed to be about applicants bilking trademark owners for whatever they think they can get away with.

A rather surprising objection considering that trademark (and copyright) owners have been bilking/gouging consumers for centuries.

Amazing how sharp the pain can be when a shoe pinches on a merchant’s foot.

How many Disney properties could end in .sucks? (Research question)

Kafka and the Foreign Intelligence Surveillance Court (FISA)

May 24th, 2015

Quiz: Just how Kafkaesque is the court that oversees NSA spying? by Alvaro Bedoya and Ben Sobel.

From the post:

When Edward Snowden first went public, he did it by leaking a 4-page order from a secret court called the Foreign Intelligence Surveillance Court, or FISA court. Founded in 1978 after the Watergate scandal and investigations by the Church Committee, the FISA court was supposed to be a bulwark against secret government surveillance. In 2006, it authorized the NSA call records program – the single largest domestic surveillance program in American history.

“The court” in Franz Kafka’s novel The Trial is a shadowy tribunal that tries (and executes) Josef K., the story’s protagonist, without informing him of the crime he’s charged with, the witnesses against him, or how he can defend himself. (Worth noting: The FISA court doesn’t “try” anyone. Also, it doesn’t kill people.)

Congress is debating a bill that would make the FISA court more transparent. In the meantime, can you tell the difference between the FISA court and Kafka’s court?

After you finish the quiz, if you haven’t read The Trial by Franz Kafka, you should.

I got 7/11. What’s your score?

The FISA court is an illusion of due process that has been foisted off on American citizens.

To be fair, the number of rejected search or arrest warrants in regular courts is as tiny as the number of rejected applications in FISA court. (One study reports 1 rejected search warrant out of 1,748. Craig D. Uchida, Timothy S. Bynum, Search Warrants, Motions to Suppress and Lost Cases: The Effects of the Exclusionary Rule in Seven Jurisdictions, 81 J. Crim. L. & Criminology 1034 (1990-1991), at page: 1058)

However, any warrant issued by a regular court, including the affidavit setting forth “probable cause” becomes public. Both the police and judicial officers know the basis for warrants will be seen by others, which encourages following the rules for probable cause.

Contrast that with the secret warrants and the basis for secret warrants from the FISA court. There is no opportunity for the public to become informed about the activities of the FISA courts or the results of the warrants that it issues. The non-public nature of the FISA court deprives voters of the ability to effectively voice concerns about the FISA court.

The only effective way to dispel the illusion that secrecy is required for the FISA court is for there to be massive and repetitive leaks of FISA applications and opinions. Just like with the Pentagon Papers, the sky will not fall and the public will learn the FISA court was hiding widespread invasions of privacy based on the thinnest tissues of fantasy from intelligence officers.

If you think I am wrong about the FISA court, name a single government leak that did not reveal the government breaking the law, attempting to conceal incompetence or avoid accountability. Suggestions?

Global Investigative Journalism Conference (Lillehammer, October 8th-11th 2015)

May 24th, 2015

Global Investigative Journalism Conference (Lillehammer, October 8th-11th 2015)

From the news page:

This year’s global event for muckrakers is approaching! Today we’re pleased to reveal the first glimpse of the program for the 9th Global Investigative Journalism Conference — #GIJC15 — in Lillehammer, Norway.

First in line are the data tracks. We have 56 sessions dedicated to data-driven journalism already confirmed, and there is more to come.

Three of the four data tracks will be hands-on, while a fourth will be showcases. In addition to that, the local organizing committee has planned a Data Pub.

The heavy security and scraping stuff will be in a special room, with three days devoted to security issues and webscraping with Python. The attendees will be introduced to how to encrypt emails, their own laptop and USB-sticks. They will also be trained to install security apps for text and voice. For those who think Python is too difficult, import.io is an option.

For the showcases, we hope the audience will appreciate demonstrations from some of the authors behind the Verification Handbook, on advanced internet search techniques and using social media in research. There will be sessions on how to track financial crime, and the journalists behind LuxLeaks and SwissLeaks will conduct different sessions.

BTW, you can become a sponsor for the conference:

Interested in helping sponsor the GIJC? Here’s a chance to reach and support the “special forces” of journalism around the world – the reporters, editors, producers and programmers on the front lines of battling crime, corruption, abuse of trust, and lack of accountability. You’ll join major media organizations, leading technology companies, and influential foundations. Contact us at hello@gijn.org.

Opposing “crime, corruption, abuse of trust, and lack of accountability?” There are easier ways to make a living but few are as satisfying.

PS: Looks like a good venue for discussing how topic maps could integrate resources from different sources or researchers.

Mastering Emacs (new book)

May 23rd, 2015

Mastering Emacs by Mickey Petersen.

I can’t recommend Mastering Emacs as lite beach reading but next to a computer, it has few if any equals.

I haven’t ordered a copy (yet) but based on the high quality of Mickey’s Emacs posts, I recommend it sight unseen.

You can look inside at the TOC.

If you still need convincing, browse Mickey’s Full list of Tips, Tutorials and Articles for a generous sampling of his writing.

Enjoy!

Deep Learning (MIT Press Book) – Update

May 22nd, 2015

Deep Learning (MIT Press Book) by Yoshua Bengio, Ian Goodfellow and Aaron Courville.

I last mentioned this book last August and wanted to point out that a new draft appeared on 19/05/2015.

Typos and opportunities for improvement still exist! Now is your chance to help the authors make this a great book!

Enjoy!

The Unreasonable Effectiveness of Recurrent Neural Networks

May 22nd, 2015

The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy.

From the post:

There’s something magical about Recurrent Neural Networks (RNNs). I still remember when I trained my first recurrent network for Image Captioning. Within a few dozen minutes of training my first baby model (with rather arbitrarily-chosen hyperparameters) started to generate very nice looking descriptions of images that were on the edge of making sense. Sometimes the ratio of how simple your model is to the quality of the results you get out of it blows past your expectations, and this was one of those times. What made this result so shocking at the time was that the common wisdom was that RNNs were supposed to be difficult to train (with more experience I’ve in fact reached the opposite conclusion). Fast forward about a year: I’m training RNNs all the time and I’ve witnessed their power and robustness many times, and yet their magical outputs still find ways of amusing me. This post is about sharing some of that magic with you.

We’ll train RNNs to generate text character by character and ponder the question “how is that even possible?”

By the way, together with this post I am also releasing code on Github that allows you to train character-level language models based on multi-layer LSTMs. You give it a large chunk of text and it will learn to generate text like it one character at a time. You can also use it to reproduce my experiments below. But we’re getting ahead of ourselves; What are RNNs anyway?

I try to blog or reblog about worthy posts by others but every now and again, I encounter a post that is stunning in its depth and usefulness.

This post by Andrej Karpathy is one of the stunning ones.

In addition to covering RNNs in general, he takes the reader on a tour of “Fun with RNNs.”

Which covers the application of RNNs to:

  • A Paul Graham generator
  • Shakespeare
  • Wikipedia
  • Algebraic Geometry (Latex)
  • Linux Source Code

Along with sourcecode, Andrej provides a list of further reading.

What’s your example of using RNNs?

Harvesting Listicles

May 22nd, 2015

Scrape website data with the new R package rvest by hkitson@zevross.com.

From the post:

Copying tables or lists from a website is not only a painful and dull activity but it’s error prone and not easily reproducible. Thankfully there are packages in Python and R to automate the process. In a previous post we described using Python’s Beautiful Soup to extract information from web pages. In this post we take advantage of a new R package called rvest to extract addresses from an online list. We then use ggmap to geocode those addresses and create a Leaflet map with the leaflet package. In the interest of coding local, we opted to use, as the example, data on wineries and breweries here in the Finger Lakes region of New York.

Lists and listicles are a common form of web content. Unfortunately, both are difficult to improve without harvesting the content and recasting it.

This post will put you on the right track to harvesting with rvest!

BTW, as a benefit to others, post data that you clean/harvest in a clean format. Yes?

First experiments with Apache Spark at Snowplow

May 22nd, 2015

First experiments with Apache Spark at Snowplow by Justin Courty.

From the post:

As we talked about in our May post on the Spark Example Project release, at Snowplow we are very interested in Apache Spark for three things:

  1. Data modeling i.e. applying business rules to aggregate up event-level data into a format suitable for ingesting into a business intelligence / reporting / OLAP tool
  2. Real-time aggregation of data for real-time dashboards
  3. Running machine-learning algorithms on event-level data

We’re just at the beginning of our journey getting familiar with Apache Spark. I’ve been using Spark for the first time over the past few weeks. In this post I’ll share back with the community what I’ve learnt, and will cover:

  1. Loading Snowplow data into Spark
  2. Performing simple aggregations on Snowplow data in Spark
  3. Performing funnel analysis on Snowplow data

I’ve tried to write the post in a way that’s easy to follow-along for other people interested in getting up the Spark learning curve.

What a great post to find just before the weekend!

You will enjoy this one and others in this series.

Have you every considered aggregation into business dashboard to include what is known about particular subjects? We have all seen the dashboards with increasing counts, graphs, charts, etc. but what about non-tabular data?

A non-tabular dashboard?

Rosetta’s Way Back to the Source

May 22nd, 2015

Rosetta’s Way Back to the Source – Towards Reverse Engineering of Complex Software by Herman Bos.

From the webpage:

The Rosetta project, funded by the EU in the form of an ERC grant, aims to develop techniques to enable reverse engineering of complex software sthat is available only in binary form. To the best of our knowledge we are the first to start working on a comprehensive and realistic solution for recovering the data structures in binary programs (which is essential for reverse engineering), as well as techniques to recover the code. The main success criterion for the project will be our ability to reverse engineer a realistic, complex binary. Additionally, we will show the immediate usefulness of the information that we extract from the binary code (that is, even before full reverse engineering), by automatically hardening the software to make it resilient against memory corruption bugs (and attacks that exploit them).

In the Rosetta project, we target common processors like the x86, and languages like C and C++ that are difficult to reverse engineer, and we aim for full reverse engineering rather than just decompilation (which typically leaves out data structures and semantics). However, we do not necessarily aim for fully automated reverse engineering (which may well be impossible in the general case). Rather, we aim for techniques that make the process straightforward. In short, we will push reverse engineering towards ever more complex programs.

Our methodology revolves around recovering data structures, code and semantic information iteratively. Specifically, we will recover data structures not so much by statically looking at the instructions in the binary program (as others have done), but mainly by observing how the data is used

Research question. The project addresses the question whether the compilation process that translates source code to binary code is irreversible for complex software. Irreversibility of compilation is an assumed property that underlies most of the commercial software today. Specifically, the project aims to demonstrate that the assumption is false.
… (emphasis added)

Herman gives a great thumbnail sketch of the difficulties and potential for this project.

Looking forward to news of a demonstration that “irreversibility of computation” is false.

One important use case being verification that software that claims to have used prevention of buffer overflow techniques has in fact done so. Not the sort of thing I would entrust to statements in marketing materials.

MINIX 3

May 22nd, 2015

MINIX 3

From the read more page:

MINIX 3 is a free open-source operating system that can be used for studying operating systems, as a base for research projects, or for commercial (embedded) systems where microkernel systems dominate the market. Much of the focus on the project is on achieving high reliability through fault tolerance and self-healing techniques.

MINIX is based on a small (about 12K lines of code) microkernel that runs in kernel mode. The rest of the operating system runs as a collection of server processes, each one protected by the hardware MMU. These processes include the virtual file system, one or more actual file systems, the memory manager, the process manager, the reincarnation server, and the device drivers, each one running as a separate user-mode process.

One consequence of this design is that failures of the system due to bugs or attacks are isolated. For example, a failure or takeover of the audio driver due to a bug or exploit can lead to strange sounds but cannot lead to a full takeover of the operating system. Similarly, crashes of a system component can in many cases be automatically and transparently recovered without human intervention. Few, if any, other operating systems are as self-healing as MINIX 3.

Are you still using an insecure and bug-riddled OS?

Your choice.

Authorea

May 22nd, 2015

Authorea is the collaborative typewriter for academia.

From the website:

Write on the web.
Writing a scientific article should be as easy as writing a blog post. Every document you create becomes a beautiful webpage, which you can share.
Collaborate.
More and more often, we write together. A recent paper coauthored on Authorea by a CERN collaboration counts over 200 authors. If we can solve collaboration for CERN, we can solve it for you too!
Version control.
Authorea uses Git, a robust versioning control system to keep track of document changes. Every edit you and your colleagues make is recorded and can be undone at any time.
Use many formats.
Authorea lets you write in LaTeX, Markdown, HTML, Javascript, and more. Different coauthors, different formats, same document.
Data-rich science.
Did you ever wish you could share with your readers the data behind a figure? Authorea documents can take data alongside text and images, such as IPython notebooks and d3.js plots to make your articles shine with beautiful data-driven interactive visualizations.

Sounds good? Sign up or log in to get started immediately, for free.

Authorea uses a gentle form of open source persuasion. You can have one (1) private article for free but unlimited public articles. As your monthly rate goes up, you can have an increased number of private articles. Works for me because most if not all of my writing/editing is destined to be public anyway.

Standards are most useful when they are writ LARGE so private or “secret” standards have never made sense to me.

Sharif University

May 22nd, 2015

Treadstone 71 continues to act as an unpaid (so far as I know) advertising agent for Sharif University.

From the university homepage:

Sharif University of Technology is one of the largest engineering schools in the Islamic Republic of Iran. It was established in 1966 under the name of Aryarmehr University of Technology and, at that time, there were 54 faculty members and a total of 412 students who were selected by national examination. In 1980, the university was renamed Sharif University of Technology. SUT now has a total of 300 full-time faculty members, approximately 430 part-time faculty members and a student body of about 12,000.

Treadstone 71 has published course notes from an Advanced Network Security course on honeypots.

In part Treadstone 71 comments:

There are many documents available on honeypot detection. Not too many are found as a Master’s course at University levels. Sharif University as part of the Iranian institutionalized efforts to build a cyber warfare capability for the government in conjunction with AmnPardaz, Ashiyane, and shadowy groups such as Ajax and the Iranian Cyber Army is highly focused on such an endeavor. With funding coming from the IRGC, infiltration of classes and as members of academia with Basij members, Sharif University is the main driver of information security and cyber operations in Iran. Below is another of many such examples. Honeypots and how to detect them is available for your review.

It is difficult to find a Master’s degree in CS that doesn’t include coursework on network security in general and honeypots in particular. I spot checked some of the degree’s offered by schools listed at: Best Online Master’s Degrees in Computer Science and found no shortage of information on honeypots.

I recognize the domestic (U.S.) political hysteria surrounding Iran but security decisions based on rumor and unfounded fears aren’t the best ones.

Project Naptha

May 22nd, 2015

Project Naptha – highlight, copy, and translate text from any image by Kevin Kwok.

From the webpage:

Project Naptha automatically applies state-of-the-art computer vision algorithms on every image you see while browsing the web. The result is a seamless and intuitive experience, where you can highlight as well as copy and paste and even edit and translate the text formerly trapped within an image.

The homepage has examples of Project Naptha being used on comics, scans, photos, diagrams, Internet memes, screenshots, along with sneak peeks at beta features, such as translation, erase text (from images) and change text. (You can select multiple regions with the shift key.)

This should be especially useful for journalists, bloggers, researchers, basically anyone who spends a lot of time looking for content on the Web.

If the project needs a slogan, I would suggest:

Naptha Frees Information From Image Prisons!

FBI Director Comey As Uninformed or Not Fair-Minded

May 21st, 2015

Transcript: Comey Says Authors of Encryption Letter Are Uninformed or Not Fair-Minded by Megan Graham.

From the post:

Earlier today, FBI Director James Comey implied that a broad coalition of technology companies, trade associations, civil society groups, and security experts were either uninformed or were not “fair-minded” in a letter they sent to the President yesterday urging him to reject any legislative proposals that would undermine the adoption of strong encryption by US companies. The letter was signed by dozens of organizations and companies in the latest part of the debate over whether the government should be given built-in access to encrypted data (see, for example, here, here, here, and here for previous iterations).

The comments were made at the Third Annual Cybersecurity Law Institute held at Georgetown University Law Center. The transcript of his encryption-related discussion is below (emphasis added).

A group of tech companies and some prominent folks wrote a letter to the President yesterday that I frankly found depressing. Because their letter contains no acknowledgment that there are societal costs to universal encryption. Look, I recognize the challenges facing our tech companies. Competitive challenges, regulatory challenges overseas, all kinds of challenges. I recognize the benefits of encryption, but I think fair-minded people also have to recognize the costs associated with that. And I read this letter and I think, “Either these folks don’t see what I see or they’re not fair-minded.” And either one of those things is depressing to me. So I’ve just got to continue to have the conversation.

Governments have a long history of abusing citizens and data entrusted to them.

Director Comey is very uninformed if he is unaware of the role that Hollerith machines and census data played in the Holocaust.

Holland embraced Hollerith machines for its 1930’s census, with the same good intentions as FBI Director Comey for access to encrypted data.

France never made effective use of Hollerith machines prior to or during WWII.

What difference did that make?

Country Jewish Population Deported Murdered Death Ratio
Holland 140,000 107,000 102,000 73%
France 300,000 to 350,000 85,000 82,000 25%

(IBM and the Holocaust : the strategic alliance between Nazi Germany and America’s most powerful corporation by Edwin Black, at page 332.)

A fifty (50%) percent difference in the effectiveness of government oppression due to technology sounds like a lot to me. Bearing in mind that Comey wants to make government collection of data even more widespread and efficient.

If invoking the Holocaust seems like a reflex action, consider the Hollywood blacklist, the McCarthy era, government inflitration of the peace movement of the 1960’s, oppression of the Black Panther Party, long term discrimination against homosexuals, ongoing discrimination based on race, gender, religion, ethnicity, to name just a few instances of where government intrusion has been facilitated by its data gathering capabilities.

If anyone is being “not fair-minded” in the debate over encryption, it’s FBI Director Comey. The pages of history, past, recent and contemporary, bleed from the gathering and use of data by government at different levels.

Making a huge leap of faith and saying the current government is in good faith, in no way guarantees that a future government will be in good faith.

Strong encryption won’t save us from a corrupt government, but it may give us a fighting chance of a better government.

Machine-Learning Algorithm Mines Rap Lyrics, Then Writes Its Own

May 21st, 2015

Machine-Learning Algorithm Mines Rap Lyrics, Then Writes Its Own

From the post:

The ancient skill of creating and performing spoken rhyme is thriving today because of the inexorable rise in the popularity of rapping. This art form is distinct from ordinary spoken poetry because it is performed to a beat, often with background music.

And the performers have excelled. Adam Bradley, a professor of English at the University of Colorado has described it in glowing terms. Rapping, he says, crafts “intricate structures of sound and rhyme, creating some of the most scrupulously formal poetry composed today.”

The highly structured nature of rap makes it particularly amenable to computer analysis. And that raises an interesting question: if computers can analyze rap lyrics, can they also generate them?

Today, we get an affirmative answer thanks to the work of Eric Malmi at the University of Aalto in Finland and few pals. These guys have trained a machine-learning algorithm to recognize the salient features of a few lines of rap and then choose another line that rhymes in the same way on the same topic. The result is an algorithm that produces rap lyrics that rival human-generated ones for their complexity of rhyme.

The review is a fun read but I rather like the original paper title as well: DopeLearning: A Computational Approach to Rap Lyrics Generation by Eric Malmi, Pyry Takala, Hannu Toivonen, Tapani Raiko, Aristides Gionis.

Abstract:

Writing rap lyrics requires both creativity, to construct a meaningful and an interesting story, and lyrical skills, to produce complex rhyme patterns, which are the cornerstone of a good flow. We present a method for capturing both of these aspects. Our approach is based on two machine-learning techniques: the RankSVM algorithm, and a deep neural network model with a novel structure. For the problem of distinguishing the real next line from a randomly selected one, we achieve an 82 % accuracy. We employ the resulting prediction method for creating new rap lyrics by combining lines from existing songs. In terms of quantitative rhyme density, the produced lyrics outperform best human rappers by 21 %. The results highlight the benefit of our rhyme density metric and our innovative predictor of next lines.

You should also visit BattleBot (a rap engine):

BattleBot is a rap engine which allows you to “spit” any line that comes to your mind after which it will respond to you with a selection of rhyming lines found among 0.5 million lines from existing rap songs. The engine is based on a slightly improved version of the Raplyzer algorithm and the eSpeak speech synthesizer.

You can try out BattleBot simply by hitting “Spit” or “Random”. The latter will randomly pick a line among the whole database of lines and find the rhyming lines for that. The underlined part shows approximately the rhyming part of a result. To understand better, why it’s considered as a rhyme, you can click on the result, see the phonetic transcriptions of your line and the result, and look for matching vowel sequences starting from the end.

BTW, the MIT review concludes with:

What’s more, this and other raps generated by DeepBeat have a rhyming density significantly higher than any human rapper. “DeepBeat outperforms the top human rappers by 21% in terms of length and frequency of the rhymes in the produced lyrics,” they point out.

I can’t help but wonder when DeepBeat is going to hit the charts! ;-)

Top 10 data mining algorithms in plain English

May 21st, 2015

Top 10 data mining algorithms in plain English by Raymond Li.

From the post:

Today, I’m going to explain in plain English the top 10 most influential data mining algorithms as voted on by 3 separate panels in this survey paper.

Once you know what they are, how they work, what they do and where you can find them, my hope is you’ll have this blog post as a springboard to learn even more about data mining.

What are we waiting for? Let’s get started!

Raymond covers:

  1. C4.5
  2. k-means
  3. Support vector machines
  4. Apriori
  5. EM
  6. PageRank
  7. AdaBoost
  8. kNN
  9. Naive Bayes
  10. CART
  11. Interesting Resources
  12. Now it’s your turn…

Would be nice if we all had a similar ability to explain algorithms!

Enjoy!

Twitter As Investment Tool

May 21st, 2015

Social Media, Financial Algorithms and the Hack Crash by Tero Karppi and Kate Crawford.

Abstract:

@AP: Breaking: Two Explosions in the White House and Barack Obama is injured’. So read a tweet sent from a hacked Associated Press Twitter account @AP, which affected financial markets, wiping out $136.5 billion of the Standard & Poor’s 500 Index’s value. While the speed of the Associated Press hack crash event and the proprietary nature of the algorithms involved make it difficult to make causal claims about the relationship between social media and trading algorithms, we argue that it helps us to critically examine the volatile connections between social media, financial markets, and third parties offering human and algorithmic analysis. By analyzing the commentaries of this event, we highlight two particular currents: one formed by computational processes that mine and analyze Twitter data, and the other being financial algorithms that make automated trades and steer the stock market. We build on sociology of finance together with media theory and focus on the work of Christian Marazzi, Gabriel Tarde and Tony Sampson to analyze the relationship between social media and financial markets. We argue that Twitter and social media are becoming more powerful forces, not just because they connect people or generate new modes of participation, but because they are connecting human communicative spaces to automated computational spaces in ways that are affectively contagious and highly volatile.

Social sciences lag behind the computer sciences in making their publications publicly accessible as well as publishing behind firewalls so I can report on is the abstract.

On the other hand, I’m not sure how much practical advice you could gain from the article as opposed to the volumes of commentary following the incident itself.

The research reminds me of Malcolm Gladwell, author of The Tipping Point and similar works.

While I have greatly enjoyed several of Gladwell’s books, including the Tipping Point, it is one thing to look back and say: “Look, there was a tipping point.” It is quite another to be in the present and successfully say: “Look, there is a tipping point and we can make it tip this way or that.”

In retrospect, we all credit ourselves with near omniscience when our plans succeed and we invent fanciful explanations about what we knew or realized at the time. Others, equally skilled, dedicated and competent, who started at the same time, did not succeed. Of course, the conservative media (and ourselves if we are honest), invent narratives to explain those outcomes as well.

Of course, deliberate manipulation of the market with false information, via Twitter or not, is illegal. The best you can do is look for a pattern of news and/or tweets that result in downward changes in a particular stock, which then recovers and then apply that pattern more broadly. You won’t make $millions off of any one transaction but that is the sort of thing that draws regulatory attention.

LogJam – Postel’s Law In Action

May 21st, 2015

The seriousness of the LogJam vulnerability was highlighted by John Leyden in Average enterprise ‘using 71 services vulnerable to LogJam’


Based on analysis of 10,000 cloud applications and data from more than 17 million global cloud users, cloud visibility firm Skyhigh Networks reckons that 575 cloud services are potentially vulnerable to man-in-the middle attacks. The average company uses 71 potentially vulnerable cloud services.

[Details from Skyhigh Networks]

The historical source of LogJam?


James Maude, security engineer at Avecto, said that the LogJam flaw shows how internet regulations and architecture decisions made more than 20 years ago are continuing to throw up problems.

“The LogJam issue highlights how far back the long tail of security stretches,” Maude commented. “As new technologies emerge and cryptography hardens, many simply add on new solutions without removing out-dated and vulnerable technologies. This effectively undermines the security model you are trying to build. Several recent vulnerabilities such as POODLE and FREAK have harnessed this type of weakness, tricking clients into using old, less secure forms of encryption,” he added.

Graham Cluley in Logjam vulnerability – what you need to know has better coverage of the history of weak encryption that resulted in the LogJam vulnerability.

What does that have to do with Postel’s Law?

TCP implementations should follow a general principle of robustness: be conservative in what you do, be liberal in what you accept from others. [RFC761]

As James Maude noted earlier:

As new technologies emerge and cryptography hardens, many simply add on new solutions without removing out-dated and vulnerable technologies.

Probably not what Postel intended at the time but certainly more “robust” in one sense of the word, technologies remain compatible with other technologies that use vulnerable technologies.

In other words, robustness is responsible for the maintenance of weak encryption and hence the current danger from LogJam.

This isn’t an entirely new idea. Eric Allman (Sendmail), warns of security issues with Postel’s Law in The Robustness Principle Reconsidered: Seeking a middle ground:

In 1981, Jon Postel formulated the Robustness Principle, also known as Postel’s Law, as a fundamental implementation guideline for the then-new TCP. The intent of the Robustness Principle was to maximize interoperability between network service implementations, particularly in the face of ambiguous or incomplete specifications. If every implementation of some service that generates some piece of protocol did so using the most conservative interpretation of the specification and every implementation that accepted that piece of protocol interpreted it using the most generous interpretation, then the chance that the two services would be able to talk with each other would be maximized. Experience with the Arpanet had shown that getting independently developed implementations to interoperate was difficult, and since the Internet was expected to be much larger than the Arpanet, the old ad-hoc methods needed to be enhanced.

Although the Robustness Principle was specifically described for implementations of TCP, it was quickly accepted as a good proposition for implementing network protocols in general. Some have applied it to the design of APIs and even programming language design. It’s simple, easy to understand, and intuitively obvious. But is it correct.

For many years the Robustness Principle was accepted dogma, failing more when it was ignored rather than when practiced. In recent years, however, that principle has been challenged. This isn’t because implementers have gotten more stupid, but rather because the world has become more hostile. Two general problem areas are impacted by the Robustness Principle: orderly interoperability and security.

Eric doesn’t come to a definitive conclusion with regard to Postel’s Law but the general case is always difficult to decide.

However, the specific case, supporting encryption known to be vulnerable shouldn’t be.

If there were a programming principles liability checklist, one of the tick boxes should read:

___ Supports (list of encryption schemes), Date:_________

Lawyers doing discovery can compare lists of known vulnerabilities as of the date given for liability purposes.

Programmers would be on notice that supporting encryption with known vulnerabilities is opening the door to legal liability.

Format String Bug Exploration

May 20th, 2015

Format String Bug Exploration by AJ Kumar.

From the post:

Abstract

The Format String vulnerability significantly introduced in year 2000 when remote hackers gain root access on host running FTP daemon which had anonymous authentication mechanism. This was an entirely new tactics of exploitation the common programming glitches behind the software, and now this deadly threat for the software is everywhere because programmers inadvertently used to make coding loopholes which are targeting none other than Format string attack. The format string vulnerability is an implication of misinterpreting the stack for handling functions with variable arguments especially in Printf function, since this article demonstrates this subtle bug in C programming context on windows operating system. Although, this class of bug is not operating system–specific as with buffer overflow attacks, you can detect vulnerable programs for Mac OS, Linux, and BSD. This article drafted to delve deeper at what format strings are, how they are operate relative to the stack, as well as how they are manipulated in the perspective of C programming language.

Essentials

To be cognizance with the format string bug explained in this article, you will require to having rudimentary knowledge of the C family of programming languages, as well as a basic knowledge of IA32 assembly over window operating system, by mean of visual studio development editor. Moreover, know-how about ‘buffer overflow’ exploitation will definitely add an advantage.

Format String Bug

The format string bug was first explained in June 2000 in a renowned journal. This notorious exploitation tactics enable a hacker to subvert memory stack protections and allow altering arbitrary memory segments by unsolicited writing over there. Overall, the sole cause behind happening is not to handle or properly validated the user-supplied input. Just blindly trusting the used supplied arguments that eventually lead to disaster. Subsequently, when hacker controls arguments of the Printf function, the details in the variable argument lists enable him to analysis or overwrite arbitrary data. The format string bug is unlike buffer overrun; in which no memory stack is being damaged, as well as any data are being corrupted at large extents. Hackers often execute this attack in context of disclosing or retrieving sensitive information from the stack for instance pass keys, cryptographic privates keys etc.

Now the curiosity around here is how exactly the hackers perform this deadly attack. Consider a program where we are trying to produce some string as “kmaraj” over the screen by employing the simple C language library Printf method as;

A bit deeper than most of my post on bugs but the lesson isn’t just the bug, but that it has persisted for going on fifteen (15) years now.

As a matter of fact, Karl Chen and David Wagner in Large-Scale Analysis of Format String Vulnerabilities in Debian Linux (2007) found:


We successfully analyze 66% of C/C++ source packages in the Debian 3.1 Linux distribution. Our system finds 1,533 format string taint warnings. We estimate that 85% of these are true positives, i.e., real bugs; ignoring duplicates from libraries, about 75% are real bugs.

We suggest that the technology exists to render format string vulnerabilities extinct in the near future. (emphasis added)

“…[N]ear future?” Maybe not because Mathias Payer and Thomas R. Gross report in 2013, String Oriented Programming: When ASLR is not Enough:

One different class of bugs has not yet received adequate attention in the context of DEP, stack canaries, and ASLR: format string vulnerabilities. If an attacker controls the first parameter to a function of the printf family, the string is parsed as a format string. Using such a bug and special format markers result in arbitrary memory writes. Existing exploits use format string vulnerabilities to mount stack or heap-based code injection attacks or to set up return oriented programming. Format string vulnerabilities are not a vulnerability of the past but still pose a significant threat (e.g., CVE-2012-0809 reports a format string bug in sudo and allows local privilege escalation; CVE-2012-1152 reports multiple format string bugs in perl-YAML and allows remote exploitation, CVE-2012-2369 reports a format string bug in pidgin-otr and allows remote exploitation) and usually result in full code execution for the attacker.

Should I assume in computer literature six (6) years doesn’t qualify as the “…near future?”

Would liability for string format bugs result in greater effort to avoid the same?

Hard to say in the abstract but the results could hardly be worse than fifteen (15) years of format string bugs.

Not to mention that liability would put the burden of avoiding the bug squarely on the shoulders of those best able to avoid it.

Math for Journalists Made Easy:…

May 20th, 2015

Math for Journalists Made Easy: Understanding and Using Numbers and Statistics – Sign up now for new MOOC

From the post:

Journalists who squirm at the thought of data calculation, analysis and statistics can arm themselves with new reporting tools during the new Massive Open Online Course (MOOC) from the Knight Center for Journalism in the Americas: “Math for Journalists Made Easy: Understanding and Using Numbers and Statistics” will be taught from June 1 to 28, 2015.

Click here to sign up and to learn more about this free online course.

“Math is crucial to things we do every day. From covering budgets to covering crime, we need to understand numbers and statistics,” said course instructor Jennifer LaFleur, senior editor for data journalism for the Center for Investigative Reporting, one of the instructors of the MOOC.

Two other instructors will be teaching this MOOC: Brant Houston, a veteran investigative journalist who is a professor and the Knight Chair in Journalism at the University of Illinois; and freelance journalists Greg Ferenstein, who specializes in the use of numbers and statistics in news stories.

The three instructors will teach journalists “how to be critical about numbers, statistics and research and to avoid being improperly swayed by biased researchers.” The course will also prepare journalists to relay numbers and statistics in ways that are easy for the average reader to understand.

“It is true that many of us became journalists because sometime in our lives we wanted to escape from mathematics, but it is also true that it has never been so important for journalists to overcome any fear or intimidation to learn about numbers and statistics,” said professor Rosental Alves, founder and director of the Knight Center. “There is no way to escape from math anymore, as we are nowadays surrounded by data and we need at least some basic knowledge and tools to understand the numbers.”

The MOOC will be taught over a period of four weeks, from June 1 to 28. Each week focuses on a particular topic taught by a different instructor. The lessons feature video lectures and are accompanied by readings, quizzes and discussion forums.

This looks excellent.

I will be looking forward to very tough questions of government and corporate statistical reports from anyone who takes this course.

A Call for Collaboration: Data Mining in Cross-Border Investigations

May 20th, 2015

A Call for Collaboration: Data Mining in Cross-Border Investigations by Jonathan Stray and Drew Sullivan.

From the post:

Over the past few years we have seen the huge potential of data and document mining in investigative journalism. Tech savvy networks of journalists such as the Organized Crime and Corruption Reporting Project (OCCRP) and the International Consortium of Investigative Journalists (ICIJ) have teamed together for astounding cross-border investigations, such as OCCRP’s work on money laundering or ICIJ’s offshore leak projects. OCCRP has even incubated its own tools, such as VIS, Investigative Dashboard and Overview.

But we need to do better. There is enormous duplication and missed opportunity in investigative journalism software. Many small grants for technology development have led to many new tools, but very few have become widely used. For example, there are now over 70 tools just for social network analysis. There are other tools for other types of analysis, document handling, data cleaning, and on and on. Most of these are open source, and in various states of completeness, usability, and adoption. Developer teams lack critical capacities such as usability testing, agile processes, and business development for sustainability. Many of these tools are beautiful solutions in search of a problem.

The fragmentation of software development for investigative journalism has consequences: Most newsrooms still lack capacity for very basic knowledge management tasks, such as digitally filing new documents where they can be searched and found later. Tools do not work or do not inter-operate. Ultimately the reporting work is slower, or more expensive, or doesn’t get done. Meanwhile, the commercial software world has so far ignored investigative journalism because it is a small, specialized user-base. Tools like Nuix and Palantir are expensive, not networked, and not extensible for the inevitable story-specific needs.

But investigative journalists have learned how to work in cross-border networks, and investigative journalism developers can too. The experience gained from collaborative data-driven journalism has led OCCRP and other interested organizations to focus on the following issues:

The issues:

  • Usability
  • Delivery
  • Networked Investigation
  • Sustainability
  • Interoperability and extensibility

The next step is reported to be:

The next step for us is a small meeting: the very first conference on Knowledge Management in Investigative Journalism. This event will bring together key developers and journalists to refine the problem definition and plan a way forward. OCCRP and the Influence Mappers project have already pledged support. Stay tuned…

Jonathan Stray jonathanstray@gmail.comand and Drew Sullivan drew@occrp.org, want to know if you are interested too?

See the original post, email Jonathan and Drew if you are interested. It sounds like a very good idea to me.

PS: You already know one of the technologies that I think is important for knowledge management: topic maps!

H2O 3.0

May 20th, 2015

H20 3.0

From the webpage:

Why H2O?

H2O is for data scientists and business analysts who need scalable and fast machine learning. H2O is an open source predictive analytics platform. Unlike traditional analytics tools, H2O provides a combination of extraordinary math and high performance parallel processing with unrivaled ease of use. H2O speaks the language of data science with support for R, Python, Scala, Java and a robust REST API. Smart business applications are powered by H2O’s NanoFastTM Scoring Engine.

Get H2O!

What is H2O?

H2O makes it possible for anyone to easily apply math and predictive analytics to solve today’s most challenging business problems. It intelligently combines unique features not currently found in other machine learning platforms including:

  • Best of Breed Open Source Technology – Enjoy the freedom that comes with big data science powered by OpenSource technology. H2O leverages the most popular OpenSource products like ApacheTM Hadoop® and SparkTM to give customers the flexibility to solve their most challenging data problems.
  • Easy-to-use WebUI and Familiar Interfaces – Set up and get started quickly using either H2O’s intuitive Web-based user interface or familiar programming environ- ments like R, Java, Scala, Python, JSON, and through our powerful APIs.
  • Data Agnostic Support for all Common Database and File Types – Easily explore and model big data from within Microsoft Excel, R Studio, Tableau and more. Connect to data from HDFS, S3, SQL and NoSQL data sources. Install and deploy anywhere
  • Massively Scalable Big Data Analysis – Train a model on complete data sets, not just small samples, and iterate and develop models in real-time with H2O’s rapid in-memory distributed parallel processing.
  • Real-time Data Scoring – Use the Nanofast Scoring Engine to score data against models for accurate predictions in just nanoseconds in any environment. Enjoy 10X faster scoring and predictions than the next nearest technology in the market.

Note the caveat near the bottom of the page:

With H2O, you can:

  • Make better predictions. Harness sophisticated, ready-to-use algorithms and the processing power you need to analyze bigger data sets, more models, and more variables.
  • Get started with minimal effort and investment. H2O is an extensible open source platform that offers the most pragmatic way to put big data to work for your business. With H2O, you can work with your existing languages and tools. Further, you can extend the platform seamlessly into your Hadoop environments.

The operative word being “can.” Your results with H2O depend upon your knowledge of machine learning, knowledge of your data and the effort you put into using H2O, among other things.

Bin Laden’s Bookshelf

May 20th, 2015

Bin Laden’s Bookshelf

From the webpage:

On May 20, 2015, the ODNI released a sizeable tranche of documents recovered during the raid on the compound used to hide Usama bin Ladin. The release, which followed a rigorous interagency review, aligns with the President’s call for increased transparency–consistent with national security prerogatives–and the 2014 Intelligence Authorization Act, which required the ODNI to conduct a review of the documents for release.

The release contains two sections. The first is a list of non-classified, English-language material found in and around the compound. The second is a selection of now-declassified documents.

The Intelligence Community will be reviewing hundreds more documents in the near future for possible declassification and release. An interagency taskforce under the auspices of the White House and with the agreement of the DNI is reviewing all documents which supported disseminated intelligence cables, as well as other relevant material found around the compound. All documents whose publication will not hurt ongoing operations against al-Qa‘ida or their affiliates will be released.

From the website:

bin-laden-bookcase

The one expected work missing from Bin Laden’s library?

The Anarchist Cookbook!

Possession of the same books as Bin Laden will be taken as a sign terrorist sympathies. Weed your collection responsibly.

Political Futures Tracker

May 20th, 2015

Political Futures Tracker.

From the webpage:

The Political Futures Tracker tells us the top political themes, how positive or negative people feel about them, and how far parties and politicians are looking to the future.

This software will use ground breaking language analysis methods to examine data from Twitter, party websites and speeches. We will also be conducting live analysis on the TV debates running over the next month, seeing how the public respond to what politicians are saying in real time. Leading up to the 2015 UK General Election we will be looking across the political spectrum for emerging trends and innovation insights.

If that sounds interesting, consider the following from: Introducing… the Political Futures Tracker:

We are exploring new ways to analyse a large amount of data from various sources. It is expected that both the amount of data and the speed that it is produced will increase dramatically the closer we get to election date. Using a semi-automatic approach, text analytics technology will sift through content and extract the relevant information. This will then be examined and analysed by the team at Nesta to enable delivery of key insights into hotly debated issues and the polarisation of political opinion around them.

The team at the University of Sheffield has extensive experience in the area of social media analytics and Natural Language Processing (NLP). Technical implementation has started already, firstly with data collection which includes following the Twitter accounts of existing MPs and political parties. Once party candidate lists become available, data harvesting will be expanded accordingly.

In parallel, we are customising the University of Sheffield’s General Architecture for Text Engineering (GATE); an open source text analytics tool, in order to identify sentiment-bearing and future thinking tweets, as well as key target topics within these.

One thing we’re particularly interested in is future thinking. We describe this as making statements concerning events or issues in the future. Given these measures and the views expressed by a certain person, we can model how forward thinking that person is in general, and on particular issues, also comparing this with other people. Sentiment, topics, and opinions will then be aggregated and tracked over time.

Personally I suspect that “future thinking” is used in difference senses by the general population and political candidates. For a political candidate, however the rhetoric is worded, the “future” consists of reaching election day with 50% plus 1 vote. For the general population, the “future” probably includes a longer time span.

I mention this in case you can sell someone on the notion that what political candidates say today has some relevance to what they will do after election. President Obmana has been in office for six (6) years on office, the Guantanamo Bay detention camp remains open, no one has been held accountable for years of illegal spying on U.S. citizens, banks and other corporate interests have all but been granted keys to the U.S. Treasury, to name a few items inconsistent with his previous “future thinking.”

Unless you accept my suggestion that “future thinking” for a politician means election day and no further.