Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 11, 2015

Spy vs. Spy

Filed under: Cybersecurity,Security — Patrick Durusau @ 12:25 pm

spy-vs-spy1

This image came to mind while reading: Cebit 2015: Find out what your apps are really doing.

The story reads in part:

These tiny programs on Internet-connected mobile phones are increasingly becoming entryways for surveillance and fraud. Computer scientists from the center for IT-Security, Privacy and Privacy, CISPA, have developed a program that can show users whether the apps on their smartphone are accessing private information, and what they do with that data. This year, the researchers will present an improved version of their system again at the CeBIT computer fair in Hanover (Hall 9, Booth E13).

RiskIQ, an IT security-software company, recently examined 350,000 apps that offer monetary transactions, and found more than 40,000 of these specialized programs to be little more than scams. Employees had downloaded the apps from around 90 recognized app store websites worldwide, and analyzed them. They discovered that a total of eleven percent of these apps contained malicious executable functions – they could read along personal messages, or remove password protections. And all this would typically take place unnoticed by the user.

Computer scientists from Saarbrücken have now developed a software system that allows users to detect malicious apps at an early stage. This is achieved by scanning the program code, with an emphasis on those parts where the respective app is accessing or transmitting personal information. The monitoring software will detect whether a data request is related to the subsequent transmission of data, and will flag the code sequence in question as suspicious accordingly. “Imagine your address book is read out, and hundreds of lines of code later, without you noticing, your phone will send your contacts to an unknown website,” Erik Derr says. Derr is a PhD student at the Graduate School for Computer Science at Saarland University, and a researcher at the Saarbrücken Research Center for IT Security, CISPA. An important feature of the software he developed is its ability to monitor precisely which websites an app is accessing, or which phone number a text message was sent to.

At 11% of apps being malicious, you need spy on the spies that created the apps. Fortunately, Eric Derr and his group are working on precisely that solution.

The CeBIT exhibition listing doesn’t offer much detail but does have this graphic:

gallery-868x0-282049.png

The most details I could find are reported in: Taking Android App Vetting to the Next Level with Path-sensitive Value Analysis by Michael Backes, Sven Bugiel, Erik Derr and Christian Hammer.

Abstract:

Application vetting at app stores and market places is the first line of defense to protect mobile end-users from malware, spyware, and immoderately curious apps. However, the lack of a highly precise yet large-scaling static analysis has forced market operators to resort to less reliable and only small-scaling dynamic or even manual analysis techniques.

In this paper, we present Bati, an analysis framework specifically tailored to perform highly precise static analysis of Android apps. Building on established static analysis frameworks for Java, we solve two important challenges to reach this goal: First, we extend this ground work with an Android application lifecycle model that includes the asynchronous communication of multi-threading. Second, we introduce a novel value analysis algorithm that builds on control-flow ordered backwards slicing and techniques from partial and symbolic evaluation. As a result, Bati is the first context-, flow-, object-, and path-sensitive analysis framework for Android apps and improves the status-quo for static analysis on Android. In particular, we empirically demonstrate the benefits of Bati in dissecting Android malware by statically detecting behavior that previously required manual reverse engineering. Noticeably, in contrast to the common conjecture about path-sensitive analyses, our evaluation of 19,700 apps from Google Play shows that highly precise path-sensitive value analysis of Android apps is possible in a reasonable amount of time and is hence amenable for large-scale vetting processes.

One measure for security testing software is its confirmation of the findings of others and the making of new findings not previously reported. That has been reported for Bati but there is one malware case in particular that will be of interest.

Bati confirmed the malware reported by Symantec at: Android.Adrd, which mentions the malware driving page rank:

It also receives search parameters from the above URLs. The Trojan then uses the obtained parameters to silently issue multiple HTTP search requests to the following location:

wap.baidu.com/s?word=[ENCODED SEARCH STRING]&vit=uni&from=[ID]

The purpose of these search requests is to increase site rankings for a website.

In addition to the baidu.com destination, other destinations were detected, including google.cn.

The only Symantec report that mentions google.cn, Backdoor.Ripinip is from March 15, 2012 and mentions it as a source of links for redirection, not building page rank.

The importance of spying on whoever is spying on you (and others) is only going to increase in importance.

Lawyers watching spies watching spies?

March 10, 2015

MIT Group Cites “Data Prep” as a Data Science Bottleneck

Filed under: Data Science,ETL,Topic Maps — Patrick Durusau @ 7:38 pm

MIT Group Cites “Data Prep” as a Data Science Bottleneck

The bottleneck is varying data semantics. No stranger to anyone interested in topic maps. The traditional means of solving that problem is to clean the data for one purpose, which unless the basis for cleaning is recorded, leaves the data dirty for the next round of integration.

What do you think is being described in this text?:

Much of Veeramachaneni’s recent research has focused on how to automate this lengthy data prep process. “Data scientists go to all these boot camps in Silicon Valley to learn open source big data software like Hadoop, and they come back, and say ‘Great, but we’re still stuck with the problem of getting the raw data to a place where we can use all these tools,’” Veeramachaneni says.

The proliferation of data sources and the time it takes to prepare these massive reserves of data are the core problems Tamr is attacking. The knee-jerk reaction to this next-gen integration and preparation problem tends to be “Machine Learning” — a cure for all ills. But as Veeramachaneni points out, machine learning can’t resolve all data inconsistencies:

Veeramachaneni and his team are also exploring how to efficiently integrate the expertise of domain experts, “so it won’t take up too much of their time,” he says. “Our biggest challenge is how to use human input efficiently, and how to make the interactions seamless and efficient. What sort of collaborative frameworks and mechanisms can we build to increase the pool of people who participate?”

Tamr has built the very sort of collaborative framework Veeramachaneni mentions, drawing from the best of machine and human learning to connect hundreds or thousands of data sources.

Top-down, deterministic data unification approaches (such as ETL, ELT and MDM) were not designed to scale to the variety of hundreds or thousands or even tens of thousands of data silos (perpetual and proliferating). Traditional deterministic systems depend on a highly trained architect developing a “master” schema — “the one schema to rule them all” — which we believe is a red herring. Embracing the fundamental diversity and ever-changing nature of enterprise data and semantics leads you towards a bottom up, probabalistic approach to connecting data sources from various enterprise silos.

You also have to engage the source owners collaboratively to curate the variety of data at scale, which is Tamr’s core design pattern. Advanced algorithms automatically connect the vast majority of the sources while resolving duplications, errors and inconsistencies among source data of sources, attributes and records — a bottom-up, probabilistic solution that is reminiscent of Google’s full-scale approach to web search and connection. When the Tamr system can’t resolve connections automatically, it calls for human expert guidance, using people in the organization familiar with the data to weigh in on the mapping and improve its quality and integrity.

Off hand I would say it is a topic map authoring solution that features algorithms to assist the authors where authoring has been crowd-sourced.

What I don’t know is whether the insight of experts is captured as dark data (A matches B) or if their identifications are preserved so they can be re-used in the future (The properties of A that result in a match with the properties of B).

I didn’t register to I can’t see the “white paper.” Let me know how close I came if you decide to get the “white paper.” Scientists are donating research data in the name of open science but startups are still farming registration data.

PredictionIO [ML Too Easy? Too Fast?]

Filed under: Machine Learning,Predictive Analytics — Patrick Durusau @ 7:17 pm

PredictionIO

From the what is page:

PredictionIO is an open-source Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time.

PredictionIO template gallery offers a wide range of predictive engine templates for download, developers can customize them easily. The DASE architecture of engine is the “MVC for Machine Learning”. It enables developers to build predictive engine components with separation-of-concerns. Data scientists can also swap and evaluate algorithms as they wish. The core part of PredictionIO is an engine deployment platform built on top of Apache Spark. Predictive engines are deployed as distributed web services. In addition, there is an Event Server. It is a scalable data collection and analytics layer built on top of Apache HBase.

PredictionIO eliminates the friction between software development, data science and production deployment. It takes care of the data infrastructure routine so that your data science team can focus on what matters most.

The most attractive feature of PredictionIO is the ability to configure and test multiple engines with less overhead.

At the same time, I am not altogether sure that “…accelerat[ing] scalable machine learning infrastructure management” is necessarily a good idea.

You may want to remember that the current state of cyberinsecurity, where all programs are suspect and security software may add more bugs that it cures, is a result, in part, of shipping code because “it works,” and not because it is free (or relatively so) of security issues.

I am really not looking forward to machine learning uncertainty like we have cyberinsecurity now.

That isn’t a reflection on PredictionIO but the thought occurred to me because of the emphasis on accelerated use of machine learning.

Apache Tajo brings data warehousing to Hadoop

Filed under: Apache Tajo,BigData,Hadoop — Patrick Durusau @ 6:47 pm

Apache Tajo brings data warehousing to Hadoop by Joab Jackson.

From the post:

Organizations that want to extract more intelligence from their Hadoop deployments might find help from the relatively little known Tajo open source data warehouse software, which the Apache Software Foundation has pronounced as ready for commercial use.

The new version of Tajo, Apache software for running a data warehouse over Hadoop data sets, has been updated to provide greater connectivity to Java programs and third party databases such as Oracle and PostGreSQL.

While less well-known than other Apache big data projects such as Spark or Hive, Tajo could be a good fit for organizations outgrowing their commercial data warehouses. It could also be a good fit for companies wishing to analyze large sets of data stored on Hadoop data processing platforms using familiar commercial business intelligence tools instead of Hadoop’s MapReduce framework.

Tajo performs the necessary ETL (extract-transform-load process) operations to summarize large data sets stored on an HDFS (Hadoop Distributed File System). Users and external programs can then query the data through SQL.

The latest version of the software, issued Monday, comes with a newly improved JDBC (Java Database Connectivity) driver that its project managers say makes Tajo as easy to use as a standard relational database management system. The driver has been tested against a variety of commercial business intelligence software packages and other SQL-based tools. (Just so you know, I took out the click following stuff and inserted the link to the Tajo project page only.)

Being surprised by Apache Tajo I looked at the list of the top level projects at Apache and while I recognized a fair number of them by name, I could tell you the status only of those I actively follow. Hard to say what other jewels are hidden there.

Joab cites several large data consumers who have found Apache Tajo faster than Hive for their purposes. Certainly an option to keep in mind.

NIH-led effort launches Big Data portal for Alzheimer’s drug discovery

Filed under: BigData,Bioinformatics,Medical Informatics,Open Science — Patrick Durusau @ 6:23 pm

NIH-led effort launches Big Data portal for Alzheimer’s drug discovery

From the post:

A National Institutes of Health-led public-private partnership to transform and accelerate drug development achieved a significant milestone today with the launch of a new Alzheimer’s Big Data portal — including delivery of the first wave of data — for use by the research community. The new data sharing and analysis resource is part of the Accelerating Medicines Partnership (AMP), an unprecedented venture bringing together NIH, the U.S. Food and Drug Administration, industry and academic scientists from a variety of disciplines to translate knowledge faster and more successfully into new therapies.

The opening of the AMP-AD Knowledge Portal and release of the first wave of data will enable sharing and analyses of large and complex biomedical datasets. Researchers believe this approach will ramp up the development of predictive models of Alzheimer’s disease and enable the selection of novel targets that drive the changes in molecular networks leading to the clinical signs and symptoms of the disease.

“We are determined to reduce the cost and time it takes to discover viable therapeutic targets and bring new diagnostics and effective therapies to people with Alzheimer’s. That demands a new way of doing business,” said NIH Director Francis S. Collins, M.D., Ph.D. “The AD initiative of AMP is one way we can revolutionize Alzheimer’s research and drug development by applying the principles of open science to the use and analysis of large and complex human data sets.”

Developed by Sage Bionetworks , a Seattle-based non-profit organization promoting open science, the portal will house several waves of Big Data to be generated over the five years of the AMP-AD Target Discovery and Preclinical Validation Project by multidisciplinary academic groups. The academic teams, in collaboration with Sage Bionetworks data scientists and industry bioinformatics and drug discovery experts, will work collectively to apply cutting-edge analytical approaches to integrate molecular and clinical data from over 2,000 postmortem brain samples.

Big data and open science, now that sounds like a winning combination:

Because no publication embargo is imposed on the use of the data once they are posted to the AMP-AD Knowledge Portal, it increases the transparency, reproducibility and translatability of basic research discoveries, according to Suzana Petanceska, Ph.D., NIA’s program director leading the AMP-AD Target Discovery Project.

“The era of Big Data and open science can be a game-changer in our ability to choose therapeutic targets for Alzheimer’s that may lead to effective therapies tailored to diverse patients,” Petanceska said. “Simply stated, we can work more effectively together than separately.”

Imagine that, academics who aren’t hoarding data for recruitment purposes.

Works for me!

Does it work for you?

Help Anthem Do A Security Audit!

Filed under: Cybersecurity,Security — Patrick Durusau @ 6:01 pm

US regulator says Anthem “refuses to cooperate” in security audit by John Zorabedian.

From the post:

Anthem “refused to cooperate” with US regulators attempting to conduct vulnerability scans and configuration tests on its IT systems.

The Inspector General of US Office of Personnel Management’s (OPM) recently attempted to schedule a security audit of the health insurance giant.

This was in the wake of Anthem’s massive data breach that exposed sensitive data on nearly 80 million customers – and non-customers, it later turned out.

Because Anthem provides insurance coverage to federal employees, the OPM’s Office of the Inspector General (OIG) is entitled to request to audit the company, but the company is allowed to decline.

Anthem turned down the OIG’s request, citing corporate policy against allowing third parties to connect to its network.

Corporate policy was insufficient to keep out the hackers that stole 80 million records. Just as passing new penalties for security breaches is insufficient to increase computer security.

I suspect corporate policy is an excuse to avoid admitting their security is managed by a part-time sysadmin who is moonlighting from their day job as an NSA programmer. 😉

It’s too bad the law is in such a state that hackers can’t volunteer to help Anthem with penetration testing, etc., and then tweet the issues with the hashtag #AnthemAudit.

When President Obama isn’t busy declaring sanctions on our next nation victim he talks a lot about cooperation to increase security. Hackers cooperating to help with penetration testing sounds like an example of that sort of cooperation. Does it to you?

MapGraph [Graphs, GPUs, 30 GTEPS (30 billion traversed edges per second)]

Filed under: bigdata®,GPU,Graphs,MapGraph — Patrick Durusau @ 4:40 pm

MapGraph [Graphs, GPUs, 30 GTEPS (30 billion traversed edges per second)]

From the post:

MapGraph is Massively Parallel Graph processing on GPUs. (Previously known as “MPGraph”).

  • The MapGraph API makes it easy to develop high performance graph analytics on GPUs. The API is based on the Gather-Apply-Scatter (GAS) model as used in GraphLab. To deliver high performance computation and efficiently utilize the high memory bandwidth of GPUs, MapGraph’s CUDA kernels use multiple sophisticated strategies, such as vertex-degree-dependent dynamic parallelism granularity and frontier compaction.
  • New algorithms can be implemented in a few hours that fully exploit the data-level parallelism of the GPU and offer throughput of up to 3 billion traversed edges per second on a single GPU.
  • Preliminary results for the multi-GPU version of MapGraph have traversal rates of nearly 30 GTEPS (30 billion traversed edges per second) on a scale-free random graph with 4.3 billion directed edges using a 64 GPU cluster. See the multi-GPU paper referenced below for details.
  • The MapGraph API also comes in a CPU-only version that is currently packaged and distributed with the bigdata open-source graph database. GAS programs operate over the graph data loaded into the database and are accessed via either a Java API or a SPARQL 1.1 Service Call . Packaging the GPU version inside bigdata will be in a future release.

MapGraph is under the Apache 2 license. You can download MapGraph from http://sourceforge.net/projects/mpgraph/ . For the lastest version of this documentation, see http://mapgraph.io. You can subscribe to receive notice for future updates on the project home page. For open source support, please ask a question on the MapGraph mailing lists or file a ticket. To inquire about commercial support, please email us at licenses@bigdata.com. You can follow MapGraph and the bigdata graph database platform at http://www.bigdata.com/blog.

This work was (partially) funded by the DARPA XDATA program under AFRL Contract #FA8750-13-C-0002.

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. D14PC00029.

MapGraph Publications

You do have to wonder when the folks at Systap sleep. 😉 This is the same group that produced BlazeGraph, recently adopted by WikiData. Granting WikiData only has 13.6 million data items as of today but it isn’t “small” data.

The rest of the page has additional pointers and explanations for MapGraph.

Enjoy!

NationBuilder (organizing, fund raising, canvassing, elections)

Filed under: Government,Politics — Patrick Durusau @ 4:09 pm

NationBuilder (organizing, fund raising, canvassing, elections)

With the elections coming in 2016, you may want to look at NationBuilder sooner rather than later.

Including 190M voters (free), NationBuilder appears to be a one stop shop for organizing, fund raising, canvassing, etc.

You could build your own system, but the time and resources spent will not be furthering your cause.

Disclaimer: I have not used NationBuilder nor do I have any relationship, financial or otherwise with NationBuilder.

U.S. declares Venezuela a national security threat, sanctions top officials (Venezuela?)

Filed under: Government,Politics — Patrick Durusau @ 3:41 pm

U.S. declares Venezuela a national security threat, sanctions top officials by Jeff Mason and Roberta Rampton.

Venezuela? Really?

FearDept-Venezuela

(Image tweeted by the U.S. Dept. of Fear)

I scanned the Wikipedia article on Venezuela but nothing jumped out at me as a threat to U.S. national security. I have to concede that most Venezuelans speak Spanish, are Roman Catholic (92%) and non-whites are a majority of the population. None of that strikes me as threats to our national security. But, I’m not trying to distract the press from some other breaking or about to break story.

I wasn’t able to find a copy of the executive order. If you check Executive Orders, the most recent one listed is February 13, 2015. I did find a press release, which indicates surveillance of U.S. bank data.

From the press release:

Individuals designated or identified for the imposition of sanctions under this E.O., including the seven individuals that have been listed today in the Annex of this E.O., will have their property and interests in property in the United States blocked or frozen, and U.S. persons are prohibited from doing business with them. The E.O. also suspends the entry into the United States of individuals meeting the criteria for economic sanctions.

1. Antonio José Benavides Torres: Commander of the Strategic Region for the Integral Defense (REDI) of the Central Region of Venezuela’s Bolivarian National Armed Forces (FANB) and former Director of Operations for Venezuela’s Bolivarian National Guard (GNB).

2. Gustavo Enrique González López: Director General of Venezuela’s Bolivarian National Intelligence Service (SEBIN) and President of Venezuela’s Strategic Center of Security and Protection of the Homeland (CESPPA).

3. Justo José Noguera Pietri: President of the Venezuelan Corporation of Guayana (CVG), a state-owned entity, and former General Commander of Venezuela’s Bolivarian National Guard (GNB).

4. Katherine Nayarith Haringhton Padron: national level prosecutor of the 20th District Office of Venezuela’s Public Ministry.

5. Manuel Eduardo Pérez Urdaneta: Director of Venezuela’s Bolivarian National Police.

6. Manuel Gregorio Bernal Martínez : Chief of the 31st Armored Brigade of Caracas of Venezuela’s Bolivarian Army and former Director General of Venezuela’s Bolivarian National Intelligence Service (SEBIN).

7. Miguel Alcides Vivas Landino: Inspector General of Venezuela’s Bolivarian National Armed Forces (FANB) and former Commander of the Strategic Region for the Integral Defense (REDI) of the Andes Region of Venezuela’s Bolivarian National Armed Forces.

What do you think? Would a hunt for assets of those seven individuals turn up empty in the United States? If your answer is no, how would the president know to name those particular individuals in the executive order?

Does that sound like surveillance of the financial system to you?

Otherwise, someone would have to apply to the FISA court to obtain the records of financial institutions…, you know, there may be FISA court orders on that very point. FOIA anyone?

Ironic that the gansta administration in Washington, that kills without due process, engages in widespread abuses of human rights and the rights of its own citizens, attempts to justify this executive order as follows:

This new authority is aimed at persons involved in or responsible for the erosion of human rights guarantees, persecution of political opponents, curtailment of press freedoms, use of violence and human rights violations and abuses in response to antigovernment protests, and arbitrary arrest and detention of antigovernment protestors, as well as the significant public corruption by senior government officials in Venezuela.

The citizens of the United States would benefit from executive orders or their equivalents from other world powers that read:

This new authority is aimed at persons involved in or responsible for the erosion of human rights guarantees, persecution of political opponents, curtailment of press freedoms, use of violence and human rights violations and abuses in response to antigovernment protests, and arbitrary arrest and detention of antigovernment protestors, as well as the significant public corruption by senior government officials in the United States.

If you or your country needs a list of specific individuals, ping me.

How to Speak Data Science

Filed under: Data Science,Humor — Patrick Durusau @ 2:49 pm

How to Speak Data Science by DataCamp.

One of my personal favorites:

“We booked these results with a small sample” – Our financial budget wasn’t large enough to perform a statistical significant data analysis.

See: Use of SurveyMonkey by mid-level managers. Managers without a clue on survey construction, testing, validation, much less data analysis of the results.

Others that DataCamp missed?

NIH RFI on National Library of Medicine

Filed under: BigData,Machine Learning,Medical Informatics,NIH — Patrick Durusau @ 2:16 pm

NIH Announces Request for Information Regarding Deliberations of the Advisory Committee to the NIH Director (ACD) Working Group on the National Library of Medicine

Deadline: Friday, March 13, 2015.

Responses to this RFI must be submitted electronically to: http://grants.nih.gov/grants/rfi/rfi.cfm?ID=41.

Apologies for having missed this announcement. Perhaps the title lacked urgency? 😉

From the post:

The National Institutes of Health (NIH) has issued a call for participation in a Request for Information (RFI), allowing the public to share its thoughts with the NIH Advisory Committee to the NIH Director Working Group charged with helping to chart the course of the National Library of Medicine, the world’s largest biomedical library and a component of the NIH, in preparation for recruitment of a successor to Dr. Donald A.B. Lindberg, who will retire as NLM Director at the end of March 2015.

As part of the working group’s deliberations, NIH is seeking input from stakeholders and the general public through an RFI.

Information Requested

The RFI seeks input regarding the strategic vision for the NLM to ensure that it remains an international leader in biomedical data and health information. In particular, comments are being sought regarding the current value of and future need for NLM programs, resources, research and training efforts and services (e.g., databases, software, collections). Your comments can include but are not limited to the following topics:

  • Current NLM elements that are of the most, or least, value to the research community (including biomedical, clinical, behavioral, health services, public health and historical researchers) and future capabilities that will be needed to support evolving scientific and technological activities and needs.
  • Current NLM elements that are of the most, or least, value to health professionals (e.g., those working in health care, emergency response, toxicology, environmental health and public health) and future capabilities that will be needed to enable health professionals to integrate data and knowledge from biomedical research into effective practice.
  • Current NLM elements that are of most, or least, value to patients and the public (including students, teachers and the media) and future capabilities that will be needed to ensure a trusted source for rapid dissemination of health knowledge into the public domain.
  • Current NLM elements that are of most, or least, value to other libraries, publishers, organizations, companies and individuals who use NLM data, software tools and systems in developing and providing value-added or complementary services and products and future capabilities that would facilitate the development of products and services that make use of NLM resources.
  • How NLM could be better positioned to help address the broader and growing challenges associated with:
    • Biomedical informatics, “big data” and data science;
    • Electronic health records;
    • Digital publications; or
    • Other emerging challenges/elements warranting special consideration.

If I manage to put something together, I will post it here as well as to the NIH.

Experiences with big data and machine learning, for all of the hype, have been falling short of the promised land. Not that I think topic maps/subject identity can get you there but certainly closer than wandering in the woods of dark data.

Root Linux Via DRAM

Filed under: Cybersecurity,Linux OS,Security — Patrick Durusau @ 10:57 am

Ouch! Google crocks capacitors and deviates DRAM to root Linux by Iain Thomson.

From the post:


Last summer Google gathered a bunch of leet [elite] security researchers as its Project Zero team and instructed them to find unusual zero-day flaws. They’ve had plenty of success on the software front – but on Monday announced a hardware hack that’s a real doozy.

The technique, dubbed “rowhammer”, rapidly writes and rewrites memory to force capacitor errors in DRAM, which can be exploited to gain control of the system. By repeatedly recharging one line of RAM cells, bits in an adjacent line can be altered, thus corrupting the data stored.

This corruption can lead to the wrong instructions being executed, or control structures that govern how memory is assigned to programs being altered – the latter case can be used by a normal program to gain kernel-level privileges.

The “rowhammer” routines are something to consider adding to your keychain USB (Edward Snowden) or fake Lady Gaga CD (writeable media) (Private Manning), in case you become curious about the security of a networked environment.

Iain’s post is suitable for passing on to middle-level worriers but if you need the read details consider:

Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors: Paper on rowhammer by Yoongu Jim et al.

Program for testing for the DRAM “rowhammer” problem Google’s Github repository on rowhammer.

Rowhammer Discuss (mailing list) Google mailing list for discussion of rowhammer.

The Linux faithful turned out comment the problem was in hardware and all operating systems were vulnerable. That is obvious from “hardware hack” and “rapidly writes and rewrites memory to force capacitor errors in DRAM.” But you do have to read more than the title to get that information.

Windows-based spys are waiting for someone to write a rowhammer application with a Windows installer so I don’t think the title is necessarily unfair to Linux. Personally I would just use a USB-based Linux OS to reboot a Windows machine. I don’t know if there is a “looks like MS Windows” interface for Linux or not. So long as you weren’t too productive, that could cover the fact you are not running Windows.

BTW, Iain, unlike many writers, included hyperlinks to non-local resources on rowhammer. That is how the Web is supposed to work. Favor the work of Iain and others like Iain if you want a better Web.

March 9, 2015

Kalman and Bayesian Filters in Python

Filed under: Bayesian Models,Filters,Kalman Filter,Python — Patrick Durusau @ 6:39 pm

Kalman and Bayesian Filters in Python by Roger Labbe.

Apologies for the lengthy quote but Roger makes a great case for interactive textbooks, IPython notebooks, writing for the reader as opposed to making the author feel clever, and finally, making content freely available.

It is a quote that I am going to make a point to read on a regular basis.

And all of that before turning to the subject at hand!

Enjoy!

From the preface:


This is a book for programmers that have a need or interest in Kalman filtering. The motivation for this book came out of my desire for a gentle introduction to Kalman filtering. I’m a software engineer that spent almost two decades in the avionics field, and so I have always been ‘bumping elbows’ with the Kalman filter, but never implemented one myself. They always has a fearsome reputation for difficulty, and I did not have the requisite education. Everyone I met that did implement them had multiple graduate courses on the topic and extensive industrial experience with them. As I moved into solving tracking problems with computer vision the need to implement them myself became urgent. There are classic textbooks in the field, such as Grewal and Andrew’s excellent Kalman Filtering. But sitting down and trying to read many of these books is a dismal and trying experience if you do not have the background. Typically the first few chapters fly through several years of undergraduate math, blithely referring you to textbooks on, for example, Itō calculus, and presenting an entire semester’s worth of statistics in a few brief paragraphs. These books are good textbooks for an upper undergraduate course, and an invaluable reference to researchers and professionals, but the going is truly difficult for the more casual reader. Symbology is introduced without explanation, different texts use different words and variables names for the same concept, and the books are almost devoid of examples or worked problems. I often found myself able to parse the words and comprehend the mathematics of a definition, but had no idea as to what real world phenomena these words and math were attempting to describe. “But what does that mean?” was my repeated thought.

However, as I began to finally understand the Kalman filter I realized the underlying concepts are quite straightforward. A few simple probability rules, some intuition about how we integrate disparate knowledge to explain events in our everyday life and the core concepts of the Kalman filter are accessible. Kalman filters have a reputation for difficulty, but shorn of much of the formal terminology the beauty of the subject and of their math became clear to me, and I fell in love with the topic.

As I began to understand the math and theory more difficulties itself. A book or paper’s author makes some statement of fact and presents a graph as proof. Unfortunately, why the statement is true is not clear to me, nor is the method by which you might make that plot obvious. Or maybe I wonder “is this true if R=0?” Or the author provides pseudocode – at such a high level that the implementation is not obvious. Some books offer Matlab code, but I do not have a license to that expensive package. Finally, many books end each chapter with many useful exercises. Exercises which you need to understand if you want to implement Kalman filters for yourself, but exercises with no answers. If you are using the book in a classroom, perhaps this is okay, but it is terrible for the independent reader. I loathe that an author withholds information from me, presumably to avoid ‘cheating’ by the student in the classroom.

None of this necessary, from my point of view. Certainly if you are designing a Kalman filter for a aircraft or missile you must thoroughly master of all of the mathematics and topics in a typical Kalman filter textbook. I just want to track an image on a screen, or write some code for my Arduino project. I want to know how the plots in the book are made, and chose different parameters than the author chose. I want to run simulations. I want to inject more noise in the signal and see how a filter performs. There are thousands of opportunities for using Kalman filters in everyday code, and yet this fairly straightforward topic is the provenance of rocket scientists and academics.

I wrote this book to address all of those needs. This is not the book for you if you program avionics for Boeing or design radars for Raytheon. Go get a degree at Georgia Tech, UW, or the like, because you’ll need it. This book is for the hobbyist, the curious, and the working engineer that needs to filter or smooth data.

This book is interactive. While you can read it online as static content, I urge you to use it as intended. It is written using IPython Notebook, which allows me to combine text, python, and python output in one place. Every plot, every piece of data in this book is generated from Python that is available to you right inside the notebook. Want to double the value of a parameter? Click on the Python cell, change the parameter’s value, and click ‘Run’. A new plot or printed output will appear in the book.

This book has exercises, but it also has the answers. I trust you. If you just need an answer, go ahead and read the answer. If you want to internalize this knowledge, try to implement the exercise before you read the answer.

This book has supporting libraries for computing statistics, plotting various things related to filters, and for the various filters that we cover. This does require a strong caveat; most of the code is written for didactic purposes. It is rare that I chose the most efficient solution (which often obscures the intent of the code), and in the first parts of the book I did not concern myself with numerical stability. This is important to understand – Kalman filters in aircraft are carefully designed and implemented to be numerically stable; the naive implementation is not stable in many cases. If you are serious about Kalman filters this book will not be the last book you need. My intention is to introduce you to the concepts and mathematics, and to get you to the point where the textbooks are approachable.

Finally, this book is free. The cost for the books required to learn Kalman filtering is somewhat prohibitive even for a Silicon Valley engineer like myself; I cannot believe the are within the reach of someone in a depressed economy, or a financially struggling student. I have gained so much from free software like Python, and free books like those from Allen B. Downey here [1]. It’s time to repay that. So, the book is free, it is hosted on free servers, and it uses only free and open software such as IPython and mathjax to create the book.

I first saw this in a tweet by nixCraft.

Machine learning and magic [ Or, Big Data and magic]

Filed under: BigData,Machine Learning,Marketing — Patrick Durusau @ 6:14 pm

Machine learning and magic by John D. Cook.

From the post:

When I first heard about a lie detector as a child, I was puzzled. How could a machine detect lies? If it could, why couldn’t you use it to predict the future? For example, you could say “IBM stock will go up tomorrow” and let the machine tell you whether you’re lying.

Of course lie detectors can’t tell whether someone is lying. They can only tell whether someone is exhibiting physiological behavior believed to be associated with lying. How well the latter predicts the former is a matter of debate.

I saw a presentation of a machine learning package the other day. Some of the questions implied that the audience had a magical understanding of machine learning, as if an algorithm could extract answers from data that do not contain the answer. The software simply searches for patterns in data by seeing how well various possible patterns fit, but there may be no pattern to be found. Machine learning algorithms cannot generate information that isn’t there any more than a polygraph machine can predict the future.

I supplied the alternative title because of the advocacy of “big data” as a necessity for all enterprises, with no knowledge at all of the data being collected or of the issues for a particular enterprise that it might address. Machine learning suffers from the same affliction.

Specific case studies don’t answer the question of whether machine learning and/or big data is a fit for your enterprise or its particular problems. Some problems are quite common but incompetency in management is the most prevalent of all (Dilbert) and neither big data nor machine learning than help with that problem.

Take John’s caution to heart for both machine learning and big data. You will be glad you did!

On Newspapers and Being Human

Filed under: Journalism,News — Patrick Durusau @ 5:57 pm

On Newspapers and Being Human by Abby Mullen.

From the post:

Last week, an opinion piece appeared in the New York Times, arguing that the advent of algorithmically derived human-readable content may be destroying our humanity, as the lines between technology and humanity blur. A particular target in this article is the advent of “robo-journalism,” or the use of algorithms to write copy for the news. 1 The author cites a study that alleges that “90 percent of news could be algorithmically generated by the mid-2020s, much of it without human intervention.” The obvious rebuttal to this statement is that algorithms are written by real human beings, which means that there are human interventions in every piece of algorithmically derived text. But statements like these also imply an individualism that simply does not match the historical tradition of how newspapers are created. 2

In the nineteenth century, algorithms didn’t write texts, but neither did each newspaper’s staff write its own copy with personal attention to each article. Instead, newspapers borrowed texts from each other—no one would ever have expected individualized copy for news stories. 3 Newspapers were amalgams of texts from a variety of sources, cobbled together by editors who did more with scissors than with a pen (and they often described themselves this way).

Newspapers have never been about individual human effort. They’ve always been about collaboration toward a common goal–giving every newspaper in every town enough material to print their papers, daily, semi-weekly, weekly, however often they went to press. Shelley Polodny states that digital outlets have caused us to “demand content with an appetite that human effort can no longer satisfy,” but news outlets have never been able to satiate that demand, as the Fremont Journal of December 29, 1854, acknowledges.

The borrowing, copying, plagarism, that Abby’s describes is said to be alien to our modern intellectual landscape. Or at least it is if you try to use the “O” word for a once every four (4) years sporting event at the behest of the unspeakable, or if you attempt to use any likeness of a Disney character.

The Viral Texts project, which Abby participates, is attempting to map networks of reprinting in 19th-century newspapers and magazines.

A comparison of the spread of news and ideas in the 21st century may well reveal that the public marketplace of ideas has been severely impoverished by excessive notions of intellectual property and its accompanying legal regime.

If it were shown modern intellectual property practices have in fact impaired the growth and discussion of ideas, by empirical measure, would that accelerate the movement towards greater access to news and ideas?

Programs and Proofs: Mechanizing Mathematics with Dependent Types

Filed under: Coq,Functional Programming,Proof Theory,Software Engineering,Types — Patrick Durusau @ 3:49 pm

Programs and Proofs: Mechanizing Mathematics with Dependent Types by Ilya Sergey.

From the post:

coq-logo

The picture “Le coq mécanisé” is courtesy of Lilia Anisimova

These lecture notes are the result of the author’s personal experience of learning how to structure formal reasoning using the Coq proof assistant and employ Coq in large-scale research projects. The present manuscript offers a brief and practically-oriented introduction to the basic concepts of mechanized reasoning and interactive theorem proving.

The primary audience of the manuscript are the readers with expertise in software development and programming and knowledge of discrete mathematic disciplines on the level of an undergraduate university program. The high-level goal of the course is, therefore, to demonstrate how much the rigorous mathematical reasoning and development of robust and intellectually manageable programs have in common, and how understanding of common programming language concepts provides a solid background for building mathematical abstractions and proving theorems formally. The low-level goal of this course is to provide an overview of the Coq proof assistant, taken in its both incarnations: as an expressive functional programming language with dependent types and as a proof assistant providing support for mechanized interactive theorem proving.

By aiming these two goals, this manuscript is, thus, intended to provide a demonstration how the concepts familiar from the mainstream programming languages and serving as parts of good programming practices can provide illuminating insights about the nature of reasoning in Coq’s logical foundations and make it possible to reduce the burden of mechanical theorem proving. These insights will eventually give the reader a freedom to focus solely on the essential part of the formal development instead of fighting with the proof assistant in futile attempts to encode the “obvious” mathematical intuition.

One approach to change the current “it works, let’s ship” software development model. Users prefer software that works but in these security conscious times, having software that works and is to some degree secure, is even better.

Looking forward to software with a warranty as a major disruption of the software industry. Major vendors are organized around there being no warranty/liability for software failures. A startup, organized to account for warranty/liability, would be a powerful opponent.

Proof techniques are one way to enable the offering limited warranties for software products.

I first saw this in a tweet by Comp Sci Fact.

FYI: COHA Full-Text Data: 385 Million Words, 116k Texts

Filed under: Corpora,Corpus Linguistics,Linguistics — Patrick Durusau @ 3:16 pm

FYI: COHA Full-Text Data: 385 Million Words, 116k Texts by Mark Davies.

From the post:

This announcement is for those who are interested in historical corpora and who may want a large dataset to work with on their own machine. This is a real corpus, rather than just n-grams (as with the Google Books n-grams; see a comparison at http://googlebooks.byu.edu/compare-googleBooks.asp).

We are pleased to announce that the Corpus of Historical American English (COHA; http://corpus.byu.edu/coha/) is now available in downloadable full-text format, for use on your own computer.
http://corpus.byu.edu/full-text/

COHA joins COCA and GloWbE, which have been available in downloadable full-text format since March 2014.

The downloadable version of COHA contains 385 million words of text in more than 115,000 separate texts, covering fiction, popular magazines, newspaper articles, and non-fiction books from the 1810s to the 2000s (see http://corpus.byu.edu/full-text/coha_full_text.asp).

At 385 million words in size, the downloadable COHA corpus is much larger than any other structured historical corpus of English. With this large amount of data, you can carry out many types of research that would not be possible with much smaller 5-10 million word historical corpora of English (see http://corpus.byu.edu/compare-smallCorpora.asp).

The corpus is available in several formats: sentence/paragraph, PoS-tagged and lemmatized (one word per line), and for input into a relational database. Samples of each format (3.6 million words each) are available at the full-text website.

We hope that this new resource is of value to you in your research and teaching.

Mark Davies
Brigham Young University
http://davies-linguistics.byu.edu/
http://corpus.byu.edu/

I haven’t ever attempted a systematic ranking of American universities but in terms of contributions to the public domain in the humanities, Brigham Young is surely in the top ten (10), however you might rank the members of that group individually.

Correction: A comment pointed out that this data set is for sale and not in the public domain. My bad, I read the announcement and not the website. Still, given the amount of work required to create such a corpus, I don’t find the fees offensive.

Take the data set being formatted for input into a relational database as a reason for inputting it into a non-relational database.

Enjoy!

I first saw this in a tweet by the https://twitter.com/linguistlistLinguist List.

March 8, 2015

The Open-Source Data Science Masters

Filed under: Data Science,Education — Patrick Durusau @ 7:05 pm

The Open-Source Data Science Masters by Clare Corthell.

Clare recites all the numbing stats on the coming shortage of data scientists but then takes a turn that most don’t.

Clare outlines a masters of data science curriculum using free resources for the most part on the Web.

Will you help reduce the coming shortage of data scientists?

Making Statistics Files Accessible

Filed under: Files,Statistics — Patrick Durusau @ 6:54 pm

Making Statistics Files Accessible by Evan Miller.

From the post:

There’s little in modern society more frustrating than receiving a file from someone and realizing you’ll need to buy a jillion-dollar piece of software in order to open it. It’s like, someone just gave you a piece of birthday cake, but you’re only allowed to eat that cake with a platinum fork encrusted with diamonds, and also the fork requires you to enter a serial number before you can use it.

Wizard often receives praise for its clean statistics interface and beautiful design, but I’m just as proud of another part of the software that doesn’t receive much attention, ironically for the very reason that it works so smoothly: the data importers. Over the last couple of years I’ve put a lot of effort into understanding and picking apart various popular file formats; and as a result, Wizard can slurp down Excel, Numbers, SPSS, Stata, and SAS files like it was a bowl of spaghetti at a Shoney’s restaurant.

Of course, there are a lot of edge cases and idiosyncrasies in binary files, and it takes a lot of mental effort to keep track of all the peculiarities; and to be honest I’d rather spend that effort making a better interface instead of bashing my head against a wall over some binary flag field that I really, honestly have no interest in learning more about. So today I’m happy to announce that the file importers are about to get even smoother, and at the same time, I’ll be able to put more of my attention on the core product rather than worrying about file format issues.

The astute reader will ask: how will a feature that starts receiving less attention from me get better? It’s simple: I’ve open-sourced Wizard’s core routines for reading SAS, Stata, and SPSS files, and as of today, these routines are available to anyone who uses R — quite a big audience, which means that many more people will be available to help me diagnose and fix issues with the file importers.

In case you don’t recognize the Wizard software, there’s a reason the site has “mac” in its name: http://www.wizardmac.com. 😉

clf – Command line tool to search snippets on Commandlinefu.com

Filed under: Linux OS,Python — Patrick Durusau @ 6:30 pm

clf – Command line tool to search snippets on Commandlinefu.com by Nicolas Crocfer.

From the webpage:

Commandlinefu.com is the place to record awesome command-line snippets. This tool allows you to search and view the results into your terminal.

What a very clever idea!

Imagine if all the sed/awk scripts were collected from various archive sites, deduped and made searchable via such an interface!

Enjoy!

Data Viz News March 2 – 7, 2015 (Delivery Format Challenge)

Filed under: Graphics,Visualization — Patrick Durusau @ 6:19 pm

Tiago Veloso has posted hundreds of links to visualizations and resources that have never appeared on Data Viz News, with one post per day between March 2 – 7, 2015.

Which highlights a problem Tiago needs your assistance to solve. From the first post:

At long last, we return to our weekly round ups of the best links about data visualization. Well, it hasn’t been that long, but when you look at what has already taken place since our last post, well, it does seem like an eternity. So much has happened in the first two months of 2015!

This means, of course, that we have a lot of catching up to do! Yes, we could just bring you the most recent articles, interviews and resources. But we’ll try to mix in some of the amazing content already published during this past 60 days, so that we may continue to feature the very best content related to visualization, infographic design, visual journalism, cartography, and much more.

That said, we have been also thinking hardly about alternatives to these long, many times overwhelming, gigantic posts. When we created Data Viz News, we were sure that there was enough content to make an appealing, interesting weekly round up just with links about the fields closer to our interests. Now, almost two years later, the question is sort of if we have content for such a post… every day!

So, while today’s post – and the upcoming ones, all to be posted this week – are still in that very same format, we are intensively looking for alternatives, and your help would be very much appreciated: just let us know on Twitter (@visualoop) what you think would be the best way to deliver this much amount of articles. Looking forward for your ideas.

Our cup runs over with data visualization content.

Taking those six days as a data set, how would you organize the same material?

How to Use R for Connecting Neo4j and Tableau (A Recommendation Use Case)

Filed under: Neo4j,R,Tableau — Patrick Durusau @ 5:57 pm

How to Use R for Connecting Neo4j and Tableau (A Recommendation Use Case) by Roberto Rösler.

From the post:

Year is just a little bit more than two months old and we got the good news from Tableau – beta testing for version 9.0 started. But it looks like that one of my most favored features didn’t manage to be part of the first release – the Tableau Web Data Connector (it’s mentioned in Christian Chabot keynote at 01:18:00 you can find here). The connector can be used to access REST APIs from within Tableau.
Instead of waiting for the unknown release containing the Web Data Connector, I will show in this post how you can still use the current version of Tableau together with R to build your own “Web Data Connector”. Specifically, this means we connect to an instance of the graph database Neo4j using Neo4js REST API. However, that is not the only good news: our approach that will create a life connection to the “REST API data source” goes beyond any attempt that utilizes Tableaus Data Extract API, static tde files that could be loaded in Tableau.

In case you aren’t familiar with Tableau, it is business analytics/visualization software that has both commercial and public versions.

Roberto moves data crunching off of Tableau (into Neo4j) and builds a dashboard (playing to Tableau’s strengths) for display of the results.

If you don’t have the time to follow R-Bloggers, you should make the time to follow Roberto’s blog, Data * Science + R. His posts explore interesting issues at length, with data and code.

I first saw this in a tweet by DataSciGeek.

The internet of things and big data: Unlocking the power

Filed under: BigData,IoT - Internet of Things — Patrick Durusau @ 4:58 pm

The internet of things and big data: Unlocking the power by Charles McLellan.

From the post:

If you have somehow missed the hype, the IoT is a fast-growing constellation of internet-connected sensors attached to a wide variety of ‘things’. Sensors can take a multitude of possible measurements, internet connections can be wired or wireless, while ‘things’ can literally be any object (living or inanimate) to which you can attach or embed a sensor. If you carry a smartphone, for example, you become a multi-sensor IoT ‘thing’, and many of your day-to-day activities can be tracked, analysed and acted upon.

Big data, meanwhile, is characterised by ‘four Vs‘: volume, variety, velocity and veracity. That is, big data comes in large amounts (volume), is a mixture of structured and unstructured information (variety), arrives at (often real-time) speed (velocity) and can be of uncertain provenance (veracity). Such information is unsuitable for processing using traditional SQL-queried relational database management systems (RDBMSs), which is why a constellation of alternative tools — notably Apache’s open-source Hadoop distributed data processing system, plus various NoSQL databases and a range of business intelligence platforms — has evolved to service this market.

The IoT and big data are clearly intimately connected: billions of internet-connected ‘things’ will, by definition, generate massive amounts of data. However, that in itself won’t usher in another industrial revolution, transform day-to-day digital living, or deliver a planet-saving early warning system. As EMC and IDC point out in their latest Digital Universe report, organisations need to hone in on high-value, ‘target-rich’ data that is (1) easy to access; (2) available in real time; (3) has a large footprint (affecting major parts of the organisation or its customer base); and/or (4) can effect meaningful change, given the appropriate analysis and follow-up action.

As we shall see, there’s a great deal less of this actionable data than you might think if you simply looked at the size of the ‘digital universe’ and the number of internet-connected ‘things’.

On the question of business opportunities, you may want to look at: 5 Ways the Internet of Things Drives New $$$ Opportunities by Bill Schmarzo.

A graphic from the report summarizes those opportunities:

IoT-5oppotunities

Select the image to see a larger (and legible) version. Most of the posts where I have encountered it leave it barely legible.

See the: EMC Digital Universe study – with research and analysis by IDC.

From the executive summary:

In 2013, only 22% of the information in the digital universe would be a candidate for analysis, i.e., useful if it were tagged (more often than not, we know little about the data, unless it is somehow characterized or tagged – a practice that results in metadata); less than 5% of that was actually analyzed. By 2020, the useful percentage could grow to more than 35%, mostly because of the growth of data from embedded systems.

Ouch! I had been wondering when the ships of opportunity were going to run aground on semantic incompatibility and/or a lack of semantics.

Where does your big data solution store “metadata” about your data (both keys and values)?

Or have you build a big silo for big data?

How to install RethinkDB on Raspberry PI 2

Filed under: Parallel Programming,RethinkDB,Supercomputing — Patrick Durusau @ 3:40 pm

How to install RethinkDB on Raspberry PI 2 by Zaher Ghaibeh.

A short video to help you install RethinkDB on your Raspberry PI 2.

Install when your Raspberry PI 2 is not part of a larger cluster of Raspberry PI 2’s that is tracking social data on your intelligence services.

Your advantage in that regard is that you aren’t (shouldn’t be) piling up bigger haystacks to investigate for needles using a pitchfork.

Focused intelligence (beginning with HUMINT and incorporating SIGINT and other types of intelligence, can result in much higher quality intelligence at lower operational cost when compared to the data vacuum approach.

For one reason, knowing the signal you are seeking boosts the chances of it being detected. Searching for an unknown signal adrift in a sea of data is a low percentage proposition.

How do you plan to use your RethinkDB to track intelligence on local or state government?

Data Journalism (Excel)

Filed under: Excel,Journalism,Spreadsheets — Patrick Durusau @ 2:01 pm

Data Journalism by Ken Blake.

From the webpage:

Learning to use a spreadsheet will transform how you do what you do as a media professional. The YouTube-hosted videos below demonstrate some of the possibilities. If you like, download the datasets and follow along.

I use the PC version of Excel 2010, Microsoft’s cheap, ubiquitous spreadsheet program. It can do some things that the Mac version can’t. But if you’re a Mac user, you’ll still find plenty worth watching.

Everything is free to watch and download. If you’d like to use these materials elsewhere, please just e-mail me first and ask permission. Questions, suggestions and requests are welcome, too. See my contact information.

And check back now and then. I’ll be adding more stuff soon.

Disclaimer: These tutorials will not help you perform NLP on Klingon nor are they a guide to GPU-based deep learning in order to improve your play in Mortal Kombat X.

Having gotten those major disappointments out of the way, these tutorials will help you master Excel and to use it effectively in uncovering the misdeeds of small lives in local and state government.

To use “big data” tools with small data is akin to hunting rats with an elephant gun. Doable, but expensive and difficult to master.

As an added bonus, processing small data will give you experience with the traps and pitfalls of data, which remain important whether your data sets are big or small.

Enjoy!

Blazegraph™ Selected by Wikimedia Foundation to Power the Wikidata Query Service

Filed under: Blazegraph,Graphs,RDF,SPARQL — Patrick Durusau @ 11:03 am

Blazegraph™ Selected by Wikimedia Foundation to Power the Wikidata Query Service by Brad Bebee.

From the post:

Blazegraph™ has been selected by the Wikimedia Foundation to be the graph database platform for the Wikidata Query Service. Read the Wikidata announcement here. Blazegraph™ was chosen over Titan, Neo4j, Graph-X, and others by Wikimedia in their evaluation. There’s a spreadsheet link in the selection message, which has quite an interesting comparison of graph database platforms.

Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wikisource, and others. The Wikidata Query Service is a new capability being developed to allow users to be able to query and curate the knowledge base contained in Wikidata.

We’re super-psyched to be working with Wikidata and think it will be a great thing for Wikidata and Blazegraph™.

From the Blazegraph™ SourceForge page:

Blazegraph™is SYSTAP’s flagship graph database. It is specifically designed to support big graphs offering both Semantic Web (RDF/SPARQL) and Graph Database (tinkerpop, blueprints, vertex-centric) APIs. It is built on the same open source GPLv2 platform and maintains 100% binary and API compatibility with Bigdata®. It features robust, scalable, fault-tolerant, enterprise-class storage and query and high-availability with online backup, failover and self-healing. It is in production use with enterprises such as Autodesk, EMC, Yahoo7!, and many others. Blazegraph™ provides both embedded and standalone modes of operation.

Blazegraph has a High Availability and Scale Out architecture. It provides robust support for Semantic Web (RDF/SPARQ)L and Property Graph (Tinkerpop) APIs. Highly scalable Blazegraph graph can handle 50 Billion edges.

The Blazegraph wiki, which has forty-three (43) substantive links to further details on Blazegraph.

For an even deeper look, consider these white papers:

Enjoy!

Lies, Damned Lies, and Clapper (2015)

Filed under: Government,Intelligence,Politics — Patrick Durusau @ 9:10 am

Worldwide Threat Assessment of the US Intelligence Community 2015 by James R Clapper (Director of National Intelligence).

The amazing thing about Director of National Intelligence (DNI) Clapper is that he remains out of prison and uncharged for his prior lies to Congress.

Clapper should get points for an amazing lack of self-awareness when he addresses the issue of unknown integrity of information due to cyber attacks:

Decision making by senior government officials (civilian and military), corporate executives, investors, or others will be impaired if they cannot trust the information they are receiving.

Decision making by members of congress (senior government officials) and member of the public are impaired when they can’t obtain trust information from government agencies and their leaders.

In that regard, the 2015 threat assessment is incomplete. It should have included threats that the US public faces, cyber and otherwise from its own government.

March 7, 2015

The ISIS Twitter Census

Filed under: Social Media,Social Networks,Twitter — Patrick Durusau @ 8:38 pm

The ISIS Twitter Census: Defining and describing the population of ISIS supporters on Twitter by J.M. Berger and Jonathon Morgan.

This is the Brookings Institute report that I said was forthcoming in: Losing Your Right To Decide, Needlessly.

From the Executive Summary:

The Islamic State, known as ISIS or ISIL, has exploited social media, most notoriously Twitter, to send its propaganda and messaging out to the world and to draw in people vulnerable to radicalization.

By virtue of its large number of supporters and highly organized tactics, ISIS has been able to exert an outsized impact on how the world perceives it, by disseminating images of graphic violence (including the beheading of Western journalists and aid workers and more recently, the immolation of a Jordanian air force pilot), while using social media to attract new recruits and inspire lone actor attacks.

Although much ink has been spilled on the topic of ISIS activity on Twitter, very basic questions remain unanswered, including such fundamental issues as how many Twitter users support ISIS, who they are, and how many of those supporters take part in its highly organized online activities.

Previous efforts to answer these questions have relied on very small segments of the overall ISIS social network. Because of the small, cellular nature of that network, the examination of particular subsets such as foreign fighters in relatively small numbers, may create misleading conclusions.

My suggestion is that you skim the “group think” sections on ISIS and move quickly to Section 3, Methodology. That will put you into a position to evaluate the various and sundry claims about ISIS and what may or may not be supported by their methodology.

I am still looking for a metric for “successful” use of social media. So far, no luck.

Hands-on with machine learning

Filed under: Journalism,Machine Learning,Python,Scikit-Learn — Patrick Durusau @ 5:20 pm

Hands-on with machine learning by Chase Davis.

From the webpage:

First of all, let me be clear about one thing: You’re not going to “learn” machine learning in 60 minutes.

Instead, the goal of this session is to give you some sense of how to approach one type of machine learning in practice, specifically http://en.wikipedia.org/wiki/Supervised_learning.

For this exercise, we’ll be training a simple classifier that learns how to categorize bills from the California Legislature based only on their titles. Along the way, we’ll focus on three steps critical to any supervised learning application: feature engineering, model building and evaluation.

To help us out, we’ll be using a Python library called http://scikit-learn.org/, which is the easiest to understand machine learning library I’ve seen in any language.

That’s a lot to pack in, so this session is going to move fast, and I’m going to assume you have a strong working knowledge of Python. Don’t get caught up in the syntax. It’s more important to understand the process.

Since we only have time to hit the very basics, I’ve also included some additional points you might find useful under the “What we’re not covering” heading of each section below. There are also some resources at the bottom of this document that I hope will be helpful if you decide to learn more about this on your own.

A great starting place for journalists or anyone else who wants to understand basic machine learning.

I first saw this in a tweet by Hanna Wallach.

Deep Learning for Natural Language Processing (March – June, 2015)

Filed under: Deep Learning,Natural Language Processing — Patrick Durusau @ 4:57 pm

CS224d: Deep Learning for Natural Language Processing by Richard Socher.

Description:

Natural language processing (NLP) is one of the most important technologies of the information age. Understanding complex language utterances is also a crucial part of artificial intelligence. Applications of NLP are everywhere because people communicate most everything in language: web search, advertisement, emails, customer service, language translation, radiology reports, etc. There are a large variety of underlying tasks and machine learning models powering NLP applications. Recently, deep learning approaches have obtained very high performance across many different NLP tasks. These models can often be trained with a single end-to-end model and do not require traditional, task-specific feature engineering. In this spring quarter course students will learn to implement, train, debug, visualize and invent their own neural network models. The course provides a deep excursion into cutting-edge research in deep learning applied to NLP. The final project will involve training a complex recurrent neural network and applying it to a large scale NLP problem. On the model side we will cover word vector representations, window-based neural networks, recurrent neural networks, long-short-term-memory models, recursive neural networks, convolutional neural networks as well as some very novel models involving a memory component. Through lectures and programming assignments students will learn the necessary engineering tricks for making neural networks work on practical problems.

Assignments, course notes and slides will all be posted online. You are free to “follow along” but no credit.

Are you ready for the cutting-edge?

I first saw this in a tweet by Randall Olson.

« Newer PostsOlder Posts »

Powered by WordPress