Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 20, 2014

Not just the government’s playbook

Filed under: Programming,Project Management,Software Engineering — Patrick Durusau @ 3:59 pm

Not just the government’s playbook by Mike Loukides.

From the post:

Whenever I hear someone say that “government should be run like a business,” my first reaction is “do you know how badly most businesses are run?” Seriously. I do not want my government to run like a business — whether it’s like the local restaurants that pop up and die like wildflowers, or megacorporations that sell broken products, whether financial, automotive, or otherwise.

If you read some elements of the press, it’s easy to think that healthcare.gov is the first time that a website failed. And it’s easy to forget that a large non-government website was failing, in surprisingly similar ways, at roughly the same time. I’m talking about the Common App site, the site high school seniors use to apply to most colleges in the US. There were problems with pasting in essays, problems with accepting payments, problems with the app mysteriously hanging for hours, and more.

I don’t mean to pick on Common App; you’ve no doubt had your own experience with woefully bad online services: insurance companies, Internet providers, even online shopping. I’ve seen my doctor swear at the Epic electronic medical records application when it crashed repeatedly during an appointment. So, yes, the government builds bad software. So does private enterprise. All the time. According to TechRepublic, 68% of all software projects fail. We can debate why, and we can even debate the numbers, but there’s clearly a lot of software #fail out there — in industry, in non-profits, and yes, in government.

With that in mind, it’s worth looking at the U.S. CIO’s Digital Services Playbook. It’s not ideal, and in many respects, its flaws reveal its origins. But it’s pretty good, and should certainly serve as a model, not just for the government, but for any organization, small or large, that is building an online presence.

See Mike’s post for the extracted thirteen (13) principles (plays in Obama-speak) for software projects.

While everybody needs a reminder, what puzzles me is that none of the principles are new. That being the case, shouldn’t we be asking:

Why haven’t projects been following these rules?

Reasoning that if we (collectively) know what makes software projects succeed, what are the barrier to implementing those steps in all software projects?

Re-stating rules that we already know to be true, without more, isn’t very helpful. Projects that start tomorrow with have a fresh warning in their ears and commit the same errors that doom 68% of all other projects.

My favorite suggestion and the one I have seen violated most often is:

Bring in experienced teams

I am told, “…our staff don’t know how to do X, Y or Z….” That sounds to me like a personnel problem. In an IT recession, a problem that isn’t hard to fix. But no, the project has to succeed with IT staff known to lack the project management or technical skills to succeed. You can guess the outcome of such projects in advance.

The restatement of project rules isn’t a bad thing to have but your real challenge is going to be following them. Suggestions for everyone’s benefit welcome!

International Conference on Functional Programming 2014

Filed under: Functional Programming — Patrick Durusau @ 3:39 pm

The 19th ACM SIGPLAN International Conference on Functional Programming Complete Proceedings of ICFP 2014 available for free for one year.

I count thirty-one (31) papers that you can access for the next year.

Be aware the ACM has imposed a petty 10 second “wait” screen even though the articles are available without charge.

I first saw this in a tweet by David Van Horn.

Exposing Resources in Datomic…

Filed under: Datomic,Linked Data,RDF — Patrick Durusau @ 2:39 pm

Exposing Resources in Datomic Using Linked Data by Ratan Sebastian.

From the post:

Financial data feeds from various data providers tend to be closed off from most people due to high costs, licensing agreements, obscure documentation, and complicated business logic. The problem of understanding this data, and providing access to it for our application is something that we (and many others) have had to solve over and over again. Recently at Pellucid we were faced with three concrete problems

  1. Adding a new data set to make data visualizations with. This one was a high-dimensional data set and we were certain that the queries that would be needed to make the charts had to be very parameterizable.

  2. We were starting to come to terms with the difficulty of answering support questions about the data we use in our charts given that we were serving up the data using a Finagle service that spoke a binary protocol over TCP. Support staff should not have to learn Datomic’s highly expressive query language, Datalog or have to set up a Scala console to look at the raw data that was being served up.

  3. Different data sets that we use had semantically equivalent data that was being accessed in ways specific to that data set.

And as a long-term goal we wanted to be able to query across data sets instead of doing multiple queries and joining in memory.

These are very orthogonal goals to be sure. We embarked on a project which we thought might move us in those three directions simultaneously. We’d already ingested the data set from the raw file format into Datomic, which we love. Goal 2 was easily addressable by conveying data over a more accessible protocol. And what’s more accessible than REST. Goal 1 meant that we’d have expose quite a bit of Datalog expressivity to be able to write all the queries we needed. And Goal 3 hinted at the need for some way to talk about things in different data silos using a common vocabulary. Enter the Linked Data Platform. A W3C project, the need for which is brilliantly covered in this talk. What’s the connection? Wait for it…

The RDF Datomic Mapping

If you are happy with Datomic and RDF primitives, for semantic purposes, this may be all you need.

You have to appreciate Ratan’s closing sentiments:

We believe that a shared ontology of financial data could be very beneficial to many and open up the normally closeted world of handling financial data.

Even though we know as a practical matter that no “shared ontology of financial data” is likely to emerge.

In the absence of such a shared ontology, there are always topic maps.

Deep Learning (MIT Press Book)

Filed under: Deep Learning,Machine Learning — Patrick Durusau @ 2:12 pm

Deep Learning (MIT Press Book) by Yoshua Bengio, Ian Goodfellow, and Aaron Courville.

From the webpage:

Draft chapters available for feedback – August 2014
Please help us make this a great book! This draft is still full of typos and can be improved in many ways. Your suggestions are more than welcome. Do not hesitate to contact any of the authors directly by e-mail or Google+ messages: Yoshua, Ian, Aaron.

Teaching a subject isn’t the only way to learn it cold. Proofing a book on a subject is another way to learn material cold.

Ready to dig in?

I first saw this in a tweet by Gregory Piatetsky

Wandora 2014-08-20 Release

Filed under: Topic Map Software,Wandora — Patrick Durusau @ 10:09 am

Wandora 2014-08-20 Release

From the change log:

Download

For a file with the distribution date, in case you have multiple versions, try http://sourceforge.net/projects/wandora/files/?source=navbar.

In the latest round of new features, Rekognition extractor and Alchemy API image keyword extractor are the two I am most likely to try first. Images are one of the weakest forms of evidence but they still carry the imprimatur of being “photographic.”

What photo collection will you be tagging first?

August 19, 2014

High Performance With Apache Tez (webinar)

Filed under: Hadoop,Tez — Patrick Durusau @ 7:28 pm

Build High Performance Data Processing Application Using Apache Tez by Ajay Singh.

From the post:

This week we continue our YARN webinar series with detailed introduction and a developer overview of Apache Tez. Designed to express fit-to-purpose data processing logic, Tez enables batch and interactive data processing applications spanning TB to PB scale datasets. Tez offers a customizable execution architecture that allows developers to express complex computations as dataflow graphs and allows for dynamic performance optimizations based on real information about the data and the resources required to process it.

Tez graduated to Apache top-level project in July 2014 and is now the workhorse of Apache Hive. With Tez, Hive 0.13 is of a magnitude faster than its previous generation. To learn more on Tez, join us on Thursday August 21st at 9 AM Pacific Time. We will review

  • Tez Architecture
  • Developer APIs
  • Sample code

Discover and Learn

Something to get you in shape for the Fall!

CRDTs: Consistency without consensus

Filed under: Consistency,CRDT,Distributed Systems — Patrick Durusau @ 7:17 pm

CRDTs: Consistency without consensus by Peter Bourgon.

Abstract:

When you think of distributed systems, you probably think in terms of consistency via consensus. That is, enabling a heterogeneous group of systems to agree on facts, while remaining robust in the face of failure. But, as any distributed systems developer can attest, it’s harder than it sounds. Failure happens in myriad, byzantine ways, and failure modes interact unpredictably. Reliable distributed systems need more than competent engineering: they need a robust theoretical foundation. CRDTs, or Convergent Replicated Data Types, are a set of properties or behaviors, discovered more than invented, which enable a distributed system to achieve consistency without consensus, and sidestep entire classes of problems altogether. This talk provides a practical introduction to CRDTs, and describes a production CRDT system built at SoundCloud to serve high-volume time-series data.

Slides: bbuzz14-peter_bourgon_0.pdf

This is very much worth your time!

Great discussion of data models after time mark 23:00 (approximately).

BTW, the system discussed is open source and in production: http://github.com/soundcloud/roshi

Solr-Wikipedia

Filed under: Solr,Wikipedia — Patrick Durusau @ 3:59 pm

Solr-Wikipedia

From the webpage:

A collection of utilities for parsing WikiMedia XML dumps with the intent of indexing the content in Solr.

I haven’t tried this, yet, but utilities for major data sources are always welcome!

Getting started in Clojure…

Filed under: Clojure,Programming — Patrick Durusau @ 3:50 pm

Getting started in Clojure with IntelliJ, Cursive, and Gorilla

part 1: setup

part 2: workflow

From Part 1:

This video goes through, step-by-step, how to setup a productive Clojure development environment from scratch. This part looks at getting the software installed and running. The second part to this video (vimeo.com/103812557) then looks at the sort of workflow you could use with this environment.

If you follow through both videos you’ll end up with Leiningen, IntelliJ, Cursive Clojure and Gorilla REPL all configured to work together 🙂

Some links:

leiningen.org
jetbrains.com/idea/
cursiveclojure.com
gorilla-repl.org

Nothing surprising but useful you are just starting out.

Seeing Things Art Historians Don’t

Filed under: Art,Artificial Intelligence,Machine Learning — Patrick Durusau @ 3:33 pm

When A Machine Learning Algorithm Studied Fine Art Paintings, It Saw Things Art Historians Had Never Noticed: Artificial intelligence reveals previously unrecognised influences between great artists

From the post:

The task of classifying pieces of fine art is hugely complex. When examining a painting, an art expert can usually determine its style, its genre, the artist and the period to which it belongs. Art historians often go further by looking for the influences and connections between artists, a task that is even trickier.

So the possibility that a computer might be able to classify paintings and find connections between them at first glance seems laughable. And yet, that is exactly what Babak Saleh and pals have done at Rutgers University in New Jersey.

These guys have used some of the latest image processing and classifying techniques to automate the process of discovering how great artists have influenced each other. They have even been able to uncover influences between artists that art historians have never recognised until now.

At first I thought the claim was that computer saw something art historians did not. That’s not hard. The question is whether you can convince anyone else to see what you saw. 😉

I stumbled a bit on figure 1 both in the post and in the paper. The caption for figure 1 in the article says:

Figure 1: An example of an often cited comparison in the context of influence. Left: Diego Vel´azquez’s Portrait of Pope Innocent X (1650), and, Right: Francis Bacon’s Study After Vel´azquez’s Portrait of Pope Innocent X (1953). Similar composition, pose, and subject matter but a different view of the work.

Well, not exactly. Bacon never saw the original Portrait of Pope Innocent X but produced over forty-five variants of it. It wasn’t a question of “influence” but of subsequent interpretations of the portrait. Not really the same thing as influence. See: Study after Velázquez’s Portrait of Pope Innocent X

I feel certain this will be a useful technique for exploration but naming objects in a painting would result in a large number of painting of popes sitting in chairs. Some of which may or may not have been “influences” in subsequent artists.

Or to put it another way, concluding influence, based on when artists lived, is a post hoc ergo propter hoc fallacy. Good technique to find possible places to look but not a definitive answer.

The original post was based on: Toward Automated Discovery of Artistic Influence

Abstract:

Considering the huge amount of art pieces that exist, there is valuable information to be discovered. Examining a painting, an expert can determine its style, genre, and the time period that the painting belongs. One important task for art historians is to find influences and connections between artists. Is influence a task that a computer can measure? The contribution of this paper is in exploring the problem of computer-automated suggestion of influences between artists, a problem that was not addressed before in a general setting. We first present a comparative study of different classification methodologies for the task of fine-art style classification. A two-level comparative study is performed for this classification problem. The first level reviews the performance of discriminative vs. generative models, while the second level touches the features aspect of the paintings and compares semantic-level features vs. low-level and intermediate-level features present in the painting. Then, we investigate the question “Who influenced this artist?” by looking at his masterpieces and comparing them to others. We pose this interesting question as a knowledge discovery problem. For this purpose, we investigated several painting-similarity and artist-similarity measures. As a result, we provide a visualization of artists (Map of Artists) based on the similarity between their works

I first saw this in a tweet by yarapavan.

Deep Learning for NLP (without Magic)

Filed under: Deep Learning,Machine Learning,Natural Language Processing — Patrick Durusau @ 2:47 pm

Deep Learning for NLP (without Magic) by Richard Socher and Christopher Manning.

Abstract:

Machine learning is everywhere in today’s NLP, but by and large machine learning amounts to numerical optimization of weights for human designed representations and features. The goal of deep learning is to explore how computers can take advantage of data to develop features and representations appropriate for complex interpretation tasks. This tutorial aims to cover the basic motivation, ideas, models and learning algorithms in deep learning for natural language processing. Recently, these methods have been shown to perform very well on various NLP tasks such as language modeling, POS tagging, named entity recognition, sentiment analysis and paraphrase detection, among others. The most attractive quality of these techniques is that they can perform well without any external hand-designed resources or time-intensive feature engineering. Despite these advantages, many researchers in NLP are not familiar with these methods. Our focus is on insight and understanding, using graphical illustrations and simple, intuitive derivations. The goal of the tutorial is to make the inner workings of these techniques transparent, intuitive and their results interpretable, rather than black boxes labeled “magic here”. The first part of the tutorial presents the basics of neural networks, neural word vectors, several simple models based on local windows and the math and algorithms of training via backpropagation. In this section applications include language modeling and POS tagging. In the second section we present recursive neural networks which can learn structured tree outputs as well as vector representations for phrases and sentences. We cover both equations as well as applications. We show how training can be achieved by a modified version of the backpropagation algorithm introduced before. These modifications allow the algorithm to work on tree structures. Applications include sentiment analysis and paraphrase detection. We also draw connections to recent work in semantic compositionality in vector spaces. The principle goal, again, is to make these methods appear intuitive and interpretable rather than mathematically confusing. By this point in the tutorial, the audience members should have a clear understanding of how to build a deep learning system for word-, sentence- and document-level tasks. The last part of the tutorial gives a general overview of the different applications of deep learning in NLP, including bag of words models. We will provide a discussion of NLP-oriented issues in modeling, interpretation, representational power, and optimization.

A tutorial on deep learning from NAACL 2013, Atlanta. The webpage offers links to the slides (205), video of the tutorial, and additional resources.

Definitely a place to take a dive into deep learning.

On page 35 of the slides the following caught my eye:

The vast majority of rule-based and statistical NLP work regards words as atomic symbols: hotel, conference, walk.

In vector space terms, this is a vector with one 1 and a lot of zeroes.

[000000000010000]

Dimensionality: 20K (speech) – 50K (PTB) – 500K (big vocab) – 13M (Google 1T)

We call this a “one-hot” representation. Its problem:

motel [000000000010000] AND
hotel [000000010000000] = 0

Another aspect of topic maps comes to the fore!

You can have “one-hot” representations of subjects in a topic map, that is a single identifier, but that’s not required.

You can have multiple “one-hot” representations for a subject or you can have more complex collections of properties that represent a subject. Depends on your requirements, not a default of the technology.

If “one-hot” representations of subjects are insufficient for deep learning, shouldn’t they be insufficient for humans as well?

Complete Antarctic Map

Filed under: Environment,Mapping,Maps — Patrick Durusau @ 1:31 pm

Waterloo makes public most complete Antarctic map for climate research

From the post:

The University of Waterloo has unveiled a new satellite image of Antarctica, and the imagery will help scientists all over the world gain new insight into the effects of climate change.

Thanks to a partnership between the Canadian Space Agency (CSA), MacDonald, Dettwiler and Associates Ltd. (MDA), the prime contractor for the RADARSAT-2 program, and the Canadian Cryospheric Information Network (CCIN) at UWaterloo, the mosaic is free and fully accessible to the academic world and the public.

Using Synthetic Aperture Radar with multiple polarization modes aboard the RADARSAT-2 satellite, the CSA collected more than 3,150 images of the continent in the autumn of 2008, comprising a single pole-to-coast map covering all of Antarctica. This is the first such map of the area since RADARSAT-1 created one in 1997.

You can access the data at: Polar Data Catalogue.

From the Catalogue homepage:

The Polar Data Catalogue is a database of metadata and data that describes, indexes, and provides access to diverse data sets generated by Arctic and Antarctic researchers. The metadata records follow ISO 19115 and Federal Geographic Data Committee (FGDC) standard formats to provide exchange with other data centres. The records cover a wide range of disciplines from natural sciences and policy, to health and social sciences. The PDC Geospatial Search tool is available to the public and researchers alike and allows searching data using a mapping interface and other parameters.

What data would you associate with such a map?

I first saw this at: Most complete Antarctic map for climate research made public.

August 18, 2014

Dissertation draft readers wanted!

Filed under: Computer Science,LVars,Parallel Programming — Patrick Durusau @ 6:55 pm

Dissertation draft readers wanted!

From the post:

Inspired by Brent Yorgey, I’m finally going public with a draft of my dissertation!

My thesis is that a certain kind of data structures, which I call “lattice-based data structures” or “LVars” for short, lend themselves well to guaranteed-deterministic parallel programming. My dissertation combines material from various alreadypublished papers, making it a three-papers-stapled-together dissertation in some sense, but I’m also retconning a lot of my work to make it tell the story I want to tell now.

When people ask what the best introduction to LVars is, I have trouble recommending the first LVars paper; even though it was only published a year ago, my thinking has changed quite a lot as my collaborators and I have figured things out since then, and the paper doesn’t match the way I like to present things now. So I’m hoping that my dissertation will be something I can point to as the definitive introduction to LVars.1

The latest draft is here; it’s automatically updated every time I commit to the repo.2 Because I thought it might be useful to my committee to see my thought process, I left my “peanut gallery” comments in there: notes to myself begin with “LK: ” and are in red, and TODOs are in a brighter red and begin with “TODO: ”. And, as you can see, there are still many TODOs — but it’s more or less starting to look like a dissertation. (Unlike Brent, I’m doing this a little late in the game: I’ve actually already sent a draft to my committee, and my defense is in only three weeks, on September 8. Still, I’m happy for any feedback, even at this late date; I probably won’t turn in a final version until some time after my defense, so there’s no rush.)

I’ll echo Brent in saying that if you notice typos or grammatical errors, feel free to put in a pull request. However, if you have any more substantial comments or suggestions, please send me an email (lindsey at this domain) instead.

Thanks so much for reading!

What do you say?

Ready to offer some eyes on a proposal for guaranteed-deterministic parallel programming?

I’m interested both from the change tracking perspective of ODF as well as parallel processing of topic maps.

Data Science at the Command Line [Webcast Weds. 20 Aug. 2014]

Filed under: Data Science — Patrick Durusau @ 6:18 pm

Data Science at the Command Line by Jeroen Janssens.

From the post:

Data Science at the Command Line is a new book written by Jeroen Janssens. This website currently contains information about this Wednesday’s webcast, instructions on how to install the Data Science Toolbox, and an overview of all the command-line tools discussed in the book.

I count eighty-one (81) command line tools listed with short explanations. That alone makes it worth visiting the page.

BTW, there is a webcast Wednesday:

On August 20, 2014 at 17:00 UTC, I’ll be doing a two-hour webcast hosted by O’Reilly Media. Attendance is free, but you do need to sign up. This event will be recorded and shared afterwards.

During this hands-on webcast, you’ll be able to interact not only with me, but also with other attendants. (So far, about 1200 people have signed up!) This means that in two hours, you can learn a lot about how to use the command line for doing data science.

Enjoy!

I first saw this in a tweet by Stat Fact.

Topic Maps Are For Data Janitors

Filed under: Marketing,Topic Maps — Patrick Durusau @ 8:20 am

For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights by Steve Lohr.

From the post:

Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

“Data wrangling is a huge — and surprisingly so — part of the job,” said Monica Rogati, vice president for data science at Jawbone, whose sensor-filled wristband and software track activity, sleep and food consumption, and suggest dietary and health tips based on the numbers. “It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”

“It’s an absolute myth that you can send an algorithm over raw data and have insights pop up,” said Jeffrey Heer, a professor of computer science at the University of Washington and a co-founder of Trifacta, a start-up based in San Francisco.

Data formats are one challenge, but so is the ambiguity of human language. Iodine, a new health start-up, gives consumers information on drug side effects and interactions. Its lists, graphics and text descriptions are the result of combining the data from clinical research, government reports and online surveys of people’s experience with specific drugs.

But the Food and Drug Administration, National Institutes of Health and pharmaceutical companies often apply slightly different terms to describe the same side effect. For example, “drowsiness,” “somnolence” and “sleepiness” are all used. A human would know they mean the same thing, but a software algorithm has to be programmed to make that interpretation. That kind of painstaking work must be repeated, time and again, on data projects.

Plenty of progress is still to be made in easing the analysis of data. “We really need better tools so we can spend less time on data wrangling and get to the sexy stuff,” said Michael Cavaretta, a data scientist at Ford Motor, which has used big data analysis to trim inventory levels and guide changes in car design.

Mr. Cavaretta is familiar with the work of ClearStory, Trifacta, Paxata and other start-ups in the field. “I’d encourage these start-ups to keep at it,” he said. “It’s a good problem, and a big one.”

Topic maps were only fifteen (15) years ahead of the need of Big Data for them.

How do you avoid:

That kind of painstaking work must be repeated, time and again, on data projects.

?

By annotating data once using a topic map and re-using that annotation over and over again.

By creating already annotated data using a topic map and reusing that annotation over and over again.

Recalling that topic map annotations can represent “logic” but more importantly, can represent any human insight that can be expressed about data.

See Lohr’s post for startups and others who are talking about a problem the topic maps community solved fifteen years ago.

August 17, 2014

AverageExplorer:…

Filed under: Clustering,Image Recognition,Indexing,Users — Patrick Durusau @ 4:22 pm

AverageExplorer: Interactive Exploration and Alignment of Visual Data Collections, Jun-Yan Zhu, Yong Jae Lee, and Alexei Efros.

Abstract:

This paper proposes an interactive framework that allows a user to rapidly explore and visualize a large image collection using the medium of average images. Average images have been gaining popularity as means of artistic expression and data visualization, but the creation of compelling examples is a surprisingly laborious and manual process. Our interactive, real-time system provides a way to summarize large amounts of visual data by weighted average(s) of an image collection, with the weights reflecting user-indicated importance. The aim is to capture not just the mean of the distribution, but a set of modes discovered via interactive exploration. We pose this exploration in terms of a user interactively “editing” the average image using various types of strokes, brushes and warps, similar to a normal image editor, with each user interaction providing a new constraint to update the average. New weighted averages can be spawned and edited either individually or jointly. Together, these tools allow the user to simultaneously perform two fundamental operations on visual data: user-guided clustering and user-guided alignment, within the same framework. We show that our system is useful for various computer vision and graphics applications.

Applying averaging to images, particularly in an interactive context with users, seems like a very suitable strategy.

What would it look like to have interactive merging of proxies based on data ranges controlled by the user?

Value-Loss Conduits?

Filed under: W3C,Web Browser — Patrick Durusau @ 3:52 pm

Do you remove links from materials that you quote?

I ask because of the following example:

The research, led by Alexei Efros, associate professor of electrical engineering and computer sciences, will be presented today (Thursday, Aug. 14) at the International Conference and Exhibition on Computer Graphics and Interactive Techniques, or SIGGRAPH, in Vancouver, Canada.

“Visual data is among the biggest of Big Data,” said Efros, who is also a member of the UC Berkeley Visual Computing Lab. “We have this enormous collection of images on the Web, but much of it remains unseen by humans because it is so vast. People have called it the dark matter of the Internet. We wanted to figure out a way to quickly visualize this data by systematically ‘averaging’ the images.”

Which is a quote from: New tool makes a single picture worth a thousand – and more – images by Sarah Yang.

Those passages were reprinted by Science Daily reading:

The research, led by Alexei Efros, associate professor of electrical engineering and computer sciences, was presented Aug. 14 at the International Conference and Exhibition on Computer Graphics and Interactive Techniques, or SIGGRAPH, in Vancouver, Canada.

“Visual data is among the biggest of Big Data,” said Efros, who is also a member of the UC Berkeley Visual Computing Lab. “We have this enormous collection of images on the Web, but much of it remains unseen by humans because it is so vast. People have called it the dark matter of the Internet. We wanted to figure out a way to quickly visualize this data by systematically ‘averaging’ the images.”

Why leave out the hyperlinks for SIGGRAPH and the Visual Computing Laboratory?

Or for that matter, the link to the original paper: AverageExplorer: Interactive Exploration and Alignment of Visual Data Collections (ACM Transactions on Graphics, SIGGRAPH paper, August 2014) which appeared in the news release.

All three hyperlinks enhance your ability to navigate to more information. Isn’t navigation to more information a prime function of the WWW?

If so, we need to clue ScienceDaily and other content repackagers to include hyperlinks passed onto them, at least.

If you can’t be a value-add, at least don’t be a value-loss conduit.

TCP Stealth

Filed under: Cybersecurity,Security — Patrick Durusau @ 3:31 pm

New “TCP Stealth” tool aims to help sysadmins block spies from exploiting their systems by David Meyer.

From the post:

System administrators who aren’t down with spies commandeering their servers might want to pay attention to this one: A Friday article in German security publication Heise provided technical detail on a GCHQ program called HACIENDA, which the British spy agency apparently uses to port-scan entire countries, and the authors have come up with an Internet Engineering Task Force draft for a new technique to counter this program.

The refreshing aspect of this vulnerability is that the details are being discussed in public, as it a partial solution.

Perhaps this is a step towards transparency for cybersecurity. Keeping malicious actors and “security researchers” only in the loop hasn’t worked out so well.

Whether governments fall into “malicious actors” or “security researchers” I leave to your judgement.

Bizarre Big Data Correlations

Filed under: BigData,Correlation,Humor,Statistics — Patrick Durusau @ 3:16 pm

Chance News 99 reported the following story:

The online lender ZestFinance Inc. found that people who fill out their loan applications using all capital letters default more often than people who use all lowercase letters, and more often still than people who use uppercase and lowercase letters correctly.

ZestFinance Chief Executive Douglas Merrill says the company looks at tens of thousands of signals when making a loan, and it doesn’t consider the capital-letter factor as significant as some other factors—such as income when linked with expenses and the local cost of living.

So while it may take capital letters into consideration when evaluating an application, it hasn’t held a loan up because of it.

Submitted by Paul Alper

If it weren’t an “online lender,” ZestFinance could take into account applications signed in crayon. 😉

Chance News collects stories with a statistical or probability angle. Some of them can be quite amusing.

August 16, 2014

Titan 0.5 Released!

Filed under: Graphs,Titan — Patrick Durusau @ 7:30 pm

Titan 0.5 Released!

From the Titan documentation:

1.1. General Titan Benefits

  • Support for very large graphs. Titan graphs scale with the number of machines in the cluster.
  • Support for very many concurrent transactions and operational graph processing. Titan’s transactional capacity scales with the number of machines in the cluster and answers complex traversal queries on huge graphs in milliseconds.
  • Support for global graph analytics and batch graph processing through the Hadoop framework.
  • Support for geo, numeric range, and full text search for vertices and edges on very large graphs.
  • Native support for the popular property graph data model exposed by Blueprints.
  • Native support for the graph traversal language Gremlin.
  • Easy integration with the Rexster graph server for programming language agnostic connectivity.
  • Numerous graph-level configurations provide knobs for tuning performance.
  • Vertex-centric indices provide vertex-level querying to alleviate issues with the infamous super node problem.
  • Provides an optimized disk representation to allow for efficient use of storage and speed of access.
  • Open source under the liberal Apache 2 license.

A major milestone in the development of Titan!

If you are interested in serious graph processing, Titan is one of the systems that should be on your short list.

PS: Matthias Broecheler has posted Titan 0.5.0 GA Release, which has links to upgrade instructions and comments about a future Titan 1.0 release!

August 15, 2014

our new robo-reader overlords

Filed under: Artificial Intelligence,Machine Learning,Security — Patrick Durusau @ 6:18 pm

our new robo-reader overlords by Alan Jacobs.

After you read this post by Jacobs, be sure to spend time with Flunk the robo-graders by Les Perelman (quoted by Jacobs).

Both raise the issue of what sort of writing can be taught by algorithms that have no understanding of writing?

In a very real sense, the outcome can only be writing that meets but does not exceed what has been programmed into an algorithm.

That is frightening enough for education, but if you are relying on AI or machine learning for intelligence analysis, your stakes may be far higher.

To be sure, software can recognize “send the atomic bomb triggers by Federal Express to this address….,” or at least I hope that is within the range of current software. But what if the message is: “The destroyer of worlds will arrive next week.” Alert? Yes/No? What if it was written in Sanskrit?

I think computers, along with AI and machine learning can be valuable tools, but not if they are setting the standard for review. At least if you don’t want to dumb down writing and national security intelligence to the level of an algorithm.

I first saw this in a tweet by James Schirmer.

Data Science (StackExchange Beta)

Filed under: Data Science — Patrick Durusau @ 4:31 pm

Data Science

Data science has a StackExchange in beta!

A great place to demonstrate your data science chops!

I first saw this in a tweet by Christophe Lalanne.

Applauding The Ends, Not The Means

Filed under: Cybersecurity,Porn,Security — Patrick Durusau @ 4:25 pm

Microsoft scans email for child abuse images, leads to arrest‏ by Lisa Vaas.

From the post:

It’s not just Google.

Microsoft is also scanning for child-abuse images.

A recent tip-off from Microsoft to the National Center for Missing & Exploited Children (NCMEC) hotline led to the arrest on 31 July 2014 of a 20-year-old Pennsylvanian man in the US.

According to the affidavit of probable cause, posted on Smoking Gun, Tyler James Hoffman has been charged with receiving and sharing child-abuse images.

Shades of the days when Kodak would censor film submitted for development.

Lisa reviews the PhotoDNA techniques used by Microsoft and concludes:

The recent successes of PhotoDNA in leading both Microsoft and Google to ferret out child predators is a tribute to Microsoft’s development efforts in coming up with a good tool in the fight against child abuse.

In this particular instance, given this particular use of hash identifiers, it sounds as though those innocent of this particular type of crime have nothing to fear from automated email scanning.

No sane person supports child abuse so the outcome of the case doesn’t bother me.

However, the use of PhotoDNA isn’t limited to photos of abused children. The same technique could be applied to photos of police officers abusing protesters (wonder where you would find those?), etc.

Before anyone applauds Microsoft for taking the role of censor (in the Roman sense), remember that corporate policies change. The goals of email scanning may not be so agreeable tomorrow.

XPERT (Xerte Public E-learning ReposiTory)

Filed under: Education,Open Source — Patrick Durusau @ 12:43 pm

XPERT (Xerte Public E-learning ReposiTory)

From the about page:

XPERT (Xerte Public E-learning ReposiTory) project is a JISC funded rapid innovation project (summer 2009) to explore the potential of delivering and supporting a distributed repository of e-learning resources created and seamlessly published through the open source e-learning development tool called Xerte Online Toolkits. The aim of XPERT is to progress the vision of a distributed architecture of e-learning resources for sharing and re-use.

Learners and educators can use XPERT to search a growing database of open learning resources suitable for students at all levels of study in a wide range of different subjects.

Creators of learning resources can also contribute to XPERT via RSS feeds created seamlessly through local installations of Xerte Online Toolkits. Xpert has been fully integrated into Xerte Online Toolkits, an open source content authoring tool from The University of Nottingham.

Other useful links:

Xerte Project Toolkits

Xerte Community.

You may want to start with the browse option because the main interface is rather stark.

The Google interface is “stark” in the same sense but Google has indexed a substantial portion of all online content. I’m not very likely to draw a blank. Xpert, with a base of 364,979 resources, the odds of my drawing a blank are far higher.

The keywords are in three distinct alphabetical segments, starting with “a” or a digit, ending and then another digit or “a” follows and end, one after the other. Hebrew and what appears to be Chinese appears at the end of the keyword list, in no particular order. I don’t know if that is an artifact of the software or of its use.

The same repeated alphabetical segments occurs in Author. Under Type there are some true types such as “color print” but the majority of the listing is file sizes in bytes. Not sure why file size would be a “type.” Institution has similar issues.

If you are looking for a volunteer opportunity, helping XPert with alphabetization would enhance the browsing experience for the resources it has collected.

I first saw this in a tweet by Graham Steel.

Photoshopping The Weather

Filed under: Graphics,Visualization,Weather Data — Patrick Durusau @ 10:23 am

Photo editing algorithm changes weather, seasons automatically

From the post:

We may not be able control the weather outside, but thanks to a new algorithm being developed by Brown University computer scientists, we can control it in photographs.

The new program enables users to change a suite of “transient attributes” of outdoor photos — the weather, time of day, season, and other features — with simple, natural language commands. To make a sunny photo rainy, for example, just input a photo and type, “more rain.” A picture taken in July can be made to look a bit more January simply by typing “more winter.” All told, the algorithm can edit photos according to 40 commonly changing outdoor attributes.

The idea behind the program is to make photo editing easy for people who might not be familiar with the ins and outs of complex photo editing software.

“It’s been a longstanding interest on mine to make image editing easier for non-experts,” said James Hays, Manning Assistant Professor of Computer Science at Brown. “Programs like Photoshop are really powerful, but you basically need to be an artist to use them. We want anybody to be able to manipulate photographs as easily as you’d manipulate text.”

A paper describing the work will be presented next week at SIGGRAPH, the world’s premier computer graphics conference. The team is continuing to refine the program, and hopes to have a consumer version of the program soon. The paper is available at http://transattr.cs.brown.edu/. Hays’s coauthors on the paper were postdoctoral researcher Pierre-Yves Laffont, and Brown graduate students Zhile Ren, Xiaofeng Tao, and Chao Qian.

For all the talk about photoshopping models, soon the Weather Channel won’t send reporters to windy, rain soaked beaches, snow bound roads, or even chasing tornadoes.

With enough information, the reporters can have weather effects around them simulated and eliminate the travel cost for such assignments.

Something to keep in mind when people claim to have “photographic” evidence. Goes double for cellphone video. A cellphone only captures the context selected by its user. A non-photographic distortion that is hard to avoid.

I first saw this in a tweet by Gregory Piatetsky.

John Chambers: Interfaces, Efficiency and Big Data

Filed under: BigData,Interface Research/Design,R — Patrick Durusau @ 10:07 am

John Chambers: Interfaces, Efficiency and Big Data

From the description:

At useR! 2014, John Chambers was generous enough to provide us with insight into the very early stages of user-centric interactive data exploration. He explains, step by step, how his insight to provide an interface into algorithms, putting the user first has placed us on the fruitful path which analysts, statisticians, and data scientists enjoy to this day. In his talk, John Chambers also does a fantastic job of highlighting a number of active projects, new and vibrant in the R ecosystem, which are helping to continue this legacy of “a software interface into the best algorithms.” The future is bright, and new and dynamic ideas are building off these thoughtful, well measured, solid foundations of the past.

To understand why this past is so important, I’d like to provide a brief view of the historical context that underpins these breakthroughs. In 1976, John Chambers was concerned with making software supported interactive numerical analysis a reality. Let’s talk about what other advances were happening in 1976 in the field of software and computing:

You should read the rest of the back story before watching the keynote by Chambers.

Will your next interface build upon the collective experience with interfaces or will it repeat some earlier experience?

I first saw this in John Chambers: Interfaces, Efficiency and Big Data by David Smith.

August 14, 2014

Mo’ money, less scrutiny:

Filed under: Intellectual Property (IP) — Patrick Durusau @ 7:33 pm

Mo’ money, less scrutiny: Why higher-paid examiners grant worse patents by Derrick Harris.

From the post:

As people get better at their jobs, it’s logical to assume they’re able to perform their work more efficiently. However, a new study suggests that when it comes to issuing patents, there’s a point at which the higher expectations placed on promoted examiners actually become a detriment.

The study used resources from the National Center for Supercomputing Applications to analyze 1.4 million patent applications against a database of patent examiner records, measuring each examiner’s grant rate as they moved up the USPTO food chain. What the researchers found essentially, according to a University of Illinois News Bureau article highlighting the study, is:

“[A]s an examiner is given less time to review an application, they become less inclined to search for prior art, which, in turn, makes it less likely that the examiner makes a prior art-based rejection. In particular, ‘obviousness’ rejections, which are especially time-intensive, decrease.”

….

See Harris’ post for charts, details, etc.

Great to have scientific confirmation but every literate person knows the USPTO has been problematic for years. The real question, beyond the obvious need for intellectual property reform, is what to do with the USPTO?

Any solution that leaves the current leadership, staff, contractors, suppliers, etc., intact is doomed to fail. The culture of the present USPTO fostered this situation, which has festered for years. Charging the USPTO to change is Einstein’s definition of insanity:

Insanity: doing the same thing over and over again and expecting different results.

Start with a clean slate, including building new indices, technology and regulations and put an end to the mummer’s farce known as the current USPTO.

Model building with the iris data set for Big Data

Filed under: BigData,Data — Patrick Durusau @ 7:09 pm

Model building with the iris data set for Big Data by Joseph Rickert.

From the post:

For similar reasons, the airlines data set used in the 2009 ASA Sections on Statistical Computing and Statistical Graphics Data expo has gained a prominent place in the machine learning world and is well on its way to becoming the “iris data set for big data”. It shows up in all kinds of places. (In addition to this blog, it made its way into the RHIPE documentation and figures in several college course modeling efforts.)

Some key features of the airlines data set are:

  • It is big enough to exceed the memory of most desktop machines. (The version of the airlines data set used for the competition contained just over 123 million records with twenty-nine variables.
  • The data set contains several different types of variables. (Some of the categorical variables have hundreds of levels.)
  • There are interesting things to learn from the data set. (This exercise from Kane and Emerson for example)
  • The data set is tidy, but not clean, making it an attractive tool to practice big data wrangling. (The AirTime variable ranges from -3,818 minutes to 3,508 minutes)

Joseph reviews what may become the iris data set of “big data,” airline data.

Its variables:

Name Description
1 Year 1987-2008
2 Month 1-12
3 DayofMonth 1-31
4 DayOfWeek 1 (Monday) – 7 (Sunday)
5 DepTime actual departure time (local, hhmm)
6 CRSDepTime scheduled departure time (local, hhmm)
7 ArrTime actual arrival time (local, hhmm)
8 CRSArrTime scheduled arrival time (local, hhmm)
9 UniqueCarrier unique carrier code
10 FlightNum flight number
11 TailNum plane tail number
12 ActualElapsedTime in minutes
13 CRSElapsedTime in minutes
14 AirTime in minutes
15 ArrDelay arrival delay, in minutes
16 DepDelay departure delay, in minutes
17 Origin origin IATA airport code
18 Dest destination IATA airport code
19 Distance in miles
20 TaxiIn taxi in time, in minutes
21 TaxiOut taxi out time in minutes
22 Cancelled was the flight cancelled?
23 CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
24 Diverted 1 = yes, 0 = no
25 CarrierDelay in minutes
26 WeatherDelay in minutes
27 NASDelay in minutes
28 SecurityDelay in minutes
29 LateAircraftDelay in minutes

Source: http://stat-computing.org/dataexpo/2009/the-data.html

Waiting for the data set to download. Lots of questions suggest themselves. For example, variation or lack thereof in the use of fields 25-29.

Enjoy!

I first saw this in a tweet by David Smith.

EMACS: The Extensible, Customizable Display Editor

Filed under: Computer Science,Editor — Patrick Durusau @ 2:49 pm

EMACS: The Extensible, Customizable Display Editor by Richard Stallman. (1981, Delivered in the ACM Conference on Text Processing)

From the introduction:

EMACS(1) is a real-time display editor which can be extended by the user while it is running.

Extensibility means that the user can add new editing commands or change old ones to fit his editing needs, while he is editing. EMACS is written in a modular fashion, composed of many separate and independent functions. The user extends EMACS by adding or replacing functions, writing their definitions in the same language that was used to write the original EMACS system. We will explain below why this is the only method of extension which is practical in use: others are theoretically equally good but discourage use, or discourage nontrivial use.

Extensibility makes EMACS more flexible than any other editor. Users are not limited by the decisions made by the EMACS implementors. What we decide is not worth while to add, the user can provide for himself. He can just as easily provide his own alternative to a feature if he does not like the way it works in the standard system.

A coherent set of new and redefined functions can be bound into a library so that the user can load them together conveniently. Libraries enable users to publish and share their extensions, which then become effectively part of the basic system. By this route, many people can contribute to the development of the system, for the most part without interfering with each other. This has led the EMACS system to become more powerful than any previous editor.

User customization helps in another, subtler way, by making the whole user community into a breeding and testing ground for new ideas. Users think of small changes, try them, and give them to other users–if an idea becomes popular, it can be incorporated into the core system. When we poll users on suggested changes, they can respond on the basis of actual experience rather than thought experiments.

To help the user make effective use of the copious supply of features, EMACS provides powerful and complete interactive self-documentation facilities with which the user can find out what is available.

A sign of the success of the EMACS design is that EMACS has been requested by over a hundred sites and imitated at least ten times. (emphasis in the original)

This may not inspire you to start using EMACS but it is a bit of software history that is worth visiting.

Software development doesn’t always result in better software. Or at least the thirty-three years spend on other editors hasn’t produced such a result.

P2P to Freedom?

Filed under: Cybersecurity,P2P,Security — Patrick Durusau @ 2:22 pm

Anti-censorship app Lantern wants to become the SETI@home of free speech by Janko Roettgers.

From the post:

Facebook? Blocked. YouTube? Blocked. Twitter? Definitely blocked. Countries like China and Iran have been trying to control the flow of online information for years. Now, there’s an app that wants to take a page from the playbook of crowdsourced computing projects like SETI and poke holes in China’s so-called great firewall and other censorship efforts.

Lantern, as the project is called, is offering users in countries with internet censorship a proxy that unblocks social networks, news sites and political blogs. People in China and elsewhere have been using commercial proxies for years, resulting in a game of whack-a-mole, where censors would simply block access to the IP address of a proxy, forcing users to move on to the next available service.

Lantern wants to solve this issue through a P2P architecture: Users in censored countries don’t connect to a central server with an easily recognizable IP address, but instead route their website requests through a computer run by a volunteer in the U.S. or elsewhere. Lantern is a simple app available for Windows, OS X and Linux. Upon running it for the first time, users indicate whether they want to give access or get access to censored sites. And once it runs, the main UI is actually a data visualization that shows usage of the app around the world in real-time.

Janko has a great summary of the current status of Lantern and its bid to help users avoid censorship.

Lantern is focused on the problem of censorship, which is an important issue in many parts of the world.

But the same principles, that of a P2P network, should be applicable to a no-censorship but track everything network such as in the United States and elsewhere.

If you don’t think a record of your web traffic, so-called “metadata,” might provide information about you, exchange browser histories with another user.

A robust P2P solution would not prevent tracking entirely, but it could make it too burdensome to be practical in most cases. Like all security, it is a question of secure enough for some purpose and length of time.

Lantern is a project to track and contribute to if you find censorship and/or tracking problematic.

« Newer PostsOlder Posts »

Powered by WordPress