Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 14, 2014

Introducing Streamtools:…

Filed under: News,Reporting,Visualization — Patrick Durusau @ 7:46 pm

Introducing Streamtools: A Graphical Tool for Working with Streams of Data by Mike Dewar.

From the post:

We see a moment coming when the collection of endless streams of data is commonplace. As this transition accelerates it is becoming increasingly apparent that our existing toolset for dealing with streams of data is lacking. Over the last 20 years we have invested heavily in tools that deal with tabulated data, from Excel, MySQL, and MATLAB to Hadoop, R, and Python+Numpy. These tools, when faced with a stream of never-ending data, fall short and diminish our creative potential.

In response to this shortfall we have created streamtools—a new, open source project by the New York Times R&D Lab which provides a general purpose, graphical tool for dealing with streams of data. It offers a vocabulary of operations that can be connected together to create live data processing systems without the need for programming or complicated infrastructure. These systems are assembled using a visual interface that affords both immediate understanding and live manipulation of the system.

I’m quite excited about this tool, although I would not go so far as to say it will “encourage new forms of reasoning. (emphasis in original)” 😉

Still, this is an exciting new tool and I commend both the post and the tool to you.

Annotation Use Cases

Filed under: Annotation,HyTime — Patrick Durusau @ 7:29 pm

Annotation Use Cases

From the Introduction:

Annotation is a pervasive activity when reading or otherwise engaging with publications. In the physical world, highlighting and sticky notes are common paradigms for marking up and associating one’s own content with the work being read, and many digital solutions exist in the same space. These digital solutions are, however, not interoperable between systems, even when there is only one user with multiple devices.

This document lays out the use cases for annotations on digital publications, as envisioned by the W3C Digital Publishing Interest Group, the W3C Open Annotation Community Group and the International Digital Publishing Forum. The use cases are provided as a means to drive forwards the conversation about standards in this arena.

Just for the record, all of these use cases and more were doable with HyTime more than twenty (20) years ago.

The syntax was ugly but the underlying concepts are as valid now as they were then. Something to keep in mind while watching this activity.

Papers: ACL 2014

Filed under: Computational Linguistics,Conferences,Linguistics — Patrick Durusau @ 7:21 pm

Papers: ACL 2014

The list of accepted papers for Association of Computational Linguistics has been posted for the June 22-27 conference in Baltimore, Maryland.

I am sure out of the one hundred and forty-six (146) you will find at least a few that will be of interest. 😉

I first saw this in a tweet by Shane Bergsma.

An R “meta” book

Filed under: Probability,R,Statistics — Patrick Durusau @ 7:13 pm

An R “meta” book by Joseph Rickert.

From the post:

Recently, however, while crawling around CRAN, it occurred to me that there is a tremendous amount of high quality material on a wide range of topics in the Contributed Documentation page that would make a perfect introduction to all sorts of people coming to R. Maybe, all it needs is a little marketing and reorganization. So, from among this treasure cache (and a few other online sources), I have assembled an R “meta” book in the following table that might be called: An R Based Introduction to Probability and Statistics with Applications.

What a very clever idea! There is lots of documentation already written and organizing it is simpler than re-doing it all from scratch. Not to mention less time consuming.

Take a close look at Joseph’s “meta” book and see what you think.

Perhaps there are other “meta” books hiding in the Contributed Documentation.

I first saw this in a tweet by David Smith.

Apache MarkMail

Filed under: Indexing,MarkLogic,Searching — Patrick Durusau @ 6:56 pm

Apache MarkMail

Just in case you don’t have your own index of the 10+ million messages in Apache mailing list archives, this is the site for you.

😉

I ran across it today while debugging an error in a Solr config file.

If I could add one thing to MarkMail it would be software release date facets. Posts are not limited by release dates but I suspect a majority of posts between release dates are about the current release. Enough so that I would find it a useful facet.

You?

Science self-corrects – instantly

Filed under: Peer Review — Patrick Durusau @ 6:47 pm

Science self-corrects – instantly

A highly amusing account of how post-publication review uncovered serious flaws in a paper published with great fanfare in Nature.

To give you the tone of the post:

Publishing a paper is still considered a definitive event. And what could be more definitive than publishing two Nature papers back to back on the same subject? Clearly a great step forward must have occurred. Just such a seismic event happened on the 29th of January, when Haruko Obokata and colleagues described a revolutionarily simple technique for producing pluripotent cells. A short dunk in the acid bath or brief exposure to any one of a number of stressors sufficed to produce STAP (Stimulus-Triggered Acqusition of Pluripotency) cells, offering enormous simplification in stem cell research and opening new therapeutic avenues.

As you may be guessing, the “three overworked referees and a couple of editors” did not catch serious issues with the papers.

But some 4000 viewers at PubPeer did.

If traditional peer review had independent and adequately compensated peer reviewers, the results might be different. But the lack of independence and compensation are designed to product a minimum review, not a peer review.

Ironic that electronic journals and publications aren’t given weight in scholarly circles due to a lack of “peer review,” when some “peer review” is nothing more than a hope the author has performed well. A rather vain hope in a number of cases.

I do disagree with the PubPeer policy on anonymity.

Authors could be retaliated against but revolutions are never bloodless. Where would the civil rights movement have accomplished with anonymous letters to editors? It was only the outrages and excesses of their oppressors that finally resulted in some change (an ongoing process even now).

Serious change will occur if and only if the “three overworked referees and a couple of editors” are publicly outed by named colleagues. And for that process to be repeated over and over again. Until successful peer review is a mark of quality of research and writing, not just another step at a publication mill.

March 13, 2014

Kite Software Development Kit

Filed under: Cloudera,Hadoop,Kite SDK,MapReduce — Patrick Durusau @ 7:12 pm

Kite Software Development Kit

From the webpage:

The Kite Software Development Kit (Apache License, Version 2.0), or Kite for short, is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.

  • Codifies expert patterns and practices for building data-oriented systems and applications
  • Lets developers focus on business logic, not plumbing or infrastructure
  • Provides smart defaults for platform choices
  • Supports gradual adoption via loosely-coupled modules

Version 0.12.0 was released March 10, 2014.

Do note that unlike some “pattern languages,” these are legitimate patterns are based on expert patterns and practices. (There are “patterns” produced like Uncle Bilius (Harry Potter and the Deathly Hallows, Chapter Eight) after downing a bottle of firewhiskey. You should avoid such patterns.)

12 Steps for Research Programming

Filed under: Programming,Research Methods — Patrick Durusau @ 6:49 pm

How effective is your research programming workflow? by Philip Guo.

From the post:

For my Ph.D. dissertation, I investigated research programming, a common type of programming activity where people write computer programs to obtain insights from data. Millions of professionals in fields ranging from science, engineering, business, finance, public policy, and journalism, as well as numerous students and computer hobbyists, all perform research programming on a daily basis.

Inspired by The Joel Test for rating software engineering teams, here is my informal “Philip test” to determine whether your research programming workflow is effective:

  1. Do you have reliable ways of taking, organizing, and reflecting on notes as you’re working?
  2. Do you have reliable to-do lists for your projects?
  3. Do you write scripts to automate repetitive tasks?
  4. Are your scripts, data sets, and notes backed up on another computer?
  5. Can you quickly identify errors and inconsistencies in your raw data sets?
  6. Can you write scripts to acquire and merge together data from different sources and in different formats?
  7. Do you use version control for your scripts?
  8. If you show analysis results to a colleague and they offer a suggestion for improvement, can you adjust your script, re-run it, and produce updated results within an hour?
  9. Do you use assert statements and test cases to sanity check the outputs of your analyses?
  10. Can you re-generate any intermediate data set from the original raw data by running a series of scripts?
  11. Can you re-generate all of the figures and tables in your research paper by running a single command?
  12. If you got hit by a bus, can one of your lab-mates resume your research where you left off with less than a week of delay?

Philip suggests a starting point in his post.

His post alone is pure gold I would say.

Came to this by following a tweet by Neil Saunders that pointed to: How effective is my research programming workflow? The Philip Test – Part 1 and from there I found the link to Philips post.

This sounds a lot like the recent controversy over the ability to duplicate research published in scientific journals. Can someone else replicate your results?

2014 SIAM Conference Program on Discrete Mathematics

Filed under: Conferences,Graphs — Patrick Durusau @ 2:52 pm

2014 SIAM Conference Program on Discrete Mathematics

If you are interested in the more formal side of graph work, there are a number of sessions at the upcoming 2014 SIAM conference of interest.

This program listing gives authors and abstracts which should be enough for you to find their work prior to the conference.

The conference runs from June 16-19, 2014 in Minneapolis, Minnesota.

Audit Trails Anyone?

Filed under: Auditing,BigData,Semantics — Patrick Durusau @ 2:44 pm

Instrumenting collaboration tools used in data projects:Built-in audit trails can be useful for reproducing and debugging complex data analysis projects by Ben Lorica.

From the post:

As I noted in a previous post, model building is just one component of the analytic lifecycle. Many analytic projects result in models that get deployed in production environments. Moreover, companies are beginning to treat analytics as mission-critical software and have real-time dashboards to track model performance.

Once a model is deemed to be underperforming or misbehaving, diagnostic tools are needed to help determine appropriate fixes. It could well be models need to be revisited and updated, but there are instances when underlying data sources1 and data pipelines are what need to be fixed. Beyond the formal systems put in place specifically for monitoring analytic products, tools for reproducing data science workflows could come in handy.

Ben goes onto suggest that an “activity log” is a great idea for capturing a work flow for later analysis/debugging. And so it is, but I would go one step further and capture some of the semantics of the work flow.

I knew a manager who had a “cheat sheet” of report writer jobs to run every month. They would pull the cheat sheet, enter the commands and produce the report. They were a roadblock to ever changing the system because then the “cheatsheet” would not work.

I am sure none of you have ever encountered the same situation. But I have seen it in at least one case.

Category Theory References

Filed under: Category Theory,Mathematics — Patrick Durusau @ 2:20 pm

Category Theory References

Ten (10) pages of category theory citations that I bookmarked recently.

The citations are not annotated so they are of limited utility but it looked worth passing along.

Are there any ongoing annotated lists of references for category theory?

The American Mathematical Association (AMA) indexing scheme for 18 Category theory; homological algebra isn’t detailed enough to substitute for an annotated listing. (Be aware that category theory appears under other classifications so use the search function for 2010 Mathematics Subject Classification if you want to find all appearances of category theory.)

Enjoy!

March 12, 2014

“The Upshot”

Filed under: Journalism,News,Reporting — Patrick Durusau @ 8:03 pm

“The Upshot” is the New York Times’ replacement for Nate Silver’s FiveThirtyEight by John McDuling.

From the post:

“The Upshot.” That’s the name the New York Times is giving to its new data-driven venture, focused on politics, policy and economic analysis and designed to fill the void left by Nate Silver, the one-man traffic machine whose statistical approach to political reporting was a massive success.

David Leonhardt, the Times’ former Washington bureau chief, who is in charge of The Upshot, told Quartz that the new venture will have a dedicated staff of 15, including three full-time graphic journalists, and is on track for a launch this spring. “The idea behind the name is, we are trying to help readers get to the essence of issues and understand them in a contextual and conversational way,” Leonhardt says. “Obviously, we will be using data a lot to do that, not because data is some secret code, but because it’s a particularly effective way, when used in moderate doses, of explaining reality to people.”

The New York Times’ own public editor admitted that Silver, a onetime baseball stats geek, never really fit into the paper’s culture, and that “a number of traditional and well-respected Times journalists disliked his work.” But Leonhardt says being part of the Times is an “enormous advantage” for The Upshot. “The Times is in an extremely strong position digitally. We are going to be very much a Times product. Having said that, we are not going to do stuff the same way the Times does.” The tone, he said, will be more like having “a journalist sitting next to you, or sending you an email.”

I really like the New York Times for its long tradition of excellence in news gathering. Couple that with technologies to connect its staff’s collective insights with the dots and it would be a formidable enterprise.

AntWeb

Filed under: Data,R,Science — Patrick Durusau @ 7:46 pm

AntWeb by rOpenScience.

From the webpage:

AntWeb s a repository of ant specimen records maintained by the California Academy of Sciences. From the website’s description:

AntWeb is the world’s largest online database of images, specimen records, and natural history information on ants. It is community driven and open to contribution from anyone with specimen records, natural history comments, or images.

Resources

An R wrapper for the AntWeb API.

Listing functions + descriptions:

  • aw_data – Search for data by taxonomic level, full species name, a bounding box, habitat, elevation or type
  • aw_unique – Obtain a list of unique levels by various taxonomic ranks
  • aw_images – Search photos by type or time since added
  • aw_coords – Search for specimens by location and radius
  • aw_code – Search for a specimen by record number
  • aw_map – Map georeferenced data

Doesn’t hurt to have a few off-beat data sets at your command. Can’t tell when someone’s child will need help with a science fair project, etc.

PS: I did resist the temptation to list this post under “bugs.”

Building a tweet ranking web app using Neo4j

Filed under: Graphs,MongoDB,Neo4j,node-js,Python,Tweets — Patrick Durusau @ 7:28 pm

Building a tweet ranking web app using Neo4j by William Lyon.

From the post:

twizzard

I spent this past weekend hunkered down in the basement of the local Elk’s club, working on a project for a hackathon. The project was a tweet ranking web application. The idea was to build a web app that would allow users to login with their Twitter account and view a modified version of their Twitter timeline that shows them tweets ranked by importance. Spending hours every day scrolling through your timeline to keep up with what’s happening in your Twitter network? No more, with Twizzard!

The project uses the following components:

  • Node.js web application (using Express framework)
  • MongoDB database for storing basic user data
  • Integration with Twitter API, allowing for Twitter authentication
  • Python script for fetching Twitter data from Twitter API
  • Neo4j graph database for storing Twitter network data
  • Neo4j unmanaged server extension, providing additional REST endpoint for querying / retrieving ranked timelines per user

Looks like a great project and good practice as well!

Curious what you think of the ranking of tweets:

How can we score Tweets to show users their most important Tweets? Users are more likely to be interested in tweets from users they are more similar to and from users they interact with the most. We can calculate metrics to represent these relationships between users, adding an inverse time decay function to ensure that the content at the top of their timeline stays fresh.

That’s one measure of “importance.” Being able to assign a rank would be useful as well, say for the British Library.

Do take notice of the Jaccard similarity index.

Would you say that possessing at least one identical string (id, subject identifier, subject indicator) is a form of similarity measure?

What other types of similarity measures do you think would be useful for topic maps?

I first saw this in a tweet by GraphemeDB.

Data Mining the Internet Archive Collection [Librarians Take Note]

Filed under: Archives,Data Mining,Librarian/Expert Searchers,MARC,MARCXML,Python — Patrick Durusau @ 4:48 pm

Data Mining the Internet Archive Collection by Caleb McDaniel.

From the “Lesson Goals:”

The collections of the Internet Archive (IA) include many digitized sources of interest to historians, including early JSTOR journal content, John Adams’s personal library, and the Haiti collection at the John Carter Brown Library. In short, to quote Programming Historian Ian Milligan, “The Internet Archive rocks.”

In this lesson, you’ll learn how to download files from such collections using a Python module specifically designed for the Internet Archive. You will also learn how to use another Python module designed for parsing MARC XML records, a widely used standard for formatting bibliographic metadata.

For demonstration purposes, this lesson will focus on working with the digitized version of the Anti-Slavery Collection at the Boston Public Library in Copley Square. We will first download a large collection of MARC records from this collection, and then use Python to retrieve and analyze bibliographic information about items in the collection. For example, by the end of this lesson, you will be able to create a list of every named place from which a letter in the antislavery collection was written, which you could then use for a mapping project or some other kind of analysis.

This rocks!

In particular for librarians and library students who will already be familiar with MARC records.

Some 7,000 items from the Boston Public Library’s anti-slavery collection at Copley Square are the focus of this lesson.

That means historians have access to rich metadata, full images, and partial descriptions for thousands of antislavery letters, manuscripts, and publications.

Would original anti-slavery materials, written by actual participants, have interested you as a student? Do you think such materials would interest students now?

I first saw this in a tweet by Gregory Piatetsky.

Towards Web-scale Web querying [WWW vs. Internet]

Filed under: Linked Data,SPARQL — Patrick Durusau @ 4:17 pm

Towards Web-scale Web querying: The quest for intelligent clients starts with simple servers that scale. by Ruben Verborgh.

From the post:

Most public SPARQL endpoints are down for more than a day per month. This makes it impossible to query public datasets reliably, let alone build applications on top of them. It’s not a performance issue, but an inherent architectural problem: any server offering resources with an unbounded computation time poses a severe scalability threat. The current Semantic Web solution to querying simply doesn’t scale. The past few months, we’ve been working on a different model of query solving on the Web. Instead of trying to solve everything at the server side—which we can never do reliably—we should build our servers in such a way that enables clients to solve queries efficiently.

The Web of Data is filled with an immense amount of information, but what good is that if we cannot efficiently access those bits of information we need?

SPARQL endpoints aim to fulfill the promise of querying on the Web, but their notoriously low availability rates make that impossible. In particular, if you want high availability for your SPARQL endpoint, you have to compromise one of these:

  • offering public access,
  • allowing unrestricted queries,
  • serving many users.

Any SPARQL endpoint that tries to fulfill all of those inevitably has low availability. Low availability means unreliable query access to datasets. Unreliable access means we cannot build applications on top of public datasets.

Sure, you could just download a data dump and have your own endpoint, but then you move from Web querying to local querying, and that problem has been solved ages ago. Besides, it doesn’t give you access to up to date information, and who has enough storage to download a dump of the entire Web?

The whole “endpoint” concept will never work on a Web scale, because servers are subject to arbitrarily complex requests by arbitrarily many clients. (emphasis in original)

The prelude to an interesting proposal on Linked Data Fragments.

See the Linked Data Fragments website or Web-Scale Querying through Linked Data Fragments by Ruben Verborgh, et. al. (LDOW2014 workshop).

The paper gives a primary motivation as:

There is one issue: it appears to be very hard to make a sparql endpoint available reliably. A recent survey examining 427 public endpoints concluded that only one third of them have an availability rate above 99%; not even half of all endpoints reach 95% [6]. To put this into perspective: 95% availability means the server is unavailable for one and a half days every month. These figures are quite disturbing given the fact that availability is usually measured in “number of nines” [5, 25], counting the number of leading nines in the availability percentage. In comparison, the fairly common three nines (99.9%) amounts to 8.8 hours of downtime per year. The disappointingly low availability of public sparql endpoints is the Semantic Web community’s very own “Inconvenient Truth”.

Curious that on the twenty-fifth anniversary of the WWW that I would realize the WWW re-created a networking problem solved by the Internet.

Unlike the WWW, to say nothing of Linked Data and its cousins in the SW activity, the Internet doesn’t have a single point of failure.

Or put more positively, the Internet is fault-tolerant by design. In contrast, the SW is fragile, by design.

While I applaud the Linked Data fragment exploration of the solution space, focusing on the design flaw of a single point of failure might be more profitable.

I first saw this in a tweet by Thomas Steiner.

What do policymakers want from researchers?…

Filed under: Government,Marketing,Topic Maps — Patrick Durusau @ 2:20 pm

What do policymakers want from researchers? Blogs, elevator pitches and good old fashioned press mentions. by Duncan Green.

From the post:

Interesting survey of US policymakers in December’s International Studies Quarterly journal. I’m not linking to it because it’s gated, thereby excluding more or less everyone outside a traditional academic institution (open data anyone?) but here’s a draft of What Do Policymakers Want From Us?, by Paul Avey and Michael Desch. The results are as relevant to NGO advocacy people trying to influence governments as they are to scholars. Maybe more so. I’ve added my own running translation.

Two tidbits to get you interested in the report:

First, unclassified newspaper articles were as important to policymakers as the classified information generated inside the government.

[role of scholars] The main contribution of scholars, in their view, was research. Second, and again somewhat surprisingly, they expressed a preference for scholars to produce “arguments” (what we would call theories) over the generation of specific “evidence” (what we think of as facts). In other words, despite their jaundiced view of cutting-edge tools and rarefied theory, the thing policymakers most want from scholars are frameworks for making sense of the world they have to operate in.’

While the article focuses on international relations, I suspect the same attitudes hold true for other areas as well.

The impact of newspaper articles suggests that marketing semantic technologies at geek conferences isn’t the road to broad success.

As for making sense of the world, topic maps support frameworks with that result but not without effort.

Perhaps a topic map-based end product that is such a framework would be a better product?

I first saw this in a tweet by Coffeehouse.

Raft Consensus Algorithm

Filed under: Algorithms,Consensus,Consistency,Paxos — Patrick Durusau @ 1:34 pm

Raft: Understandable Distributed Consensus

A compelling visualization of the Raft consensus algorithm!

I first saw the visualization link in a tweet by Aaron Bull Schaefer.

The visualization closes with pointers to more information on Raft.

One pointer is to the Raft Consensus Algorithm website.

From the homepage:

Raft is a consensus algorithm that is designed to be easy to understand. It’s equivalent to Paxos in fault-tolerance and performance. The difference is that it’s decomposed into relatively independent subproblems, and it cleanly addresses all major pieces needed for practical systems. We hope Raft will make consensus available to a wider audience, and that this wider audience will be able to develop a variety of higher quality consensus-based systems than are available today.

There are links to videos + slides, the raft-dev Google Group, and numerous implementations of the Raft algorithm.

The other pointer from the visualization is to the Raft paper: In Search of an Understandable Consensus Algorithm (PDF) by Diego Ongaro and John Ousterhout.

From the paper (section 4):

We had several goals in designing Raft: it must provide a complete and appropriate foundation for system building, so that it significantly reduces the amount of design work required of developers; it must be safe under all conditions and available under typical operating conditions; and it must be efficient for common operations. But our most important goal—and most difficult challenge—was understandability. It must be possible for a large audience to understand the algorithm comfortably. In addition, it must be possible to develop intuitions about the algorithm, so that system builders can make the extensions that are inevitable in real-world implementations.

Who would have thought that choosing more obvious/understandable approaches would have practical benefits?

There were numerous points in the design of Raft where we had to choose among alternative approaches. In these situations we evaluated the alternatives based on understandability: how hard is it to explain each alternative (for example, how complex is its state space, and does it have subtle implications?), and how easy will it be for a reader to completely understand the approach and its implications? Given a choice between an alternative that was concise but subtle and one that was longer (either in lines of code or explanation) but more obvious, we chose the more obvious approach. Fortunately, in most cases the more obvious approach was also more concise. (emphasis added)

Understandability, now there’s a useful requirement.

Office of Incisive Analysis

Filed under: Funding,Government,Research Methods — Patrick Durusau @ 10:07 am

Office of Incisive Analysis Office Wide – Broad Agency Announcement (BAA) IARPA-BAA-14-02
BAA Release Date: March 10, 2014

FedBizOpps Reference

IARPA-BAA-14-02 with all Supporting Documents

From the webpage:

Synopsis

IARPA invests in high-risk, high-payoff research that has the potential to provide our nation with an overwhelming intelligence advantage over future adversaries. This BAA solicits abstracts/proposals for Incisive Analysis.

IA focuses on maximizing insights from the massive, disparate, unreliable and dynamic data that are – or could be – available to analysts, in a timely manner. We are pursuing new sources of information from existing and novel data, and developing innovative techniques that can be utilized in the processes of analysis. IA programs are in diverse technical disciplines, but have common features: (a) Create technologies that can earn the trust of the analyst user by providing the reasoning for results; (b) Address data uncertainty and provenance explicitly.

The following topics (in no particular order) are of interest to IA:

  • Methods for estimation and communication of uncertainty and risk;
  • Methods for understanding the process of analysis and potential impacts of technology;
  • Methods for measuring and improving human judgment and human reasoning;
  • Multidisciplinary approaches to processing noisy audio and speech;
  • Methods and approaches to quantifiable representations of uncertainty simultaneously accounting for multiple types of uncertainty;
  • Discovering, tracking and sorting emerging events and participating entities found in reports;
  • Accelerated system development via machine learning;
  • Testable methods for identifying individuals’ intentions;
  • Methods for developing understanding of how knowledge and ideas are transmitted and change within groups, organizations, and cultures;
  • Methods for analysis of social, cultural, and linguistic data;
  • Methods to construct and evaluate speech recognition systems in languages without a formalized orthography;
  • Multidisciplinary approaches to assessing linguistic data sets;
  • Mechanisms for detecting intentionally falsified representations of events and/or personas;
  • Methods for understanding and managing massive, dynamic data in images, video, and speech;
  • Analysis of massive, unreliable, and diverse data;
  • Methods to make machine learning more useful and automatic;
  • 4D geospatial/temporal representations to facilitate change detection and analysis;
  • Novel approaches for mobile augmented reality applied to analysis and collection;
  • Methods for assessments of relevancy and reliability of new data;
  • Novel approaches to data and knowledge management facilitating discovery, retrieval and manipulation of large volumes of information to provide greater access to interim analytic and processing products.

This announcement seeks research ideas for topics that are not addressed by emerging or ongoing IARPA programs or other published IARPA solicitations. It is primarily, but not solely, intended for early stage research that may lead to larger, focused programs through a separate BAA in the future, so periods of performance generally will not exceed 12 months.

Offerors should demonstrate that their proposed effort has the potential to make revolutionary, rather than incremental, improvements to intelligence capabilities. Research that primarily results in evolutionary improvement to the existing state of practice is specifically excluded.

Contracting Office Address:
Office of Incisive Analysis
Intelligence Advanced Research Projects Activity
Office of the Director of National Intelligence
ATTN: IARPA-BAA-14-02
Washington, DC 20511
Fax: 301-851-7673

Primary Point of Contact:
dni-iarpa-baa-14-02@iarpa.gov

The “topics … of interest” that caught my eye for topic maps are:

  • Methods for measuring and improving human judgment and human reasoning;
  • Discovering, tracking and sorting emerging events and participating entities found in reports;
  • Methods for developing understanding of how knowledge and ideas are transmitted and change within groups, organizations, and cultures;
  • Methods for analysis of social, cultural, and linguistic data;
  • Novel approaches to data and knowledge management facilitating discovery, retrieval and manipulation of large volumes of information to provide greater access to interim analytic and processing products.

Thinking capturing the insights of users as they use and add content to a topic map as “evolutionary change.”

Others?

March 11, 2014

Cataloguing projects

Filed under: Archives,Cataloging,Law - Sources,Legal Informatics,Library — Patrick Durusau @ 8:27 pm

Cataloguing projects (UK National Archive)

From the webpage:

The National Archives’ Cataloguing Strategy

The overall objective of our cataloguing work is to deliver more comprehensive and searchable catalogues, thus improving access to public records. To make online searches work well we need to provide adequate data and prioritise cataloguing work that tackles less adequate descriptions. For example, we regard ranges of abbreviated names or file numbers as inadequate.

I was lead to this delightful resource by a tweet from David Underdown, advising that his presentation from National Catalogue Day in 2013 was now onlne.

His presentation along with several others and reports about projects in prior years are available at this projects page.

I thought the presentation titled: Opening up of Litigation: 1385-1875 by Amanda Bevan and David Foster, was quite interesting in light of various projects that want to create new “public” citation systems for law and litigation.

I haven’t seen such a proposal yet that gives sufficient consideration to the enormity of what do you do with old legal materials?

The litigation presentation could be a poster child for topic maps.

I am looking forward to reading the other presentations as well.

Number Theory and Algebra

Filed under: Algebra,Cryptography,Mathematics — Patrick Durusau @ 6:28 pm

A Computational Introduction to Number Theory and Algebra by Victor Shoup.

The first and second editions, published by Cambridge University Press are available for download under a Creative Commons license.

From the preface of the second edition:

Number theory and algebra play an increasingly significant role in computing and communications, as evidenced by the striking applications of these subjects to such fields as cryptography and coding theory. My goal in writing this book was to provide an introduction to number theory and algebra, with an emphasis on algorithms and applications, that would be accessible to a broad audience. In particular, I wanted to write a book that would be appropriate for typical students in computer science or mathematics who have some amount of general mathematical experience, but without presuming too much specific mathematical knowledge.

Even though reliance on cryptography and vendors of cryptography is fading, you are likely to encounter people still using cryptography or legacy data “protected” by cryptography.

BTW, this is only one of several books that Cambridge University Press has published and allowed the final text to remain available.

Should you pen something appropriate and hopefully profitable for you and a publisher, Cambridge University Press should be on your short list.

Cambridge University Press is a great press and a good citizen of the academic world.
.
I first saw this in a tweet by Algebra Fact.

NASA’s Asteroid Grand Challenge Series

Filed under: Astroinformatics,Challenges — Patrick Durusau @ 6:10 pm

NASA’s Asteroid Grand Challenge Series

From the webpage:

Welcome to the Asteroid Grand Challenge Series sponsored by the NASA Tournament Lab! The Asteroid Grand Challenge Series will be comprised of a series of topcoder challenges to get more people from around the planet involved in finding all asteroid threats to human populations and figuring out what to do about them. In an increasingly connected world, NASA recognizes the value of the public as a partner in addressing some of the country’s most pressing challenges. Click here to learn more and participate in our debut challenge, Asteroid Data Hunter – launching 03/17/14!

From the details page:

The Asteroid Data Hunter challenge tasks competitors to develop significantly improved algorithms to identify asteroids in images from ground-based telescopes. The winning solution must increase the detection sensitivity, minimize the number of false positives, ignore imperfections in the data, and run effectively on all computers.

This is radically cool!

Lots of data, difficult problem, high stakes (ELE (extinction level event) prevention).

30,000 comics, 7,000 series – How’s Your Collection?

Filed under: Data,History,Social Sciences — Patrick Durusau @ 4:53 pm

Marvel Comics opens up its metadata for amazing Spider-Apps by Alex Dalenberg.

From the post:

It’s not as cool as inheriting superpowers from a radioactive spider, but thanks to Marvel Entertainment’s new API, you can now build Marvel Comics apps to your heart’s content.

That is, as long as you’re not making any money off of them. Nevertheless, it’s a comic geek’s dream. The Disney-owned company is opening up the data trove from its 75-year publishing history, including cover art, characters and comic book crossover events, for developers to tinker with.

That’s metadata for more than 30,000 comics and 7,000 series.

Marvel Developer.

I know, another one of those non-commercial use licenses. I mean, Marvel paid for all of this content and then has the gall to not just give it away for free. What is the world coming to?

😉

Personally I think Marvel has the right to allow as much or as little access to their data as they please. If you come up with a way to make money using this content, ask Marvel for commercial permissions. I deeply suspect they will be more than happy to accommodate any reasonable request.

The comic book zealot uses are obvious but aren’t you curious about the comic books your parents read? Or that your grandparents read?

Speaking of contemporary history, a couple of other cultural goldmines, Playboy Cover to Cover Hard Drive – Every Issue From 1953 to 2010 and Rolling Stone.

I don’t own either one so I don’t know how hard it would be to get the content in to machine readable format.

Still, both would be a welcome contrast to main stream news sources.

I first saw this in a tweet by Bob DuCharme.

Data Science Challenge

Filed under: Challenges,Cloudera,Hadoop,MapReduce — Patrick Durusau @ 4:22 pm

Data Science Challenge

Some details from the registration page:

Prerequisite: Data Science Essentials (DS-200)
Schedule: Twice per year
Duration: Three months from launch date
Next Challenge Date: March 31, 2014
Language: English
Price: USD $600

From the webpage:

Cloudera will release a Data Science Challenge twice each year. Each bi-quarterly project is based on a real-world data science problem involving a large data set and is open to candidates for three months to complete. During the open period, candidates may work on their project individually and at their own pace.

Current Data Science Challenge

The new Data Science Challenge: Detecting Anomalies in Medicare Claims will be available starting March 31, 2014, and will cost USD $600.

In the U.S., Medicare reimburses private providers for medical procedures performed for covered individuals. As such, it needs to verify that the type of procedures performed and the cost of those procedures are consistent and reasonable. Finally, it needs to detect possible errors or fraud in claims for reimbursement from providers. You have been hired to analyze a large amount of data from Medicare and try to detect abnormal data — providers, areas, or patients with unusual procedures and/or claims.

Register for the challenge.

Build a Winning Model

CCP candidates compete against each other and against a benchmark set by a committee including some of the world’s elite data scientists. Participants who surpass evaluation benchmarks receive the CCP: Data Scientist credential.

Lead the Field

Those with the highest scores from each Challenge will have an opportunity to share their solutions and promote their work on cloudera.com and via press and social media outlets. All candidates retain the full rights to their own work and may leverage their models outside of the Challenge as they choose.

Useful way to develop some street cred in data science.

The FIRST Act, Retro Legislation?

Filed under: Government,Government Data,Legal Informatics — Patrick Durusau @ 1:38 pm

Language in FIRST act puts United States at Severe Disadvantage Against International Competitors by Ranit Schmelzer.

From the press release:

The Scholarly Publishing and Academic Research Coalition (SPARC), an international alliance of nearly 800 academic and research libraries, today announced its opposition to Section 303 of H.R. 4186, the Frontiers in Innovation, Research, Science and Technology (FIRST) Act. This provision would impose significant barriers to the public’s ability to access the results of taxpayer-funded research.

Section 303 of the bill would undercut the ability of federal agencies to effectively implement the widely supported White House Directive on Public Access to the Results of Federally Funded Research and undermine the successful public access program pioneered by the National Institutes of Health (NIH) – recently expanded through the FY14 Omnibus Appropriations Act to include the Departments Labor, Education and Health and Human Services. Adoption of Section 303 would be a step backward from existing federal policy in the directive, and put the U.S. at a severe disadvantage among our global competitors.

“This provision is not in the best interests of the taxpayers who fund scientific research, the scientists who use it to accelerate scientific progress, the teachers and students who rely on it for a high-quality education, and the thousands of U.S. businesses who depend on public access to stay competitive in the global marketplace,” said Heather Joseph, SPARC Executive Director. “We will continue to work with the many bipartisan members of the Congress who support open access to publicly funded research to improve the bill.”

[the parade of horribles follows]

SPARC‘s press release never quotes a word from H.R. 4186. Not one. Commentary but nary a part of its object.

I searched at Thomas (the Congressional information service at the Library of Congress), for H.R. 4186 and came up empty by bill number. Switching to the Congressional Record for Monday, March 10, 2014, I did find the bill being introduced and the setting of a hearing on it. The GPO as not (as of today) posted the text of H.R. 4186, but when it does, follow this link: H.R. 4186.

Even more importantly, SPARC doesn’t point out who is responsible for the objectionable section appearing in the bill. Bills don’t write themselves and as far as I know, Congress doesn’t have a random bill generator.

The bottom line is that someone, an identifiable someone, asked for longer embargo wording to be included. If the SPARC press release is accurate, the most likely someone’s asked are Chairman Lamar Smith (R-TX 21st District) or Rep. Larry Bucshon (R-IN 8th District).

The Wikipedia page on the 8th Congressional District in Illinois needs to be updated but it also fails to mention that the 8th district is to the West and North-West of Chicago. You might want to check Bucshon‘s page at Wikipedia and links there to other resources.

Wikipedia on the 21st Congressional District of Texas, places it north of San Antonio, the seventh largest city in the United States. Lamar Smith‘s page at Wikipedia has some interested reading.

Odds are in and around Chicago and San Antonio there are people interested in longer embargo periods on federally funded research.

Those are at least some starting points for effective opposition to this legislation, assuming it was reported accurately by SPARC. Let’s drop the pose of disinterested legislators trying valiantly to serve the public good. Not impossible, just highly unlikely. Let’s argue about who is getting paid and for what benefits.

Or as Captain Ahab advises:

All visible objects, man, are but as pasteboard masks. But in each event –in the living act, the undoubted deed –there, some unknown but still reasoning thing puts forth the mouldings of its features from behind the unreasoning mask. If man will strike, strike through the mask! [Melville, Moby Dick, Chapter XXXVI]

Legislation as a “pasteboard mask” is a useful image. There is not a contour, dimple, shade or expression that wasn’t bought and paid for by someone. You have to strike through the mask to discover who.

Are you game?

PS: Curious, where would you go next (data wise, I don’t have the energy to lurk in garages) in terms of searching for the buyers of longer embargoes in H.R. 4186?

March 10, 2014

Data Science 101: Deep Learning Methods and Applications

Filed under: Data Science,Deep Learning,Machine Learning,Microsoft — Patrick Durusau @ 7:56 pm

Data Science 101: Deep Learning Methods and Applications by Daniel Gutierrez.

From the post:

Microsoft Research, the research arm of the software giant, is a hotbed of data science and machine learning research. Microsoft has the resources to hire the best and brightest researchers from around the globe. A recent publication is available for download (PDF): “Deep Learning: Methods and Applications” by Li Deng and Dong Yu, two prominent researchers in the field.

Deep sledding with twenty (20) pages of bibliography and pointers to frequently updated lists of resources (at page 8).

You did say you were interested in deep learning. Yes? 😉

Enjoy!

Orbital Computing – Electron Orbits That Is.

Filed under: Computer Science,HPC — Patrick Durusau @ 7:41 pm

Physicist proposes a new type of computing at SXSW. Check out orbital computing by Stacey Higginbotham.

From the post:

The demand for computing power is constantly rising, but we’re heading to the edge of the cliff in terms of increasing performance — both in terms of the physics of cramming more transistors on a chip and in terms of the power consumption. We’ve covered plenty of different ways that researchers are trying to continue advancing Moore’s Law — this idea that the number of transistors (and thus the performance) on a chip doubles every 18 months — especially the far out there efforts that take traditional computer science and electronics and dump them in favor of using magnetic spin, quantum states or probabilistic logic.

We’re going to add a new impossible that might become possible to that list thanks to Joshua Turner, a physicist at the SLAC National Accelerator Laboratory, who has proposed using the orbits of electrons around the nucleus of an atom as a new means to generate the binary states (the charge or lack of a charge that transistors use today to generate zeros and ones) we use in computing. He calls this idea orbital computing and the big takeaway for engineers is that one can switch the state of an electron’s orbit 10,000 times faster than you can switch the state of a transistor used in computing today.

That means you can still have the features of computing in that you use binary programming, but you just can compute more in less time. To get us to his grand theory, Turner had to take the SXSW audience through how computing works, how transistors work, the structure of atoms, the behavior of subatomic particles and a bunch of background on X-rays.

This would have been a presentation to see: Bits, Bittier Bits & Qubits: Physics of Computing

Try this SLAC Search for some publications by Joshua Turner.

It’s always fun to read about how computers will be able to process data more quickly. A techie sort of thing.

On the other hand, going 10,000 times faster with semantically heterogeneous data, will get you to the wrong answer 10,000 times faster.

If you realize the answer is wrong, you may have time to try again.

What if you don’t realize the answer is wrong?

Do you really want to be the customs agent who stops a five year old because their name is similar to that of a known terrorist? Because the machine said they could not fly?

Excited about going faster, worried about data going by too fast for anyone to question its semantics.

Hubble Source Catalog

Filed under: Astroinformatics,Data — Patrick Durusau @ 4:51 pm

Beta Version 0.3 of the Hubble Source Catalog

From the post:

The Hubble Source Catalog (HSC) is designed to optimize science from the Hubble Space Telescope by combining the tens of thousands of visit-based source lists in the Hubble Legacy Archive (HLA) into a single master catalog.

Search with Summary Form now (one row per match)
Search with Detailed Form now (one row per source)

Beta Version 0.3 of the HSC contains members of the WFPC2, ACS/WFC, WFC3/UVIS and WFC3/IR Source Extractor source lists in HLA version DR7.2 (data release 7.2) that are considered to be valid detections because they have flag values less than 5 (see more flag information).

The crossmatching process involves adjusting the relative astrometry of overlapping images so as to minimize positional offsets between closely aligned sources in different images. After correction, the astrometric residuals of crossmatched sources are significantly reduced, to typically less than 10 mas. In addition, the catalog includes source nondetections. The crossmatching algorithms and the properties of the initial (Beta 0.1) catalog are described in Budavari & Lubow (2012) .

if you need training with this data set, see: A Hubble Source Catalog (HSC) Walkthrough

Apache Tez 0.3 Released!

Filed under: GPU,MapReduce,Tez — Patrick Durusau @ 4:12 pm

Apache Tez 0.3 Released! by Bikas Saha.

From the post:

The Apache Tez community has voted to release 0.3 of the software.

Apache™ Tez is a replacement of MapReduce that provides a powerful framework for executing a complex topology of tasks. Tez 0.3.0 is an important release towards making the software ready for wider adoption by focusing on fundamentals and ironing out several key functions. The major action areas in this release were

  1. Security. Apache Tez now works on secure Hadoop 2.x clusters using the built-in security mechanisms of the Hadoop ecosystem.
  2. Scalability. We tested the software on large clusters, very large data sets and large applications processing tens of TB each to make sure it scales well with both data-sets and machines.
  3. Fault Tolerance. Apache Tez executes a complex DAG workflow that can be subject to multiple failure conditions in clusters of commodity hardware and is highly resilient to these and other sorts of failures.
  4. Stability. A large number of bug fixes went into this release as early adopters and testers put the software through its paces and reported issues.

To prove the stability and performance of Tez, we executed complex jobs comprised of more than 50 different stages and tens of thousands of tasks on a fairly large cluster (> 300 Nodes, > 30TB data). Tez passed all our tests and we are certain that new adopters can integrate confidently with Tez and enjoy the same benefits as Apache Hive & Apache Pig have already.

I am curious how the Hadoop community is going to top 2013. I suspect Tez is going to be part of that answer!

CORDIS – EU research projects under FP7 (2007-2013)

Filed under: EU,Funding — Patrick Durusau @ 2:23 pm

CORDIS – EU research projects under FP7 (2007-2013)

Description:

This dataset contains projects funded by the European Union under the seventh framework programme for research and technological development (FP7) from 2007 to 2013. Grant information is provided for each project, including reference, acronym, dates, funding, programmes, participant countries, subjects and objectives. A smaller file is also provided without the texts for objectives.

The column separator is the “;” character.

The “Achievements” column is blank for all 22,653 projects/rows.

Can you suggest other sources will machine readable data on the results from EU research projects under FP7 (2007-2013)?

Thanks!

I first saw this in a tweet by Stefano Bertolo.

« Newer PostsOlder Posts »

Powered by WordPress