Archive for the ‘Crowd Sourcing’ Category

Tactical Advantage: I don’t have to know everything, just more than you.

Friday, January 12th, 2018

Mapping the Ghostly Traces of Abandoned Railroads – An interactive, crowdsourced atlas plots vanished transit routes by Jessica Leigh Hester.

From the post:

In the 1830s, a rail line linked Elkton, Maryland, with New Castle, Delaware, shortening the time it took to shuttle people and goods between the Delaware River and Chesapeake Bay. Today you’d never know it had been there. A photograph snapped years after the line had been abandoned captures a stone culvert halfway to collapse into the creek it spanned. Another image, captured even later, shows a relict trail that looks more like a footpath than a railroad right-of-way. The compacted dirt seems wide enough to accommodate no more than two pairs of shoes at a time.

The scar of the New Castle and Frenchtown Railroad barely whispers of the railcars that once barreled through. That’s what earned it a place on Andrew Grigg’s map.

For the past two years, Grigg, a transit enthusiast, has been building an interactive atlas of abandoned railroads. Using Google Maps, he lays the ghostly silhouettes of the lines over modern aerial imagery. His recreation of the 16-mile New Castle and Frenchtown Line crosses state lines and modern highways, marches through suburban housing developments, and passes near a cineplex, a Walmart, and a paintball field.
… (emphasis in original)

Great example of a project capturing travel paths that may be omitted from modern maps. Being omitted from a map doesn’t impact the potential use of an abandoned railway as an alternative to other routes.

Be sure to check ahead of time but digital navigation systems may have omitted discontinued railroads.

The same advantage obtains if you know which underpasses flood after a heavy rain, which streets are impassable, when trains are passing over certain crossings, all manner of information that isn’t captured by standard digital navigation systems.

What information can you add to a map that isn’t known to or thought to be important by others?

An Initial Reboot of Oxlos

Tuesday, April 18th, 2017

An Initial Reboot of Oxlos by James Tauber.

From the post:

In a recent post, Update on LXX Progress, I talked about the possibility of putting together a crowd-sourcing tool to help share the load of clarifying some parse code errors in the CATSS LXX morphological analysis. Last Friday, Patrick Altman and I spent an evening of hacking and built the tool.

Back at BibleTech 2010, I gave a talk about Django, Pinax, and some early ideas for a platform built on them to do collaborative corpus linguistics. Patrick Altman was my main co-developer on some early prototypes and I ended up hiring him to work with me at Eldarion.

The original project was called oxlos after the betacode transcription of the Greek word for “crowd”, a nod to “crowd-sourcing”. Work didn’t continue much past those original prototypes in 2010 and Pinax has come a long way since so, when we decided to work on oxlos again, it made sense to start from scratch. From the initial commit to launching the site took about six hours.

At the moment there is one collective task available—clarifying which of a set of parse codes is valid for a given verb form in the LXX—but as the need for others arises, it will be straightforward to add them (and please contact me if you have similar tasks you’d like added to the site).
… (emphasis in the original)

Crowd sourcing, parse code errors in the CATSS LXX morphological analysis, Patrick Altman and James Tauber! What’s more could you ask for!

Well, assuming you enjoy Django development, or have Greek morphology, sign up at:

After mastering Greek, you don’t really want to lose it from lack of practice. Yes? Perfect opportunity for recent or even not so recent Classics and divinity majors.

I suppose that’s a nice way to say you won’t be encountering LXX Greek on ESPN or CNN. 😉

Uniting Journalists and Hackers?

Friday, April 22nd, 2016

Kevin Gosztola’s post: US News Editors Find It Increasingly Difficult to Defend First Amendment is very sad, especially where he covers the inability to obtain records/information:

Forty-four percent of editors indicated their news organization was less able to go on the offense and sue to open up access to information.

“Newspaper-based (and especially TV-based) companies have tougher budgets and are less willing to spend on lawyers to challenge sunshine and public records violations,” one editor acknowledged.

Another editor declared, “The loss of journalist jobs and publishers’ declining profits means there’s less opportunity to pursue difficult stories and sue for access to information.” The costs of litigation constrain organizations.

“Government agencies are well aware that we do not have the money to fight. More and more, their first response to our records request is, ‘Sue us if you want to get the records,’” one editor stated.

What if the journalism and hacker communities can unite to change:

‘Sue us if you want to get the records’

into a crowd-sourced:

‘Hack us if you want to get the records’

The effectiveness of crowd-sourcing requires no documentation.

Public service hacking by crowds of hackers would greatly reduce the legal fees expended to obtain records.

There are two elements missing for effective crowd-sourced hacking in support of journalists:

  1. Notice of what records journalists want.
  2. Disconnecting hackers from journalists.

Both elements could be satisfied by a public records request board that enables journalists to anonymously request records and allows anonymous responses with pointers to the requested records.

If subpoenaed, give the authorities the records that were posted anonymously. (One assumes hackers won’t leave their fingerprints on them.)

There may be such a notice board already frequented by journalists and hackers so please pardon my ignorance if that is the case.

From Kevin’s post I got the impression that isn’t the case.

PS: If you have ethical qualms about this approach, recall the executive branch decided to lie at will to judicial fact-finders, thereby rendering judicial review a farce. They have no one but themselves to blame for suggestions to by-pass that process.

History Unfolded: US Newspapers and the Holocaust [Editors/Asst. Editors?]

Tuesday, April 12th, 2016

History Unfolded: US Newspapers and the Holocaust

From the webpage:

What did American newspapers report about the Holocaust during World War II? Citizen historians participating in History Unfolded: US Newspapers and the Holocaust will help the US Holocaust Memorial Museum answer this question.

Your Role

Participants will explore their local newspapers for articles about the Holocaust, and submit their research into a centralized database. The collected data will show trends in American reporting.

Citizen historians like you will explore Holocaust history as both an American story and a local story, learn how to use primary sources in historical research, and challenge assumptions about American knowledge of and responses to the Holocaust.

Project Outcomes

Data from History Unfolded: U.S. Newspapers and the Holocaust will be used for two main purposes:
to inform the Museum’s upcoming exhibition on Americans and the Holocaust, and to enhance scholarly research about the American press and the Holocaust.

Our Questions

  • What did people in your community know about the event?
  • Was the information accurate?
  • What do the newspapers tell us about how local and national leaders and community members reacted to news about the event?

Historical Background

During the 1930s, a deeply rooted isolationism pervaded American public opinion. Americans were scornful of Europe’s inability to organize its affairs following the destruction of WWI and feared being drawn into European matters. As a result, news about the Holocaust arrived in an America fraught with isolation, cynicism, and fear of being deceived by government propaganda. Even so, the way the press told the story of the Holocaust—the space allocated, the location of the news in the paper, and the editorial opinions—shaped American reactions.

U.S. Press Coverage of the Holocaust

The press has influence on public opinion. Media attention enhances the importance of an issue in the eyes of the public. The U.S. press had reported on Nazi violence against Jews in Germany as early as 1933. It covered extensively the Nuremberg Laws of 1935 and the expanded German antisemitic legislation of 1938 and 1939. The nationwide state-sponsored violence of November 9-10, 1938, known as Kristallnacht, made front page news in dailies across the U.S.

As the magnitude of anti-Jewish violence increased in 1939-1941, many American newspapers ran descriptions of German shooting operations, first in Poland and later after the invasion of the Soviet Union. As early as July 2, 1942, the New York Times reported on the operations of the killing center in Chelmno, based on sources from the Polish underground. The article, however, appeared on page six of the newspaper.

During the Holocaust, the American press did not always publicize reports of Nazi atrocities in full or with prominent placement. For example, the New York Times, the nation’s leading newspaper, generally deemphasized the murder of the Jews in its news coverage. Although the Times covered the December 1942 statement of the Allies condemning the mass murder of European Jews on its front page, it placed coverage of the more specific information released on page ten, significantly minimizing its importance. Similarly, on July 3, 1944, the Times provided on page 3 a list by country of the number of Jews “eradicated”; the Los Angeles Times places the report on page 5.

How did your hometown cover these events?

I first saw this in What did Americans know as the Holocaust unfolded? Quite a lot, it turns out. by Tara Bahrampour, follow @TaraBahrampour.

I have registered for the project and noticed that although author bylines are captured, there doesn’t seem to be a routine to capture editors, assistant editors, etc. Newspapers don’t assemble themselves.

The site focuses on twenty (20) major events, starting with “Dachau Opens,” March 22, 1933 and ending with “FDR Delivers His Forth Inaugural Address,” January 20, 1945.

The interfaces seem very intuitive and I am looking forward to searching my local newspaper for one or more of these events.

PS: Anti-Semites didn’t and don’t exist in isolation. Graphing relationships over history in your community may help explain some of the news coverage you do or don’t find.

Tracking Down Superdelegates – Data In Action

Saturday, April 9th, 2016

The Republican party has long been known for its efforts to exclude voters from the voting process altogether. Voter ID laws, purging voter rolls, literacy tests, etc.

The Democratic party skips the difficulty of dealing with voters because votes are irrelevant in the selection of superdelegates for presidential nominations. Gives the appearance of greater democracy while being on par with the Republicans.

To counter the Democratic Party’s lack of democracy, Spenser Thayer created the Superdelegate Hit List.

There was a predictable reaction of the overly sensitive media types and the privileged (by definition, superdelegates number among the privileged) to the use of “hit list” in the name.

Avoidance of personal accountability is a characteristic of the privileged.

The goal being to persuade the privileged, not to ride them down, the more appropriate response to privilege, the site name was changed to “Superdelegate List,” removing “hit.”

Here is the current list of superdelegates.

As you can see, the list is “sparse” on contact information.

You can help the Democratic Party be democratic by contributing data

Pentagon Confirms Crowdsourcing of Map Data

Tuesday, April 5th, 2016

I have mentioned before, Tracking NSA/CIA/FBI Agents Just Got Easier, The DEA is Stalking You!, how citizens can invite federal agents to join the gold fish bowl being prepared for the average citizen.

Of course, that’s just me saying it, unless and until the Pentagon confirms the crowdsourcing of map data!

Aliya Sternstein writes
in Soldiers to Help Crowdsource Spy Maps:

“What a great idea if we can get our soldiers adding fidelity to the maps and operational picture that we already have” in Defense systems, Gordon told Nextgov. “All it requires is pushing out our product in a manner that they can add data to it against a common framework.”

Comparing mapping parties to combat support activities, she said, soldiers are deployed in some pretty remote areas where U.S. forces are not always familiar with the roads and the land, partly because they tend to change.

If troops have a base layer, “they can do basically the same things that that social party does and just drop pins and add data,” Gordon said from a meeting room at the annual Esri conference. “Think about some of the places in Africa and some of the less advantaged countries that just don’t have addresses in the way we do” in the United States.

Of course, you already realize the value of crowd-sourcing surveillance of government agents but for the c-suite crowd, confirmation from a respected source (the Pentagon) may help push your citizen surveillance proposal forward.

BTW, while looking at Army GeoData research plans (pages 228-232), I ran across this passage:

This effort integrates behavior and population dynamics research and analysis to depict the operational environment including culture, demographics, terrain, climate, and infrastructure, into geospatial frameworks. Research exploits existing open source text, leverages multi-media and cartographic materials, and investigates data collection methods to ingest geospatial data directly from the tactical edge to characterize parameters of social, cultural, and economic geography. Results of this research augment existing conventional geospatial datasets by providing the rich context of the human aspects of the operational environment, which offers a holistic understanding of the operational environment for the Warfighter. This item continues efforts from Imagery and GeoData Sciences, and Geospatial and Temporal Information Structure and Framework and complements the work in PE 0602784A/Project T41.

Doesn’t that just reek with subjects that would be identified differently in intersecting information systems?

One solution would be to fashion top down mapping systems that are months if not years behind demands in an operational environment. Sort of like tanks that overheat in jungle warfare.

Or you could do something a bit more dynamic that provides a “good enough” mapping for operational needs and yet also has the information necessary to integrate it with other temporary solutions.

Photo-Reconnaissance For Your Revolution

Sunday, February 21st, 2016

Using Computer Vision to Analyze Aerial Big Data from UAVs During Disasters by Patrick Meier.

From the post:

Recent scientific research has shown that aerial imagery captured during a single 20-minute UAV flight can take more than half-a-day to analyze. We flew several dozen flights during the World Bank’s humanitarian UAV mission in response to Cyclone Pam earlier this year. The imagery we captured would’ve taken a single expert analyst a minimum 20 full-time workdays to make sense of. In other words, aerial imagery is already a Big Data problem. So my team and I are using human computing (crowdsourcing), machine computing (artificial intelligence) and computer vision to make sense of this new Big Data source.

Revolutionaries are chronically understaffed so Meier’s advice for natural disasters is equally applicable to disasters known as governments.

Imagine the Chicago police riot or Watts or the Rodney King riot where NGO leadership had real time data on government forces.

Meier’s book, Digital Humanitarians is a good advocacy book for the use of technology during “disasters.” It is written for non-specialists so you will have to look to other resources to build up your technical infrastructure.

PS: With the advent of cheap drones, imagine stitching together images from multiple drones with overlapping coverage. Could provide better real-time combat intelligence than more expensive options.

I first saw this in a tweet by Kirk Borne.

Illusory Truth (Illusory Publication)

Monday, January 18th, 2016

On Known Unknowns: Fluency and the Neural Mechanisms of Illusory Truth by Wei-Chun Wang, et al. Journal of Cognitive Neuroscience, Posted Online January 14, 2016. (doi:10.1162/jocn_a_00923)


The “illusory truth” effect refers to the phenomenon whereby repetition of a statement increases its likelihood of being judged true. This phenomenon has important implications for how we come to believe oft-repeated information that may be misleading or unknown. Behavioral evidence indicates that fluency or the subjective ease experienced while processing a statement underlies this effect. This suggests that illusory truth should be mediated by brain regions previously linked to fluency, such as the perirhinal cortex (PRC). To investigate this possibility, we scanned participants with fMRI while they rated the truth of unknown statements, half of which were presented earlier (i.e., repeated). The only brain region that showed an interaction between repetition and ratings of perceived truth was PRC, where activity increased with truth ratings for repeated, but not for new, statements. This finding supports the hypothesis that illusory truth is mediated by a fluency mechanism and further strengthens the link between PRC and fluency.

Whether you are crowd sourcing authoring of a topic map, measuring sentiment or having content authored by known authors, you are unlikely to want it populated by illusory truths. That is truths your sources would swear to but that are in fact false (from a certain point of view).

I would like to say more about what this article reports but it is an “illusory publication” that resides behind a pay-wall so I don’t know what is says in fact.

Isn’t that ironic? An article on illusory truth that cannot substantiate its own claims. It can only repeat them.

I first saw this in a tweet by Stefano Bertolo


Monday, October 19th, 2015


From the webpage:

The CrowdTruth Framework implements an approach to machine-human computing for collecting annotation data on text, images and videos. The approach is focussed specifically on collecting gold standard data for training and evaluation of cognitive computing systems. The original framework was inspired by the IBM Watson project for providing improved (multi-perspective) gold standard (medical) text annotation data for the training and evaluation of various IBM Watson components, such as Medical Relation Extraction, Medical Factor Extraction and Question-Answer passage alignment.

The CrowdTruth framework supports the composition of CrowdTruth gathering workflows, where a sequence of micro-annotation tasks can be configured and sent out to a number of crowdsourcing platforms (e.g. CrowdFlower and Amazon Mechanical Turk) and applications (e.g. Expert annotation game Dr. Detective). The CrowdTruth framework has a special focus on micro-tasks for knowledge extraction in medical text (e.g. medical documents, from various sources such as Wikipedia articles or patient case reports). The main steps involved in the CrowdTruth workflow are: (1) exploring & processing of input data, (2) collecting of annotation data, and (3) applying disagreement analytics on the results. These steps are realised in an automatic end-to-end workflow, that can support a continuous collection of high quality gold standard data with feedback loop to all steps of the process. Have a look at our presentations and papers for more details on the research.

An encouraging quote from Truth is a Lie by Lora Aroyo.

the idea of truth is a fallacy for semantic interpretation and needs to be changed

I don’t disagree but observe a “crowdtruth” with disagreements is a variant of “truth.” What variant of “truth” is of interest to your client is an important issue.

CIA analysts, for example, have little interest in crowdtruths that threaten their prestige and/or continued employment. “Accuracy” is only one aspect of any truth.

If your client is sold on crowdtruths, by all means take up the banner on their behalf. Always remembering:

There are no facts, only interpretations. (Nietzsche)

Which interpretation interests you?

1.5 Million Slavery Era Documents Will Be Digitized…

Thursday, June 25th, 2015

1.5 Million Slavery Era Documents Will Be Digitized, Helping African Americans to Learn About Their Lost Ancestors

From the post:

The Freedmen’s Bureau Project — a new initiative spearheaded by the Smithsonian, the National Archives, the Afro-American Historical and Genealogical Society, and the Church of Jesus Christ of Latter-Day Saints — will make available online 1.5 million historical documents, finally allowing ancestors [sic. descendants] of former African-American slaves to learn more about their family roots. Near the end of the US Civil War, The Freedmen’s Bureau was created to help newly-freed slaves find their footing in postbellum America. The Bureau “opened schools to educate the illiterate, managed hospitals, rationed food and clothing for the destitute, and even solemnized marriages.” And, along the way, the Bureau gathered handwritten records on roughly 4 million African Americans. Now, those documents are being digitized with the help of volunteers, and, by the end of 2016, they will be made available in a searchable database at According to Hollis Gentry, a Smithsonian genealogist, this archive “will give African Americans the ability to explore some of the earliest records detailing people who were formerly enslaved,” finally giving us a sense “of their voice, their dreams.”

You can learn more about the project by watching the video below, and you can volunteer your own services here.

A crowd sourced project that has a great deal of promise with regard to records on 4 million African Americans, who were previously held as slaves.

Making the documents “searchable” will be of immense value. However, imagine capturing the myriad relationships documented in these records so that subsequent searchers can more quickly find relationships you have already documented.

Finding former slaves with a common owner or other commonalities, could be the clues others need to untangle a past we only see dimly.

Topic maps are a nice fit for this work.

Crowdsourcing Courses

Tuesday, April 28th, 2015

Kurt Luther is teaching a crowdsourcing course this Fall and has a partial list of crowdsourcing courses.

Any more to suggest?

Kurt tweets about crowdsourcing and history so you may want to follow him on Twitter.

Flock: Hybrid Crowd-Machine Learning Classifiers

Monday, March 16th, 2015

Flock: Hybrid Crowd-Machine Learning Classifiers by Justin Cheng and Michael S. Bernstein.


We present hybrid crowd-machine learning classifiers: classification models that start with a written description of a learning goal, use the crowd to suggest predictive features and label data, and then weigh these features using machine learning to produce models that are accurate and use human-understandable features. These hybrid classifiers enable fast prototyping of machine learning models that can improve on both algorithm performance and human judgment, and accomplish tasks where automated feature extraction is not yet feasible. Flock, an interactive machine learning platform, instantiates this approach. To generate informative features, Flock asks the crowd to compare paired examples, an approach inspired by analogical encoding. The crowd’s efforts can be focused on specific subsets of the input space where machine-extracted features are not predictive, or instead used to partition the input space and improve algorithm performance in subregions of the space. An evaluation on six prediction tasks, ranging from detecting deception to differentiating impressionist artists, demonstrated that aggregating crowd features improves upon both asking the crowd for a direct prediction and off-the-shelf machine learning features by over 10%. Further, hybrid systems that use both crowd-nominated and machine-extracted features can outperform those that use either in isolation.

Let’s see, suggest predictive features (subject identifiers in the non-topic map technical sense) and label data (identify instances of a subject), sounds a lot easier that some of the tedium I have seen for authoring a topic map.

I particularly like the “inducing” of features versus relying on a crowd to suggest identifying features. I suspect that would work well in a topic map authoring context, sans the machine learning aspects.

This paper is being presented this week, CSCW 2015, so you aren’t too far behind. 😉

How would you structure an inducement mechanism for authoring a topic map?

“The Whole Is Greater Than the Sum of Its Parts”

Tuesday, March 3rd, 2015

“The Whole Is Greater Than the Sum of Its Parts”: Optimization in Collaborative Crowdsourcing by Habibur Rahman, et al.


In this work, we initiate the investigation of optimization opportunities in collaborative crowdsourcing. Many popular applications, such as collaborative document editing, sentence translation, or citizen science resort to this special form of human-based computing, where, crowd workers with appropriate skills and expertise are required to form groups to solve complex tasks. Central to any collaborative crowdsourcing process is the aspect of successful collaboration among the workers, which, for the first time, is formalized and then optimized in this work. Our formalism considers two main collaboration-related human factors, affinity and upper critical mass, appropriately adapted from organizational science and social theories. Our contributions are (a) proposing a comprehensive model for collaborative crowdsourcing optimization, (b) rigorous theoretical analyses to understand the hardness of the proposed problems, (c) an array of efficient exact and approximation algorithms with provable theoretical guarantees. Finally, we present a detailed set of experimental results stemming from two real-world collaborative crowdsourcing application us- ing Amazon Mechanical Turk, as well as conduct synthetic data analyses on scalability and qualitative aspects of our proposed algorithms. Our experimental results successfully demonstrate the efficacy of our proposed solutions.

Heavy sledding but given the importance of crowd sourcing and the potential for any increase in productivity, well worth the effort!

I first saw this in a tweet by Dave Rubal.


Saturday, February 28th, 2015


Algorithmia was born in 2013 with the goal of advancing the art of algorithm development, discovery and use. As developers ourselves we believe that given the right tools the possibilities for innovation and discovery are limitless.

Today we build what we believe to be the next era of programming: a collaborative, always live and community driven approach to making the machines that we interact with better.

The community drives the Algorithmia API. One API that exposes the collective knowledge of algorithm developers across the globe.

Currently in private beta but sounds very promising!

I first saw Algorithmia mentioned in Algorithmia API Exposes Collective Knowledge of Developers by Martin W. Brennan.

Intelligence Sharing, Crowd Sourcing and Good News for the NSA

Monday, February 16th, 2015

Lisa Vaas posted an entertaining piece today with the title: Are Miami cops really flooding Waze with fake police sightings?. Apparently an NBC affiliate (not FOX, amazing) tried its hand at FUD, alleging that Miami police officers were gaming Waze.

There is a problem with that theory, which Lisa points out quoting Julie Mossler, a spokes person for Waze:

Waze algorithms rely on crowdsourcing to confirm or negate what has been reported on the road. Thousands of users in Florida do this, both passively and actively, every day. In addition, we place greater trust in reports from heavy users and terminate accounts of those whose behavior demonstrate a pattern of contributing false information. As a result the Waze map will remain reliable and updated to the minute, reflecting real-time conditions.


See Lisa’s post for the blow-by-blow account of this FUD attempt by the NBC affiliate.

However foolish an attempt to game Waze would be, it is a good example to promote the sharing of intelligence.

Think about it. Rather than the consensus poop that emerges as the collaboration of the senior management in intelligence agencies, why not share all intelligence between agencies between working analysts addressing the same areas or issues? Make the “crowd” people who have similar security clearances and common subject areas. And while contributions are trackable within a agency, to the “crowd,” everyone has a handle and their contributions on shared intelligence is voted up or down. Just like with Waze, people will develop reputations within the system.

I assume for turf reasons you could put handles on the intelligence so the participants would not know its origins as well, just until people started building up trust in the system.

Changing the cultures at the intelligence agencies, which hasn’t succeeded since 9/11, would require a more dramatic approach than has been tried to date. My suggestion is to give the Inspector Generals the ability to block promotions and/or fire people in the intelligence agencies who don’t actively promote the sharing of intelligence. Where “actively promotes” is measured by intelligence shared and not activities to plan to share intelligence, etc.

Unless and until there are consequences for the failure of members of the intelligence community to put the interests of their employers (in this case, citizens of the United States) above their own or that of their agency, the failure to share intelligence since 9/11 will continue.

PS: People will object that the staff in question have been productive, loyal, etc., etc. in the past. The relevant question is whether they have the skills and commitment that is required now? The answer to that last question is either yes or no. Employment is an opportunity to perform, not an entitlement.

Yet More “Hive” Confusion

Wednesday, December 10th, 2014

The New York Times R&D Lab releases Hive, an open-source crowdsourcing tool by Justin Ellis.

From the post:

A few months ago we told you about a new tool from The New York Times that allowed readers to help identify ads inside the paper’s massive archive. Madison, as it was called, was the first iteration on a new crowdsourcing tool from The New York Times R&D Lab that would make it easier to break down specific tasks and get users to help an organization get at the data they need.

Today the R&D Lab is opening up the platform that powers the whole thing. Hive is an open-source framework that lets anyone build their own crowdsourcing project. The code responsible for Hive is now available on GitHub. With Hive, a developer can create assignments for users, define what they need to do, and keep track of their progress in helping to solve problems.

Not all that long ago, I penned: Avoiding “Hive” Confusion, which pointed out the possible confusion between Apache Hive and High-performance Integrated Virtual Environment (HIVE), in mid to late October, 2014. Now, barely two months later we have another “Hive” in the information technology field.

I have no idea how many “hives” there are inside or outside of IT but as of today, I can name at least three (3).

Have you ever thought that semantic confusion is part and parcel of the human condition? Can be allowed for, can be compensated for, but can never be eliminated.

Madison: Semantic Listening Through Crowdsourcing

Tuesday, October 28th, 2014

Madison: Semantic Listening Through Crowdsourcing by Jane Friedhoff.

From the post:

Our recent work at the Labs has focused on semantic listening: systems that obtain meaning from the streams of data surrounding them. Chronicle and Curriculum are recent examples of tools designed to extract semantic information (from our corpus of news coverage and our group web browsing history, respectively). However, not every data source is suitable for algorithmic analysis–and, in fact, many times it is easier for humans to extract meaning from a stream. Our new projects, Madison and Hive, are explorations of how to best design crowdsourcing projects for gathering data on cultural artifacts, as well as provocations for the design of broader, more modular kinds of crowdsourcing tools.

(image omitted)

Madison is a crowdsourcing project designed to engage the public with an under-viewed but rich portion of The New York Times’s archives: the historical ads neighboring the articles. News events and reporting give us one perspective on our past, but the advertisements running alongside these articles provide a different view, giving us a sense of the culture surrounding these events. Alternately fascinating, funny and poignant, they act as commentary on the technology, economics, gender relations and more of that time period. However, the digitization of our archives has primarily focused on news, leaving the ads with no metadata–making them very hard to find and impossible to search for them. Complicating the process further is that these ads often have complex layouts and elaborate typefaces, making them difficult to differentiate algorithmically from photographic content, and much more difficult to scan for text. This combination of fascinating cultural information with little structured data seemed like the perfect opportunity to explore how crowdsourcing could form a source of semantic signals.

From the projects homepage:

Help preserve history with just one click.

The New York Times archives are full of advertisements that give glimpses into daily life and cultural history. Help us digitize our historic ads by answering simple questions. You’ll be creating a unique resource for historians, advertisers and the public — and leaving your mark on history.

Get started with our collection of ads from the 1960s (additional decades will be opened later)!

I would like to see a Bible transcription project that was that user friendly!

But, then the goal of the New York Times is to include as many people as possible.

Looking forward to more news on Madison!

Bringing chemical synthesis to the masses

Monday, September 8th, 2014

Bringing chemical synthesis to the masses by Michael Gross.

From the post:

You too can create thousands of new compounds and screen them for a desired activity. That is the promise of a novel approach to building chemical libraries, which only requires simple building blocks in water, without any additional reagents or sample preparation.1

Jeffrey Bode from ETH Zurich and his co-worker Yi-Lin Huang took inspiration both from nature’s non-ribosomal peptide synthesis and from click chemistry. Nature uses specialised non-ribosomal enzymes to create a number of unusual peptides outside the normal paths of protein biosynthesis including, for instance, pharmaceutically relevant peptides like the antibiotic vancomycin. Bode and Huang have now produced these sorts of compounds without cells or enzymes, simply relying on the right chemistry.

Given the simplicity of the process and the absence of toxic reagents and by-products, Bode anticipates that it could even be widely used by non-chemists. ‘Our idea is to provide a quick way to make bioactive molecules just by mixing the components in water,’ Bode explains. ‘We would like to use this as a platform for chemistry that anyone can do, including scientists in other fields, high school students and farmers. Anyone could prepare libraries in a few hours with a micropipette, explore different combinations of building blocks and culture conditions along with simple assays to find novel molecules.’

Bode either wasn’t a humanities major or he missed the class on keeping lay people away from routine tasks. Everyone knows that routine tasks, like reading manuscripts must be reserved for graduate students under the fiction that only an “expert” can read non-printed material.

To be fair, there are manuscript characters or usages that require an expert opinion but those can be quickly isolated by statistical analysis of disagreement between different readers. Assuming effective transcription interfaces for manuscripts and a large enough body of readers.

That would reduce the number of personal fiefdoms built on access to particular manuscripts but that prospect finds me untroubled.

You can imagine the naming issues that will ensue from wide spread chemical synthesis by the masses. But, there is too much to be discovered to be miserly with means of discovery or dissemination of those results.

Non-Moral Case For Diversity

Monday, July 21st, 2014

Groups of diverse problem solvers can outperform groups of high-ability problem solvers by Lu Hong and Scott E. Page.


We introduce a general framework for modeling functionally diverse problem-solving agents. In this framework, problem-solving agents possess representations of problems and algorithms that they use to locate solutions. We use this framework to establish a result relevant to group composition. We find that when selecting a problem-solving team from a diverse population of intelligent agents, a team of randomly selected agents outperforms a team comprised of the best-performing agents. This result relies on the intuition that, as the initial pool of problem solvers becomes large, the best-performing agents necessarily become similar in the space of problem solvers. Their relatively greater ability is more than offset by their lack of problem-solving diversity.

I have heard people say that diverse teams are better, but always in the context of contending for members of one group or another to be included on a team.

Reading the paper carefully, I don’t think that is the author’s point at all.

From the conclusion:

The main result of this paper provides conditions under which, in the limit, a random group of intelligent problem solvers will outperform a group of the best problem solvers. Our result provides insights into the trade-off between diversity and ability. An ideal group would contain high-ability problem solvers who are diverse. But, as we see in the proof of the result, as the pool of problem solvers grows larger, the very best problem solvers must become similar. In the limit, the highest-ability problem solvers cannot be diverse. The result also relies on the size of the random group becoming large. If not, the individual members of the random group may still have substantial overlap in their local optima and not perform well. At the same time, the group size cannot be so large as to prevent the group of the best problem solvers from becoming similar. This effect can also be seen by comparing Table 1. As the group size becomes larger, the group of the best problem solvers becomes more diverse and, not surprisingly, the group performs relatively better.

A further implication of our result is that, in a problem-solving context, a person’s value depends on her ability to improve the collective decision (8). A person’s expected contribution is contextual, depending on the perspectives and heuristics of others who work on the problem. The diversity of an agent’s problem-solving approach, as embedded in her perspective-heuristic pair, relative to the other problem solvers is an important predictor of her value and may be more relevant than her ability to solve the problem on her own. Thus, even if we were to accept the claim that IQ tests, Scholastic Aptitude Test scores, and college grades predict individual problem-solving ability, they may not be as important in determining a person’s potential contribution as a problem solver as would be measures of how differently that person thinks. (emphasis added)

Some people accept gender, race, nationality, etc. as markers for thinking differently and no doubt that is true in some cases. But presuming it is just as uninformed as presuming no differences in how people of different gender, race, and nationalities think.

You could ask. Such as presenting candidates for a team with open ended problems that are capable of multiple solutions. Group similar solutions together and then pick randomly across the solution groups.

You may have a gender, race, nationality diverse team but if they think the same way, say Anthony Scalia and Clarence Thomas, then your team isn’t usefully diverse.

Diversity of thinking should be your goal, not diversity of markers of diversity.

I first saw this in a tweet by Chris Dixon.

Crowdscraping – You Game?

Tuesday, July 8th, 2014

Launching #FlashHacks: a crowdscraping movement to release 10 million data points in 10 days. Are you in? by Hera.

From the post:

The success story that is OpenCorporates is very much a team effort – not just the tiny OpenCorporates core team, but the whole open data community, who from the beginning have been helping us in so many ways, from writing scrapers for company registers, to alerting us when new data is available, to helping with language or data questions.

But one of the most common questions has been, “How can I get data into OpenCorporates“. Given that OpenCorporates‘ goal is not just every company in the world but also all the public data that relates to those companies, that’s something we’ve wanted to allow, as we would not achieve that alone, and it’s something that will make OpenCorporates not just the biggest open database of company data in the world, but the biggest database of company data, open or proprietary.

To launch this new era in corporate data, we are launching a #FlashHacks campaign.

Flash What? #FlashHacks.

We are inviting all Ruby and Python botwriters to help us crowdscrape 10 million data points into OpenCorporates in 10 days.

How you can join the crowdscraping movement

  • Join and sign up!
  • Have a look at the datasets we have listed on the Campaign page as inspiration. You can either write bots for these or even chose your own!
  • Sign up to a mission! Send a tweet pledge to say you have taken on a mission.
  • Write the bot and submit on the platform.
  • Tweet your success with the #FlashHacks tag! Don’t forget to upload the FlashHack design as your twitter cover photo and facebook cover photo to get more people involved.

Join us on our Google Group, share problems and solutions, and help build the open corporate data community.

If you are interested in covering this story, you can view the press release here.

Also of interest: Ruby and Python coders – can you help us?

To join this crowdscrape, sign up at:

Tweet, email, post, etc.

Could be the start of a new social activity, the episodic crowdscrape.

Are crowdscrapes an answer to massive data dumps from corporate interests?

I first saw this in a tweet by Martin Tisne.

Asteroid Hunting!

Thursday, June 26th, 2014

Planetary Resources Wants Public to Help Find Asteroids by Doug Messier.

From the post:

Planetary Resources, the asteroid mining company, and Zooniverse today launched Asteroid Zoo (, empowering students, citizen scientists and space enthusiasts to aid in the search for previously undiscovered asteroids. The program allows the public to join the search for Near Earth Asteroids (NEAs) of interest to scientists, NASA and asteroid miners, while helping to train computers to better find them in the future.

Asteroid Zoo joins the Zooniverse’s family of more than 25 citizen science projects! It will enable participants to search terabytes of imaging data collected by Catalina Sky Survey (CSS) for undiscovered asteroids in a fun, game-like process from their personal computers or devices. The public’s findings will be used by scientists to develop advanced automated asteroid-searching technology for telescopes on Earth and in space, including Planetary Resources’ ARKYD.

“With Asteroid Zoo, we hope to extend the effort to discover asteroids beyond astronomers and harness the wisdom of crowds to provide a real benefit to Earth,” said Chris Lewicki, President and Chief Engineer, Planetary Resources, Inc. “Furthermore, we’re excited to introduce this program as a way to thank the thousands of people who supported Planetary Resources through Kickstarter. This is the first of many initiatives we’ll introduce as a result of the campaign.”

The post doesn’t say who names an asteroid that qualifies for an Extinction Event. 😉 If it is a committee, it may go forever nameless.

Expert vs. Volunteer Semantics

Thursday, April 17th, 2014

The variability of crater identification among expert and community crater analysts by Stuart J. Robbins, et al.


The identification of impact craters on planetary surfaces provides important information about their geological history. Most studies have relied on individual analysts who map and identify craters and interpret crater statistics. However, little work has been done to determine how the counts vary as a function of technique, terrain, or between researchers. Furthermore, several novel internet-based projects ask volunteers with little to no training to identify craters, and it was unclear how their results compare against the typical professional researcher. To better understand the variation among experts and to compare with volunteers, eight professional researchers have identified impact features in two separate regions of the moon. Small craters (diameters ranging from 10 m to 500 m) were measured on a lunar mare region and larger craters (100s m to a few km in diameter) were measured on both lunar highlands and maria. Volunteer data were collected for the small craters on the mare. Our comparison shows that the level of agreement among experts depends on crater diameter, number of craters per diameter bin, and terrain type, with differences of up to ∼±45. We also found artifacts near the minimum crater diameter that was studied. These results indicate that caution must be used in most cases when interpreting small variations in crater size-frequency distributions and for craters ≤10 pixels across. Because of the natural variability found, projects that emphasize many people identifying craters on the same area and using a consensus result are likely to yield the most consistent and robust information.

The identification of craters on the Moon may seem far removed from your topic map authoring concerns but I would suggest otherwise.

True the paper is domain specific in some of it concerns (crater age, degradation, etc.) but the most important question was whether volunteers in aggregate could be as useful as experts in the identification of craters?

The author conclude:

Except near the minimum diameter, volunteers are able to identify craters just as well as the experts (on average) when using the same interface (the Moon Mappers interface), resulting in not only a similar number of craters, but also a similar size distribution. (page 34)

I find that suggestive for mapping semantics because unlike moon craters, what words mean (and implicitly why) are a daily concern for users, including ones in your enterprise.

You can, of course, employ experts to re-interpret what they have been told by some of your users into the expert’s language and produce semantic integration based on the expert’s understanding or mis-understanding of your domain.

Or, you can use your own staff, with experts to facilitate encoding their understanding of your enterprise semantics, as in a topic map.

Recalling that the semantics for your enterprise aren’t “out there” in the ether but residing within the staff that make up your enterprise.

I still see an important role for experts but it isn’t as the source of your semantics, rather at the hunters who assist in capturing your semantics.

I first saw this in a tweet by astrobites that lead me to: Crowd-Sourcing Crater Identification by Brett Deaton.

The GATE Crowdsourcing Plugin:…

Monday, March 24th, 2014

The GATE Crowdsourcing Plugin: Crowdsourcing Annotated Corpora Made Easy by Kalina Bontcheva, Ian Roberts, Leon Derczynski, and Dominic Rout.


Crowdsourcing is an increasingly popular, collaborative approach for acquiring annotated corpora. Despite this, reuse of corpus conversion tools and user interfaces between projects is still problematic, since these are not generally made available. This demonstration will introduce the new, open-source GATE Crowd-sourcing plugin, which offers infrastructural support for mapping documents to crowdsourcing units and back, as well as automatically generating reusable crowd-sourcing interfaces for NLP classification and selection tasks. The entire work-flow will be demonstrated on: annotating named entities; disambiguating words and named entities with respect to DBpedia URIs; annotation of opinion holders and targets; and sentiment.

From the introduction:

A big outstanding challenge for crowdsourcing projects is that the cost to define a single annotation task remains quite substantial. This demonstration will introduce the new, open-source GATE Crowdsourcing plugin, which offers infrastructural support for mapping documents to crowdsourcing units, as well as automatically generated, reusable user interfaces [1] for NLP classification and selection tasks. Their use will be demonstrated on annotating named entities (selection task), disambiguating words and named entities with respect to DBpedia URIs (classification task), annotation of opinion holders and targets (selection task), as well as sentiment (classification task).


Are the difficulties associated with annotation UIs a matter of creating the UI or the choices that underlie the UI?

This plugin may shed light on possible answers to that question.

Citizen Science and the Modern Web…

Friday, March 21st, 2014

Citizen Science and the Modern Web – Talk by Amit Kapadia by Bruce Berriman.

From the post:

Amit Kapadia gave this excellent talk at CERN on Citizen Science and The Modern Web. From Amit’s abstract: “Beginning as a research project to help scientists communicate, the Web has transformed into a ubiquitous medium. As the sciences continue to transform, new techniques are needed to analyze the vast amounts of data being produced by large experiments. The advent of the Sloan Digital Sky Survey increased throughput of astronomical data, giving rise to Citizen Science projects such as Galaxy Zoo. The Web is no longer exclusively used by researchers, but rather, a place where anyone can share information, or even, partake in citizen science projects.

As the Web continues to evolve, new and open technologies enable web applications to become more sophisticated. Scientific toolsets may now target the Web as a platform, opening an application to a wider audience, and potentially citizen scientists. With the latest browser technologies, scientific data may be consumed and visualized, opening the browser as a new platform for scientific analysis.”

Bruce points to the original presentation here.

The emphasis is on astronomy but many good points on citizen science.

Curious if citizen involvement in the sciences and humanities could lead to greater awareness and support for them?

Quizz: Targeted Crowdsourcing…

Friday, March 7th, 2014

Quizz: Targeted Crowdsourcing with a Billion (Potential) Users by Panagiotis G. Ipeirotis and Evgeniy Gabrilovich.


We describe Quizz, a gamified crowdsourcing system that simultaneously assesses the knowledge of users and acquires new knowledge from them. Quizz operates by asking users to complete short quizzes on specific topics; as a user answers the quiz questions, Quizz estimates the user’s competence. To acquire new knowledge, Quizz also incorporates questions for which we do not have a known answer; the answers given by competent users provide useful signals for selecting the correct answers for these questions. Quizz actively tries to identify knowledgeable users on the Internet by running advertising campaigns, effectively leveraging the targeting capabilities of existing, publicly available, ad placement services. Quizz quantifies the contributions of the users using information theory and sends feedback to the advertising system about each user. The feedback allows the ad targeting mechanism to further optimize ad placement.

Our experiments, which involve over ten thousand users, confirm that we can crowdsource knowledge curation for niche and specialized topics, as the advertising network can automatically identify users with the desired expertise and interest in the given topic. We present controlled experiments that examine the effect of various incentive mechanisms, highlighting the need for having short-term rewards as goals, which incentivize the users to contribute. Finally, our cost- quality analysis indicates that the cost of our approach is below that of hiring workers through paid-crowdsourcing platforms, while offering the additional advantage of giving access to billions of potential users all over the planet, and being able to reach users with specialized expertise that is not typically available through existing labor marketplaces.

Crowd sourcing isn’t an automatic slam-dunk but with research like this, it will start moving towards being a repeatable experience.

What do you want to author using a crowd?

I first saw this at Greg Linden’s More quick links.

A Gresham’s Law for Crowdsourcing and Scholarship?

Friday, February 28th, 2014

A Gresham’s Law for Crowdsourcing and Scholarship? by Ben W. Brumfield.

Ben examines the difficulties of involving both professionals and “amateurs” in crowd-sourced projects.

The point of controversy being whether or not professionals will decline to be identified by projects that include amateurs?

There isn’t any smoking gun evidence and I suspect the reaction of both professionals and amateurs varies from field to field.

Still, it is something you may run across if you use crowd-sourcing to build semantic annotations and/or data archives.

Monuments Men

Friday, February 7th, 2014

Monuments Men

From the post:

During World War II, an unlikely team of soldiers was charged with identifying and protecting European cultural sites, monuments, and buildings from Allied bombing. Officially named the Monuments, Fine Arts, and Archives (MFAA) Section, this U.S. Army unit included art curators, scholars, architects, librarians, and archivists from the U.S. and Britain. They quickly became known as The Monuments Men. These documents are drawn from MFAA members’ personal papers held at the Archives of American Art.

Towards the end of the war, their mission changed to one of locating and recovering works of art that had been looted by the Nazis. The Monuments Men uncovered troves of stolen art hidden across Germany and Austria—some in castles, others in salt mines. They rescued some of history’s greatest works of art.

Among the holdings of the Archives of American Art are the papers of Monuments Men George Leslie Stout, James J. Rorimer, Walker Hancock, Thomas Carr Howe, S. Lane Faison, Walter Horn, and Otto Wittman. These personal archives tell a fascinating story.

These documents—and many more including photographs of the recovery operations—are on display in the Lawrence A. Fleischman Gallery at the Donald W. Reynolds Center for American Art and Portraiture in Washington D.C. between February 7 and April 20, 2014 to see the original documents in person. The exhibition is also available online at Monuments Men: On the Frontline to Save Europe’s Art, 1942–1946.

You would know, there is a movie with the same name, just to confuse itself with this project: The Monuments Men. 😉

This is one of many transcription projects at the Smithsonian Transcription Center. Site navigation is problematic, particularly since projects are listed under departments known mostly to insiders.

Crowd sourced transcription helps correct the impression that knowledge starts with digital documents.

Should it happen to spread, someday, to biblical studies, even the average reader would realize the eclectic nature of any modern Bible translation.

Star Date: M83

Tuesday, January 14th, 2014

Star Date: M83 – Uncovering the ages of star clusters in the Southern Pinwheel Galaxy

From the homepage:

Most of the billions of stars that reside in galaxies start their lives grouped together into clusters. In this activity, you will pair your discerning eye with Hubble’s detailed images to identify the ages of M83’s many star clusters. This info helps us learn how star clusters are born, evolve and eventually fall apart in spiral galaxies.

A great citizen scientist project for when it is too cold to go outside (even if CNN doesn’t make it headline news).

The success of citizen science at “recognition” tasks (what else would you call subject identification?) has me convinced the average person is fully capable of authoring a topic map.

They will not author a topic map the same way I would but that’s a relief. I don’t want more than one me around. 😉

Has anyone done a systematic study of the “citizen science” interfaces? What appears to work better or worse?


A Brand New Milky Way Project

Thursday, December 12th, 2013

A Brand New Milky Way Project by Robert Simpson.

From the post:

Just over three years the Zooniverse launched the Milky Way Project (MWP), my first citizen science project. I have been leading the development and science of the MWP ever since. 50,000 volunteers have taken part from all over the world, and they’ve helped us do real science, including creating astronomy’s largest catalogue of infrared bubbles – which is pretty cool.

Today the original Milky Way Project (MWP) is complete. It took about three years and users have drawn more than 1,000,000 bubbles and several million other objects, including star clusters, green knots, and galaxies. It’s been a huge success but: there’s even more data! So it is with glee that we have announced the brand new Milky Way Project! It’s got more data, more objects to find, and it’s even more gorgeous.

Another great crowd sourced project!

Bear in mind that the Greek New Testament has approximately 138,000 words and 469,000 words in the Hebrew Bible.

The success of the Milky Way and other crowd sourced projects makes you wonder why images of biblical manuscripts aren’t setup for crowd transcription doesn’t it?

DARPA’s online games crowdsource software security

Friday, December 6th, 2013

DARPA’s online games crowdsource software security by Kevin McCaney.

From the post:

Flaws in commercial software can cause serious problems if cyberattackers take advantage of them with their increasingly sophisticated bag of tricks. The Defense Advanced Research Projects Agency wants to see if it can speed up discovery of those flaws by making a game of it. Several games, in fact.

DARPA’s Crowd Sourced Formal Verification (CSFV) program has just launched its Verigames portal, which hosts five free online games designed to mimic the formal software verification process traditionally used to look for software bugs.

Verification, both dynamic and static, has proved to be the best way to determine if software free of flaws, but it requires software engineers to perform “mathematical theorem-proving techniques” that can be time-consuming, costly and unable to scale to the size of some of today’s commercial software, according to DARPA. With Verigames, the agency is testing whether untrained (and unpaid) users can verify the integrity of software more quickly and less expensively.

“We’re seeing if we can take really hard math problems and map them onto interesting, attractive puzzle games that online players will solve for fun,” Drew Dean, DARPA program manager, said in announcing the portal launch. “By leveraging players’ intelligence and ingenuity on a broad scale, we hope to reduce security analysts’ workloads and fundamentally improve the availability of formal verification.”

If program verification is possible with online games, I don’t know of any principled reason why topic map authoring should not be possible.

Maybe fill-in-the-blank topic map authoring is just a poor authoring technique for topic maps.

Imagine gamifying data streams to be like Missile Command. 😉

Can you even count the number of hours that you played Missile Command?

Now consider the impact of a topic map authoring interface that addictive.

Particularly if the user didn’t know they were doing useful work.