## Archive for the ‘Crowd Sourcing’ Category

Tuesday, May 14th, 2013

Cascade: Crowdsourcing Taxonomy Creation by Lydia B. Chilton, Greg Little, Darren Edge, Daniel S. Weld, James A. Landay.

Abstract:

Taxonomies are a useful and ubiquitous way of organizing information. However, creating organizational hierarchies is difficult because the process requires a global understanding of the objects to be categorized. Usually one is created by an individual or a small group of people working together for hours or even days. Unfortunately, this centralized approach does not work well for the large, quickly-changing datasets found on the web. Cascade is an automated workflow that creates a taxonomy from the collective efforts of crowd workers who spend as little as 20 seconds each. We evaluate Cascade and show that on three datasets its quality is 80-90% of that of experts. The cost of Cascade is competitive with expert information architects, despite taking six times more human labor. Fortunately, this labor can be parallelized such that Cascade will run in as fast as five minutes instead of hours or days.

In the introduction the authors say:

Crowdsourcing has become a popular way to solve problems that are too hard for today’s AI techniques, such as translation, linguistic tagging, and visual interpretation. Most successful crowdsourcing systems operate on problems that naturally break into small units of labor, e.g., labeling millions of independent photographs. However, taxonomy creation is much harder to decompose, because it requires a global perspective. Cascade is a unique, iterative workflow that emergently generates this global view from the distributed actions of hundreds of people working on small, local problems.

The authors demonstrate the potential for time and cost savings in the creation of taxonomies but I take the significance of their paper to be something different.

As the paper demonstrates, taxonomy creation does not require a global perspective.

Any one of the individuals who participated, contributed localized knowledge that when combined with other localized knowledge, can be formed into what an observer would call a taxonomy.

A critical point since every user represents/reflects slightly varying experiences and viewpoints, while the most learned expert represents only one.

Does “your” taxonomy reflect your views or some expert’s?

### Help Map Historical Weather From Ship Logs

Thursday, May 9th, 2013

Help Map Historical Weather From Ship Logs by Caitlin Dempsey.

From the post:

The Old Weather project is a crowdsourcing data gathering endeavor to understand and map historical weather variability. The data collected will be used to understand past weather patterns and extremes in order to better predict future weather and climate. The project is headed by a team of collaborators from a range of agencies such as NOAA, the Met Office, the National Archives, and the National Maritime Museum.

Information about historical weather, in the form of temperature and pressure measurements, can be gleaned from old ship logbooks. For example, Robert Fitzory, the Captain of the Beagle, and his crew recorded weather conditions in their logs at every point the ship visited during Charles Darwin’s expedition. The English East India from the 1780s to the 1830s made numerous trips between the United Kingdom and China and India, with the ship crews recording weather measurements in their log books. Other expeditions to Antarctica provide rare historical measurements for that region of the world.

By utilizing a crowdsourcing approach, the Old Weather project team aims to use the collective efforts of public participation to gather data and to fact check data recorded from log books. There are 250,000 log books stored in the United Kingdom alone. Clive Wilkinson, a climate historian and research manager for the Recovery of Logbooks and International Marine Data (RECLAIM) Project, a part of NOAA’s Climate Database Modernisation Program, notes there are billions of unrecorded weather observations stored in logbooks around the world that could be captured and use to better climate prediction models.

In addition to climate data, I suspect that ships logs would make interesting records to dovetail, using a topic map, with other records, such as of ports, along their voyages.

Tracking the identities of passengers and crew, cargoes, social events/conditions along the way.

Standing on their own, logs and other historical materials are of interest, but integrated with other historical records a fuller historical tapestry emerges.

### Crowdsourced Astronomy…

Thursday, May 9th, 2013

Crowdsourced Astronomy – A Talk By Carolina Ödman-Govender by Bruce Berriman.

From the post:

This is a talk given by Carolina Ödman-Govender, given at the re:publica 13 meeting, on May 8 2013. She gives a fine general introduction to the value of crowdsourcing in astronomy, and invites people to get in touch with her if they want get involved.

Have you considered crowdsourcing for development of a topic map corpus?

### The Motherlode of Semantics, People

Saturday, April 27th, 2013

1st International Workshop on “Crowdsourcing the Semantic Web” (CrowdSem2013)

Submission deadline: July 12, 2013 (23:59 Hawaii time)

From the post:

1st International Workshop on “Crowdsourcing the Semantic Web” in conjunction with the 12th Interantional Seamntic Web Conference (ISWC 2013), 21-25 October 2013, in Sydney, Australia. This interactive workshop takes stock of the emergent work and chart the research agenda with interactive sessions to brainstorm ideas and potential applications of collective intelligence to solving AI hard semantic web problems.

The Global Brain Semantic Web—a Semantic Web interleaving a large number of human and machine computation—has great potential to overcome some of the issues of the current Semantic Web. In particular, semantic technologies have been deployed in the context of a wide range of information management tasks in scenarios that are increasingly significant in both technical (data size, variety and complexity of data sources) and economical terms (industries addressed and their market volume). For many of these tasks, machine-driven algorithmic techniques aiming at full automation do not reach a level of accuracy that many production environments require. Enhancing automatic techniques with human computation capabilities is becoming a viable solution in many cases. We believe that there is huge potential at the intersection of these disciplines – large scale, knowledge-driven, information management and crowdsourcing – to solve technically challenging problems purposefully and in a cost effective manner.

I’m encouraged.

The Semantic Web is going to start asking the entities (people) that originate semantics about semantics.

Going the motherlode of semantics.

Now to see what they do with the answers.

### Crowdsourcing Chemistry for the Community…

Friday, April 5th, 2013

Crowdsourcing Chemistry for the Community — 5 Year of Experiences by Antony Williams.

From the description:

ChemSpider is one of the internet’s primary resources for chemists. ChemSpider is a structure-centric platform and hosts over 26 million unique chemical entities sourced from over 400 different data sources and delivers information including commercial availability, associated publications, patents, analytical data, experimental and predicted properties. ChemSpider serves a rather unique role to the community in that any chemist has the ability to deposit, curate and annotate data. In this manner they can contribute their skills, and data, to any chemist using the system. A number of parallel projects have been developed from the initial platform including ChemSpider SyntheticPages, a community generated database of reaction syntheses, and the Learn Chemistry wiki, an educational wiki for secondary school students.

This presentation will provide an overview of the project in terms of our success in engaging scientists to contribute to crowdsouring chemistry. We will also discuss some of our plans to encourage future participation and engagement in this and related projects.

Perhaps not encouraging in terms of the rate of participation but certainly encouraging in terms of the impact of those who do participate.

I suspect the ratio of contributors to users isn’t that far off from those observed in open source projects.

On the whole, I take this as a plus sign for crowd-sourced curation projects, including topic maps.

I first saw this in a tweet by ChemConnector.

### Crowdsourced Chemistry… [Documents vs. Data]

Monday, March 18th, 2013

Crowdsourced Chemistry Why Online Chemistry Data Needs Your Help by Antony Williams. (video)

From the description:

This is the Ignite talk that I gave at ScienceOnline2010 #sci010 in the Research Triangle Park in North Carolina on January 16th 2010. This was supposed to be a 5 minute talk highlighting the quality of chemistry data on the internet. Ok, it was a little tongue in cheek because it was an after dinner talk and late at night but the data are real, the problem is real and the need for data curation of chemistry data online is real. On ChemSpider we have provided a platform to deposit and curate data. Other videos will show that in the future.

Great demonstration of the need for curation in chemistry.

And of the impact that re-usable information can have on the quality of information.

The errors in chemical descriptions you see in this video could be corrected in:

• In an article.
• In a monograph.
• In a webpage.
• In an online resource that can be incorporated by reference.

Which one do you think would propagate the corrected information more quickly?

Documents are a great way to convey information to a reader.

They are an incredibly poor way to store/transmit information.

Every reader has to extract the information in a document for themselves.

Not to mention that data is fixed, unless it has incorporated information by reference.

Funny isn’t it? We are still storing data as we did when clay tablets were the medium of choice.

Isn’t it time we separated presentation (documents) from storage/transmission (data)?

### Tom Sawyer and Crowdsourcing

Sunday, March 10th, 2013

Crowdsource from your Community the Tom Sawyer Way – Community Nuggets Vol.1 (video by Dave Olson)

Crowdsource From Your Community – the Tom Sawyer Way (article by Connor Meakin)

Deeply impressive video/article.

More of the nuts and bolts of the social side of crowd sourcing.

The side that makes it so successful (or not) depending on how well you do the social side.

Makes me wonder how to adapt the lessons of crowd sourcing both for development of topic maps but also for topic maps standardization?

### Crowdsourcing Cybersecurity: A Proposal (Part 2)

Wednesday, February 20th, 2013

As you may already suspect, my proposal for increasing cybersecurity is transparency.

A transparency borne of crowdsourcing cybersecurity.

What are the consequences of the current cult of secrecy around cybersecurity?

Here’s my short list (feel free to contribute):

• Governments have no source of reliable information on the security of their contractors, vendors, etc.
• Corporations have no source of reliable information on the security of their contractors, partners and others.
• Sysadmins outside the “inner circle” have no notice of the details of hacks, with which to protect their systems.
• Consumers of software have no source of reliable information on how insecure software may or may not be.

Secrecy puts everyone at greater cybersecurity risk, not less.

Let’s end cybersecurity secrecy and crowdsource cybersecurity.

Here is a sketch of one way to do just that:

1. Establish or re-use an agency or organization to offer bounties on hacks into systems.
2. Sliding scale where penetration using published root passwords are worth less than more sophisticated hacks. But even a minimal hack is worth say $5,000. 3. To collect the funds, a hacker must provide full hack details and proof of the hack. 4. A hacker submitting a “proof of hackability” attack has legal immunity (civil and criminal). 5. Hack has to be verified using the hack as submitted. 6. Upon verification of the hack, the hacker is paid the bounty. 7. One Hundred and Eighty (180) days after the verification of the hack, the name of the hacked organization, the full details of the hack and the hacker’s identity (subject to their permission), are published to a public website. Finance such a proposal, if run by a government, by fines on government contractors who get hacked. Defense contractors who aren’t cybersecure should not be defense contractors. That’s how you stop loss of national security information. Surprised it hasn’t occurred to anyone inside the beltway. With greater transparency, hacks, software, origins of software, authors of software, managers of security, all become subject to mapping. Would you hire your next security consultant from a firm that gets hacked on a regular basis? Or would you hire a defense contractor that changed its skin to avoid identification as an “easy” hack? Or retain a programmer who keeps being responsible for security flaws? Transparency and a topic map could give you better answers to those questions than you have today. ### Crowdsourcing Cybersecurity: A Proposal (Part 1) Wednesday, February 20th, 2013 Mandiant’s provocative but hardly conclusive report has created a news wave on cybersecurity. Hardly conclusive because as Mandiant states: we have analyzed the group’s intrusions against nearly 150 victims over seven years (page 2) A little over twenty-one victims a year. And I thought hacking was common place. Allegations of hacking should require a factual basis other than “more buses were going the other way.” (A logical fallacy because you get on the first bus going your way.) Here we have a tiny subset (if general hacking allegations have any credibility) of all hacking every year. Who is responsible for the intrusions? It is easy and commonplace to blame hackers, but there are other responsible parties. The security industry that continues to protect the identity of the “victims” of hacks and shares hacking information with a group of insiders comes to mind. That long standing cult of secrecy has not prevented, if you believe the security PR, a virtual crime wave of hacking. In fact, every non-disclosed hack, leaves thousands if not hundreds of thousands of users, institutions, governments and businesses with no opportunity to protect themselves. And, if you are hiring a contractor, say a defense contractor, isn’t their record with protecting your data from hackers a relevant concern? If users, institutions, governments and businesses had access to the details of hacking reports, who was hacked, who in the organization was responsible for computer security, how the hack was performed, etc., then we could all better secure our computers. Or be held accountable for failing to secure our computers. By management, customers and/or governments. Decades of diverting attention from poor security practices, hiding those who practice poor security, and cultivating a cult of secrecy around computer security, hasn’t diminished hacking. What part of that lesson is unclear? Or do you deny the reports by Mandiant and others? It really is that clear: Either Mandiant and others are inventing hacking figures out of whole clothe or the cult of cybersecurity secrecy has failed to stop hacking. Interested? See Crowdsourcing Cybersecurity: A Proposal (Part 2) for my take on a solution. Just as a side note, President Obama’s Executive Order — Improving Critical Infrastructure Cybersecurity appeared on February 12, 2013. Compare: Mandiant Releases Report Exposing One of China’s Cyber Espionage Groups released February 19, 2013. Is Mandiant trying to ride on the President’s coattails as they say? Or just being opportunistic with the news cycle? Connected into the beltway security cult? Hard to say, probably impossible to know. Interesting timing none the less. I wonder who will be on the various panels, experts, contractors under the Cybersecurity executive order? Don’t you? ### Models and Algorithms for Crowdsourcing Discovery Sunday, February 17th, 2013 Models and Algorithms for Crowdsourcing Discovery by Siamak Faridani. (PDF) From the abstract: The internet enables us to collect and store unprecedented amounts of data. We need better models for processing, analyzing, and making conclusions from the data. In this work, crowdsourcing is presented as a viable option for collecting data, extracting patterns and insights from big data. Humans in collaboration, when provided with appropriate tools, can collectively see patterns, extract insights and draw conclusions from data. We study diff erent models and algorithms for crowdsourcing discovery. In each section in this dissertation a problem is proposed, the importance of it is discussed, and solutions are proposed and evaluated. Crowdsourcing is the unifying theme for the projects that are presented in this dissertation. In the first half of the dissertation we study diff erent aspects of crowdsourcing like pricing, completion times, incentives, and consistency with in-lab and controlled experiments. In the second half of the dissertation we focus on Opinion Space1 and the algorithms and models that we designed for collecting innovative ideas from participants. This dissertation speci cally studies how to use crowdsourcing to discover patterns and innovative ideas. We start by looking at the CONE Welder project2 which uses a robotic camera in a remote location to study the eff ect of climate change on the migration of birds. In CONE, an amateur birdwatcher can operate a robotic camera at a remote location from within her web browser. She can take photos of diff erent bird species and classify diff erent birds using the user interface in CONE. This allowed us to compare the species presented in the area from 2008 to 2011 with the species presented in the area that are reported by Blacklock in 1984 [Blacklock, 1984]. Citizen scientists found eight avian species previously unknown to have breeding populations within the region. CONE is an example of using crowdsourcing for discovering new migration patterns. Crowdsourcing has great potential. Especially if you want to discover the semantics people are using rather than dictating the semantics they ought to be using. I think the former is more accurate than the latter. You? I first saw this at Christophe Lalanne’s A bag of tweets / January 2013. ### The Power of Semantic Diversity Sunday, February 10th, 2013 Prize-based contests can provide solutions to computational biology problems by Karim R Lakhani, et al. (Nature Biotechnology 31, 108–111 (2013) doi:10.1038/nbt.2495) From the article: Advances in biotechnology have fueled the generation of unprecedented quantities of data across the life sciences. However, finding analysts who can address such ‘big data’ problems effectively has become a significant research bottleneck. Historically, prize-based contests have had striking success in attracting unconventional individuals who can overcome difficult challenges. To determine whether this approach could solve a real big-data biologic algorithm problem, we used a complex immunogenomics problem as the basis for a two-week online contest broadcast to participants outside academia and biomedical disciplines. Participants in our contest produced over 600 submissions containing 89 novel computational approaches to the problem. Thirty submissions exceeded the benchmark performance of the US National Institutes of Health’s MegaBLAST. The best achieved both greater accuracy and speed (1,000 times greater). Here we show the potential of using online prize-based contests to access individuals without domain-specific backgrounds to address big-data challenges in the life sciences. …. Over the last ten years, online prize-based contest platforms have emerged to solve specific scientific and computational problems for the commercial sector. These platforms, with solvers in the range of tens to hundreds of thousands, have achieved considerable success by exposing thousands of problems to larger numbers of heterogeneous problem-solvers and by appealing to a wide range of motivations to exert effort and create innovative solutions18, 19. The large number of entrants in prize-based contests increases the probability that an ‘extreme-value’ (or maximally performing) solution can be found through multiple independent trials; this is also known as a parallel-search process19. In contrast to traditional approaches, in which experts are predefined and preselected, contest participants self-select to address problems and typically have diverse knowledge, skills and experience that would be virtually impossible to duplicate locally18. Thus, the contest sponsor can identify an appropriate solution by allowing many individuals to participate and observing the best performance. This is particularly useful for highly uncertain innovation problems in which prediction of the best solver or approach may be difficult and the best person to solve one problem may be unsuitable for another19. An article that merits wider reading that it is likely to get behind a pay-wall. A semantically diverse universe of potential solvers is more effective than a semantically monotone group of selected experts. An indicator of what to expect from the monotone logic of the Semantic Web. Good for scheduling tennis matches with Tim Berners-Lee. For more complex tasks, rely on semantically diverse groups of humans. ### Human Computation and Crowdsourcing Saturday, January 26th, 2013 From the conference website: Where Palm Springs, California Venue information coming soon When November 7-9, 2013 Important Dates All deadlines are 5pm Pacific time unless otherwise noted. Papers Submission deadline: May 1, 2013 Author rebuttal period: June 21-28 Notification: July 16, 2013 Camera Ready: September 4, 2013 Workshops & Tutorials Proposal deadline: May 10, 2013 Notification: July 16, 2013 Camera Ready: September 4, 2013 Posters & Demonstrations Submission deadline: July 25, 2013 Notification: August 26, 2013 Camera Ready: September 4, 2013 From the post: Announcing HCOMP 2013, the Conference on Human Computation and Crowdsourcing, Palm Springs, November 7-9, 2013. Paper submission deadline is May 1, 2013. Thanks to the HCOMP community for bringing HCOMP to life as a full conference, following on the successful workshop series. The First AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2013) will be held November 7-9, 2013 in Palm Springs, California, USA. The conference was created by researchers from diverse fields to serve as a key focal point and scholarly venue for the review and presentation of the highest quality work on principles, studies, and applications of human computation. The conference is aimed at promoting the scientific exchange of advances in human computation and crowdsourcing among researchers, engineers, and practitioners across a spectrum of disciplines. Papers submissions are due May 1, 2013 with author notification on July 16, 2013. Workshop and tutorial proposals are due May 10, 2013. Posters & demonstrations submissions are due July 25, 2013. I suppose it had to happen. Instead of asking adding machines for their opinions, someone would decide to ask the creators of adding machines for theirs. I first saw this at: New AAAI Conference on Human Computation and Crowdsourcing by Shar Steed. ### Crowdsourcing campaign spending: … Thursday, December 13th, 2012 From the post: This fall, ProPublica set out to Free the Files, enlisting our readers to help us review political ad files logged with Federal Communications Commission. Our goal was to take thousands of hard-to-parse documents and make them useful, helping to reveal hidden spending in the election. Nearly 1,000 people pored over the files, logging detailed ad spending data to create a public database that otherwise wouldn’t exist. We logged as much as$1 billion in political ad buys, and a month after the election, people are still reviewing documents. So what made Free the Files work?

A quick backstory: Free the Files actually began last spring as an effort to enlist volunteers to visit local TV stations and request access to the “public inspection file.” Stations had long been required to keep detailed records of political ad buys, but they were only available on paper and required actually traveling to the station.

In August, the FCC ordered stations in the top 50 markets to begin posting the documents online. Finally, we would be able to access a stream of political ad data based on the files. Right?

Wrong. It turns out the FCC didn’t require stations to submit the data in anything that approaches an organized, standardized format. The result was that stations sent in a jumble of difficult to search PDF files. So we decided if the FCC or stations wouldn’t organize the information, we would.

Enter Free the Files 2.0. Our intention was to build an app to help translate the mishmash of files into structured data about the ad buys, ultimately letting voters sort the files by market, contract amount and candidate or political group (which isn’t possible on the FCC’s web site), and to do it with the help of volunteers.

In the end, Free the Files succeeded in large part because it leveraged data and community tools toward a single goal. We’ve compiled a bit of what we’ve learned about crowdsourcing and a few ideas on how news organizations can adapt a Free the Files model for their own projects.

The team who worked on Free the Files included Amanda Zamora, engagement editor; Justin Elliott, reporter; Scott Klein, news applications editor; Al Shaw, news applications developer, and Jeremy Merrill, also a news applications developer. And thanks to Daniel Victor and Blair Hickman for helping create the building blocks of the Free the Files community.

The entire story is golden but a couple of parts shine brighter for me than the others.

Design consideration:

The success of Free the Files hinged in large part on the design of our app. The easier we made it for people to review and annotate documents, the higher the participation rate, the more data we could make available to everyone. Our maxim was to make the process of reviewing documents like eating a potato chip: “Once you start, you can’t stop.”

Let me re-say that: The easier it is for users to author topic maps, the more topic maps they will author.

Yes?

Semantic Diversity:

But despite all of this, we still can’t get an accurate count of the money spent. The FCC’s data is just too dirty. For example, TV stations can file multiple versions of a single contract with contradictory spending amounts — and multiple ad buys with the same contract number means radically different things to different stations. But the problem goes deeper. Different stations use wildly different contract page designs, structure deals in idiosyncratic ways, and even refer to candidates and groups differently.

All true but knowing the semantics vary ahead of time, station to station, why not map the semantics in the markets ahead of time?

Granting I second their request to the FCC to request standardized data but having standardized blocks doesn’t mean the information has the same semantics.

The OMB can’t keep the same semantics for a handful of terms in one document.

What chance is there with dozens and dozens of players in multiple documents?

### Georeferencer: Crowdsourced Georeferencing for Map Library Collections

Monday, November 19th, 2012

Georeferencer: Crowdsourced Georeferencing for Map Library Collections by Christopher Fleet, Kimberly C. Kowal and Petr Přidal.

Abstract:

Georeferencing of historical maps offers a number of important advantages for libraries: improved retrieval and user interfaces, better understanding of maps, and comparison/overlay with other maps and spatial data. Until recently, georeferencing has involved various relatively time-consuming and costly processes using conventional geographic information system software, and has been infrequently employed by map libraries. The Georeferencer application is a collaborative online project allowing crowdsourced georeferencing of map images. It builds upon a number of related technologies that use existing zoomable images from library web servers. Following a brief review of other approaches and georeferencing software, we describe Georeferencer through its five separate implementations to date: the Moravian Library (Brno), the Nationaal Archief (The Hague), the National Library of Scotland (Edinburgh), the British Library (London), and the Institut Cartografic de Catalunya (Barcelona). The key success factors behind crowdsourcing georeferencing are presented. We then describe future developments and improvements to the Georeferencer technology.

There is an introduction video if you prefer: http://www.klokantech.com/georeferencer/.

Either way, you will be deeply impressed by this project.

And wondering: Can the same lessons be applied to crowd source the creation of topic maps?

### Why I decided to crowdfund my research

Sunday, November 11th, 2012

From the post:

For the last five years, I ran a lab in Princeton University as an independent researcher through a $1 million grant. That money ran out in September. Now my option is to apply for government grants where I have a slim chance of success. And, if unsuccessful, I have to stop research. Over 80% of grant applications to funding agencies in the United States fail. The government is planning to make further cuts to the science budget. More disturbing is the fact that now scientists receive their first big grant at the age of 42, nearly a decade after surviving graduate school, postdoctoral fellowships and temporary faculty appointments. That’s why I decided to experiment with the way experiments are funded. I am trying to crowdfund a basic research project. Kickstarter brought the concept of crowdfunding to my attention years ago. However, it was only in the last year that I learned about the SciFund Challenge, a “by scientists, for scientists” initiative to finance small-scale ($200 – $2,000) projects, mostly in ecology and related fields, but not much in the biomedical sciences. Ethan researched the models use by other crowdfunded projects and this post includes pointers to that research as well as other lessons he learned along the way. Including how to visualize the network of supporters for his campaign and consequently how to reach out to new supporters. Not for the first time, I wonder if crowdfunding would work for the production of subject specific topic maps? That is to pick some area, a defined data set with a proposed deliverable, and then promote it for funding? I would shy away from secret government documents unless I ran across a funder who read the Pentagon Papers from cover to cover. It’s a classic, “something that everybody wants to have read and nobody wants to read. My problem, which you may share, is that I know what I like, not so good about what other people like. As in other people willing to contribute money. Suggestions as to sources on what “other” people like? Twitter trends? News programs? Movie/music reviews? The next big question: How can topic maps increase their enjoyment of X? I first saw news of Ethan O. Perlstein in a tweet by Duncan Hall. ### New version of Get-Another-Label available Monday, October 22nd, 2012 New version of Get-Another-Label available by Panos Ipeirotis. From the post: I am often asked what type of technique I use for evaluating the quality of the workers on Mechanical Turk (or on oDesk, or …). Do I use gold tests? Do I use redundancy? Well, the answer is that I use both. In fact, I use the code “Get-Another-Label” that I have developed together with my PhD students and a few other developers. The code is publicly available on Github. We have updated the code recently, to add some useful functionality, such as the ability to pass (for evaluation purposes) the true answers for the different tasks, and get back answers about the quality of the estimates of the different algorithms. Panos continues his series on the use of crowd sourcing. Just a thought experiment at the moment but could semantic gaps between populations be “discovered” by use of crowd sourcing? That is to create tasks that require “understanding” some implicit semantic in the task and then collecting the answer. There being no “incorrect” answers but answers that reflect the differing perceptions of the semantics of the task. A way to get away from using small groups of college students for such research? (Nothing against small groups of college students but they best represent small groups of college students. May need a broader semantic range.) ### Why oDesk has no spammers Saturday, October 20th, 2012 Why oDesk has no spammers by Panos Ipeirotis. From the post: So, in my last blog post, I described a brief outline on how to use oDesk to execute automatically a set of tasks, in a “Mechanical Turk” style (i.e., no interviews for hiring and completely computer-mediated process for posting a job, hiring, and ending a contract). A legitimate question by appeared in the comments: “Well, the concept is certainly interesting. But is there a compelling reason to do microstasks on oDesk? Is it because oDesk has a rating system?” So, here is my answer: If you hire contractors on oDesk you will not run into any spammers, even without any quality control. Why is that? Is there a magic ingredient at oDesk? Short answer: Yes, there is an ingredient: Lack of anonymity! Well, when you put it that way. Question: How open are your topic maps? Question: Would you use lack of anonymity to prevent spam in a publicly curated topic map? Question: If we want a lack of anonymity to provide transparency and accountability in government, why isn’t that the case with public speech? ### Using oDesk for microtasks [Data Semantics - A Permanent Wait?] Monday, October 15th, 2012 Using oDesk for microtasks by Panos Ipeirotis. From the post: Quite a few people keep asking me about Mechanical Turk. Truth be told, I have not used MTurk for my own work for quite some time. Instead I use oDesk to get workers for my tasks, and, increasingly, for my microtasks as well. When I mention that people can use oDesk for micro-tasks, people get often surprised: “oDesk cannot be used through an API, it is designed for human interaction, right?” Oh well, yes and no. Yes, most jobs require some form of interviewing, but there are certainly jobs where you do not need to manually interview a worker before engaging them. In fact, with most crowdsourcing jobs having both the training and the evaluation component built in the working process, the manual interview is often not needed. For such crowdsourcing-style jobs, you can use the oDesk API to automate the hiring of workers to work on your tasks. You can find the API at http://developers.odesk.com/w/page/12364003/FrontPage (Saying that the API page is, ahem, badly designed, is an understatement. Nevertheless, it is possible to figure out how to use it, relatively quickly, so let’s move on.) Panos promises future posts with the results of crowd-sourcing experiments with oDesk. Looking forward to it because waiting for owners of data to disclose semantics looks like a long wait. Perhaps a permanent wait. And why not? If the owners of data “know” the semantics of their data, what advantage do they get from telling you? What is their benefit? If you guessed “none,” go to the head of the class. We can either wait for crumbs of semantics to drop off the table or we can setup our own table to produce semantics. Which one sounds quicker to you? ### Verification: In God We Trust, All Others Pay Cash Thursday, October 11th, 2012 Crowdsourcing is a valuable technique, at least if accurate information is the result. Incorrect information or noise is still incorrect information or noise, crowdsourced or not. From PLOS ONE (not Nature or Science) comes news of progress on verification of crowdsourced information. (Verification in Referral-Based Crowdsourcing Naroditskiy V, Rahwan I, Cebrian M, Jennings NR (2012) Verification in Referral-Based Crowdsourcing. PLoS ONE 7(10): e45924. doi:10.1371/journal.pone.0045924) Abstract: Online social networks offer unprecedented potential for rallying a large number of people to accomplish a given task. Here we focus on information gathering tasks where rare information is sought through “referral-based crowdsourcing”: the information request is propagated recursively through invitations among members of a social network. Whereas previous work analyzed incentives for the referral process in a setting with only correct reports, misreporting is known to be both pervasive in crowdsourcing applications, and difficult/costly to filter out. A motivating example for our work is the DARPA Red Balloon Challenge where the level of misreporting was very high. In order to undertake a formal study of verification, we introduce a model where agents can exert costly effort to perform verification and false reports can be penalized. This is the first model of verification and it provides many directions for future research, which we point out. Our main theoretical result is the compensation scheme that minimizes the cost of retrieving the correct answer. Notably, this optimal compensation scheme coincides with the winning strategy of the Red Balloon Challenge. UCSD Jacobs School of Engineering, in Making Crowdsourcing More Reliable, reported the following experience with this technique: The research team has successfully tested this approach in the field. Their group accomplished a seemingly impossible task by relying on crowdsourcing: tracking down “suspects” in a jewel heist on two continents in five different cities, within just 12 hours. The goal was to find five suspects. Researchers found three. That was far better than their nearest competitor, which located just one “suspect” at a much later time. It was all part of the “Tag Challenge,” an event sponsored by the U.S. Department of State and the U.S. Embassy in Prague that took place March 31. Cebrian’s team promised$500 to those who took winning pictures of the suspects. If these people had been recruited to be part of “CrowdScanner” by someone else, that person would get $100. To help spread the word about the group, people who recruited others received$1 per person for the first 2,000 people to join the group.

This has real potential!

Could use money, but what of other inducements?

What if department professors agree to substitute participation in a verified crowdsourced bibliography in place of the usual 10% class participation?

Motivation, structuring the task, are all open areas for experimentation and research.

Suggestions on areas for topic maps using this methodology?

Some other resources you may find of interest:

Tag Challenge website

Tag Challenge – Wikipedia (Has links to team pages, etc.)

### Experts vs. Crowds (How to Distinguish, CIA and Drones)

Monday, August 27th, 2012

Reporting on the intelligence community’s view of crowd-sourcing, Ken Dilanian reports:

“I don’t believe in the wisdom of crowds,” said Mark Lowenthal, a former senior CIA and State Department analyst (and 1988″Jeopardy!” champion) who now teaches classified courses about intelligence. “Crowds produce riots. Experts produce wisdom.”

I would modify Lowenthal’s assessment to read:

Crowds produce diverse judgements. Experts produce highly similar judgements.

Or to put it another way, the smaller the group, over time, the less variation you will find in opinion. And the further group opinion diverges from reality as experienced by non-group members.

No real surprise Beltway denizens failed to predict the Arab Spring. None of the concerns that led to the Arab Spring are part of the “experts” concerns. Not just on a conscious level but as a social experience.

The more diverse the opinion/experience pool, the less likely a crowd judgement is to be completely alien to reality as experienced by others.

Which is how I would explain the performance of the crowd thus far in the experiment.

Dilanian’s speculation:

Crowd-sourcing would mean, in theory, polling large groups across the 200,000-person intelligence community, or outside experts with security clearances, to aggregate their views about the strength of the Taliban, say, or the likelihood that Iran is secretly building a nuclear weapon.

reflects a failure to appreciate the nature of crowd-sourced judgements.

First, crowd-sourcing will be more effective if the “intelligence community” is only a small part of the crowd. To choose people only with security clearances I suspect automatically excludes many Taliban sympathizers. Not going to get good results if the crowd is poorly chosen.

Think of it as trying to re-create the “dance” that bees do as a means of communicating the location of pollen. I would trust the CIA to build a bee hive with only drones. And then complain that crowd behavior didn’t work.

Second, crowd-sourcing can do factual questions, like guessing the weight of an animal, but only if everyone has the same information. Otherwise, use crowd-sourcing to gauge the likely impact of policies, changes in policies, etc. Pulse of the “public” as it were.

The “likelihood that Iran is secretly building a nuclear weapon” isn’t a crowd-source question. No lack of information can counter the effort being “secret.” There is no information because, yes, Iran is keeping it secret.

Properly used, crowd-sourcing can be a very valuable tool.

The ad agencies call it public opinion polling.

Imagine appropriate polling activities on the ground in the Middle East. Asking ordinary people about their hopes, desires, and dreams. If credited over summarized and sanitized results of experts, could lead to policies that benefit the people, not to say the governments, of the Middle East. (Another reason some prefer experts. Experts support current governments.)

### [C]rowdsourcing … knowledge base construction

Friday, August 10th, 2012

Development and evaluation of a crowdsourcing methodology for knowledge base construction: identifying relationships between clinical problems and medications by Allison B McCoy, Adam Wright, Archana Laxmisan, Madelene J Ottosen, Jacob A McCoy, David Butten, and Dean F Sittig. (J Am Med Inform Assoc 2012; 19:713-718 doi:10.1136/amiajnl-2012-000852)

Abstract:

Objective We describe a novel, crowdsourcing method for generating a knowledge base of problem–medication pairs that takes advantage of manually asserted links between medications and problems.

Methods Through iterative review, we developed metrics to estimate the appropriateness of manually entered problem–medication links for inclusion in a knowledge base that can be used to infer previously unasserted links between problems and medications.

Results Clinicians manually linked 231 223 medications (55.30% of prescribed medications) to problems within the electronic health record, generating 41 203 distinct problem–medication pairs, although not all were accurate. We developed methods to evaluate the accuracy of the pairs, and after limiting the pairs to those meeting an estimated 95% appropriateness threshold, 11 166 pairs remained. The pairs in the knowledge base accounted for 183 127 total links asserted (76.47% of all links). Retrospective application of the knowledge base linked 68 316 medications not previously linked by a clinician to an indicated problem (36.53% of unlinked medications). Expert review of the combined knowledge base, including inferred and manually linked problem–medication pairs, found a sensitivity of 65.8% and a specificity of 97.9%.

Conclusion Crowdsourcing is an effective, inexpensive method for generating a knowledge base of problem–medication pairs that is automatically mapped to local terminologies, up-to-date, and reflective of local prescribing practices and trends.

I would not apply the term “crowdsourcing,” here, in part because the “crowd” is hardly unknown. Not a crowd at all, but an identifiable group of clinicians.

Doesn’t invalidate the results, which shows the utility of data mining for creating knowledge bases.

As a matter of usage, let’s not confuse anonymous “crowds,” with specific groups of people.

Monday, July 2nd, 2012

Readersourcing—a manifesto by Stefano Mizzaro. (Mizzaro, S. (2012), Readersourcing—a manifesto. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22668)

Abstract:

This position paper analyzes the current situation in scholarly publishing and peer review practices and presents three theses: (a) we are going to run out of peer reviewers; (b) it is possible to replace referees with readers, an approach that I have named “Readersourcing”; and (c) it is possible to avoid potential weaknesses in the Readersourcing model by adopting an appropriate quality control mechanism. The readersourcing.org system is then presented as an independent, third-party, nonprofit, and academic/scientific endeavor aimed at quality rating of scholarly literature and scholars, and some possible criticisms are discussed.

Mizzaro touches a number of issues that have speculative answers in his call for “readersourcing” of research. There is a website in progress, www.readersourcing.org.

I am interested in the approach as an aspect of crowdsourcing the creation of topic maps.

FYI, his statement that:

Readersourcing is a solution to a problem, but it immediately raises another problem, for which we need a solution: how to distinguish good readers from bad readers. If 200 undergraduate students say that a paper is good, but five experts (by reputation) in the field say that it is not, then it seems obvious that the latter should be given more importance when calculating the paper’s quality.

Seems problematic to me. Particularly for graduate students. If professors at their school rate research high or low, that should be calculated into a rating for that particular reader.

If that seems pessimistic, read: Fish, Stanley, “Transmuting the Lump: Paradise Lost, 1942-1979,” in Doing What Comes Naturally. Fish, Stanley (ed.), Duke University Press, 1989), which treats changing “expert” opinions on the closing chapters of Paradise Lost. So far as I know, the text did not change between 1942 and 1979 but “expert” opinion certainly did.

I offer that as a caution that all of our judgements are a matter of social consensus that changes over time. On some issues more quickly than others. Our information systems should reflect the ebb and flow of that semantic renegotiation.

### Citizen Archivist Dashboard ["...help the next person discover that record"]

Sunday, June 10th, 2012

Citizen Archivist Dashboard

What’s the common theme of these interfaces from the National Archives (United States)?

• Tag – Tagging is a fun and easy way for you to help make National Archives records found more easily online. By adding keywords, terms, and labels to a record, you can do your part to help the next person discover that record. For more information about tagging National Archives records, follow “Tag It Tuesdays,” a weekly feature on the NARAtions Blog. [includes "missions" (sets of materials for tagging), rated as "beginner," "intermediate," and "advanced." Or you can create your own mission.]
• Transcribe – By contributing to transcriptions, you can help the National Archives make historical documents more accessible. Transcriptions help in searching for the document as well as in reading and understanding the document. The work you do transcribing a handwritten or typed document will help the next person discover and use that record.

The transcription tool features over 300 documents ranging from the late 18th century through the 20th century for citizen archivists to transcribe. Documents include letters to a civil war spy, presidential records, suffrage petitions, and fugitive slave case files.

[A pilot project with 300 documents but one you should follow. Public transcription (crowd-sourced if you want the popular term) of documents has the potential to open up vast archives of materials.]

• Edit Articles – Our Archives Wiki is an online space for researchers, educators, genealogists, and Archives staff to share information and knowledge about the records of the National Archives and about their research.

Here are just a few of the ways you may want to participate:

• Create new pages and edit pre-existing pages
• Store useful information discovered during research
• Expand upon a description in our online catalog

• Upload & Share – Calling all researchers! Start sharing your digital copies of National Archives records on the Citizen Archivist Research group on Flickr today.

Researchers scan and photograph National Archives records every day in our research rooms across the country — that’s a lot of digital images for records that are not yet available online. If you have taken scans or photographs of records you can help make them accessible to the public and other researchers by sharing your images with the National Archives Citizen Archivist Research Group on Flickr.

• Index the Census – Citizen Archivists, you can help index the 1940 census!

The National Archives is supporting the 1940 census community indexing project along with other archives, societies, and genealogical organizations. The release of the decennial census is one of the most eagerly awaited record openings. The 1940 census is available to search and browse, free of charge, on the National Archives 1940 Census web site. But, the 1940 census is not yet indexed by name.

You can help index the 1940 census by joining the 1940 census community indexing project. To get started you will need to download and install the indexing software, register as an indexing volunteer, and download a batch of images to transcribe. When the index is completed, the National Archives will make the named index available for free.

The common theme?

The tagging entry sums it up with: “…you can do your part to help the next person discover that record.”

That’s the “trick” of topic maps. Once a fact about a subject is found, you can preserve your “finding” for the next person.

### Identifying And Weighting Integration Hypotheses On Open Data Platforms

Wednesday, May 16th, 2012

Identifying And Weighting Integration Hypotheses On Open Data Platforms by Julian Eberius, Katrin Braunschweig, Maik Thiele, and Wolfgang Lehner.

Abstract:

Open data platforms such as data.gov or opendata.socrata.com provide a huge amount of valuable information. Their free-for-all nature, the lack of publishing standards and the multitude of domains and authors represented on these platforms lead to new integration and standardization problems. At the same time, crowd-based data integration techniques are emerging as new way of dealing with these problems. However, these methods still require input in form of specific questions or tasks that can be passed to the crowd. This paper discusses integration problems on Open Data Platforms, and proposes a method for identifying and ranking integration hypotheses in this context. We will evaluate our findings by conducting a comprehensive evaluation using on one of the largest Open Data platforms.

This is interesting work on Open Data platforms but it is marred by claims such as:

Open Data Platforms have some unique integration problems that do not appear in classical integration scenarios and which can only be identi ed using a global view on the level of datasets. These problems include partial- or duplicated datasets, partitioned datasets, versioned datasets and others, which will be described in detail in Section 4.

Really?

Would come as a surprise to the World Data Centre for Aerosols which had Synthesis and INtegration of Global Aerosol Data Sets. Contract No. ENV4-CT98-0780 (DG 12 –EHKN) produced on data sets from 1999 to 2001. One of the specific issues they addressed were duplicate data sets.

More than a decade ago counts for a “classical integration scenario” I think.

Another quibble. Cited sources do not support the text.

New forms of data management such as dataspaces and pay-as-you-go data integration [2, 6] are a hot topic in database research. They are strongly related to Open Data Platforms in that they assume large sets of heterogeneous data sources lacking a global or mediated schemata, which still should be queried uniformly.

2 M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: a new abstraction for information management. SIGMOD Rec., 34:27{33, December 2005.

6 J. Madhavan, S. R. Je ery, S. Cohen, X. . Dong, D. Ko, C. Yu, A. Halevy, and G. Inc. Web-scale Data Integration: You Can Only A fford to Pay As You Go. In Proc. of CIDR-07, 2007.

Articles written seven (7) and five (5) years ago, do not justify a “hot topic(s) in database research.” claim today.

There are other issues, major and minor but for all that, this is important work.

I want to see reports that do justice to its importance.

### TREC 2012 Crowdsourcing Track

Saturday, May 12th, 2012

Panos Ipeirotis writes:

TREC 2012 Crowdsourcing Track - Call for Participation
June 2012 – November 2012

Goals

As part of the National Institute of Standards and Technology (NIST)‘s annual Text REtrieval Conference (TREC), the Crowdsourcing track investigates emerging crowd-based methods for search evaluation and/or developing hybrid automation and crowd search systems.

This year, our goal is to evaluate approaches to crowdsourcing high quality relevance judgments for two different types of media:

1. textual documents
2. images

For each of the two tasks, participants will be expected to crowdsource relevance labels for approximately 20k topic-document pairs (i.e., 40k labels when taking part in both tasks). In the first task, the documents will be from an English news text corpora, while in the second task the documents will be images from Flickr and from a European news agency.

Participants may use any crowdsourcing methods and platforms, including home-grown systems. Submissions will be evaluated against a gold standard set of labels and against consensus labels over all participating teams.

Tentative Schedule

• Jun 1: Document corpora, training topics (for image task) and task guidelines available
• Jul 1: Training labels for the image task
• Aug 1: Test data released
• Sep 15: Submissions due
• Oct 1: Preliminary results released
• Oct 15: Conference notebook papers due
• Nov 6-9: TREC 2012 conference at NIST, Gaithersburg, MD, USA
• Nov 15: Final results released
• Jan 15, 2013: Final papers due

As you know, I am interested in crowd sourcing of paths through data and assignment of semantics.

Although I am puzzled why we continue to put emphasis on post-creation assignment of semantics?

After data is created, we look around surprised the data has no explicit semantics.

Like realizing you are on Main Street without your pants.

Why don’t we look to the data creation process to assign explicit semantics?

Thoughts?

Friday, May 11th, 2012

Crowdsourcing – A Solution to your “Bad Data” Problems by Hollis Tibbetts.

Hollis writes:

Data problems – whether they be inaccurate data, incomplete data, data categorization issues, duplicate data, data in need of enrichment – are age-old.

IT executives consistently agree that data quality/data consistency is one of the biggest roadblocks to them getting full value from their data. Especially in today’s information-driven businesses, this issue is more critical than ever.

Technology, however, has not done much to help us solve the problem – in fact, technology has resulted in the increasingly fast creation of mountains of “bad data”, while doing very little to help organizations deal with the problem.

One “technology” holds much promise in helping organizations mitigate this issue – crowdsourcing. I put the word technology in quotation marks – as it’s really people that solve the problem, but it’s an underlying technology layer that makes it accurate, scalable, distributed, connectable, elastic and fast. In an article earlier this week, I referred to it as “Crowd Computing”.

Crowd Computing – for Data Problems

The Human “Crowd Computing” model is an ideal approach for newly entered data that needs to either be validated or enriched in near-realtime, or for existing data that needs to be cleansed, validated, de-duplicated and enriched. Typical data issues where this model is applicable include:

• Verification of correctness
• Data conflict and resolution between different data sources
• Judgment calls (such as determining relevance, format or general “moderation”)
• “Fuzzy” referential integrity judgment
• Data error corrections
• Data enrichment or enhancement
• Classification of data based on attributes into categories
• De-duplication of data items
• Sentiment analysis
• Data merging
• Image data – correctness, appropriateness, appeal, quality
• Transcription (e.g. hand-written comments, scanned content)
• Translation

In areas such as the Data Warehouse, Master Data Management or Customer Data Management, Marketing databases, catalogs, sales force automation data, inventory data – this approach is ideal – or any time that business data needs to be enriched as part of a business process.

Hollis has a number of good points. But the choice doesn’t have to be “big data/iron” versus “crowd computing.”

More likely to get useful results out of some combination of the two.

Make “big data/iron” responsible for raw access, processing, visualization in an interactive environment with semantics supplied by the “crowd computers.”

And vet participants on both sides in real time. Would be a novel thing to have firms competing to supply the interactive environment and being paid on the basis of the “crowd computers” that preferred it or got better results.

That is a ways past where Hollis is going but I think it leads naturally in that direction.

### Syrian crowdmapping project documents reports of rape

Sunday, April 1st, 2012

Syrian crowdmapping project documents reports of rape

Niall Firth, technology editor for the New Scientist, writes:

Earlier this month, an unnamed woman in the village of Sahl Al-Rawj, Syria, left the safety of her hiding place to plead for the lives of her husband and son as government forces advanced. She was captured and five soldiers took turns raping her as she was forced to watch her husband die.

Her shocking story – officially unverified – is just one of many reports of sexual violence against women that has come out of Syria as fighting continues between government forces and rebels. Now a crowd-mapping website, launched this week, will attempt to detail every such rape and incident of sexual violence against women throughout the conflict.

The map is the creation of the Women under Siege initiative, and uses the same crowdsourcing technology developed by Washington DC-based Ushaidi, which is also being used to calculate the death toll in the recent fighting.

I read not all that long ago that under reporting of rape is 60% among civilians and 80% among the military. Military Sexual Abuse: A Greater Menace Than Combat

Would a mapping service such as the one created for the conflict in Syria help with the under reporting of rape in the United States? That would at least document the accounts of rape victims and the locations of their attacks.

Greater reporting of rapes and their locations is a first step.

Topic maps could help with the next step: Outing Rapists.

Outing Rapists means binding the accounts and locations of rapes to Facebook, faculty, department, government, listings of rapists.

Outing a rapist may prevent a future rape.

A couple of resources out of thousands on domestic or sexual violence: National Center on Domestic and Sexual Violence or U.S. Military Violence Against Women.

### SoSlang Crowdsources a Dictionary

Wednesday, March 21st, 2012

SoSlang Crowdsources a Dictionary

Stephen E. Arnold writes:

Here’s a surprising and interesting approach to dictionaries: have users build their own. SoSlang allows anyone to add a slang term and its definition. Beware, though, this site is not for everyone. Entries can be salty. R-rated, even. You’ve been warned.

I would compare this approach:

speakers -> usages -> dictionary

to a formal dictionary:

speakers -> usages -> editors -> formal dictionary

That is to say a formal dictionary reflects the editor’s sense of the language and not the raw input of the speakers of a language.

It would be a very interesting text mining tasks to eliminate duplicate usages of terms so that the changing uses of a term can be tracked.

### Crowdsourcing and the end of job interviews

Thursday, March 1st, 2012

Crowdsourcing and the end of job interviews by Panos Ipeirotis.

From the post:

When you discuss crowdsourcing solutions with people that have not heard the concept before, they tend to ask the question: “Why is crowdsourcing so much cheaper than existing solutions that depend on ‘classic’ outsourcing?

Interestingly enough, this is not a phenomenon that appears only in crowdsourcing. The Sunday edition of the New York Times has an article titled Why Are Harvard Graduates in the Mailroom?. The article discusses the job searching strategy in some fields (e.g., Hollywood, academic, etc), where talented young applicants are willing to start with jobs that are paying well below what their skills deserve, in exchange for having the ability to make it big later in the future:

[This is] the model lottery industry. For most companies in the business, it doesn’t make economic sense to, as Google does, put promising young applicants through a series of tests and then hire only the small number who pass. Instead, it’s cheaper for talent agencies and studios to hire a lot of young workers and run them through a few years of low-paying drudgery…. This occupational centrifuge allows workers to effectively sort themselves out based on skill and drive. Over time, some will lose their commitment; others will realize that they don’t have the right talent set; others will find that they’re better at something else.

Interestingly enough, this occupational centrifuge is very close to the model of employment in crowdsourcing.

The author’s take is that esoteric interview questions aren’t as effective as using a crowdsourcing model. I suspect he may be right.

If that is true, how would you go about structuring a topic map authoring project for crowdsourcing? What framework would you erect going into the project? What sort of quality checks would you implement? Would you “prime the pump” with already public data to be refined?

Are we on the verge of a meritocracy of performance?

As opposed to once meritocracies of performance, now the lands of clannish and odd questions in interviews?

### Orev: The Apache OpenRelevance Viewer

Tuesday, December 13th, 2011

Orev: The Apache OpenRelevance Viewer

From the webpage:

The OpenRelevance project is an Apache project, aimed at making materials for doing relevance testing for information retrieval (IR), Machine Learning and Natural Language Processing (NLP). Think TREC, but open-source.

These materials require a lot of managing work and many human hours to be put into collecting corpora and topics, and then judging them. Without going into too many details here about the actual process, it essentially means crowd-sourcing a lot of work, and that is assuming the OpenRelevance project had the proper tools to offer the people recruited for the work.

Having no such tool, the Viewer – Orev – is meant for being exactly that, and so to minimize the overhead required from both the project managers and the people who will be doing the actual work. By providing nice and easy facilities to add new Topics and Corpora, and to feed documents into a corpus, it will make it very easy to manage the surrounding infrastructure. And with a nice web UI to be judging documents with, the work of the recruits is going to be very easy to grok.

Focuses on judging of documents but that is a common level of granularity these days for relevance.

I don’t know of anything more granular but if you find such a tool, please sing out!