## Cataloguing projects

March 11th, 2014

Cataloguing projects (UK National Archive)

From the webpage:

The National Archives’ Cataloguing Strategy

The overall objective of our cataloguing work is to deliver more comprehensive and searchable catalogues, thus improving access to public records. To make online searches work well we need to provide adequate data and prioritise cataloguing work that tackles less adequate descriptions. For example, we regard ranges of abbreviated names or file numbers as inadequate.

I was lead to this delightful resource by a tweet from David Underdown, advising that his presentation from National Catalogue Day in 2013 was now onlne.

His presentation along with several others and reports about projects in prior years are available at this projects page.

I thought the presentation titled: Opening up of Litigation: 1385-1875 by Amanda Bevan and David Foster, was quite interesting in light of various projects that want to create new “public” citation systems for law and litigation.

I haven’t seen such a proposal yet that gives sufficient consideration to the enormity of what do you do with old legal materials?

The litigation presentation could be a poster child for topic maps.

I am looking forward to reading the other presentations as well.

## Number Theory and Algebra

March 11th, 2014

A Computational Introduction to Number Theory and Algebra by Victor Shoup.

From the preface of the second edition:

Number theory and algebra play an increasingly significant role in computing and communications, as evidenced by the striking applications of these subjects to such fields as cryptography and coding theory. My goal in writing this book was to provide an introduction to number theory and algebra, with an emphasis on algorithms and applications, that would be accessible to a broad audience. In particular, I wanted to write a book that would be appropriate for typical students in computer science or mathematics who have some amount of general mathematical experience, but without presuming too much specific mathematical knowledge.

Even though reliance on cryptography and vendors of cryptography is fading, you are likely to encounter people still using cryptography or legacy data “protected” by cryptography.

BTW, this is only one of several books that Cambridge University Press has published and allowed the final text to remain available.

Should you pen something appropriate and hopefully profitable for you and a publisher, Cambridge University Press should be on your short list.

Cambridge University Press is a great press and a good citizen of the academic world.
.
I first saw this in a tweet by Algebra Fact.

## NASA’s Asteroid Grand Challenge Series

March 11th, 2014

NASA’s Asteroid Grand Challenge Series

From the webpage:

Welcome to the Asteroid Grand Challenge Series sponsored by the NASA Tournament Lab! The Asteroid Grand Challenge Series will be comprised of a series of topcoder challenges to get more people from around the planet involved in finding all asteroid threats to human populations and figuring out what to do about them. In an increasingly connected world, NASA recognizes the value of the public as a partner in addressing some of the country’s most pressing challenges. Click here to learn more and participate in our debut challenge, Asteroid Data Hunter – launching 03/17/14!

From the details page:

The Asteroid Data Hunter challenge tasks competitors to develop significantly improved algorithms to identify asteroids in images from ground-based telescopes. The winning solution must increase the detection sensitivity, minimize the number of false positives, ignore imperfections in the data, and run effectively on all computers.

Lots of data, difficult problem, high stakes (ELE (extinction level event) prevention).

## 30,000 comics, 7,000 series – How’s Your Collection?

March 11th, 2014

Marvel Comics opens up its metadata for amazing Spider-Apps by Alex Dalenberg.

From the post:

It’s not as cool as inheriting superpowers from a radioactive spider, but thanks to Marvel Entertainment’s new API, you can now build Marvel Comics apps to your heart’s content.

That is, as long as you’re not making any money off of them. Nevertheless, it’s a comic geek’s dream. The Disney-owned company is opening up the data trove from its 75-year publishing history, including cover art, characters and comic book crossover events, for developers to tinker with.

That’s metadata for more than 30,000 comics and 7,000 series.

I know, another one of those non-commercial use licenses. I mean, Marvel paid for all of this content and then has the gall to not just give it away for free. What is the world coming to?

Personally I think Marvel has the right to allow as much or as little access to their data as they please. If you come up with a way to make money using this content, ask Marvel for commercial permissions. I deeply suspect they will be more than happy to accommodate any reasonable request.

Speaking of contemporary history, a couple of other cultural goldmines, Playboy Cover to Cover Hard Drive – Every Issue From 1953 to 2010 and Rolling Stone.

I don’t own either one so I don’t know how hard it would be to get the content in to machine readable format.

Still, both would be a welcome contrast to main stream news sources.

I first saw this in a tweet by Bob DuCharme.

## Data Science Challenge

March 11th, 2014

Data Science Challenge

Some details from the registration page:

Prerequisite: Data Science Essentials (DS-200)
Schedule: Twice per year
Duration: Three months from launch date
Next Challenge Date: March 31, 2014
Language: English
Price: USD $600 From the webpage: Cloudera will release a Data Science Challenge twice each year. Each bi-quarterly project is based on a real-world data science problem involving a large data set and is open to candidates for three months to complete. During the open period, candidates may work on their project individually and at their own pace. Current Data Science Challenge The new Data Science Challenge: Detecting Anomalies in Medicare Claims will be available starting March 31, 2014, and will cost USD$600.

In the U.S., Medicare reimburses private providers for medical procedures performed for covered individuals. As such, it needs to verify that the type of procedures performed and the cost of those procedures are consistent and reasonable. Finally, it needs to detect possible errors or fraud in claims for reimbursement from providers. You have been hired to analyze a large amount of data from Medicare and try to detect abnormal data — providers, areas, or patients with unusual procedures and/or claims.

Build a Winning Model

CCP candidates compete against each other and against a benchmark set by a committee including some of the world’s elite data scientists. Participants who surpass evaluation benchmarks receive the CCP: Data Scientist credential.

Those with the highest scores from each Challenge will have an opportunity to share their solutions and promote their work on cloudera.com and via press and social media outlets. All candidates retain the full rights to their own work and may leverage their models outside of the Challenge as they choose.

Useful way to develop some street cred in data science.

## The FIRST Act, Retro Legislation?

March 11th, 2014

From the press release:

The Scholarly Publishing and Academic Research Coalition (SPARC), an international alliance of nearly 800 academic and research libraries, today announced its opposition to Section 303 of H.R. 4186, the Frontiers in Innovation, Research, Science and Technology (FIRST) Act. This provision would impose significant barriers to the public’s ability to access the results of taxpayer-funded research.

Section 303 of the bill would undercut the ability of federal agencies to effectively implement the widely supported White House Directive on Public Access to the Results of Federally Funded Research and undermine the successful public access program pioneered by the National Institutes of Health (NIH) – recently expanded through the FY14 Omnibus Appropriations Act to include the Departments Labor, Education and Health and Human Services. Adoption of Section 303 would be a step backward from existing federal policy in the directive, and put the U.S. at a severe disadvantage among our global competitors.

“This provision is not in the best interests of the taxpayers who fund scientific research, the scientists who use it to accelerate scientific progress, the teachers and students who rely on it for a high-quality education, and the thousands of U.S. businesses who depend on public access to stay competitive in the global marketplace,” said Heather Joseph, SPARC Executive Director. “We will continue to work with the many bipartisan members of the Congress who support open access to publicly funded research to improve the bill.”

SPARC‘s press release never quotes a word from H.R. 4186. Not one. Commentary but nary a part of its object.

I searched at Thomas (the Congressional information service at the Library of Congress), for H.R. 4186 and came up empty by bill number. Switching to the Congressional Record for Monday, March 10, 2014, I did find the bill being introduced and the setting of a hearing on it. The GPO as not (as of today) posted the text of H.R. 4186, but when it does, follow this link: H.R. 4186.

Even more importantly, SPARC doesn’t point out who is responsible for the objectionable section appearing in the bill. Bills don’t write themselves and as far as I know, Congress doesn’t have a random bill generator.

The bottom line is that someone, an identifiable someone, asked for longer embargo wording to be included. If the SPARC press release is accurate, the most likely someone’s asked are Chairman Lamar Smith (R-TX 21st District) or Rep. Larry Bucshon (R-IN 8th District).

The Wikipedia page on the 8th Congressional District in Illinois needs to be updated but it also fails to mention that the 8th district is to the West and North-West of Chicago. You might want to check Bucshon‘s page at Wikipedia and links there to other resources.

Wikipedia on the 21st Congressional District of Texas, places it north of San Antonio, the seventh largest city in the United States. Lamar Smith‘s page at Wikipedia has some interested reading.

Odds are in and around Chicago and San Antonio there are people interested in longer embargo periods on federally funded research.

Those are at least some starting points for effective opposition to this legislation, assuming it was reported accurately by SPARC. Let’s drop the pose of disinterested legislators trying valiantly to serve the public good. Not impossible, just highly unlikely. Let’s argue about who is getting paid and for what benefits.

All visible objects, man, are but as pasteboard masks. But in each event –in the living act, the undoubted deed –there, some unknown but still reasoning thing puts forth the mouldings of its features from behind the unreasoning mask. If man will strike, strike through the mask! [Melville, Moby Dick, Chapter XXXVI]

Legislation as a “pasteboard mask” is a useful image. There is not a contour, dimple, shade or expression that wasn’t bought and paid for by someone. You have to strike through the mask to discover who.

Are you game?

PS: Curious, where would you go next (data wise, I don’t have the energy to lurk in garages) in terms of searching for the buyers of longer embargoes in H.R. 4186?

## Data Science 101: Deep Learning Methods and Applications

March 10th, 2014

Data Science 101: Deep Learning Methods and Applications by Daniel Gutierrez.

From the post:

Microsoft Research, the research arm of the software giant, is a hotbed of data science and machine learning research. Microsoft has the resources to hire the best and brightest researchers from around the globe. A recent publication is available for download (PDF): “Deep Learning: Methods and Applications” by Li Deng and Dong Yu, two prominent researchers in the field.

Deep sledding with twenty (20) pages of bibliography and pointers to frequently updated lists of resources (at page 8).

You did say you were interested in deep learning. Yes?

Enjoy!

## Orbital Computing – Electron Orbits That Is.

March 10th, 2014

Physicist proposes a new type of computing at SXSW. Check out orbital computing by Stacey Higginbotham.

From the post:

The demand for computing power is constantly rising, but we’re heading to the edge of the cliff in terms of increasing performance — both in terms of the physics of cramming more transistors on a chip and in terms of the power consumption. We’ve covered plenty of different ways that researchers are trying to continue advancing Moore’s Law — this idea that the number of transistors (and thus the performance) on a chip doubles every 18 months — especially the far out there efforts that take traditional computer science and electronics and dump them in favor of using magnetic spin, quantum states or probabilistic logic.

We’re going to add a new impossible that might become possible to that list thanks to Joshua Turner, a physicist at the SLAC National Accelerator Laboratory, who has proposed using the orbits of electrons around the nucleus of an atom as a new means to generate the binary states (the charge or lack of a charge that transistors use today to generate zeros and ones) we use in computing. He calls this idea orbital computing and the big takeaway for engineers is that one can switch the state of an electron’s orbit 10,000 times faster than you can switch the state of a transistor used in computing today.

That means you can still have the features of computing in that you use binary programming, but you just can compute more in less time. To get us to his grand theory, Turner had to take the SXSW audience through how computing works, how transistors work, the structure of atoms, the behavior of subatomic particles and a bunch of background on X-rays.

This would have been a presentation to see: Bits, Bittier Bits & Qubits: Physics of Computing

Try this SLAC Search for some publications by Joshua Turner.

It’s always fun to read about how computers will be able to process data more quickly. A techie sort of thing.

On the other hand, going 10,000 times faster with semantically heterogeneous data, will get you to the wrong answer 10,000 times faster.

If you realize the answer is wrong, you may have time to try again.

What if you don’t realize the answer is wrong?

Do you really want to be the customs agent who stops a five year old because their name is similar to that of a known terrorist? Because the machine said they could not fly?

Excited about going faster, worried about data going by too fast for anyone to question its semantics.

## Hubble Source Catalog

March 10th, 2014

Beta Version 0.3 of the Hubble Source Catalog

From the post:

The Hubble Source Catalog (HSC) is designed to optimize science from the Hubble Space Telescope by combining the tens of thousands of visit-based source lists in the Hubble Legacy Archive (HLA) into a single master catalog.

Search with Summary Form now (one row per match)
Search with Detailed Form now (one row per source)

Beta Version 0.3 of the HSC contains members of the WFPC2, ACS/WFC, WFC3/UVIS and WFC3/IR Source Extractor source lists in HLA version DR7.2 (data release 7.2) that are considered to be valid detections because they have flag values less than 5 (see more flag information).

The crossmatching process involves adjusting the relative astrometry of overlapping images so as to minimize positional offsets between closely aligned sources in different images. After correction, the astrometric residuals of crossmatched sources are significantly reduced, to typically less than 10 mas. In addition, the catalog includes source nondetections. The crossmatching algorithms and the properties of the initial (Beta 0.1) catalog are described in Budavari & Lubow (2012) .

if you need training with this data set, see: A Hubble Source Catalog (HSC) Walkthrough

## Apache Tez 0.3 Released!

March 10th, 2014

Apache Tez 0.3 Released! by Bikas Saha.

From the post:

The Apache Tez community has voted to release 0.3 of the software.

Apache™ Tez is a replacement of MapReduce that provides a powerful framework for executing a complex topology of tasks. Tez 0.3.0 is an important release towards making the software ready for wider adoption by focusing on fundamentals and ironing out several key functions. The major action areas in this release were

1. Security. Apache Tez now works on secure Hadoop 2.x clusters using the built-in security mechanisms of the Hadoop ecosystem.
2. Scalability. We tested the software on large clusters, very large data sets and large applications processing tens of TB each to make sure it scales well with both data-sets and machines.
3. Fault Tolerance. Apache Tez executes a complex DAG workflow that can be subject to multiple failure conditions in clusters of commodity hardware and is highly resilient to these and other sorts of failures.
4. Stability. A large number of bug fixes went into this release as early adopters and testers put the software through its paces and reported issues.

To prove the stability and performance of Tez, we executed complex jobs comprised of more than 50 different stages and tens of thousands of tasks on a fairly large cluster (> 300 Nodes, > 30TB data). Tez passed all our tests and we are certain that new adopters can integrate confidently with Tez and enjoy the same benefits as Apache Hive & Apache Pig have already.

I am curious how the Hadoop community is going to top 2013. I suspect Tez is going to be part of that answer!

## CORDIS – EU research projects under FP7 (2007-2013)

March 10th, 2014

CORDIS – EU research projects under FP7 (2007-2013)

Description:

This dataset contains projects funded by the European Union under the seventh framework programme for research and technological development (FP7) from 2007 to 2013. Grant information is provided for each project, including reference, acronym, dates, funding, programmes, participant countries, subjects and objectives. A smaller file is also provided without the texts for objectives.

The column separator is the “;” character.

The “Achievements” column is blank for all 22,653 projects/rows.

Can you suggest other sources will machine readable data on the results from EU research projects under FP7 (2007-2013)?

Thanks!

I first saw this in a tweet by Stefano Bertolo.

## The Elements According to Relative Abundance

March 10th, 2014

The Elements According to Relative Abundance (A Periodic Chart by Prof. Wm. F. Sheehan, University of Santa Clara. CA 95053. Ref. Chemistry. Vol. 49.No.3. p. 17-18, 1976)

From the caption:

Roughly, the size of an element’s own niche is proportioned to its abundance on Earth’s surface, and in addition, certain chemical similarities.

Very nice.

A couple of suggestions for the graphically inclined:

• How does a proportionate periodic table of your state (in the United States, substitute other appropriate geographic subdivisions if outside the United States) compare to other states?
• Adjust your periodic table to show the known elements at important dates in history.

I first saw this in a tweet by Maxime Duprez.

## A New Entity Salience Task with Millions of Training Examples

March 10th, 2014

A New Entity Salience Task with Millions of Training Examples by Dan Gillick and Jesse Dunietz.

Abstract:

Although many NLP systems are moving toward entity-based processing, most still identify important phrases using classical keyword-based approaches. To bridge this gap, we introduce the task of entity salience: assigning a relevance score to each entity in a document. We demonstrate how a labeled corpus for the task can be automatically generated from a corpus of documents and accompanying abstracts. We then show how a classifier with features derived from a standard NLP pipeline outperforms a strong baseline by 34%. Finally, we outline initial experiments on further improving accuracy by leveraging background knowledge about the relationships between entities.

The article concludes:

We believe entity salience is an important task with many applications. To facilitate further research, our automatically generated salience annotations, along with resolved entity ids, for the subset of the NYT corpus discussed in this paper are available here: https://code.google.com/p/nyt-salience

A classic approach to a CS article: new approach/idea, data + experiments, plus results and code. It doesn’t get any better.

The results won’t be perfect, but the question is: Are they “acceptable results?”

Which presumes a working definition of “acceptable” that you have hammered out with your client.

I first saw this in a tweet by Stefano Bertolo.

## Open Source: Option of the Security Conscious

March 10th, 2014

International Space Station attacked by ‘virus epidemics’ by Samuel Gibbs.

From the post:

Malware made its way aboard the International Space Station (ISS) causing “virus epidemics” in space, according to security expert Eugene Kaspersky.

Kaspersky, head of security firm Kaspersky labs, revealed at the Canberra Press Club 2013 in Australia that before the ISS switched from Windows XP to Linux computers, Russian cosmonauts managed to carry infected USB storage devices aboard the station spreading computer viruses to the connected computers.

…..

In May, the United Space Alliance, which oversees the running of if the ISS in orbit, migrated all the computer systems related to the ISS over to Linux for security, stability and reliability reasons.

If your or your company is at all concerned with security issues, open source software is the only realistic option.

Not that open source software has fewer bugs in fact on release, but because there is the potential for a large community of users to be seeking those bugs out and fixing them.

The recent Apple “goto fail” farce would not happen in an open source product. Some tester, intentionally or accidentally would use invalid credentials and so the problem would have surfaced.

If we are lucky, Apple had one tester who was also tasked with other duties and so we got what Apple chose to pay for.

This is not a knock against software companies that sell software for a profit. Rather it is a challenge to the current marketing of software for a profit.

Imagine that MS SQL Server was open source but commercial software. That is the source code is freely available but the licensing prohibits its use for commercial resale.

Do you really think that banks, insurance companies, enterprises are going to be grabbing source code and compiling it to avoid license fees?

I admit to having a low opinion of the morality of bank, insurance companies, etc., but they also have finely tuned senses of risk. Might save a few bucks in the short run, but the consequences of getting caught are quite severe.

So there would be lots of hobbyists hacking on, trying to improve, etc. MS SQL Server source code.

You know that hackers can no more keep a secret than a member of Congress, albeit hackers don’t usually blurt out secrets on the evening news. Every bug, improvement, etc. would become public knowledge fairly quickly.

MS could even make contribution of bugs, fixes as a condition of the open source download.

MS could continue to sell MS SQL Server as commercial software as before making it open source.

The difference would be instead of N programmers working to find and fix bugs, there would be N + Internet community working to find and fix bugs.

The other difference being the security conscious in military, national security, and government organizations, would not have to be planning migrations away from closed source software.

Post-Snowden, open source software is the only viable security option.

PS: Yes, I have seen the “we are not betraying you now” and/or “we betray you only when required by law to do so,” statements from various vendors.

I much prefer to not be betrayed at all.

You?

PS: There is another advantage to vendors from an all open source policy on software. Vendors worry about others copying their code, etc. With open source that should be easy enough to monitor and prove.

## Algebraic and Analytic Programming

March 10th, 2014

Algebraic and Analytic Programming by Luke Palmer.

In a short post Luke does a great job contrasting algebraic versus analytic approaches to programming.

In an even shorter summary, I would say the difference is “truth” versus “acceptable results.”

Oddly enough, that difference shows up in other areas as well.

The major ontology projects, including linked data, are pushing one and only one “truth.”

Versus other approaches, such as topic maps (at least in my view), that tend towards “acceptable results.”

I am not sure what other measure of success you would have other than “acceptable results?”

Or what another measure for a semantic technology would be other than “acceptable results?”

Whether the universal truth of the world folks admit it or not, they just have a different definition of “acceptable results.” Their “acceptable results” means their world view.

I appreciate the work they put into their offer but I have to decline. I already have a world view of my own.

You?

I first saw this in a tweet by Computer Science.

## Mapillary to OpenStreetMap

March 10th, 2014

Mapillary to OpenStreetMap by Johan Gyllenspetz.

From the post:

We have been working with the OpenStreetMap community lately and we wanted to investigate how Mapillary can be used as a tool for some serious mapping.

First of all I needed to find a possible candidate area for mapping. After some investigation I found this little park in West Hollywood, called West Hollywood park. The park was under construction on the Bing images in the Id editor and nobody has traced the park yet.

If a physical map lacks your point of interest, you have to mark on the map or use some sort of overlay.

Like a topic map, with Mapillary and OpenStreetMap, you can add your point of interest with a suitable degree of accuracy.

You don’t need the agreement of your local department of highways or civil defense authorities.

Enjoy!

I first saw this in a tweet by Map@Syst.

## The Books of Remarkable Women

March 10th, 2014

The Books of Remarkable Women by Sarah J. Biggs.

From the post:

In 2011, when we blogged about the Shaftesbury Psalter (which may have belonged to Adeliza of Louvain; see below), we wrote that medieval manuscripts which had belonged to women were relatively rare survivals. This still remains true, but as we have reviewed our blog over the past few years, it has become clear that we must emphasize the relative nature of the rarity – we have posted literally dozens of times about manuscripts that were produced for, owned, or created by a number of medieval women.

A good example of why I think topic maps have so much to offer for preservation of cultural legacy.

While each of the books covered in this post are important historical artifacts, their value is enhanced by the context of their production, ownership, contemporary practices, etc.

All of which lies outside the books proper. Just as data about data, the so-called “metadata,” usually lies outside its information artifact.

If future generations are going to have better historical context than we do for many items, we had best get started writing them.

## Lucene 4 Essentials for Text Search and Indexing

March 9th, 2014

Lucene 4 Essentials for Text Search and Indexing by Mitzi Morris.

From the post:

Here’s a short-ish introduction to the Lucene search engine which shows you how to use the current API to develop search over a collection of texts. Most of this post is excerpted from Text Processing in Java, Chapter 7, Text Search with Lucene.

Not too short!

I have seen blurbs about Text Processing in Java but this post convinced me to put it on my wish list.

You?

PS: As soon as a copy arrives I will start working on a review of it. If you want to see that happen sooner rather than later, ping me.

## Getty – 35 Million Free Images

March 9th, 2014

From the post:

Getty Images has single-handedly redefined the entire photography market with the launch of a new embedding feature that will make more than 35 million images freely available to anyone for non-commercial usage. BJP’s Olivier Laurent finds out more.

(skipped image)

The controversial move is set to draw professional photographers’ ire at a time when the stock photography market is marred by low prices and under attack from new mobile photography players. Yet, Getty Images defends the move, arguing that it’s not strong enough to control how the Internet has developed and, with it, users’ online behaviours.

“We’re really starting to see the extent of online infringement,” says Craig Peters, senior vice president of business development, content and marketing at Getty Images. “In essence, everybody today is a publisher thanks to social media and self-publishing platforms. And it’s incredibly easy to find content online and simply right-click to utilise it.”

In the past few years, Getty Images found that its content was “incredibly used” in this manner online, says Peters. “And it’s not used with a watermark; instead it’s typically found on one of our valid licensing customers’ websites or through an image search. What we’re finding is that the vast majority of infringement in this space happen with self publishers who typically don’t know anything about copyright and licensing, and who simply don’t have any budget to support their content needs.”

To solve this problem, Getty Images has chosen an unconventional strategy. “We’re launching the ability to embed our images freely for non-commercial use online,” Peters explains. In essence, anyone will be able to visit Getty Images’ library of content, select an image and copy an embed HTML code to use that image on their own websites. Getty Images will serve the image in a embedded player – very much like YouTube currently does with its videos – which will include the full copyright information and a link back to the image’s dedicated licensing page on the Getty Images website.

More than 35 million images from Getty Images’ news, sports, entertainment and stock collections, as well as its archives, will be available for embedding from 06 March.

What a clever move by Getty!

Think about it. Who do you sue for copyright infringement? Is it some hobbyist blogger or use of an image in a school newspaper? OK, the RIAA would but what about sane people?

Your first question: Did the infringement result is a substantial profit due to the infringement?

Your second question: Does the guilty party have enough assets to likely recover the substantial profit?

You only want to catch infringement by other major for profit players.

All of who have to publicly use your images. Hiding infringement isn’t possible.

None of the major media outlets or publishers are going to cheat on use of your images. Whether that is because they are honest with regard to IP or so easily caught, doesn’t really matter.

In one fell swoop, Getty has secured for itself free advertising for every image that is used for free. Advertising it could not have bought for any sum of money.

Makes me wonder when the ACM, IEEE, Springer, Elsevier and others are going to realize that free and public access to their journals and monographs will drive demand for libraries to have enhanced access to those publications?

It isn’t like EBSCO and the others are going to start using data that is limited to non-commercial use for their databases. That would be too obvious, not to mention incurring significant legal liability.

Ditto for libraries. Libraries want legitimate access to the materials they provide and/or host.

As I told an academic society once upon a time, “It’s time to stop grubbing for pennies when there are $100 bills blowing over head.” It involve a replacement of “lost in the mail” journals. At a replacement cost of$3.50 (plus postage) per claim, they were employing a full time person to research eligibility to request a replacement copy. For a time I convinced them to simply replace upon request in the mailroom. Track requests but just do it. Worked quite well.

Over the years management has changed and I suspect they have returned to protecting the rights of members that only people entitled to a copy of the journal got one. I kid you not, that was the explanation for the old policy. Bizarre.

I first saw this at: Getty Set 35 Million Images Free, But Who Can Use Them? by David Godsall.

PS: The thought does occur to me that suitable annotations could be prepared ahead of time for these images so that when a for-profit publisher purchases the rights to a Getty image, someone could offer robust metadata to accompany the image.

## IMDB Top 100K Movies Analysis in Depth (Parts 1- 4)

March 9th, 2014

IMDB Top 100K Movies Analysis in Depth Part 1 by Bugra Akyildiz.

IMDB Top 100K Movies Analysis in Depth Part 2

IMDB Top 100K Movies Analysis in Depth Part 3

IMDB Top 100K Movies Analysis in Depth Part 4

From part 1:

Data is from IMDB and it includes all of the popularly voted 100042 movies from 1950 to 2013.(I know why 100000 is there but have no idea how 42 movies get squeezed. Instead of blaming my web scraping skills, I blame the universe, though).

The reason why I chose the number of votes as a metric to order the movies is because, generally the information (title, certificate, outline, director and so on) about movie are more likely to be complete for the movies that have high number of votes. Moreover, IMDB uses number of votes as a metric to determine the ranking as well so number of votes also correlate with the rating as well. Further, everybody at least has an idea on IMDB Top 250 or IMDB Top 1000 which are ordered by the ratings computed by IMDB.

Although the data is quite rich in terms of basic information, only year, rating and votes are complete for all of the movies. Only ~80% of the movies have runtime information(minutes). The categories are mostly 90% complete which could be considered good but the certificate information of the movies is the most sparse (only ~25% of them have it).

This post aims to explore data for diffferent aspects of data(categories, rating and categories) and also useful information(best movie in terms of rating or votes for each year).

An interesting analysis of the Internet Movie Database (IMDB) that incorporates other sources, such as for revenue and actors’ and actresses’ age and height information.

Suggestions on other data to include or representation techniques?

I first saw this in a tweet by Gregory Piatetsky.

## Building a Database-backed Clojure Web App…

March 8th, 2014

From the post:

Some time ago I wrote a post about Java In the Auto-Scaling Cloud. In the post, I mentioned Heroku. In today’s post, I want to take time to point back to Heroku again, this time with the focus on building web applications. Heroku Dev Center recently posted a great tutorial on building a databased-backed Clojure web application. In this example, a twitter-like app is built that stores “shouts” to a PostgreSQL database. It covers a lot of territory, from connecting to PostgreSQL, to web bindings with Compujure, HTML tempting with Hiccup and assembling the application and testing it. Finally, deploying it.

If you aren’t working on a weekend project already, here is one for your consideration!

## LongoMatch

March 8th, 2014

LongoMatch

From the “Features” page:

LongoMatch has been designed to be very easy to use, exposing the basic functionalities of video analysis in an intuitive interface. Tagging, playback and edition of stored events can be easily done from the main window, while more specific features can be accessed through menus when needed.

Flexible and customizable for all sports

LongoMatch can be used for any kind of sports, allowing to create custom templates with an unlimited number of tagging categories. It also supports defining custom subcategories and creating templates for your teams with detailed information of each player which is the perfect combination for a fine-grained performance analysis.

Post-match and real time analysis

LongoMatch can be used for post-match analysis supporting the most common video formats as well as for live analysis, capturing from Firewire, USB video capturers, IP cameras or without any capture device at all, decoupling the capture process from the analysis, but having it ready as soon as the recording is done. With live replay, without stopping the capture, you can review tagged events and export them while still analyzing the game live.

Although pitched as software for analyzing sports events, it occurs to me this could be useful in a number of contexts.

Such as analyzing news footage of police encounters with members of the public.

Or video footage of particular locations. Foot or vehicle traffic.

The possibilities are endless.

Then it’s just a question of tying that information together with data from other information feeds.

## papers-we-love

March 8th, 2014

papers-we-love

From the webpage:

Repository related to the following meetups:

Let us know if you are interested in starting a chapter!

A GitHub repository of CS papers.

If you decide to start a virtual “meetup” be sure to ping me. Nothing against the F2F meetings, absolutely needed, but some of use can’t make F2F meetings.

PS: There is also a list of other places to search for good papers.

## Merge Mahout item based recommendations…

March 8th, 2014

Merge Mahout item based recommendations results from different algorithms

From the post:

Apache Mahout is a machine learning library that leverages the power of Hadoop to implement machine learning through the MapReduce paradigm. One of the implemented algorithms is collaborative filtering, the most successful recommendation technique to date. The basic idea behind collaborative filtering is to analyze the actions or opinions of users to recommend items similar to the one the user is interacting with.

Similarity isn’t restricted to a particular measure or metric.

How similar is enough to be considered the same?

That is a question topic map designers must answer on a case by case basis.

## Black Hat Asia 2014: The Weaponized Web

March 8th, 2014

Black Hat Asia 2014: The Weaponized Web

From the post:

The World Wide Web has grown exponentially since its birth 21 years ago, and it now serves as the interface for many of the apps we use every day. It’s hard to imagine a more enticing target for hacks and exploits. Today’s trio of Black Hat Briefings explore ways the Web can be weaponized … and how to defend against it.

Even as HTML 5 proliferates as an enabler of rich interactive Web applications, cross-site scripting (XSS) remains one of the top three Web application vulnerabilities. DOM-based XSS is growing in popularity, but its client-side nature makes it difficult to monitor for malicious payloads. Ultimate Dom Based XSS Detection Scanner on Cloud delves into this thorny issue. Nera W. C. Liu and Albert Yu will show how they managed to introduce and propagate tainted attributes to a DOM input interface, and then devised a system to detect such breaches by harnessing the power of PhantomJS, a headless browser for automation.

JavaScript’s ubiquity makes it the subject of aggressive security-community research, boosting its effective security level every day. Sounds good, but in JS Suicide: Using JavaScript Security Features to Kill JS Security, AhamedNafeez will demonstrate that these security features can be a double-edged sword, sometimes allowing an attacker to disable certain other JS protection mechanisms. In particular, the sandboxing features of ECMAScript 5 can break security in many JS applications. Real-world examples of other JS security lapses are also on the agenda.

Ready-made exploit kits make it easier than ever for malicious parties to victimize unwary Internet users. Jose Miguel Esparza will take us down that rabbit hole in PDF Attack: A Journey From the Exploit Kit to the Shellcode, in which he’ll teach how to manually extract obfuscated URLs and binaries from these weaponized pages. You’ll also learn how to do modify a malicious PDF payload yourself to bypass AV software, a useful trick for pentesting.

Looking to register? Please visit Black Hat Asia 2014′s registration page to get started.

One of the things I like about Black Hat is their honesty. Computer enthusiasts include the usual high school/college nerds and white shirt/blue tie crowd but there are those who follow a different track. And some of those, don’t work for national governments.

If you need more evidence for the argument that software (not just the WWW) is systematically broken (Back to Basics: Beyond Network Hygiene by Felix ‘FX’ Lindner and Sandro Gaycken), review the agenda for this Black Hat conference or for proceeding years.

As long as software security remains a separate security product or patch to existing software issue, Black Hat isn’t going to go lacking for conference material.

## Who Are the Customers for Intelligence?

March 7th, 2014

Who Are the Customers for Intelligence? by Peter C. Oleson.

From the paper:

Who uses intelligence and why? The short answer is almost everyone and to gain an advantage. While nation-states are most closely identified with intelligence, private corporations and criminal entities also invest in gathering and analyzing information to advance their goals. Thus the intelligence process is a service function, or as Australian intelligence expert Don McDowell describes it,

Information is essential to the intelligence process. Intelligence… is not simply an amalgam of collected information. It is instead the result of taking information relevant to a specific issue and subjecting it to a process of integration, evaluation, and analysis with the specific purpose of projecting future events and actions, and estimating and predicting outcomes.

It is important to note that intelligence is prospective, or future oriented (in contrast to investigations that focus on events that have already occurred).

As intelligence is a service, it follows that it has customers for its products. McDowell differentiates between “clients” and “customers” for intelligence. The former are those who commission an intelligence effort and are the principal recipients of the resulting intelligence product. The latter are those who have an interest in the intelligence product and could use it for their own purposes. Most scholars of intelligence do not make this distinction. However, it can be an important one as there is an implied priority associated with a client over a customer. (footnote markers omitted)

If you want to sell the results of topic maps, that is highly curated data that can be viewed from multiple perspectives, this essay should spark your thinking about potential customers.

You may also find this website useful: Association of Former Intelligence Officers.

I first saw this at Full Text Reports as Who Are the Customers for Intelligence? (draft).

## Quizz: Targeted Crowdsourcing…

March 7th, 2014

Quizz: Targeted Crowdsourcing with a Billion (Potential) Users by Panagiotis G. Ipeirotis and Evgeniy Gabrilovich.

Abstract:

Our experiments, which involve over ten thousand users, confirm that we can crowdsource knowledge curation for niche and specialized topics, as the advertising network can automatically identify users with the desired expertise and interest in the given topic. We present controlled experiments that examine the effect of various incentive mechanisms, highlighting the need for having short-term rewards as goals, which incentivize the users to contribute. Finally, our cost- quality analysis indicates that the cost of our approach is below that of hiring workers through paid-crowdsourcing platforms, while offering the additional advantage of giving access to billions of potential users all over the planet, and being able to reach users with specialized expertise that is not typically available through existing labor marketplaces.

Crowd sourcing isn’t an automatic slam-dunk but with research like this, it will start moving towards being a repeatable experience.

What do you want to author using a crowd?

I first saw this at Greg Linden’s More quick links.

## Introducing the ProPublica Data Store

March 7th, 2014

From the post:

We work with a lot of data at ProPublica. It's a big part of almost everything we do — from data-driven stories to graphics to interactive news applications. Today we're launching the ProPublica Data Store, a new way for us to share our datasets and for them to help sustain our work.

Like most newsrooms, we make extensive use of government data — some downloaded from "open data" sites and some obtained through Freedom of Information Act requests. But much of our data comes from our developers spending months scraping and assembling material from web sites and out of Acrobat documents. Some data requires months of labor to clean or requires combining datasets from different sources in a way that's never been done before.

In the Data Store you'll find a growing collection of the data we've used in our reporting. For raw, as-is datasets we receive from government sources, you'll find a free download link that simply requires you agree to a simplified version of our Terms of Use. For datasets that are available as downloads from government websites, we've simply linked to the sites to ensure you can quickly get the most up-to-date data.

For datasets that are the result of significant expenditures of our time and effort, we're charging a reasonable one-time fee: In most cases, it's $200 for journalists and$2,000 for academic researchers. Those wanting to use data commercially should reach out to us to discuss pricing. If you're unsure whether a premium dataset will suit your purposes, you can try a sample first. It's a free download of a small sample of the data and a readme file explaining how to use it.

The datasets contain a wealth of information for researchers and journalists. The premium datasets are cleaned and ready for analysis. They will save you months of work preparing the data. Each one comes with documentation, including a data dictionary, a list of caveats, and details about how we have used the data here at ProPublica.

A data store you can feel good about supporting!

I first saw this at Nathan Yau’s ProPublica opened a data store.

## Trapping Users with Linked Data (WorldCat)

March 7th, 2014

WorldCat Works Linked Data – Some Answers To Early Questions by Richard Wallis.

The most interesting question Richard answers:

No there is no bulk download available. This is a deliberate decision for several reasons.
Firstly this is Linked Data – its main benefits accrue from its canonical persistent identifiers and the relationships it maintains between other identified entities within a stable, yet changing, web of data. WorldCat.org is a live data set actively maintained and updated by the thousands of member libraries, data partners, and OCLC staff and processes. I would discourage reliance on local storage of this data, as it will rapidly evolve and become out of synchronisation with the source. The whole point and value of persistent identifiers, which you would reference locally, is that they will always dereference to the current version of the data.

I will give you one guess on who is deciding on the entities, identifiers and relationships to be maintained.

Hint: It’s not you.

Which in my view is one of the principal weaknesses of Linked Data.

In order to participate, you have to forfeit your right to organize your world differently than it has been organized by Richard Wallis, WorldCat and others.

I am sure they all have good intentions and WorldCat will come close enough for most of my purposes, but I’m not interested in a one world view, whoever agrees with it. Even me.

If you are good with graphics, take the original Apple commercial:

and reverse it.

Show users and screen of vivid diversity and show a Richard Wallis look alike touching the side of the projection screen and the uniform grayness of linked data starts to spread across it. As it does, the users in the audience who have been in traditional dress start to look like the starting audience in Apple’s 1984 commercial.

That’s the intellectual landscape that linked data promises. Do you really want to go there?

Nothing against standards, I have helped write one or two them. But I do oppose uniformity for the sake of empowering self-appointed guardians.

Particularly when that uniformity is a tepid grey that doesn’t reflect the rich and discordant hues of human intellectual history.

## Using Lucene’s search server to search Jira issues

March 7th, 2014

Using Lucene’s search server to search Jira issues by Michael McCandless.

From the post:

You may remember my first blog post describing how the Lucene developers eat our own dog food by using a Lucene search application to find our Jira issues.

That application has become a powerful showcase of a number of modern Lucene features such as drill sideways and dynamic range faceting, a new suggester based on infix matches, postings highlighter, block-join queries so you can jump to a specific issue comment that matched your search, near-real-time indexing and searching, etc. Whenever new users ask me about Lucene’s capabilities, I point them to this application so they can see for themselves.

Recently, I’ve made some further progress so I want to give an update.

The source code for the simple Netty-based Lucene server is now available on this subversion branch (see LUCENE-5376 for details). I’ve been gradually adding coverage for additional Lucene modules, including facets, suggesters, analysis, queryparsers, highlighting, grouping, joins and expressions. And of course normal indexing and searching! Much remains to be done (there are plenty of nocommits), and the goal here is not to build a feature rich search server but rather to demonstrate how to use Lucene’s current modules in a server context with minimal “thin server” additional source code.

Separately, to test this new Lucene based server, and to complete the “dog food,” I built a simple Jira search application plugin, to help us find Jira issues, here. This application has various Python tools to extract and index Jira issues using Jira’s REST API and a user-interface layer running as a Python WSGI app, to send requests to the server and render responses back to the user. The goal of this Jira search application is to make it simple to point it at any Jira instance / project and enable full searching over all issues.

Of particular interest to me because OASIS is about to start using JIRA 6.2 (the version in use at Apache).

I haven’t looked closely at the documentation for JIRA 6.2.

Thoughts on where it has specific weaknesses that are addressed by Michael’s solution?