Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 10, 2014

The Elements According to Relative Abundance

Filed under: Graphics,Visualization — Patrick Durusau @ 1:44 pm

The Elements According to Relative Abundance (A Periodic Chart by Prof. Wm. F. Sheehan, University of Santa Clara. CA 95053. Ref. Chemistry. Vol. 49.No.3. p. 17-18, 1976)

From the caption:

Roughly, the size of an element’s own niche is proportioned to its abundance on Earth’s surface, and in addition, certain chemical similarities.

Very nice.

A couple of suggestions for the graphically inclined:

  • How does a proportionate periodic table of your state (in the United States, substitute other appropriate geographic subdivisions if outside the United States) compare to other states?
  • Adjust your periodic table to show the known elements at important dates in history.

I first saw this in a tweet by Maxime Duprez.

A New Entity Salience Task with Millions of Training Examples

A New Entity Salience Task with Millions of Training Examples by Dan Gillick and Jesse Dunietz.

Abstract:

Although many NLP systems are moving toward entity-based processing, most still identify important phrases using classical keyword-based approaches. To bridge this gap, we introduce the task of entity salience: assigning a relevance score to each entity in a document. We demonstrate how a labeled corpus for the task can be automatically generated from a corpus of documents and accompanying abstracts. We then show how a classifier with features derived from a standard NLP pipeline outperforms a strong baseline by 34%. Finally, we outline initial experiments on further improving accuracy by leveraging background knowledge about the relationships between entities.

The article concludes:

We believe entity salience is an important task with many applications. To facilitate further research, our automatically generated salience annotations, along with resolved entity ids, for the subset of the NYT corpus discussed in this paper are available here: https://code.google.com/p/nyt-salience

A classic approach to a CS article: new approach/idea, data + experiments, plus results and code. It doesn’t get any better.

The results won’t be perfect, but the question is: Are they “acceptable results?”

Which presumes a working definition of “acceptable” that you have hammered out with your client.

I first saw this in a tweet by Stefano Bertolo.

Open Source: Option of the Security Conscious

Filed under: Cybersecurity,Linux OS,Open Source,Security — Patrick Durusau @ 10:00 am

International Space Station attacked by ‘virus epidemics’ by Samuel Gibbs.

From the post:

Malware made its way aboard the International Space Station (ISS) causing “virus epidemics” in space, according to security expert Eugene Kaspersky.

Kaspersky, head of security firm Kaspersky labs, revealed at the Canberra Press Club 2013 in Australia that before the ISS switched from Windows XP to Linux computers, Russian cosmonauts managed to carry infected USB storage devices aboard the station spreading computer viruses to the connected computers.

…..

In May, the United Space Alliance, which oversees the running of if the ISS in orbit, migrated all the computer systems related to the ISS over to Linux for security, stability and reliability reasons.

If your or your company is at all concerned with security issues, open source software is the only realistic option.

Not that open source software has fewer bugs in fact on release, but because there is the potential for a large community of users to be seeking those bugs out and fixing them.

The recent Apple “goto fail” farce would not happen in an open source product. Some tester, intentionally or accidentally would use invalid credentials and so the problem would have surfaced.

If we are lucky, Apple had one tester who was also tasked with other duties and so we got what Apple chose to pay for.

This is not a knock against software companies that sell software for a profit. Rather it is a challenge to the current marketing of software for a profit.

Imagine that MS SQL Server was open source but commercial software. That is the source code is freely available but the licensing prohibits its use for commercial resale.

Do you really think that banks, insurance companies, enterprises are going to be grabbing source code and compiling it to avoid license fees?

I admit to having a low opinion of the morality of bank, insurance companies, etc., but they also have finely tuned senses of risk. Might save a few bucks in the short run, but the consequences of getting caught are quite severe.

So there would be lots of hobbyists hacking on, trying to improve, etc. MS SQL Server source code.

You know that hackers can no more keep a secret than a member of Congress, albeit hackers don’t usually blurt out secrets on the evening news. Every bug, improvement, etc. would become public knowledge fairly quickly.

MS could even make contribution of bugs, fixes as a condition of the open source download.

MS could continue to sell MS SQL Server as commercial software as before making it open source.

The difference would be instead of N programmers working to find and fix bugs, there would be N + Internet community working to find and fix bugs.

The other difference being the security conscious in military, national security, and government organizations, would not have to be planning migrations away from closed source software.

Post-Snowden, open source software is the only viable security option.

PS: Yes, I have seen the “we are not betraying you now” and/or “we betray you only when required by law to do so,” statements from various vendors.

I much prefer to not be betrayed at all.

You?

PS: There is another advantage to vendors from an all open source policy on software. Vendors worry about others copying their code, etc. With open source that should be easy enough to monitor and prove.

Algebraic and Analytic Programming

Filed under: Algebra,Analytics,Cyc,Mathematics,Ontology,Philosophy,SUMO — Patrick Durusau @ 9:17 am

Algebraic and Analytic Programming by Luke Palmer.

In a short post Luke does a great job contrasting algebraic versus analytic approaches to programming.

In an even shorter summary, I would say the difference is “truth” versus “acceptable results.”

Oddly enough, that difference shows up in other areas as well.

The major ontology projects, including linked data, are pushing one and only one “truth.”

Versus other approaches, such as topic maps (at least in my view), that tend towards “acceptable results.”

I am not sure what other measure of success you would have other than “acceptable results?”

Or what another measure for a semantic technology would be other than “acceptable results?”

Whether the universal truth of the world folks admit it or not, they just have a different definition of “acceptable results.” Their “acceptable results” means their world view.

I appreciate the work they put into their offer but I have to decline. I already have a world view of my own.

You?

I first saw this in a tweet by Computer Science.

Mapillary to OpenStreetMap

Filed under: Mapillary,Mapping,OpenStreetMap — Patrick Durusau @ 8:58 am

Mapillary to OpenStreetMap by Johan Gyllenspetz.

From the post:

We have been working with the OpenStreetMap community lately and we wanted to investigate how Mapillary can be used as a tool for some serious mapping.

First of all I needed to find a possible candidate area for mapping. After some investigation I found this little park in West Hollywood, called West Hollywood park. The park was under construction on the Bing images in the Id editor and nobody has traced the park yet.

If a physical map lacks your point of interest, you have to mark on the map or use some sort of overlay.

Like a topic map, with Mapillary and OpenStreetMap, you can add your point of interest with a suitable degree of accuracy.

You don’t need the agreement of your local department of highways or civil defense authorities.

Enjoy!

I first saw this in a tweet by Map@Syst.

The Books of Remarkable Women

Filed under: History,Preservation,Topic Maps — Patrick Durusau @ 8:32 am

The Books of Remarkable Women by Sarah J. Biggs.

From the post:

In 2011, when we blogged about the Shaftesbury Psalter (which may have belonged to Adeliza of Louvain; see below), we wrote that medieval manuscripts which had belonged to women were relatively rare survivals. This still remains true, but as we have reviewed our blog over the past few years, it has become clear that we must emphasize the relative nature of the rarity – we have posted literally dozens of times about manuscripts that were produced for, owned, or created by a number of medieval women.

A good example of why I think topic maps have so much to offer for preservation of cultural legacy.

While each of the books covered in this post are important historical artifacts, their value is enhanced by the context of their production, ownership, contemporary practices, etc.

All of which lies outside the books proper. Just as data about data, the so-called “metadata,” usually lies outside its information artifact.

If future generations are going to have better historical context than we do for many items, we had best get started writing them.

March 9, 2014

Lucene 4 Essentials for Text Search and Indexing

Filed under: Indexing,Java,Lucene,Searching — Patrick Durusau @ 5:06 pm

Lucene 4 Essentials for Text Search and Indexing by Mitzi Morris.

From the post:

Here’s a short-ish introduction to the Lucene search engine which shows you how to use the current API to develop search over a collection of texts. Most of this post is excerpted from Text Processing in Java, Chapter 7, Text Search with Lucene.

Not too short! 😉

I have seen blurbs about Text Processing in Java but this post convinced me to put it on my wish list.

You?

PS: As soon as a copy arrives I will start working on a review of it. If you want to see that happen sooner rather than later, ping me.

Getty – 35 Million Free Images

Filed under: Image Understanding,Intellectual Property (IP) — Patrick Durusau @ 3:40 pm

Getty Images makes 35 million images free in fight against copyright infringement by Olivier Laurent.

From the post:

Getty Images has single-handedly redefined the entire photography market with the launch of a new embedding feature that will make more than 35 million images freely available to anyone for non-commercial usage. BJP’s Olivier Laurent finds out more.

(skipped image)

The controversial move is set to draw professional photographers’ ire at a time when the stock photography market is marred by low prices and under attack from new mobile photography players. Yet, Getty Images defends the move, arguing that it’s not strong enough to control how the Internet has developed and, with it, users’ online behaviours.

“We’re really starting to see the extent of online infringement,” says Craig Peters, senior vice president of business development, content and marketing at Getty Images. “In essence, everybody today is a publisher thanks to social media and self-publishing platforms. And it’s incredibly easy to find content online and simply right-click to utilise it.”

In the past few years, Getty Images found that its content was “incredibly used” in this manner online, says Peters. “And it’s not used with a watermark; instead it’s typically found on one of our valid licensing customers’ websites or through an image search. What we’re finding is that the vast majority of infringement in this space happen with self publishers who typically don’t know anything about copyright and licensing, and who simply don’t have any budget to support their content needs.”

To solve this problem, Getty Images has chosen an unconventional strategy. “We’re launching the ability to embed our images freely for non-commercial use online,” Peters explains. In essence, anyone will be able to visit Getty Images’ library of content, select an image and copy an embed HTML code to use that image on their own websites. Getty Images will serve the image in a embedded player – very much like YouTube currently does with its videos – which will include the full copyright information and a link back to the image’s dedicated licensing page on the Getty Images website.

More than 35 million images from Getty Images’ news, sports, entertainment and stock collections, as well as its archives, will be available for embedding from 06 March.

What a clever move by Getty!

Think about it. Who do you sue for copyright infringement? Is it some hobbyist blogger or use of an image in a school newspaper? OK, the RIAA would but what about sane people?

Your first question: Did the infringement result is a substantial profit due to the infringement?

Your second question: Does the guilty party have enough assets to likely recover the substantial profit?

You only want to catch infringement by other major for profit players.

All of who have to publicly use your images. Hiding infringement isn’t possible.

None of the major media outlets or publishers are going to cheat on use of your images. Whether that is because they are honest with regard to IP or so easily caught, doesn’t really matter.

In one fell swoop, Getty has secured for itself free advertising for every image that is used for free. Advertising it could not have bought for any sum of money.

Makes me wonder when the ACM, IEEE, Springer, Elsevier and others are going to realize that free and public access to their journals and monographs will drive demand for libraries to have enhanced access to those publications?

It isn’t like EBSCO and the others are going to start using data that is limited to non-commercial use for their databases. That would be too obvious, not to mention incurring significant legal liability.

Ditto for libraries. Libraries want legitimate access to the materials they provide and/or host.

As I told an academic society once upon a time, “It’s time to stop grubbing for pennies when there are $100 bills blowing over head.” It involve a replacement of “lost in the mail” journals. At a replacement cost of $3.50 (plus postage) per claim, they were employing a full time person to research eligibility to request a replacement copy. For a time I convinced them to simply replace upon request in the mailroom. Track requests but just do it. Worked quite well.

Over the years management has changed and I suspect they have returned to protecting the rights of members that only people entitled to a copy of the journal got one. I kid you not, that was the explanation for the old policy. Bizarre.

I first saw this at: Getty Set 35 Million Images Free, But Who Can Use Them? by David Godsall.

PS: The thought does occur to me that suitable annotations could be prepared ahead of time for these images so that when a for-profit publisher purchases the rights to a Getty image, someone could offer robust metadata to accompany the image.

IMDB Top 100K Movies Analysis in Depth (Parts 1- 4)

Filed under: Graphics,IMDb,Visualization — Patrick Durusau @ 2:27 pm

IMDB Top 100K Movies Analysis in Depth Part 1 by Bugra Akyildiz.

IMDB Top 100K Movies Analysis in Depth Part 2

IMDB Top 100K Movies Analysis in Depth Part 3

IMDB Top 100K Movies Analysis in Depth Part 4

From part 1:

Data is from IMDB and it includes all of the popularly voted 100042 movies from 1950 to 2013.(I know why 100000 is there but have no idea how 42 movies get squeezed. Instead of blaming my web scraping skills, I blame the universe, though).

The reason why I chose the number of votes as a metric to order the movies is because, generally the information (title, certificate, outline, director and so on) about movie are more likely to be complete for the movies that have high number of votes. Moreover, IMDB uses number of votes as a metric to determine the ranking as well so number of votes also correlate with the rating as well. Further, everybody at least has an idea on IMDB Top 250 or IMDB Top 1000 which are ordered by the ratings computed by IMDB.

Although the data is quite rich in terms of basic information, only year, rating and votes are complete for all of the movies. Only ~80% of the movies have runtime information(minutes). The categories are mostly 90% complete which could be considered good but the certificate information of the movies is the most sparse (only ~25% of them have it).

This post aims to explore data for diffferent aspects of data(categories, rating and categories) and also useful information(best movie in terms of rating or votes for each year).

An interesting analysis of the Internet Movie Database (IMDB) that incorporates other sources, such as for revenue and actors’ and actresses’ age and height information.

Suggestions on other data to include or representation techniques?

I first saw this in a tweet by Gregory Piatetsky.

March 8, 2014

Building a Database-backed Clojure Web App…

Filed under: Clojure,Database,Web Applications — Patrick Durusau @ 8:49 pm

Building a Database-backed Clojure Web App On Top of Heroku Cloud App Platform by Charles Ditzel.

From the post:

Some time ago I wrote a post about Java In the Auto-Scaling Cloud. In the post, I mentioned Heroku. In today’s post, I want to take time to point back to Heroku again, this time with the focus on building web applications. Heroku Dev Center recently posted a great tutorial on building a databased-backed Clojure web application. In this example, a twitter-like app is built that stores “shouts” to a PostgreSQL database. It covers a lot of territory, from connecting to PostgreSQL, to web bindings with Compujure, HTML tempting with Hiccup and assembling the application and testing it. Finally, deploying it.

If you aren’t working on a weekend project already, here is one for your consideration!

LongoMatch

Filed under: Analytics,Annotation,Video — Patrick Durusau @ 8:40 pm

LongoMatch

From the “Features” page:

Performance analysis made easy

LongoMatch has been designed to be very easy to use, exposing the basic functionalities of video analysis in an intuitive interface. Tagging, playback and edition of stored events can be easily done from the main window, while more specific features can be accessed through menus when needed.

Flexible and customizable for all sports

LongoMatch can be used for any kind of sports, allowing to create custom templates with an unlimited number of tagging categories. It also supports defining custom subcategories and creating templates for your teams with detailed information of each player which is the perfect combination for a fine-grained performance analysis.

Post-match and real time analysis

LongoMatch can be used for post-match analysis supporting the most common video formats as well as for live analysis, capturing from Firewire, USB video capturers, IP cameras or without any capture device at all, decoupling the capture process from the analysis, but having it ready as soon as the recording is done. With live replay, without stopping the capture, you can review tagged events and export them while still analyzing the game live.

Although pitched as software for analyzing sports events, it occurs to me this could be useful in a number of contexts.

Such as analyzing news footage of police encounters with members of the public.

Or video footage of particular locations. Foot or vehicle traffic.

The possibilities are endless.

Then it’s just a question of tying that information together with data from other information feeds. 😉

papers-we-love

Filed under: Computer Science,CS Lectures — Patrick Durusau @ 8:20 pm

papers-we-love

From the webpage:

Repository related to the following meetups:

Let us know if you are interested in starting a chapter!

A GitHub repository of CS papers.

If you decide to start a virtual “meetup” be sure to ping me. Nothing against the F2F meetings, absolutely needed, but some of use can’t make F2F meetings.

PS: There is also a list of other places to search for good papers.

Merge Mahout item based recommendations…

Filed under: Hadoop,Hive,Mahout,MapReduce,Recommendation — Patrick Durusau @ 8:08 pm

Merge Mahout item based recommendations results from different algorithms

From the post:

Apache Mahout is a machine learning library that leverages the power of Hadoop to implement machine learning through the MapReduce paradigm. One of the implemented algorithms is collaborative filtering, the most successful recommendation technique to date. The basic idea behind collaborative filtering is to analyze the actions or opinions of users to recommend items similar to the one the user is interacting with.

Similarity isn’t restricted to a particular measure or metric.

How similar is enough to be considered the same?

That is a question topic map designers must answer on a case by case basis.

Black Hat Asia 2014: The Weaponized Web

Filed under: Conferences,Cybersecurity,Security — Patrick Durusau @ 7:56 pm

Black Hat Asia 2014: The Weaponized Web

From the post:

The World Wide Web has grown exponentially since its birth 21 years ago, and it now serves as the interface for many of the apps we use every day. It’s hard to imagine a more enticing target for hacks and exploits. Today’s trio of Black Hat Briefings explore ways the Web can be weaponized … and how to defend against it.

Even as HTML 5 proliferates as an enabler of rich interactive Web applications, cross-site scripting (XSS) remains one of the top three Web application vulnerabilities. DOM-based XSS is growing in popularity, but its client-side nature makes it difficult to monitor for malicious payloads. Ultimate Dom Based XSS Detection Scanner on Cloud delves into this thorny issue. Nera W. C. Liu and Albert Yu will show how they managed to introduce and propagate tainted attributes to a DOM input interface, and then devised a system to detect such breaches by harnessing the power of PhantomJS, a headless browser for automation.

JavaScript’s ubiquity makes it the subject of aggressive security-community research, boosting its effective security level every day. Sounds good, but in JS Suicide: Using JavaScript Security Features to Kill JS Security, AhamedNafeez will demonstrate that these security features can be a double-edged sword, sometimes allowing an attacker to disable certain other JS protection mechanisms. In particular, the sandboxing features of ECMAScript 5 can break security in many JS applications. Real-world examples of other JS security lapses are also on the agenda.

Ready-made exploit kits make it easier than ever for malicious parties to victimize unwary Internet users. Jose Miguel Esparza will take us down that rabbit hole in PDF Attack: A Journey From the Exploit Kit to the Shellcode, in which he’ll teach how to manually extract obfuscated URLs and binaries from these weaponized pages. You’ll also learn how to do modify a malicious PDF payload yourself to bypass AV software, a useful trick for pentesting.

Looking to register? Please visit Black Hat Asia 2014’s registration page to get started.

One of the things I like about Black Hat is their honesty. Computer enthusiasts include the usual high school/college nerds and white shirt/blue tie crowd but there are those who follow a different track. And some of those, don’t work for national governments.

If you need more evidence for the argument that software (not just the WWW) is systematically broken (Back to Basics: Beyond Network Hygiene by Felix ‘FX’ Lindner and Sandro Gaycken), review the agenda for this Black Hat conference or for proceeding years.

As long as software security remains a separate security product or patch to existing software issue, Black Hat isn’t going to go lacking for conference material.

March 7, 2014

Who Are the Customers for Intelligence?

Filed under: Intelligence,Marketing — Patrick Durusau @ 8:37 pm

Who Are the Customers for Intelligence? by Peter C. Oleson.

From the paper:

Who uses intelligence and why? The short answer is almost everyone and to gain an advantage. While nation-states are most closely identified with intelligence, private corporations and criminal entities also invest in gathering and analyzing information to advance their goals. Thus the intelligence process is a service function, or as Australian intelligence expert Don McDowell describes it,

Information is essential to the intelligence process. Intelligence… is not simply an amalgam of collected information. It is instead the result of taking information relevant to a specific issue and subjecting it to a process of integration, evaluation, and analysis with the specific purpose of projecting future events and actions, and estimating and predicting outcomes.

It is important to note that intelligence is prospective, or future oriented (in contrast to investigations that focus on events that have already occurred).

As intelligence is a service, it follows that it has customers for its products. McDowell differentiates between “clients” and “customers” for intelligence. The former are those who commission an intelligence effort and are the principal recipients of the resulting intelligence product. The latter are those who have an interest in the intelligence product and could use it for their own purposes. Most scholars of intelligence do not make this distinction. However, it can be an important one as there is an implied priority associated with a client over a customer. (footnote markers omitted)

If you want to sell the results of topic maps, that is highly curated data that can be viewed from multiple perspectives, this essay should spark your thinking about potential customers.

You may also find this website useful: Association of Former Intelligence Officers.

I first saw this at Full Text Reports as Who Are the Customers for Intelligence? (draft).

Quizz: Targeted Crowdsourcing…

Filed under: Authoring Topic Maps,Crowd Sourcing — Patrick Durusau @ 8:19 pm

Quizz: Targeted Crowdsourcing with a Billion (Potential) Users by Panagiotis G. Ipeirotis and Evgeniy Gabrilovich.

Abstract:

We describe Quizz, a gamified crowdsourcing system that simultaneously assesses the knowledge of users and acquires new knowledge from them. Quizz operates by asking users to complete short quizzes on specific topics; as a user answers the quiz questions, Quizz estimates the user’s competence. To acquire new knowledge, Quizz also incorporates questions for which we do not have a known answer; the answers given by competent users provide useful signals for selecting the correct answers for these questions. Quizz actively tries to identify knowledgeable users on the Internet by running advertising campaigns, effectively leveraging the targeting capabilities of existing, publicly available, ad placement services. Quizz quantifies the contributions of the users using information theory and sends feedback to the advertising system about each user. The feedback allows the ad targeting mechanism to further optimize ad placement.

Our experiments, which involve over ten thousand users, confirm that we can crowdsource knowledge curation for niche and specialized topics, as the advertising network can automatically identify users with the desired expertise and interest in the given topic. We present controlled experiments that examine the effect of various incentive mechanisms, highlighting the need for having short-term rewards as goals, which incentivize the users to contribute. Finally, our cost- quality analysis indicates that the cost of our approach is below that of hiring workers through paid-crowdsourcing platforms, while offering the additional advantage of giving access to billions of potential users all over the planet, and being able to reach users with specialized expertise that is not typically available through existing labor marketplaces.

Crowd sourcing isn’t an automatic slam-dunk but with research like this, it will start moving towards being a repeatable experience.

What do you want to author using a crowd?

I first saw this at Greg Linden’s More quick links.

Introducing the ProPublica Data Store

Filed under: Data,News,Reporting — Patrick Durusau @ 8:07 pm

Introducing the ProPublica Data Store by Scott Klein and Ryann Grochowski Jones.

From the post:

We work with a lot of data at ProPublica. It's a big part of almost everything we do — from data-driven stories to graphics to interactive news applications. Today we're launching the ProPublica Data Store, a new way for us to share our datasets and for them to help sustain our work.

Like most newsrooms, we make extensive use of government data — some downloaded from "open data" sites and some obtained through Freedom of Information Act requests. But much of our data comes from our developers spending months scraping and assembling material from web sites and out of Acrobat documents. Some data requires months of labor to clean or requires combining datasets from different sources in a way that's never been done before.

In the Data Store you'll find a growing collection of the data we've used in our reporting. For raw, as-is datasets we receive from government sources, you'll find a free download link that simply requires you agree to a simplified version of our Terms of Use. For datasets that are available as downloads from government websites, we've simply linked to the sites to ensure you can quickly get the most up-to-date data.

For datasets that are the result of significant expenditures of our time and effort, we're charging a reasonable one-time fee: In most cases, it's $200 for journalists and $2,000 for academic researchers. Those wanting to use data commercially should reach out to us to discuss pricing. If you're unsure whether a premium dataset will suit your purposes, you can try a sample first. It's a free download of a small sample of the data and a readme file explaining how to use it.

The datasets contain a wealth of information for researchers and journalists. The premium datasets are cleaned and ready for analysis. They will save you months of work preparing the data. Each one comes with documentation, including a data dictionary, a list of caveats, and details about how we have used the data here at ProPublica.

A data store you can feel good about supporting!

I first saw this at Nathan Yau’s ProPublica opened a data store.

Trapping Users with Linked Data (WorldCat)

Filed under: Linked Data,WorldCat — Patrick Durusau @ 5:33 pm

WorldCat Works Linked Data – Some Answers To Early Questions by Richard Wallis.

The most interesting question Richard answers:

Q Is there a bulk download available?
No there is no bulk download available. This is a deliberate decision for several reasons.
Firstly this is Linked Data – its main benefits accrue from its canonical persistent identifiers and the relationships it maintains between other identified entities within a stable, yet changing, web of data. WorldCat.org is a live data set actively maintained and updated by the thousands of member libraries, data partners, and OCLC staff and processes. I would discourage reliance on local storage of this data, as it will rapidly evolve and become out of synchronisation with the source. The whole point and value of persistent identifiers, which you would reference locally, is that they will always dereference to the current version of the data.

I will give you one guess on who is deciding on the entities, identifiers and relationships to be maintained.

Hint: It’s not you.

Which in my view is one of the principal weaknesses of Linked Data.

In order to participate, you have to forfeit your right to organize your world differently than it has been organized by Richard Wallis, WorldCat and others.

I am sure they all have good intentions and WorldCat will come close enough for most of my purposes, but I’m not interested in a one world view, whoever agrees with it. Even me.

If you are good with graphics, take the original Apple commercial:

and reverse it.

Show users and screen of vivid diversity and show a Richard Wallis look alike touching the side of the projection screen and the uniform grayness of linked data starts to spread across it. As it does, the users in the audience who have been in traditional dress start to look like the starting audience in Apple’s 1984 commercial.

That’s the intellectual landscape that linked data promises. Do you really want to go there?

Nothing against standards, I have helped write one or two them. But I do oppose uniformity for the sake of empowering self-appointed guardians.

Particularly when that uniformity is a tepid grey that doesn’t reflect the rich and discordant hues of human intellectual history.

Using Lucene’s search server to search Jira issues

Filed under: Indexing,Lucene,Search Engines — Patrick Durusau @ 5:02 pm

Using Lucene’s search server to search Jira issues by Michael McCandless.

From the post:

You may remember my first blog post describing how the Lucene developers eat our own dog food by using a Lucene search application to find our Jira issues.

That application has become a powerful showcase of a number of modern Lucene features such as drill sideways and dynamic range faceting, a new suggester based on infix matches, postings highlighter, block-join queries so you can jump to a specific issue comment that matched your search, near-real-time indexing and searching, etc. Whenever new users ask me about Lucene’s capabilities, I point them to this application so they can see for themselves.

Recently, I’ve made some further progress so I want to give an update.

The source code for the simple Netty-based Lucene server is now available on this subversion branch (see LUCENE-5376 for details). I’ve been gradually adding coverage for additional Lucene modules, including facets, suggesters, analysis, queryparsers, highlighting, grouping, joins and expressions. And of course normal indexing and searching! Much remains to be done (there are plenty of nocommits), and the goal here is not to build a feature rich search server but rather to demonstrate how to use Lucene’s current modules in a server context with minimal “thin server” additional source code.

Separately, to test this new Lucene based server, and to complete the “dog food,” I built a simple Jira search application plugin, to help us find Jira issues, here. This application has various Python tools to extract and index Jira issues using Jira’s REST API and a user-interface layer running as a Python WSGI app, to send requests to the server and render responses back to the user. The goal of this Jira search application is to make it simple to point it at any Jira instance / project and enable full searching over all issues.

Of particular interest to me because OASIS is about to start using JIRA 6.2 (the version in use at Apache).

I haven’t looked closely at the documentation for JIRA 6.2.

Thoughts on where it has specific weaknesses that are addressed by Michael’s solution?

Data Scientist Solution Kit

Filed under: Cloudera,Data Analysis,Training,Web Analytics — Patrick Durusau @ 3:57 pm

Data Scientist Solution Kit

From the post:

The explosion of data is leading to new business opportunities that draw on advanced analytics and require a broader, more sophisticated skills set, including software development, data engineering, math and statistics, subject matter expertise, and fluency in a variety of analytics tools. Brought together by data scientists, these capabilities can lead to deeper market insights, more focused product innovation, faster anomaly detection, and more effective customer engagement for the business.

The Data Science Challenge Solution Kit is your best resource to get hands-on experience with a real-world data science challenge in a self-paced, learner-centric environment. The free solution kit includes a live data set, a step-by-step tutorial, and a detailed explanation of the processes required to arrive at the correct outcomes.

Data Science at Your Desk

The Web Analytics Challenge includes five sections that simulate the experience of exploring, then cleaning, and ultimately analyzing web log data. First, you will work through some of the common issues a data scientist encounters with log data and data in JSON format. Second, you will clean and prepare the data for modeling. Third, you will develop an alternate approach to building a classifier, with a focus on data structure and accuracy. Fourth, you will learn how to use tools like Cloudera ML to discover clusters within a data set. Finally, you will select an optimal recommender algorithm and extract ratings predictions using Apache Mahout.

With the ongoing confusion about what it means to be a “data scientist,” having a certification or two isn’t going to hurt your chances for employment.

And you may learn something in the bargain. 😉

Handling and Processing Strings in R

Filed under: R,String Matching — Patrick Durusau @ 3:22 pm

Handling and Processing Strings in R by Gaston Sanchez. (free ebook)

From the post:

Many years ago I decided to apply for a job in a company that developed data mining applications for big retailers. I was invited for an on-site visit and I went through the typical series of interviews with the members of the analytics team. Everything was going smoothly and I was enjoying all the conversations. Then it came turn to meet the computer scientist. After briefly describing his role in the team he started asking me a bunch of technical questions and tests. Although I was able to answer those questions related with statistics and multivariate analysis, I had a really hard time trying to answer a series of questions related with string manipulations.

I will remember my interview with that guy as one of the most embarrassing moments of my life. That day, the first thing I did when I went back home was to open my laptop, launch R, and start reproducing the tests I failed to solve. It didn’t take me that much to get the right answers. Unfortunately, it was too late and the harm was already done. Needless to say I wasn’t offered the job. That shocking experience showed me that I was not prepared for manipulating character strings. I felt so bad that I promised myself to learn the basics of strings manipulation and text processing. “Handling and Processing Strings in R” is one of the derived results of that old promise.

The content of this ebook is the byproduct of my experience working with character string data in R. It is based on my notes, scripts, projects, and uncountable days and nights in which I’ve been struggling with text data. Briefly, I’ve tried to document and organize several topics related with manipulating character strings.

At one hundred and twelve (112) pages, “Handling and Processing Strings in R” may not answer every question you have about strings and R, but it answers a lot of them.

Enjoy and pass this along!

I first saw this in a tweet by Sharon Machlis.

Language: Vol 89, Issue 1 (March 2013)

Filed under: Linguistics — Patrick Durusau @ 3:05 pm

Language: Vol 89, Issue 1 (March 2013)

Language is a publication of the Linguistic Society of America:

The Linguistic Society of America is the major professional society in the United States that is exclusively dedicated to the advancement of the scientific study of language. As such, the LSA plays a critical role in supporting and disseminating linguistic scholarship, as well as facilitating the application of current research to scientific, educational, and social issues concerning language.

Language is a defining characteristic of the human species and impacts virtually all aspects of human experience. For this reason linguists seek not only to discover properties of language in general and of languages in particular but also strive to understand the interface of the phenomenon of language with culture, cognition, history, literature, and so forth.

With over 5,000 members, the LSA speaks on behalf of the field of linguistics and also serves as an advocate for sound educational and political policies that affect not only professionals and students of language, but virtually all segments of society. Founded in 1924, the LSA has on many occasions made the case to governments, universities, foundations, and the public to support linguistic research and to see that our scientific discoveries are effectively applied. As part of its outreach activities, the LSA attempts to provide information and educate both officials and the public about language.

You might want to note that access to all of Language is subject to a one year embargo.

Quite reasonable when compared to embargoes calculated to give those with institutional subscriptions an advantage. I guess if you can’t get published without such advantages that sounds reasonable as well.

Enjoy!

Building fast Bayesian computing machines…

Filed under: Artificial Intelligence,Bayesian Data Analysis,Precision — Patrick Durusau @ 11:41 am

Building fast Bayesian computing machines out of intentionally stochastic, digital parts by Vikash Mansinghka and Eric Jonas.

Abstract:

The brain interprets ambiguous sensory information faster and more reliably than modern computers, using neurons that are slower and less reliable than logic gates. But Bayesian inference, which underpins many computational models of perception and cognition, appears computationally challenging even given modern transistor speeds and energy budgets. The computational principles and structures needed to narrow this gap are unknown. Here we show how to build fast Bayesian computing machines using intentionally stochastic, digital parts, narrowing this efficiency gap by multiple orders of magnitude. We find that by connecting stochastic digital components according to simple mathematical rules, one can build massively parallel, low precision circuits that solve Bayesian inference problems and are compatible with the Poisson firing statistics of cortical neurons. We evaluate circuits for depth and motion perception, perceptual learning and causal reasoning, each performing inference over 10,000+ latent variables in real time – a 1,000x speed advantage over commodity microprocessors. These results suggest a new role for randomness in the engineering and reverse-engineering of intelligent computation.

Ironic that the greater precision and repeatability of our digital computers may be choices that are holding back advancements in Bayesian digital computing machines.

I have written before about the RDF ecosystem being over complex and precise for use by everyday users.

We should strive to capture semantics as understood by scientists, researchers, students, and others. Less precise than professional semantics but precise enough to make it usable?

I first saw this in a tweet by Stefano Bertolo.

Monitoring Real-Time Bidding at AdRoll

Filed under: Concurrent Programming,Erlang — Patrick Durusau @ 11:09 am

Monitoring Real-Time Bidding at Adroll by Brian Troutwine.

From the description:

This is the talk I gave at Erlang Factory SF Bay Area 2014. In it I discussed the instrumentation by default approach taken in the AdRoll real-time bidding team, discuss the technical details of the libraries we use and lessons learned to adapt your organization to deal with the onslaught of data from instrumentation.

The problem domain:

  • Low latency ( < 100ms per transaction )
  • Firm real-time system
  • Highly concurrent ( > 30 billion transactions per day )
  • Global, 24/7 operation

(emphasis in original)

They are not doing semantic processing subject to those requirements. 😉

But, that’s ok. If needed, you can assign semantics to the data and its containers separately.

A very impressive use of Erlang.

March 6, 2014

Algebird 0.5.0 Released

Filed under: Algebird,Mathematics,Scalding,Storm — Patrick Durusau @ 9:24 pm

Algebird 0.5.0

From the webpage:

Abstract algebra for Scala. This code is targeted at building aggregation systems (via Scalding or Storm). It was originally developed as part of Scalding’s Matrix API, where Matrices had values which are elements of Monoids, Groups, or Rings. Subsequently, it was clear that the code had broader application within Scalding and on other projects within Twitter.

Other links you will find helpful:

0.5.0 Release notes.

Algebird mailing list.

Algebird Wiki.

The “Tube” as History of Music

Filed under: Maps,Music,Visualization — Patrick Durusau @ 9:14 pm

The history of music shown by the London Underground

I have serious difficulties with the selection of music to be mapped, but that should not diminish your enjoyment of this map if you find it more to your taste.

Great technique if somewhat lacking in content. 😉

It does illustrate the point that every map is from a point of view, even if it is an incorrect one (IMHO).

I first saw this in a tweet by The O.C.R.

Visualising UK Ministerial Lobbying…

Filed under: Government,Government Data,Visualization — Patrick Durusau @ 9:01 pm

Visualising UK Ministerial Lobbying & “Buddying” Over Eight Months by Roland Dunn.

From the post:

barclays

[This is a companion piece to our visualisation of ministerial lobbying – open it up and take a look!].

Eight Months Worth of Lobbying Data

Turns out that James Ball, together with the folks at Who’s Lobbying had collected together all the data regarding ministerial meetings from all the different departments across the UK’s government (during May to December 2010), tidied the data up, and put them together in one spreadsheet: https://docs.google.com/spreadsheet/ccc?key=0AhHlFdx-QwoEdENhMjAwMGxpb2kyVnlBR2QyRXJVTFE.

It’s important to understand that despite the current UK government stating that it is the most open and transparent ever, each department publishes its ministerial meetings in ever so slightly different formats. On that page for example you can see Dept of Health Ministerial gifts, hospitality, travel and external meetings January to March 2013, and DWP ministers’ meetings with external organisations: January to March 2013. Two lists containing slightly different sets of data. So, the work that Who’s Lobbying and James Ball did in tallying this data up is considerable. But not many people have the time to tie such data-sets together, meaning the data contained in them is somewhat more opaque than you might at first be led to believe. What’s needed is one pan-governmental set of data.

An example to follow in making “open” data a bit more “transparent.”

Not entirely transparent for as the author notes, minutes from the various meetings are not available.

Or I suppose when minutes are available, their completeness would be questionable.

I first saw this in a tweet by Steve Peters.

Crisis News on Twitter

Filed under: Authoring Topic Maps,News,Reporting — Patrick Durusau @ 3:50 pm

Who to Follow on Twitter for Crisis News, Part 2: Venezuela by David Godsall.

From the post:

With political strife dominating so much of our news cycle these past months, and events from Ukraine to Venezuela rapidly unfolding, Twitter is one of the best ways to stay informed in real time. But when social media turns everyone into an information source, it can be a challenge to sort the signal from the noise and figure out who to trust.

To help you find reliable sources for some of the most timely geopolitical news stories, we’ve created a series of Twitter lists compiling trusted journalists, activists and citizens on the ground in the conflict regions. These are the people sharing the most up-to-date information, often from their own first hand experiences. In Part 1 of this series, we talked about sources of news from Ukraine.

Our second list in the series focuses on the events currently taking place in Venezuela:

If you are building a topic map for current events, you need information feeds. Twitter has some suggestions if you want to follow events in the Ukraine or Venezuela.

As will any information feed, use even the best feeds with caution. I saw Henry Kissinger on Charlie Rose. Kissinger was very even handed while Rose was an “America lectures the world” advocate. If you haven’t read The ugly American by William J Lederer and Eugene Burdick, you should.

It is a very crowded field for who would qualify as the “ugliest” American these days.

Metaphor: Web-based Functorial Data Migration

Filed under: Category Theory,Data Integration,SQL — Patrick Durusau @ 3:14 pm

Metaphor: Web-based Functorial Data Migration

From the webpage:

Metaphor is a web-based implementation of functorial data migration. David Spivak and Scott Morrison are the primary contributors.

I discovered this while running some of the FQL material to ground.

While I don’t doubt the ability of category theory to create mappings between relational schemas, what I am not seeing is the basis for the mapping.

In other words, assume I have two schemas with only one element in each one, firstName in one and givenName in the other. Certainly I can produce a mapping between those schemas.

Question: On what basis did I make such a mapping?

In other words, what properties of those subjects had to be the same or different in order for me to make that mapping?

Unless and until you know that, how can you be sure that your mappings agree with those I have made?

FQL: A Functorial Query Language

Filed under: Category Theory,Query Engine,Query Language,SQL — Patrick Durusau @ 3:00 pm

FQL: A Functorial Query Language

From the webpage:

The FQL IDE is a visual schema mapping tool for developing FQL programs. It can run FQL programs, generate SQL from FQL, generate FQL from SQL, and generate FQL from schema correspondences. Using JDBC, it can run transparently using an external SQL engine and on external database instances. It can output RDF/OWL/XML and comes with many built-in examples. David Spivak and Ryan Wisnesky are the primary contributors. Requires Java 7.

As if FQL and the IDE weren’t enough, papers, slides, source code await you.

I first saw this in a tweet by Computer Science.

« Newer PostsOlder Posts »

Powered by WordPress