Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 21, 2012

Topincs 6.4.0

Filed under: Topic Map Software,Topincs — Patrick Durusau @ 7:47 am

Topincs 6.4.0 by Robert Cerny.

Robert details the new features and enhancements to Topincs.

Large Steam network visualization with Google Maps + Gephi

Filed under: Gephi,Graphs,Networks,Visualization — Patrick Durusau @ 7:33 am

Large Steam network visualization with Google Maps + Gephi

From the post:

I’ve used Google Maps API to visualize a relatively large network collected from Steam Community members. The data is collected from public player profiles that Valve reveals through their Steam Web API. For each player their links to friends and links to Steam Groups they belong are collected. This creates a social network which can be visualized using Gephi.

Graph consists of 212600 nodes and 4045203 edges. Before filtering outliers and low/high degree nodes there are approximately 800 000 groups and over 11 million users.

Very impressive visualization.

Enjoy!

Open Data Is Not for Sprinters [Or Open Data As Religion]

Filed under: Open Data — Patrick Durusau @ 7:13 am

Open Data Is Not for Sprinters by Andrea Di Maio.

Andrea’s comment on the UK special envoy who was “disappointed” with open data usage was to point out that government should be making better internal use of open data, to justify the open data programs.

His view was challenged by a member of an audience who said:

open data is for the sake of economic development and transparency, not for internal use.

Andrea’s response:

I do not disagree of course. All I am saying, and I have been saying for a while now, is that to realize this vision will take quite some time. Indeed more data must be available, of higher quality and timeliness; more entrepreneurs or “appreneurs” must be lured to extract value for businesses and the public at large from this data; and we need a stream of examples across sectors and regions to show that value can be generated everywhere.

A more direct answer would be to point out that statements like:

Opening up data is fundamentally about more efficient use of resources and improving service delivery for citizens. The effects of that are far reaching: innovation, transparency, accountability, better governance and economic growth. (Sir Tim Berners-Lee: Raw data, now!)

Are religious dogma. Useful if you want to run your enterprise or government based on religious dogma but you may as well use a Ouija board.

The astronomy community which has a history of “open data” that spans decades. I find the data very interesting and it has lead to discoveries in astronomy, but economic development?

The biological community apparently has a competition to see who can make more useful data available than the next lab. And it leads to better research, discoveries and innovation, but economic development?

The same holds true for the chemical community and numerous others.

The point being that claims such as “open data leads to economic development” are sure to disappoint.

Some open data might, but that is a question of research and proof, not mere cant.

A government, for example, could practice open data with regard to its tax policies and how it decides to audit taxpayers. I am sure startups would quickly take up the task of using that data to advise clients on how to avoid audits. (They are called tax advisors now.)

Or a government could practice open data on the White House visitor list and include non-tour visitors, some of them, in the thousands who visit every day. It’s “open data,” just not useful data. And not data that is likely to lead to economic development or transparency.

Governments should practice open data but with an eye towards selecting data that is likely to lead to economic development, innovation, etc. By tracking the use of “open data” now, governments can make rational decisions about what data to “open” in the future.

Digesting Big Data [Egestion vs. Excretion]

Filed under: BigData,Machine Learning — Patrick Durusau @ 6:17 am

Digesting Big Data by Jos Verwoerd.

From the post:

Phase One: Ingestion.
The art of collecting data and storing it.

Phase Two: Digestion.
Digestion is processing your raw data into something that you can extract value from.

Phase Three: Absorption.
This stage is all about extracting insights from your data.

Phase Four: Assimilation.
In the fourth stage you want to put the insights to action.

Phase Five: Egestion.
This fifth phase runs parallel to all others and is about getting rid of the unwanted, unclean, unnecessary parts of your data, invalid insights and predictions at every step of the process.

An analogy to processing big data that I have not encountered before.

Jos points out that most PR is about phases one and two, but the business value payoff doesn’t come until phases three and four.

Interesting reading, and I learned a new word, “Egestion.”

From Biology Online: Egestion –

noun

The act or process of voiding or discharging undigested food as faeces.

Supplement

Egestion is the discharge or expulsion of undigested material (food) from a cell in case of unicellular organisms, and from the digestive tract via the anus in case of multicellular organisms.

It should not be confused with excretion, which is getting rid of waste formed from the chemical reaction of the body, such as in urine, sweat, etc.

Word origin: from Latin ēgerere, ēgest-, to carry out.
Related forms: egest (verb).

According to Webster’s, excrement doesn’t make such fine distinctions, covering waste of any origin.

Interactive Data Visualization for the Web [D3]

Filed under: D3,Graphics,Visualization — Patrick Durusau @ 5:24 am

Interactive Data Visualization for the Web by Scott Murray.

A preview version explaining D3.

Skimming the first couple of chapters it has a number of references/examples but could use one or more editing passes.

Still, if you are new to D3, it’s free and correcting any mistakes you find will be good practice.

I first saw this in a tweet by Jen Lowe.


Update: Final ebook and print copies are now available!

November 20, 2012

… ‘disappointed’ with open data use

Filed under: Open Data,Open Government — Patrick Durusau @ 8:20 pm

Prime minister’s special envoy ‘disappointed’ with open data use by Derek du Preez.

From the post:

Prime Minister David Cameron’s special envoy on the UN’s post-2015 development goals has said that he is ‘disappointed’ by how much the government’s open datasets have been used so far.

Speaking at a Reform event in London this week on open government and data transparency, Anderson said he recognises that the public sector needs to improve the way it pushes out the data so that it is easier to use.

“I am going to be really honest with you. As an official in a government department that has worked really hard to get a lot of data out in the last two years, I have been pretty disappointed by how much it has been used,” he said.

Easier to use data is one issue.

But the expectation that effort making data open = people interested in using it is another.

The article later reports there are 9,000 datasets available at data.gov.uk.

How relevant to every day concerns are those 9,000 datasets?

When the government starts disclosing the financial relationships between members of government, their families and contributors, I suspect interest in open data will go up.

Benefits stigma: how newspapers report on welfare

Filed under: News,Politics — Patrick Durusau @ 7:53 pm

Benefits stigma: how newspapers report on welfare by Randeep Ramesh.

From the post:

New research out today looks at the benefits stigma in Britain. The Guardian’s social affairs editor takes a look at the most common myths and sees how content on welfare differs by newspapers.

Those working in benefits and with claimants have become increasingly exasperated with the gap between the reality of poor peoples’ lives and the rhetoric of welfare reform.

Such is the scale of successive governments’ disinformation that the report by Turn2us, part of anti-poverty charity Elizabeth Finn, calls for ministers to abandon briefing journalists in advance of their speeches and asks departments to seek corrections for “for predictable and repeated media misinterpretations”.

It is articles like this one that have me contemplating a hard copy subscription to the Guardian.

Mapping the distortions won’t stop them but might sharpen your aim on their sources.

Gaza-Israel crisis 2012: every verified incident mapped

Filed under: Government,Mapping,Maps — Patrick Durusau @ 7:42 pm

Gaza-Israel crisis 2012: every verified incident mapped by Ami Sedghi, John Burn-Murdoch and Simon Rogers.

What has happened in Gaza and Israel since the assassination of Hamas leader Ahmed al-Jabari last week? This map shows all the verified incidents reported by news sources and wires across the region since then. Click on a dot to see an event – or download data for yourself. Search an address or share view to get the precise url

How can you help?

If you know of an event we’ve missed, help us add it to the map by giving us the details below at this Google Form or email us data@guardian.co.uk. We’re also looking for your photos and videos

Nice to know someone trusts us with the real data. Instead of sound bite summaries.

Hadoop on Azure : Introduction

Filed under: Azure Marketplace,Hadoop — Patrick Durusau @ 7:28 pm

Hadoop on Azure : Introduction by BrunoTerkaly.

From the post:

I am in complete awe on how this technology is resonating with today’s developers. If I invite developers for an evening event, Big Data is always a sellout.

This particular post is about getting everyone up to speed about what Hadoop is at a high level.

Big data is a technology that manages voluminous amount of unstructured and semi-structured data.

Due to its size and semi-structured nature, it is inappropriate for relational databases for analysis.

Big data is generally in the petabytes and exabytes of data.

A very high level view but a series to watch as the details emerge on using Hadoop on Azure.

Big Data Security Part Two: Introduction to PacketPig

Filed under: BigData,Network Security,Security — Patrick Durusau @ 7:20 pm

Big Data Security Part Two: Introduction to PacketPig by Michael Baker.

From the post:

Packetpig is the tool behind Packetloop. In Part One of the Introduction to Packetpig I discussed the background and motivation behind the Packetpig project and problems Big Data Security Analytics can solve. In this post I want to focus on the code and teach you how to use our building blocks to start writing your own jobs.

The ‘building blocks’ are the Packetpig custom loaders that allow you to access specific information in packet captures. There are a number of them but two I will focus in this post are;

  • Packetloader() allows you to access protocol information (Layer-3 and Layer-4) from packet captures.
  • SnortLoader() inspects traffic using Snort Intrusion Detection software.

Just in case you get bored with holiday guests, you can spend some quality time looking around on the other side of your cable router. 😉

Or deciding how you would model such traffic using a topic map.

Both would be a lot of fun.

Towards a Scalable Dynamic Spatial Database System [Watching Watchers]

Filed under: Database,Geographic Data,Geographic Information Retrieval,Spatial Index — Patrick Durusau @ 5:07 pm

Towards a Scalable Dynamic Spatial Database System by Joaquín Keller, Raluca Diaconu, Mathieu Valero.

Abstract:

With the rise of GPS-enabled smartphones and other similar mobile devices, massive amounts of location data are available. However, no scalable solutions for soft real-time spatial queries on large sets of moving objects have yet emerged. In this paper we explore and measure the limits of actual algorithms and implementations regarding different application scenarios. And finally we propose a novel distributed architecture to solve the scalability issues.

At least in this version, you will find two copies of the same paper, the second copy sans the footnotes. So read the first twenty (20) pages and ignore the second eighteen (18) pages.

I thought the limitation of location to two dimensions understandable, for the use cases given, but am less convinced that treating a third dimension as an extra attribute is always going to be suitable.

Still, the results here are impressive as compared to current solutions so an additional dimension can be a future improvement.

The use case that I see missing is an ad hoc network of users feeding geo-based information back to a collection point.

While the watchers are certainly watching us, technology may be on the cusp of answering the question: “Who watches the watchers?” (The answer may be us.)

I first saw this in a tweet by Stefano Bertolo.

DBpedia 3.8 Downloads

Filed under: DBpedia,RDF — Patrick Durusau @ 4:34 pm

DBpedia 3.8 Downloads

From the webpage:

This pages provides downloads of the DBpedia datasets. The DBpedia datasets are licensed under the terms of the Creative Commons Attribution-ShareAlike License and the GNU Free Documentation License. The downloads are provided as N-Triples and N-Quads, where the N-Quads version contains additional provenance information for each statement. All files are bzip2 1 packed.

I had to ask to find this one.

One interesting feature that would bear repetition elsewhere is the ability to see a sample of a data file.

For example, at Links to Wikipedia Article, nest to “nt” (N-Triple), there is a “?” that when followed displays in part:

<http://dbpedia.org/resource/AccessibleComputing><http://xmlns.com/foaf/0.1/isPrimaryTopicOf><http://en.wikipedia.org/wiki/AccessibleComputing>.
<http://en.wikipedia.org/wiki/AccessibleComputing><http://xmlns.com/foaf/0.1/primaryTopic><http://dbpedia.org/resource/AccessibleComputing>.
<http://en.wikipedia.org/wiki/AccessibleComputing><http://purl.org/dc/elements/1.1/language>”en”@en .
<http://dbpedia.org/resource/AfghanistanHistory><http://xmlns.com/foaf/0.1/isPrimaryTopicOf><http://en.wikipedia.org/wiki/AfghanistanHistory>.
<http://en.wikipedia.org/wiki/AfghanistanHistory><http://xmlns.com/foaf/0.1/primaryTopic><http://dbpedia.org/resource/AfghanistanHistory>.
<http://en.wikipedia.org/wiki/AfghanistanHistory><http://purl.org/dc/elements/1.1/language>”en”@en .
<http://dbpedia.org/resource/AfghanistanGeography><http://xmlns.com/foaf/0.1/isPrimaryTopicOf><http://en.wikipedia.org/wiki/AfghanistanGeography>.
<http://en.wikipedia.org/wiki/AfghanistanGeography><http://xmlns.com/foaf/0.1/primaryTopic><http://dbpedia.org/resource/AfghanistanGeography>.
<http://en.wikipedia.org/wiki/AfghanistanGeography><http://purl.org/dc/elements/1.1/language>”en”@en .

Which enabled me to conclude for my purposes, the reverse pointing from DBpedia to Wikipedia was repetitious. And since the entire dataset is only for the English version of Wikipedia, the declaration of language was superfluous.

That may not be true for your intended use of DBpedia data.

My point being that seeing sample data allows a quick evaluation before downloading large amounts of data.

A feature I would like to see for other data sets.

Wikipedia:Database download

Filed under: Data,Wikipedia — Patrick Durusau @ 3:30 pm

Wikipedia:Database download

From the webpage:

Wikipedia offers free copies of all available content to interested users. These databases can be used for mirroring, personal use, informal backups, offline use or database queries (such as for Wikipedia:Maintenance). All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages. For our advice about complying with these licenses, see Wikipedia:Copyrights.

I know you are already aware of this as a data source but every time I want to confirm something about it, I have a devil of a time finding it at Wikipedia.

If I remember that I wrote about it here, perhaps it will be easier to find. 😉

What I need to do is get one of those multi-terabyte network appliances for Christmas. Then copy large data sets that I don’t need updated as often as I need to consult their structures. (Like the next one I am about to mention.)

Balisage 2013 – Dates/Location

Filed under: Conferences,XML,XML Database,XML Query Rewriting,XML Schema,XPath,XQuery,XSLT,XTM — Patrick Durusau @ 3:19 pm

Tommie Usdin just posted email with the Balisage 2013 dates and location:

Montreal, Hotel Europa, August 5 – 9 , 2013

Hope that works with everything else.

That’s the entire email so I don’t know what was meant by:

Hope that works with everything else.

Short of it being your own funeral, open-heart surgery or giving birth (to your first child), I am not sure what “everything else” there could be?

You get a temporary excuse for the second two cases and a permanent excuse for the first one.

Now’s a good time to hint about plane fare plus hotel and expenses for Balisage as a stocking stuffer.

And to wish a happy holiday Tommie Usdin and to all the folks at Mulberry Technology who make Balisage possible all of us. Each and every one.

November 19, 2012

Psychological Studies of Policy Reasoning

Filed under: Psychology,Subject Identity,Users — Patrick Durusau @ 7:47 pm

Psychological Studies of Policy Reasoning by Adam Wyner.

From the post:

The New York Times had an article on the difficulties that the public has to understand complex policy proposals – I’m Right (For Some Reason). The points in the article relate directly to the research I’ve been doing at Liverpool on the IMPACT Project, for we decompose a policy proposal into its constituent parts for examination and improved understanding. See our tool live: Structured Consultation Tool

Policy proposals are often presented in an encapsulated form (a sound bite). And those receiving it presume that they understand it, the illusion of explanatory depth discussed in a recent article by Frank Keil (a psychology professor at Cornell when and where I was a Linguistics PhD student). This is the illusion where people believe they understand a complex phenomena with greater precision, coherence, and depth than they actually do; they overestimate their understanding. To philosophers, this is hardly a new phenomena, but showing it experimentally is a new result.

In research about public policy, the NY Times authors, Sloman and Fernbach, describe experiments where people state a position and then had to justify it. The results showed that participants softened their views as a result, for their efforts to justify it highlighted the limits of their understanding. Rather than statements of policy proposals, they suggest:

An approach to get people to state how they would distinguish or not, two subjects?

Would it make a difference if the questions were oral or in writing?

Since a topic map is an effort to capture a domain expert’s knowledge, tools to elicit that knowledge are important.

Accelerating literature curation with text-mining tools:…

Filed under: Bioinformatics,Curation,Literature,Text Mining — Patrick Durusau @ 7:35 pm

Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts by Chih-Hsuan Wei, Bethany R. Harris, Donghui Li, Tanya Z. Berardini, Eva Huala, Hung-Yu Kao and Zhiyong Lu.

Abstract:

Today’s biomedical research has become heavily dependent on access to the biological knowledge encoded in expert curated biological databases. As the volume of biological literature grows rapidly, it becomes increasingly difficult for biocurators to keep up with the literature because manual curation is an expensive and time-consuming endeavour. Past research has suggested that computer-assisted curation can improve efficiency, but few text-mining systems have been formally evaluated in this regard. Through participation in the interactive text-mining track of the BioCreative 2012 workshop, we developed PubTator, a PubMed-like system that assists with two specific human curation tasks: document triage and bioconcept annotation. On the basis of evaluation results from two external user groups, we find that the accuracy of PubTator-assisted curation is comparable with that of manual curation and that PubTator can significantly increase human curatorial speed. These encouraging findings warrant further investigation with a larger number of publications to be annotated.

Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/

Presentation on PubTator (slides, PDF).

Hmmm, curating abstracts. That sounds like annotating subjects in documents doesn’t it? Or something very close. 😉

If we start off with a set of subjects, that eases topic map authoring because users are assisted by automatic creation of topic map machinery. Creation triggered by identification of subjects and associations.

Users don’t have to start with bare ground to build a topic map.

Clever users build (and sell) forms, frames, components and modules that serve as the scaffolding for other topic maps.

The “Ask Bigger Questions” Contest!

Filed under: Cloudera,Contest,Hadoop — Patrick Durusau @ 7:17 pm

The “Ask Bigger Questions” Contest! by Ryan Goldman. (Deadline, Feb. 1 2013)

From the post:

Have you helped your company ask bigger questions? Our mission at Cloudera University is to equip Hadoop professionals with the skills to manage, process, analyze, and monetize more data than they ever thought possible.

Over the past three years, we’ve heard many great stories from our training participants about faster cluster deployments, complex data workflows made simple, and superhero troubleshooting moments. And we’ve heard from executives in all types of businesses that staffing Cloudera Certified professionals gives them confidence that their Hadoop teams have the skills to turn data into breakthrough insights.

Now, it’s your turn to tell us your bigger questions story! Cloudera University is seeking tales of Hadoop success originating with training and certification. How has an investment in your education paid dividends for your company, team, customer, or career?

The most compelling stories chosen from all entrants will receive prizes like Amazon gift cards, discounted Cloudera University training, autographed copies of Hadoop books from O’Reilly Media, and Cloudera swag. We may even turn your story into a case study!

Sign up to participate here. Submissions must be received by Friday, Feb. 1, 2013 to qualify for a prize.

A good marketing technique that might bear imitation.

Don’t have to seek out success stories. Incentives for people to bring them to you.

You get good marketing material that is likely to resonate with other users.

Something to think about.

FindTheData

Filed under: Data,Data Source,Dataset — Patrick Durusau @ 7:03 pm

FindTheData

From the about page:

At FindTheData, we present you with the facts stripped of any marketing influence so that you can make quick and informed decisions. We present the facts in easy-to-use tables with smart filters, so that you can decide what is best.

Too often, marketers and pay-to-play sites team up to present carefully crafted advertisements as objective “best of” lists. As a result, it has become difficult and time consuming to distinguish objective information from paid placements. Our goal is to become a trusted source in assisting you in life’s important decisions.

FindTheData is organized into 9 broad categories

Each category includes dozens of Comparisons from smartphones to dog breeds. Each Comparison consists of a variety of listings and each listing can be sorted by several key filters or compared side-by-side.

Traditional search is a great hammer but sometimes you need a wrench.

Currently search can find any piece of information across hundreds of billions of Web pages, but when you need to make a decision whether it’s choosing the right college or selecting the best financial advisor, you need information structured in an easily comparable format. FindTheData does exactly that. We help you compare apples-to-apples data, side-by-side, on a wide variety of products & services.

If you think in the same categories as the authors, sorta like using LCSH, you are in like Flint. If you don’t, well, your mileage may vary.

While some people may find it convenient to have tables and sorts pre-set for them, it would be nice to be able to download the data files.

Still, you may find it useful to browse for datasets that are new to you.

Georeferencer: Crowdsourced Georeferencing for Map Library Collections

Georeferencer: Crowdsourced Georeferencing for Map Library Collections by Christopher Fleet, Kimberly C. Kowal and Petr Přidal.

Abstract:

Georeferencing of historical maps offers a number of important advantages for libraries: improved retrieval and user interfaces, better understanding of maps, and comparison/overlay with other maps and spatial data. Until recently, georeferencing has involved various relatively time-consuming and costly processes using conventional geographic information system software, and has been infrequently employed by map libraries. The Georeferencer application is a collaborative online project allowing crowdsourced georeferencing of map images. It builds upon a number of related technologies that use existing zoomable images from library web servers. Following a brief review of other approaches and georeferencing software, we describe Georeferencer through its five separate implementations to date: the Moravian Library (Brno), the Nationaal Archief (The Hague), the National Library of Scotland (Edinburgh), the British Library (London), and the Institut Cartografic de Catalunya (Barcelona). The key success factors behind crowdsourcing georeferencing are presented. We then describe future developments and improvements to the Georeferencer technology.

If your institution has a map collection or if you are interested in maps at all, you need to read this article.

There is an introduction video if you prefer: http://www.klokantech.com/georeferencer/.

Either way, you will be deeply impressed by this project.

And wondering: Can the same lessons be applied to crowd source the creation of topic maps?

D-Lib

Filed under: Archives,Digital Library,Library — Patrick Durusau @ 4:26 pm

D-Lib

From the about page:

D-Lib Magazine is an electronic publication with a focus on digital library research and development, including new technologies, applications, and contextual social and economic issues. D-Lib Magazine appeals to a broad technical and professional audience. The primary goal of the magazine is timely and efficient information exchange for the digital library community to help digital libraries be a broad interdisciplinary field, and not a set of specialties that know little of each other.

I am about to post concerning an article in D-Lib and realized I don’t have a blog entry on D-Lib!

Not that it is topic map specific but it is digital library specific, with all the issues that entails. Remarkably similar to the issues any topic map author or software will face.

D-Lib has proven what many of us suspected:

The quality of content is not related to the medium of delivery.

Enjoy!

BioInformatics: A Data Deluge with Hadoop to the Rescue

Filed under: Bioinformatics,Cloudera,Hadoop,Impala — Patrick Durusau @ 4:10 pm

BioInformatics: A Data Deluge with Hadoop to the Rescue by Marty Lurie.

From the post:

Cloudera Cofounder and Chief Scientist Jeff Hammerbacher is leading a revolutionary project with Mount Sinai School of Medicine to apply the power of Cloudera’s Big Data platform to critical problems in predicting and understanding the process and treatment of disease.

“We are at the cutting edge of disease prevention and treatment, and the work that we will do together will reshape the landscape of our field,” said Dennis S. Charney, MD, Anne and Joel Ehrenkranz Dean, Mount Sinai School of Medicine and Executive Vice President for Academic Affairs, The Mount Sinai Medical Center. “Mount Sinai is thrilled to join minds with Cloudera.” (Please see http://www.cloudera.com/content/cloudera/en/about/press-center/press-releases/release.html?ReleaseID=1747809 for more details.)

Cloudera is active in many other areas of BioInformatics. Due to Cloudera’s market leadership in Big Data, many DNA mapping programs have specific installation instructions for CDH (Cloudera’s 100% open-source, enterprise-ready distribution of Hadoop and related projects). But rather than just tell you about Cloudera let’s do a worked example of BioInformatics data – specifically FAERS.

A sponsored piece by Cloudera but walks you through using Impala with the FDA data on adverse drug reactions.

Demonstrates getting started with Impala isn’t hard. Which is true.

What’s lacking is a measure of the difficulty of good results.

Any old result, good or bad, probably isn’t of interest to most users.

Update: TabLinker & UnTabLinker

Filed under: CSV,Excel,RDF,TabLinker/UnTabLinker — Patrick Durusau @ 2:48 pm

Update: TabLinker & UnTabLinker

From the post:

TabLinker, introduced in an earlier post, is a spreadsheet to RDF converter. It takes Excel/CSV files as input, and produces enriched RDF graphs with cell contents, properties and annotations using the DataCube and Open Annotation vocabularies.

TabLinker interprets spreadsheets based on hand-made markup using a small set of predefined styles (e.g. it needs to know what the header cells are). Work package 6 is currently investigating whether and how we can perform this step automatically.

Features:

  • Raw, model-agnostic conversion from spreadsheets to RDF
  • Interactive spreadsheet marking within Excel
  • Automatic annotation recognition and export with OA
  • Round-trip conversion: revive the original spreadsheet files from the produced RDF (UnTabLinker)

Even with conversion tools, the question has to be asked:

What was gained by the conversion? Yes, yes the data is now an RDF graph but what can I do now that I could not do before?

With the caveat that it has to be something I want to do.

Popcorn infographics

Filed under: Graphics,Infographics,Visualization — Patrick Durusau @ 2:39 pm

Popcorn infographics by Aleksey Nozdryn-Plotnicki.

From the post:

On my way to Crete recently, I was flipping through the in-flight magazine when I stumbled upon this treat. This full-page piece was about Claire Cock-Starkey’s upcoming (at the time) book, Seeing the Bigger Picture.

You have to see the graphics for most of the post to work but thought it worth calling to your attention.

The book appears to be one of those that claim: “…no one has ever done it this way before.” You don’t have to read very carefully to discover there is a good reason why that is true.

Or to quote Alexksey when he cites a review of the book:

“Bought this for my 14 yr old – absolutely loves it and showed friends who were also suitably impressed. Thank you”

Seeing the Bigger Picture by Claire Cock-Starkey. Publisher: London : Michael O’Mara, 2012. I could not find any professional reviews as of November 19, 2012.

NFL Rushing Tendencies Visualization…or Ahhhh, It’s a Spider!

Filed under: Graphs,Visualization — Patrick Durusau @ 6:17 am

NFL Rushing Tendencies Visualization…or Ahhhh, It’s a Spider! by Zach Gemignani.

From the post:

The mad scientists over in Juice Labs cooked up a new treat for you all just in time for Halloween Thanksgiving. We’ve crossbred some NFL Data (courtesy of our fellow friend in data Brian Burke) with a visualization we’ve kept under wraps for a while called The Spider. Now before any of you that suffer from arachnophobia start to freak out, this isn’t the spider that you may be accustomed to. The Spider visualization helps you understand the offensive rush tendencies for every NFL team for the last 4 seasons. EVERY rush that ever happened in the NFL from 2008-2011 is captured in the visualization. Information about the average yards gained, total yards gained, and the number of plays in each direction are also included. You may be surprised to find out which direction the Green Bay Packers run to on 4th and inches or how unsuccessful Arian Foster has historically been running the football on 1st and long situations (he’s a pretty awesome RB otherwise). You can filter the data by week, season, down, distance, player, and/or the opposing defense. Try it out for yourself here for some deliciously filling insights.

How to Read the Chart:
The thickness of the spider leg represents the number of rushes in that direction. A thicker line means more rushes were attempted to that area. The length of the spider leg stops on the average yards gained per rush in that direction. Click or hover over each leg for more detailed information about the team’s offensive rush tendencies.

About time they did something useful with data and visualization. 😉

Especially useful for armchair quarterbacks and coaches.

Now imagine an application that could track all the records for a player across their career starting in high school, including members of opposing teams, complete with plays and results.

Constructing a true LCSH tree of a science and engineering collection

Filed under: Cataloging,Classification,Classification Trees,Hierarchy,LCSH,Library,Trees — Patrick Durusau @ 5:49 am

Constructing a true LCSH tree of a science and engineering collection by Charles-Antoine Julien, Pierre Tirilly, John E. Leide and Catherine Guastavino.

Abstract:

The Library of Congress Subject Headings (LCSH) is a subject structure used to index large library collections throughout the world. Browsing a collection through LCSH is difficult using current online tools in part because users cannot explore the structure using their existing experience navigating file hierarchies on their hard drives. This is due to inconsistencies in the LCSH structure, which does not adhere to the specific rules defining tree structures. This article proposes a method to adapt the LCSH structure to reflect a real-world collection from the domain of science and engineering. This structure is transformed into a valid tree structure using an automatic process. The analysis of the resulting LCSH tree shows a large and complex structure. The analysis of the distribution of information within the LCSH tree reveals a power law distribution where the vast majority of subjects contain few information items and a few subjects contain the vast majority of the collection.

After a detailed analysis of records from the McGill University Libraries (204,430 topical authority records) and 130,940 bibliographic records (Schulich Science and Engineering Library), the authors conclude in part:

This revealed that the structure was large, highly redundant due to multiple inheritances, very deep, and unbalanced. The complexity of the LCSH tree is a likely usability barrier for subject browsing and navigation of the information collection.

For me the most compelling part of this research was the focus on LCSH as used and not as it imagines itself. Very interesting reading. A slow walk through the bibliography will interest those researching LCSH or classification more generally.

Demonstration of the power law with the use of LCSH makes one wonder about other classification systems as used.

November 18, 2012

… text mining in the FlyBase genetic literature curation workflow

Filed under: Curation,Genomics,Text Mining — Patrick Durusau @ 5:47 pm

Opportunities for text mining in the FlyBase genetic literature curation workflow by Peter McQuilton. (Database (2012) 2012 : bas039 doi: 10.1093/database/bas039)

Abstract:

FlyBase is the model organism database for Drosophila genetic and genomic information. Over the last 20 years, FlyBase has had to adapt and change to keep abreast of advances in biology and database design. We are continually looking for ways to improve curation efficiency and efficacy. Genetic literature curation focuses on the extraction of genetic entities (e.g. genes, mutant alleles, transgenic constructs) and their associated phenotypes and Gene Ontology terms from the published literature. Over 2000 Drosophila research articles are now published every year. These articles are becoming ever more data-rich and there is a growing need for text mining to shoulder some of the burden of paper triage and data extraction. In this article, we describe our curation workflow, along with some of the problems and bottlenecks therein, and highlight the opportunities for text mining. We do so in the hope of encouraging the BioCreative community to help us to develop effective methods to mine this torrent of information.

Database URL: http://flybase.org

Would you believe that ambiguity is problem #1 and describing relationships is another one?

The most common problem encountered during curation is an ambiguous genetic entity (gene, mutant allele, transgene, etc.). This situation can arise when no unique identifier (such as a FlyBase gene identifier (FBgn) or a computed gene (CG) number for genes), or an accurate and explicit reference for a mutant or transgenic line is given. Ambiguity is a particular problem when a generic symbol/ name is used (e.g. ‘Actin’ or UAS-Notch), or when a symbol/ name is used that is a synonym for a different entity (e.g. ‘ras’ is the current FlyBase symbol for the ‘raspberry’ gene, FBgn0003204, but is often used in the literature to refer to the ‘Ras85D’ gene, FBgn0003205). A further issue is that some symbols only differ in case-sensitivity for the first character, for example, the genes symbols ‘dl’ (dorsal) and ‘Dl’ (Delta). These ambiguities can usually be resolved by searching for associated details about the entity in the article (e.g. the use of a specific mutant allele can identify the gene being discussed) or by consulting the supplemental information for additional details. Sometimes we have to do some analysis ourselves, such as performing a BLAST search using any sequence data present in the article or supplementary files or executing an in-house script to report those entities used by a specified author in previously curated articles. As a final step, if we cannot resolve a problem, we email the corresponding author for clarification. If the ambiguity still cannot be resolved, then a curator will either associate a generic/unspecified entry for that entity with the article, or else omit the entity and add a (non-public) note to the curation record explaining the situation, with the hope that future publications will resolve the issue.

One of the more esoteric problems found in curation is the fact that multiple relationships exist between the curated data types. For example, the ‘dppEP2232 allele’ is caused by the ‘P{EP}dppEP2232 insertion’ and disrupts the ‘dpp gene’. This can cause problems for text-mining assisted curation, as the data can be attributed to the wrong object due to sentence structure or the requirement of back- ground or contextual knowledge found in other parts of the article. In cases like this, detailed knowledge of the FlyBase proforma and curation rules, as well as a good knowledge of Drosophila biology, is necessary to ensure the correct proforma field is filled in. This is one of the reasons why we believe text-mining methods will assist manual curation rather than replace it in the near term.

I like the “manual curation” line. Curation is a task best performed by a sentient being.

LucidWorks Announces Lucene Revolution 2013

Filed under: Conferences,Lucene,LucidWorks — Patrick Durusau @ 4:50 pm

LucidWorks Announces Lucene Revolution 2013 by Paul Doscher, CEO of LucidWorks.

From the webpage:

LucidWorks, the trusted name in Search, Discovery and Analytics, today announced that Lucene Revolution 2013 will take place at The Westin San Diego on April 29 – May 2, 2013. Many of the brightest minds in open source search will convene at this 4th annual Lucene Revolution to discuss topics and trends driving the next generation of search. The conference will be preceded by two days of Apache Lucene, Solr and Big Data training.

BTW, the call for papers opened up on November 12, 2012, but you still have time left: http://lucenerevolution.org/2013/call-for-papers

Jan. 13, 2013: CFP closes
Feb 1, 2013: Speakers notified

European Parliament Proceedings Parallel Corpus 1996-2011

European Parliament Proceedings Parallel Corpus 1996-2011

From the webpage:

For a detailed description of this corpus, please read:

Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp Koehn, MT Summit 2005, pdf.

Please cite the paper, if you use this corpus in your work. See also the extended (but earlier) version of the report (ps, pdf).

The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.

The goal of the extraction and processing was to generate sentence aligned text for statistical machine translation systems. For this purpose we extracted matching items and labeled them with corresponding document IDs. Using a preprocessor we identified sentence boundaries. We sentence aligned the data using a tool based on the Church and Gale algorithm.

Version 7, released in May of 2012, has around 60 million words per language.

Just in case you need a corpus for the EU.

I would be mindful of its parlimentary context. Semantic equivalent or similarity there may not hold true for other contexts.

Authors and Articles, Keywords, SOMs and Graphs [Oh My!]

Filed under: Graphs,Keywords,Self Organizing Maps (SOMs) — Patrick Durusau @ 4:28 pm

Analyzing Authors and Articles Using Keyword Extraction, Self-Organizing Map and Graph Algorithms by Tommi Vatanen , Mari-sanna Paukkeri , Ilari T. Nieminen, Timo Honkela.

An attempt to enable participants at an interdisciplinary conference to find others with similar interests and to learn about other participants.

Be aware the URL given in the article for the online demo now returns a 404.

Interesting approach but be aware that if it was using Likey as described in: A Language-Independent Approach to Keyphrase Extraction and Evaluation, the absence of phrases in the reference corpus may mean the phrases are omitted from the results.

I mention that because the reference corpus was Europarl (European Parliament Proceedings Parallel Corpus).

I would not bet on the similarities between the “European Parliament Proceedings” and the “International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning.” Would you?

Leaving those quibbles to one side, interesting work, particularly if viewed as the means to explore textual data for later editing.

CiteSeer does not report a date for this paper and it does not appear in DBLP for any of the authors. Timo Honkela’s publications page gives it the following suggested BibTeX entry:

@inproceedings{sompapaper,
author = {Tommi Vatanen and Mari-Sanna Paukkeri and Ilari T. Nieminen and Timo Honkela},
booktitle = {Proceedings of the AKRR08},
pages = {105--111},
title = {Analyzing Authors and Articles Using Keyword Extraction, Self-Organizing Map and Graph Algorithms},
year = {2008},
}

Level Up: Study Reveals Keys to Gamer Loyalty [Tips For TM Interfaces]

Filed under: Interface Research/Design,Marketing,Usability,Users — Patrick Durusau @ 11:24 am

Level Up: Study Reveals Keys to Gamer Loyalty

For topic maps that aspire to be common meeting places, there are a number of lessons in this study. The study is forthcoming but quoting from the news coverage:

One strategy found that giving players more control and ownership of their character increased loyalty. The second strategy showed that gamers who played cooperatively and worked with other gamers in “guilds” built loyalty and social identity.

“To build a player’s feeling of ownership towards its character, game makers should provide equal opportunities for any character to win a battle,” says Sanders. “They should also build more selective or elaborate chat rooms and guild features to help players socialize.”

In an MMORPG, players share experiences, earn rewards and interact with others in an online world that is ever-present. It’s known as a “persistent-state-world” because even when a gamer is not playing, millions of others around the globe are.

Some MMORPGs operate on a subscription model where gamers pay a monthly fee to access the game world, while others use the free-to-play model where access to the game is free but may feature advertising, additional content through a paid subscription or optional purchases of in-game items or currency.

The average MMORPG gamer spends 22 hours per week playing.

Research on loyalty has found that increasing customer retention by as little as 5 percent can increase profits by 25 to 95 percent, Sanders points out.

So, how would you like to have people paying to use your topic map site 22 hours per week?

There are challenges in adapting these strategies to a topic map context but that would be your value-add.

I first saw this at ScienceDaily.

The study will be published in the International Journal of Electronic Commerce.

That link is: http://www.ijec-web.org/. For the benefit of ScienceDaily and the University of Buffalo.

Either they were unable to find that link or are unfamiliar with the practice of placing hyperlinks in HTML texts to aid readers in locating additional resources.

« Newer PostsOlder Posts »

Powered by WordPress