Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 12, 2012

Cinemetrics creates a visual fingerprint for movies

Filed under: Visualization — Patrick Durusau @ 7:31 pm

Cinemetrics creates a visual fingerprint for movies by Nathan Yau.

From the post:

Each film is broken into segments, where each segment represents ten shots. Color changes with each movie and with each ten-shot chapter. And then the segments are set in motion based on the amount of movement in that chapter so that action sequences show rapid pulsations. For example, the first circle in the top left is Alien, whereas the last one in the second row is The Simpsons.

Frederic Brodbeck’s bachelor graduation project at the Royal Academy of Arts (KABK), Den Haag.

You really have to go to Frederic’s page and scroll to the bottom to see his “book” about the project. Particularly if you think you can read fast. 😉

I like the visualization but I am less convinced that animating the visualization adds anything to it. That may be because I am more accustomed to visualizations that sit still and allow me to study them.

Bugs, features, and risk

Filed under: Bugs,Proofing — Patrick Durusau @ 7:29 pm

Bugs, features, and risk by John D. Cook.

All software has bugs. Someone has estimated that production code has about one bug per 100 lines. Of course there’s some variation in this number. Some software is a lot worse, and some is a little better.

But bugs-per-line-of-code is not very useful for assessing risk. The risk of a bug is the probability of running into it multiplied by its impact. Some lines of code are far more likely to execute than others, and some bugs are far more consequential than others.

Devoting equal effort to testing all lines of code would be wasteful. You’re not going to find all the bugs anyway, so you should concentrate on the parts of the code that are most likely to run and that would produce the greatest harm if they were wrong.

Has anyone done error studies on RDF/OWL/LinkedData? Asking because obviously topic maps, Semantic Web, and other semantic applications are going to have errors.

Some obvious questions:

  • How does your application respond to bad data (errors)?
  • What data is most critical to be correct?
  • What is your acceptable error rate? (0 is not an acceptable answer)
  • What is the error rate for data entry with your application?

If you are interested in error correction, in semantic contexts or otherwise, start with General Error Detection, a set of pages maintained by Roy Panko.

From General Error Detection homepage:

Proofreading catches about 90% of all nonword spelling errors and about 70% of all word spelling errors. The table below shows that error detection varies widely by the type of task being done.

In general, our error detection rate only approaches 90% for simple mechanical errors, such as mistyping a number.

For logic errors, error detection is far worse, often 50% or less.

For omission errors, where we have left something out, correction rates are very low.

Probably Overthinking It

Filed under: Mathematics,Statistics — Patrick Durusau @ 7:28 pm

Probably Overthinking It: A blog by Allen Downey about statistics and probability.

If your work has any aspect of statistics/probability about it, you probably need to be reading this blog.

I commend it to topic mappers because claims about data are often expressed as statistics.

Not to mention that the results of statistics are subjects themselves, which you may wish to include in your topic map.

Complexity and Computation

Filed under: Books,Complexity,Computation — Patrick Durusau @ 7:27 pm

Complexity and Computation by Allen B. Downey.

Another free (you can order hard copy) book from Allen B. Downey. See my post: Think Stats: Probability and Statistics for Programmers or jump to Green Tea Press to see these and other titles for free download.

Description:

This book is about complexity science, data structures and algorithms, intermediate programming in Python, and the philosophy of science:

  • Data structures and algorithms: A data structure is a collection that contains data elements organized in a way that supports particular operations. For example, a dictionary organizes key-value pairs in a way that provides fast mapping from keys to values, but mapping from values to keys is generally slower.

    An algorithm is a mechanical process for performing a computation. Designing efficient programs often involves the co-evolution of data structures and the algorithms that use them. For example, the first few chapters are about graphs, a data structure that is a good implementation of a graph—nested dictionaries—and several graph algorithms that use this data structure.

  • Python programming: This book picks up where Think Python leaves off. I assume that you have read that book or have equivalent knowledge of Python. As always, I will try to emphasize fundmental ideas that apply to programming in many languages, but along the way you will learn some useful features that are specific to Python.
  • Computational modeling: A model is a simplified description of a system that is useful for simulation or analysis. Computational models are designed to take advantage of cheap, fast computation.
  • Philosophy of science: The models and results in this book raise a number of questions relevant to the philosophy of science, including the nature of scientific laws, theory choice, realism and instrumentalism, holism and reductionism, and Bayesian epistemology.

This book focuses on discrete models, which include graphs, cellular automata, and agent-based models. They are often characterized by structure, rules and transitions rather than by equations. They tend to be more abstract than continuous models; in some cases there is no direct correspondence between the model and a physical system.

Complexity science is an interdiscipinary field—at the intersection of mathematics, computer science and physics—that focuses on these kinds of models. That’s what this book is about.

Call for Papers on Big Data: Theory and Practice

Filed under: BigData,Heterogeneous Data,Semantic Web — Patrick Durusau @ 7:27 pm

SWJ 2012 : Semantic Web Journal Call for Papers on Big Data: Theory and Practice

Dates:

Manuscript submission due: 13. February 2012
First notification: 26. March 2012
Issue publication: Summer 2012

From the post:

The Semantic Web journal calls for innovative and high-quality papers describing theory and practice of storing, accessing, searching, mining, processing, and visualizing big data. We especially invite papers that describe or demonstrate how ontologies, Linked Data, and Semantic Web technologies can handle the problems arising when integrating massive amounts of multi-thematic and multi-perspective information from heterogeneous sources to answer complex questions that cut through domain boundaries.

We welcome all paper categories, i.e., full research papers, application reports, systems and tools, ontology papers, as well as surveys, as long as they clearly relate to challenges and opportunities arising from processing big data – see our listing of paper types in the author guidelines. In other words, we expect all submitted manuscripts to address how the presented work can exploit massive and/or heterogeneous data.

Semantic Web technologies represent subjects as well as being subjects themselves should enable demonstrations of integrating diverse Semantic Web approaches to the same data. Where the underlying data is heterogeneous as well. Now that would be an interesting paper.

FOSDEM 2012 – Call for Volunteers

Filed under: Conferences — Patrick Durusau @ 7:26 pm

FOSDEM 2012 – Call for Volunteers

From the post:

FOSDEM 2012 is almost upon us, and we’re looking for motivated people to help us make it a success again. If you’ve visited FOSDEM in the past, you’ve probably seen our enthusiastic army of volunteers that helped us make FOSDEM a pleasant experience for all our visitors. If you want to be a part of this great team, here’s your chance to sign up!

More specifically, we need help with the following tasks:

  • Setting up the venue
  • Manning the infodesk
  • Handling the donations
  • Translating for our foreign guests
  • Handling taxis for speakers
  • Moderating the main tracks and the lightning talks
  • Room security: barring entrance to overcrowded rooms
  • Cleaning up after the event

Also new this year, we’re looking for volunteers to help with the video recordings. If you have experience handling digital video equipment, be sure to mention this in your application!

This is a chance for you to be a goodwill ambassador for topic maps.

Volunteering for FOSDEM is good for topic maps because:

  1. You will meet new people. (That is you won’t be facing a keyboard/monitor when introduced.)
  2. New people may ask about your interests. (It’s like an online survey but verbal.)
  3. You can talk about your interests. (Same as #2.)
  4. You can ask new people about their interests. (It’s called being polite, if you listen to the response.)
  5. Real time, full bandwidth interaction with others can socialize topic maps. (Particularly over beer.)

Seriously, give the folks at FOSDEM a hand. It is truly a great way to meet people and to get a feel for what it really takes to run a conference.

Will make you truly appreciate smooth running conferences.

FOSDEM, the Free and Open Source Software Developers' European Meeting

January 11, 2012

Coming of Age: R and Spatial Data Visualisation

Filed under: Graphics,R,Visualization — Patrick Durusau @ 8:09 pm

Coming of Age: R and Spatial Data Visualisation by James Cheshire.

From the post:

I have been using R (a free statistics and graphics software package) now for the past four years or so and I have seen it become an increasingly powerful method of both analysing and visualising spatial data. Crucially, more and more people are writing accessible tutorials (see here) for beginners and intermediate users and the development of packages such as ggplot2 have made it simpler than ever to produce fantastic graphics. You don’t get the interactivity you would with conventional GIS software such as ArcGIS when you produce the visualisation but you are much more flexible in terms of the combinations of plot types and the ease with which they can be combined. It is, for example, time consuming to produce multivariate symbols (such as those varying in size and colour) in ArcGIS but with R it is as simple* as one line of code. I have, for example, been able to add subtle transitions in the lines of the migration map above. Unless you have massive files, plotting happens quickly and can be easily saved to vector formats for tweaking in a graphics package.

Truly impressive visualization work.

Take it as encouragement to go out and do likewise.

New Techniques Turbo-Charge Data Mining

Filed under: Data Mining,Spectral Feature Selection,Spectral Graph Theory — Patrick Durusau @ 8:08 pm

New Techniques Turbo-Charge Data Mining by Nicole Hemsoth.

From the post:

While the phrase “spectral feature selection” may sound cryptic (if not ghostly) this concept is finding a welcome home in the realm of high performance data mining.

We talked with an expert in the spectral feature selection for data mining arena, Zheng Zhao from the SAS Institute, about how trends like this, as well as a host of other new developments, are reshaping data mining for both researchers and industry users.

Zhao says that when it comes to major trends in data mining, cloud and Hadoop represent the key to the future. These developments, he says, offer the high performance data mining tools required to tackle the types of large-scale problems that are becoming more prevalent.

In an interview this week, Zhao predicted that over the next few years, large-scale analytics will be at the forefront of both academic research and industry R&D efforts. On one side, industry has strong requirements for new techniques, software and hardware for solving their real problems at the large scale, while on the other hand, academics find this to be an area laden with interesting new challenges to pursue.

For more details, you may want to see our earlier posts:

Spectral Feature Selection for Data Mining

Spectral Graph Theory

Designing Google Maps

Filed under: Geographic Information Retrieval,Mapping,Maps — Patrick Durusau @ 8:07 pm

Designing Google Maps by Nathan Yau.

From the post:

Google Maps is one of Google’s best applications, but the time, energy, and thought put into designing it often goes unnoticed because of how easy it is to use, for a variety of purposes. Willem Van Lancker, a user experience and visual designer for Google Maps, describes the process of building a map application — color scheme, icons, typography, and “Googley-ness” — that practically everyone can use, worldwide.

I don’t normally disagree with anything Nathan says, particularly about design but I have to depart from him on why we don’t notice the excellence of Google Maps.

I think we have become accustomed to its excellence and since we don’t look elsewhere (most of us), then we don’t notice that it isn’t commonplace.

In fact for most of us it is a universe with one inhabitant, Google Maps.

That takes a lot of very hard work and skill.

The question is do you have the chops to make your topic map of one or more infoverses the “only” inhabitant, by user choice?

Algorithms exercise: Find mistakes in Wikipedia articles

Filed under: Algorithms,Critical Reading — Patrick Durusau @ 8:06 pm

Algorithms exercise: Find mistakes in Wikipedia articles by René Pickhardt.

From the post:

Today I started an experiment I created an excercise for coursework in algorithms and data structures that is very unusuale and many people have been criticle if this was a good idea. The idea behind the exercise is that studens should read wikipedia articles to topics related to lectures and find mistakes or suggest things that could be improoved. Thereby I hope that people will do something that many people in science don’t do often enough: Read something critically and carefully and question the things that you have learnt. (more discussions after the exercise)

This is great! No only can students practice thinking critically but there is a forum to test their answers: other users of Wikipedia.

Read the article, do the exercises and see how your critical reading skills fare.

MongoGraph One Ups MongoDB With Semantic Power (Humor)

Filed under: AllegroGraph,MongoDB,MongoGraph — Patrick Durusau @ 8:05 pm

MongoGraph One Ups MongoDB With Semantic Power by Jennifer Zaino.

From the post:

But Franz Inc. proposes an alternative for those who want more sophisticated functionality: Use the semantic power of its AllegroGraph Web 3.0 database to deal with complicated queries, via MongoGraph, a MongoDB API to AllegroGraph technology.

So, MongoGraph “One Ups” MongoDB by copying their API?

If MongoDB is as difficult to use as the article implies, wouldn’t that copying be going the other way?

Heard of anyone copying the Franz API lately?

Certainly not MongoDB. 😉

PS: As MongoDB points out: http://www.mongodb.org/display/DOCS/MongoDB+Data+Modeling+and+Rails, there are things that MongoDB does better than others. (shrugs) That is true for all technologies. At least MongoDB is up front about it.

Fractal Tree Indexes and Mead – MySQL Meetup

Filed under: Fractal Trees,MySQL — Patrick Durusau @ 8:04 pm

Fractal Tree Indexes and Mead – MySQL Meetup

From the post:

As a brief overview – most databases employ B-trees to achieve a good tradeoff between the ability to update data quickly and to search it quickly. It turns out that B-trees are far from the optimum in this tradeoff space. This led to the development at MIT, Rutgers and Stony Brook of Fractal Tree indexes. Fractal Tree indexes improve MySQL¼ scalability and query performance by allowing greater insertion rates, supporting rich indexing and offering efficient compression. They can also eliminate operational headaches such as dump/reloads, inflexible schemas and partitions.

The presentation provides an overview on how Fractal Tree indexes work, and then gets into some specific product features, benchmarks, and customer use cases that show where people have deployed Fractal Tree indexes via the TokuDBÂź storage engine.

Whether you are just browsing or seriously looking for better performance, I think you will like this presentation.

Performance of data stores is an issue for topic maps whether you store a final “merged” result or simply present “merged” results to users.

Monthly Twitter activity for all members of the U.S. Congress

Filed under: Data Source,Government Data,Tweets — Patrick Durusau @ 8:04 pm

Monthly Twitter activity for all members of the U.S. Congress by Drew Conway.

From the post:

Many months ago I blogged about the research that John Myles White and I are conducting on using Twitter data to estimate an individual’s political ideology. As I mentioned then, we are using the Twitter activity of members of the U.S. Congress to build a training data set for our model. A large part of the effort for this project has gone into designing a system to systematically collect the Twitter data on the members of the U.S. Congress.

Today I am pleased to announce that we have worked out most of the bugs, and now have a reliable data set upon which to build. Better still, we are ready to share. Unlike our old system, the data now lives on a live CouchDB database, and can be queried for specific research tasks. We have combined all of the data available from Twitter’s search API with the information on each member from Sunlight Foundation’s Congressional API.

Looks like an interesting data set to match up to the ages of addresses doesn’t it?

Social Networks and Archival Context Project (SNAC)

Filed under: Archives,Networks,Social Graphs,Social Networks — Patrick Durusau @ 8:03 pm

Social Networks and Archival Context Project (SNAC)

From the homepage:

The Social Networks and Archival Context Project (SNAC) will address the ongoing challenge of transforming description of and improving access to primary humanities resources through the use of advanced technologies. The project will test the feasibility of using existing archival descriptions in new ways, in order to enhance access and understanding of cultural resources in archives, libraries, and museums.

Archivists have a long history of describing the people who—acting individually, in families, or in formally organized groups—create and collect primary sources. They research and describe the people who create and are represented in the materials comprising our shared cultural legacy. However, because archivists have traditionally described records and their creators together, this information is tied to specific resources and institutions. Currently there is no system in place that aggregates and interrelates those descriptions.

Leveraging the new standard Encoded Archival Context-Corporate Bodies, Persons, and Families (EAC-CPF), the SNAC Project will use digital technology to “unlock” descriptions of people from finding aids and link them together in exciting new ways.

On the Prototype page you will find the following description:

While many of the names found in finding aids have been carefully constructed, frequently in consultation with LCNAF, many other names present extraction and matching challenges. For example, many personal names are in direct rather than indirect (or catalog entry) order. Life dates, if present, some times appear in parentheses or brackets. Numerous names some times appear in the same <persname>, <corpname>, or <famname>. Many names are incorrectly tagged, for example, a personal name tagged as a .

We will continue to refine the extraction and matching algorithms over the course of the project, but it is anticipated that it will only be possible to address some problems through manual editing, perhaps using “professional crowd sourcing.”

While the project is still a prototype, it occurs to me that it would make a handy source of identifiers.

Try:

Or one of the many others you will find at: Find Corporate, Personal, and Family Archival Context Records.

OK, now I have a question for you: All of the foregoing also appear in Wikipedia.

For your comparison:

If you could choose only one identifier for a subject, would you choose the SNAC or the Wikipedia links?

I ask because some semantic approaches take a “one ring” approach to identification. Ignoring the existence of multiple identifiers, even URL identifiers for the same subjects.

Of course, you already know that with topic maps you can have multiple identifiers for any subject.

In CTM syntax:

bush-vannevar
href=”http://socialarchive.iath.virginia.edu/xtf/view?docId=bush-vannevar-1890-1974-cr.xml ;
href=”http://en.wikipedia.org/wiki/Vannevar_Bush ;
– “Vannevar Bush” ;
– varname: “Bush, Vannevar, 1890-1974” ;
– varname: “Bush, Vannevar, 1890-” .

Which of course means that if I want to make a statement about the webpage for Vannevar Bush at Wikipedia, I can do so without any confusion:

wikipedia-vannevar-bush
= href=”http://en.wikipedia.org/wiki/Vannevar_Bush ;
descr: “URL as subject locator.” .

Or I can comment on a page at SNAC and map additional information to it. And you will always know if I am using the URL as an identifier or to point you towards a subject.

Bio4j release 0.7 is out !

Filed under: Bioinformatics,Biomedical,Cypher,Graphs,Gremlin,Medical Informatics,Visualization — Patrick Durusau @ 8:02 pm

Bio4j release 0.7 is out !

A quick list of the new features:

  • Expasy Enzyme database integration
  • Node type indexing
  • Amazon web services Availability in all Regions
  • New CloudFormation templates
  • Bio4j REST server
  • Explore you database with the Data browser
  • Run queries with Cypher
  • Querying Bio4j with Gremlin

Wait! Did I say Cypher and Gremlin!?

Looks like this graph querying stuff is spreading. 🙂

Even if you are not working in bioinformatics, Bio4j is worth more than a quick look.

January 10, 2012

Oracle: “Open Source isn’t all that weird” (Cloudera)

Filed under: Cloudera,Hadoop,Oracle — Patrick Durusau @ 8:12 pm

OK, maybe that’s not an exact word-for-word quotation. 😉

Oracle selects CDH and Cloudera Manager as the Apache Hadoop Platform for the Oracle Big Data Appliance

Ed Albanese (Ed leads business development for Cloudera. He is responsible for identifying new markets, revenue opportunities and strategic alliances for the company.) writes:

Summary: Oracle has selected Cloudera’s Distribution Including Apache Hadoop (CDH) and Cloudera Manager software as core technologies on the Oracle Big Data Appliance, a high performance “engineered system.” Oracle and Cloudera announced a multiyear agreement to provide CDH, Cloudera Manager, and support services in conjunction with Oracle Support for use on the Oracle Big Data Appliance.

Announced at Oracle Open World in October 2011, the Big Data Appliance was received with significant market interest. Oracle reported then that it would be released in the first half of 2012. Just 10 days into that period, Oracle has announced that the Big Data Appliance is available immediately.

The product itself is noteworthy. Oracle has combined Oracle hardware and software innovations with Cloudera technology to deliver what it calls an “engineered system.” Oracle has created several such systems over the past few years, including the Exadata, Exalogic, and Exalytics products. The Big Data Appliance combines Apache Hadoop with a purpose-built hardware platform and software that includes platform components such as Linux and Java, as well as data management technologies such as the Oracle NoSql database and Oracle integration software.

Read the post to get Ed’s take on what this will mean for both Cloudera and Oracle customers (positive).

I’m glad for Cloudera but also take this as validation of the overall Hadoop ecosystem. Not that it is appropriate for every application but where it is, it deserves serious consideration.

The Semantic Web & the Right to be Forgotten (there is a business case – read on)

Filed under: Document Retention,Semantic Web — Patrick Durusau @ 8:10 pm

The Semantic Web & the Right to be Forgotten by Angela Guess.

From the post:

Dr. Kieron O’Hara has examined how the semantic web might be used to implement a so-called ‘right to be forgotten.’ O’Hara writes, “During the revision of the EU’s data protection directive, attention has focused on a ‘right to be forgotten’. Though the discussion has been largely confined to the legal profession, and has been overlooked by technologists, it does raise technical issues – UK minister Ed Vaizey, and the UK’s Information Commissioner’s Office have pointed out that rights are only meaningful when they can be enforced and implemented (Out-law.com 2011, ICO 2011). In this article, I look at how such a right might be interpreted and whether it could be enforced using the specific technology of the Semantic Web or the Linked Data Web.”

O’Hara continues, “Currently, the Semantic Web and the Linked Data Web approach access control via licences and waivers. In many cases, those who wish to gain the benefits of linking are keen for their data to be used and linked, and so are happy to invite access. Copyrightable content can be governed by Creative Commons licences, requiring the addition of a single RDF triple to the metadata. With other types of data, controllers use waivers, and for that purpose a waiver vocabulary, http://vocab.org/waiver/terms/.html, has been created.”

The case Dr. O’Hara is concerned with is:

the right of individuals to have their data no longer processed and deleted when they are no longer needed for legitimate purposes. This is the case, for example, when processing is based on the person’s consent and when he or she withdraws consent or when the storage period has expired.

As you would expect, the EU completely overlooks the business case for forgetting, it’s called document retention. Major corporations have established policies for how long materials have to be retained and procedures to be followed for their destruction.

Transpose that into a topic maps or semantic web context where you have links into those materials. Perhaps links that you don’t control.

So, what is your policy about “forgetting” by erasure of links to documents that no longer exist?

Or do you have a policy about the creation of links to documents? And if so, how do you track them? Or even monitor the enforcement of the policy?

It occurs to me that if you used enterprise search software, you could create topics that represent documents that are being linked to by other documents. Topics that could carry the same destruction date information as your other information systems.

Interesting uses suggest themselves. Upon destruction of some documents you could visualize if odd or inconvenient holes in the document record are going to be created by a widely linked record’s destruction.

Depending on the skill of your indexing and document recognition, you could even uncover non-hyperlink references between documents. And perform the same analysis.

Or, you could wait until some enterprising lawyer who isn’t representing your interest decides to perform the same analysis.

Your call.

Another way to think about geeks and repetitive tasks

Filed under: Marketing,Semantic Diversity,Semantics — Patrick Durusau @ 8:09 pm

Another way to think about geeks and repetitive tasks

John Udell writes:

The other day Tim Bray tweeted a Google+ item entitled Geeks and repetitive tasks along with the comment: “Geeks win, eventually.”

…(material omitted)

In geek ideology the oppressors are pointy-haired bosses and clueless users. Geeks believe (correctly) that clueless users can’t imagine, never mind implement, automated improvements to repetitive manual chores. The chart divides the world into geeks and non-geeks, and it portrays software-assisted process improvement as a contest that geeks eventually win. This Manichean worldview is unhelpful.

I have no doubt that John’s conclusion:

Software-assisted automation of repetitive work isn’t an event, it’s a process. And if you see it as a contest with winners and losers you are, in my view, doing it wrong.

is the right one but I think it misses an important insight.

That “geeks” and their “oppressors” view the world with very different semantics. If neither one tries to communicate those semantics to the other, then software will continue to fail to meet the needs of its users. An unhappy picture for all concerned, geeks as well as their oppressors.

Being semantics, there is no “right” or “wrong” semantic.

True enough, the semantics of geeks works better with computers but if that fails to map in some meaningful way to the semantics of their oppressors, what’s the point?

Geeks can write highly efficient software for tasks but if the tasks aren’t something anyone is willing to pay for or even use, what’s the point?

Users and geeks need to both remember that communication is a two-way street. The only way for it to fail completely is for either side to stop trying to communicate with the other.

Have no doubt, I have experience the annoyance of trying to convince a geek that just because they have written software a particular way that has little to no bearing on some user request. (The case in point was a UI where the geek had decided on a “better” means of data entry. The users, who were going to be working with the data thought otherwise. I heard the refrain, “…if they would just use it they would get used to it.” Of course, the geek had written the interface without asking the users first.)

To be fair, users have to be willing to understand there are limitations on what can be requested.

And that users failing to complete written and detailed requirements for all aspects of a request, is almost a guarantee that the software result isn’t going to satisfy anyone.

Written requirements are where semantic understandings, mis-understandings and clashes can be made visible, resolved (hopefully) and documented. Burdensome, annoying, non-productive in the view of geeks who want to get to coding, but absolutely necessary in any sane software development environment.

That is to say any software environment that is going to represent a happy (well, workable) marriage of the semantics of geeks and users.

Vanity, vanity, all is vanity…

Filed under: Marketing — Patrick Durusau @ 8:07 pm

Manning Publications has Big Data: Principles and best practices of scalable realtime data systems by Nathan Marz and Samuel E. Ritchie out in an EARLY ACCESS EDITION. The book is due out in the summer of 2012.

You can order today, either in paper or ebook formats + “MEAP” (Manning Early Access Program). And, for that you get early access to the content and are invited to provide feedback to the author.

Used to, publishers paid for editors. Now editors are paying for the privilege of commenting. That’s a pretty good trick.

Who among us isn’t vain enough to “need” early access to a new book in our field?

Who among us isn’t vain enough to have a “contribution” to make to a new book in our field?

Vanity has a cost, we pay ahead of time for the MEAP edition and we contribute our expertise to the final work.

I don’t object to this model, in fact I think other publishers, who will go nameless, could benefit from something quite similar.

If you think about it, this is quite similar to the motivational model used by Wikipedia to solicit contributions.

Except they have not stumbled upon the notion of paying to contribute to it. A yearly charge for the privilege of submitting (not necessarily accepted) edits and the ensuing competition as articles in Wikipedia improve would insure its existence for the foreseeable future. If you know anyone in the inner circle at Wikipedia, please feel free to make that suggestion.

I mention the Manning/Vanity model because I think it is one that topic maps, public ones at any rate, should consider. You are always going to need more editors than you can afford to pay for and a topic map of any size, see the example of Wikipedia, is going to need ongoing maintenance and support. Unless you are going to sell subscriptions or otherwise limit access, you need another income model.

Taking a page from the Manning book and starting from the presumption that people are vain enough to pay to contribute and/or see their names with “other” experts, I think a yearly editing/contribution fee might be the way to go. After all, someone with less expertise might say something wrong that needs correction, so there would be an incentive to keep up editing/contributing privileges.

I would not take on established prestige venues where publication counts for promotion, at least not just yet. Think of alternative delivery or subject areas.

Some quick examples:

  • Book Reviews to cellphones – Local reviews while you are in the stacks.
  • Citizen Crime Reports – The stories w/locations before it hits the local news. A 1-900 number possibility?
  • Restaurant Reviews – These are already appearing on cellphones but think of this as more of a filtered Craigslist.

The traditional information venues aren’t going anyway and it is better to take them on from a strong base. Think of NetFlix. Alternative delivery mechanism, convenience that traditional channels were slow to follow. Now, we’ll have to see what NetFlix decides to do with that power.

Proceedings…Information Heterogeneity and Fusion in Recommender Systems

Filed under: Conferences,Heterogeneous Data,Recommendation — Patrick Durusau @ 8:05 pm

Proceedings of the 2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems

I am still working on the proceeding for the main conference but thought these might be of interest:

  • Information market based recommender systems fusion
    Efthimios Bothos, Konstantinos Christidis, Dimitris Apostolou, Gregoris Mentzas
    Pages: 1-8
    doi>10.1145/2039320.2039321
  • A kernel-based approach to exploiting interaction-networks in heterogeneous information sources for improved recommender systems
    Oluwasanmi Koyejo, Joydeep Ghosh
    Pages: 9-16
    doi>10.1145/2039320.2039322
  • Learning multiple models for exploiting predictive heterogeneity in recommender systems
    Clinton Jones, Joydeep Ghosh, Aayush Sharma
    Pages: 17-24
    doi>10.1145/2039320.2039323
  • A generic semantic-based framework for cross-domain recommendation
    Ignacio FernĂĄndez-TobĂ­as, IvĂĄn Cantador, Marius Kaminskas, Francesco Ricci
    Pages: 25-32
    doi>10.1145/2039320.2039324
  • Hybrid algorithms for recommending new items
    Paolo Cremonesi, Roberto Turrin, Fabio Airoldi
    Pages: 33-40
    doi>10.1145/2039320.2039325
  • Expert recommendation based on social drivers, social network analysis, and semantic data representation
    Maryam Fazel-Zarandi, Hugh J. Devlin, Yun Huang, Noshir Contractor
    Pages: 41-48
    doi>10.1145/2039320.2039326
  • Experience Discovery: hybrid recommendation of student activities using social network data
    Robin Burke, Yong Zheng, Scott Riley
    Pages: 49-52
    doi>10.1145/2039320.2039327
  • Personalizing tags: a folksonomy-like approach for recommending movies
    Alan Said, Benjamin Kille, Ernesto W. De Luca, Sahin Albayrak
    Pages: 53-56
    doi>10.1145/2039320.2039328
  • Personalized pricing recommender system: multi-stage epsilon-greedy approach
    Toshihiro Kamishima, Shotaro Akaho
    Pages: 57-64
    doi>10.1145/2039320.2039329
  • Matrix co-factorization for recommendation with rich side information and implicit feedback
    Yi Fang, Luo Si
    Pages: 65-69
    doi>10.1145/2039320.2039330

An Application Driven Analysis of the ParalleX Execution Model (here be graph’s mention)

Filed under: Graphs,HPC,ParalleX — Patrick Durusau @ 8:03 pm

An Application Driven Analysis of the ParalleX Execution Model by Matthew Anderson, Maciej Brodowicz, Hartmut Kaiser and Thomas Sterling.

Just in case you feel the need for more information about ParalleX after that post about the LSU software release. 😉

Abstract:

Exascale systems, expected to emerge by the end of the next decade, will require the exploitation of billion-way parallelism at multiple hierarchical levels in order to achieve the desired sustained performance. The task of assessing future machine performance is approached by identifying the factors which currently challenge the scalability of parallel applications. It is suggested that the root cause of these challenges is the incoherent coupling between the current enabling technologies, such as Non-Uniform Memory Access of present multicore nodes equipped with optional hardware accelerators and the decades older execution model, i.e., the Communicating Sequential Processes (CSP) model best exemplified by the message passing interface (MPI) application programming interface. A new execution model, ParalleX, is introduced as an alternative to the CSP model. In this paper, an overview of the ParalleX execution model is presented along with details about a ParalleX-compliant runtime system implementation called High Performance ParalleX (HPX). Scaling and performance results for an adaptive mesh refinement numerical relativity application developed using HPX are discussed. The performance results of this HPX-based application are compared with a counterpart MPI-based mesh refinement code. The overheads associated with HPX are explored and hardware solutions are introduced for accelerating the runtime system.

Graphaholics should also note:

Today’s conventional parallel programming methods such as MPI [1] and systems such as distributed memory massively parallelvprocessors (MPPs) and Linux clusters exhibit poor efficiency and constrained scalability for this class of applications. This severely hinders scientifi c advancement. Many other classes of applications exhibit similar properties, especially graph/tree data structures that have non uniform data access patterns. (emphasis added)

I like that, “non uniform data access patterns.”

My “gut” feeling is that this will prove very useful for processing semantics. Since semantics originate from us and have “non uniform data access patterns.”

Granted a lot of work between here and there, especially since the semantics side of the house is fond of declaring victory in favor of the latest solution.

You would think after years, decades, centuries, no, millenia of one “ultimate” solution after another, we would be a little more wary of such pronouncements. I suspect the problem is that programmers come by their proverbial laziness honestly. They get it from us. It is easier to just fall into line with whatever seems like a passable solution and to not worry about all the passable solutions that went before.

That is no doubt easier but imagine where medicine, chemistry, physics, or even computers would be if they had adopted such a model. True, we have to use models that work now, but at the same time we should encourage new, different, even challenging models that may (or may not) be better at capturing human semantics. Models that change even as we do.

LSU Releases First Open Source ParalleX Runtime Software System

Filed under: HPC,ParalleX — Patrick Durusau @ 8:01 pm

LSU Releases First Open Source ParalleX Runtime Software System

From the press release:

Louisiana State University’s Center for Computation & Technology (CCT) has delivered the first freely available open-source runtime system implementation of the ParalleX execution model. The HPX, or High Performance ParalleX, runtime software package is a modular, feature-complete, and performance oriented representation of the ParalleX execution model targeted at conventional parallel computing architectures such as SMP nodes and commodity clusters.

HPX is being provided to the open community for experimentation and application to achieve high efficiency and scalability for dynamic adaptive and irregular computational problems. HPX is a library of C++ functions that supports a set of critical mechanisms for dynamic adaptive resource management and lightweight task scheduling within the context of a global address space. It is solidly based on many years of experience in writing highly parallel applications for HPC systems.

The two-decade success of the communicating sequential processes (CSP) execution model and its message passing interface (MPI) programming model has been seriously eroded by challenges of power, processor core complexity, multi-core sockets, and heterogeneous structures of GPUs. Both efficiency and scalability for some current (strong scaled) applications and future Exascale applications demand new techniques to expose new sources of algorithm parallelism and exploit unused resources through adaptive use of runtime information.

The ParalleX execution model replaces CSP to provide a new computing paradigm embodying the governing principles for organizing and conducting highly efficient scalable computations greatly exceeding the capabilities of today’s problems. HPX is the first practical, reliable, and performance-oriented runtime system incorporating the principal concepts of ParalleX model publicly provided in open source release form.

Aggregation and Restructuring data (from “R in Action”)

Filed under: Aggregation,R — Patrick Durusau @ 8:00 pm

Aggregation and Restructuring data (from “R in Action”) by Dr. Robert I. Kabacoff.

From the post:

R provides a number of powerful methods for aggregating and reshaping data. When you aggregate data, you replace groups of observations with summary statistics based on those observations. When you reshape data, you alter the structure (rows and columns) determining how the data is organized. This article describes a variety of methods for accomplishing these tasks.

We’ll use the mtcars data frame that’s included with the base installation of R. This dataset, extracted from Motor Trend magazine (1974), describes the design and performance characteristics (number of cylinders, displacement, horsepower, mpg, and so on) for 34 automobiles. To learn more about the dataset, see help(mtcars).

How do you recognize what data you want to aggregate or transpose?

Or communicate that knowledge to future users?

The data set for Motor Trend magazine is an easy one.

If you have access to the electronic text for Motor Trend magazine (one or more issues) for 1974, drop me a line. I am thinking of a way to illustrate the “semantic” problem.

Stanford Machine Learning

Filed under: Machine Learning — Patrick Durusau @ 7:59 pm

Stanford Machine Learning by Alex Holehouse.

From the webpage:

The following notes represent a complete, stand alone interpretation of Stanford’s machine learning course presented by Professor Andrew Ng and originally posted on the ml-class.org website during the fall 2011 semester. The topics covered are shown below, although for a more detailed summary see lecture 19. The only content not covered here is the Octave/MATLAB programming.

All diagrams are my own or are directly taken from the lectures, full credit to Professor Ng for a truly exceptional lecture course.

The (Real) Semantic Web Requires Machine Learning

Filed under: Machine Learning,Semantic Web — Patrick Durusau @ 7:57 pm

The (Real) Semantic Web Requires Machine Learning by John O’Neil.

From the longer quote below:

…different people will almost inevitably create knowledge encodings that can’t easily be compared, because they use different — sometimes subtly, maddeningly different — basic definitions and concepts. Another difficult problem is to decide when entity names refer to the “same” real-world thing. Even worse, if the entity names are defined in two separate places, when and how should they be merged?

And the same is true for relationships between entities.(full stop)

The author thinks statistical analysis will be able to distinguish both entities and relationships between them, which I am sure will be true to some degree.

I would characterize that as a topic map authoring aid but it would also be possible to simply accept the statistical results.

It is refreshing to see someone recognize the “semantic web” is the one created by users and not as dictated by other authorities.

From the post:

We think about the semantic web in two complementary (and equivalent) ways. It can be viewed as:

  • A large set of subject-verb-object triples, where the verb is a relation and the subject and object are entities

OR

  • As a large graph or network, where the nodes of the graph are entities and the graph’s directed edges or arrows are the relations between nodes.

As a reminder, entities are proper names, like people, places, companies, and so on. Relations are meaningful events, outcomes or states, like BORN-IN, WORKS-FOR, MARRIED-TO, and so on. Each entity (like “John O’Neil”, “Attivio” or “Newton, MA”) has a type (like “PERSON”, “COMPANY” or “LOCATION”) and each relation is constrained to only accept certain types of entities. For example, WORKS-FOR may require a PERSON as the subject and a COMPANY as the object.

How semantic web information is organized and transmitted is described by a blizzard of technical standards and XML namespaces. Once you escape from that, the basic goals of the semantic web are (1) to allow a lot of useful information about the world to be simply expressed, in a way that (2) allows computers to do useful things with it.

Almost immediately, some problems crop up. As generations of artificial intelligence researchers have learned, it can be really difficult to encode real-world knowledge into predicate logic, which is more-or-less what the semantic web is. The same AI researchers also learned that different people will almost inevitably create knowledge encodings that can’t easily be compared, because they use different — sometimes subtly, maddeningly different — basic definitions and concepts. Another difficult problem is to decide when entity names refer to the “same” real-world thing. Even worse, if the entity names are defined in two separate places, when and how should they be merged? For example, do an Internet search for “John O’Neil”, and try to decide which of the results refer to how many different people. Believe me, all the results are not for the same person.

idata-semantic-web.jpgAs for relations, it’s difficult to tell when they really mean the same thing across different knowledge encodings. No matter how careful you are, if you want to use relations to infer new facts, you have few resources to check to see if the combined information is valid.

So, when each web site can define its own entities and relations, independently of any other web site, how do you reconcile entities and relations defined by different people?

January 9, 2012

SIMI 2012 : Semantic Interoperability in Medical Informatics

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 1:48 pm

SIMI 2012 : Semantic Interoperability in Medical Informatics

Dates:

When May 27, 2012 – May 27, 2012
Where Heraklion (Crete), Greece
Submission Deadline Mar 4, 2012
Notification Due Apr 1, 2012
Final Version Due Apr 15, 2012

From the call for papers:

To gather data on potential application to new diseases and disorders is increasingly to be not only a means for evaluating the effectiveness of new medicine and pharmaceutical formulas but also for experimenting on existing drugs and their appliance to new diseases and disorders. Although the wealth of published non-clinical and clinical information is increasing rapidly, the overall number of new active substances undergoing regulatory review is gradually falling, whereas pharmaceutical companies tend to prefer launching modified versions of existing drugs, which present reduced risk of failure and can generate generous profits. In the meanwhile, market numbers depict the great difficulty faced by clinical trials in successfully translating basic research into effective therapies for the patients. In fact, success rates, from first dose in man in clinical trials to registration of the drug and release in the market, are only about 11% across indications. But, even if a treatment reaches the broad patient population through healthcare, it may prove not to be as effective and/or safe as indicated in the clinical research findings.

Within this context, bridging basic science to clinical practice comprises a new scientific challenge which can result in successful clinical applications with low financial cost. The efficacy of clinical trials, in combination with the mitigation of patients’ health risks, requires the pursuit of a number of aspects that need to be addressed ranging from the aggregation of data from various heterogeneous distributed sources (such as electronic health records – EHRs, disease and drug data sources, etc) to the intelligent processing of this data based on the study-specific requirements for choosing the “right” target population for the therapy and in the end selecting the patients eligible for recruitment.

Data collection poses a significant challenge for investigators, due to the non-interoperable heterogeneous distributed data sources involved in the life sciences domain. A great amount of medical information crucial to the success of a clinical trial could be hidden inside a variety of information systems that do not share the same semantics and/or structure or adhere to widely deployed clinical data standards. Especially in the case of EHRs, the wealth of information within them, which could provide important information and allow of knowledge enrichment in the clinical trial domain (during test of hypothesis generation and study design) as well as act as a fast and reliable bridge between study requirements for recruitment and patients who would like to participate in them, still remains unlinked from the clinical trial lifecycle posing restrictions in the overall process. In addition, methods for efficient literature search and hypothesis validation are needed, so that principal investigators can research efficiently on new clinical trial cases.

The goal of the proposed workshop is to foster exchange of ideas and offer a suitable forum for discussions among researchers and developers on great challenges that are posed in the effort of combining information underlying the large number of heterogeneous data sources and knowledge bases in life sciences, including: – Strong multi-level (semantic, structural, syntactic, interface) heterogeneity issues in clinical research and healthcare domains – Semantic interoperability both at schema and data/instance level – Handling of unstructured information, i.e., literature articles – Reasoning on the wealth of existing data (published findings, background knowledge on diseases, drugs, targets, Electronic Health Records) can boost and enhance clinical research and clinical care processes – Acquisition/extraction of new knowledge from published information and Electronic Health Records – Enhanced matching between clinicians as well as patients΅© needs and available informational content

Apologies for the length of the quote but this is a tough nut that simply saying “topic maps,” isn’t going to solve. As described above, there is a set of domains, each with its own information gathering, processing and storage practices, none of which are going to change rapidly, or consistently.

Although I think topic maps can play a role in solving this sort of issue, it will be by being the “integration rain drop” that starts with some obvious integration issue and solves it and only it. Does not try to be a solution for every issue or requirement. Having solved one, it then spreads out to solve another one.

The key is going to be the delivery of clear and practical advantages in concrete situations.

One approach could be to identify current semantic integration efforts (which tend to have global aspirations) and effect semantic mappings between those solutions. Which has the advantage of allowing the advocates of those systems to continue while a topic map can offer other systems an integration of data from those parts.

Functional Programming with Python – Part 1

Filed under: Functional Programming,Python — Patrick Durusau @ 1:46 pm

Functional Programming with Python – Part 1 by Dhananjay Nene.

From the post:

Lately there has been a substantial increase in interest and activity in Functional Programming. Functional Programming is sufficiently different from the conventional mainstream programming style called Imperative Programming to warrant some discussion on what it is, before we delve into the specifics of how it can be used in Python.

And why Python? The author answers:

Python is not the best functional programming language. But it was not meant to be. Python is a multi paradigm language. Want to write good old ‘C’ style procedural code? Python will do it for you. C++/Java style object oriented code? Python is at your service as well. Functional Programming ? As this series of posts is about to demonstrate – Python can do a decent job at it as well. Python is probably the most productive language I have worked with (across a variety of different types of programming requirements). Add to that the fact that python is a language thats extremely easy to learn, suffers from excellent readability, has fairly good web frameworks such as django, has excellent mathematical and statistical libraries such as numpy, and cool network oriented frameworks such as twisted. Python may not be the right choice if you want to write 100% FP. But if you want to learn more of FP or use FP techniques along with other paradigms Python’s capabilities are screaming to be heard.

He forgot to mention that more people know Python than any of the various “functional” programming languages. Increases the potential audience for an article on the advantages of functional programming. 😉

Triggers in MySQL

Filed under: MySQL,SQL,Triggers — Patrick Durusau @ 1:44 pm

Triggers in MySQL

From the post:

Almost all developers are heard about Triggers and all knows that mysql support triggers and triggers are adding an advantages to mysql.Triggers are the SQL statements are stored in database.

Triggers are the SQL statements which add functionality to your tables so that they perform a certain series of actions when a some queries are executed. We can say in easy language is Triggers are some conditions performed when INSERT, UPDATE or DELETE events are made in the table without using two separate queries.

Sometimes developers are prefer to use store procedures rather than triggers but triggers are one kind of store procedures which contain procedural code into body.The difference between a trigger and a stored procedure is that a trigger is called when an event occurs in a table whereas a stored procedure must be called explicitly.

Short overview of triggers in MySQL.

Possibly useful if you are using a relational backend for your topic map engine.

Should topic map engines support the equivalent of triggers?

As declared by a topic map?

Now that would be clever, to have a topic map carry its triggers around with it.

Admittedly, interactive data structures aren’t the norm, yet, but they are certainly worth thinking about.

The 2015 Digital Marketing Rule Book. Change or Perish.

Filed under: Advertising,Web Analytics — Patrick Durusau @ 1:42 pm

The 2015 Digital Marketing Rule Book. Change or Perish.

Avinash Kaushik writes:

It is the season to be predicting the future, but that is almost always a career-limiting move. So I’m not going to do that.

It is a lot easier to predict the present. So I’m not going to do that either.

Rather, I’m going to share a clump of realities/rules garnered from the present to help ready you for the predictable near future . Now here is the great part
 if you follow these rules and act on these insights I believe you’ll be significantly better prepared for the unpredictable future.

Awesome right?

Now here’s another surprise: These rules/insights/mind shifts are not about data!

He covers a lot of interesting ground to conclude:

Do you agree with my learning that our primary problem is not web analytics/data but, rather, it is unimaginative web strategies?

My “take away” was much earlier in his post:

All while constantly optimizing your portfolio via controlled experiments .

For me the primary problem is two-fold:

  • web analytics/data as understood by management (not the users they are trying to reach), and
  • unimaginative web strategies

How can you have an imaginative or even intelligible web strategy unless and until you understand user behavior or their understanding of the data?

See my post on testing relevance tuning with the top ten actresses for 2011 as an example of questioning web analytics.

Fusion-io passes one billion IOPS barrier thanks to better software, not hardware

Filed under: I/O,IOPS — Patrick Durusau @ 1:41 pm

Fusion-io passes one billion IOPS barrier thanks to better software, not hardware

From the post:

At the DEMO Enterprise Disruption event yesterday, Fusion-io had a big announcement — it’s broken the one billion IOPS mark, having reached one million less than two years ago. IOPS are Input / Output Operations per second, a measure of computer storage access speeds based on the number of read / write operations that can be completed per second.

You will see a comment in the post that lower latency is “…crucial for the cloud-based world we’re heading towards.”

I suppose but that presumes the other components in a “cloud-based world” (or ones closer to the ground) are capable of taking advantage of one billion IOPS performance. Both in terms of storing as well as reading data.

Unused capacity is like a car that will go 160 MPH. Looks good on the sales sticker, not much use in Atlanta traffic.

At least with computer systems there is some hope for components that can handle the increased speed.

Overall systems speedups will improve topic map response/processing times. Would be good to see research on topic map processing per se.

First seen at myNoSQL.

« Newer PostsOlder Posts »

Powered by WordPress