Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 19, 2012

Apache HBase 0.90.6 is now available

Filed under: HBase — Patrick Durusau @ 6:54 pm

Apache HBase 0.90.6 is now available

Jimmy Xiang writes:

Apache HBase 0.90.6 is now available. It is a bug fix release covering 31 bugs and 5 improvements. Among them, 3 are blockers and 3 are critical, such as:

  • HBASE-5008: HBase can not provide services to a region when it can’t flush the region, but considers it stuck in flushing,
  • HBASE-4773: HBaseAdmin may leak ZooKeeper connections,
  • HBASE-5060: HBase client may be blocked forever when there is a temporary network failure.

This release has improved system robustness and availability by fixing bugs that cause potential data loss, system unavailability, possible deadlocks, read inconsistencies and resource leakage.

The 0.90.6 release is backward compatible with 0.90.5. The fixes in this release will be included in CDH3u4.

ggplot posixct cheat sheet

Filed under: Ggplot2,Graphics — Patrick Durusau @ 6:54 pm

ggplot posixct cheat sheet

From the brain of Mat Kelcey, a cheatsheet of common plots using ggplot.

Any common plots in your toolbox?

Big-data Naive Bayes and Classification Trees with R and Netezza

Filed under: Bayesian Data Analysis,Classification Trees,Netezza,R — Patrick Durusau @ 6:54 pm

Big-data Naive Bayes and Classification Trees with R and Netezza

From the post:

The IBM Netezza analytics appliances combine high-capacity storage for Big Data with a massively-parallel processing platform for high-performance computing. With the addition of Revolution R Enterprise for IBM Netezza, you can use the power of the R language to build predictive models on Big Data.

In the demonstration below, Revolution Analytics’ Derek Norton analyzes loan approval data stored on the IBM appliance. You’ll see the R code used to:

  • Explore the raw data (with summary statistics and charts)
  • Prepare the data for statistical analysis, and create training and test sets
  • Create predictive models using classificiation trees and Naïve Bayes
  • Predict using the models, and evaluate model performance using confusion matrices

[embedded presentation omitted]

Note that while R code is being run on Derek’s laptop, the raw data is never moved from the appliance, and the analytic computations take place “in-database” within the appliance itself (where the Revolution R Enterprise engine is also running on each parallel core).

Another incentive for you to be learning R.

Does it sound to you like “Derek’s computer” is a terminal entering instructions that are executed elsewhere? 😉 (If the computing fabric develops fast enough, we may lose the distinction of a “personal” computer. There will simply be computing.)

Meant to mention this the other day. Enjoy!

A Parallel Architecture for In-Line Data De-duplication

Filed under: Deduplication,Parallel Programming — Patrick Durusau @ 6:54 pm

A Parallel Architecture for In-Line Data De-duplication by Seetendra Singh Sengar, Manoj Mishra. (2012 Second International Conference on Advanced Computing & Communication Technologies)

Abstract:

Recently, data de-duplication, the hot emerging technology, has received a broad attention from both academia and industry. Some researches focus on the approach by which more redundant data can be reduced and others investigate how to do data de-duplication at high speed. In this paper, we show the importance of data de-duplication in the current digital world and aim at reducing the time and space requirement for data de-duplication. Then, we present a parallel architecture with one node designated as a server and multiple storage nodes. All the nodes, including the server, can do block level in-line de-duplication in parallel. We have built a prototype of the system and present some performance results. The proposed system uses magnetic disks as a storage technology.

Apologies but all I have at the moment is the abstract.

53 Books APIs: Google Books, Goodreads and SharedBook

Filed under: Books,Library — Patrick Durusau @ 6:54 pm

53 Books APIs: Google Books, Goodreads and SharedBook

Wendell Santos has posted on behalf of ProgrammableWeb a list of fifty-three (53) book APIs!

Fairly good listing but it could be better.

For example, it is missing the Springer API, http://dev.springer.com/, and although they don’t list Elsevier and say http://www.programmableweb.com/api/elsevier-article is historical only, you should be aware that Elsevier does offer an extensive API at: http://www.developers.elsevier.com/cms/index (called SciVerse).

I am sure there are others. Any you would like to mention in particular?

Now that I think about it, guess who doesn’t have a public API?

Would you believe the ACM? Check out the ACM Digital Library and tell me if I am wrong.

Or for that matter, the IEEE. See CS Digital Library.

Maybe they don’t have anyone to build an API for them? Please write the ACM and/or IEEE offering your services at your usual rates.

Intelligence Community (U.S.)

Filed under: Government,Intelligence — Patrick Durusau @ 6:53 pm

At the Intelligence and National Security Alliance (INSA) I ran across: Cloud Computing: Risks, Benefits, and Mission Enhancement for the Intelligence Community, which I thought might be of interest.

The document is important to learn the “lingo” that is being used to describe cloud computing in the intelligence community.

And to understand the grave misunderstandings of cloud computing in the intelligence community.

At page 7 you will find:

Within the IC, information is often the decisive discriminator. Studies of recent mission failures found that many were caused by:

  • The compartmentalization of information in data silos;
  • Weaknesses of the human-based document exploitation process; and
  • A reliance on “operationally proven” processes and filters typically used to address the lack of computational power or decision time.8

In most of these cases, the critical piece of information necessary for mission success was already possessed. The failure was not in obtaining the information but in locating and applying it to the mission. Cloud computing can address such issues, as well as enabling multi-use intelligence. Cloud solutions can now be used to work on all of the data, all of the time. With the ability to leverage the power of a supercomputer at will, critical decision timelines can now be more easily met. (Emphasis added)

Hard to make that many mistakes in one passage, short of misspelling one’s own name.

Cloud computing cannot address the sharing of intelligence, or as the document says: “…work on all of the data, all of the time.” That is a utter and complete falsehood.

Intelligence sharing is possible with cloud computing, just as it is with file folders with sticky labels. But the mechanism of sharing has not, cannot, and will not enable the sharing of intelligence or data in the intelligence community.

To say otherwise is to ignore the realities that produced the current culture of not sharing intelligence and data.

Sharing data and intelligence can only be accomplished by creating cultures, habits, social mechanisms that reward and promote the sharing of data and intelligence. Some of those can be represented or facilitated in information systems but it will be people who authorize, create and reward the use of those mechanisms.

So long as the NSA views the CIA (to just pick two agencies at random) as a leaky sieve, its staff are not going to take responsibility for initiating the sharing of information. Or even responding favorably to requests for information. You can pick any other pairing and get the same result.

Developing incentives and ridding the relevant agencies of people who aren’t incentivized to share, will go much further to promote the sharing of intelligence than any particular technology solution.

If you start to pitch a topic map solution in the intelligence community, I would mention sharing but also that without incentives they won’t be making the highest and best use of your topic map solution.

Let’s Blame Powerpoint (or Rethinking Powerpoint:…)

Filed under: Graphics,Visualization — Patrick Durusau @ 6:53 pm

Let’s Blame Powerpoint (or Rethinking Powerpoint: The New Wave of Presentation Tools)

Gideon Hayden covers several potentially disruptive approaches but:

Faruk Ateş has been a public speaker for a few years now. He started off using Keynote but began to feel that sharing them was very difficult, and they could not deliver the story he was looking to tell. With this pain in mind, he sought out to create a tool that would solve both of these issues. Ateş and his co-founder created thepit.ch (not the company name, but a demo), a stealth startup that aims to provide the tools to create an amazing presentation, and the ability to share it while still getting the point across. According to their research, there are over 75 million desktop users of presentation software, and 20 million users of cloud-based presentation software, so they believe the potential market is huge.

caught my eye in terms of a potential market. Yes, the market is huge. +1!

The software featured in Gideon’s post, while answering important questions, fails to answer:

What makes a good narrator for a narrative?

If you are not a good narrator, no software will make you one.

There isn’t anything wrong with Powerpoint that having a gifted presenter would not cure.

For that matter, a gifted presenter can use chalk + blackboard, overhead slides, or electronic media.

A close friend of mine is a gifted presenter. He recorded and timed a Powerpoint presentation that included video of himself. And he “interacted” with the presentation. To the point that the audience was trying to answer questions looking at the presentation instead of the person in front of us. 😉 His point was that we get captured by video to the exclusion of real people.

The point is that we blame Powerpoint for our poor presentations when a mirror would do a much better job of isolating the point of failure.

Nothing against new tools and capabilities, but you can get as much, if not more, return by learning to be a good presenter.

Clever Algorithms: Statistical Machine Learning Recipes

Filed under: Algorithms,Programming — Patrick Durusau @ 6:52 pm

Clever Algorithms: Statistical Machine Learning Recipes by Jason Brownlee PhD. First Edition, Lulu Enterprises, [Expected mid 2012]. ISBN: xxx.

From the website cleveralgorithms.com

Implementing an Machine Learning algorithms is difficult. Algorithm descriptions may be incomplete, inconsistent, and distributed across a number of papers, chapters and even websites. This can result in varied interpretations of algorithms, undue attrition of algorithms, and ultimately bad science.

This book is an effort to address these issues by providing a handbook of algorithmic recipes drawn from the field of Machine Learning, described in a complete, consistent, and centralized manner. These standardized descriptions were carefully designed to be accessible, usable, and understandable.

An encyclopedic algorithm reference, this book is intended for research scientists, engineers, students, and interested amateurs. Each algorithm description provides a working code example in R.

Also see: Clever Algorithms: Nature-Inspired Programming Recipes.

Led to this by Experiments in Genetic Programming.

Experiments in genetic programming

Filed under: Algorithms,Authoring Topic Maps,Genetic Algorithms,Record Linkage — Patrick Durusau @ 6:52 pm

Experiments in genetic programming

Lars Marius Garshol writes:

I made an engine called Duke that can automatically match records to see if they represent the same thing. For more background, see a previous post about it. The biggest problem people seem to have with using it is coming up with a sensible configuration. I stumbled across a paper that described using so-called genetic programming to configure a record linkage engine, and decided to basically steal the idea.

You need to read about the experiments in the post but I can almost hear Lars saying the conclusion:

The result is pretty clear: the genetic configurations are much the best. The computer can configure Duke better than I can. That’s almost shocking, but there you are. I guess I need to turn the script into an official feature.

😉

Excellent post and approach by the way!

Lars also posted a link to Reddit about his experiments. Several links appear in comments that I have turned into short posts to draw more attention to them.

Another tool for your topic mapping toolbox.

Question: I wonder what it would look like to have the intermediate results used for mapping, only to be replaced as “better” mappings become available? Has a terminating condition but new content can trigger additional cycles but only as relevant to its content.

Or would queries count as new content? If they expressed synonymy or other relations?

March 18, 2012

Drug data reveal sneaky side effects

Filed under: Bioinformatics,Biomedical,Knowledge Economics,Medical Informatics — Patrick Durusau @ 8:54 pm

Drug data reveal sneaky side effects

From the post:

An algorithm designed by US scientists to trawl through a plethora of drug interactions has yielded thousands of previously unknown side effects caused by taking drugs in combination.

The work, published today in Science Translational Medicine [Tatonetti, N. P., Ye, P. P., Daneshjou, R. and Altman, R. B. Sci. Transl. Med. 4, 125ra31 (2012).], provides a way to sort through the hundreds of thousands of ‘adverse events’ reported to the US Food and Drug Administration (FDA) each year. “It’s a step in the direction of a complete catalogue of drug–drug interactions,” says the study’s lead author, Russ Altman, a bioengineer at Stanford University in California.

From later in the post:

The team then used this method to compile a database of 1,332 drugs and possible side effects that were not listed on the labels for those drugs. The algorithm came up with an average of 329 previously unknown adverse events for each drug — far surpassing the average of 69 side effects listed on most drug labels.

Double trouble

The team also compiled a similar database looking at interactions between pairs of drugs, which yielded many more possible side effects than could be attributed to either drug alone. When the data were broken down by drug class, the most striking effect was seen when diuretics called thiazides, often prescribed to treat high blood pressure and oedema, were used in combination with a class of drugs called selective serotonin reuptake inhibitors, used to treat depression. Compared with people who used either drug alone, patients who used both drugs were significantly more likely to experience a heart condition known as prolonged QT, which is associated with an increased risk of irregular heartbeats and sudden death.

A search of electronic medical records from Stanford University Hospital confirmed the relationship between these two drug classes, revealing a roughly 1.5-fold increase in the likelihood of prolonged QT when the drugs were combined, compared to when either drug was taken alone. Altman says that the next step will be to test this finding further, possibly by conducting a clinical trial in which patients are given both drugs and then monitored for prolonged QT.

This data could be marketed to drug companies, trial lawyers (both sides), medical malpractice insurers, etc. This is an example of the data marketing I mentioned in Knowledge Economics II.

Gisgraphy

Filed under: Geo Analytics,Geographic Data,Geographic Information Retrieval,Gisgraphy — Patrick Durusau @ 8:53 pm

Gisgraphy

From the website:

Gisgraphy is a free, open source framework that offers the possibility to do geolocalisation and geocoding via Java APIs or REST webservices. Because geocoding is nothing without data, it provides an easy to use importer that will automagically download and import the necessary (free) data to your local database (Geonames and OpenStreetMap : 42 million entries). You can also add your own data with the Web interface or the importer connectors provided. Gisgraphy is production ready, and has been designed to be scalable(load balanced), performant and used in other languages than just java : results can be output in XML, JSON, PHP, Python, Ruby, YAML, GeoRSS, and Atom. One of the most popular GPS tracking System (OpenGTS) also includes a Gisgraphy client…read more

Free webservices:

  • Geocoding
  • Street Search
  • Fulltext Search
  • Reverse geocoding / street search
  • Find nearby
  • Address parser

Services that you could use with smart phone apps or in creating topic map based collections of data that involve geographic spaces.

Class Central

Filed under: CS Lectures — Patrick Durusau @ 8:53 pm

Class Central

From the webpage:

A complete list of free online courses offered by Stanford’s Coursera, MIT’s MITx, and Udacity

Well, except that the latest offering listed is from CalTech. 😉

Looks like a resource that is going to see a lot of traffic, as well as new content.

Learning from Data

Filed under: CS Lectures,Machine Learning — Patrick Durusau @ 8:53 pm

Learning from Data

Outline:

This is an introductory course on machine learning that covers the basic theory, algorithms and applications. Machine learning (ML) uses data to recreate the system that generated the data. ML techniques are widely applied in engineering, science, finance, and commerce to build systems for which we do not have full mathematical specification (and that covers a lot of systems). The course balances theory and practice, and covers the mathematical as well as the heuristic aspects. Detailed topics are listed below.

From the webpage:

Real Caltech course, not watered-down version
Broadcast live from the lecture hall at Caltech

And so, the competition of online course offerings begins. 😉

The Pyed Piper

Filed under: Linux OS,Pyed Piper,Python — Patrick Durusau @ 8:52 pm

The Pyed Piper: A Modern Python Alternative to awk, sed and Other Unix Text Manipulation Utilities

Toby Rosen presents on Pyed Piper. Text processing for Python programmers.

Interesting that many movie studios use Python and Linux.

If you work in a Python environment, you probably want to give this a close look.

The project homepage.

Annotations in Data Streams

Filed under: Annotation,Data Streams — Patrick Durusau @ 8:52 pm

Annotations in Data Streams by Amit Chakrabarti, Graham Cormode, Andrew McGregor, and Justin Thaler.

Abstract:

The central goal of data stream algorithms is to process massive streams of data using sublinear storage space. Motivated by work in the database community on outsourcing database and data stream processing, we ask whether the space usage of such algorithms can be further reduced by enlisting a more powerful “helper” who can annotate the stream as it is read. We do not wish to blindly trust the helper, so we require that the algorithm be convinced of having computed a correct answer. We show upper bounds that achieve a non-trivial tradeoff between the amount of annotation used and the space required to verify it. We also prove lower bounds on such tradeoffs, often nearly matching the upper bounds, via notions related to Merlin-Arthur communication complexity. Our results cover the classic data stream problems of selection, frequency moments, and fundamental graph problems such as triangle-freeness and connectivity. Our work is also part of a growing trend — including recent studies of multi-pass streaming, read/write streams and randomly ordered streams — of asking more complexity-theoretic questions about data stream processing. It is a recognition that, in addition to practical relevance, the data stream model raises many interesting theoretical questions in its own right.

I have a fairly simple question as I start to read this paper: When is digital data not a stream?

When it is read from a memory device, it is a stream.

When it is read into a memory device, it is a stream.

When it is read into a cache on a CPU, it is a stream.

When it is read from the cache by a CPU, it is a stream.

When it is placed back in a cache by a CPU, it is a stream.

What would you call digital data on a storage device? May not be a stream but you can’t look at it without it becoming a stream. Yes?

GraphLab workshop, Why should you care ?

Filed under: GraphLab,Graphs — Patrick Durusau @ 8:51 pm

GraphLab workshop, Why should you care ?

Danny Bickson has announced the first GraphLab workshop.

The “…Why should you care ?” post reads in part as follows:

Designing and implementing efficient and provably correct parallel machine learning (ML) algorithms can be very challenging. Existing high-level parallel abstractions like MapReduce are often insufficiently expressive while low-level tools like MPI and Pthreads leave ML experts repeatedly solving the same design challenges. By targeting common patterns in ML, we developed GraphLab, which improves upon abstractions like MapReduce by compactly expressing asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency and achieving a high degree of parallel performance.

In short it is a way to perform iterative algorithms on sparse graphs (parallel processing is also included). With the advent of cheap cloud computing, and the underlying need for post-processing in sparse recovery or advanced matrix factorization like dictionary learning, robust PCA and the like, it might be interesting to investigate the matter and even present something at this workshop….

Read the rest of “…Why should you care ?” for links to resources and examples. You will care. Promise.

And, if that doesn’t completely convince you, try:

A small Q&A with Danny Bickson on GraphLab.

Me? I am just hopeful for a small video cam somewhere in the audience with the slides/resources being posted.

Hawaii International Conference on System Sciences – Proceedings – TM Value-Add

Filed under: Conferences,Knowledge Economics,Systems Research — Patrick Durusau @ 8:51 pm

Hawaii International Conference on System Sciences

The Hawaii International Conference on System Sciences (HICSS) is the sponsor of the Knowledge Economics conference I mentioned earlier today.

It has a rich history (see below) and just as importantly, free access to its proceedings back to 2005, via the CS Digital Library (ignore the wording that you need to login).

I did have to locate the new page which is: HCISS Proceedings 1995 –.

The proceedings illustrate why a topic map that captures prior experience can be beneficial.

For example, the entry for 1995 reads:

  • 28th Hawaii International Conference on System Sciences (HICSS’95)
  • 28th Hawaii International Conference on System Sciences (HICSS’95)
  • 28th Hawaii International Conference on System Sciences
  • 28th Hawaii International Conference on System Sciences (HICSS’95)
  • 28th Hawaii International Conference on System Sciences (HICSS’95)

Those are not duplicate entries. They all lead to unique content.

The entry for 2005 reads:

  • Volume No. 5 – Proceedings of the 38th Annual Hawaii International Conference on System Sciences (HICSS’05) – Track 5
  • Volume No. 7 – Proceedings of the 38th Annual Hawaii International Conference on System Sciences (HICSS’05) – Track 7
  • Volume No. 3 – Proceedings of the 38th Annual Hawaii International Conference on System Sciences (HICSS’05) – Track 3
  • Volume No. 6 – Proceedings of the 38th Annual Hawaii International Conference on System Sciences (HICSS’05) – Track 6
  • Volume No. 9 – Proceedings of the 38th Annual Hawaii International Conference on System Sciences (HICSS’05) – Track 9
  • Volume No. 2 – Proceedings of the 38th Annual Hawaii International Conference on System Sciences (HICSS’05) – Track 2
  • Volume No. 1 – Proceedings of the 38th Annual Hawaii International Conference on System Sciences (HICSS’05) – Track 1
  • Volume No. 8 – Proceedings of the 38th Annual Hawaii International Conference on System Sciences (HICSS’05) – Track 8
  • Volume No. 4 – Proceedings of the 38th Annual Hawaii International Conference on System Sciences (HICSS’05) – Track 4

Now that’s a much more useful listing. 😉

Don’t despair! That changed in 2009 and in the latest (2011), we find:

  • 2011 44th Hawaii International Conference on System Sciences

OK, so we follow that link to find (in part):

  • David J. Nickles, Daniel D. Suthers, “A Study of Structured Lecture Podcasting to Facilitate Active Learning,” hicss, pp.1-10, 2011 44th Hawaii International Conference on System Sciences, 2011
  • J. Lucca, R. Sharda, J. Ruffner, U. Shimp, D. Biros, A. Clower, “Ammunition Multimedia Encyclopedia (AME): A Case Study,” hicss, pp.1-10, 2011 44th Hawaii International Conference on System Sciences, 2011
  • (emphasis added)

Do you notice anything odd about the pagination numbers?

Just for grins, 2008, all the articles are one page long – pp 28, pp 29.

BTW, all the articles appear (I haven’t verified this) to have unique DOI entries.

I am deeply interested in the topics covered by HICSS, so if I organize the proceeding into a more useful form, how do I make that extensible by other researchers?

That is I have no interest in duplicating the work already done for these listings but rather adding value to them and at the same time, being open to more value being added to my work product.

Here are some observations to start a requirements process.

The pages with the HTML abstracts of the articles have helpful links (such as more articles by the same author and co-authors). Could duplicate the author/co-author links but why? Additional maintenance duty.

Conferences should have a one page display, 1995 to current year, with each conference expanding into tracks and each track into papers (with authors listed). Mostly for browsing purposes.

Should be searchable across these proceedings only. (A feature apparently not available at the CS Digital Library site.)

Search should include title, author, exact phrase (as CS Digital Library does) but also subjects.

What am I missing? (Lots I know so be gentle. 😉 )

BTW, be aware of: HICSS Symposium and Workshop Reports and Monographs (also free for downloading)

Knowledge Economics II

My notes about Steve Newcomb’s economic asset approach to knowledge/information integration were taking me too far afield from the conference proper.

As an economic asset, take information liberated from alphabet soup agency (ASP) #1 to be integrated with your information. Your information could be from unnamed sources, public records, your records, etc. Reliable integration requires semantic knowledge of ASP #1’s records or trial-n-error processing. Unless, of course, you have a mapping enabling reliable integration of ASP #1 information with your own.

How much is that “mapping” worth to you? Is it reusable? Or should I say, “retargetable?”

You can, as people are want to do, hire a data mining firm to go over thousands of documents (like State Department cables, which revealed the trivia nature of State Department secrets) and get a one off result. But what happens the next time? Do you do the same thing over again? And how does that fit into your prior results?

That’s really the question isn’t it? Not how do you process the current batch of information (although that can be important) but how does that integrate into your prior data? So that your current reporters will not have to duplicate all the searching your earlier reporters did to find the same information.

Perhaps they will uncover relationships that were not apparent from only one batch of leaked information. Perhaps they will purchase from the airlines their travel data to be integrated with reported sightings from their own sources. Or telephone records from carriers not based in the United States.

But data integration opportunities are not just for governments and the press.

Your organization has lots of information. Information on customers. Information on suppliers. Information on your competition. Information on what patients were taking what drugs with what results? (Would you give that information to drug companies or sell it to drug companies? I know my answer. How about yours?)

What will you answer when a shareholder asks: What is our KROI? Knowledge Return on Investment?

You have knowledge to sell. How are you going to package it up to attract buyers? (inquiries welcome)

Knowledge Economics (Grand Wailea Maui, HI 2013)

Filed under: Conferences,Knowledge Economics,Knowledge Management — Patrick Durusau @ 8:50 pm

KE HICSS 2013 : Knowledge Economics: HICSS-46 (2013)

When Jan 7, 2013 – Jan 10, 2013
Where Grand Wailea Maui, HI
Submission Deadline Jun 15, 2012
Notification Due Aug 15, 2012
Final Version Due Sep 15, 2012

A conference running to catch up with Steve Newcomb’s advocacy of knowledge integration mappings as economic assets. (See Knowledge Economics II)

Additional details about the Minitrack may be found at: http://www.hicss.hawaii.edu/hicss_46/KMEconomics.pdf

Join our “Knowledge Economics” LinkedIn group: http://www.linkedin.com/groups/Knowledge-Economics-4351854?trk=myg_ugrp_ovr

Knowledge Management is continuously gaining importance in research and practice, since economically growth economies are more reliant on the contribution of knowledge intensive businesses. Various methodologies to identify, capture, model and simulate knowledge transfers have been elaborated within the business scope. These methodologies comprise both the technical, as well as the organizational aspect of knowledge, being transferred in organizations.

This minitrack aims to provide insight on the knowledge economics and emphasizes a holistic view on the economic implications of knowledge, including the value and economics of repositories and the overall value of knowledge. Further on, implications of the knowledge society and knowledge based policy are covered within the scope of this minitrack.

Possible contributions regarding the economics of knowledge management and transfer may include, but are not limited to the following:

  • Value and economics of repositories
  • Implications of the knowledge society
  • Policy generation and implementation in the knowledge society
  • Knowledge based theory
  • Knowledge based society
  • Economics of knowledge co-creation and Business Process Management (BPM)
  • Costs associated with knowledge management and knowledge transfer
  • Tangible and intangible (business) value of knowledge management systems
  • Methods for measuring the costs and benefits of projects involving knowledge management systems
  • Measuring, managing and promoting intellectual capital
  • Economics of inner and cross-organizational knowledge transfer
  • Business models involving knowledge management and knowledge transfer
  • The role of human, intellectual and social capital in knowledge management and knowledge transfer
  • Economics of knowledge transfer across developed and emerging economies
  • Value creation through education based knowledge transfer
  • Benefits and costs of considering knowledge in the analysis of business processes
  • Economics of sustainable knowledge management – potentials, barriers and critical success factors
  • Motivations and financial expectations of cross-border knowledge transfer
  • Contribution of knowledge management systems to firm performance and competiveness
  • Economics of talent management
  • Financial effects of the Chief Knowledge Officer (CKO) position, knowledge managers, and other knowledge management related resources
  • Financial rewards systems related to knowledge management and knowledge transfer
  • Frameworks, models and theories related to the economics of knowledge management and transfer

Both conceptual and empirical papers with a sound research background are welcomed. All submissions must include a separate contribution section, explaining how the work contributes to a better understanding of knowledge management and economics.

March 17, 2012

Yahoo! Search Scientists Break New Ground on Search Results

Filed under: Searching,Yahoo! — Patrick Durusau @ 8:20 pm

Yahoo! Search Scientists Break New Ground on Search Results

From the post:

Understanding a person’s intent when searching on the web is critical to the quality of search results offered and at Yahoo! Search, the science team is constantly working to refine our technology and provide people with more relevant answers, not links, to their search query.

Recently, Yahoo! Search scientists built a new search platform from the ground up with machine learning technology that improves Yahoo!’s vertical intent triggering system and, as a result, our ability to better anticipate the needs of the individual user as he or she searches online. With this new platform, our search algorithm has the ability to adapt to what users are really interested in, by continuously monitoring how they engage with the search results. The system then continuously and automatically improves itself to provide the most engaging web search experience.

This technology was recently launched for news and movie search queries, two categories that tested extremely well with the technology. For example, with breaking news search terms constantly changing, humans can’t instantly track which queries are now breaking news stories. The intended result for a user can change for the same search query on a daily or even hourly basis. The technology can determine what the users are looking for and bring it to the top. And the key results that may have been at the top this morning, can be moved to the middle of the search results page at the end of day if user behavior determines other content is now more relevant.

Based on the positive feedback we’ve received in testing this platform for news and movie searches, we plan to roll out this new technology to support shopping, local, travel and mobile searches in the coming months, as well as other experiences across the Yahoo! network.

There wasn’t enough information in the post to evaluate the claims of improvement. I tried to post a comment asking when more details will appear but it was with FireFox on Ubuntu so it may not have taken.

If you know what Yahoo! has done differently and can say what it is, please do. I am sure we would all like to know.

As you know, enabling users to state their intent seems like a better strategy to me. At least better than simply running the numbers like a network rating sweeps. It works, but only just.

Advanced Analytics with R and SAP HANA

Filed under: Hadoop,Oracle,R,SAP — Patrick Durusau @ 8:20 pm

Advanced Analytics with R and SAP HANA. Slides by Jitender Aswani and Jens Doerpmund.

Ajay Ohri reports that SAP is following Oracle in using R. And we have all heard about Hadoop and R.

Question: What language for analytics are you going to start learning for Oracle, SAP and Hadoop? (To say nothing of mining/analysis for topic maps.)

The Teradata add-on package for R

Filed under: R,Terrastore — Patrick Durusau @ 8:20 pm

The Teradata add-on package for R

Ajay Ohri writes:

teradataR is a package or library that allows R users to easily connect to Teradata, establish data frames (R data formats) to Teradata and to call in-database analytic functions within Teradata. This allows R users to work within their R console environment while leveraging the in-database functions developed with Teradata Warehouse Miner. This package provides 44 different analytical functions and an additional 20 data connection and R infrastructure functions. In addition, we’ve added a function that will list the stored procedures within Teradata provide the capability to call functions from R.

I won’t repeat more of the post as it is detailed and long (but usefully so).

For details on the Teradata Warehouse Miner see Teradata.

Analytical packages are important but analysis/data never makes decisions. Insightful analysts and decision makers do, the successful ones at any rate.

Lifebrowser: Data mining gets (really) personal at Microsoft

Filed under: Data Mining,Microsoft,Privacy — Patrick Durusau @ 8:20 pm

Lifebrowser: Data mining gets (really) personal at Microsoft

Nancy Owano writes:

Microsoft Research is doing research on software that could bring you your own personal data mining center with a touch of Proust for returns. In a recent video, Microsoft scientist Eric Horvitz demonstrated the Lifebrowser, which is prototype software that helps put your digital life in meaningful shape. The software uses machine learning to help a user place life events, which may span months or years, to be expanded or contracted selectively, in better context.

Navigating the large stores of personal information on a user’s computer, the program goes through the piles of personal data, including photos, emails and calendar dates. A search feature can pull up landmark events on a certain topic. Filtering the data, the software calls up memory landmarks and provides a timeline interface. Lifebrowser’s timeline shows items that the user can associate with “landmark” events with the use of artificial intelligence algorithms.

A calendar crawler, working with Microsoft Outlook extracts various properties from calendar events, such as location, organizer, and relationships between participants. The system then applies Bayesian machine learning and reasoning to derive atypical features from events that make them memorable. Images help human memory, and an image crawler analyzes a photo library. By associating an email with a relevant calendar date with a relevant document and photos, significance is gleaned from personal life events. With a timeline in place, a user can zoom in on details of the timeline around landmarks with a “volume control” or search across the full body of information.

Sounds like the start towards a “personal” topic map authoring application.

One important detail: With MS Lifebrowser the user is gathering information on themselves.

Not the same as having Google or FaceBook gathering information on you. Is it?

Updated Google Prediction API

Filed under: Google Prediction,Predictive Analytics — Patrick Durusau @ 8:20 pm

Updated Google Prediction API

From the post:

Although we can’t reliably compare its future-predicting abilities to a crystal ball, the Google Prediction API unlocks a powerful mechanism to use machine learning in your applications.

The Prediction API allows developers to train their own predictive models, taking advantage of Google’s world-class machine learning algorithms. It can be used for all sorts of classification and recommendation problems from spam detection to message routing decisions. In the latest release, the Prediction API has added more detailed debugging information on trained models and a new App Engine sample, which illustrates how to use the Google Prediction API for the Java and Python runtimes.

To help App Engine developers get started with the prediction API, we’ve published an article and walkthrough detailing how to create and manage predictive models in App Engine apps with simple authentication using OAuth2 and service accounts. Check out the walkthrough and let us know what you think on the group. Happy coding!

OK, so what do I do when I leave my crystal ball at home?

Oh, that is why this is on the “cloud” I suppose. 😉

Are you using the Google Prediction API? Would appreciate hearing from satisfied/unsatisfied users. Certainly the sort of thing that could be important in authoring/curating a topic map.

Paper Review: “Recovering Semantic Tables on the WEB”

Filed under: Searching,Semantic Annotation,Semantics — Patrick Durusau @ 8:19 pm

Paper Review: “Recovering Semantic Tables on the WEB”

Sean Golliher writes:

A paper entitled “Recovering Semantics of Tables on the Web” was presented at the 37th Conference on Very Large Databases in Seattle, WA . The paper’s authors included 6 Google engineers along with professor Petros Venetis of Stanford University and Gengxin Miao of UC Santa Barbara. The paper summarizes an approach for recovering the semantics of tables with additional annotations other than what the author of a table has provided. The paper is of interest to developers working on the semantic web because it gives insight into how programmers can use semantic data (database of triples) and Open Information Extraction (OIE) to enhance unstructured data on the web. In addition they compare how a “maximum-likelihood” model, used to assign class labels to tables, compares to a “database of triples” approach. The authors show that their method for labeling tables is capable of labeling “an order of magnitude more tables on the web than is possible using Wikipedia/YAGO and many more than freebase.”

The authors claim that “the Web offers approximately 100 million tables but the meaning of each table is rarely explicit from the table itself”. Tables on the Web are embedded within HTML which makes extracting meaning from them a difficult task. Since tables are embedded in HTML search engines typically treat them like any other text in the document. In addition, authors of tables usually have labels that are specific to their own labeling style and assigned attributes are usually not meaningful. As the authors state: “Every creator of a table has a particular Schema in mind”. In this paper the authors describe a system where they automatically add additional annotations to a table in order to extract meaningful relationships between the entities in the table and other columns within table. The authors reference the table example shown below in Table. 1.1 . The table has no row or column labels and there is no title associated to it. To extract the meaning from this table, using text analysis, a search engine would have to relate the table entries to the text surrounding the document and/or analyze the text entries in the table.

The annotation process, first with class/instance and then out of a triple database, reminds me of Newcomb’s “conferral” of properties. That is some content in the text (or in a subject representative/proxy) causes additional key/value pairs to be assigned/conferred. Nothing particularly remarkable about that process.

I am not suggesting that the ISA/triple database strategy will work equally for all areas. What annotation/conferral strategy works best for you will depend on your data and the requirements imposed upon a solution. I would like to hear from you about annotation/conferral strategies that work with particular data sets.

The Anachronism Machine: The Language of Downton Abbey

Filed under: Language,Linguistics — Patrick Durusau @ 8:19 pm

The Anachronism Machine: The Language of Downton Abbey

David Smith writes::

I’ve recently become hooked on the TV series Downton Abbey. I’m not usually one for costume dramas, but the mix of fine acting, the intriguing social relationships, and the larger WW1-era story make for compelling viewing. (Also: Maggie Smith is a treasure.)

Despite the widespread criticial acclaim, Downton has met with criticism for some period-innapropriate uses of language. For example, at one point Lady Mary laments “losting the high ground”, a phrase that didn’t come into use until the 1960s. But is this just a random slip, or are such anachronistic phrases par for the course on Downton? And how does it compare to other period productions in its use of language?

To answer these questions, Ben Schmidt (a graduate student in history at Princeton University and Visiting Graduate Fellow at the Cultural Observatory at Harvard) created an “Anachronism Machine“. Using the R statistical programming language and Google n-grams, it analyzes all of the two-word phrases in a Downton Abbey script, and compares their frequency of use with that in books written around the WW1 era (when Downton is set). For example, Schmidt finds that Downton characters, if they were to follow societal norms of the 1910’s (as reflected in books from that period), would rarely use the phrase “guest bedroom”, but in fact it’s regularly uttered during the series. Schmidt charts the frequency these phrases appear in the show versus the frequency they appear in contemporaneous books below:

Good post on the use of R for linguistic analysis!

As a topic map person, I am more curious what should be substituted for “guest bedroom” in a 1910’s series? Thinking it would be interesting to have a mapping between the “normal” speech patterns for various time periods.

Every open spending data site in the US ranked and listed

Filed under: Government Data,Transparency — Patrick Durusau @ 8:19 pm

Every open spending data site in the US ranked and listed

Lisa Evans (Guardian, UK) writes:

The Follow the Money 2012 report has this week revealed the good news that more US states are being open about their public spending by publishing their transactions on their websites. It has also exposed the states of Arkansas, Idaho, Iowa, Montana and Wyoming that are keeping their finances behind a password protected wall or are just not publishing at all.

A network of US Public Interest Research Groups (US PIRGs) which produced the report, revealed that 46 states now “allow residents to access checkbook-level information about government expenditures online”.

The checkbook means a digital copy of who receives state money, how much, and for what purpose. Perhaps to make sense of this ‘checkbook’ concept it’s useful to compare US and UK public finance transparency.

A lot of data to be sure and far more than was available as little as ten (10) years ago.

It is “checkbook-level” type information but that is only a starting point for transparency.

Citizens can spot double billing/payments or “odd” billing/payments but that isn’t transparency. Or rather it is transparency of a sort but not its full potential.

For example, if your local government is spending in its IT budget over $300,000 a year for GIS services and you see the monthly billings and payments are all correct and proper. What you are missing is that local developers, who have long standing relationships with elected officials benefit from the GIS services for planning new developments. The public doesn’t benefit from the construction of new developments, which places strain on existing infrastructure for the benefit if the very few.

To develop that level of transparency you would need electronic records of campaign support, government phone records, property records, visitor logs, and other data. And quite possibly a topic map to make sense of it all. Interesting to think about.

NASA Releases Atlas Of Entire Sky

Filed under: Astroinformatics,Data Mining,Dataset — Patrick Durusau @ 8:19 pm

NASA Releases Atlas Of Entire Sky

J. Nicholas Hoover (InformationWeek) writes:

NASA this week released to the Web an atlas and catalog of 18,000 images consisting of more than 563 million stars, galaxies, asteroids, planets, and other objects in the sky–many of which have never been seen or identified before–along with data on all of those objects.

The space agency’s Wide-field Infrared Survey Explorer (WISE) mission, which was a collaboration of NASA’s Jet Propulsion Laboratory and the University of California Los Angeles, collected the data over the past two years, capturing more than 2.7 million images and processing more than 15 TB of astronomical data along the way. In order to make the data easier to use, NASA condensed the 2.7 million digital images down to 18,000 that cover the entire sky.

The WISE mission, which mapped the entire sky, uncovered a number of never-before-seen objects in the night sky, including an entirely new class of stars and the first “Trojan” asteroid that shares the Earth’s orbital path. The study also determined that there were far fewer mid-sized asteroids near Earth than had been previously thought. Even before the mass release of data to the Web, there have already been at least 100 papers published detailing the more limited results that NASA had already released.

Hoover also says that NASA has developed tutorials to assist developers in working with the data and that the entire database will be available in the not too distant future.

When I see releases like this one, I am reminded of Jim Gray (MS). Jim was reported to like astronomy data sets because they are big and free. See what you think about this one.

March 16, 2012

Dex 4.5 release

Filed under: DEX,Graphs — Patrick Durusau @ 7:36 pm

Dex 4.5 release

From the webpage:

DEX 4.5 includes the following new features:

  • Graph Algorithm package for all APIs
  • Graph Algorithm package includes the following algorithms:
    • Traversals algorithms: To traverse the graph using DFS or BFS techniques. Your choice!
    • Find shortest path algorithms: Find the shortest way to between two nodes, using BFS or Dijkstra techniques, whatever fits your code best!.
    • Connected components algorithms: Find Strongy or Weakly connected components

I was tipped off to this release by Alex Popescu’s myNoSQL.

Wyner and Hoekstra on A Legal Case OWL Ontology with an Instantiation of Popov v. Hayashi

Filed under: Ontology,OWL — Patrick Durusau @ 7:35 pm

Wyner and Hoekstra on A Legal Case OWL Ontology with an Instantiation of Popov v. Hayashi

From Legalinformatics:

Dr. Adam Wyner of the University of Leeds Centre for Digital Citizenship and Dr. Rinke Hoekstra of the University of Amsterdam’s Leibniz Center for Law have published A legal case OWL ontology with an instantiation of Popov v. Hayashi, forthcoming in Artificial Intelligence and Law. Here is the abstract:

The legal case ontology here.

I have a history with logic and the law that stretches over decades. Rather than comment now, interested in what you think? What is strong/weak about this proposal?

« Newer PostsOlder Posts »

Powered by WordPress