Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 25, 2011

MonoTable

Filed under: MonoTable,Ruby — Patrick Durusau @ 7:47 pm

MonoTable – Zero-admin, no single-point-of-failure, scalable NoSQL Data-Store in Ruby

From the webpage:

Install

It’ll be available as a gem.

Status

We are in the early design/implementaiton phase.

Primary Goals

  • Ordered key-value store / document store
  • REST api
  • Scales with ease
  • Easy setup and admin

….

The MonoTable Data Structure

“Everything should be made as simple as possible, but no simpler.” -Einstein

MonoTable stores all data in a single table. The table consists of records sorted by their keys. Each record, in addition to their key, can have 0 or more named fields. Basicaly, it’s a 2-dimensional hash where the first dimension supports range selects.

Sounds interesting but remember that Einstein may have been wrong about other issues: Models, Relativity & Reality.

September 24, 2011

ORCID (Open Researcher & Contributor ID)

Filed under: Ambiguity,Researchers,Subject Identifiers — Patrick Durusau @ 6:59 pm

ORCID (Open Researcher & Contributor ID)

From the About page:

ORCID, Inc. is a non-profit organization dedicated to solving the name ambiguity problem in scholarly research and brings together the leaders of the most influential universities, funding organizations, societies, publishers and corporations from around the globe. The ideal solution is to establish a registry that is adopted and embraced as the de facto standard by the whole of the community. A resolution to the systemic name ambiguity problem, by means of assigning unique identifiers linkable to an individual’s research output, will enhance the scientific discovery process and improve the efficiency of funding and collaboration. The organization is managed by a fourteen member Board of Directors.

ORCID’s principles will guide the initiative as it grows and operates. The principles confirm our commitment to open access, global communication, and researcher privacy.

Accurate identification of researchers and their work is one of the pillars for the transition from science to e-Science, wherein scholarly publications can be mined to spot links and ideas hidden in the ever-growing volume of scholarly literature. A disambiguated set of authors will allow new services and benefits to be built for the research community by all stakeholders in scholarly communication: from commercial actors to non-profit organizations, from governments to universities.

Thomson Reuters and Nature Publishing Group convened the first Name Identifier Summit in Cambridge, MA in November 2009, where a cross-section of the research community explored approaches to address name ambiguity. The ORCID initiative officially launched as a non-profit organization in August 2010 and is moving ahead with broad stakeholder participation (view participant gallery). As ORCID develops, we plan to engage researchers and other community members directly via social media and other activity. Participation from all stakeholders at all levels is essential to fulfilling the Initiative’s mission.

I am not altogether certain that elimination of ambiguity in identification will enable “…min[ing] to spot links and ideas hidden in the ever-growing volume of scientific literature.” Or should I say there is no demonstrated connection between unambiguous identification of researchers and such gains?

True enough, the claim is made but I thought science was based on evidence, not simply making claims.

And, like most researchers, I have discovered unexpected riches when mistaking one researcher’s name for another’s. Reducing ambiguity in identification will reduce the incidence of, well, ambiguity in identification.

Jack Park forwarded this link to me.

newLISP® for Mac OS X, GNU Linux, Unix and Win32

Filed under: Lisp,MapReduce — Patrick Durusau @ 6:59 pm

newLISP® for Mac OS X, GNU Linux, Unix and Win32

From the website:

newLISP is a Lisp-like, general-purpose scripting language. It is especially well-suited for applications in AI, web search, natural language processing, and machine learning. Because of its small resource requirements, newLISP is also excellent for embedded systems applications. Most of the functions you will ever need are already built in. This includes networking functions, support for distributed and parallel processing, and Bayesian statistics.

At version 10.3.3, newLISP say that it has over 350 functions and is about 200K in size.

Interesting that one of the demo applications written in 2007 is MapReduce.

Some posts on its mailing lists but I would not call them high traffic. 😉

Flowing Data Tutorials

Filed under: Visualization — Patrick Durusau @ 6:59 pm

Flowing Data Tutorials by Nathan Yau.

Nathan has created tutorials from time to time on visualization of data.

You will quickly learn that visualization techniques + data != useful presentation.

Useful presentation of data requires insight into the data, the purpose of the presentation, the audience who will see it and related issues, as well as the technical aspects of the visualization proper.

These tutorials will give you a sense of the range of possibilities that exist for visualization of data.

Enjoy!

The Impact of online reviews: An annotated bibliograpy

Filed under: Marketing,Reviews — Patrick Durusau @ 6:59 pm

The Impact of online reviews: An annotated bibliograpy by Panos Ipeirotis.

From the post:

A few weeks back, I received some questions about online consumer reviews, their impact on sales, and other related questions. At that point, I realized that while I had a good grasp of the technical literature within Computer Science venues, my grasp of the overall empirical literature within Marketing and Information Systems venues was rather shaky, so I had to do a better work in preparing a literature review.

So, I did whatever a self-respecting professor would do in such a situation: I asked my PhD student, Beibei Li, to compile a list of such papers, write a brief summary of each, and send me the list. She had passed her qualification exam by studying exactly this area, so she was the resident expert in the topic.

Beibei did not disappoint me. A few hours later I had a very good list of papers in my mailbox, together with the description. It was so good, that I thought that many other people would be interested in the list.

Questions:

  1. When was the last time you read a review of topic map software?
  2. When was the last time you read a review of a topic map?

I mention this bibliography in part to show the usefulness of online reviews and possibly how to make them effective.

It that sounds like cold-blooded marketing, there is a good reason. It is.

What topic map software or topic map would you suggest for review?

Where would you publish the review?

In case you are having trouble thinking of one, check the Topic Maps Lab projects listing.

How do I become a data scientist?

Filed under: Computer Science,Data Analysis,Data Science — Patrick Durusau @ 6:58 pm

How do I become a data scientist?

Whether you call yourself a “data scientist” or not is up to you.

Acquiring the skills relevant to your area of interest is the first step towards success with topic maps.

What are the best blogs about data? Why?

Filed under: Data,Data Science — Patrick Durusau @ 6:58 pm

What are the best blogs about data? Why?

A very extensive listing (as you can imagine) of blogs about data.

Quora reports:

This question has been viewed 15577 times; it has 2 monitors with 21188 topic followers and 0 aliases exist.

831 people are following this question.

So the question is popular.

How would you make the answer more useful?

Introducing Fech

Filed under: Dataset,Marketing — Patrick Durusau @ 6:58 pm

Introducing Fech by Michael Strickland.

From the post:

Ten years ago, the Federal Election Commission introduced electronic filing for political committees that raise and spend money to influence elections to the House and the White House. The filings contain aggregate information about a committee’s work (what it has spent, what it owes) and more detailed listings of its interactions with the public (who has donated to it, who it has paid for services).

Journalists who work with these filings need to extract their data from complex text files that can reach hundreds of megabytes. Turning a new set into usable data involves using the F.E.C.’s data dictionaries to match all the fields to their positions in the data. But the available fields have changed over time, and subsequent versions don’t always match up. For example, finding a committee’s total operating expenses in version 7 means knowing to look in column 52 of the “F3P” line. It used to be found at column 50 in version 6, and at column 44 in version 5. To make this process faster, my co-intern Evan Carmi and I created a library to do that matching automatically.

Fech (think “F.E.C.h,” say “fetch”), is a Ruby gem that abstracts away any need to map data points to their meanings by hand. When you give Fech a filing, it checks to see which version of the F.E.C.’s software generated it. Then, when you ask for a field like “total operating expenses,” Fech knows how to retrieve the proper value, no matter where in the filing that particular software version stores it.

At present Fech only parses presidential filings but can be extended to other filings.

OK, so now it is easier to get campaign finance information. Now what?

So members of congress live in the pockets of their largest supporters. Is that news?

How would you use topic map to make that news? Serious question.

Or how to use topic maps to make that extraction a value-add when used with other New York Times content?


Update: Fech 1.1 Released.

SWJ-SoM 2012 : Semantic Web Journal – Special Issue on The Semantics of Microposts

Filed under: Semantic Web,Semantics — Patrick Durusau @ 6:58 pm

SWJ-SoM 2012 : Semantic Web Journal – Special Issue on The Semantics of Microposts

Dates:

Submission Deadline Nov 15, 2011
Notification Due Jan 15, 2012

From the call:

The aim of this special issue is to publish a collection of papers covering the range of topics relevant to the analysis, use and reuse of Micropost data. This should cover a wide scope of work that represents current efforts in the fields collaborating with the Semantic Web community to address the challenges identified for the extraction of semantics in Microposts, and the development of intuitive, effective tools that make use of the rich, collective knowledge. We especially solicit new research in the field that explores the particular challenges due to, and the influence of the mainstream user, as compared to publication and management by technical experts.

Additionally, we encourage revised versions of research papers and practical demonstrations presented at relevant workshops, symposia and conferences, extended to increase depth and review the authors’ own and other relevant work, and take into account also feedback from discussions and panels at such events.

Perhaps starting with “microposts” will allow researchers to work their way up to the semantics of full texts? Personally I am betting on semantics to continue to be the clear winner that refuses to “fit” into various models and categories. We can create useful solutions but that isn’t the same thing as mastering semantics.

Recommendation Engine

Filed under: Recommendation — Patrick Durusau @ 6:57 pm

Recommendation Engine by Ricky Ho.

From the post:

In a classical model of recommendation system, there are “users” and “items”. User has associated metadata (or content) such as age, gender, race and other demographic information. Items also has its metadata such as text description, price, weight … etc. On top of that, there are interaction (or transaction) between user and items, such as userA download/purchase movieB, userX give a rating 5 to productY … etc.

Ricky does a good job of stepping through the different approaches to making recommendations. Iimportant for topic map interfaces that recommend additional topics to their users.

September 23, 2011

Models, Relativity & Reality

Filed under: Ontology — Patrick Durusau @ 6:30 pm

Particles Appear to Travel Faster Than Light: OPERA Experiment Reports Anomaly in Flight Time of Neutrinos from Science Daily.

From the post:

Scientists with the OPERA experiment, which observes a neutrino beam from CERN 730 km away at Italy’s INFN Gran Sasso Laboratory, are presenting surprising new results (in a seminar at CERN on Sept. 23, 2011) that appear to show neutrinos traveling faster than light.

The OPERA result is based on the observation of over 15000 neutrino events measured at Gran Sasso, and appears to indicate that the neutrinos travel at a velocity 20 parts per million above the speed of light, nature’s cosmic speed limit. Given the potential far-reaching consequences of such a result, independent measurements are needed before the effect can either be refuted or firmly established. This is why the OPERA collaboration has decided to open the result to broader scrutiny. The collaboration’s result is available on the preprint server arXiv (http://arxiv.org/list/hep-ex/new).

It will take weeks, months or perhaps years to confirm or refute these findings. And that lies firmly in the province of high-energy physics. So why mention it here?

Whatever the outcome, I take this as a reminder that we create models of reality. Relativity, both special and general are such models. Useful models but then so were Newtonian physics (which remain useful by the way).

Our ontologies, data structures, identification systems, etc., are all models. The only thing that separates them, one from another, is whether they are useful for some specified purpose or not.

5 misconceptions about visualization

Filed under: Interface Research/Design,Virtualization — Patrick Durusau @ 6:28 pm

From the post:

Last month, I had the pleasure of spending a week at the Census Bureau as a “visiting scholar.” They’re looking to boost their visualization efforts across all departments, and I put in my two cents on how to go about doing it. For being a place where there is so much data, the visual side of things is still in the early stages, generally speaking.

During all the meetings, there were recurring themes about what visualization is and what it is used for. Some people really got it, but others were new to the subject, and we ran into a few misconceptions that I think are worth repeating.

Here we go, in no particular order.

Yeah, I moved the link.

Before you see Nathan’s list, take a piece of paper and write down why you have used visualization of data.

Got that? Now for the link:

5 misconceptions about visualization by Nathan Yau

Any to add to Nathan’s list? Perhaps from your own?

Top JavaScript Dynamic Table Libraries

Filed under: Interface Research/Design,Javascript — Patrick Durusau @ 6:26 pm

Top JavaScript Dynamic Table Libraries from Sematext.

From the post:

Since @sematext focuses on Search, Data and Text Analytics, and similar areas that typically involve exclusively backend, server-side work, we rarely publish anything that deals with UI, UX, with JavaScript, front ends, and do on. However, our Search Analytics and Scalable Performance Monitoring services/products do have rich, data-driven UIs (think reports, graphs, charts, tables), so we are increasingly thinking (obsessing?) about usability, intuitive and clean interfaces, visual data representations, etc. (in fact, we have an opening for a UI/UX designer and developer). Recently, we decided to upgrade a group of Search Analytics reports that, until now, used a quickly-thrown-together HTML table that, as much as we loved its simplicity, needed a replacement. So we set out to look for something better, more functional and elegant. In the process we identified a number of JavaScript libraries for rendering and browsing tabular data. Eventually we narrowed our selection to 6 JavaScript libraries whose characteristics we analyzed. In the spirit of sharing and helping others, please find their feature matrix below.

I suppose there is a time in everyone’s life when they finally have to show their front end. This will help you with yours. 😉

Free and Public Data Sets

Filed under: Dataset — Patrick Durusau @ 6:23 pm

Free and Public Data Sets

Some of these will be familiar, some not.

I am aware of a number of government sites that offer a variety of data sets. What I don’t know of is a list of data sets by characteristics. That would include subject matter, format, age, etc.

Suggestions?

Pivot Faceting (Decision Trees) in Solr 1.4.

Filed under: Facets,Solr — Patrick Durusau @ 6:22 pm

Pivot Faceting (Decision Trees) in Solr 1.4.

From the post:

Solr faceting breaks down searches for terms, phrases, and fields in the Solr into aggregated counts by matched fields or queries. Facets are a great way to “preview” further searches, as well as a powerful aggregation tool in their own right.

Before Solr 4.0, facets were only available at one level, meaning something like “counts for field ‘foo’” for a given query. Solr 4.0 introduced pivot facets (also called decision trees) which enable facet queries to return “counts for field ‘foo’ for each different field ‘bar’” – a multi-level facet across separate Solr fields.

Decision trees come up a lot, and at work, we need results along multiple axes – typically in our case “field/query by year” for a time series. However, we use Solr 1.4.1 and are unlikely to migrate to Solr 4.0 in the meantime. Our existing approach was to simply query for the top “n” fields for a first query, then perform a second-level facet query by year for each field result. So, for the top 20 results, we would perform 1 + 20 queries – clearly not optimal, when we’re trying to get this done in the context of a blocking HTTP request in our underlying web application.

Hoping to get something better than our 1 + n separate queries approach, I began researching the somewhat more obscure facet features present in Solr 1.4.1. And after some investigation, experimentation and a good amount of hackery, I was able to come up with a “faux” pivot facet scheme that mostly approximates true pivot faceting using Solr 1.4.1.

We’ll start by examining some real pivot facets in Solr 4.0, then look at the components and full technique for simulated pivot facets in Solr 1.4.1.

Not only a good introduction to a new feature in Solr 4.0 but how to sorta duplicate it in Solr 1.4.1!

Top Scoring Pairs for Feature Selection in Machine Learning and Applications to Cancer Outcome Prediction

Filed under: Bioinformatics,Biomedical,Classifier,Machine Learning,Prediction — Patrick Durusau @ 6:15 pm

Top Scoring Pairs for Feature Selection in Machine Learning and Applications to Cancer Outcome Prediction by Ping Shi, Surajit Ray, Qifu Zhu and Mark A Kon.

BMC Bioinformatics 2011, 12:375 doi:10.1186/1471-2105-12-375 Published: 23 September 2011

Abstract:

Background

The widely used k top scoring pair (k-TSP) algorithm is a simple yet powerful parameter-free classifier. It owes its success in many cancer microarray datasets to an effective feature selection algorithm that is based on relative expression ordering of gene pairs. However, its general robustness does not extend to some difficult datasets, such as those involving cancer outcome prediction, which may be due to the relatively simple voting scheme used by the classifier. We believe that the performance can be enhanced by separating its effective feature selection component and combining it with a powerful classifier such as the support vector machine (SVM). More generally the top scoring pairs generated by the k-TSP ranking algorithm can be used as a dimensionally reduced subspace for other machine learning classifiers.

Results

We developed an approach integrating the k-TSP ranking algorithm (TSP) with other machine learning methods, allowing combination of the computationally efficient, multivariate feature ranking of k-TSP with multivariate classifiers such as SVM. We evaluated this hybrid scheme (k-TSP+SVM) in a range of simulated datasets with known data structures. As compared with other feature selection methods, such as a univariate method similar to Fisher’s discriminant criterion (Fisher), or a recursive feature elimination embedded in SVM (RFE), TSP is increasingly more effective than the other two methods as the informative genes become progressively more correlated, which is demonstrated both in terms of the classification performance and the ability to recover true informative genes. We also applied this hybrid scheme to four cancer prognosis datasets, in which k-TSP+SVM outperforms k-TSP classifier in all datasets, and achieves either comparable or superior performance to that using SVM alone. In concurrence with what is observed in simulation, TSP appears to be a better feature selector than Fisher and RFE in some of the cancer datasets

Conclusions

The k-TSP ranking algorithm can be used as a computationally efficient, multivariate filter method for feature selection in machine learning. SVM in combination with k-TSP ranking algorithm outperforms k-TSP and SVM alone in simulated datasets and in some cancer prognosis datasets. Simulation studies suggest that as a feature selector, it is better tuned to certain data characteristics, i.e. correlations among informative genes, which is potentially interesting as an alternative feature ranking method in pathway analysis.

Knowing the tools that are already in use in bioinformatics will help you design topic map applications of interest to those in that field. And this is a very nice combination of methods to study on its own.

ParLearning 2012 (silos or maps?)

ParLearning 2012 : Workshop on Parallel and Distributed Computing for Machine Learning and Inference Problems

Dates:

When May 25, 2012 – May 25, 2012
Where Shanghai, China
Submission Deadline Dec 19, 2011
Notification Due Feb 1, 2012
Final Version Due Feb 21, 2012

From the notice:

HIGHLIGHTS

  • Foster collaboration between HPC community and AI community
  • Applying HPC techniques for learning problems
  • Identifying HPC challenges from learning and inference
  • Explore a critical emerging area with strong industry interest without overlapping with existing IPDPS workshops
  • Great opportunity for researchers worldwide for collaborating with Chinese Academia and Industry

CALL FOR PAPERS

Authors are invited to submit manuscripts of original unpublished research that demonstrate a strong interplay between parallel/distributed computing techniques and learning/inference applications, such as algorithm design and libraries/framework development on multicore/ manycore architectures, GPUs, clusters, supercomputers, cloud computing platforms that target applications including but not limited to:

  • Learning and inference using large scale Bayesian Networks
  • Large scale inference algorithms using parallel TPIC models, clustering and SVM etc.
  • Parallel natural language processing (NLP).
  • Semantic inference for disambiguation of content on web or social media
  • Discovering and searching for patterns in audio or video content
  • On-line analytics for streaming text and multimedia content
  • Comparison of various HPC infrastructures for learning
  • Large scale learning applications in search engine and social networks
  • Distributed machine learning tools (e.g., Mahout and IBM parallel tool)
  • Real-time solutions for learning algorithms on parallel platforms

If you are wondering what role topic maps have to play in this arena, ask yourself the following question:

Will the systems and techniques demonstrated at this conference use the same means to identify the same subjects?*

If your answer is no, what would you suggest is the solution for mapping different identifications of the same subjects together?

My answer to that question is to use topic maps.

*Whatever your ascribe as its origin, semantic diversity is part and parcel of the human condition. We can either develop silos or maps across silos. Which do you prefer?

Oresoft Live Web Class

Filed under: CS Lectures,Data Mining — Patrick Durusau @ 6:11 pm

Oresoft Live Web Class YouTube Channel

I ran across this YouTube channel on a data mining alert I get from a search service. The data mining course looks like one of the more complete ones.

It stems from the Oresoft Academy, which conducts live virtual classes. If you have an interest in teaching, see the FAQ to see what is required to contribute to this effort.

The Oresoft playlist offers (as of 22 September 2011):

  • Algorithms (101 sessions)
  • Compiler Design (42 sessions)
  • Computer Graphics (7 sessions)
  • Finite Automata (5 sessions)
  • Graph Theory (9 sessions)
  • Heap Sort (13 sessions)
  • Java Tutorials (16 sessions)
  • Non-Determistic Finite Automata (14 sessions)
  • Oracle PL/SQL (27 sessions)
  • Oracle Server Concept (48 sessions, slides number to 49 due to numbering error)
  • Oracle SQL (17 sessions)
  • Pumping Lemma (6 sessions)
  • Regular Expression (14 sessions)
  • Turing Machines (10 sessions)
  • Web Data Mining (127 sessions)

Foundations for Ontology

Filed under: Ontology — Patrick Durusau @ 6:10 pm

Foundations for Ontology by John Sowa.

A highly amusing combination of recent slides by John Sowa on issues surrounding the use and construction of ontologies.

I particularly enjoyed slide 4 Prospects for a Universal Ontology: Attempts to create a universal classification of concepts which lists just the highlights of attempts at universal classification systems.

Lots of references to other Sowa presentations/publications. And others.

Which slide do you like best?

FYI, my test for any ontology is its usefulness for some specified purpose. That allows us to clear away all the factionalism over notation, foundations, and most theoretical issues. The remaining questions are: What do you want to do?, How does this ontology help you do it?, What will it cost?. Those are empirical questions that don’t require a review of Western Civilization to answer.

How To Write an Academic Paper in Text Mining

Filed under: Conferences — Patrick Durusau @ 6:07 pm

How To Write an Academic Paper in Text Mining by Matthew Hurst.

From the post:

I’m completing a set of reviews for a reasonably high quality conference that touches on data mining and text mining problems. Perhaps the industrial setting has jaded me with respect to academic papers, but there seems to be some key points that – for me – really matter in the writing of a good paper (and implicitly in the selection of interesting areas of research).

Please read and take Matthew’s advice to heart.

He missed my favorite: Check your citations! I read published papers that have citations to non-existent materials. Or at least not with a particular title or in a particular journal. Most can be found but poor citation practice doesn’t give a lot of confidence in the more important aspects of your paper.

Facebook and the Semantic Web

Filed under: Linked Data,Semantic Web — Patrick Durusau @ 7:42 am

Jesse Weaver, Ph.D. Student, Patroon Fellow, Tetherless World Constellation, Rensselaer Polytechnic Institute, http://www.cs.rpi.edu/~weavej3/, announces that:

I would like to bring to subscribers’ attention that Facebook now supports RDF with Linked Data URIs from its Graph API. The RDF is in Turtle syntax, and all of the HTTP(S) URIs in the RDF are dereferenceable in accordance with httpRange-14. Please take some time to check it out.

If you have a vanity URL (mine is jesserweaver), you can get RDF about you:

curl -H ‘Accept: text/turtle’ http://graph.facebook.com/
curl -H ‘Accept: text/turtle’ http://graph.facebook.com/jesserweaver

If you don’t have a vanity URL but know your Facebook ID, you can use that instead (which is actually the fundamental method).

curl -H ‘Accept: text/turtle’ http://graph.facebook.com/
curl -H ‘Accept: text/turtle’ http://graph.facebook.com/1340421292

From there, try dereferencing URIs in the Turtle. Have fun!

And I thought everyone had moved to that other service and someone left the lights on at Facebook. 😉

No flames! Just kidding.

September 22, 2011

A Graph-Based Movie Recommender Engine

Filed under: Graphs,Gremlin,Neo4j,Recommendation — Patrick Durusau @ 6:32 pm

A Graph-Based Movie Recommender Engine by Marko A. Rodriguez.

From the post:

A recommender engine helps a user find novel and interesting items within a pool of resources. There are numerous types of recommendation algorithms and a graph can serve as a general-purpose substrate for evaluating such algorithms. This post will demonstrate how to build a graph-based movie recommender engine using the publicly available MovieLens dataset, the graph database Neo4j, and the graph traversal language Gremlin. Feel free to follow along in the Gremlin console as the post will go step-by-step from data acquisition, to parsing, and ultimately, to traversing.

As important as graph engines, algorithms and research are at present, and as important as they will become, I think the Neo4j community itself is worthy of direct study. There are stellar contributors to the technology and the community, but is that what makes it such an up and coming community? Or perhaps how they contributed? It would take a raft (is that the term for a group of sociologists?) of sociologists and perhaps there are existing studies of online communities that might have some clues. I mention that because there are other groups I would like to see duplicate the success of the Neo4j community.

Marko takes you from data import to a useful (albeit limited) application in less than 2500 words. (measured to the end of the conclusion, excluding further reading)

And leaves you with suggestions for further exploring.

That is a blog post that promotes a paradigm. (And for anyone who takes offense at that observation, it applies to my efforts as well. There are other ways to promote a paradigm but you have to admit, this is a fairly compelling one.)

Put Marko’s post on your read with evening coffee list.

Sparse Machine Learning Methods for Understanding Large Text Corpora

Filed under: Machine Learning,Sparse Learning,Text Analytics — Patrick Durusau @ 6:30 pm

Sparse Machine Learning Methods for Understanding Large Text Corpora (pdf) by Laurent El Ghaoui, Guan-Cheng Li, Viet-An Duong, Vu Pham, Ashok Srivastava, and Kanishka Bhaduri. Status: Accepted for publication in Proc. Conference on Intelligent Data Understanding, 2011.

Abstract:

Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using sparse regression or classification; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents.

I suppose it depends on your background (mine includes a law degree and a decade of practice) but when I read:

The ASRS data contains several of the crucial challenges involved under the general banner of “large-scale text data understanding”. First, its scale is huge, and growing rapidly, making the need for automated analyses of the processed reports more crucial than ever. Another issue is that the reports themselves are far from being syntactically correct, with lots of abbreviations, orthographic and grammatical errors, and other shortcuts. Thus we are not facing a corpora with well-structured language having clearly de ned rules, as we would if we were to consider a corpus of laws or bills or any other well-redacted data set.

I thought I would fall out of my chair. I don’t think I have ever heard of a “corpus of laws or bills” being described as a “…well-redacted data set.”

There was a bill passed in the US Congress last year that despite being acted on by both Houses and who knows how many production specialists, was passed without a name.

Apologies for the digression.

From the paper:

Our paper makes the claim that sparse learning methods can be very useful to the understanding large text databases. Of course, machine learning methods in general have already been successfully applied to text classi cation and clustering, as evidenced for example by [21]. We will show that sparsity is an important added property that is a crucial component in any tool aiming at providing interpretable statistical analysis, allowing in particular efficient multi-document summarization, comparison, and visualization of huge-scale text corpora.

You will need to read the paper for the details but I think it clearly demonstrates that sparse learning methods are useful for exploring large text databases. While it may be the case that your users have a view of their data, it is equally likely that you will be called upon to mine a text database and to originate a navigation overlay for it. That will require exploring the data and developing an understanding of it.

For all the projections of need for data analysts and required technical skills, without insight and imagination, they will just be going through the motions.

(Applying sparse learning methods to new areas is an example of imagination.)

Riak 1.0.0 RC 1

Filed under: Erlang,Riak — Patrick Durusau @ 6:27 pm

Riak 1.0.0 RC 1

From the post:

We are pleased to announce the first release candidate for Riak 1.0.0 is now available.

The packages are available on our downloads page: http://downloads.basho.com/riak/riak-1.0.0rc1/

As a release candidate, we consider this to be a functionally complete representation of Riak 1.0.0. From now until the 1.0.0 release, only critical bug fixes will be merged into the repository. We would like to thank everyone who took the time to download, install, and run the pre-releases. The Riak community has always been one of the great strengths of Riak, and this release period has been no different with feedback and bug reports we’ve been given.

Cool!

Skills Matter – Autumn Update

Filed under: Conferences,Government Data,NoSQL,Scala — Patrick Durusau @ 6:26 pm

Skills Matter – Autumn Update

Given the state of UK airport security, about the only reason I would go to the UK would be for a Skills Matter (un)conference, eXchange, or tutorial! And that is from having only enjoyed them as recorded presentations, slides and code. Actual attendance must bring a lot of repeat customers.

On the schedule for this Fall:

Skills Matter Partner Conferences

Skills Matter has partnered with Silicon Valley Comes to the UK, WIP, Novoda, FuseSource and David Pollak, to provide you with the following fantastic (un)Conferences & Hackathon’s:

Skills Matter eXchanges

We’ll also be running some pretty cool one- and two-day long Skills Matter eXchanges, which are conferences featuring 45 minute long expert talks and lots of breaks to discuss what you have learned. Expect in-depth, hands-on talks led by real experts who are there to be quizzed, questioned and interrogated until you know as much as they do, or thereabouts! In the paragraphs below, you’ll be able to find out about the following eXchanges we have planned for the coming months:

Skills Matter Progressive Technology Tutorials

Skills Matter Progressive Technology Tutorials offer a collection of 4-hour tutorials, featuring a mix in-depth and hands-on workshops on technology, agile and software craftsmanship. In the paragraphs below, you’ll be able to find out about the following eXchanges we have planned for the coming months:

Neo4j and Spring – Practical Primer 28 Sept 2011 (London)

Filed under: Java,Neo4j,Spring Data — Patrick Durusau @ 6:24 pm

Neo4j and Spring – Practical Primer 28 Sept 2011 (London)

From the announcement:

In this talk Aleksa will introduce how you can integrate Neo4j with Spring – the popular Java enterprise framework.

Topics covered will include declarative transactions, object-to-graph mapping using Spring-data-graph components, as well collections mapping using Cypher and Gremlin annotation support introduced in newly released version 1.1.0 of spring-data-graph. You can expect a lot of hands on coding and practical examples while exploring the latest features of neo4j and spring-data-graph.

Aleksa Vukotic is a Data Management Practice lead and Spring veteran with extensive experience as author, trainer, architect and developer. Working with graph data models on several projects, Neo4j has become Aleksa’s technology of choice for solving complex graph related problems.

Looks interesting to me!

Introduction to RavenDB

Filed under: NoSQL,RavenDB — Patrick Durusau @ 6:22 pm

Introduction to RavenDB by Rob Ashton.

From the description:

In this session we will give a brief introduction to the concept of a document database and how it relates to what we already know before launching into a series of code demos using the RavenDB .NET Client API.

We will cover basic document structure, persistence, unit of work, querying / searching, and demonstrate real world use for map/reduce in our applications.

The usual (should I say expected?) difficulties with being unable to read the slides and/or examples in a video. Slides really should be provided with video presentations.

If you are interested in RavenDB, there are more presentations and blogs entries at Rob Ashton’s blog.

Khan Academy

Filed under: Mathematics — Patrick Durusau @ 6:20 pm

Khan Academy

From the “about” page:

The Khan Academy is an organization on a mission. We’re a not-for-profit with the goal of changing education for the better by providing a free world-class education to anyone anywhere.

All of the site’s resources are available to anyone. It doesn’t matter if you are a student, teacher, home-schooler, principal, adult returning to the classroom after 20 years, or a friendly alien just trying to get a leg up in earthly biology. The Khan Academy’s materials and resources are available to you completely free of charge.

If you need to brush up on probability or linear algebra, you will find some helpful video lectures here.

Visualizing Lexical Novelty in Literature

Filed under: Data,Visualization — Patrick Durusau @ 6:19 pm

Visualizing Lexical Novelty in Literature by Matthew Hurst.

From the post:

Novels are full of new characters, new locations and new expressions. The discourse between characters involves new ideas being exchanged. We can get a hint of this by tracking the introduction of new terms in a novel. In the below visualizations (in which each column represents a chapter and each small block a paragraph of text), I maintain a variable which represents novelty. When a paragraph contains more than 25% new terms (i.e. words that have not been observed thus far) this variable is set at its maximum of 1. Otherwise, the variable decays. The variable is used to colour the paragraph with red being 1.0 and blue being 0. The result is that we can get an idea of the introduction of new ideas in novels.

As aways, interesting ideas on text visualization from Matthew Hurst.

Curious how much novelty (change?) would you see between SEC filings from the same law firm? Or put another way, how much boilerplate is there in regulatory filings? I am mindful of the disaster plan for BP that included saving polar bears in the Gulf of Mexico.

Another interesting tool for exploring data and data sets in preparation to create topic maps.

« Newer PostsOlder Posts »

Powered by WordPress