Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 16, 2012

structr – update

Filed under: Graphs,Neo4j,Ontology,structr,Topic Map Software,Topic Maps — Patrick Durusau @ 2:29 pm

structr

One of the real pleasures of going over my older posts is checking up on projects I have mentioned in the past. Particularly when they show significant progress since the last time I looked.

Structr is one of those projects.

A lot of progress and I saw today that the homepage advertises:

With structr, you can build web sites or portals, but also interactive web applications.

And if you like, you can add topic maps or ontologies to the content graph. (emphasis added)

Guess I need to throw a copy on my “big box” and see what happens!

January 15, 2012

Buried Alive Fiance Gets 20 Years in Prison – Replace Turing Test?

Filed under: Artificial Intelligence,Humor — Patrick Durusau @ 9:20 pm

Unambiguous crash blossom Filed by Mark Liberman under Crash blossoms

From the post:

This one isn’t ambiguous, as far as I can tell — it just doesn’t mean what the headline writer wanted it to mean: “Buried Alive Fiance Gets 20 Years in Prison”, ABC News 1/13/2012.

See Mark’s post for the answer.

Maybe this and similar headlines + the news stories should replace the Turing Test as the test for artificial intelligence.

Or would that make it too hard?

Comments?

House Launches Transparency Portal

Filed under: Government,Government Data,Transparency — Patrick Durusau @ 9:19 pm

House Launches Transparency Portal by Daniel Schuman.

From the post:

Making good on part of the House of Representative’s commitment to increase congressional transparency, today the House Clerk’s office launched http://docs.house.gov/, a one stop website where the public can access all House bills, amendments, resolutions for floor consideration, and conference reports in XML, as well as information on floor proceedings and more. Information will ultimately be published online in real time and archived for perpetuity.

The Clerk is hosting the site, and the information will primarily come from the leadership, the Committee on House Administration, the Rules Committee, and the Clerk’s office. The project has been driven by House Republican leaders as part of an push for transparency. Important milestones include the adoption of the new House Rules in January 2011 that gave the Committee on House Administration the power to establish standards for publishing documents online, an April 2011 letter from the Speaker and Majority Leader to the Clerk calling for better public access to House information, a Committee on House Administration hearing in June 2011 on modernizing information delivery in the House, a December 2011 public meeting on public access to congressional information, and finally the late December adoption of online publication standards.

Some immediate steps to take:

  • Contact the House Clerk’s office to express your appreciation for their efforts.
  • If you are a US citizen, contact your representatives to express your support for this effort and looking forward to more transparency.
  • Write to your local TV/radio/newspaper to point out this important resource and express your interest. (Keep it real non-technical. Transparency = Good.)
  • Write to your local school board/school, etc., to suggest they could use this as a classroom resource. (Offer to help as well.)
  • Make use of the data and credit your source.
  • Urge others to do the foregoing steps.
    • I have doubts about the transparency efforts but also think we should give credit where credit is due. A lot of people have worked very hard to make this much transparency possible so let’s make the best use of it we can.

Machine Learning: Ensemble Methods

Filed under: Ensemble Methods,Machine Learning — Patrick Durusau @ 9:16 pm

Machine Learning: Ensemble Methods by Ricky Ho.

Ricky gives a brief overview of ensemble methods in machine learning.

Not enough for practical application but enough to orient yourself to learn more.

From the post:

Ensemble Method is a popular approach in Machine Learning based on the idea of combining multiple models. For example, by mixing different machine learning algorithms (e.g. SVM, Logistic regression, Bayesian network), ensemble method can automatically pick the best algorithmic model that fits the data the best. On the other hand, by mixing different parameter set of the same algorithmic model (e.g. Random forest, Boosting tree), it can pick the best set of parameters of the same algorithmic model.

Pbm: A new dataset for blog mining

Filed under: Blogs,Dataset — Patrick Durusau @ 9:15 pm

Pbm: A new dataset for blog mining by Mehwish Aziz and Muhammad Rafi.

Abstract:

Text mining is becoming vital as Web 2.0 offers collaborative content creation and sharing. Now Researchers have growing interest in text mining methods for discovering knowledge. Text mining researchers come from variety of areas like: Natural Language Processing, Computational Linguistic, Machine Learning, and Statistics. A typical text mining application involves preprocessing of text, stemming and lemmatization, tagging and annotation, deriving knowledge patterns, evaluating and interpreting the results. There are numerous approaches for performing text mining tasks, like: clustering, categorization, sentimental analysis, and summarization. There is a growing need to standardize the evaluation of these tasks. One major component of establishing standardization is to provide standard datasets for these tasks. Although there are various standard datasets available for traditional text mining tasks, but there are very few and expensive datasets for blog-mining task. Blogs, a new genre in web 2.0 is a digital diary of web user, which has chronological entries and contains a lot of useful knowledge, thus offers a lot of challenges and opportunities for text mining. In this paper, we report a new indigenous dataset for Pakistani Political Blogosphere. The paper describes the process of data collection, organization, and standardization. We have used this dataset for carrying out various text mining tasks for blogosphere, like: blog-search, political sentiments analysis and tracking, identification of influential blogger, and clustering of the blog-posts. We wish to offer this dataset free for others who aspire to pursue further in this domain.

This paper details construction of the blog data set used in Sentence based semantic similarity measure for blog-posts.

The aspect I found most interesting was the restriction of the data set to a particular domain. When I was using physical research tools (books) in libraries, there was no “index to everything” available. Nor would I have used it had it been available.

If I had a social science question (political science major) or later a law question (law school), I would pick a physical research tool (PRT) that was appropriate to the search request. Why? Because specialized publications were curated to facilitate research in a particular area, including identification of synonyms and cross-referencing of information you might otherwise not notice.

Is this blogging dataset a clue that if we created sub-sets of the entire WWW, that we could create indexing/analysis routines specific to those datasets? And hence give users a measurably better search experience?

RFI: Public Access to Digital Data Resulting From Federally Funded Scientific Research

Filed under: Government Data,Marketing,RFI-RFP,Topic Maps — Patrick Durusau @ 9:14 pm

RFI: Public Access to Digital Data Resulting From Federally Funded Scientific Research

Summary:

In accordance with Section 103(b)(6) of the America COMPETES Reauthorization Act of 2010 (ACRA; Pub. L. 111-358), this Request for Information (RFI) offers the opportunity for interested individuals and organizations to provide recommendations on approaches for ensuring long-term stewardship and encouraging broad public access to unclassified digital data that result from federally funded scientific research. The public input provided through this Notice will inform deliberations of the National Science and Technology Council’s Interagency Working Group on Digital Data.

I responded to the questions on: Standards for Interoperability, Re-Use and Re-Purposing

(10) What digital data standards would enable interoperability, reuse, and repurposing of digital scientific data? For example, MIAME (minimum information about a microarray experiment; see Brazma et al., 2001, Nature Genetics 29, 371) is an example of a community-driven data standards effort.Show citation box

(11) What are other examples of standards development processes that were successful in producing effective standards and what characteristics of the process made these efforts successful?Show citation box

(12) How could Federal agencies promote effective coordination on digital data standards with other nations and international communities?Show citation box

(13) What policies, practices, and standards are needed to support linking between publications and associated data?

The deadline was 12 January 2012 so what I have written below is my final submission.

I am tracking the Federal Register for other opportunities to comment, particularly those that bring topic maps to the attention of agencies and other applicants.

Please comment on this response so I can sharpen the language for the next opportunity. Examples would be very helpful, from different fields. For example, if it is a police type RFI, examples of use of topic maps in law enforcement would be very useful.

In the future I will try to rough out responses (with no references) early so I can ask for your assistance in refining the response.

BTW, it was a good thing I asked about the response format (the RFI didn’t say) b/c I was about to send in five (5) separate formats, OOo, MS Word, PDF, RTF, text. Suspect that would have annoyed them. 😉 Oh, they wanted plain email format. Just remember to ask!

Patrick Durusau
patrick@durusau.net

Patrick Durusau (consultant)

Covington, Georgia 30014

Comments on questions (10) – (13), under “Standards for Interoperability, Re-Use and Re-Purposing.”

(10) What digital data standards would enable interoperability, reuse, and repurposing of digital scientific data?

The goals of interoperability, reuse, and repurposing of digital scientific data are not usually addressed by a single standard on digital data.

For example, in astronomy, the FITS (http://en.wikipedia.org/wiki/FITS) format is routinely used to ensure digital data interoperability. In some absolute sense, if the data is in a proper FITS format, it can be “read” by FITS conforming software.

But being in FITS format is no guarantee of reuse or repurposing. Many projects adopt “local” extensions to FITS and their FITS files can be reused or repurposed, if and only if the local extensions are understood. (Local FITS Conventions (http://fits.gsfc.nasa.gov/fits_local_conventions.html), FITS Keyword Dictionaries (http://fits.gsfc.nasa.gov/fits_dictionary.html))

That is not to fault projects for having “local” conventions but to illustrate that scientific research can require customization of digital data standards and reuse and repurposing will depend upon documentation of those extensions.

Reuse and repurposing would be enhanced by the use of a mapping standard, such as ISO/IEC 13250, Topic Maps (http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=38068). Briefly stated, topic maps enable the creation of mapping/navigational structures over digital (and analog) scientific data, furthering the goals of reuse and repurposing.

To return to the “local” conventions for FITS, it isn’t hard to imagine future solar research missions that develop different “local” conventions from the SDAC FITS Keyword Conventions (http://www.lmsal.com/solarsoft/ssw_standards.html). Interoperable to be sure because of the conformant FITS format, but reuse and repurposing become problematic with files from both data sets.

Topic maps enable experts to map the “local” conventions of the projects, one to the other, without any prior limitation on the basis for that mapping. It is important that experts be able to use their “present day” reasons to map data sets together, not just reasons from the dusty past.

Some data may go unmapped. Or should we say that not all data will be found equally useful? Mapping can and will make it easier to reuse and repurpose data but that is not without cost. The participants in a field should be allowed to make the decision if mappings to legacy data are needed.

Some Babylonian astronomical texts(http://en.wikipedia.org/wiki/Babylonian_astronomy) have survived but they haven’t been translated into modern astronomical digital format. The point being that no rule for mapping between data sets will fit all occasions.

When mapping is appropriate, topic maps offer the capacity to reuse data across shifting practices of nomenclature and styles. Twenty years ago asking about “Dublin Core” would have evoked a puzzled look. Asking about a current feature in “Dublin Core” twenty years from now, is likely to have the same response.

Planning on change and mapping it when useful, is a better response than pretending change stops with the current generation.

(11) What are other examples of standards development processes that were successful in producing effective standards and what characteristics of the process made these efforts successful?

The work of the IAU (International Astronomical Union (http://www.iau.org/)) and its maintenance of the FITS standard mentioned above is an example of a successful data standard effort.

Not formally part of the standards process but the most important factor was the people involved. They were dedicated to the development of data and placing that data in the hands of others engaged in the same enterprise.

To put a less glowing and perhaps repeatable explanation on their sharing, one could say members of the astronomical community had a mutual interest in sharing data.

Where gathering of data is dependent upon the vagaries of the weather, equipment, observing schedules and the like, data has to be taken from any available source. That being the case, there is an incentive to share data with others in like circumstances.

Funding decisions for research should depend not only on the use of standards that enable sharing but awarding heavy consideration on active sharing.

(12) How could Federal agencies promote effective coordination on digital data standards with other nations and international communities?

The answer here depends on what is meant by “effective coordination?” It wasn’t all that long ago that the debates were raging about whether both ODF (ISO/IEC 26300) and OOXML (ISO/IEC 29500) should both be ISO standards. Despite being (or perhaps because of) the ODF editor, I thought it would be to the advantage of both proposals to be ISO standards.

Several years later, I stand by that position. Progress has been slower than I would like at seeing the standards draw closer together but there are applications that support both so that is a start.

Different digital standards have and will develop for the same areas of research. Some for reasons that aren’t hard to see, some for historical accidents, others for reasons we may never know. Semantic diversity expressed in the existence of different standards is going to be with us always.

Attempting to force different communities (the source of different standards) together will have unhappy results all the way around. Instead, federal agencies should take the initiative to be the cross-walk as it were between diverse groups working in the same areas. As semantic brokers, who are familiar with two or three or perhaps more perspectives, federal agencies will offer a level of expertise that will be hard to match.

It will be a slow, evolutionary process but contributions based on understanding different perspectives will bring diverse efforts closer together. It won’t be quick or easy but federal agencies are uniquely positioned to bring the long term commitment to develop such expertise.

(13) What policies, practices, and standards are needed to support linking between publications and associated data?

Linking between publications and associated data presumes availability of the associated data. To recall the comments on incentives for sharing, making data available should be a requirement for present funding and a factor to be considered for future funding.

Applications for funding should also be judged on the extent to which they plan on incorporating existing data sets and/or provide reasons why that data should not be reused. Agencies can play an important “awareness” role by developing and maintaining resources that catalog data in given fields.

It isn’t clear that any particular type of linking between publication and associated data should be mandated. The “type” of linking is going to vary based on available technologies.

What is clear is that the publication its dependency on associated data should be clearly identified. Moreover, the data should be documented such that in the absence of the published article, a researcher in the field could use or reuse the data.

I added categories for RFI-RFP to make it easier to find this sort of analysis.

If you have any RFI-RFP responses that you feel like you can post, please do and send me links.

Oil Drop Semantics?

Filed under: Authoring Semantics,Authoring Topic Maps,Semantic Diversity,Semantics — Patrick Durusau @ 9:13 pm

Interconnection of Communities of Practice: A Web Platform for Knowledge Management and some related material made me think of the French “oil drop” counter-insurgency strategy.

With one important difference.

In a counter-insurgency context, the oil drop strategy is being used to further the goals of counter-insurgency force. Whatever you think of those goals or the alleged benefits for the places covered by the oil drops, the fundamental benefit is to the counter-insurgency force.

In a semantic context, one that seeks to elicit the local semantics of a group, the goal is not the furtherance of an outside semantic, but the exposition of a local semantic with the goal of benefiting the group covered by the oil spot. That as the oil drop spreads, those semantics may be combined with other oil drop semantics, but that is a cost and effort borne by the larger community seeking that benefit.

There are several immediate advantages to this approach with semantics.

First, the discussion of semantics at every level is taking place with the users of those semantics. You can hardly get closer to a useful answer than being able to ask the users of a semantic what was meant or for examples of usage. I don’t have a formalism for it but I would postulate that as the distance from users increases, so does the usefulness of the semantics of those users.

Ask the FBI about the Virtual Case Management project. Didn’t ask users or at least enough of them and flushed lots of cash. Lesson: Asking management, IT, etc., about the semantics of users is a utter waste of time. Really.

If you want to know the semantics of user group X, then ask group X. If you ask Y about X, you will get Y’s semantics about X. If that is what you want, fine, but if you want the semantics of group X, you have wasted your time and resources.

Second, asking the appropriate group of users for their semantics means that you can make explicit the ROI from making their semantics explicit. That is to say if asked, the group will ask about semantics that are meaningful to them. That either solve some task or issue that they encounter. May or may not be the semantics that interest you but recall the issue is the group’s semantics, not yours.

The reason for the ROI question at the appropriate group level is so that the project is justified both to the group being asked to make the effort as well as those who must approve the resources for such a project. Answering that question up front helps get buy-in from group members and makes them realize this isn’t busy work but will have a positive benefit for them.

Third, such a bottom-up approach, whether you are using topic maps, RDF, etc. will mean that only the semantics that are important to users and justified by some positive benefit are being captured. Your semantics may not have the rigor of SUMO, for example, but they are a benefit to you. What other test would you apply?

On the Hyperbolicity of Small-World Networks and Tree-Like Graphs

Filed under: Graphs,Networks,Small World — Patrick Durusau @ 9:11 pm

On the Hyperbolicity of Small-World Networks and Tree-Like Graphs by Wei Chen, Wenjie Fang, Guangda Hu and Michael W. Mahoney.

Abstract:

Hyperbolicity is a property of a graph that may be viewed as being a “soft” version of a tree, and recent empirical and theoretical work has suggested that many graphs arising in Internet and related data applications have hyperbolic properties. Here, we consider Gromov’s notion of $\delta$-hyperbolicity, and we establish several positive and negative results for small-world and tree-like random graph models. In particular, we show that small-world random graphs built from underlying grid structures do not have strong improvement in hyperbolicity, even when the rewiring greatly improves decentralized navigation. On the other hand, for a class of tree-like graphs called ringed trees that have constant hyperbolicity, adding random links among the leaves in a manner similar to the small-world graph constructions may easily destroy the hyperbolicity of the graphs, except for a class of random edges added using an exponentially decaying probability function based on the ring distance among the leaves. In addition to being of interest in their own right, our main results shed light on the relationship between hyperbolicity and navigability, as well as the relationship between $\delta$-hyperbolicity and the use of randomness in common random graph constructions.

Something to keep you off the streets after work this coming week. 😉

To understand why this work (and work like it is important):

Hyperbolicity, a property of metric spaces that generalizes the idea of Riemannian manifolds with negative curvature, has received considerable attention in both mathematics and computer science. When applied to graphs, as is typical in computer science applications, one may think of hyperbolicity as characterizing a “soft” version of a tree—trees are graphs that have hyperbolicity equal to zero, and graphs that “look like” trees in terms of their metric structure have “small” hyperbolicity. Since trees are an important class of graphs and since tree-like graphs arise in numerous applications, the idea of hyperbolicity has received attention in a range of applications. For example, it has found usefulness in the visualization of the Internet, the Web, and other large graphs []; it has been applied to questions of compact routing, navigation, and decentralized search in Internet graphs and small-world social networks []; and it has been applied to a range of other problems such as distance estimation, network security, sensor networks, and traffic flow and congestion minimization []. (cites omitted)

Interesting question to which I don’t know the answer off hand: Do topic maps exhibit the characteristics of a “small-world network?” Or for that matter, has anyone investigated the nature of the graphs that result from topic maps?

I will be blogging more about the graph nature of topic maps. Please post pointers, suggestions, questions.

Interconnection of Communities of Practice: A Web Platform for Knowledge Management

Filed under: Communities of Practice,Knowledge Management — Patrick Durusau @ 9:10 pm

Interconnection of Communities of Practice: A Web Platform for Knowledge Management by Elise Garrot-Lavoué (LIESP).

Abstract:

Our works aim at developing a Web platform to connect various Communities of Practice (CoPs) and to capitalise on all their knowledge. This platform addresses CoPs interested in a same general activity, for example tutoring. For that purpose, we propose a general model of Interconnection of Communities of Practice (ICP), based on the concept of Constellation of Practice (CCP) developed by Wenger (1998). The model of ICP was implemented and has been used to develop the TE-Cap 2 platform which has, as its field of application, educational tutoring activities. In particular, we propose an indexation and search tool for the ICP knowledge base. The TE-Cap 2 platform has been used in real conditions. We present the main results of this descriptive investigation to validate this work.

I started reading this article because of the similarity of “Communities of Practice (CoPs)” to Jack Park’s “tribes,” which Jack uses to describe different semantic communities. Then I ran across:

The most important difficulty to overcome is to arouse interactions between persons except any frame imposed by an organisation. For that purpose, it is necessary to bring them to become aware that they have shared practices and to provide the available means to get in touch with people from different CoPs.
(emphasis added)

Admittedly the highlighted sentence would win no prizes for construction but I think its intent is clear. I would restate it as:

The most important difficulty is enabling interactions between persons across the structures of their Communities of Practice (CoPs).

Communities of Practice (CoPs) can be and often are based in organizations, such as employers, I think it is important to not limit the idea of such communities to formal organizational structures, which some CoPs may transcend. The project uses “Interconnection of Communities of Practice (ICP)” to describe communication that transcends institutional barriers.

The other modification I made was to make it clear that it is enabling of interactions is the goal. Creating a framework of interactions isn’t the goal. Unless the interactions emerge from the members of the CoPs, then all we have is a set of interactions imposed on the CoPs and their members.

I need to look at more Communities of Practice (CoPs) literature because I wonder if ontologies are seen as the product of a community, as opposed to be the basis for a community itself?

I have done some quick searches on “Communities of Practice (CoPs)” and as with all things connected to topic maps, there is a vast sea of literature. 😉

January 14, 2012

Sufficient Conditions for Formation of a Network Topology by Self-interested Agents

Filed under: Graphs,Networks — Patrick Durusau @ 7:41 pm

Sufficient Conditions for Formation of a Network Topology by Self-interested Agents by Swapnil Dhamal and Y. Narahari.

Abstract:

The current literature on network formation primarily addresses the problem: given a set of self-interested nodes and a set of conditions, what topologies are pairwise stable and hence are likely to emerge. A pairwise stable network is one in which no node wants to delete any of its links and no two nodes would want to create a link between them. Pairwise stable networks span a wide range of topologies and some of them might be far from desirable. Motivated by the necessity for ensuring that the emerging network is exactly a desired one, we study the following reverse problem: given a network topology, what conditions are required so that best response strategies played by self-interested agents ultimately result in that network topology. We propose a sequential network formation game model that captures principal determinants of network formation, namely benefits from immediate neighbors, costs of maintaining links with immediate neighbors, benefits from indirect neighbors, and bridging benefits. Based on this model, we analyze some common network topologies, namely star, complete graph and bipartite Turán graph, and derive a set of sufficient conditions under which these network topologies emerge.

I mention this to suggest that as future work, merging of nodes based on the subject they represent should be considered as a benefit. Depending upon linking, merger could result in a reduction of linking between nodes, something the authors already recognize as a potential benefit. The potential exists for some merging operations to be found to be too expensive, either temporarily or longer term.

Rather than an axiomatic merging rule, a cost/benefit merging rule that is invoked only in some circumstances.

Sentence based semantic similarity measure for blog-posts

Filed under: Nutch,Semantics,Similarity — Patrick Durusau @ 7:40 pm

Sentence based semantic similarity measure for blog-posts by Mehwish Aziz and Muhammad Rafi.

Abstract:

Blogs-Online digital diary like application on web 2.0 has opened new and easy way to voice opinion, thoughts, and like-dislike of every Internet user to the World. Blogosphere has no doubt the largest user-generated content repository full of knowledge. The potential of this knowledge is still to be explored. Knowledge discovery from this new genre is quite difficult and challenging as it is totally different from other popular genre of web-applications like World Wide Web (WWW). Blog-posts unlike web documents are small in size, thus lack in context and contain relaxed grammatical structures. Hence, standard text similarity measure fails to provide good results. In this paper, specialized requirements for comparing a pair of blog-posts is thoroughly investigated. Based on this we proposed a novel algorithm for sentence oriented semantic similarity measure of a pair of blog-posts. We applied this algorithm on a subset of political blogosphere of Pakistan, to cluster the blogs on different issues of political realm and to identify the influential bloggers.

I am not sure I agree that “relaxed grammatical structures” are peculiar to blog posts. 😉

A “measure” of similarity that I have not seen (would appreciate a citation if you have) is the listing of one blog by another by another in its “blogroll.” On the theory that blogs may cite blogs they disagree with both semantically and otherwise in post but are unlikely to list blogs in their “blogroll” that they find disagreeable.

Faster reading through math

Filed under: Data Mining,Natural Language Processing,Searching — Patrick Durusau @ 7:39 pm

Faster reading through math

John Johnson writes:

Let’s face it, there is a lot of content on the web, and one thing I hate worse is reading halfway through an article and realizing that the title and first paragraph indicate little about the rest of the article. In effect, I check out the quick content first (usually after a link), and am disappointed.

My strategy now is to use automatic summaries, which are now a lot more accessible than they used to be. The algorithm has been around since 1958 (!) by H. P. Luhn and is described in books such as Mining the Social Web by Matthew Russell (where a Python implementation is given). With a little work, you can create a program that scrapes text from a blog, provides short and long summaries, and links to the original post, and packages it up in a neat HTML page.

Or you can use the cute interface in Safari, if you care to switch.

The woes of ambiguity!

I jumped to John’s post thinking it had some clever way to read math faster. 😉 Some of the articles I am reading take a lot longer than others. I have one on homology that I am just getting comfortable enough with to post about it.

Any reading assistant tools that you would care to recommend?

Of particular interest would be software that I could feed a list of URLs that resolve to PDFs files (possibly with authentication although I could login to start it off) and it produces a single HTML page summary.

Almagame

Filed under: Alignment,Vocabularies — Patrick Durusau @ 7:39 pm

Almagame

Almagame is the software that Tim Wray mentions in his post, vocabulary alignment, meaning and understanding in the world museum, as using a technique called “interactive alignment.”

From the homepage:

Amalgame (AMsterdam ALignment GenerAtion MEtatool) is a tool for finding, evaluating and managing vocabulary alignments. We explicitly do not aim to produce ‘yet another alignment method’ but rather seek to combine existing matching techniques and methods such as those developed within the context of the Ontology Alignment Evaluation Initiative (OAEI), in which different alignment methods can be combined using a workflow setup. The Amalgame Alignment server will feature:

  • A workflow composition functionality, where various alignment generators can be positioned. Their resulting mapping sets can be used as input for filtering methods, other alignment generators or combined into overlap sets.
  • A statistics function, where statistics for alignment sets will be shown
  • An evaluation tool, where subsets of alignments can be evaluated manually

Vocabulary and metadata workflow

The Amalgame toolkit realises the second step of a workflow specified by the Europeana Connect project for SKOSifying vocabularies and converting collection metadata into the EDM (Europeana Data Model). The first step, conversion of XML data into RDF, is supported by the XMLRDF tool.

Amalgame paper at TPDL 2011

We’re happy to announce our paper about Amalgam was accepted as a full paper for the International Conference on Theory and Practice of Digital Libraries 2011 (TPDL 2011).

The extensive online appendix also contains a rich use case description.

I think you will want to grab the paper, which has the following abstract:

In many heritage institutes, objects are routinely described using terms from predefined vocabularies. When object collections need to be merged or linked, the question arises how those vocabularies relate. In practice it often unclear for data providers how well alignment tools will perform on their specific vocabularies. This creates a bottleneck to align vocabularies, as data providers want to have tight control over the quality of their data. We will discuss the key limitations of current tools in more detail and propose an alternative approach. We will show how this approach has been used in two alignment use cases, and demonstrate how it is currently supported by our Amalgame alignment platform.

I am downloading/installing the software.

I am curious if a similar approach, albeit without converting data into RDF, could be used to create alignments of unstructured vocabularies? Along with reasons for the mappings between vocabularies?

Reasoning in part that there are far more non-structured vocabularies where access could be enhanced with mappings to other vocabularies, along with reasons for the mappings.

Vocabulary alignment, meaning and understanding in the world museum

Filed under: Vocabularies — Patrick Durusau @ 7:37 pm

vocabulary alignment, meaning and understanding in the world museum by Tim Wray.

From the post:

We live in a world of silos. Silos data. Silos of culture. Linked Open Data aims to tear down these silos and create unity among the collections, their data and their meaning. The World Museum awaits us.

It comes to no surprise that I begin this post with such Romantic allusions. Our discussions of vocabularies – as technical behemoths and cultural artefacts – were lively and florid at a recent gathering of researchers library and museum professionals at LODLAM-NZ. Metaphors of time and tide – depicted beautifully in this companion post by Ingrid Mason, highlight issues of their expressive power of their meaning over time and across cultures. I present a very broad technical perspective on the matter beginning with a metaphor for what I believe represents the current state of digital cultural heritage : a world of silos.

Among these silos lie vocabularies that describe their collections and induce meaning to their objects. Originally employed to assist cataloguers and disambiguate terms, vocabularies have grown to encompass rich semantic information, often pertaining to the needs of that institution, their collection or their creator communities. Vocabularies themselves are cultural artefacts representing a snapshot of sense making. Like the objects that they describe, vocabularies can depict a range of substance from Cold War paranoia to escapist and consumerist Disneyfication. Inherent within them are the world views, biases, and focal points of their creators. An object’s source vocabulary should always be recorded as a significant part of it’s provenance. Welcome to the recursive hell of meta-meta-data.

Within the context of the museum, vocabularies form the backbone from which collection descriptions are tagged, catalogued or categorised. But there are many vocabularies, and the World Museum needs a universal language. LODLAM-NZ embraced the enthusiasm of a universal language but also understood the immense technical challenges that follow vocabulary alignment and, in many cases, natural language processing in general. However, if done successfully, alignment does a few great things: it normalises the labels that we assign to objects so that a unity of inferencing, reasoning and understanding can occur across vast swathes of collections; it can provide semantic context to those labels for even deeper, more compelling relations among the objects and it can be used to disambiguate otherwise flat or “unsemantified” meta-data, such as small free-text fields and social tags.

Vocabulary alignment is the process of putting two vocabularies side-by-side, finding the best matches, and joining the dots.

Tim’s message is not one of despair.

In fact, he describes how researchers have brought humans back into the picture, seeking to take advantage of what machines do best (simple matches) and what humans do better, more complex matching.

He references a paper and software that I will posting about separately that allow humans to refine mappings.

My only caution is that even human refinements are time and culture bound. That is a refinement that is useful today is a time and cultural artifact (in the archeological sense) that may need replacement today for use by visitors from another culture but certainly when the users are in a later time period.

That is we need to build systems that manage (record/track?) changes in semantic meaning rather than attempting to create semantic edifices designed to hold back the tides of semantic change.

Extract meta concepts through co-occurrences analysis and graph theory

Filed under: Classification,co-occurrence,Indexing — Patrick Durusau @ 7:36 pm

Extract meta concepts through co-occurrences analysis and graph theory

Cristian Mesiano writes:

During The Christmas period I had finally the chance to read some papers about probabilistic latent semantic and its applications in auto classification and indexing.

The main concept behind “latent semantic” lays on the assumption that words that occurs close in the text are related to the same semantic construct.

Based on this principle the LSA (and partially also the PLSA ) builds a matrix to keep track of the co-occurrences of the words in text, and it assign a score to these co-occurrences considering the distribution in the corpus as well.

Often TF-IDF score is used to rank the words.

Anyway, I was wondering if this techniques could be useful also to extract key concepts from the text.

Basically I thought: “in LSA we consider some statistics over the co-occurrences, so: why not consider the link among the co-occurrences as well?”.

Using the first three chapters of “The Media in the Network Society, author: Gustavo Cardoso,” Christian creates a series of graphs.

Christian promises his opinion on classification of texts using this approach.

In the meantime, what’s yours?

A Taxonomy of Ideas?

Filed under: Language,Taxonomy — Patrick Durusau @ 7:35 pm

A Taxonomy of Ideas?

David McCandless writes:

Recently, when throwing ideas around with people, I’ve noticed something. There seems to be a hidden language we use when evaluating ideas.

Neat idea. Brilliant idea. Dumb idea. Bad idea. Strange idea. Cool idea.

There’s something going on here. Each one of these ideas is subtly different in character. Each adjective somehow conveys the quality of the concept in a way we instantly and unconsciously understand.

Good point. There is always a “hidden language” that will be understood by members of a social group. But that also means that “hidden language” and its implications, will not be understood, at least not in the same way, by people in other groups.

That same “hidden language” shapes our choices of subjects out of a grab bag of subjects (a particular data set if not the world).

We can name some things that influence our choices, but it is always far from being all of them. Which means that no set of rules will always lead to the choices we would make. We are incapable of specifying the rules in the require degree of detail.

ToChildBlockJoinQuery in Lucene

Filed under: Lucene,Search Engines — Patrick Durusau @ 7:34 pm

ToChildBlockJoinQuery in Lucene .

Mike McCandless writes:

In my last post I described a known limitation of BlockJoinQuery: it joins in only one direction (from child to parent documents). This can be a problem because some applications need to join in reverse (from parent to child documents) instead.

This is now fixed! I just committed a new query, ToChildBlockJoinQuery, to perform the join in the opposite direction. I also renamed the previous query to ToParentBlockJoinQuery.

This will included in Lucene 3.6.0 and 4.0.

Custom Search JavaScript API is now fully documented!

Filed under: Google CSE,Javascript,Search Engines,Searching — Patrick Durusau @ 7:33 pm

Custom Search JavaScript API is now fully documented!

Riona MacNamara writes:

The Custom Search engineers spent 2011 launching great features. But we still hear from our users that our documentation could do with improvement. We hear you. Today we’re launching some updates to our docs:

  • Comprehensive JavaScript reference for the Custom Search Element. We’ve completely overhauled our Custom Search Element API documentation to provide a comprehensive overview of all the JavaScript methods available. We can’t wait to see what you build with it.
  • More languages. The Help Center is now available in Danish, Dutch, Finnish, French, German, Italian, Japanese, Norwegian, Spanish, and Swedish.
  • Easier navigation and cleaner design. We’ve reorganized the Help Center to make it easier to find the information you’re looking for. Navigation is simpler and more streamlined. Individual articles have been revised and updated, and designed to be more readable.

Documentation is an ongoing effort, and we’ll be continuing to improve both our Help Center and our developer documentation. If you have comments or suggestions, we’d love to see them in our user forum.

Granting that a Google CSE could give you more focused results (along with ads), but don’t you still have the problem of re-using search results?

It’s a good thing that users can more quickly find relevant content in a particular domain, but do you really want your users searching for the same information over and over again?

Hmmm, what if you kept a search log with the “successful” results as chosen by your users? That could be a start in terms of locating subjects and information about them that is important to your users. Subjects that could then be entered into your topic map.

January 13, 2012

Duke 0.4

Filed under: Deduplication,Entity Resolution,Record Linkage — Patrick Durusau @ 8:17 pm

Duke 0.4

New release of deduplication software written in Java on top of Lucene by Lars Marius Garshol.

From the release notes:

This version of Duke introduces:

  • Added JNDI data source for connecting to databases via JNDI (thanks to FMitzlaff).
  • In-memory data source added (thanks to FMitzlaff).
  • Record linkage mode now more flexible: can implement different strategies for choosing optimal links (with FMitzlaff).
  • Record linkage API refactored slightly to be more flexible (with FMitzlaff).
  • Added utilities for building equivalence classes from Duke output.
  • Made the XML config loader more robust.
  • Added a special cleaner for English person names.
  • Fixed bug in NumericComparator ( issue 66 )
  • Uses own Lucene query parser to avoid issues with search strings.
  • Upgraded to Lucene 3.5.0.
  • Added many more tests.
  • Many small bug fixes to core, NTriples reader, ec.

BTW, the documentation is online only: http://code.google.com/p/duke/wiki/GettingStarted.

scikit-learn 0.10

Filed under: Machine Learning,Python — Patrick Durusau @ 8:16 pm

scikit-learn 0.10

With a list of 27 items that include words like “new,” “added,” “fixed,” “refactored,” etc., you know this is a change log you want to do more than skim.

In case you have been under a programming rock somewhere, scikit-learn is a Python machine learning library. Scikit-learn homepage

Meronymy SPARQL Database Server

Filed under: RDF,SPARQL — Patrick Durusau @ 8:15 pm

Meronymy SPARQL Database Server

Inge Henriksen writes:

We are pleased to announce that the Meronymy SPARQL Database Server is ready for release later in 2012. Those interested in our RDF database server software should consider registering today; those that do get exclusive early access to beta software in the upcoming closed beta testing period, insider news on the development progress, get to submit feature requests, and otherwise directly influence the finished product.

From the FAQ we learn some details:

A: All components in the database server and its drivers have been programmed from scratch so that we could optimize them in terms of their performance.
We developed the database server in C++ since this programming language has the most potential for optimalization, there are also some inline assembly at key locations in the programming code.
Some more components that makes our database management system very fast:

  • In-process query optimizer; determines the most efficient way to execute a query.
  • In-proces memory manager; for much faster memory allocation and deallocation than the operating system can provide.
  • In-process multithreaded HTTP server; for much faster SPARQL Protocol endpoint than through a standard out-of-process web server.
  • In-process multithreaded TCP/IP-listener with thread pooling; for efficient thread managment.
  • In-process directly coded lexical analyzer; for efficient query parsing.
  • Snapshot isolation; for fast transaction processing.
  • B+ trees; for fast indexing
  • In-process stream-oriented XML parser; for fast RDF/XML parsing.
  • A RDF data model; for no data model abstraction layers which results in faster processing of data.

I’m signing up for the beta. How about you?

OpenSearch.org

Filed under: OpenSearch.org,Search Engines,Searching — Patrick Durusau @ 8:14 pm

OpenSearch.org

From the webpage:

OpenSearch is a collection of simple formats for the sharing of search results.

The OpenSearch description document format can be used to describe a search engine so that it can be used by search client applications.

The OpenSearch response elements can be used to extend existing syndication formats, such as RSS and Atom, with the extra metadata needed to return search results.

I like this line from the FAQ:

Different types of content require different types of search engines. The best search engine for a particular type of content is frequently the search engine written by the people that know the content the best.

Not a lot of people using the OpenSearch description document but I wonder if you could write something similar for websites? That is declare for a website ((your own or someone else’s), what is to be found there. Or what vocabulary is used there.

With just a small boost from their users, search engines could do a lot better in terms of producing sane results.

Will have to think about a way to test the declaration of a vocabulary for a website or even group of websites and comparing that to a search without the benefit of such a vocabulary. Suggestions welcome!

Cyberspace Science and Technology RFI (U.S. Air Force)

Filed under: Military — Patrick Durusau @ 8:13 pm

Cyberspace Science and Technology RFI (U.S. Air Force)

Response Date: February 24, 2012 4 pm Eastern.

From the background information:

The Air Force is requesting information on revolutionary cyberspace science and technologies that address the challenge of future Air Force cyberspace needs in cyberspace exploitation, defense, and operations for potential inclusion in the Air Force Cyber Vision 2025 study. Cyber Vision 2025 is a study to create an integrated, Air Force-wide, near-, mid- and far-term S&T vision to advance revolutionary cyber capabilities to support core Air Force missions. Cyber Vision 2025 will identify state of the art S&T and best practices in government and the private sector. It will analyze current and forecasted capabilities, threats, vulnerabilities, and consequences across core AF missions to identify key S&T gaps and opportunities. It will articulate an AF near- (FY2012-15), mid- (FY2016-20) and far-term (FY2021-25) S&T vision to fill gaps, indicating where AF should lead (creating or inventing novel solutions for core AF missions), follow (by adopting, adapting, or augmenting others investments), or watch key technologies. In alignment with the national security cyber strategy, the study is intended to address cyber S&T across Air Force core missions (air, space, cyber, and Command and Control Intelligence, Surveillance and Reconnaissance (C2ISR)) including DOTMLPF (Doctrine, Organization, Training, Materiel, Leadership and Education, Personnel and Facilities) considerations, engaging with industry, academia, national laboratories, Federally Funded Research and Development Centers (FFRDCs), University Affiliated Research Centers (UARCs), and government to leverage capabilities and experience.

The ability to make sense out of big (heterogeneous) data should qualify as one aspect of supporting core Air Force missions.

Read the RFI, plus other suggested documents and see what you think.

Map-Reduce-Merge

Filed under: MapReduceMerge,Peregrine — Patrick Durusau @ 8:12 pm

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters by Hung-chih Yang, Ali Dasdan, Ruey-Lung Hsiao and D. Stott Parker.

Abstract:

Map-Reduce is a programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. Through a simple interface with two functions, map and reduce, this model facilitates parallel implementation of many real-world tasks such as data processing for search engines and machine learning.

However, this model does not directly support processing multiple related heterogeneous datasets. While processing relational data is a common need, this limitation causes difficulties and/or inefficiency when Map-Reduce is applied on relational operations like joins.

We improve Map-Reduce into a new model called Map-Reduce-Merge. It adds to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted (or hashed) by map and reduce modules. We also demonstrate that this new model can express relational algebra operators as well as implement several join algorithms.

As of today, I count sixty-three (63) citations of this paper. I just discovered it today and it is going to take some time to work through all the citing materials and then materials that cite those papers.

The Peregrine software I mentioned in another post, implements this map-reduce-merge framework.

Peregrine

Filed under: MapReduce,MapReduceMerge,Peregrine — Patrick Durusau @ 8:11 pm

Peregrine

From the webpage:

Peregrine is a map reduce framework designed for running iterative jobs across partitions of data. Peregrine is designed to be FAST for executing map reduce jobs by supporting a number of optimizations and features not present in other map reduce frameworks.

Among its many “modern” features, Peregrine includes: “MapReduceMerge style computations including a new merge() operation.

I will have a separate blog entry on a paper describing MapReduceMerge computations for heterogeneous data sets.

This looks very important for the future of topic maps in a big data (heterogeneous) universe.

First steps in data visualisation using d3.js, by Mike Dewar

Filed under: JSON,Visualization — Patrick Durusau @ 8:11 pm

First steps in data visualisation using d3.js, by Mike Dewar

Drew Conway writes:

Last night Mike Dewar presented a wonderful talk to the New York Open Statistical Programming Meetup titled, “First steps in data visualisation using d3.js.” Mike took the audience through an excellent review of d3.js fundamentals, as well as showed off some of the features of working with Chrome Web Developer Tools. This is one of the best talks we have ever had, and if you have had any interest in exploring d3.js, but were intimidated by the design concepts or syntax, this is exactly the talk for you.

Follow the link to Drew’s blog to see the video or link to Mike’s slides (a product of d3.js).

This is an impressive presentation but I hesitated before making this post since Mike refers to XML as “clunky.” 😉 Oh, well, the rest of the presentation made up for it. Sorta. The audio quality leaves something to be desired as Mike wanders away from the microphone.

BTW, presentation question: What’s wrong with the bar chart in Mike’s first example? I count at least two. How many do you see?

January 12, 2012

A Backstage Tour of ggplot2 with Hadley Wickham

Filed under: Ggplot2,Graphics,Visualization — Patrick Durusau @ 7:35 pm

A Backstage Tour of ggplot2 with Hadley Wickham

Details:

Date: Wednesday, February 8, 2012
Time: 11:00AM – 12:00PM Pacific Time
Presenter: Hadley Wickham, Professor of Statistics, Rice University

From the webpage:

GGplot2 is one of R’s most popular, widely used packages, developed by Rice University’s Hadley Wickham. Ggplot2’s exploratory graphics capabilities are driving the use of R as a complement to legacy analytics tools such as SAS. SAS is well-regarded for its strength in data management and “production” statistics, where you know what you want to do and need to do it repeatedly. On the other hand, R is strong in data analysis and exploration in situations where figuring out what is needed is the biggest challenge. In this important way, SAS and R are strong companions.

This webinar will provide an all-access pass to Hadley’s latest work. He’ll discuss:

  • A brief overview of ggplot2, and how it’s different to other plotting systems
  • A sneak peek at some of the new features coming to the next version of ggplot2
  • What’s been learned about good development practices in the 5 years since first starting to develop ggplot
  • Some of the internals of ggplot2, and talk about how he is gradually making it easier for others to contribute

Join this webinar to understand how ggplot2 adds valuable, unmatched capabilities to your analytics toolbox.

Introducing DataFu: an open source collection of useful Apache Pig UDFs

Filed under: DataFu,Hadoop,MapReduce,Pig — Patrick Durusau @ 7:34 pm

Introducing DataFu: an open source collection of useful Apache Pig UDFs

From the post:

At LinkedIn, we make extensive use of Apache Pig for performing data analysis on Hadoop. Pig is a simple, high-level programming language that consists of just a few dozen operators and makes it easy to write MapReduce jobs. For more advanced tasks, Pig also supports User Defined Functions (UDFs), which let you integrate custom code in Java, Python, and JavaScript into your Pig scripts.

Over time, as we worked on data intensive products such as People You May Know and Skills, we developed a large number of UDFs at LinkedIn. Today, I’m happy to announce that we have consolidated these UDFs into a single, general-purpose library called DataFu and we are open sourcing it under the Apache 2.0 license:

Check out DataFu on GitHub!

DataFu includes UDFs for common statistics tasks, PageRank, set operations, bag operations, and a comprehensive suite of tests. Read on to learn more.

This is way cool!

Read the rest of Matthew’s post (link above) or get thee to GitHub!

TEXT RETRIEVAL CONFERENCE (TREC) 2012

Filed under: Conferences,TREC — Patrick Durusau @ 7:33 pm

TEXT RETRIEVAL CONFERENCE (TREC) 2012 February 2012 – November 2012.

Schedule:

As soon as possible — submit your application to participate in TREC 2012 as described below.

Submitting an application will add you to the active participants’ mailing list. On Feb 23, NIST will announce a new password for the “active participants” portion of the TREC web site.

Beginning March 1

Document disks used in some existing TREC collections distributed to participants who have returned the required forms. Please note that no disks will be shipped before March 1.

July–August

Results submission deadline for most tracks Specific deadlines for each track will be included in the track guidelines, which will be finalized in the spring.

September 30 (estimated)

relevance judgments and individual evaluation scores due back to participants.

Nov 6-9

TREC 2012 conference at NIST in Gaithersburg, Md. USA

Applications:

Organizations wishing to participate in TREC 2012 should respond to this call for participation by submitting an application. Participants in previous TRECs who wish to participate in TREC 2012 must submit a new application. To apply, follow the instructions at

http://ir.nist.gov/trecsubmit.open/application.html

to submit an online application. The application system will send an acknowledgement to the email address supplied in the form once it has processed the form.

Conference blurb:

The Text Retrieval Conference (TREC) workshop series encourages research in information retrieval and related applications by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results. Now in its 21st year, the conference has become the major experimental effort in the field. Participants in the previous TREC conferences have examined a wide variety of retrieval techniques and retrieval environments, including cross-language retrieval, retrieval of web documents, multimedia retrieval, and question answering. Details about TREC can be found at the TREC web site, http://trec.nist.gov.

You are invited to participate in TREC 2012. TREC 2012 will consist of a set of tasks known as “tracks”. Each track focuses on a particular subproblem or variant of the retrieval task as described below. Organizations may choose to participate in any or all of the tracks. Training and test materials are available from NIST for some tracks; other tracks will use special collections that are available from other organizations for a fee.

Dissemination of TREC work and results other than in the (publicly available) conference proceedings is welcomed, but the conditions of participation specifically preclude any advertising claims based on TREC results. All retrieval results submitted to NIST are published in the Proceedings and are archived on the TREC web site. The workshop in November is open only to participating groups that submit retrieval results for at least one track and to selected government invitees.

Look at the data sets and tracks. This is not a venture for the faint of heart.

MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl

Filed under: Common Crawl,Hadoop,MapReduce — Patrick Durusau @ 7:32 pm

MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl

From the post:

Common Crawl aims to change the big data game with our repository of over 40 terabytes of high-quality web crawl information into the Amazon cloud, the net total of 5 billion crawled pages. In this blog post, we’ll show you how you can harness the power of MapReduce data analysis against the Common Crawl dataset with nothing more than five minutes of your time, a bit of local configuration, and 25 cents.

When Google unveiled its MapReduce algorithm to the world in an academic paper in 2004, it shook the very foundations of data analysis. By establishing a basic pattern for writing data analysis code that can run in parallel against huge datasets, speedy analysis of data at massive scale finally became a reality, turning many orthodox notions of data analysis on their head.

With the advent of the Hadoop project, it became possible for those outside the Googleplex to tap into the power of the MapReduce pattern, but one outstanding question remained: where do we get the source data to feed this unbelievably powerful tool?

This is the very question we hope to answer with this blog post, and the example we’ll use to demonstrate how is a riff on the canonical Hadoop Hello World program, a simple word counter, but the twist is that we’ll be running it against the Internet.

When you’ve got a taste of what’s possible when open source meets open data, we’d like to whet your appetite by asking you to remix this code. Show us what you can do with Common Crawl and stay tuned as we feature some of the results!

Any takers?

It will be this weekend but I will be reporting back next Monday.

« Newer PostsOlder Posts »

Powered by WordPress