Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 26, 2013

Register To Watch? Let’s Not.

Filed under: ElasticSearch — Patrick Durusau @ 5:55 pm

webinar: Getting Started with Elasticsearch by Drew Raines.

I am sure Drew does a great job in this webinar. Just as I am sure if you really are a search newbie, it would be useful to you.

But let’s all start passing on the “register to watch/download” dance.

If they want to count views or downloads, hell, Youtube does that (without registering).

Many people have given “skip registering to view” advice before and many will in the future.

Mark me down as just one more.

Do blog about your decision and which “register to view” you decided to skip.

If that happens enough times, maybe marketing departments will look elsewhere for spam addresses.

NoSQL: Data Grids and Graph Databases

Filed under: Graph Databases,Neo4j — Patrick Durusau @ 2:27 pm

NoSQL: Data Grids and Graph Databases by Al Rubinger.

Chapter Six of Continuous Enterprise Development in Java by Andrew Lee Rubinger and Aslak Knutsen. Accompanying website.

From chapter six:

Until relatively recently, the RDBMS reigned over data in enterprise applications by a wide margin when contrasted with other approaches. Commercial offerings from Oracle and established open-source projects like MySQL (reborn MariaDB) and PostgreSQL became defacto choices when it came to storing, querying, archiving, accessing, and obtaining data. In retrospect, it’s shocking that given the varying requirements from those operations, one solution was so heavily lauded for so long.

In the late 2000s, a trend away from the strict ACID transactional properties could be clearly observed given the emergence of data stores that organized information differently from the traditional table model:

  • Document-oriented
  • Object-oriented
  • Key/Value stores
  • Graph models

In addition, many programmers were beginning to advocate for a release from strict transactions; in many use cases it appeared that this level of isolation wasn’t enough of a priority to warrant the computational expense necessary to provide ACID guarantees.

No, what’s shocking is the degree of historical ignorance among people who criticize RDBMS systems. Either than or they are simply parroting what other ignorant people are saying about RDBMS systems.

Don’t get me wrong, I strongly prefer NoSQL solutions in some cases. But it is a question of requirements and not making up tales about RDBMS systems.

For example, in A transient hypergraph-based model for data access Carolyn Watters and Michael A. Shepherd write:

Two major methods of accessing data in current database systems are querying and browsing. The more traditional query method returns an answer set that may consist of data values (DBMS), items containing the answer (full text), or items referring the user to items containing the answer (bibliographic). Browsing within a database, as best exemplified by hypertext systems, consists of viewing a database item and linking to related items on the basis of some attribute or attribute value. A model of data access has been developed that supports both query and browse access methods. The model is based on hypergraph representation of data instances. The hyperedges and nodes are manipulated through a set of operators to compose new nodes and to instantiate new links dynamically, resulting in transient hypergraphs. These transient hypergraphs are virtual structures created in response to user queries, and lasting only as long as the query session. The model provides a framework for general data access that accommodates user-directed browsing and querying, as well as traditional models of information and data retrieval, such as the Boolean, vector space, and probabilistic models. Finally, the relational database model is shown to provide a reasonable platform for the implementation of this transient hypergraph-based model of data access. (Emphasis added.)

Oh, did I say that paper was written in 1990, some twenty-three years ago?

So twenty-three (23) years ago that bad old RDBMS model was capable of implementing a hypergraph.

A hypergraph that had, wait for it, true hyperedges, not the faux hyperedges claimed by some graph databases.

It’s that lack of accuracy that makes me wonder about what else has been missed?

CCNx®

Filed under: CCNx,Networks — Patrick Durusau @ 1:44 pm

CCN® (Content-Centric Networking)

From the about page:

Project CCNx® exists to develop, promote, and evaluate a new approach to communication architecture we call content-centric networking. We seek to carry out this mission by creating and publishing open protocol specifications and an open source software reference implementation of those protocols. We provide support for a community of people interested in experimentation, research, and building applications with this technology, all contributing to its evolution.

Research Origins and Current State

CCNx technology is still at an early stage of development, with pure infrastructure and no applications, best suited to researchers and adventurous network engineers or software developers. If you’re looking for cool applications ready to download and use, you are a little too early.

Project CCNx is sponsored by the Palo Alto Research Center (PARC) and is based upon the PARC Content-Centric Networking (CCN) architecture, which is the focus of a major, long-term research and development program. There are interesting problems in many areas still to be solved to fully realize and apply the vision, but we believe that enough of an architectural foundation is in place to enable significant experiments to begin. Since this new approach to networking can be deployed through middleware software communicating in an overlay on existing networks, it is possible to start applying it now to solve communication problems in new ways. Project CCNx is an invitation to join us and participate in this exploration of the frontier of content networking.

An odd echo of my post earlier today on HSA – Heterogeneous System Architecture, where heterogeneous processors share the same data.

The abstract from the paper, Networking Named Content by Van Jacobson, Diana K. Smetters, James D. Thornton, Michael F. Plass, Nicholas H. Briggs and Rebecca L. Braynard (2009), gives a good overview:

Network use has evolved to be dominated by content distribution and retrieval, while networking technology still speaks only of connections between hosts. Accessing content and services requires mapping from the what that users care about to the network’s where. We present Content-Centric Networking (CCN) which treats content as a primitive – decoupling location from identity, security and access, and retrieving content by name. Using new approaches to routing named content, derived heavily from IP, we can simultaneously achieve scalability, security and performance. We implemented our architecture’s basic features and demonstrate resilience and performance with secure file downloads and VoIP calls.

I rather like that: “…requires mapping from the what that users care about to the network’s where.”

As a user I don’t care nearly as much where content is located as I do about the content itself.

Do you?

You may have to get out your copy of TCP IP Illustrated by W. Richard Stevens but it will be worth the effort.

I haven’t gone over all the literature but I haven’t seen any mention of the same data originating from multiple addresses. Not the caching of content, that’s pretty obvious but the same named content at different locations.

The usual content semantic issues plus being able to say that two or more named contents are the same content.

Kindred Britain

Filed under: D3,Genealogy,Geography,History,PHP,PostgreSQL — Patrick Durusau @ 12:48 pm

Kindred Britian by Nicholas Jenkins, Elijah Meeks and Scott Murray.

From the website:

Kindred Britain is a network of nearly 30,000 individuals — many of them iconic figures in British culture — connected through family relationships of blood, marriage, or affiliation. It is a vision of the nation’s history as a giant family affair.

A quite remarkable resource.

Family relationships connecting people, a person’s relationship to geographic locations and a host of other associated details for 30,000 people await you!

From the help page:

ESSAYS

Originating Kindred Britain by Nicholas Jenkins

Developing Kindred Britain by Elijah Meeks and Karl Grossner

Designing Kindred Britain by Scott Murray

Kindred Britain: Statistics by Elijah Meeks

GENERAL INFORMATION

User’s Guide by Hannah Abalos and Nicholas Jenkins

FAQs

Glossary by Hannah Abalos and Emma Townley-Smith

Acknowledgements

Terms of Use

If you notice a problem with the site or have a question or copyright concern, please contact us at kindredbritain@stanford.edu

An acronym that may puzzle you: ODNB – Oxford Dictionary of National Biography.

In Developing Kindred Britain you will learn Kindred Britain has no provision for reader annotation or contribution of content.

Given a choice between the rich presentation and capabilities of Kindred Britain, which required several technical innovations and less capabilities but reader annotation, I would always choose the former over the latter.

You should forward the link to Kindred Britain to anyone working on robust exploration and display of data, academic or otherwise.

Apache Hadoop 2 (beta)

Filed under: Hadoop,MapReduce — Patrick Durusau @ 10:24 am

Announcing Beta Release of Apache Hadoop 2 by Arun Murthy.

From the post:

It’s my great pleasure to announce that the Apache Hadoop community has declared Hadoop 2.x as Beta with the vote closing over the weekend for the hadoop-2.1.0-beta release.

As noted in the announcement to the mailing lists, this is a significant milestone across multiple dimensions: not only is the release chock-full of significant features (see below), it also represents a very stable set of APIs and protocols on which we can continue to build for the future. In particular, the Apache Hadoop community has spent an enormous amount of time paying attention to stability and long-term viability of our APIs and wire protocols for both HDFS and YARN. This is very important as we’ve already seen a huge interest in other frameworks (open-source and proprietary) move atop YARN to process data and run services *in* Hadoop.

It is always nice to start the week with something new.

Your next four steps:

  1. Download and install Hadoop 2.
  2. Experiment with and use Hadoop 2.
  3. Look for and report bugs (and fixes if possible) for Hadoop 2.
  4. Enjoy!

Third Age of Computing?

Filed under: Architecture,Computation,CUDA,HSA,NVIDIA,Systems Research — Patrick Durusau @ 10:07 am

The ‘third era’ of app development will be fast, simple, and compact by Rik Myslewski.

From the post:

The tutorial was conducted by members of the HSA – heterogeneous system architecture – Foundation, a consortium of SoC vendors and IP designers, software companies, academics, and others including such heavyweights as ARM, AMD, and Samsung. The mission of the Foundation, founded last June, is “to make it dramatically easier to program heterogeneous parallel devices.”

As the HSA Foundation explains on its website, “We are looking to bring about applications that blend scalar processing on the CPU, parallel processing on the GPU, and optimized processing of DSP via high bandwidth shared memory access with greater application performance at low power consumption.”

Last Thursday, HSA Foundation president and AMD corporate fellow Phil Rogers provided reporters with a pre-briefing on the Hot Chips tutorial, and said the holy grail of transparent “write once, use everywhere” programming for shared-memory heterogeneous systems appears to be on the horizon.

According to Rogers, heterogeneous computing is nothing less than the third era of computing, the first two being the single-core era and the muti-core era. In each era of computing, he said, the first programming models were hard to use but were able to harness the full performance of the chips.

(…)

Exactly how HSA will get there is not yet fully defined, but a number of high-level features are accepted. Unified memory addressing across all processor types, for example, is a key feature of HSA. “It’s fundamental that we can allocate memory on one processor,” Rogers said, “pass a pointer to another processor, and execute on that data – we move the compute rather than the data.”
(…)

Rik does a deep dive with references to the HSA Programmers Reference Manual to Project Sumatra that bring data-parallel algorithms to Java 9 (2015).

The only discordant note is that Nivdia and Intel are both missing from the HSA Foundation. Invited but not present.

Customers of Nvidia and/or Intel (I’m both) should contact Nvidia (Contact us) and Intel (contact us) and urge them to join the HSA Foundation. And pass this request along.

Sharing of memory is one of the advantages of HSA (heterogeneous systems architecture) and it is the where the semantics of shared data will come to the fore.

I haven’t read the available HSA documents in detail, but the HSA Programmer’s Reference Manual appears to presume that shared data has only one semantic. (It never says that but that is my current impression.)

We have seen that the semantics of data is not “transparent.” The same demonstration illustrates that data doesn’t always have the same semantic.

Simply because I am pointed to a particular memory location, there is no reason to presume I should approach that data with the same semantics.

For example, what if I have a Social Security Number (SSN). In processing that number for the Social Security Administration, it may serve to recall claim history, eligibility, etc. If I am accessing the same data to compare it to SSN records maintained by the Federal Bureau of Investigation (FBI), it may not longer be a unique identifier in the same sense as at the SSA.

Same “data,” but different semantics.

Who you gonna call? Topic Maps!

PS: Perhaps not as part of the running code but to document the semantics you are using to process data. Same data, same memory location, multiple semantics.

DHS Bridging Siloed Databases [Comments?]

Filed under: Database,Silos,Topic Maps — Patrick Durusau @ 8:51 am

DHS seeks to bridge siloed databases by Adam Mazmanian.

From the post:

The Department of Homeland Security plans to connect databases containing information on legal foreign visitors as a prototype of a system to consolidate identity information from agency sources. The prototype is a first step in what could turn into comprehensive records overhaul that would erase lines between the siloed databases kept by DHS component agencies.

Currently, DHS personnel can access information from across component databases under the “One DHS” policy, but access can be hindered by the need to log into multiple systems and make multiple queries. The Common Entity Index (CEI) prototype pulls biographical information from DHS component agencies and correlates the data into a single comprehensive record. The CEI prototype is designed to find linkages inside source data – names and addresses as well as unique identifiers like passport and alien registration numbers – and connect the dots automatically, so DHS personnel do not have to.

DHS is trying to determine whether it is feasible to create “a centralized index of select biographic information that will allow DHS to provide a consolidated and correlated record, thereby facilitating and improving DHS’s ability to carry out its national security, homeland security, law enforcement, and benefits missions,” according to a notice in the Aug. 23 Federal Register.
(…) (emphasis added)

Adam goes on to summarize the data sources that DHS wants to include in its “centralized index of select biographic information.”

There isn’t enough information in the Federal Register notice to support technical comments on the prototype.

However, some comments about subject identity and the role of topic maps in collating information from diverse resources would not be inappropriate.

Especially since all public comments are made visible at: http://www.regulations.gov.

August 25, 2013

Big Data Sets you can use with R

Filed under: BigData,R — Patrick Durusau @ 7:36 pm

Big Data Sets you can use with R by Joseph Rickert.

From the post:

The world may indeed be awash with data, however, it is not always easy to find a suitable data set when you need one. As the number of people becoming involved with R and data science increases so does the need for interesting data sets for creating examples, showcasing machine learning algorithms and developing statistical analyses. The most difficult data sets to find are those that would provide the foundation for impressive big data examples: data sets with a 100 million rows and hundreds of variables.The problem with big data, however, is that most of it is proprietary and locked away. Consequently, when constructing examples it is often necessary “make do” with data sets that are considerably smaller than an analyst is likely to be faced with in practice. To help with this problem, we have added some new data sets to lists of data sets on inside-r.org that we began keeping since almost two years ago. So, if you are looking for a sample data set or if you are the kind of person who enjoys browsing data repositories as some people enjoy browsing bookstores have a look at what is available there. The following presents some of the highlights.

Joseph highlights airline, medicare, and Australian weather data sets.

There are a number of other data sets but more would be appreciated by inside-r.org.

Better synonym handling in Solr

Filed under: Solr,Synonymy — Patrick Durusau @ 7:05 pm

Better synonym handling in Solr by Nolan Lawson.

A very deep dive into synonym handling in Solr, along with a proposed fix.

The problems Nolan uncovers are now in a JIRA issue, SOLR-4381.

And Nolan has a Github repository with his proposed fix.

The Solr JIRA lists the issue as still “open.”

Start with the post and then go onward to the JIRA issue and Github repository. I say that because Nolan does a great job detailing the issue he discovered and his proposed solution.

I can think of several other improvements to synonym handling in Solr.

Such as allowing specification of tokens and required values in other fields for synonyms. (An indexing analog to scope.)

Or even allowing Solr queries in a synonym table.

Not to mention making Solr synonym tables by default indexed.

Just to name a few.

August 24, 2013

Citing data (without tearing your hair out)

Filed under: Citation Practices,Data — Patrick Durusau @ 7:00 pm

Citing data (without tearing your hair out) by Bonnie Swoger

From the post:

The changing nature of how and where scientists share raw data has sparked a growing need for guidelines on how to cite these increasingly available datasets.

Scientists are producing more data than ever before due to the (relative) ease of collecting and storing this data. Often, scientists are collecting more than they can analyze. Instead of allowing this un-analyzed data to die when the hard drive crashes, they are releasing the data in its raw form as a dataset. As a result, datasets are increasingly available as separate, stand-alone packages. In the past, any data available for other scientists to use would have been associated with some other kind of publication – printed as table in a journal article, included as an image in a book, etc. – and cited as such.

Now that we can find datasets “living on their own,” scientists need to be able to cite these sources.

Unfortunately, the traditional citation manuals do a poor job of helping a scientist figure out what elements to include in the reference list, either ignoring data or over-complicating things.

If you are building a topic map that relies upon data sets you didn’t create, get ready to cite data sets.

Citations, assuming they are correct, can give your users confidence in the data you present.

Bonnie does a good job providing basic rules that you should follow when citing data.

You can always do more than she suggests but you should never do any less.

Name Search in Solr

Filed under: Searching,Solr — Patrick Durusau @ 6:46 pm

Name Search in Solr by Doug Turnbull.

From the post:

Searching names is a pretty common requirement for many applications. Searching by book authors, for example, is a pretty crucial component to a book store. And as it turns out names are actually a surprisingly hard thing to get perfect. Regardless, we can get something pretty good working in Solr, at least for the vast-majority of Anglicized representations.

We can start with the assumption that aside from all the diversity in human names, that a name in our Authors field is likely going to be a small handful of tokens in a single field. We’ll avoid breaking these names up by first, last, and middle names (if these are even appropriate in all cultural contexts). Let’s start by looking at some sample names in our “Authors” field:

Doug has a photo of library shelves in his post with the caption:

Remember the good ole days of “Alpha by Author”?

True but books listed their authors in various forms. Librarians were the ones who imposed a canonical representation on author names.

Doug goes through basic Solr techniques for matching author names when you don’t have the benefit of librarians.

Agenda for Lucene/Solr Revolution EU! [Closes September 9, 2013]

Filed under: Conferences,Lucene,Solr — Patrick Durusau @ 6:34 pm

Help Us Set the Agenda for Lucene/Solr Revolution EU! by Laura Whalen.

From the post:

Thanks to all of you who submitted an abstract for the Lucene/Solr Revolution EU 2013 conference in Dublin. We had an overwhelming response to the Call for Papers, and narrowing the topics from the many great submissions was a difficult task for the Conference Committee. Now we need your help in making the final selections!

Vote now! Community voting will close September 9, 2013.

The Lucene/Solr Revolution free voting system allows you to vote on your favorite topics. The sessions that receive the highest number of votes will be automatically added to the Lucene/Solr Revolution EU 2013 agenda. The remaining sessions will be selected by a committee of industry experts who will take into account the community’s votes as well as their own expertise in the area. Click here to start voting for your favorites.

Your chance to influence the Lucene/Solr Revolution agenda for Dublin! (November 4-7)

PS: As of August 24, 2013, about 11:33 UTC, I was getting a server error from the voting link. Maybe overload of voters?

Hard/Soft Commits, Transaction Logs (SolrCloud)

Filed under: SolrCloud — Patrick Durusau @ 6:22 pm

Understanding Transaction Logs, Soft Commit and Commit in SolrCloud by Erick Erickson.

From the post:

As of Solr 4.0, there is a new “soft commit” capability, and a new parameter for hard commits – openSearcher. Currently, there’s quite a bit of confusion about the interplay between soft and hard commit actions, and especially what it all means for the transaction log. The stock solrconfig.xml file explains the options, but with the usual documentation-in-example limits, if there was a full explanation of everything, the example file would be about a 10M and nobody would ever read through the whole thing. This article outlines the consequences hard and soft commits and the new openSearcher option for hard commits. The release documentation can be found in the Solr Reference Guide, this post is a more leisurely overview of this topic. I persuaded a couple of the committers to give me some details. I’m sure I was told the accurate information, any transcription errors are mine!

The mantra

Repeat after me “Hard commits are about durability, soft commits are about visibility“. Hard and soft commits are related concepts, but serve different purposes. Concealed in this simple statement are many details; we’ll try to illuminate some of them.

Interested? 😉

No harm in knowing the details. Could come in very handy.

SCOAP3

Filed under: Open Access,Publishing — Patrick Durusau @ 3:52 pm

SCOAP3

I didn’t recognize the acronym either. 😉

From the “about” page:

The Open Access (OA) tenets of granting unrestricted access to the results of publicly-funded research are in contrast with current models of scientific publishing, where access is restricted to journal customers. At the same time, subscription costs increase and put considerable strain on libraries, forcing them to cancel an increasing number of journals subscriptions. This situation is particularly acute in fields like High-Energy Physics (HEP), where pre-prints describing scientific results are timely available online. There is a growing concern within the academic community that the future of high-quality journals, and the peer-review system they administer, is at risk.

To address this situation for HEP and, as an experiment, Science at large, a new model for OA publishing has emerged: SCOAP3 (Sponsoring Consortium for Open Access Publishing in Particle Physics). In this model, HEP funding agencies and libraries, which today purchase journal subscriptions to implicitly support the peer-review service, federate to explicitly cover its cost, while publishers make the electronic versions of their journals free to read. Authors are not directly charged to publish their articles OA.

SCOAP3 will, for the first time, link quality and price, stimulating competition and enabling considerable medium- and long-term savings. Today, most publishers quote a price in the range of 1’000–2’000 Euros per published article. On this basis, we estimate that the annual budget for the transition of HEP publishing to OA would amount to a maximum of 10 Million Euros/year, sensibly lower than the estimated global expenditure in subscription to HEP journals.

Each SCOAP3 partner will finance its contribution by canceling journal subscriptions. Each country will contribute according to its share of HEP publishing. The transition to OA will be facilitated by the fact that the large majority of HEP articles are published in just six peer-reviewed journals. Of course, the SCOAP3 model is open to any, present or future, high-quality HEP journal aiming at a dynamic market with healthy competition and broader choice.

HEP funding agencies and libraries are currently signing Expressions of Interest for the financial backing of the consortium. A tendering procedure will then take place. Provided that SCOAP3 funding partners are prepared to engage in long-term commitments, many publishers are expected to be ready to enter into negotiations.

The example of SCOAP3 could be rapidly followed by other fields, directly related to HEP, such as nuclear physics or astro-particle physics, also similarly compact and organized with a reasonable number of journals.

Models like this one may result in increasing the amount of information available for topic mapping and the amount of semantic diversity in traditional search results.

Delivery models are changing but search interfaces leave us to our own devices at the document level.

If we are going to have better access in the physical sense, shouldn’t we be working on better access in the content sense?

PS: To show this movement has legs, consider the recent agreement of Elsevier, IOPp and Springer to participate.

Information Extraction from the Internet

Filed under: Data Mining,Information Retrieval,Information Science,Publishing — Patrick Durusau @ 3:30 pm

Information Extraction from the Internet by Nan Tang.

From the description at Amazon ($116.22):

As the Internet continues to become part of our lives, there now exists an overabundance of reliable information sources on this medium. The temporal and cognitive resources of human beings, however, do not change. “Information Extraction from the Internet” provides methods and tools for Web information extraction and retrieval. Success in this area will greatly enhance business processes and provide information seekers new tools that allow them to reduce their searching time and cost involvement. This book focuses on the latest approaches for Web content extraction, and analyzes the limitations of existing technology and solutions. “Information Extraction from the Internet” includes several interesting and popular topics that are being widely discussed in the area of information extraction: data spasity and field-associated knowledge (Chapters 1–2), Web agent design and mining components (Chapters 3–4), extraction skills on various documents (Chapters 5–7), duplicate detection for music documents (Chapter 8), name disambiguation in digital libraries using Web information (Chapter 9), Web personalization and user-behavior issues (Chapters 10–11), and information retrieval case studies (Chapters 12–14). “Information Extraction from the Internet” is suitable for advanced undergraduate students and postgraduate students. It takes a practical approach rather than a conceptual approach. Moreover, it offers a truly reader-friendly way to get to the subject related to information extraction, making it the ideal resource for any student new to this subject, and providing a definitive guide to anyone in this vibrant and evolving discipline. This book is an invaluable companion for students, from their first encounter with the subject to more advanced studies, while the full-color artworks are designed to present the key concepts with simplicity, clarity, and consistency.

I discovered this volume while searching for the publisher of: On-demand Synonym Extraction Using Suffix Arrays.

As you can see from the description, a wide ranging coverage of information extraction interests.

All of the chapters are free for downloading at the publisher’s site.

iConcepts Press has a number of books and periodicals you may find interesting.

On-demand Synonym Extraction Using Suffix Arrays

Filed under: Authoring Topic Maps,Suffix Array,Synonymy — Patrick Durusau @ 3:19 pm

On-demand Synonym Extraction Using Suffix Arrays by Minoru Yoshida, Hiroshi Nakagawa, and Akira Terada. (Yoshida, M., Nakagawa, H. & Terada, A. (2013). On-demand Synonym Extraction Using Suffix Arrays. Information Extraction from the Internet. ISBN: 978-1463743994. iConcept Press. Retrieved from http://www.iconceptpress.com/books//information-extraction-from-the-internet/)

From the introduction:

The amount of electronic documents available on the World Wide Web (WWW) is continuously growing. The situation is the same in a limited part of the WWW, e.g., Web documents from specific web sites such as ones of some specific companies or universities, or some special-purpose web sites such as www.wikipedia.org, etc. This chapter mainly focuses on such a limited-size corpus. Automatic analysis of this large amount of data by text-mining techniques can produce useful knowledge that is not found by human efforts only.

We can use the power of on-memory text mining for such a limited-size corpus. Fast search for required strings or words available by putting whole documents on memory contributes to not only speeding up of basic search operations like word counting, but also making possible more complicated tasks that require a number of search operations. For such advanced text-mining tasks, this chapter considers the problem of extracting synonymous strings for a query given by users. Synonyms, or paraphrases, are words or phrases that have the same meaning but different surface strings. “HDD” and “hard drive” in documents related to computers and “BBS” and “message boards” in Web pages are examples of synonyms. They appear ubiquitously in different types of documents because the same concept can often be described by two or more expressions, and different writers may select different words or phrases to describe the same concept. In such cases, the documents that include the string “hard drive” might not be found by if the query “HDD” is used, which results in a drop in the coverage of the search system. This could become a serious problem, especially for searches of limited-size corpora. Therefore, being able to find such synonyms significantly improves the usability of various systems. Our goal is to develop an algorithm that can find strings synonymous with the user input. The applications of such an algorithm include augmenting queries with synonyms in information retrieval or text-mining systems, and assisting input systems by suggesting expressions similar to the user input.

The authors concede the results of their method are inferior to the best results of other synonym extraction methods but go on to say:

However, note that the main advantage of our method is not its accuracy, but its ability to extract synonyms of any query without a priori construction of thesauri or preprocessing using other linguistic tools like POS taggers or dependency parsers, which are indispensable for previous methods.

An important point to remember about all semantic technologies. How appropriate a technique is for your project depends on your requirements, not qualities of a technique in the abstract.

Technique N may not support machine reasoning but sending coupons to mobile phones “near” a restaurant doesn’t require that overhead. (Neither does standing outside the restaurant with flyers.)

Choose semantic techniques based on their suitability for your purposes.

Missing Layer in the Semantic Web Stack!

Filed under: Humor,Semantic Web — Patrick Durusau @ 2:48 pm

Samatha Bail has discovered a missing layer in the Semantic Web Stack!

Revised Semantic Web Stack

In topic maps we call that semantic diversity. 😉

August 23, 2013

Cypher shell with logging

Filed under: Cypher,Documentation,Neo4j — Patrick Durusau @ 6:12 pm

Cypher shell with logging by Alex Frieden.

From the post:

For those who don’t know, Neo4j is a graph database built with Java. The internet is abound with examples, so I won’t bore you with any.

Our problem was a data access problem. We built a loader, loaded our data into neo4j, and then queried it. However we ran into a little problem: Neo4j at the time of release logs in the home directory (at least on linux redhat) what query was ran (its there as a hidden file). However, it doesn’t log what time it was run at. One other problem as an administrator point of view is not having a complete log of all queries and data access. So we built a cypher shell that would do the logging the way we needed to log. Future iterations of this shell will have REST cypher queries and not use the embedded mode (which is faster but requires a local connection to the data). We also wanted a way in the future to output results to a file.
(…)

Excellent!

Logs are a form of documentation. You may remember that documentation was #1 in the Solr Usability contest.

Documentation is important! Don’t neglect it.

Light Table 0.5.0

Filed under: Documentation,Programming — Patrick Durusau @ 6:04 pm

Light Table 0.5.0 by Chris Granger.

A little later than the first week or two of August, 2013, but not by much!

Chris says Light Table is a next-gen IDE.

He may be right but to evaluate that claim, you will need to download the alpha here.

I must confess I am curious about his claim:

With the work we did to add background processing and a good deal of effort toward ensuring everything ran fast, LightTable is now comparable in speed to Vim and faster than Emacs or Sublime in most things. (emphasis added)

I want to know what “most things” Light Table does faster than Emacs. 😉

Are you downloading a copy yet?

Data miners strike gold on copyright

Filed under: Data Mining,Licensing,NSA — Patrick Durusau @ 5:40 pm

Data miners strike gold on copyright by Paul Jump.

From the post:

From early September, the biomedical publisher, which is owned by Springer, will publish all datasets under a Creative Commons CC0 licence, which waives all rights to the material.

Data miners, who use software to analyse data drawn from numerous papers, have called for CC0, also known as “no rights reserved”, to be the standard licence for datasets. Even the CC-BY licence, which is required by the UK research councils, is deemed to be a hindrance to data mining: although it does not impose restrictions on reuse, it requires every paper mined to be credited.

In a statement, the publisher says that “the true research potential of knowledge that is captured in data will only be released if data mining and other forms of data analysis and re-use are not in any form restricted by licensing requirements.

“The inclusion of the Creative Commons CC0 public domain dedication will make it clear that data from articles in BioMed Central journals is clearly and unambiguously available for sharing, integration and re-use without legal restrictions.”

As of September, the NSA won’t be violating copyright restrictions when it mines Biomed Central.

Being illegal does not bother the NSA but the Biomed news reduces the number of potential plaintiffs to less than the world population + N. (Where N = legal entities entitled to civil damages.)

You will be able to mine, manipulate and merge data from Biomed Central as well.

Aggregation Options on Big Data Sets Part 1… [MongoDB]

Filed under: Aggregation,MongoDB — Patrick Durusau @ 5:26 pm

Aggregation Options on Big Data Sets Part 1: Basic Analysis using a Flights Data Set by Daniel Alabi and Sweet Song, MongoDB Summer Interns.

From the post:

Flights Dataset Overview

This is the first of three blog posts from this summer internship project showing how to answer questions concerning big datasets stored in MongoDB using MongoDB’s frameworks and connectors.

The first dataset explored was a domestic flights dataset. The Bureau of Transportation Statistics provides information for every commercial flight from 1987, but we narrowed down our project to focus on the most recent available data for the past year (April 2012-March 2013).

We were particularly attracted to this dataset because it contains a lot of fields that are well suited for manipulation using the MongoDB aggregation framework.

To get started, we wanted to answer a few basic questions concerning the dataset:

  1. When is the best time of day/day of week/time of year to fly to minimize delays?
  2. What types of planes suffer the most delays? How old are these planes?
  3. How often does a delay cascade into other flight delays?
  4. What was the effect of Hurricane Sandy on air transportation in New York? How quickly did the state return to normal?

A series of blog posts to watch!

I thought the comment:

We were particularly attracted to this dataset because it contains a lot of fields that are well suited for manipulation using the MongoDB aggregation framework.

was remarkably honest.

The Department of Transportation Table/Field guide reveals that the fields are mostly populated by codes, IDs and date/time values.

Values that lend themselves to easy aggregation.

Looking forward to harder aggregation examples as this series develops.

August 22, 2013

Antepedia…

Filed under: Programming,Search Engines,Software — Patrick Durusau @ 6:18 pm

Antepedia Open Source Project Search Engine

From the “more” information link on the homepage:

Antepedia is the largest knowledge base of open source components with over 2 million current projects, and 1,000 more added daily. Antepedia continuously aggregates data from various directories that include Google Code, Apache, GitHub, Maven, and many more. These directories allow Antepedia to consistently grow as the world’s largest knowledge base of open source components.

Antepedia helps companies protect and secure their software assets, by providing a multi-source tracking solution that assists them in their management of open source governance. This implementation of Antepedia allows an organization to reduce licensing risks and security vulnerabilities in your open source component integration.

Antepedia is a public site that provides a way for anyone to search for an open source project. In cases where a project is not currently indexed in the knowledge base, you can manually submit that project, and help build upon the Antepedia knowledge base. These various benefits allow Antepedia to grow and offer the necessary functionalities, which provide the information you need, when you need it. With Antepedia you can assure that you have the newest & most relevant information for all your open source management and detection projects.

See also: Antepedia Reporter Free Edition for tracking open source projects.

If you like open source projects, take a look at: http://www.antelink.com/ (sponsor of Antepedia).

Do navigate on and off the Antelink homepage and watch the Antepedia counter increment, to the same number. 😉 I’m sure the total changes day to day but it was funny to see it reach the same number more than twice.

Indexing use cases and technical strategies [Hadoop]

Filed under: Hadoop,HDFS,Indexing — Patrick Durusau @ 6:02 pm

Indexing use cases and technical strategies

From the post:

In this post, let us look at 3 real life indexing use cases. While Hadoop is commonly used for distributed batch index building, it is desirable to optimize the index capability in near real time. We look at some practical real life implementations where the engineers have successfully worked out their technology stack combinations using different products.

Resources on:

  1. Near Real Time index at eBay
  2. Distributed indexing strategy at Trovit
  3. Incremental Processing by Google’s Percolator

Presentations and a paper for the weekend!

IBM on Data Security

Filed under: Humor,Security — Patrick Durusau @ 4:36 pm

How Not To Use Company Data

Animated graphic with an important lesson about data security.

Snooping isn’t limited to being convenient or even electronic.

Three RDFa Recommendations Published

Filed under: HTML5,RDF,RDFa,WWW — Patrick Durusau @ 2:52 pm

Three RDFa Recommendations Published

From the announcement:

  • HTML+RDFa 1.1, which defines rules and guidelines for adapting the RDFa Core 1.1 and RDFa Lite 1.1 specifications for use in HTML5 and XHTML5. The rules defined in this specification not only apply to HTML5 documents in non-XML and XML mode, but also to HTML4 and XHTML documents interpreted through the HTML5 parsing rules.
  • The group also published two Second Editions for RDFa Core 1.1 and XHTML+RDFa 1.1, folding in the errata reported by the community since their publication as Recommendations in June 2012; all changes were editorial.
  • The group also updated the a RDFa 1.1 Primer.

The deeper I get into HTML+RDFa 1.1, the more I think a random RDFa generator would be an effective weapon against government snooping.

Something copies some percentage of your text and places it in a comment and generates random RDFa 1.1 markup for it, thus: <!– – your content + RDFa – –>.

Improves the stats for the usage of RDFa 1.1 and if the government tries to follow all the RDFa 1.1 rules, well, let’s just say they will have less time for other mischief. 😉

You complete me

Filed under: AutoSuggestion,ElasticSearch,Interface Research/Design,Lucene — Patrick Durusau @ 2:03 pm

You complete me by Alexander Reelsen.

From the post:

Effective search is not just about returning relevant results when a user types in a search phrase, it’s also about helping your user to choose the best search phrases. Elasticsearch already has did-you-mean functionality which can correct the user’s spelling after they have searched. Now, we are adding the completion suggester which can make suggestions while-you-type. Giving the user the right search phrase before they have issued their first search makes for happier users and reduced load on your servers.

Warning: The completion suggester Alexander describes may “change/break in future releases.”

Two features that made me read the post were: readability and custom ordering.

Under readability, the example walks you through returning one output for several search completions.

Suggestions don’t have to be presented in TF/IDF relevance order. A weight assigned to the target of a completion controls the ordering of suggestions.

The post covers several other features and if you are using or considering using Elasticsearch, it is a good read.

duplitector

Filed under: Duplicates,ElasticSearch,Lucene — Patrick Durusau @ 1:17 pm

duplitector by Paweł Rychlik.

From the webpage:

duplitector

A duplicate data detector engine based on Elasticsearch. It’s been successfully used as a proof of concept, piloting an full-blown enterprize solution.

Context

In certain systems we have to deal with lots of low-quality data, containing some typos, malformatted or missing fields, erraneous bits of information, sometimes coming from different sources, like careless humans, faulty sensors, multiple external data providers, etc. This kind of datasets often contain vast numbers of duplicate or similar entries. If this is the case – then these systems might struggle to deal with such unnatural, often unforeseen, conditions. It might, in turn, affect the quality of service delivered by the system.

This project is meant to be a playground for developing a deduplication algorithm, and is currently aimed at the domain of various sorts of organizations (e.g. NPO databases). Still, it’s small and generic enough, so that it can be easily adjusted to handle other data schemes or data sources.

The repository contains a set of crafted organizations and their duplicates (partially fetched from IRS, partially intentionally modified, partially made up), so that it’s convenient to test the algorithm’s pieces.

Paweł also points to this article by Andrei Zmievski: Duplicates Detection with ElasticSearch. Andrei merges tags for locations based on their proximity to a particular coordinates.

I am looking forward to the use of indexing engines for deduplication of data in situ as it were. That is without transforming the data into some other format for processing.

Knowledge Leakage:..

Filed under: Knowledge,Knowledge Capture,Organizational Memory — Patrick Durusau @ 1:00 pm

Knowledge Leakage: The Destructive Impact of Failing to Train on ERP Projects by Cushing Anderson.

Abstract:

This IDC study refines the concept of knowledge leakage and the factors that compound and mitigate the impact of knowledge leakage on an IT organization. It also suggests strategies for IT management to reduce the impact of knowledge leakage on organizational performance.

There is a silent killer in every IT organization — knowledge leakage. IT organizations are in a constant state of flux. The IT environment, the staff, and the organizational goals change continuously. At the same time, organizational performance must be as high as possible, but the impact of changing staff and skill leakage can cause 50% of an IT organization’s skills to be lost in six years.

“Knowledge leak is the degradation of skills over time, and it occurs in every organization, every time. It doesn’t discriminate based on operating system or platform, but it can kill organizational performance in as little as a couple of years.” — Cushing Anderson, vice president, IT Education and Certification research

I don’t have an IDC account so I can’t share with you what goodies may be inside this article.

I do think that “knowledge leakage” is a good synonym for “organizational memory.” Or should that be “organizational memory loss?”

I also don’t think that “knowledge leakage” is confined to IT organizations.

Ask the nearest supervisor that has had a long time administrative assistant retire. That’s real “knowledge leakage.”

The problem with capturing organizational knowledge, the unwritten rules of who to ask, for what and when, is that such rules are almost never written down.

And if they were, how would you find them?

Let me leave you with a hint:

The user writing down the unwritten rules needs to use their vocabulary and not one ordained by IT or your corporate office. And they need to walk you through it so you can add your vocabulary to it.

Or to summarize: Say it your way. Find it your way.

If you are interested, you know how to contact me.

August 21, 2013

Groklaw Goes Dark

Filed under: Government,NSA,Security — Patrick Durusau @ 7:45 pm

Just in case you missed it, Groklaw has gone dark.

In Forced Exposure, Pamela Jones outlines why Groklaw cannot continue when all email is subject to constant monitoring by the government.

From the post:

I hope that makes it clear why I can’t continue. There is now no shield from forced exposure. Nothing in that parenthetical thought list is terrorism-related, but no one can feel protected enough from forced exposure any more to say anything the least bit like that to anyone in an email, particularly from the US out or to the US in, but really anywhere. You don’t expect a stranger to read your private communications to a friend. And once you know they can, what is there to say? Constricted and distracted. That’s it exactly. That’s how I feel.

So. There we are. The foundation of Groklaw is over. I can’t do Groklaw without your input. I was never exaggerating about that when we won awards. It really was a collaborative effort, and there is now no private way, evidently, to collaborate.

I’m really sorry that it’s so. I loved doing Groklaw, and I believe we really made a significant contribution. But even that turns out to be less than we thought, or less than I hoped for, anyway. My hope was always to show you that there is beauty and safety in the rule of law, that civilization actually depends on it. How quaint.

I won’t say that I always agreed with Groklaw but I am sad to see it go.

While I respect Pamela’s judgement to go offline, I won’t be following her nor should you.

What revolution ever started and continued without innocent victims?

Would the march from Montgomery to Selma been the same if the police issued misdemeanor summons?

We know the use of Bull Connor‘s police dogs:

Birmingham

and fire hoses:

ChildrensCrusade

during the Children’s Crusade, lead to the passage of the Civil Rights Act of 1964.

The sacrifices of many nameless (to television viewers) victims gave the civil rights movement its moral impetus.

How could others turn away after watching victims simply accepting abuse?

The same will be true for the current police state. The longer it exists the more mistakes it will make and the more victims it will accumulate.

Innocent people going to be harassed, innocent people are going to lose their jobs, innocent people are going to die.

Without innocent victims, there will be no moral impetus to dismantle Obama‘s police state.

Ask yourself, would you have been on the Edmund Pettus Bridge with the marchers, or on the other side?

Bloody Sunday

Child of the Library

Filed under: Library — Patrick Durusau @ 6:29 pm

Public libraries, for me, are in a category all their own.

I was fortunate to grow up in a community that supported public libraries. And I have spent most of my life across several careers using libraries of one sort or another, including public ones.

Public libraries offer at no cost to patrons opportunities to be informed about current events, to learn more than is taught in any university, to be entertained by stories from near and far and even long ago.

Public libraries are also community centers where anyone can meet, where economic limitations don’t prevent access to the latest technologies or information streams.

Visit and support your local public library.

Every public library is a visible symbol that government thought control may be closing in, but it hasn’t won, yet.

« Newer PostsOlder Posts »

Powered by WordPress