Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 22, 2012

VIVO – An interdisciplinary national network

Filed under: Semantic Web,VIVO — Patrick Durusau @ 7:42 pm

VIVO – An interdisciplinary national network

From the “about” page:

VIVO enables the discovery of researchers across institutions. Participants in the network include institutions with local installations of VIVO or those with research discovery and profiling applications that can provide semantic web!-compliant data. The information accessible through VIVO’s search and browse capability will reside and be controlled locally, within institutional VIVOs or other semantic web-compliant applications.

VIVO is an open source semantic web application originally developed and implemented at Cornell. When installed and populated with researcher interests, activities, and accomplishments, it enables the discovery of research and scholarship across disciplines at that institution and beyond. VIVO supports browsing and a search function which returns faceted results for rapid retrieval of desired information. Content in any local VIVO installation may be maintained manually, brought into VIVO in automated ways from local systems of record, such as HR, grants, course, and faculty activity databases, or from database providers such as publication aggregators and funding agencies.

The rich semantically structured data in VIVO support and facilitate research discovery. Examples of applications that consume these rich data include: visualizations, enhanced multi-site search through VIVO Search, and applications such as VIVO Searchlight, a browser bookmarklet which uses text content of any webpage to search for relevant VIVO profiles, and the Inter-Institutional Collaboration Explorer, an application which allows visualization of collaborative institutional partners, among others.

Download the VIVO flyer.

Would be very interested to hear from adopters outside of the current “collaborative institutional partners.”

I don’t doubt that VIVO will prove to be useful but as you know, I am interested in collaborations that lie just beyond the reach of any particular framework.

Spring MVC 3.1 – Implement CRUD with Spring Data Neo4j

Filed under: CRUD,MongoDB,Neo4j,Spring Data — Patrick Durusau @ 7:42 pm

Spring MVC 3.1 – Implement CRUD with Spring Data Neo4j

The title of the post includes “(Part-1)” but all five parts have been posted.

From the post:

In this tutorial, we will create a simple CRUD application using Spring 3.1 and Neo4j. We will based this tutorial on a previous guide for MongoDB. This means we will re-use our existing design and implement only the data layer to use Neo4j as our data store.

I would start at the beginning MongoDB post, Spring MVC 3.1 – Implement CRUD with Spring Data MongoDB. (It won’t hurt you to learn some MongoDB as well.)

Quite definitely will repay the time you spend.

Web Data Commons

Filed under: Common Crawl,Microdata,Microformats,PageRank,RDFa — Patrick Durusau @ 7:42 pm

Web Data Commons

From the webpage:

More and more websites have started to embed structured data describing products, people, organizations, places, events into their HTML pages. The Web Data Commons project extracts this data from several billion web pages and provides the extracted data for download. Web Data Commons thus enables you to use the data without needing to crawl the Web yourself.

More and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their HTML pages using encoding standards such as Microformats, Microdatas and RDFa. The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public, and provide the extracted data for download in the form of RDF-quads and (soon) also in the form of CSV-tables for common entity types (e.g. product, organization, location, …).

Web Data Commons thus enables you to use structured data originating from hundreds of million web pages within your applications without needing to crawl the Web yourself.

Pages in the Common Crawl corpora are included based on their PageRank score, thereby making the crawls snapshots of the current popular part of the Web.

This reminds me of the virtual observatory practice in astronomy. Astronomical data is too large to easily transfer and many who need to use the data lack the software or processing power. The solution? Holders of the data make it available via interfaces that deliver a sub-part of the data, processed according to the requester’s needs.

The Web Data Commons is much the same thing as it frees most of us from both crawling the web and/or extracting structured data from it. Or at least giving us the basis for more pointed crawling of the web.

A very welcome development!

Vista Stares Deep Into the Cosmos:…

Filed under: Astroinformatics,Data,Dataset — Patrick Durusau @ 7:42 pm

Vista Stares Deep Into the Cosmos: Treasure Trove of New Infrared Data Made Available to Astronomers

From the post:

The European Southern Observatory’s VISTA telescope has created the widest deep view of the sky ever made using infrared light. This new picture of an unremarkable patch of sky comes from the UltraVISTA survey and reveals more than 200 000 galaxies. It forms just one part of a huge collection of fully processed images from all the VISTA surveys that is now being made available by ESO to astronomers worldwide. UltraVISTA is a treasure trove that is being used to study distant galaxies in the early Universe as well as for many other science projects.

ESO’s VISTA telescope has been trained on the same patch of sky repeatedly to slowly accumulate the very dim light of the most distant galaxies. In total more than six thousand separate exposures with a total effective exposure time of 55 hours, taken through five different coloured filters, have been combined to create this picture. This image from the UltraVISTA survey is the deepest [1] infrared view of the sky of its size ever taken.

The VISTA telescope at ESO’s Paranal Observatory in Chile is the world’s largest survey telescope and the most powerful infrared survey telescope in existence. Since it started work in 2009 most of its observing time has been devoted to public surveys, some covering large parts of the southern skies and some more focused on small areas. The UltraVISTA survey has been devoted to the COSMOS field [2], an apparently almost empty patch of sky which has already been extensively studied using other telescopes, including the NASA/ESA Hubble Space Telescope [3]. UltraVISTA is the deepest of the six VISTA surveys by far and reveals the faintest objects.

Another six (6) terabytes of images, just in case you are curious.

And the rate of acquisition of astronomical data is only increasing.

Clever insights into how to more efficiently process and analyze the resulting data are surely welcome.

Milking Performance from Riak Search

Filed under: Erlang,Riak — Patrick Durusau @ 7:42 pm

Milking Performance from Riak Search by Gary William Flake.

From the post:

The primary backend store of Clipboard is built on top of Riak, one of the lesser known NoSQLs solutions. We love Riak and are really happy with our experiences with it — both in terms of development and operations — but to get to where we are, we had to use some tricks. In this post I want to share with you why we chose Riak and also arm you with some of the best tricks that we learned along the way. Individually, these tricks gave us better than a 100x performance boost, so they may make a big difference for you too.

If you don’t know what Clipboard is, you should try it out. We’re in private beta now, but here’s a backdoor that will bypass the invitation system: Register at Clipboard.

Good discussion of term-based partitioning and its disadvantages. (Term-based partitioning being native to Riak.) Solved in part by judging likely queries in advance and precomputing inner joins. Not a bad method, depending on your confidence in your guesses about likely queries.

You will also have to determine if sorting on a primary key meets your needs, for a 10X to 100X performance gain.

Sehrch.com: … Powered by Hypertable (Semantic Web/RDF Performance by Design)

Filed under: Hypertable,NoSQL,Sehrch.com — Patrick Durusau @ 7:41 pm

Sehrch.com: A Structured Search Engine Powered By Hypertable

From the introduction:

Sehrch.com is a structured search engine. It provides powerful querying capabilities that enable users to quickly complete complex information retrieval tasks. It gathers conceptual awareness from the Linked Open Data cloud, and can be used as (1) a regular search engine or (2) as a structured search engine. In both cases conceptual awareness is used to build entity centric result sets.

To facilitate structured search we have introduced a new simple search query syntax that allows for bound properties and relations (contains, less than, more than, between, etc). The initial implementation of Sehrch.com was built over an eight month period. The primary technologies used are Hypertable and Lucene. Hypertable is used to store all Sehrch.com data which is in RDF (Resource Description Framework). Lucene provides the underlying searcher capability. This post provides an introduction to structured web search and an overview of how we tackled our big data problem.

A bit later you read:

We achieved a stable loading throughput of 22,000 triples per second (tps), peaking at 30,000 tps. Within 24 hours we had loaded the full 1.3 billion triples on a single node, on hardware that was at least two years out of date. We were shocked, mostly because on the same hardware the SPARQL compliant triplestores had managed 80 million triples (Virtuoso) at the most and Hypertable had just loaded all 1.3 billion. The 500GB of input RDF had become a very efficient 50GB of Hypertable data files. But then, loading was only half of the problem, could we query? We wrote a multi-threaded data exporter that would query Hypertable for entities by subject (Hypertable row key) randomly. We ran the exporter, and achieved speeds that peaked at 1,800 queries per second. Again we were shocked. Now that the data the challenge had set forth was loaded, we wondered how far Hypertable could go on the same hardware.

So we reloaded the data, this time appending the row keys with 1. Hypertable completed the load again, in approximately the same time. So we ran it again, now appending the keys with 2. Hypertable completed again, again in the same time frame. We now had a machine which was only 5% of our eventual production specification that stored 3.6 billion triples, three copies each of DBpedia and Freebase. We reran our data exporter and achieved query speeds that ranged between 1,000-1,300 queries per second. From that day on we have never looked back, Hypertable solved our data storage problem, it smashed the challenge that we set forth that would determine if Sehrch.com was at all possible. Hypertable made it possible.

That’s performance by design, not brute force.

On the other hand, the results of: Pop singers less than 20 years old, could be improved. Page after page of Miley Cyrus results gets old in a hurry. 😉

I am sure the team at Sehrch.com would appreciate your suggestions and comments.

Secondary Indices Have Arrived! (Hypertable)

Filed under: Hypertable,Indexing — Patrick Durusau @ 7:41 pm

Secondary Indices Have Arrived! (Hypertable)

From the post:

Until now, SELECT queries in Hypertable had to include a row key, row prefix or row interval specification in order to be fast. Searching for rows by specifying a cell value or a column qualifier involved a full table scan which resulted in poor performance and scaled badly because queries took longer as the dataset grew. With 0.9.5.6, we’ve implemented secondary indices that will make such SELECT queries lightning fast!

Hypertable supports two kinds of indices: a cell value index and a column qualifier index. This blog post explains what they are, how they work and how to use them.

I am glad to hear about the new indexing features but how do “cell value indexes” and “column qualifier indexes” differ from secondary indexes as described in the PostgreSQL 9.1 documentation as:

All indexes in PostgreSQL are what are known technically as secondary indexes; that is, the index is physically separate from the table file that it describes. Each index is stored as its own physical relation and so is described by an entry in the pg_class catalog. The contents of an index are entirely under the control of its index access method. In practice, all index access methods divide indexes into standard-size pages so that they can use the regular storage manager and buffer manager to access the index contents.

It would be helpful in evaluating new features to know when (if?) they are substantially the same as features known in other contexts.

Einstein Archives Online

Filed under: Archives,Dataset — Patrick Durusau @ 7:41 pm

Einstein Archives Online

From the “about” page:

The Einstein Archives Online Website provides the first online access to Albert Einstein’s scientific and non-scientific manuscripts held by the Albert Einstein Archives at the Hebrew University of Jerusalem, constituting the material record of one of the most influential intellects in the modern era. It also enables access to the Einstein Archive Database, a comprehensive source of information on all items in the Albert Einstein Archives.

DIGITIZED MANUSCRIPTS

From 2003 to 2011, the site included approximately 3,000 high-quality digitized images of Einstein’s writings. This digitization of more than 900 documents written by Einstein was made possible by generous grants from the David and Fela Shapell family of Los Angeles. As of 2012, the site will enable free viewing and browsing of approximatelly 7,000 high-quality digitized images of Einstein’s writings. The digitization of close to 2,000 documents written by Einstein was produced by the Albert Einstein Archives Digitization Project and was made possible by the generous contribution of the Polonsky foundation. The digitization  project will continue throughout 2012.

FINDING AID

The site enables access to the online version of the Albert Einstein Archives Finding Aid, a comprehensive description of the entire repository of Albert Einstein’s personal papers held at the Hebrew University. The Finding Aid, presented in Encoded Archival Description (EAD) format, provides the following information on the Einstein Archives: its identity, context, content, structure, conditions of access and use. It also contains a list of the folders in the Archives which will enable access to the Archival Database and to the Digitized Manuscripts.

ARCHIVAL DATABASE

From 2003 to 2011, the Archival Database included approximately 43,000 records of Einstein and Einstein-related documents. Supplementary archival holdings and databases pertaining to Einstein documents have been established at both the Einstein Papers Project and the Albert Einstein
Archives
for scholarly research. As of 2012 the Archival Database allows direct access to all 80,000 records of Einstein and Einstein-related documents in the original and the supplementary archive. The records published in this online version pertain to Albert Einstein’s scientific and non-scientific writings, his professional and personal correspondence, notebooks, travel diaries, personal documents, and third-party items contained in both the original collection of Einstein’s personal papers and in the supplementary archive.

Unless you are a professional archivist, I suspect you will want to start with the Gallery. Which for some UI design reason appears at the bottom of the homepage in small type. (Hint: It really should be a logo at top left, to interest the average visitor.)

When you do reach mss. images, the zoom/navigation is quite responsive, although a slightly larger image to clue the reader in on location would be better. In fact, one that is readable and yet subject to zoom would be ideal.

Another improvement would be to display a URL to allow exchange of links to particular images, along with X/Y coordinates to the images. As presented, every reader has to re-find information in images for themselves.

Archiving material is good. Digital archives that enable wider access is better. Being able to reliably point into digital archives for commentary, comparison and other purposes is great.

Text Analytics in Telecommunications – Part 3

Filed under: Machine Learning,Text Analytics — Patrick Durusau @ 7:41 pm

Text Analytics in Telecommunications – Part 3 by Themos Kalafatis.

From the post:

It is well known that FaceBook contains a multitude of information that can be potentially analyzed. A FaceBook page contains several entries (Posts, Photos, Comments, etc) which in turn generate Likes. This data can be analyzed to better understand the behavior of consumers towards a Brand, Product or Service.

Let’s look at the analysis of the three FaceBook pages of MT:S, Telenor and VIP Mobile Telcos in Serbia as an example. The question that this analysis tries to answer is whether we can identify words and phrases that frequently appear in posts that generate any kind of reaction (a “Like”, or a Comment) vs words and topics that do not tend to generate reactions . If we are able to differentiate these words then we get an idea on what consumers tend to value more : If a post is of no value to us then we will not tend to Like it and/or comment it.

To perform this analysis we need a list of several thousands of posts (their text) and also the number of Likes and Comments that each post has received. If any post has generated a Like and/or a Comment then we flag that post as having generated a reaction. The next step is to feed that information to a machine learning algorithm to identify which words have discriminative power (=which words appear more frequently in posts that are liked and/or commented and also which words do not produce any reaction.)

It would be more helpful if the “machine learning algorithm” used in this case was identified, along with the data set in question.

I suppose we will learn more after the presentation at the European Text Analytics Summit, although we would like to learn more sooner! 😉

March 21, 2012

Taking a Look at Version 2.1 of Objectivity’s InfiniteGraph

Filed under: Graphs,InfiniteGraph — Patrick Durusau @ 3:31 pm

Taking a Look at Version 2.1 of Objectivity’s InfiniteGraph

Paul Williams writes:

InfiniteGraph is a distributed graph database application developed by the California-based company, Objectivity. Companies focused on the relationships within their data make up the primary market for InfiniteGraph. The database is known for its ability to find connections inside large datasets, as well as its robust performance and easy scalability.

InfiniteGraph uses a unique load-based pricing model that allows interested parties to try the software, including full database development, essentially free of charge. Two pricing options exist for companies deciding to fully deploy InfiniteGraph. The first option is “pay as you go,” which sports a run-time usage-based pricing model. Companies with large or classified applications can take advantage of a site-wide license option.

Those interested in using the free demo version of InfiniteGraph need to either have a Java compiler combined with some skill using a command line, or an installed IDE such as Eclipse. Database model development in InfiniteGraph requires at least some basic familiarity with the Java language, considering models and relationships (vertices and edges in InfiniteGraph nomenclature) are defined as Java classes. Once a database is compiled, the InfiniteGraph Visualizer app (included with package) allows for graph navigation and data browsing.

Objectivity might benefit from providing a pre-compiled database with the InfiniteGraph download package, so interested parties can investigate the Visualizer app and the software’s data mining capabilities without the need of a Java compiler and/or having to engage their development staff.

Just going off of the review for the moment, I think having a common navigation of data and metadata is a good thing. That is offset by the rather unnatural load a graph database and search for a node before there is a display.

I think I understand the reasoning for the search for a node first before displaying but it is counter-intuitive, where counter-intuitive = burden on user. Better to provide for (and use) a default node that is displayed upon load. That way “load” produces some action the user can see. And offer the user the ability to pick another “default” node that loads automatically when the graph is loaded.

It sounds like the documentation could be better integrated into the application (and not left on the vendor’s website).

Comments or suggestions on strong/weak points to look for with InfiniteGraph?

SoSlang Crowdsources a Dictionary

Filed under: Crowd Sourcing,Dictionary — Patrick Durusau @ 3:31 pm

SoSlang Crowdsources a Dictionary

Stephen E. Arnold writes:

Here’s a surprising and interesting approach to dictionaries: have users build their own. SoSlang allows anyone to add a slang term and its definition. Beware, though, this site is not for everyone. Entries can be salty. R-rated, even. You’ve been warned.

I would compare this approach:

speakers -> usages -> dictionary

to a formal dictionary:

speakers -> usages -> editors -> formal dictionary

That is to say a formal dictionary reflects the editor’s sense of the language and not the raw input of the speakers of a language.

It would be a very interesting text mining tasks to eliminate duplicate usages of terms so that the changing uses of a term can be tracked.

FDsys – Topic Maps – Concrete Example

Filed under: Marketing,Topic Maps — Patrick Durusau @ 3:31 pm

FDsys – Topic Maps – Concrete Example

Have you ever wanted a quick, concrete example to give someone of the need for a topic map?

Today I saw: Liberating America’s secret, for-pay laws which is a great read on how pay-for standards are cited by federal regulations, but you have to pay for the standards to know what the rules say.

That’s a rip-off isn’t it? You not only have to follow the rules, on pain of enforcement, but you have to pay to know what the rules are.

Being a member of the OASIS Technical Advisory Board and general advocate of open standards, I see an opportunity for OASIS to claim some PR here.

So I go to the FDsys site, choose advanced search, limited to the Code of Federal Regulations and enter “OASIS.”

I get 298 “hits” for “collection:CFR and content:OASIS.”

Really?

Well, the first one is: http://www.gpo.gov/fdsys/pkg/CFR-2011-title18-vol1/pdf/CFR-2011-title18-vol1-sec37-6.pdf. Title 18?

In the event that an OASIS user makes an error in a query, the Responsible Party can block the affected query and notify the user of the nature of the error. The OASIS user must correct the error before making any additional queries. If there is a dispute over whether an error has occurred, the procedures in paragraph (d) of this section apply.

FYI, Title 18 is Federal Energy Regulatory Commission so this doesn’t sound right.

To cut to the chase, I find:

http://www.gpo.gov/fdsys/pkg/CFR-2009-title47-vol1/pdf/CFR-2009-title47-vol1-sec10-10.pdf

as one of the relevant examples.

Two questions:

  1. How to direct OASIS members to citations in U.S. and foreign laws/regs to promote OASIS?
  2. How to make U.S. and foreign regulators aware of relevant OASIS materials?

Hint: The answer is not:

  • Waiting for them to discover OASIS and its fine work.
  • Advertising the fine work of OASIS to its own membership and staff.

European Legislation Identifier: Document and Slides

Filed under: EU,Government,Law,Legal Informatics — Patrick Durusau @ 3:31 pm

European Legislation Identifier: Document and Slides

From LegalInformatics:

John Dann of the Luxembourg Service Central de Législation has kindly given his permission for us to post the following documents related to the proposed European Legislation Identifier (ELI) standard:

If you are interested in legal identifiers or legislative materials in Europe more generally, this should be of interest.

An Asymmetric Data Conversion Scheme based on Binary Tags

Filed under: Binary Tags,Data Conversion — Patrick Durusau @ 3:30 pm

An Asymmetric Data Conversion Scheme based on Binary Tags by Zhu Wang; Chonglei Mei; Hai Jiang; Wilkin, G.A..

Abstract:

In distributed systems with homogeneous or heterogeneous computers, data generated on one machine might not always be used by another machine directly. For a particular data type, its endianness, size and padding situation cause incompatibility issue. Data conversion procedure is indispensable, especially in open systems. So far, there is no widely accepted data format standard in high performance computing community. Most time, programmers have to handle data formats manually. In order to achieve high programmability and efficiency in both homogeneous and heterogeneous open systems, a novel asymmetric binary-tag-based data conversion scheme (BinTag) is proposed to share data smoothly. Each data item carries one binary tag generated by BinTag’s parser without much programmer’s involvement. Data conversion only happens when it is absolutely necessary. Experimental results have demonstrated its effectiveness and performance gains in terms of productivity and data conversion speed. BinTag can be used in both memory and secondary storage systems.

Homogeneous and heterogeneous in the sense of padding, size, endianness? Serious issue for high performance computing.

Are there lessons to be taught or learned here for other notions of homogeneous/heterogeneous data?

Do we need binary tags to track semantics at a higher level?

Or can we view data as though it had particular semantics? At higher and lower levels?

A graphical overview of your MySQL database

Filed under: Data,Database,MySQL — Patrick Durusau @ 3:30 pm

A graphical overview of your MySQL database by Christophe Ladroue.

From the post:

If you use MySQL, there’s a default schema called ‘information_schema‘ which contains lots of information about your schemas and tables among other things. Recently I wanted to know whether a table I use for storing the results of a large number experiments was any way near maxing out. To cut a brief story even shorter, the answer was “not even close” and could be found in ‘information_schema.TABLES‘. Not being one to avoid any opportunity to procrastinate, I went on to write a short script to produce a global overview of the entire database.

infomation_schema.TABLES contains the following fields: TABLE_SCHEMA, TABLE_NAME, TABLE_ROWS, AVG_ROW_LENGTH and MAX_DATA_LENGTH (and a few others). We can first have a look at the relative sizes of the schemas with the MySQL query “SELECT TABLE_SCHEMA,SUM(DATA_LENGTH) SCHEMA_LENGTH FROM information_schema.TABLES WHERE TABLE_SCHEMA!='information_schema' GROUP BY TABLE_SCHEMA“.

Christophe includes R code to generate graphics that you will find useful in managing (or just learning about) MySQL databases.

While the parts of the schema Christophe is displaying graphically are obviously subjects, the graphical display pushed me in another direction.

If we can visualize the schema of a MySQL database, then shouldn’t we be able to visualize the database structures a bit closer to the metal?

And if we can visualize those database structures, shouldn’t we be able to represent them and the relationships between them as a graph?

Or perhaps better, can we “view” those structures and relationships “on demand” as a graph?

That is in fact what is happening when we display a table at the command prompt for MySQL. It is a “display” of information, it is not a report of information.

I don’t know enough about the internal structures of MySQL or PostgreSQL to start such a mapping. But ignorance is curable, at least that is what they say. 😉

I have another post today that suggests a different take on conversion methodology.

R Data Import/Export

Filed under: Data,R — Patrick Durusau @ 3:30 pm

R Data Import/Export

After posting about the Excel -> R utility, I started to wonder about R -> Excel and in researching that question, I ran across this page.

Here is the table of contents as of 2012-02-29:

  • Acknowledgements
  • Introduction
  • Spreadsheet-like data
  • Importing from other statistical systems
  • Relational databases
  • Binary files
  • Connections
  • Network interfaces
  • Reading Excel spreadsheets
  • References
  • Function and variable index
  • Concept index

Enjoy!

Reading Excel data is easy with JGR and XLConnect

Filed under: Data,Excel — Patrick Durusau @ 3:30 pm

Reading Excel data is easy with JGR and XLConnect

From the post:

Despite the fact that Excel is the most widespread application for data manipulation and (perhaps) analysis, R’s support for the xls and xlsx file formats has left a lot to be desired. Fortunately, the XLConnect package has been created to fill this void, and now JGR 1.7-8 includes integration with XLConnect package to load .xls and .xlsx documents into R.

For JGR, see: http://rforge.net/JGR/

Text Analytics for Telecommunications – Part 2

Filed under: Telecommunications,Text Analytics — Patrick Durusau @ 3:30 pm

Text Analytics for Telecommunications – Part 2 by Themos Kalafatis.

From the post:

In the previous post we have seen the problems that a highly inflected language creates and also a very basic example of Competitive Intelligence. The Case Study that i will present in the forthcoming European Text Analytics Summit is about the analysis of Telco Subscriber conversations on FaceBook and Twitter that involve Telenor, MT:S and VIP Mobile located in Serbia.

It is time to see what Topics are found in subscriber conversations. Each Telco has its own FaceBook page which contains posts and comments generated by page curators and subscribers. Each post and comment also generates “Likes” and “Shares”. Several types of analysis can be performed to find out :

  1. What kind of Topics are discussed in posts and comments of each Telco FaceBook page?
  2. What is the sentiment?
  3. Which posts (and comments) tend to be liked and shared (=generate Interest and reactions)?

Themos continues his series on text analytics for Telcos.

Here he moves into Facebook comments and analysis of the same.

Topic Maps as Indexing Tools in the Educational Sphere:…

Filed under: Education,Topic Maps — Patrick Durusau @ 3:29 pm

Topic Maps as Indexing Tools in the Educational Sphere: Theoretical Foundations, Review of Empirical Research and Future Challenges by Vivek Venkatesh, Kamran Shaikh and Amna Zuberi.

Lars Marius Garshol sent a note concerning this chapter on education and topic maps (Appears in Cognitive Maps).

From the introduction:

Topic Maps (International Organization of Standardization [ISO 13250], 1999; 2002) are a form of indexing that describe the relationships between concepts within a domain of knowledge and link these concepts to descriptive resources. Topic maps are malleable – the concept and relationship creation process is dynamic and user-driven. In addition, topic maps are scalable and can hence be conjoined and merged. Perhaps, most impressively, topic maps provide a distinct separation between resources and concepts, thereby facilitating migration of the data models therein (Garshol, 2004).

Topic map technologies are extensively employed to navigate databases of information in the fields of medicine, military, and corporations. Many of these proprietary topic maps are machine-generated through the use of context-specific algorithms which read a corpus of text, and automatically produce a set of topics along with the relationships among them. However, there has been little, if any, research on how to use cognitive notions of mental models, knowledge representation and decision-making processes employed in problem-solving situations as a basis for the design of ontologies for topic maps.

This chapter will first outline the theoretical foundations in educational psychology and cognitive information retrieval that should underlie the development of ontologies that describe topic maps. The conjectural analyses presented will reveal how various modes of online interaction between key stakeholders (e.g., instructors, learners, content and graphical user interfaces), as well as the classic information processing model, mental models and related research on problem representation must be integrated into our current understanding of how the design of topic maps can better reflect the relationships between concepts in any given domain. Next, the chapter outlines a selective review of empirical research conducted on the use of topic maps in educational contexts, with a focus on learner perceptions and cognitions. Finally, the chapter provides comments on what the future holds for researchers who are committed to the development, implementation, and evaluation of topic map indexes in educational contexts.

A very useful review of what literature exists on topic maps in education is presented by this chapter. It is clear that much remains to be done to investigate the possible roles of topic maps in education.

Of particular interest is the suggestion that topic maps be used for learners to see themselves from multiple perspectives. An introspective use of topic maps as opposed to organization of knowledge external to ourselves.

March 20, 2012

Text Analytics for Telecommunications – Part 1

Filed under: Telecommunications,Text Analytics,Text Extraction — Patrick Durusau @ 3:54 pm

Text Analytics for Telecommunications – Part 1 by Themos Kalafatis.

From the post:

As discussed in the previous post, performing Text Analytics for a language for which no tools exist is not an easy task. The Case Study which i will present in the European Text Analytics Summit is about analyzing and understanding thousands of Non-English FaceBook posts and Tweets for Telco Brands and their Topics, leading to what is known as Competitive Intelligence.

The Telcos used for the Case Study are Telenor, MT:S and VIP Mobile which are located in Serbia. The analysis aims to identify the perception of Customers for each of the three Companies mentioned and understand the Positive and Negative elements of each Telco as this is captured from the Voice of the Customers – Subscribers.

The start of a very useful series on non-English text analysis. The sort that is in demand by agencies of various governments.

Come to think of it, text analysis of English/non-English government information is probably in demand by non-government groups. 😉

Graphs in Operations

Filed under: Graphs,IT,Operations — Patrick Durusau @ 3:53 pm

Graphs in Operations by John E. Vincent.

From the post:

Anyone who has ever used Puppet or Git has dabbled in graphs even if they don’t know it. However my interest in graphs in operations relates to the infrastructure as a whole. James Turnbull expressed it very well last year in Mt. View when discussion orchestration. Obviously this is a topic near and dear to my heart.

Right now much of orchestration is in the embryonic stages. We define relationships manually. We register watches on znodes. We define hard links between components in a stack. X depends on Y depends on Z. We’re not really being smart about it. If someone disagrees, I would LOVE to see a tool addressing the space.

Interesting post from a sysadmin perspective on the relationships that graphs could make explicit. And being made explicit, we could attach properties to those relationships (or associations in topic map talk).

Imagine the various *nix tools monitoring a user’s activities at multiple locations on the network and that data long with the relationships being merged with other data.

First saw this at Alex Popescu’s myNoSQL.

Wheel Re-invention: Change Data Capture systems

Filed under: Change Data,Data,Databus — Patrick Durusau @ 3:53 pm

LinkedIn: Creating a Low Latency Change Data Capture System with Databus

Siddharth Anand, a senior member of LinkedIn’s Distributed Data Systems team writes:

Having observed two high-traffic web companies solve similar problems, I cannot help but notice a set of wheel-reinventions. Some of these problems are difficult and it is truly unfortunate for each company to solve its problems separately. At the same time, each company has had to solve these problems due to an absence of a reliable open-source alternative. This clearly has implications for an industry dominated by fast-moving start-ups that cannot build 50-person infrastructure development teams or dedicate months away from building features.

Siddharth goes on to address a particular re-invention of the wheel: change data capture systems.

And he has a solution to this wheel re-invention problem: Databus. (Not good for all situations but worth your time to read carefully, along with following the other resources.)

From the post:

Databus is an innovative solution in this space.

It offers the following features:

  • Pub-sub semantics
  • In-commit-order delivery guarantees
  • Commits at the source are grouped by transaction
    • ACID semantics are preserved through the entire pipeline
  • Supports partitioning of streams
    • Ordering guarantees are then per partition
  • Like other messaging systems, offers very low latency consumption for recently-published messages
  • Unlike other messaging systems, offers arbitrarily-long look-back with no impact to the source
  • High Availability and Reliability

Authorization and Authentication In Hadoop

Filed under: Hadoop — Patrick Durusau @ 3:53 pm

Authorization and Authentication In Hadoop by Jon Natkins.

From the post:

One of the more confusing topics in Hadoop is how authorization and authentication work in the system. The first and most important thing to recognize is the subtle, yet extremely important, differentiation between authorization and authentication, so let’s define these terms first:

Authentication is the process of determining whether someone is who they claim to be.

Authorization is the function of specifying access rights to resources.

In simpler terms, authentication is a way of proving who I am, and authorization is a way of determining what I can do.

Let me see if I can summarize the authentication part: If you are responsible for the Hadoop cluster and unauthenticated users can access it, you need to have a backup job.

Hadoop doesn’t have authentication enabled by default but authentication for access to the cluster could be performed by some other mechanism. Such as access to the network where the cluster resides, etc.

There are any number of ways to do authentication but to lack authentication to a network asset is a recipe for being fired upon its discovery.

Authorization regulates access and usage of cluster assets.

Here’s the test for authentication and authorization on your mission critical Hadoop cluster. While sitting in front of your cluster admin’s desk, ask for a copy of the authentication and authorization policies and settings for your cluster. If they can’t send it to a printer, you need another cluster admin. It is really that simple.

Designing Search (part 3): Keeping on track

Filed under: Interface Research/Design,Search Behavior,Search Interface,Searching — Patrick Durusau @ 3:52 pm

Designing Search (part 3): Keeping on track by Tony Russell-Rose

From the post:

In the previous post we looked at techniques to help us create and articulate more effective queries. From auto-complete for lookup tasks to auto-suggest for exploratory search, these simple techniques can often make the difference between success and failure.

But occasionally things do go wrong. Sometimes our information journey is more complex than we’d anticipated, and we find ourselves straying off the ideal course. Worse still, in our determination to pursue our original goal, we may overlook other, more productive directions, leaving us endlessly finessing a flawed strategy. Sometimes we are in too deep to turn around and start again.

(graphic omitted)

Conversely, there are times when we may consciously decide to take a detour and explore the path less trodden. As we saw earlier, what we find along the way can change what we seek. Sometimes we find the most valuable discoveries in the most unlikely places.

However, there’s a fine line between these two outcomes: one person’s journey of serendipitous discovery can be another’s descent into confusion and disorientation. And there’s the challenge: how can we support the former, while unobtrusively repairing the latter? In this post, we’ll look at four techniques that help us keep to the right path on our information journey.

Whether you are writing a search interface or simply want to know more about what factors to consider in evaluating a search interface, this series by Tony Russell-Rose is well worth your time.

If you are writing a topic map, you already have as a goal the collection of information for some purpose. It would be sad if the information you collect isn’t findable due to poor interface design.

From counting citations to measuring usage (help needed!)

Filed under: Citation Indexing,Classification,Data — Patrick Durusau @ 3:52 pm

From counting citations to measuring usage (help needed!)

Daniel Lemire writes:

We sometimes measure the caliber of a researcher by how many research papers he wrote. This is silly. While there is some correlation between quantity and quality — people like Einstein tend to publish a lot — it can be gamed easily. Moreover, several major researchers have published relatively few papers: John Nash has about two dozens papers in Scopus. Even if you don’t know much about science, I am sure you can think of a few writers who have written only a couple of books but are still world famous.

A better measure is the number of citations a researcher has received. Google Scholar profiles display the citation record of researchers prominently. It is a slightly more robust measure, but it is still silly because 90% of citations are shallow: most authors haven’t even read the paper they are citing. We tend to cite famous authors and famous venues in the hope that some of the prestige will get reflected.

But why stop there? We have the technology to measure the usage made of a cited paper. Some citations are more significant: for example it can be an extension of the cited paper. Machine learning techniques can measure the impact of your papers based on how much following papers build on your results. Why isn’t it done?

Daniel wants to distinguish important papers that cite his papers from ho-hum papers that cite him. (my characterization, not his)

That isn’t happening now so Daniel has teamed up with Peter Turney and Andre Vellino to gather data from published authors (that would be you), to use in investigating this problem.

Topic maps of scholarly and other work face the same problem. How do you distinguish the important from the less so? For that matter, what criteria do you use? If an author who cites you wins the Nobel Prize for work that doesn’t cite you, does the importance of your paper go up? Stay the same? Goes down? 😉

It is an important issue so if you are a published author, see Daniel’s post and contribute to the data gathering.

Worst-case Optimal Join Algorithms

Filed under: Algorithms,Database,Joins — Patrick Durusau @ 3:52 pm

Worst-case Optimal Join Algorithms by Hung Q. Ngo, Ely Porat, Christopher Ré, and Atri Rudra.

Abstract:

Efficient join processing is one of the most fundamental and well-studied tasks in database research. In this work, we examine algorithms for natural join queries over many relations and describe a novel algorithm to process these queries optimally in terms of worst-case data complexity. Our result builds on recent work by Atserias, Grohe, and Marx, who gave bounds on the size of a full conjunctive query in terms of the sizes of the individual relations in the body of the query. These bounds, however, are not constructive: they rely on Shearer’s entropy inequality which is information-theoretic. Thus, the previous results leave open the question of whether there exist algorithms whose running time achieve these optimal bounds. An answer to this question may be interesting to database practice, as it is known that any algorithm based on the traditional select-project-join style plans typically employed in an RDBMS are asymptotically slower than the optimal for some queries. We construct an algorithm whose running time is worst-case optimal for all natural join queries. Our result may be of independent interest, as our algorithm also yields a constructive proof of the general fractional cover bound by Atserias, Grohe, and Marx without using Shearer’s inequality. This bound implies two famous inequalities in geometry: the Loomis-Whitney inequality and the Bollob\’as-Thomason inequality. Hence, our results algorithmically prove these inequalities as well. Finally, we discuss how our algorithm can be used to compute a relaxed notion of joins.

With reference to the optimal join problem the authors say:

Implicitly, this problem has been studied for over three decades: a modern RDBMS use decades of highly tuned algorithms to efficiently produce query results. Nevertheless, as we described above, such systems are asymptotically suboptimal – even in the above simple example of (1). Our main result is an algorithm that achieves asymptotically optimal worst-case running times for all conjunctive join queries.

The author’s strategy involves evaluation of the keys in a join and the dividing of those keys into separate sets. The information used by the authors has always been present, just not used in join processing. (pp. 2-3 of the article)

There are a myriad of details to be mastered in the article but I suspect this line of thinking may be profitable in many situations where “join” operations are relevant.

FDB: A Query Engine for Factorised Relational Databases

Filed under: Data Factorization,Factorised Databases,Graph Databases — Patrick Durusau @ 3:51 pm

FDB: A Query Engine for Factorised Relational Databases by Nurzhan Bakibayev, Dan Olteanu, and Jakub Závodný.

Abstract:

Factorised databases are relational databases that use compact factorised representations at the physical layer to reduce data redundancy and boost query performance. This paper introduces FDB, an in-memory query engine for select-project-join queries on factorised databases. Key components of FDB are novel algorithms for query optimisation and evaluation that exploit the succinctness brought by data factorisation. Experiments show that for data sets with many-to-many relationships FDB can outperform relational engines by orders of magnitude.

It is twelve pages of dense slogging but I wonder if you have a reaction to:

Finally, factorised representations are relational algebra expressions with well-understood semantics. Their relational nature sets them apart from XML documents, object-oriented databases, and nested objects [2], where the goal is to avoid the rigidity of the relational model. (on the second page)

Where [2] is: S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995.

Online version of Foundations of Databases

DBLP has a nice listing of the references (with links) in Foundations of Databases

Abiteboul and company are cited without a page reference (my printed edition is 685 pages long) and the only comparison that I can uncover between the relational model and any of those mentioned here is that an object-oriented database has oids, which aren’t members of a “printable class” as are keys.

I am not sure what sort of oid isn’t a member of a “printable” class but am willing to leave that to one side for the moment.

My problem is with the characterization “…to avoid the rigidity of the relational model.”

The relational model has been implemented in any number of rigid ways, but is that the fault of a model based on operations on tuples, which can be singletons?

What if factorisation were applied to a graph database, composed of singletons, enabling the use of “…relational algebra expressions with well-understood semantics.”?

It sounds like factorisation could speed up classes of “expected” queries across graph databases. I don’t think anyone creates a database, graph or otherwise, without some classes of queries in mind. The user would be no worse off when they create an unexpected query.

…trimming the spring algorithm for drawing hypergraphs

Filed under: Graphs,Hypergraphs,Visualization — Patrick Durusau @ 3:51 pm

…trimming the spring algorithm for drawing hypergraphs by Harri Klemetti, Ismo Lapinleimu, Erkki Mäkinen, and Mika Sieranta. ACM SIGCSE Bulletin, Volume 27 Issue 3, Sept. 1995.

Abstract:

Graph drawing problems provide excellent material for programming projects. As an example, this paper describes the results of an undergraduate project which dealt with hypergraph drawing. We introduce a practical method for drawing hypergraphs. The method is based on the spring algorithm, a well-known method for drawing normal graphs.

Not the earliest or the latest on drawing hypergraphs (for which there is apparently no consensus) but something I ran across while researching the issue. Thought it best to write it down so I can refer to it from other posts.

Hypergraphs have a long history with analysis of relational databases and I suspect their applications to modeling NoSQL databases has already happened or at least isn’t far off. Not to mention their relevance to graph databases.

In any event, being able to visualize hypergraphs, by one of more methods, is likely to be useful both for topic map authors and users but other investigators as well.

March 19, 2012

Document Frequency Limited MultiTermQuerys

Filed under: Lucene,Query Expansion,Searching — Patrick Durusau @ 6:55 pm

Document Frequency Limited MultiTermQuerys

From the post:

If you’ve ever looked at user generated data such as tweets, forum comments or even SMS text messages, you’ll have noticed there there are many variations in the spelling of words.  In some cases they are intentional such as omissions of vowels to reduce message length, in other cases they are unintentional typos and spelling mistakes.

Querying this kind of data since only matching the traditional spelling of a word can lead to many valid results being missed.  One way to includes matches on variations of a word is to use Lucene’s MultiTermQuerys such as FuzzyQuery or WildcardQuery.  For example, to find matches for the word “hotel” and all its variations, you might use the queries “hotel~” and “h*t*l”.  Unfortunately, depending on how many variations there are, the queries could end up matching 10s or even 100s of terms, which will impact your performance.

You might be willing to accept this performance degradation to capture all the variations, or you might want to only query those terms which are common in your index, dropping the infrequent variations and giving your users maximum results with little impact on performance.

Lets explore how you can focus your MultiTermQuerys on the most common terms in your index.

Not to give too much away, but you will learn how to tune a fuzzy match of terms. (To account for misspellings, for example.)

This is a very good site and blog for search issues.

« Newer PostsOlder Posts »

Powered by WordPress