Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 4, 2010

Zoie: Real-time search indexing

Filed under: Full-Text Search,Indexing,Lucene,Search Engines,Software — Patrick Durusau @ 10:04 am

Zoie: Real-time search indexing

Somehow appropriate that following the lead on Kafka would lead me to Zoie (and other goodies to be reported).

From the website:

Zoie is a real-time search and indexing system built on Apache Lucene.

Donated by LinkedIn.com on July 19, 2008, and has been deployed in a real-time large-scale consumer website: LinkedIn.com handling millions of searches as well as hundreds of thousands of updates daily.

News: Zoie 2.0.0 is released … – Compatible with Lucene 2.9.x.

In a real-time search/indexing system, a document is made available as soon as it is added to the index. This functionality is especially important to time-sensitive information such as news, job openings, tweets etc.

Design Goals:

  • Additions of documents must be made available to searchers immediately
  • Indexing must not affect search performance
  • Additions of documents must not fragment the index (which hurts search performance)
  • Deletes and/or updates of documents must not affect search performance.

In topic map terms:

  • Additions to topic map must be made available to searchers immediately
  • Indexing must not affect search performance
  • Additions to topic map must not fragment the index (which hurts search performance)
  • Deletes and/or updates of a topic map must not affect search performance.

I would say that #’s 3 and 4 are research questions at this point.

Additions, updates and deletions in a topic map may have unforeseen (unforeseeable?) consequences.

Such as causing:

  • merging to occur
  • merging to be undone
  • roles to be played
  • roles to not be played
  • association to be valid
  • association to be invalid

to name only a few.

It may be possible to formally prove the impact that certain events will have but I am not aware of any definitive analysis on the subject.

Suggestions?

Kafka : A high-throughput distributed messaging system – Post

Filed under: Software,Topic Map Systems — Patrick Durusau @ 5:54 am

Kafka : A high-throughput distributed messaging system

Caught my eye:

Kafka is a distributed publish-subscribe messaging system. It is designed to support the following

  • Persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages.
  • High-throughput: even with very modest hardware Kafka can support hundreds of thousands of messages per second.
  • Explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics.
  • Support for parallel data load into Hadoop.

Depending on your message passing requirements for your topic map application, this could be of interest. Better to concentrate on the semantic heavy lifting than re-inventing message passing when solutions like this exist.

December 2, 2010

Apache Tika – a content analysis toolkit

Filed under: Authoring Topic Maps,Data Mining,Software — Patrick Durusau @ 7:57 pm

Apache Tika – a content analysis toolkit

From the website:

Apache Tika™ is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Formats include:

  • HyperText Markup Language
  • XML and derived formats
  • Microsoft Office document format
  • OpenDocument Format
  • Portable Document Format
  • Electronic Publication Format
  • Rich Text Format
  • Compression and packaging formats
  • Text formats
  • Audio formats
  • Image formats
  • Video formats
  • Java class files and archives
  • The mbox format

Sounds like we are getting close to pipelines for topic map production.

Comments?

Silverlight Pivotviewer – Addendum

Filed under: Interface Research/Design,Pivotviewer,Software — Patrick Durusau @ 11:19 am

I was amused to read:

At a high-level, CXML can be thought of as a set of property/value pairings. Facets are like property values on an item, and facet categories are groups of facets. For example: if a collection had a facet category called “U.S. State,” then “Georgia” could be a facet in that category. Depending on authoring choices, these facets may be displayed as filters in the PivotViewer collection experience, or included in the details of an item. Collection XML Schema

Sounds a lot like the Topic Maps Reference Model.

Or, the game of twenty questions.

That is the subject you are identifying is broader or narrower depending upon the number of key/value pairs you specify.

The Pivotviewer allows you to go from a very broad subject to a very narrow, even a specific one by, adding on key/value pairs.

Legends enable users to arrive at the same broad or narrow subject, even if they have different key/value pairs for that subject.

Hey, that is rather neat and practical isn’t it? (Take note Lars. He knows which one.)

Will have to investigate how to combine collection XML schemas to make that point.

More to follow on this topic (sorry) anon.

Silverlight Pivotviewer – A Turning Point For Visualizing Topic Maps?

Filed under: Interface Research/Design,Pivotviewer,Software — Patrick Durusau @ 9:32 am

Andrew Townley has suggested Gary Flake: is Pivot a turning point for web exploration? as an example of “wow” factor.

From a search on Pivot I found: Silverlight Pivotviewer is no longer experimental.

On the issue of “wow” factor, I have to agree. This is truly awesome.

I am sure there are corner cases and bugs, but I think kudos are due to the developers of Silverlight Pivotviewer.

Now the question is what do “we,” as in the topic maps community, do with this nice shiny tool?

Questions:

  1. What are the factors you would consider for navigation of your topic map? (3-5 pages, no citations)
  2. How would you test your navigation choices? (3-5 pages, no citations)
  3. Demonstration of navigation of your topic map. (class demonstration)

November 30, 2010

Apache Mahout – Website

Filed under: Classification,Clustering,Data Mining,Mahout,Pattern Recognition,Software — Patrick Durusau @ 8:54 pm

Apache Mahout

From the website:

Apache Mahout’s goal is to build scalable machine learning libraries. With scalable we mean:

Scalable to reasonably large data sets. Our core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms.

Current capabilities include:

  • Collaborative Filtering
  • User and Item based recommenders
  • K-Means, Fuzzy K-Means clustering
  • Mean Shift clustering
  • Dirichlet process clustering
  • Latent Dirichlet Allocation
  • Singular value decomposition
  • Parallel Frequent Pattern mining
  • Complementary Naive Bayes classifier
  • Random forest decision tree based classifier
  • High performance java collections (previously colt collections)

A topic maps class will only have enough time to show some examples of using Mahout. Perhaps an informal group?

RDF Extension for Google Refine

Filed under: Data Mining,RDF,Software — Patrick Durusau @ 1:09 pm

RDF Extension for Google Refine

From the website:

This project adds a graphical user interface(GUI) for exporting data of Google Refine projects in RDF format. The export is based on mapping the data to a template graph using the GUI.

See my earlier post on Google Refine 2.0.

BTW, if you don’t know the folks at DERI – Digital Enterprise Research Institute take a few minutes (it will stretch into hours) to explore their many projects. (I will be doing a separate post on projects of particular interest for topic maps from DERI soon.)

November 29, 2010

Cloud9: a MapReduce library for Hadoop

Filed under: Hadoop,MapReduce,Software — Patrick Durusau @ 1:40 pm

Cloud9: a MapReduce library for Hadoop

From the website:

Cloud9 is a MapReduce library for Hadoop designed to serve as both a teaching tool and to support research in data-intensive text processing. MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. Hadoop provides an open-source implementation of the programming model. The library itself is available on github and distributed under the Apache License.

See Data-Intensive Text Processing with MapReduce by Lin and Dyer for more details on MapReduce.

Guide to using the Cloud9 library, including its use on particular data sets, such as Wikipedia.

Data-Intensive Information Processing Apps

Filed under: MapReduce,Software — Patrick Durusau @ 10:02 am

Data-Intensive Information Processing Applications

University of Maryland course by Jimmy Lin, (jimmylin@umd.edu) and Nitin Madnani, (nmadnani@umiacs.umd.edu).

From the course site:

This course is about scalable approaches to processing large amounts of information (terabytes and even petabytes). We focus mostly on MapReduce, which is presently the most accessible and practical means of computing at this scale, but will discuss other approaches as well.

Includes slides, data sets, etc.

Google App Engine

Filed under: Google App Engine,Software — Patrick Durusau @ 9:44 am

Google App Engine 1.4.0 pre-release is out!

I ran across this today.

From the webpage:

Every Google App Engine application gets enough free resources to serve approximately 5 million monthly page views.

I should be able to fit into that limitation. 😉

Is anyone working on a topic map application using the Google App Engine?

Lucene / Solr for Academia: PhD Thesis Ideas
(Disambiguated Indexes/Indexing)

Filed under: Lucene,Software,Solr — Patrick Durusau @ 5:57 am

Lucene / Solr for Academia: PhD Thesis Ideas

Excellent opportunity to make suggestions that could result not only in more academic work but also in advancement of useful open source software.

My big idea (don’t mind if you borrow/steal it for implementation):

We all know how traditional indexes work. They gather up single tokens and then point back to the locations in documents where they are found.

So they can’t distinguish among differing uses of the same string. One aspect of the original indexing problem that lead to topic maps.

What if indexers could be given configuration files that said: When indexing http://www.medicalsite.org create a tuple for indexing that includes site=www.medicalsite.org type=medical, etc. and index the entire tuple as a single entry.

And enable indexers to index by members of the tuples so that if I decide that all uses of a term of type=medical mean the same subject, I can produce an index that represents that choice.

Sounds a lot like merging doesn’t it?

I don’t know of any index that does what I just described but I don’t know all indexes so if I have overlooked something, please sing out.

If successful, would be an entirely different way of authoring topic maps against large information stores.

Not to mention creating the opportunity to monetize indexes as separate from the information resources themselves. The Readers Guide to Periodical Literature is a successful example of that approach as product.

Hmmm, needs a name, how about: Disambiguated Indexes/Indexing?

Suggestions?

November 25, 2010

Virtuoso Open-Source Edition

Filed under: Linked Data,RDF,Semantic Web,Software — Patrick Durusau @ 7:06 am

Virtuoso Open-Source Edition

I ran across Virtuoso while running down the references in the article on SIREn. (Yes, I check references, not all of them, just the most interesting ones, as time permits.)

Has partial support for a variety of “Semantic Web” technologies.

Is the basis for OpenLink Data Spaces.

A named structured data cluster within a distributed data network where each item of data (each “datum”) has a unique identifier. Fundamental characteristics of data spaces include:

  • Each Data Item (or Entity) is endowed with a unique HTTP-based Identifier
  • Entity Identity, Access, and Representation are each distinct from the others
  • Entities are interlinked via attributes and relationship properties
  • Creation, Update, and Deletion privileges are controlled by the space owner

I can think of lots of “data spaces,” Large Hadron Collider data, radio and optical astronomy data dumps, TCP/IP data streams, bioinformatics data, commercial transaction databases that don’t fit this description. Please submit your own.

Still, if you want to learn the ins and outs as well as the limitations of this approach, it costs nothing more than the time to download the software.

November 24, 2010

IRODS

Filed under: Astroinformatics,Software,Space Data — Patrick Durusau @ 2:49 pm

IRODS:Data Grids, Digital Libraries, Persistent Archives, and Real-time Data Systems

From the website:

iRODS™, the Integrated Rule-Oriented Data System, is a data grid software system developed by the Data Intensive Cyber Environments research group (developers of the SRB, the Storage Resource Broker), and collaborators. The iRODS system is based on expertise gained through a decade of applying the SRB technology in support of Data Grids, Digital Libraries, Persistent Archives, and Real-time Data Systems. iRODS management policies (sets of assertions these communities make about their digital collections) are characterized in iRODS Rules and state information. At the iRODS core, a Rule Engine interprets the Rules to decide how the system is to respond to various requests and conditions. iRODS is open source under a BSD license. (emphasis in original)

Provides an umbrella over data sources to presents a uniform view to users.

The rules and metadata don’t appear to be as granular as one expects with topic maps.

I mention it here because of its use/importance with space data and as a current research platform into sharing data.

Questions:

  1. Current and annotated bibliography for the project.
  2. What are the main strengths/weaknesses of this approach? (3-5 pages, citations)

November 23, 2010

November 15, 2010

SecondString

Filed under: Searching,Software,String Matching — Patrick Durusau @ 4:56 am

SecondString is a Java library of string matching techniques.

The Levenshtein distance test mentioned in the LIMES post is an example of a string matching technique.

The results are not normalized so compare results from the techniques cautiously.

Questions:

  1. Suggest 1 – 2 survey articles on string matching for the class. (The Navarro article cited in Wikipedia on the Levenshtein distance is almost ten years old and despite numerous exclusions, still runs 58 pages. Excellent article but needs updating with more recent material.)
  2. What one technique would you use in constructing your topic map? Why? (2-3 pages, citing examples of why it would be the best for your data)

November 14, 2010

Orient: The Database For The Web – Presentation

Filed under: NoSQL,OrientDB,Software — Patrick Durusau @ 9:02 am

Orient: The Database For The Web

Nice slide deck if you need something for the company CTO.

Perhaps to justify a NOSQL conference or further investigation into NOSQL as an option.

I was deeply amused by slide 19’s claim of “Ø Config.”

Maybe true if I am running it on my laptop during a conference presentation.

A bit more thought required for use in or with a topic map system.

Orient is an impressive bit of software and is likely to be used or encountered by topic mappers.

Questions:

  1. Uses of OrientDB in library contexts? (3-5 pages, citations/links)
  2. Download and install OrientDB. How do you evaluate it’s claim of “Ø Config?” (3-5 pages, no citations)
  3. Extra credit: As librarians you will be asked to evaluate vendor claims about software. Develop a finding aid on software evaluation for librarians faced with that task. (3-5 pages, citations)

November 13, 2010

LIMES – LInk discovery framework for MEtric Spaces

Filed under: Linked Data,Semantic Web,Software — Patrick Durusau @ 7:46 am

LIMES – LInk discovery framework for MEtric Spaces

From the website:

LIMES is a link discovery framework for the Web of Data. It implements time-efficient approaches for large-scale link discovery based on the characteristics of metric spaces. It is easily configurable via a web interface. It can also be downloaded as standalone tool for carrying out link discovery locally.

LIMES detects “duplicates” in a single source or between sources by use of string metrics.

The current version of LIMES supports exclusively the string metrics Levenshtein, QGrams, BlockDistance and Euclidian as implements by the SimMetrics library. Further metrics will be included in following versions.

An interesting approach to use as a topic map authoring aid.

Questions:

  1. Using the online LIMES interface, develop and run five (5) link discovery requests. Name and save the result files. Upload them to your class project directory. Be prepared to discuss your requests and results in class.
  2. Sign up to be discussion leader for one of the algorithms supported by LIMES. Prepare a two (2) page summary for the class on your algorithm.
  3. What suggestions would you have for the project on its current UI?
  4. Use LIMES to augment your topic map authoring. Comments? (3-5 pages, no citations)
  5. In an actual run, I got the following as owl:sameAs – http://bio2rdf.org/mesh:D016889 and http://data.linkedct.org/page/condition/4398. Your evaluation? You may follow any links you find to make your evaluation. (2-3 pages, include URLs for other locations that you visit)

JUNG Graph Implementation

Filed under: Graphs,Software,Visualization — Patrick Durusau @ 6:48 am

JUNG Graph Implementation

From the website:

The Java Universal Network/Graph Framework is a software library that provides a common and extensible language for the modeling, analysis, and visualization of data that can be represented as a graph or network. It is written in Java, which allows JUNG-based applications to make use of the extensive built-in capabilities of the Java API, as well as those of other existing third-party Java libraries.

JUNG can be used to process property graphs in Gremlin.

The JUNG team notes that some classes may run into memory issues since JUNG was designed to work with in memory graphs.

Still, it looks like an effective tool for experimenting with exploration and delivery of information as visualized graphs.

November 3, 2010

Aperture: a Java framework for getting data and metadata

Filed under: Data Mining,Software — Patrick Durusau @ 7:07 pm

Aperture: a Java framework for getting data and metadata

From the website:

Aperture is an open source library for crawling and indexing information sources such as file systems, websites and mail boxes. Aperture supports a number of common source types and document formats out-of-the-box and provides easy ways to extend it with custom implementations.

Aperture wiki

Example applications include:

  • bibsonomycrawler.bat – crawls Bibsonomy accounts, extracts bookmarks and tags
  • deliciouscrawler.bat – crawls delicious accounts, extracts bookmarks and tags
  • filecrawler.bat – crawls filesystems, extracts the folder structure, the file metadata and the file content
  • flickrcrawler.bat – crawls flickr accounts, extracts tags, and photos metadata
  • icalcrawler.bat – crawls calendars stored in the well-known iCalendar format, extracts events, todos, journal entires etc.
  • imapcrawler.bat – crawls remote mailboxes accessible with IMAP
  • mboxcrawler.bat – crawls local mailboxes stored in mbox-format files (e.g. those from thunderbird)
  • outlookcrawler.bat – makes a connection with the outlook instance and crawls appointments, contacts and emails, note that this crawler will obviously only work in Windows if the MS Outlook is installed
  • thunderbirdcrawler.bat – crawls a thunderbird addressbook, extracts contacts, note that for crawling emails – use the mboxcrawler
  • webcrawler.bat – crawls websites

More tools for your topic map toolbox!

October 31, 2010

OpenII

Filed under: Data Structures,Heterogeneous Data,Information Retrieval,Software — Patrick Durusau @ 7:20 pm

OpenII

From the website:

OpenII is a collaborative effort spearheaded by The MITRE Corporation and Google to create a suite of open-source tools for information integration. The project is leveraging the latest developments in research on information integration to create a platform on which integration applications can be built and further research can be conducted.

The motivation for OpenII is that although a significant amount of research has been conducted on information integration, and several commercial systems have been deployed, many information integration applications are still hard to build. In research, we often innovate on a specific aspect of information integration, but then spend much our time building (and rebuilding) other components that we need in order to validate our contributions. As a result, the research prototypes that have been built are generally not reusable and do not inter-operate with each other. On the applications side, information integration comes in many flavors, and therefore it is hard for commercial products to serve all the needs. Our goal is to create tools that can be applied in a variety of architectural contexts and can easily be tailored to the needs of particular domains.

OpenII tools include, among others, wrappers for common data sources, tools for creating matches and mappings between disparate schemas, a tool for searching collections of schemas and extending schemas, and run-time tools for processing queries over heterogeneous data sources.

The M3 metamodel:

The fundamental building block in M3 is the entity. An entity represents information about a set of related real-world objects. Associated with each entity is a set of attributes that indicate what information is captured about each entity. For simplicity, we assume that at most one value can be associated with each attribute of an entity.

The project could benefit from a strong injection of subject identity based thinking and design.

October 28, 2010

19th ACM International Conference on Information and Knowledge Management

Filed under: Conferences,Information Retrieval,Knowledge Management,Software — Patrick Durusau @ 5:50 am

The front matter for 19th ACM international conference on Information and knowledge management is a great argument for ACM membership + Digital Library.

There are 126 papers, any one of which would make for a pleasant afternoon.

I will be mining these for those particularly relevant to topic maps but your suggestions would be appreciated.

  1. What conferences do you follow?
  2. What journals do you follow?
  3. What blogs/websites do you follow?

*****
Visit the ACM main site or its membership page ACM Membership

Gource

Filed under: Graphs,Software,Visualization — Patrick Durusau @ 5:25 am

Gource – Software Version Control Visualization.

Relevance to topic maps:

Consider the visualization of Blacklight, an open source library software project.

Imagine visualizing source code across an enterprise (are you listening MS/HP/Oracle/IBM?) so that code, coders, use of classes, can be compared.

Questions:

  1. Sign up with the computer lab to learn about version control.
  2. Use Gource to visualize an open source project’s versions.
  3. Subject Identity and Software (project)

October 22, 2010

Neo4j 1.2 Milestone 2 – Release

Filed under: Graphs,Indexing,Neo4j,Software — Patrick Durusau @ 6:02 am

Neo4j 1.2 Milestone 2 has been released!

Relevant to topic maps in general and TMQL in particular, are the improvement to indexing and querying capabilities.

Neo4j uses Lucene as a back-end.

Would Neo4j be a good way to proto-type proposals for TMQL?

To evaluate concerns about implementation difficulties.

And quite possibly to encourage the non-invention of new query syntaxes.

A side effect would be demonstrating that Neo4j could be used as a topic map platform.

October 17, 2010

SGDB – Simple Graph Database Optimized for Activation Spreading Computation

Filed under: Graphs,Software,Topic Map Software — Patrick Durusau @ 3:38 am

SGDB – Simple Graph Database Optimized for Activation Spreading Computation Authors: Marek Ciglan and Kjetil Nørvåg Keywords: spreading activation, graph, link graph, graph database, persistent media

Abstract:

In this paper, we present SGDB, a graph database with a storage model optimized for computation of Spreading Activation (SA) queries. The primary goal of the system is to minimize the execution time of spreading activation algorithm over large graph structures stored on a persistent media; without pre-loading the whole graph into the memory. We propose a storage model aiming to minimize number of accesses to the storage media during execution of SA and we propose a graph query type for the activation spreading operation. Finally, we present the implementation and its performance characteristics in scope of our pilot application that uses the activation spreading over the Wikipedia link graph.

Useful if your topic map won’t fit into memory or you want to use spreading activation with your topic map. Not to mention that the SGDB has some interesting performance characteristics versus a general graph database. Or so the article says, I haven’t verified the claims, you need to make your own judgment.

Software: SGDB

October 14, 2010

Using text animated transitions to support navigation in document histories

Filed under: Authoring Topic Maps,Interface Research/Design,Software,Trails,Visualization — Patrick Durusau @ 10:34 am

Using text animated transitions to support navigation in document histories Authors: Fanny Chevalier, Pierre Dragicevic, Anastasia Bezerianos, Jean-Daniel Fekete Keywords: animated transitions, revision control, text editing

Abstract:

This article examines the benefits of using text animated transitions for navigating in the revision history of textual documents. We propose an animation technique for smoothly transitioning between different text revisions, then present the Diffamation system. Diffamation supports rapid exploration of revision histories by combining text animated transitions with simple navigation and visualization tools. We finally describe a user study showing that smooth text animation allows users to track changes in the evolution of textual documents more effectively than flipping pages.

Project website: http://www.aviz.fr/diffamation/

The video of tracking changes to a document has to be seen to be appreciated.

Research question as to how to visualize changes/revisions to a topic map. This is one starting place.

October 10, 2010

An Approach for Fast Hierarchical Agglomerative Clustering Using Graphics Processors with CUDA

Filed under: Computation,CUDA,Graphic Processors,Graphs,Software — Patrick Durusau @ 11:58 am

An Approach for Fast Hierarchical Agglomerative Clustering Using Graphics Processors with CUDA Authors: S.A. Arul Shalom, Manoranjan Dash, Minh Tue Keywords: CUDA. Hierarchical clustering, High performance Computing, Computations using Graphics hardware, complete linkage

Abstract:

Graphics Processing Units in today’s desktops can well be thought of as a high performance parallel processor. Each single processor within the GPU is able to execute different tasks independently but concurrently. Such computational capabilities of the GPU are being exploited in the domain of Data mining. Two types of Hierarchical clustering algorithms are realized on GPU using CUDA. Speed gains from 15 times up to about 90 times have been realized. The challenges involved in invoking Graphical hardware for such Data mining algorithms and effects of CUDA blocks are discussed. It is interesting to note that block size of 8 is optimal for GPU with 128 internal processors.

GPUs offer a great deal of processing power and programming them may provoke deeper insights into subject identification and mapping.

Topic mappers may be able to claim NVIDIA based software/hardware and/or Sony Playstation 3 and 4 units (Cell Broadband Engine) as a business expense (check with your tax advisor).

A GPU based paper for TMRA 2011 anyone?

October 8, 2010

Inside Neo4j: Intro and roadmap

Filed under: Graphs,Neo4j,NoSQL,Software — Patrick Durusau @ 6:07 am

Inside Neo4j: Intro and roadmap

Chris Gioran has started a series of posts at A Digital Stain covering the internals of Neo4j.

Whether you are interested in Neo4j in particular or graph databases in general, this is a series of posts to watch closely.

October 7, 2010

KP-Lab System: A Collaborative Environment for Design, Realization and Examination of Different Knowledge Practices

Filed under: Collaboration,Exercises,Interface Research/Design,Software — Patrick Durusau @ 6:18 am

KP-Lab System: A Collaborative Environment for Design, Realization and Examination of Different Knowledge Practices Author(s): Ján Parali?, František Babi? Keywords: collaborative system – practices – patterns – time-line – summative information

Abstract:

This paper presents a collaborative working and learning environment called KP-Lab System. It provides a complex and multifunctional application built on principles of semantic web, exploiting also some web2.0 approaches as Google Apps or mashups. This system offers virtual user environment with different, necessary and advanced features for collaborative learning or working knowledge intensive activities. This paper briefly presents the whole system with special emphasis on its semantic-based aspects and analytical tools.

Public Site: http://2d.mobile.evtek.fi/shared-space (Be aware that FireFox will say this is an untrusted site as of 6 October 2010. Not sure why but I just added a security exception to access the site.)

Software: http://www.kp-lab.org/tools

Exploration of semantic user interfaces is in its infancy and this is another attempt to explore that space.

Questions/Activities:

  1. Create account and login to public site (Organization: none)
  2. Comments on the interface?
  3. Suggestions for changes to interface?
  4. Download/install software (geeks)
  5. Create content (with other class members)
  6. Likes/dislikes managing content on basis of subject identity?

WebGraph

Filed under: Graphs,Indexing,Navigation,Searching,Software — Patrick Durusau @ 5:56 am

WebGraph was mentioned in the article Fast and Compact Web Graph Representations.

Great work on the web graph, with software and data sets for exploring!

(Warning: If you like this sort of thing you will lose hours if not days here.)

Questions:

  1. Is the Web Graph different from a graph of a topic map?
  2. How would you go about researching question #1?
  3. Would your answer to #1 vary depending on the topic map you chose?
  4. Would the size of a topic map affect your answer?
  5. How would you test your answer to #4?
  6. What other aspects of graphs would you want to explore on topic maps?

October 6, 2010

Fast and Compact Web Graph Representations

Filed under: Data Structures,Graphs,Navigation,Searching,Software — Patrick Durusau @ 7:10 am

Fast and Compact Web Graph Representations Authors: Francisco Claude, Gonzalo Navarro Keywords: Compression, Web Graph, Data Structures

Abstract:

Compressed graph representations, in particular for Web graphs, have become an attractive research topic because of their applications in the manipulation of huge graphs in main memory. The state of the art is well represented by the WebGraph project, where advantage is taken of several particular properties of Web graphs to offer a trade-off between space and access time. In this paper we show that the same properties can be exploited with a different and elegant technique that builds on grammar-based compression. In particular, we focus on Re-Pair and on Ziv-Lempel compression, which, although cannot reach the best compression ratios of WebGraph, achieve much faster navigation of the graph when both are tuned to use the same space. Moreover, the technique adapts well to run on secondary memory and in distributed scenarios. As a byproduct, we introduce an approximate Re-Pair version that works efficiently with severely limited main memory.

Software & Examples: Fast and Compact Web Graph Representations

As topic maps grow larger and/or memory space becomes smaller (comparatively speaking), compressed graph work becomes increasingly relevant.

Gains in navigation speed are always welcome.

« Newer PostsOlder Posts »

Powered by WordPress