Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 10, 2011

Language Really Does Matter For Search – Post

Filed under: Marketing,Searching,Semantics — Patrick Durusau @ 2:51 pm

Language Really Does Matter For Search

Matthew Hurst writes:

While most pundits regard the deep, formal semantics promised by the likes of Powerset as not important to search I feel that I am personally finding search dead-ends in my long tail queries that clearly indicate the need for this type of feature. I will commit the sin of using a single example to support my point.

I don’t know if I qualify as a “pundit” but I certainly disagree that “deep, formal semantics” are not important for searching.

Well, assuming you want to find useful results.

I suspect that is part of the problem is that we have become accustomed to very low quality answers and shifting page after page of duplicated and/or irrelevant material.

As more data comes online, the worse the return on searches is going to become.

And the greater the opportunity for topic maps.

Ten years ago when the first topic map standard was approved, there was web searching but the quality and quantity of data wasn’t nearly what it is today.

I don’t know of any hard statistics on it but I would venture to guess that among staff allowed to use the WWW at work, at least an hour a day, every day, is spend not finding information on the WWW.

Think about that. At least 250 hours per person per year.

And the real figure is probably much higher.

So if you have a staff of 1,000 people, 250,000 hours are being lost every year, not finding information on the WWW.

The only bright side is that the lost 250,000 hours aren’t a line item in the budget.

Topic maps can’t save all of that time for you but they can help create a find once, re-use many situation for your staff.

Graph Exploration with Apache Hama

Filed under: Bulk Synchronous Parallel (BSP),Graphs,Hama — Patrick Durusau @ 2:51 pm

Graph Exploration with Apache Hama

From the website:

Hey guys,

I’ve been busy for a tiny bit of time, but I finished the graph exploration algorithm with Apache Hama recently. This post is about the BSP portation of this post.
So I already told in this post how BSP basically works. Now I’m going to tell you what you can do with it in terms of graph exploration. Last post I did this with MapReduce, so let’s go and get into Hama!

Thomas Jungblut on graph exploration.

Take the time. It will be time well spent.

SIGMA:Large Scale Machine Learning Toolkit

Filed under: Machine Learning — Patrick Durusau @ 2:50 pm

SIGMA:Large Scale Machine Learning Toolkit

From the website:

The goal of this project is to provide a group of parallel machine learning functionalities which can meet the requirements of research work and applications typically with large scale data/features. The toolkit includes but not limited to: classification, clustering, Ranking, statistical analysis, etc and makes them run on hundreds of machines, thousands of CPU cores parallel. We also provide a SDK for researchers/developers to invent their own algorithms and accumulate them into the toolkit.

Algorithms in the toolkit:

  • Parallel Classification
    • Logistic Regression
    • Boosting
    • SVM
      • PSVM
      • PPegasos
    • Neural Network
  • Parallel Ranking
    • LambdaRank
    • RankBoost
  • Parallel Clustering
    • Kmeans
    • Random Walk
  • Parallel Regression
    • Linear Regression
    • Regression Tree
  • Others
    • Parallel-Regularized-SVD
    • Parallel-LDA
  • Optimization Library
    • OWL-QN

Parallelizing Machine Learning– Functionally

Filed under: Graphs,Machine Learning,Scala — Patrick Durusau @ 2:49 pm

Parallelizing Machine Learning– Functionally

A Framework and Abstractions for Parallel Graph Processing

Abstract:

Implementing machine learning algorithms for large data, such as the Web graph and social networks, is challenging. Even though much research has focused on making sequential algorithms more scalable, their running times continue to be prohibitively long. Meanwhile, parallelization remains a formidable challenge for this class of problems, despite frameworks like MapReduce which hide much of the associated complexity. We present a framework for implementing parallel and distributed machine learning algorithms on large graphs, flexibly, through the use of functional programming abstractions. Our aim is a system that allows researchers and practitioners to quickly and easily implement (and experiment with) their algorithms in a parallel or distributed setting. We introduce functional combinators for the flexible composition of parallel, aggregation, and sequential steps. To the best of our knowledge, our system is the first to avoid inversion of control in a (bulk) synchronous parallel model.

I am particularly interested in the authors’ claim that:

While also based on graphs, Pregel is a closed system that was designed to solve large-scale “graph processing” problems, which are usually simpler in nature than typical real-world ML problems. In an effort to capitalize on Pregel’s strengths while focusing on a framework more aptly-suited to ML problems, we introduce a more flexible programming model, based on high-level functional abstractions.

Mostly because identifying where we are researching because our algorithms work versus areas where algorithms await discovery is important.

But, in part so that we know where it is appropriate to apply our usual algorithms and where those are likely to break down.

April 9, 2011

DBpedia4Neo

Filed under: Blueprints,DBpedia,Graphs,Neo4j — Patrick Durusau @ 3:43 pm

DBpedia4Neo

Claudio Martella walks through loading DBpedia into a graphDB.

DISCLAIMER: this is a bit of a hack, but it should get you started. I managed to get the core dataset of DBpedia into Neo4J, but this procedure should actually be working for any Blueprints-ready vendor, like OrientDB.

Ok, a little background first: we want to store DBpedia inside of a GraphDB, instead of the typical TripleStore, and run SPARQL queries over it. DBpedia is a project aiming to extract structured content from Wikipedia, information such as the one you can find in the infoboxes, the links, the categorization infos, geo-coordinates etc. This information is extracted and exported as triples to form a graph, a network of properties and relationships between Wikipedia resources.

Noting that graph queries are more efficient than when against a triple store.

David Rumsey Map Collection

Filed under: Mapping,Maps — Patrick Durusau @ 3:42 pm

David Rumsey Map Collection

From the website:

Welcome to the David Rumsey Map Collection Database and Blog. The Map Database has many viewers and the Blog has numerous categories.

The historical map collection has over 26,000 maps and images online. The collection focuses on rare 18th and 19th century North American and South American maps and other cartographic materials. Historic maps of the World, Europe, Asia, and Africa are also represented.

It is going to take a while to form even an impression of such a collection of maps and map related resources.

graph-tool

Filed under: Graphs,Visualization — Patrick Durusau @ 3:42 pm

graph-tool

From the website:

graph-tool is an efficient python module for manipulation and statistical analysis of graphs (a.k.a. networks). With graph-tool you can do the following:

  • Easily create directed or undirected graphs and manipulate them in an arbitrary fashion, using the convenience and expressiveness of the python language!
  • Associate arbitrary information to the vertices, edges or even the graph itself, by means of property maps.
  • Filter vertices and/or edges “on the fly”, such that they appear to have been removed from the graph, but can be easily recovered.
  • Instantaneously reverse the edge direction of directed graphs, and easily transform directed graphs into undirected, and vice-versa.
  • Save and load your graphs from files using the graphml and dot file formats, which provide interoperability with other software. You can also pickle your graphs at will!
  • Conveniently draw your graphs, using a variety of algorithms and output formats (including to the screen). graph-tool works as a very comfortable interface to the excellent graphviz package.
  • Collect all sorts of statistics: degree/property histogram, combined degree/property histogram, vertex-vertex correlations, assortativity, average vertex-vertex shortest distance, etc.
  • Run several topological algorithms on your graphs, such as isomorphism, minimum spanning tree, connected components, dominator tree, maximum flow, etc.
  • Generate random graphs, with arbitrary degree distribution and degree correlation.
  • Calculate clustering coefficients, motif statistics, communities, centrality measures, etc.
  • Ad-hoc compilation and execution of C++ code, for efficient implementation of throw-away code for specific projects.
  • And probably more stuff I’m forgetting…

Now there’s a feature list!

Comments?

Graph Exploration with Hadoop MapReduce

Filed under: Graphs,Hadoop,MapReduce — Patrick Durusau @ 3:41 pm

Graph Exploration with Hadoop MapReduce

From the post:

Hi all,

sometimes you will have data where you don’t know how elements of these data are connected. This is a common usecase for graphs, this is because they are really abstract.

So if you don’t know how your data is looking like, or if you know how it looks like and you just want to determine various graph components, this post is a good chance for you to get the “MapReduce-way” of graph exploration. As already mentioned in my previous post, I ranted about message passing through DFS and how much overhead it is in comparison to BSP.

I will have to keep an eye out for the Apache Hama BSP post.

Cluster Computing and MapReduce

Filed under: MapReduce — Patrick Durusau @ 3:40 pm

Cluster Computing and MapReduce

A nice lecture series to introduce cluster computing and MapReduce from Google:

Globalsdb

Filed under: Globalsdb,Multidimensional — Patrick Durusau @ 3:39 pm

Globalsdb

Jack Park forwarded this to my attention.

I am puzzling over:

At its core, the Globals database is powered by an extremely efficient multidimensional data engine. The exposed interface support access to the multidimensional structures – providing the highest performance and greatest range of storage possibilities. A multitude of applications can be implemented entirely using this data engine directly.

There is no data dictionary, and thus no data definitions, for the multidimensional data engine.

I “get” the part about extremely efficient multidimensional data engine (they say it often enough) but am curious why there is no data dictionary? Or at least why is that a claim to put up front?

Granting that I don’t consider data dictionaries to be self-describing but then neither are multidimensional arrays. Necessarily.

This database apparently lies at the core of a commercial application or line of commercial applications by Intersystems Corporation.

April 8, 2011

Riak Core – An Erlang Distributed Systems Toolkit

Filed under: Erlang,Riak — Patrick Durusau @ 7:20 pm

Riak Core – An Erlang Distributed Systems Toolkit

Abstract:

Riak Core is the distributed systems foundation for the Riak distributed database and the Riak Search full-text indexing system. Riak Core provides a proven architecture for building scalable, distributed applications quickly. This talk will cover the origins of Riak Core, the abstractions and functionality it provides, and some guidance on building distributed systems.

Something for those interested in building distributed topic map applications.

MongoDB Videos

Filed under: MongoDB — Patrick Durusau @ 7:20 pm

MongoDB Videos

Since no doubt your favorite contestant has been voted off American Idol, some MongoDB videos to pass the weekend. 😉

Strategies for Exploiting Large-scale Data in the Federal Government

Filed under: Hadoop,Marketing — Patrick Durusau @ 7:19 pm

Strategies for Exploiting Large-scale Data in the Federal Government

Yes, that federal government. The one in the United States that is purportedly going to shut-down. Except that those responsible for the shutdown will still get paid. There’s logic in there somewhere or so I have been told.

Nothing specifically useful but more the flavor of conversations that are taking place where people have large datasets.

Perry-Castañeda Library Map Collection

Filed under: Maps — Patrick Durusau @ 7:19 pm

Perry-Castañeda Library Map Collection

A very nice collection of maps, particularly topical ones.

If you are creating topic maps that involve current news or geographic locations, certainly a source to consider.

April 7, 2011

The Beauty of Maps

Filed under: Mapping,Maps — Patrick Durusau @ 7:27 pm

The Beauty of Maps

A BBC special from last year that is now available in twelve (12) parts on YouTube.

The mapping side of topic maps remains largely unexplored.

Perhaps this series will spark expeditions into the wilds of mapping semantics.

How to search the documentation of all CRAN packages

Filed under: R,Search Engines,Searching — Patrick Durusau @ 7:27 pm

How to search the documentation of all CRAN packages

Now there is a damned odd title for a post these days. 😉

I mean after releases of Lucene 3.1, Solr 3.1, not to mention other indexing/searching clients/platforms, why would anyone need a post on finding a specific function or algorithm?

You just put what you are looking for in your favorite search tool and …., oh yeah, it isn’t just put your lips together and blow is it?

Rather than saying you can find it, this post should say you can search for it.

Because functions and algorithms may not have the names you expect.

To handle that problem you would need a topic map.

Third Workshop on Massive Data Algorithmics (MASSIVE 2011)

Filed under: Algorithms,BigData,Subject Identity — Patrick Durusau @ 7:26 pm

Third Workshop on Massive Data Algorithmics (MASSIVE 2011)

From the website:

Tremendous advances in our ability to acquire, store and process data, as well as the pervasive use of computers in general, have resulted in a spectacular increase in the amount of data being collected. This availability of high-quality data has led to major advances in both science and industry. In general, society is becoming increasingly data driven, and this trend is likely to continue in the coming years.

The increasing number of applications processing massive data means that in general focus on algorithm efficiency is increasing. However, the large size of the data, and/or the small size of many modern computing devices, also means that issues such as memory hierarchy architecture often play a crucial role in algorithm efficiency. Thus the availability of massive data also means many new challenges for algorithm designers.

Forgive me for mentioning it, but what is the one thing all algorithms have in common? Whether for massive data or no?

Ah, yes, some presumption about the identity of the subjects to be processed.

Would be rather difficult to efficiently process anything unless you knew where you were starting and with what?

Making the subjects processed by algorithms efficiently interchangeable seems like a good thing to me.

TeXMaker 3.0 Released!

Filed under: TeX/LaTeX — Patrick Durusau @ 7:26 pm

TeXMaker 3.0 Released!

From the post:

The version 3.0 of the free LaTeX editor Texmaker has been released yesterday. The most notable changes are: Texmaker

  • Extensively modified user interface: no tabs and a fully integrated pdf previewer
  • The auto-complete commands list can be extended by users
  • Label checking in master/child documents
  • A new full-screen mode
  • Mouse-over tooltips for mathematical symbols in the panels
  • New keyboard shortcuts
  • Important bugfixes

If you are going to do serious publishing about topic maps, it is most likely going to be with TeX/LaTeX.


Update: TeXmaker 3.2 released.

Comparative Study of Probabilistic Logic Languages and Systems

Filed under: Probabilistic Programming,Probalistic Models — Patrick Durusau @ 7:24 pm

Comparative Study of Probabilistic Logic Languages and Systems

This was mentioned in a response to Chris Diehl’s post in his series.

Good source of software/information.

(I checked, all the links work. That is something these days.)

April 6, 2011

Exploring Complex, Dynamic Graph Data

Filed under: Analytics,Graphs,Visualization — Patrick Durusau @ 6:22 pm

Chris Diehl has an interesting series,

Exploring Complex, Dynamic Graph Data, part 1

Exploring Complex, Dynamic Graph Data, part 2

Exploring Complex, Dynamic Graph Data, part 3

According to Chris, Exploratory Data Analysis (EDA) requires:

  • Persistence – Provides a non-volatile representation of the data we intend to explore.
  • Query – Supports filtering and transformation operations to condition the data for analysis.
  • Analysis – Enables the synthesis and execution of complex analytics on the data.
  • Visualization – Facilitates rapid composition of a range of visualizations to interpret results.

Check it out.

OPEN! Government Data

Filed under: Dataset — Patrick Durusau @ 6:21 pm

OPEN! Government Data

Another listing of government data sets and other materials.

Relevant for topic maps as more grist for a topic map mill.

May have a certain sense of urgency in the United States as several of the government sponsored data sites will be going dark later this year. Budget cuts.

Why the transparency minded Obama administration and the secretive opposition would agree on less government transparency isn’t clear.

I note that agreement only to point out that if you are going to copy data currently available for later use in topic maps, the time to do so is now.

*****
PS: Not that access to data = transparency but in the absence of data, there isn’t even a basis for transparency.

List of European Open Data Catalogues

Filed under: Dataset — Patrick Durusau @ 6:20 pm

List of European Open Data Catalogues

From the website:

Following is a list of open data catalogues from around European member states, sorted by country. This list is very much a work in progress.

EU oriented listing of open data catalogues.

publicdata.eu — Europe’s public data

Filed under: Dataset — Patrick Durusau @ 6:19 pm

publicdata.eu — Europe’s public data

Noticed this in the Open Data Challenge materials and thought it merited a separate entry.

Deserves a visit, if for no other reason than the home page that lists “Places” as: United Kingdom, England, Wales, Scotland, Northern Ireland, and, International.

Where “International” includes the United States, Australia, Afghanistan, and oh, yes, the rest of Europe.

The first fourteen entries from International will give you an idea of the range of the data sets:

* German federal budget (OffenerHaushalt)
* 2000 U.S. Census in RDF (rdfabout.com)
* 32000 Naples Florida Businesses in KML format
* Airborne Antarctic Ozone Experiment (AAOE-87)
* AcaWiki
* Acupuncture & Moxibustion in London
* Asian Development Bank (ADB) – Statistical Database System (SDBS)
* Addgene
* Adopt a Roadside (Victoria, Australia)
* Advances in Dental Research
* Aegean Archaeomalacology
* Afghanistan Election Data
* Agricultural and forestry exports from New Zealand
* AGROVOC

Open Data Challenge

Filed under: Contest,Dataset — Patrick Durusau @ 6:19 pm

Open Data Challenge

EU residents and organizations with operations in the EU can compete in four basic categories:

  • Ideas – Anyone can suggest an idea for projects which reuse public information to do something interesting or useful.
  • Apps – Teams of developers can submit working applications which reuse public information.
  • Visualisations – Designers, artists and others can submit interesting or insightful visual representations of public information.
  • Datasets – Public bodies can submit newly opened up datasets, or developers can submit derived datasets which they’ve cleaned up, or linked together

Runs 5 April to 5 June, 2011

See the site for various rules and details.

April 5, 2011

Tutorial on Crowdsourcing and Human Computation

Filed under: Crowd Sourcing — Patrick Durusau @ 4:30 pm

Tutorial on Crowdsourcing and Human Computation

From the post:

Last week, together with Praveen Paritosh from Google, we presented a 6-hour tutorial at the WWW 20111 conference, on crowdsourcing and human computation. The title of the tutorial was “Managing Crowdsourced Human Computation”.

Check the post for other links, resources.

Perhaps the lesson is to automate when possible and to use human computation when necessary. And the trick is to know when to switch.

LingPipe Book Draft 0.4

Filed under: LingPipe — Patrick Durusau @ 4:29 pm

LingPipe Book Draft 0.4

A report that version 0.4 of the LingPipe book has appeared.

The role of Natural Language Processing in Information Retrieval: Searching for Meaning in Text

Filed under: Information Retrieval,Natural Language Processing — Patrick Durusau @ 4:29 pm

The role of Natural Language Processing in Information Retrieval: Searching for Meaning in Text by Tony Russell-Rose.

Abstract:

Here are the slides from the talk I gave at City University last week, as a guest lecture to their Information Science MSc students. It’s based on the chapter of the same name which I co-authored with Mark Stevenson of Sheffield University and appears in the book called “Information Retrieval: Searching in the 21st Century“. The session was scheduled for 3 hours, and to my amazement, required all of that (thanks largely to an enthusiastic group who asked lots of questions). And no, I didn’t present 3 hours of Powerpoint – the material was punctuated with practical exercises and demos to illustrate the learning points and allow people to explore the key concepts for themselves. These exercises aren’t included in the Slideshare version, but I am happy to make them available to folks who want to enjoy the full experience.

If you don’t look at another presentation slide deck this week, do yourself a favor and look at this one. Very well done.

I’m going to write for the exercises. Comments to follow.

Budget Climb

Filed under: Graphics,Visualization — Patrick Durusau @ 4:28 pm

Budget Climb

First noticed on Flowing Data.

Interesting way to navigate historical budget data for the United States.

Pairs visualization and Microsoft Kinect.

Relies on pre-computed subjects but I suspect we are not far from users being able to choose/construct subjects visually from data as they explore data sets.

Making them reliably shareable could open up entire new lines of employment.

structr

Filed under: Neo4j — Patrick Durusau @ 4:28 pm

structr

An open source CMS that is based on Neo4J.

First public beta, 0.3, is due in May, 2011.

Having a graph database underpinning, one has to wonder how difficult it would be to add topic map characteristics?

Solr + Hadoop = Big Data Love

Filed under: Hadoop,Solr — Patrick Durusau @ 4:27 pm

Solr + Hadoop = Big Data Love

Interesting combination, using Solr as a key/value store.

The article mentions that it is for “smaller” data sets and later says that approximately 200M “records” with reasonable response times.

That is something that gets overlooked in the rush to scale.

There are a lot of interesting data sets that are < 200M "records." The Library of Congress for example has 143 million items in its catalogs.

Perhaps your data set is < the Library of Congress?

« Newer PostsOlder Posts »

Powered by WordPress