October « 2011 « Another Word For It

October 11, 2011

Writing Tetris in Clojure

Filed under: Clojure — Patrick Durusau @ 5:53 pm

From the post:

Good evening to everyone. Today I want to guide you step-by-step through the process of writing a game of Tetris in Clojure. My goal was not to write the shortest version possible but the concisest one and the one that would use idiomatic Clojure techniques (like relying on the sequence processing functions and making a clear distinction between purely functional and side-effect code). The result I got is about 300 lines of code in size but it is very comprehensible and simple. If you are interested then fire up your editor_of_choice and let’s get our hands dirty.

Not that I really think you need a personal clone of Tetris but it looks like a very good way to explore Clojure.

Comments Off

Dart: Structured Web Programming

Filed under: Programming,Web Applications — Patrick Durusau @ 5:53 pm

Dart: Structured Web Programming

Home for a new browser based application language.

From the homepage:

From quick prototypes to serious apps

Dart’s optional types let you prototype quickly and then revise your code to be more maintainable.

Wherever you need structured code

You can use the same Dart code in most modern web browsers (Chrome, Safari 5+, Firefox 4+) and on servers. Look for more browser support shortly.

Familiar yet new

Dart code should look familiar if you know a language or two, and you can use time-tested features such as classes and closures. Dart’s new features make it easier for you to develop and maintain software. Dart is still in the early stages of development, so please take a look and tell us what you think.

Charges, counter-charges, attacks/defenses of Dart are underway in a number of social media forums and I won’t waste your time with reports of them here.

Comments Off

Bibliographic Wilderness

Filed under: Bibliography,Librarian/Expert Searchers — Patrick Durusau @ 2:40 pm

Bibliographic Wilderness

An interesting bibliographic/library blog that I encountered. Posts on URLs, microdata, etc.

Comments Off

October 10, 2011

Introducing Truffler – Advanced search made easy

Filed under: ElasticSearch,Truffler — Patrick Durusau @ 6:20 pm

Introducing Truffler – Advanced search made easy

From the post:

Last week during a presentation at a user group I showed the project that me and my two partners Henrik Lindström and Marcus Granström have been working on for quite a while now – Truffler. Truffler is a search engine that we offer both as Software as a Service and as dedicated servers for rent. It’s a commercial product but we offer free trial indexes as well as personal indexes to developers that want to use it for their blogs or hobby projects as long as they link to us.

Built on ElasticSearch, offering a .Net API. Windows developers take note.

Curious if you try it out, what do you make of the claim that “advanced search [is] made easy.”? Lots of people make it, not just Truffler. How would you evaluate that claim? Is ease of programming/configuration enough? What of the results? How do you judge those?

For my class, consider a project proposal for how you would compare two search engines, Truffler and another. Not actually doing the proposal but writing up the process by which would test one against the other. I think you will find simply designing how you would compare the two a reasonably sized project. Could be interesting to pitch your project design to the search engines in question to see if they would fund the comparison as a research project.

Comments Off

Integrating Zend Framework Lucene with your Cake Application

Filed under: Lucene,PHP — Patrick Durusau @ 6:19 pm

Integrating Zend Framework Lucene with your Cake Application

From the post:

This is a short tutorial that teaches you how to integrate Zend Framework’s Lucene implementation (100% PHP) to your application. It requires your server to have PHP5 installed, since ZF only runs on PHP5, and is likelly to be deprecated very soon.

Another search implementation guide.

Curious, (for my students), could I take the results of a search at one site and combine them with the results from another site? If that were your task, what questions would you ask? Is using the same search engine enough? If not, what more would you need to know? Is there anything you would like to do as part of the combining process? Assume that you have free access to any needed data.

Comments Off

OrientDB JDBC Driver

Filed under: OrientDB — Patrick Durusau @ 6:19 pm

OrientDB JDBC Driver

From the project page:

OrientDB (http://code.google.com/p/orient/) is a NoSql DBMS that support a subset of SQL ad query languge.

This project is an effort to develop a JDBC driver for OrientDB

Perhaps the familiar may temp DB programmers into unfamiliar territory?

Comments Off

FPGA-based MapReduce Framework for Machine Learning

Filed under: Machine Learning,MapReduce — Patrick Durusau @ 6:19 pm

FPGA-based MapReduce Framework for Machine Learning by Ningyi XU.

From the description:

Machine learning algorithms are becoming increasingly important in our daily life. However, training on very large scale datasets is usually very slow. FPGA is a reconfigurable platform that can achieve high parallelism and data throughput. Many works have been done on accelerating machine learning algorithms on FPGA. In this paper, we adapt Google’s MapReduce model to FPGA by realizing an on-chip MapReduce framework for machine learning algorithms. A processor scheduler is implemented for the maximum computation resource utilization and load balancing. In accordance with the characteristics of many machine learning algorithms, a common data access scheme is carefully designed to maximize data throughput for large scale dataset. This framework hides the task control, synchronization and communication away from designers to shorten development cycles. In a case study of RankBoost acceleration, up to 31.8x speedup is achieved versus CPU-based design, which is comparable with a fully manually designed version. We also discuss the implementations of two other machine learning algorithms, SVM and PageRank, to demonstrate the capability of the framework.

Not quite ready for general use but this looks very promising.

The usual discussion of “big data” made me start thinking about whether we need lots of instances of data to have “big data” or have we trimmed down data that surrounds us so we can manage it without the requirements of “big data?”

There are lots of examples of the former, can you think of examples of the latter?

Comments Off

Dataflow Programming:…

Filed under: Flow-Based Programming (FBP),Pipes — Patrick Durusau @ 6:18 pm

Dataflow Programming: Handling Huge Data Loads Without Adding Complexity by Jim Falgout.

From the post:

Because the dataflow operators in a graph work in parallel, the model allows overlapping I/O operations with computation. This is a “whole application” approach to parallelization as opposed to many thread-oriented performance frameworks that focus on hot sections of code such as for loops. This addresses a key problem in processing “big data” for today’s many-core processors: feeding data fast enough to the processors.

While dataflow does this, it scales down easily as well. This distinguishes it from technologies, such as Hadoop and to a lesser extent, Map Reduce, which don’t scale downward well due to their innate complexity.

We’ve discussed how a dataflow architecture exploits multicore. These same principles can be applied to multi-node clusters by extending dataflow queues over networks with a dataflow graph executed on multiple systems in parallel. The compositional model of building dataflow graphs allows for replication of pieces of the graph across multiple nodes. Scaling out extends the reach of dataflow to solve large data problems.

Read the first comment. By J.P. Morrison, author of Flow Based Programming, the inspiration for Pipes. (Shamelessly repeated from a post by Marko Rodriguez on the gremlin-users list, Achim first noticed the article.)

Be aware that Amazon lists the Kindle edition for \$29.00 and a hardback edition for \$69.00. Sadly one reader reports the book has no index?

Comments (3)

Bio4jExplorer

Filed under: Bio4j,Bioinformatics,Biomedical,Cloud Computing,Graphs — Patrick Durusau @ 6:17 pm

Bio4jExplorer: familiarize yourself with Bio4j nodes and relationships

From the post:

I just uploaded a new tool aimed to be used both as a reference manual and initial contact for Bio4j domain model: Bio4jExplorer

Bio4jExplorer allows you to:

Navigate through all nodes and relationships

Access the javadocs of any node or relationship

Graphically explore the neighbourhood of a node/relationship

Look up for the different indexes that may serve as an entry point for a node

Check incoming/outgoing relationships of a specific node

Check start/end nodes of a specific relationship

And take note:

For those interested on how this was done, on the server side I created an AWS SimpleDB database holding all the information about the model of Bio4j, i.e. everything regarding nodes, relationships, indexes… (here you can check the program used for creating this database using java aws sdk)

Meanwhile, in the client side I used Flare prefuse AS3 library for the graph visualization.

When people are this productive as well as a benefit to the community, I am deeply envious but glad for them (and the rest of us) at the same time. Simply must work harder. 😉

Comments Off

A Basic Full Text Search Server in Erlang

Filed under: Erlang,Search Engines,Searching — Patrick Durusau @ 6:17 pm

A Basic Full Text Search Server in Erlang

From the post:

This post explains how to build a basic full text search server in Erlang. The server has the following features:

indexing

stemming

ranking

faceting

asynchronous search results

web frontend using websockets

Familiarity with the OTP design principles is recommended.

Looks like a good way to become familiar with Erlang and text search issues.

Comments Off

Nearest Neighbor Search: the Old, the New, and the Impossible

Filed under: Edit Distance,Levenshtein Distance,Nearest Neighbor,Neighbors — Patrick Durusau @ 6:16 pm

Nearest Neighbor Search: the Old, the New, and the Impossible, the MIT PhD thesis of Alexandr Andoni.

To be honest, it is the discovery of gems like this one that keep me prowling journals, pre-publication sites, homepages, etc.

Alexandr walks the reader through a very complete review of the literature on nearest neighbor search, all the while laying a foundation for the significant progress he has made.

Not for the faint of heart but it promises to be well worth the effort.

Comments Off

October 9, 2011

Approximating Edit Distance in Near-Linear Time

Filed under: Algorithms,Edit Distance,Levenshtein Distance — Patrick Durusau @ 6:44 pm

Approximating Edit Distance in Near-Linear Time

Abstract:

We show how to compute the edit distance between two strings of length n up to a factor of $2^{\~O(sqrt(log n))} in n^(1+o(1))$ time. This is the first sub-polynomial approximation algorithm for this problem that runs in near-linear time, improving on the state-of-the-art $n^(1/3+o(1))$ approximation. Previously, approximation of $2^{\~O(sqrt(log n))}$ was known only for embedding edit distance into $l_1$, and it is not known if that embedding can be computed in less than quadratic time.

Deeply important research for bioinformatics, text searching. The edit distance is “approximated.”

If you are not familiar with this area, Levenshtein Distance, in Three Flavors by Michael Gilleland is a nice starting point with source code in three languages.

Comments (3)

Distributed Reasoning in a Peer-to-Peer Setting: Application to the Semantic Web

Filed under: Artificial Intelligence,P2P,Semantic Web — Patrick Durusau @ 6:43 pm

Distributed Reasoning in a Peer-to-Peer Setting: Application to the Semantic Web by P. Adjiman, P. Chatalic, F. Goasdoue, M. C. Rousset, and L. Simon.

Abstract:

In a peer-to-peer inference system, each peer can reason locally but can also solicit some of its acquaintances, which are peers sharing part of its vocabulary. In this paper, we consider peer-to-peer inference systems in which the local theory of each peer is a set of propositional clauses defined upon a local vocabulary. An important characteristic of peer-to-peer inference systems is that the global theory (the union of all peer theories) is not known (as opposed to partition-based reasoning systems). The main contribution of this paper is to provide the first consequence finding algorithm in a peer-to-peer setting: DeCA. It is anytime and computes consequences gradually from the solicited peer to peers that are more and more distant. We exhibit a sufficient condition on the acquaintance graph of the peer-to-peer inference system for guaranteeing the completeness of this algorithm. Another important contribution is to apply this general distributed reasoning setting to the setting of the Semantic Web through the Somewhere semantic peer-to-peer data management system. The last contribution of this paper is to provide an experimental analysis of the scalability of the peer-to-peer infrastructure that we propose, on large networks of 1000 peers.

Interesting research on its own but I was struck by the phrase: “but can also solicit some of its acquaintances, which are peers sharing part of its vocabulary.”

Can we say that our “peers,” share our “mappings”?

That is mappings between terms and our expectation of others with regard to those terms.

Not the mapping between the label and the subject for which it is a label.

Or is the second mapping encompassed in the first? Or merely a partial expression of the first? (That seems more likely.)

Not immediately applicable to anything but may be important in terms of the mappings we are seeking to capture.

Comments Off

Execution in the Kingdom of Nouns

Filed under: Java,Language,Language Design — Patrick Durusau @ 6:43 pm

Execution in the Kingdom of Nouns

From the post:

They’ve a temper, some of them—particularly verbs: they’re the proudest—adjectives you can do anything with, but not verbs—however, I can manage the whole lot of them! Impenetrability! That’s what I say!
— Humpty Dumpty

Hello, world! Today we’re going to hear the story of Evil King Java and his quest for worldwide verb stamp-outage.1

Caution: This story does not have a happy ending. It is neither a story for the faint of heart nor for the critical of mouth. If you’re easily offended, or prone to being a disagreeable knave in blog comments, please stop reading now.

Before we begin the story, let’s get some conceptual gunk out of the way.

What I find compelling is the notion that a programming language should follow how we think, that is the most of us.

If you want a successful topic map, should it follow/mimic the thinking of:

the author
the client
intended user base?

#1 is easy, that’s the default and requires the least work.

#2 is instinctive, but you will need to educate the client to #3.

#3 is golden if you can hit that mark.

Comments Off

Yes, Virginia, Scala is Learnable

Filed under: Scala — Patrick Durusau @ 6:42 pm

Yes, Virginia, Scala is Learnable

Paul Snively writes:

We’re using Databinder Dispatch a lot in the Cloud Services Engineering group at VMware, and late last week I was discussing it with one of my colleagues, a very senior (not average!) Java developer. I showed him a snippet of Dispatch code and said I wouldn’t expect anyone to understand it on first reading. He seemed surprised by that, unfortunately in the sense that he seemed to believe that it was expected that team members understand Dispatch code on first reading. Then Dave Pollak’s excellent Yes, Virginia, Scala is Hard post appeared, calling me out by name. 🙂 While it’s extremely flattering that Dave thinks I’m a statistical outlier with respect to programming language expertise, his comment, along with my disappointment to find that a very capable colleague apparently felt pressure to understand something that I expect no one to understand immediately, impels me to try to address the question of Scala’s complexity.

He concludes:

My one-sentence summary, though, would be: there’s no substitute for actually learning the language, and yes, Virginia, Scala is learnable.

Perhaps a bit unfair but I am reminded of efforts to make metadata more “accessible” to people, innocent of any formal information/library training, who built data sets used by millions daily. Interesting I suppose but then I recall when Alta Vista was “the” search site. How many users could today even correctly identify the name? There will always be far more users looking for simple facts, surmise and rumor than those interested in more sophisticated analysis.

My counsel is to learn both the more sophisticated and perhaps even historical systems. You can always dumb delivery down.

Comments Off

The RDA Vocabularies: Implementation, Extension, and Mapping

Filed under: Library,RDA — Patrick Durusau @ 6:42 pm

NISO/DCMI Webinar: “The RDA Vocabularies: Implementation, Extension, and Mapping”

If you are unfamiliar with RDA, I would suggest the D-Lib article, RDA Vocabularies: Process, Outcome, Use, by Diane Hillmann, Karen Coyle, Jon Phipps, and Gordon Dunsire, cited by the webinar announcement as a starting point.

The article concludes (in part, emphasis added):

But the benefit of using a modern and fully registered standard is not only to others — library reliance on data standards that require that all data be created by hand by highly trained individuals is clearly unsustainable. In a recent presentation to an audience at ALA Annual in Chicago, Jon Phipps demonstrated that continued library use of a standard only we understand has cut us off from reuse of data being built exponentially by entities such as DBpedia, which are clearly, for a host of reasons, choosing not to access and reuse library data [Phipps] [DBpedia]. Only by changing what we do in library environments can we hope to participate with other large users of data in building better descriptive data that we can then hope to reuse to improve our own services.

I don’t disagree with the assessment but am not altogether sure about the solution. That is to say that what constitutes the “common standard” varies from time to time. Cataloging in Latin would have been the most accessible once upon a time. And those records still exist.

As we chase another “standard,” what provision have we made not to cut ourselves (and users) off from prior information?

While we pursue exposing users to the equivalent of an coarsely edited annual Almanac of fact, surmise and rumor.

I don’t think you will find the answer with RDA.

Still, you may find the webinar informative (if a bit pricey).

From the post:

DATE: 16 November 2011
TIME: 1:00pm – 2:30pm EDT (17:00-19:30 UTC)
REGISTRATION: http://www.niso.org/news/events/2011/dcmi/rda

ABOUT THE WEBINAR

During a meeting at the British Library in May 2007 between the Joint Steering Committee for the Development of RDA and DCMI, important recommendations were forged for the development of an element vocabulary, application profile, and value vocabularies [1], based on the Resource Description and Access (RDA) standard, then in final draft. A DCMI/RDA Task Group [2] has completed much of the work, and described their process and decisions in a recent issue of D-Lib Magazine [3]. A final, pre-publication technical review of this work is underway, prior to adoption by early implementers.

This webinar will provide an up-to-the-minute update on the review process, as well as progress on the RDA-based application profiles. The webinar will discuss practical implementation issues raised by early implementers and summarize issues surfaced in virtual and face-to-face venues where the vocabularies and application profiles have been discussed.

[1] http://www.bl.uk/bibliographic/meeting.html
[2] http://dublincore.org/dcmirdataskgroup/
[3] http://dlib.org/dlib/january10/hillmann/01hillmann.html

SPEAKERS:

Diane Hillmann is Vocabulary Management Officer for DCMI and a partner in the consulting firm Metadata Management Associates. She is co-chair (with Gordon Dunsire) of the DCMI/RDA Task Group and is the DCMI Liaison to the ALA Committee on Cataloging: Description and access, the US body providing feedback on RDA Development.

Thomas Baker, DCMI Chief Information Officer (Communications, Research and Development), was recently co-chair of the W3C Semantic Web Deployment Working Group and a W3C Incubator Group on Library Linked Data (report pending).

REGISTRATION:

For registration and webinar technical information, see http://www.niso.org/news/events/2011/dcmi/rda. Registration closes at 12:00 pm Eastern on 16 November 2011.

Comments Off

Large Scale Machine Learning and Other Animals

Filed under: GraphLab — Patrick Durusau @ 6:41 pm

Large Scale Machine Learning and Other Animals

From About Me:

Danny Bickson

I am a project scientist at Carnegie Mellon University, Machine Learning Department. I am interested in distributed/parallel large scale machine learning algorithms and applications on various computing platforms such as clouds and clusters. Checkout our GraphLab project page. I am quite excited about multicore matrix factorization code that I recently wrote, that has been downloaded about 1500 times, and helped us win the 5h place in ACM KDD CUP 2011 contest (out of more than 1000 participants). We are looking for industrial collaboration around GraphLab. Contact me if you are interested in hearing more.

I encountered this blog due to its current post on installing GraphLab on Ubuntu

I am sure I will be pointing out other specific posts in the future but thought I should direct your attention to his blog and the GraphLab project while it was on my mind.

Comments Off

Parallel frameworks for graph processing

Filed under: Graphs,Parallel Programming — Patrick Durusau @ 6:41 pm

Parallel frameworks for graph processing from Lambda the Ultimate.

Summaries and then comments on GraphLab and John Gilbert’s Parallel Combinatorial BLAS: A Toolbox for High-Performance Graph Computation (papers, slides).

Contribute your comments, pointers to other resources?

Comments Off

Open Relevance Project

Filed under: Dataset,Relevance — Patrick Durusau @ 6:40 pm

Open Relevance Project

From the website:

What Is the Open Relevance Project?

The Open Relevance Project (ORP) is a new Apache Lucene sub-project aimed at making materials for doing relevance testing for Information Retrieval (IR), Machine Learning and Natural Language Processing (NLP) into open source.

Our initial focus is on creating collections, judgments, queries and tools for the Lucene ecosystem of projects (Lucene Java, Solr, Nutch, Mahout, etc.) that can be used to judge relevance in a free, repeatable manner.

One dataset that needs attention from this project is: Apache Software Foundation Public Mail Archives, which is accessible on the Amazon cloud.

Project work products would benefit Apache software users, vendors with Apache software bases, historians, sociologists and others interested in the dynamics, technical and otherwise, of software development.

I am willing to try to learn cloud computing and the skills necessary to turn this dataset into a test collection. Are you?

Comments Off

Apache Software Foundation Public Mail Archives

Filed under: Cloud Computing,Dataset — Patrick Durusau @ 6:40 pm

Apache Software Foundation Public Mail Archives

From the webpage:

Submitted By: Grant Ingersoll
US Snapshot ID (Linux/Unix): snap-17f7f476
Size: 200 GB
License: Public Domain (See http://apache.org/foundation/public-archives.html)
Source: The Apache Software Foundation (http://www.apache.org)
Created On: August 15, 2011 10:00 PM GMT
Last Updated: August 15, 2011 10:00 PM GMT

A collection of all publicly available mail archives from the Apache55 Software Foundation (ASF), taken on July 11, 2011.

This collection contains all publicly available email archives from the ASF’s 80+ projects (http://mail-archives.apache.org/mod_mbox/), including mailing lists such as Apache HTTPD Server, Apache Tomcat, Apache Lucene and Solr, Apache Hadoop and many more.

Generally speaking, most projects have at least three lists: user, dev and commits, but some have more, some have less. The user lists are where users of the software ask questions on usage, while the dev list usually contains discussions on the development of the project (code, releases, etc.)

The commit lists usually consists of automated notifications sent by the various ASF version control tools, like Subversion or CVS, and contain information about changes made to the project’s source code.

Both tarballs and per project sets are available in the snapshot. The tarballs are organized according to project name. Thus, a-d.tar.gz contains all ASF projects that begin with the letters a, b, c or d, such as abdera.apache.org. Files within the project are usually gzipped mbox files. (I split the first paragraph into several paragraphs for readability reasons.)

Rather meager documentation for a 200 GB data set don’t you think? I think a first step would be to create basic documentation on what projects are present, their mailing lists, some basic statistical counts to serve as reference points.

You have been waiting for a motivation to “get into” cloud computing, well, now you have the motivation and an interesting dataset!

Comments (1)

Ancestry.com Forum Dataset

Filed under: Dataset — Patrick Durusau @ 6:40 pm

Ancestry.com Forum Dataset

From the post:

The Ancestry.com Forum Dataset was created with the cooperation of Ancestry.com in an effort to promote research on information retrieval, language technologies, and social network analysis. It contains a full snapshot of the Ancestry.com online forum, boards.ancestry.com, from July 2010. This message board is large, with over 22 million messages, over 3.5 million authors, and active participation for over ten years.

In addition to the document collection, queries from Ancestry.com’s query log and pairwise preference relevance judgements for a message thread retrieval task using this online forum are distributed.

This webpage describes the dataset, gives instructions for obtaining the dataset, and describes the supplemental data to use for thread search information retrieval experiments. Further details of the dataset can be found in the tech report describing the collection.

Contact: Jonathan Elsas.

Document Collection

The Ancestry.com Online Forum document collection is a full snapshot of the online forum, boards.ancestry.com from July 2010.

Number of Messages 22,054,728

Number of Threads 9,040,958

Number of Sub-forums 165,358

Number of Unique Authors 3,775,670

Message Date Range December 1995 – July 2010

Size 5 GB (compressed)

The documents distributed in the collection are in the TRECTEXT SGML format, similar to other collections used at the Text REtrieval Conference.

As you will read, creation of a dataset, for use as a test set, is a non-trivial project.

Curious, what questions would you ask of such a dataset? Or perhaps better, what tools would you use to ask those questions and why?

Grant Ingersoll mentioned this collection in email on the openrelevance-dev@apache.org mailing list.

Comments Off

Kernel Methods and Support Vector Machines de-Mystified

Filed under: Kernel Methods,Support Vector Machines — Patrick Durusau @ 6:39 pm

Kernel Methods and Support Vector Machines de-Mystified

From the post:

We give a simple explanation of the interrelated machine learning techniques called kernel methods and support vector machines. We hope to characterize and de-mystify some of the properties of these methods. To do this we work some examples and draw a few analogies. The familiar no matter how wonderful is not perceived as mystical.

Did the authors succeed in their goal of a “simple explanation”?

You might want to compare the Wikipedia entry they cite on support vector machines before making your comment. Success is often a relative term.

Comments Off

October 8, 2011

Data Mining Research Notes – Wiki

Filed under: Data Mining,Vocabularies — Patrick Durusau @ 8:15 pm

Data Mining Research Notes – Wiki

You can go to the parent resource but I am deliberately pointing to the “wiki” resource page.

It is a collection of terms from data mining with pointers to Wikipedia pages for each one.

While I may quibble with the readability of some of the work at Wikipedia, I must confess to having created no competing explanations for their consideration.

Perhaps that is something that I could use to fill the idle hours. 😉 Seriously, readable explanations of technical material is both an art form and quite welcome by most technical types. It saves them the time of explanations if anything and possibly helps others become interested.

Comments Off

Wiki PageRank with Hadoop

Filed under: Hadoop,PageRank — Patrick Durusau @ 8:15 pm

Wiki PageRank with Hadoop

From the post:

In this tutorial we are going to create a PageRanking for Wikipedia with the use of Hadoop. This was a good hands-on excercise to get started with Hadoop. The page ranking is not a new thing, but a suitable usecase and way cooler than a word counter! The Wikipedia (en) has 3.7M articles at the moment and is still growing. Each article has many links to other articles. With those incomming and outgoing links we can determine which page is more important than others, which basically is what PageRanking does.

Excellent tutorial! Non-trivial data set and gets your hands wet with Hadoop, one of the rising stars in data processing. What’s not to like?

Question: What other processing looks interesting for the Wiki pages?

The running time on some jobs would be short enough to plan a job at the start of class, from live suggestions, then run the job during the presentation/lecture, present the results/post-mortem of mistakes after the break.

Now that would make an interesting class. Suggestions?

Comments Off

Counting Triangles

Filed under: Hadoop,MPP,Vectors — Patrick Durusau @ 8:14 pm

Counting Triangles

From the post:

Recently I’ve heard from or read about people who use Hadoop because their analytic jobs can’t achieve the same level of performance in a database. In one case, a professor I visited said his group uses Hadoop to count triangles “because a database doesn’t perform the necessary joins efficiently.”

Perhaps I’m being dense but I don’t understand why a database doesn’t efficiently support these use-cases. In fact, I have a hard time believing they wouldn’t perform better in a columnar, MPP database like Vertica – where memory and storage are laid out and accessed efficiently, query jobs are automatically tuned by the optimizer, and expression execution is vectorized at run-time. There are additional benefits when several, similar jobs are run or data is updated and the same job is re-run multiple times. Of course, performance isn’t everything; ease-of-use and maintainability are important factors that Vertica excels at as well.

Since the “gauntlet was thrown down”, to steal a line from Good Will Hunting, I decided to take up the challenge of computing the number of triangles in a graph (and include the solutions in GitHub so others can experiment – more on this at the end of the post).

I don’t think you will surprised at the outcome but the exercise is instructive in a number of ways. Primarily don’t assume performance without testing. If all the bellowing leads to more competition and close work on software and algorithms, I think there will be some profit from it.

Comments Off

Tree Traversal in O(1) Space

Filed under: Algorithms,Graphs,Software,Trees — Patrick Durusau @ 8:14 pm

Tree Traversal in O(1) Space by Sanjoy.

From the post:

I’ve been reading some texts recently, and came across a very interesting way to traverse a graph, called pointer reversal. The idea is this — instead of maintaining an explicit stack (of the places you’ve visited), you try to store the relevant information in the nodes themselves. One approach that works on directed graphs with two (outgoing) arcs per node is called the Deutsch-Schorr-Waite algorithm. This was later extended by Thorelli to work for directed graphs with an unknown number of (outgoing) arcs per node.

Implemented here for a tree, care to go for a more general graph?

Comments Off

An Introduction to Tinkerpop

Filed under: Blueprints,Gremlin,Pipes,Rexster,TinkerPop — Patrick Durusau @ 8:13 pm

An Introduction to Tinkerpop by Takahiro Inoue.

Excellent introduction to the Tinkerpop stack.

Comments Off

The Definition of GraphDB

Filed under: GraphDB,Graphs — Patrick Durusau @ 8:13 pm

The Definition of GraphDB by Takahiro Inoue.

Yes, GraphDB is also a product name from Sones but in this context it means graph database in the generic sense.

Good thing we don’t have naming issues in the topic map/semantic integration area, would make it hard to find things. 😉

These are some of the best graphics I have ever seen for introducing graphs and their capabilities as data structures.

Definitely worth spending some time with them and forwarding to others.

Comments Off

An Introduction to Neo4j

Filed under: Graphs,Neo4j — Patrick Durusau @ 8:13 pm

An Introduction to Neo4j by Takahiro Inoue (Leader of MongoDB JP)

I have seen a number of introductions to Neo4j that cover the same basic themes.

This one is different. You need to take the time to review this one and forward it to others.

Not just because it goes into more technical detail but I sense a different in appreciate for the power of Neo4j.

BTW, when you get to slide 24, what does that graphic remind you of? That’s what I thought too.

Definitely a researcher to follow. Twitter handle @doryokujin

Comments Off

DYI – Topic Modeling

Filed under: Latent Dirichlet Allocation (LDA),Software — Patrick Durusau @ 8:12 pm

How to do Your Own Topic Modeling

From the post:

In the first Teaching with Technology Tuesday of the fall 2011 semester, David Newman delivered a presentation on topic modeling to a full house in Bass’s L01 classroom. His research concentrates on data mining and machine learning, and he has been working with Yale for the past three years in an IMLS funded project on the applications of topic modeling in museum and library collections. In Tuesday’s talk, David broke down what topic modeling is, how it can be useful, and introduced a tool he designed to make the process accessible to anyone who can use a computer.

Summary of what sounds like an interesting presentation on the use of topic modeling (Latent Dirichlet Allocation/LDA) along with links to software. Enough detail that if topic modeling is unfamiliar, you will get the gist of it.

The usual cautions about LDA apply: It can’t model what’s not present, works at the document level (too coarse for many purposes), your use of the software has a dramatic impact on the results, etc. Useful tool, just be careful how much you rely upon it without checking the results.

Comments Off

« Newer Posts — Older Posts »

Number of Messages	22,054,728
Number of Threads	9,040,958
Number of Sub-forums	165,358
Number of Unique Authors	3,775,670
Message Date Range	December 1995 – July 2010
Size	5 GB (compressed)

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 11, 2011

October 10, 2011

October 9, 2011

Document Collection

October 8, 2011