Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 20, 2012

Green-Marl – The Paper

Filed under: DSL,Graphs,Green-Mari — Patrick Durusau @ 4:36 pm

Green-Marl: A DSL for Easy and Efficient Graph Analysis by Sungpack Hong, Hassan Chafi, Eric Sedlar, and Kunle Olukotun.

I previously reported on the Green-Marl website/software and mentioned this paper: Green-Marl. Catching up on a severe backlog of papers during a slow summer weekend and read in Green-Marl paper:

The above mathematical descriptions imply two important assumptions that Green-Marl makes:

1. The graph is immutable and is not modified during analysis.

2. There are no aliases between graph instances nor between graph properties.

We assume an immutable graph so that we can focus on the task of graph analysis, rather than worry about orthogonal issues such as how graphs are constructed or modified. Since Green-Marl is designed to be used in re-writting only parts of the user application (Section 3.1), one can construct or modify the graph in their own preferred way (e.g. from data file, from a database, etc.) but when a Green-Marl generated implementation is handed a graph, the assumption is that the graph will not be modified while a Green-Marl procedure is analyizing it.

I can imagine assumption #1 being in place for processing a topic map but certainly not assumption #2.

But given the type of graph analysis they want to perform, the assumptions are justifiable.

Given a graph, \( G = (V;E) \), and a set of properties defined on the graph, \( \amalg = \{P_1; P_2; \ldots P_n\} \), our language is specifically designed for the following types of graph analysis:


  • Computing a scalar value from \( (G,\amalg) \), e.g. the conductance of a sub-graph
  • 

  • Computing a new property \( P_{n+1} \) from \( (G, \amalg ) \), e.g. the pagerank of each node of a graph
  • 

  • Selecting a subgraph of interest out of the original graph \( (V^’,E^’) \subset (V,E) \), e.g. strongly connected components of a graph

What if we share the assumption:

1. The graph is immutable and is not modified during analysis.

And have a different second assumption:

There are aliases between nodes and graph properties.

We want to say:

Given a graph, \( G = (V;E) \), and a set of properties defined on the graph, \( \amalg = \{P_1; P_2; \ldots P_n\} \),

  1. Compute the aliases for each property defined on the graph
  2. Compute the aliases for each node of the graph

(My suspicion being that aliases of properties must be computed before aliases on nodes, although that would depend upon how aliases between nodes are defined.)

Where actions based on the computation of aliases for both properties and nodes are a separate step in the analysis.

(There are some other complexities that I haven’t fully teased out so suggestions/comments welcome. But then that is always the case.)

Neo4j in the Trenches [webinar notes]

Filed under: Design,Graphs,Neo4j — Patrick Durusau @ 4:18 pm

Neo4j in the Trenches

From the description:

OpenCredo discusses Opigram: a social recommendation engine

In this webinar, Nicki Watt of OpenCredo presents the lessons learned (and being learned) on an active Neo4j project: Opigram. Opigram is a socially oriented recommendation engine which is already live, with some 150k users and growing. The webinar will cover Neo4j usage, challenges encountered, and solutions to these challenges.

I was scheduled to watch it live but it conflicted, unexpectedly, with nap time. 😉

Watching it now and it is very impressive!

Lots of details and code!

Some specific points that I found interesting:

  • Know what questions you are going to ask the graph
  • Important things => nodes (can you say subjects?)
  • batch deleting (experiment with # of nodes) (Is this still an issue?)
  • reservoir sampling algorithm (you need to look deeply at this)
  • multi-threading fixed in 1.7 or later (issue discovered by profiling but should profile in any case)

Highly recommended!


Curious about your thoughts on the deletion issue?

On one hand, you can do “soft deletes” as discussed in this presentation but at some point, that may have an adverse impact on graph size and complexity.

On the other hand, “actual” deletion seems to be disfavored.

But change (read deletion/update) is a fact of enterprise data. (data generally but it sounds more impressive to say “enterprise data.”)

June 18, 2012

Regrets – June 18-19, 2012

Filed under: Uncategorized — Patrick Durusau @ 8:18 pm

Apologies but I will not be making technical posts to Another Word For It on June 18th or June 19th, 2012.

Medical testing that was supposed to end mid-day on Monday has spread over onto Tuesday. And most of Tuesday at that.

I don’t want to post unless I think the information is useful and/or I have something useful to say about the information. I’m ok but can’t focus enough to promise either one.

On the “bright” side, I hope to return to posting on Wednesday (June 20, 2012) and am only a few posts away from #5,000!

I appreciate well wishes but be aware that I won’t be answering emails during this time period as well. I stole a few minutes to make this post.

June 17, 2012

MapR Now Available as an Option on Amazon Elastic MapReduce

Filed under: Amazon Web Services AWS,Hadoop,MapR,MapReduce — Patrick Durusau @ 3:59 pm

MapR Now Available as an Option on Amazon Elastic MapReduce

From the post:

MapR Technologies, Inc., the provider of the open, enterprise-grade distribution for Apache Hadoop, today announced the immediate availability of its MapR Distribution for Hadoop as an option within the Amazon Elastic MapReduce service. Customers can now provision dynamically scalable MapR clusters while taking advantage of the flexibility, agility and massive scalability of Amazon Web Services (AWS). In addition, AWS has made its own Hadoop enhancements available to MapR customers, allowing them to seamlessly use MapR with other AWS offerings such as Amazon Simple Storage Service (Amazon S3), Amazon DynamoDB and Amazon CloudWatch.

“We’re excited to welcome MapR’s feature-rich distribution as an option for customers running Hadoop in the cloud,” said Peter Sirota, general manager of Amazon Elastic MapReduce, AWS. “MapR’s innovative high availability data protection and performance features combined with Amazon EMR’s managed Hadoop environment and seamless integration with other AWS services provides customers a powerful tool for generating insights from their data.”

Customers can provision MapR clusters on-demand and automatically terminate them after finishing data processing, reducing costs as they only pay for the resources they consume. Customers can augment their existing on-premise deployments with AWS-based clusters to improve disaster recovery and access additional compute resources as required.

“For many customers there is no longer a compelling business case for deploying an on-premise Hadoop cluster given the secure, flexible and highly cost effective platform for running MapR that AWS provides,” said John Schroeder, CEO and co-founder, MapR Technologies. “The combination of AWS infrastructure and MapR’s technology, support and management tools enables organizations to potentially lower their costs while increasing the flexibility of their data intensive applications.”

Are you doing topic maps in the cloud yet?

A rep from one of the “big iron” companies was telling me how much more reliable owning your own hardware with their software than the cloud.

True, but that has the same answer as the question: Who needs the capacity to process petabytes of data in real time?

If the truth were told, there are a few companies, organizations that could benefit from that capability.

But the rest of us don’t have that much data or the talent to process it if we did.

Over the summer I am going to try the cloud out, both generally and for topic maps.

Suggestions/comments?

Neo4j [Perl client for Neo4j]

Filed under: Neo4j,Perl — Patrick Durusau @ 3:45 pm

Neo4j [Perl client for Neo4j]

One of my daily searches turned up:

Neo4j – A client for the Neo4j graph database

Perl client for Neo4j. A very early version, Neo4j-0.01_01.

Neo4j is increases in popularity by leaps and bounds. So are the number of clients in a variety of languages.

If the string “Neo4j” is reserved for the graph software from the Neo4j project and some other string for software to use with it, searching for both additional software and the core package will be enhanced.

Imagine the confusion if dozens of clients are all called “Neo4j” and you are searching for email or newsgroup reports on some issue.

Semantic confusion is going to happen. I don’t mean to imply otherwise.

But let’s not help it along if we can avoid it.

How to Read Mathematics

Filed under: Mathematics — Patrick Durusau @ 3:30 pm

How to Read Mathematics by Shai Simonson and Fernando Gouvea.

From the post:

A reading protocol is a set of strategies that a reader must use in order to benefit fully from reading the text. Poetry calls for a different set of strategies than fiction, and fiction a different set than non-fiction. It would be ridiculous to read fiction and ask oneself what is the author’s source for the assertion that the hero is blond and tanned; it would be wrong to read non-fiction and not ask such a question. This reading protocol extends to a viewing or listening protocol in art and music. Indeed, much of the introductory course material in literature, music and art is spent teaching these protocols.

Mathematics has a reading protocol all its own, and just as we learn to read literature, we should learn to read mathematics. Students need to learn how to read mathematics, in the same way they learn how to read a novel or a poem, listen to music, or view a painting. Ed Rothstein’s book, Emblems of Mind, a fascinating book emphasizing the relationship between mathematics and music, touches implicitly on the reading protocols for mathematics.

When we read a novel we become absorbed in the plot and characters. We try to follow the various plot lines and how each affects the development of the characters. We make sure that the characters become real people to us, both those we admire and those we despise. We do not stop at every word, but imagine the words as brushstrokes in a painting. Even if we are not familiar with a particular word, we can still see the whole picture. We rarely stop to think about individual phrases and sentences. Instead, we let the novel sweep us along with its flow and carry us swiftly to the end. The experience is rewarding, relaxing and thought provoking.

Novelists frequently describe characters by involving them in well-chosen anecdotes, rather than by describing them by well-chosen adjectives. They portray one aspect, then another, then the first again in a new light and so on, as the whole picture grows and comes more and more into focus. This is the way to communicate complex thoughts that defy precise definition.

Mathematical ideas are by nature precise and well defined, so that a precise description is possible in a very short space. Both a mathematics article and a novel are telling a story and developing complex ideas, but a math article does the job with a tiny fraction of the words and symbols of those used in a novel. The beauty in a novel is in the aesthetic way it uses language to evoke emotions and present themes which defy precise definition. The beauty in a mathematics article is in the elegant efficient way it concisely describes precise ideas of great complexity.

What are the common mistakes people make in trying to read mathematics? How can these mistakes be corrected?

Interesting post, along with a plug for the author’s new book, Rediscovering Mathematics. At a list price of $55 for a ~ 200 page hardback, I will be waiting for the paperback version.

Data mining is becoming more deeply intertwined with computer science and mathematics.

You can’t develop a facility for reading mathematics too soon. This is a good place to start.

(I first saw this at: Math Comprehension Made Easy.)

Data Mining with Microsoft SQL Server 2008 [Book Review]

Filed under: Data Mining,Microsoft,SQL Server — Patrick Durusau @ 3:10 pm

Data Mining with Microsoft SQL Server 2008

Sandro Saitta writes:

If you are using Microsoft data mining tools, this book is a must have. Written by MacLennan, Tang and Crivat, it describes how to perform data mining using SQL Server 2008. The book is huge – more than 630 pages – but it is normal since authors give detailed explanation for each data mining function. The book covers topics such as general data mining concepts, DMX, Excel add-ins, OLAP cubes, data mining architecture and many more. The seven data mining algorithms included in the tool are described in separate chapters.

The book is well written, so it can be read from A to Z or by selecting specific chapters. Each theoretical concept is explained through examples. Using screenshots, each step of a given method is presented in details. It is thus more a user manual than a book explaining data mining concepts. Don’t expect to read any detailed algorithms or equations. A good surprise of the book are the case studies. They are present in most chapters and show real examples and how to solve them. It really shows the experience of the authors in the field.

I haven’t seen the book, yet, but that can be corrected. 😉

User Interface Design and Implementation [MIT]

Filed under: Interface Research/Design — Patrick Durusau @ 3:00 pm

User Interface Design and Implementation

Description:

6.831/6.813 examines human-computer interaction in the context of graphical user interfaces. The course covers human capabilities, design principles, prototyping techniques, evaluation techniques, and the implementation of graphical user interfaces. Deliverables include short programming assignments and a semester-long group project. Students taking the graduate version also have readings from current literature and additional assignments.

This is a “traditional” courseware offering and not the recent Harvard/MIT edx venture.

Having said that, if you are looking for a reading list in the field, see the “recommended” books for the class.

Or for that matter, check out the lecture notes.

June 16, 2012

Getting Started with Apache Camel

Filed under: Apache Camel,Data Integration — Patrick Durusau @ 4:09 pm

Getting Started with Apache Camel

6/28/2012 10:00 AM EST

From the webpage:

Description:

This session will teach you how to get a good start with Apache Camel. It will cover the basic concepts of Camel such as Enterprise Integration Patterns and Domain Specific Languages, all explained with simple examples demonstrating the theory applied in practice using code. We will then discuss how you can get started developing with Camel and how to setup a new project from scratch—using Maven and Eclipse tooling. This session includes live demos that show how to build Camel applications in Java, Spring, OSGi Blueprint and alternative languages such as Scala and Groovy. We demonstrate how to build custom components and we will share highlights of the upcoming Apache Camel 2.10 release.

Speaker:

Claus Ibsen has worked on Apache Camel for years and he shares a great deal of his expertise as a co-author of Manning’s Camel in Action book. He is a principal engineer working for FuseSource specializing in the enterprise integration space. He lives in Sweden near Malmo with his wife and dog.

Data integration talents are great, but coupled with integration tools, they are even better! See you are the webinar!

Knowledge Design Patterns

Filed under: Design,Knowledge,Knowledge Organization,Ontology — Patrick Durusau @ 3:58 pm

Knowledge Design Patterns

John Sowa announced these slides as:

Last week, I presented a 3-hour tutorial on Knowledge Design Patterns at the Semantic Technology Conference in San Francisco. Following are the slides:

http://www.jfsowa.com/talks/kdptut.pdf

The talk was presented on June 4, but these are the June 10th version of the slides. They include a few revisions and extensions, which I added to clarify some of the issues and to answer some of the questions that were asked during the presentation.

And John posted an outline of the 130 slides:

Outline of This Tutorial

1. What are knowledge design patterns?
2. Foundations of ontology.
3. Syllogisms, categorical and hypothetical.
4. Patterns of logic.
5. Combining logic and ontology.
6. Patterns of patterns of patterns.
7. Simplifying the user interface.

Particularly if you have never seen a Sowa presentation, take a look at the slides.

Deep Dive with MongoDB [Virtual Conference]

Filed under: Conferences,MongoDB — Patrick Durusau @ 3:46 pm

Deep Dive with MongoDB (online conference)

Wednesday July 11th 11:00 AM EDT / 8:00 AM PDT

From the webpage:

This four hour online conference will introduce you to some MongoDB basics and get you up to speed with why and how you should choose MongoDB for your next project. The conference will begin at 8:00am PST with a brief introduction and last until 12:00pm PST covering four topics with plenty of time for Q&A.

The program:

  • Introduction 8:00-8:10am PST
  • 8:10am PST Building Your First App – Asya Kamsky, Senior Solutions Architect, 10gen
  • 9:00am PST Schema Design with MongoDB: Principles and Practices – Antoine Girbal, Solutions Architect, 10gen
  • 9:50am PST Replication and Replica Sets – Asya Kamsky, Senior Solutions Architect, 10gen
  • 11:05am PST – Introducing MongoDB into Your Organization – Edouard Servan-Schreiber, Director for Solution Architecture, 10gen

What I wonder about is which startup or startup conference is going to put out a call for papers, do peer review and then on the days of the meeting, conference speakers in with inexpensive software and tweets the presentations right before they start.

Imagine having 1,000 people listening to your presentation instead of < 50. Could increase the impact of your ideas and the reach of your startup. (Jack Park forwarded this to my attention.)

Third International Workshop on Consuming Linked Data (COLD2012)

Filed under: Conferences,Linked Data — Patrick Durusau @ 3:30 pm

Third International Workshop on Consuming Linked Data (COLD2012)

Important dates:

Paper submission deadline: July 31, 2012, 23.59 Hawaii time
Acceptance notification: August 21, 2012
Camera-ready versions of accepted papers: September 10, 2012
Workshop date: November, 2012

Abstract:

The quantity of published Linked Data is increasing dramatically. However, applications that consume Linked Data are not yet widespread. Current approaches lack methods for seamless integration of Linked Data from multiple sources, dynamic discovery of available data and data sources, provenance and information quality assessment, application development environments, and appropriate end user interfaces. Addressing these issues requires well-founded research, including the development and investigation of concepts that can be applied in systems which consume Linked Data from the Web. Following the success of the 1st International Workshop on Consuming Linked Data, we organize the second edition of this workshop in order to provide a platform for discussion and work on these open research problems. The main objective is to provide a venue for scientific discourse — including systematic analysis and rigorous evaluation — of concepts, algorithms and approaches for consuming Linked Data.

….

Objectives

The term Linked Data refers to a practice for publishing and interlinking structured data on the Web. Since the practice has been proposed in 2006, a grass-roots movement has started to publish and to interlink multiple open databases on the Web following the Linked Data principles. Due to conference workshops, tutorials, and general evangelism an increasing number of data publishers such as the BBC, Thomson Reuters, The New York Times, the Library of Congress, and the UK and US governments have adopted Linked Data principles. The ongoing effort resulted in bootstrapping the Web of Data which, today, comprises billions of RDF triples including millions of links between data sources. The published datasets include data about books, movies, music, radio and television programs, reviews, scientific publications, genes, proteins, medicine, and clinical trials, geographic locations, people, companies, statistical and census data, etc.

Several open issues that make the development of Linked Data based applications a challenging or still impossible task. These issues include the lack of approaches for seamless integration of Linked Data from multiple sources, for dynamic, on-the-fly discovery of available data, for information quality assessment, and for elaborate end user interfaces. These open issues can only be addressed appropriately when they are conceived as research problems that require the development and systematic investigation of novel approaches. The International Workshop on Consuming Linked Data (COLD) aims to provide a platform for the presentation and discussion of such approaches. Our main objective is to receive submissions that present scientific discussion (including systematic evaluation) of concepts and approaches, instead of exposition of features implemented in Linked Data based applications. For practical systems without formalization or evaluation we refer interested participants to other offerings at ISWC, such as the Semantic Web Challenge or the Demo Track. As such, we see our workshop as orthogonal to these events.

Probably prejudice on my part but I think topic maps would make a very viable approach for “…seamless integration of Linked Data from multiple sources…” Integration of dynamic resources is going to require a potentially semantically dynamic solution. One like topic maps.

SharePoint Module 3.2 HotFix 3 Now Available [Javascript bug]

Filed under: .Net,SharePoint — Patrick Durusau @ 3:19 pm

SharePoint Module 3.2 HotFix 3 Now Available

From the post:

A new hotfix package is available for version 3.2 of the TMCore SharePoint Module.

Systems Affected

This hotfix should be applied to any installation of the TMCore SharePoint Module 3.2 downloaded before 15th June 2012. If you downloaded your copy of the software from our site on or after this date, the hotfix is included in the package and you do not need to apply it again.

To determine if your system is affected, check the File Version property of the assembly NetworkedPlanet.SharePoint in the GAC (browse to C:\Windows\ASSEMBLY, locate the NetworkedPlanet.SharePoint assembly, right-click and choose Properties. The File Version can be found on the Version tab above Description and Copyright). This hotfix updates the File Version of the NetworkedPlanet.SharePoint assembly to 2.2.3.0 – if the file version shown is greater than or equal to 2.2.3.0, then you do not need to apply this hotfix.

The change log reports:

BUGFIX: Hierarchy topic selector was experiencing a javascript error when topic names contained apostrophes

Does She or Doesn’t She?

Filed under: Image Processing,Image Understanding,Information Integration,Topic Maps — Patrick Durusau @ 2:57 pm

Information Processing: Adding a Touch of Color

From the post:

An innovative computer program brings color to grayscale images.

Creating a high-quality realistic color image from a grayscale picture can be challenging. Conventional methods typically require the user’s input, either by using a scribbling tool to color the image manually or by using a color transfer. Both options can result in poor colorization quality limited by the user’s degree of skill or the range of reference images available.

Alex Yong-Sang Chia at the A*STAR’s Institute for Infocomm Research and co-workers have now developed a computer program that utilizes the vast amount of imagery available on the internet to find suitable color matches for grayscale images. The program searches hundreds of thousands of online color images, cross-referencing their key features and objects in the foreground with those of grayscale pictures.

“We have developed a method that takes advantage of the plentiful supply of internet data to colorize gray photos,” Chia explains. “The user segments the image into separate major foreground objects and adds semantic labels naming these objects in the gray photo. Our program then scans the internet using these inputs for suitable object color matches.”

If you think about it for a moment, it appears that subject recognition in images is being performed here. As the researchers concede, its not 100% but then it doesn’t need to be. They have human users in the loop.

I wonder if the human users have to correct the coloration for an image more than once for a source of color image? That is does the system “remember” earlier choices?

The article doesn’t say so I will follow up with an email.

Keeping track of user-corrected subject recognition would create a bread crumb trail for other users confronted with the same images. (In other words, a topic map.)

Binary Jumbled String Matching: Faster Indexing in Less Space

Filed under: Binary Search,String Matching — Patrick Durusau @ 2:11 pm

Binary Jumbled String Matching: Faster Indexing in Less Space by Golnaz Badkobeh, Gabriele Fici, Steve Kroon, and Zsuzsanna Lipták.

Abstract:

We introduce a new algorithm for the binary jumbled string matching problem, where the aim is to decide whether a given binary string of length n has a substring whose multiplicity equals a query vector (x, y). For example, for the string abaaababab, the query (3, 1) would return “yes”, and the query (5, 1) “no”. Previous solutions answered queries in constant time by creating an index of size O(n) in a pre-processing step. The fastest known approach to constructing this index is O(n^2/logn) [Burcsi et al., FUN 2010; Moosa and Rahman, IPL 2010] resp. O(n^2/log2 n) in the word-RAM model [Moosa and Rahman, JDA, 2012]. We propose an algorithm which creates an index for an input string s by using the string’s run-length encoding. This index can be queried in logarithmic time. Our index has worst-case size n, but extensive experimentation has consistently yielded a size which is between 0.8 and 3 times the length of the run-length encoding of s. The algorithm runs in time O(r^2 log r), where r is the number of a-runs of s, i.e., half the length of the run-length encoding of the string. This is no worse than previous solutions if r = O(n/logn) and better if r = o(n/logn)-which is the case for binary strings in many domains. Our experimentation further shows that in the vast majority of cases, the construction algorithm does not exceed the space needed for the index, and when it does, it does so only by a tiny constant.

Queries of binary strings in logarithmic time anyone?

I cite this in part because some topic mappers may be indexing binary strings but also as encouragement to consider the properties of your data carefully.

You may discover it has properties that lend themselves to very efficient processing.

Wavelet Trees for All

Filed under: Graphs,Indexing,Wavelet Trees — Patrick Durusau @ 1:51 pm

Wavelet Trees for All (free version) (official publication reference)

Abstract:

The wavelet tree is a versatile data structure that serves a number of purposes, from string processing to geometry. It can be regarded as a device that represents a sequence, a reordering, or a grid of points. In addition, its space adapts to various entropy measures of the data it encodes, enabling compressed representations. New competitive solutions to a number of problems, based on wavelet trees, are appearing every year. In this survey we give an overview of wavelet trees and the surprising number of applications in which we have found them useful: basic and weighted point grids, sets of rectangles, strings, permutations, binary relations, graphs, inverted indexes, document retrieval indexes, full-text indexes, XML indexes, and general numeric sequences.

Good survey article but can be tough sledding depending on your math skills. Fortunately the paper covers enough uses and has references to freely available applications of this technique. I am sure you will find one that trips your understanding of wavelet trees.

Semantic Technology For Intelligence, Defense, and Security STIDS 2012

Filed under: Conferences,Defense,Intelligence — Patrick Durusau @ 1:37 pm

SEMANTIC TECHNOLOGY FOR INTELLIGENCE, DEFENSE, AND SECURITY STIDS 2012

Paper submissions due: July 24, 2012
Notification of acceptance: August 28, 2012
Camera-ready papers due: September 18, 2012
Presentations due: October 17, 2012

Tutorials October 23
Main Conference October 24-26
Early Bird Registration rates until September 25

From the call for papers:

The conference is an opportunity for collaboration and cross-fertilization between researchers and practitioners of semantic-based technologies with particular experience in the problems facing the Intelligence, Defense, and Security communities. It will feature invited talks from prominent ontologists and recognized leaders from the target application domains.

To facilitate interchange among communities with a clear commonality of interest but little history of interaction, STIDS will host two separate tracks. The Research Track will showcase original, significant research on semantic technologies applicable to problems in intelligence, defense or security. Submissions to the research track are expected to clearly present their contribution, demonstrate its significance, and show the applicability to problems in the target applications domain. The Applications Track provides a forum for presenting implemented semantic-based applications to intelligence, defense, or security, as well as to discuss and evaluate the use of semantic techniques in these areas. Of particular interest are comparisons between different technologies or approaches and lessons learned from applications. By capitalizing on this opportunity, STIDS could spark dramatic progress toward transitioning semantic technologies from research to the field.

A hidden area where it will be difficult to cut IT budgets. Mostly because it is “hidden.” 😉

Not the only reason you should participate but perhaps an extra incentive to do well!

June 15, 2012

ActionGenerator, Part One

Filed under: ActionGenerator — Patrick Durusau @ 4:22 pm

ActionGenerator, Part One

Rafał Kuć writes:

In this post we’ll introduce you to ActionGenerator, one of several open source projects we are working on. ActionGenerator lets you generate actions (you can also think of actions as events) from an action sources and play those actions with ActionGenerator’s action player to one of the sinks. The rest is done by ActionGenerator. ActionGenerator comes with several action sources and sinks, but one can easily implement custom action sources and sinks and play them with ActionGenerator. Let’s dig into the details.

This is the first part of the two-part post series where we show what ActionGenerator is, how you can use it for your needs, how you can extend it, and finally what existing action generators are there for you to use out-of-the-box.

What is ActionGenerator?

ActionGenerator is focused on generating actions (aka events) of your choice. Imagine you want to feed your search engine with millions of documents or you want to run stress test and see if your application can work under load for hours. That’s exactly where you can use ActionGenerator. If existing sources and sinks don’t fit your needs all you have to do is write a simple action type, your action source, and a sink to consume those actions, and you are ready to go. The rest, which includes playing those actions and their parallelization with multiple-threads, as well as performance metrics/stats gathering, is done by ActionGenerator itself. You only need to worry about your actions.

Current Status

So far at Sematext we’ve written all code needed to generate data and query actions for search engines like Apache Solr, ElasticSearch and SenseiDB. All this is included in ActionGenerator so you could use it, too. We’ll expand the number of sources and sinks over time based on our own needs, but if you would like to add support for other sources and sinks, please issue a pull request — contributions are always very welcome! 🙂

Sounds like something interesting to look at over the weekend!

What would you like to be testing?

Data Mining Music

Filed under: Humor,Music — Patrick Durusau @ 3:34 pm

Data Mining Music by Ajay Ohri.

Ajay points to a 1985 paper by Donald Knuth, “The Complexity of Songs.”

Not the right time of year but I will forget it by the appropriate time next year.

DataArt with BBC Backstage

Filed under: Graphics,Maps,News,Visualization — Patrick Durusau @ 3:22 pm

DataArt with BBC Backstage

From the post:

Locus is a news archive visualisation that maps Guardian articles to places over time – a spatial & temporal mapping of events and media attention in the last decade. We’re using the Guardian Open Platform because it provides an API that can be queried by date, and an archive going back over 10 years.

Each place is represented as a geo located dot that changes scale in proportion to that places appearance in news articles over time. As the time slider selection changes the circles grow and shrink giving a picture of which locations are in the news at any given time. To see the all the news articles mapped, you can extend the time slider to the full search period. You can click on the places to see the news headlines for that place and time period. The headlines link through to the online articles at the Guardian.

There are two versions of the project: Locus Afganistan, and Locus Iraq.

Very cool!

Now just imagine that time was your scope for a location you selected on the map and by choosing a location + time, a set of results were merged and returned.

That may or may not help to answer the question of who knew what when? But it is a place to start.

(I first saw this at: Is it Data or Art? Check out these Newsworthy Visualizations from the BBC)

VMware’s Project Serengeti And What It Means For Enterprise Hadoop

Filed under: Hadoop,MapReduce,Serengeti — Patrick Durusau @ 3:11 pm

VMware’s Project Serengeti And What It Means For Enterprise Hadoop by Chuck Hollis.

From the post:

Virtualize something — anything — and you make it easier for everyone to consume: IT vendors, enterprise IT organizations — and, most importantly, business users. The vending machine analogy is a powerful and useful one.

At a macro level, cloud is transforming IT, and virtualization is playing a starring role.

Enterprise-enhanced flavors of Hadoop are starting to earn prized roles in an ever-growing variety of enterprise applications. At a macro level, big data is transforming business, and Hadoop is playing an important role.

The two megatrends intersect nicely in VMware’s recently announced Project Serengeti: an encapsulation of popular Hadoop distros that make big data analytics tools far easier to deploy and consume in enterprise — or service provider — settings.

And if you’re interested in big data, virtualization, cloud et. al. — you’ll want to take a moment to get more familiar with what’s going on here.

Chuck has some really nice graphics and illustrations, pitched to a largely non-technical audience.

If you want the full monty, see: Project Serengeti: There’s a Virtual Elephant in my Datacenter by Richard McDougall.

The main project page for Serengeti.

User mailing list for Serengeti.

OData Extensions for Data Aggregation

Filed under: Aggregation,Data Aggregation,Odata — Patrick Durusau @ 2:23 pm

OData Extensions for Data Aggregation by Chris Webb.

Chris writes:

I was just reading the following blog post on the OASIS OData Technical Committee Call for Participation: http://www.odata.org/blog/2012/6/11/oasis-odata-technical-committee-call-for-participation

…when I saw this:

In addition to the core OData version 3.0 protocol found here, the Technical Committee will be defining some key extensions in the first version of the OASIS Standard:

OData Extensions for Data Aggregation – Business Intelligence provides the ability to get the right set of aggregated results from large data warehouses. OData Extensions for Analytics enable OData to support Business Intelligence by allowing services to model data analytic “cubes” (dimensions, hierarchies, measures) and consumers to query aggregated data

Follow the link in the quoted text – it’s very interesting reading! Here’s just one juicy quote:

You have to go to Chris’ post to see the “juicy quote.” 😉

With more data becoming available, at higher speeds, data aggregation is going to be the norm.

Some people will do it well. Some people will do it not so well.

Which one will describe you?

Participation in the OData TC at OASIS may help shape that answer: http://www.odata.org/blog/2012/6/11/oasis-odata-technical-committee-call-for-participation

First meeting details:

The first meeting of the Technical Committee will be a face-to-face meeting to be held in Redmond, Washington on July 26-27, 2012 from 9 AM PT to 5 PM PT. This meeting will be sponsored by Microsoft. Dial-in conference calling bridge numbers will be available for those unable to attend in person.

At least the meeting is on a Thursday/Friday slot! Any comments on the weather to expect in late July?

Hands-on with Google Docs’s new research tool [UI Idea?]

Filed under: Authoring Topic Maps,Interface Research/Design — Patrick Durusau @ 1:54 pm

Hands-on with Google Docs’s new research tool by Joel Mathis, Macworld.com.

From the post:

Google Docs has unveiled a new research tool meant to help writers streamline their browser-based research, making it easier for them to find and cite the information they need while composing text.

The feature, announced Tuesday, appears as an in-page vertical pane on the right side of your Google Doc. (You can see an example of the pane at left.) It can be accessed either through the page’s Tools menu, or with a Command-Option-R keyboard shortcut on your Mac.

The tool offers three types of searches: A basic “everything” search, another just for images, and a third featuring quotes about—or by—the subject of your search.

In “everything” mode, a search for GOP presidential candidate Mitt Romney brought up a column of images and information. At the top of the column, a scrollable set of thumbnail pictures of the man, followed by some basic dossier information—birthday, hometown, and religion—followed by a quote from Romney, taken from an ABC News story that had appeared within the last hour.

The top Web links for a topic are displayed underneath that roster of information. You’re given three option with the links: First, you can “preview” the linked page within the Google Docs page—though you’ll have to open a new tab if you want to conduct a more thorough perusal of the pertinent info. The second option is to create a link to that page directly from the text you’re writing. The third is to create a footnote in the text that cites the link.

Interfaces are forced to make assumptions about the “average” user and their needs. This one sounds like it is hitting around or even close to needs that are fairly common.

Makes me wonder if topic map authoring interfaces should place more emphasis on incorporation of content and authoring, with correspondingly less emphasis on the topic mappishness of the result.

Perhaps cleaning up a map is something that should be a separate task anyway.

Authors write and editors edit.

Is there some reason to combine those two tasks?

(I first saw this at Research Made Easy With Google Docs by Stephen Arnold.)

Deep Learning Tutorials

Filed under: Deep Learning,Machine Learning — Patrick Durusau @ 1:29 pm

Deep Learning Tutorials

From the main page:

Deep Learning is a new area of Machine Learning research, which has been introduced with the objective of moving Machine Learning closer to one of its original goals: Artificial Intelligence. See these course notes for a brief introduction to Machine Learning for AI and an `introduction to Deep Learning algorithms.

Deep Learning is about learning multiple levels of representation and abstraction that help to make sense of data such as images, sound, and text.

For more about deep learning algorithms, see for example:

The tutorials presented here will introduce you to some of the most important deep learning algorithms and will also show you how to run them using Theano. Theano is a python library that makes writing deep learning models easy, and gives the option of training them on a GPU.

The algorithm tutorials have some prerequisites. You should know some python, and be familiar with numpy. Since this tutorial is about using Theano, you should read over the Theano basic tutorial first. Once you’ve done that, read through our Getting Started chapter — it introduces the notation, and [downloadable] datasets used in the algorithm tutorials, and the way we do optimization by stochastic gradient descent.

The tutorial materials reflect the content of Yoshua Bengio’s Learning Algorithms (ITF6266) course.

Part of the resources you will find at: Deep Learning … moving beyond shallow machine learning since 2006!. There is a break between 2010 and 2012, with a few entries, such as in the blog, dated for 2012. There has been a considerable amount of work in the mean time so you might want to contribute to the site.

mSDA: A fast and easy-to-use way to improve bag-of-words features

Filed under: Bag-of-Words (BOW),mSDA,SDA,Sentiment Analysis — Patrick Durusau @ 9:50 am

mSDA: A fast and easy-to-use way to improve bag-of-words features by Kilian Weinberger.

From the description:

Machine learning algorithms rely heavily on the representation of the data they are presented with. In particular, text documents (and often images) are traditionally expressed as bag-of-words feature vectors (e.g. as tf-idf). Recently Glorot et al. showed that stacked denoising autoencoders (SDA), a deep learning algorithm, can learn representations that are far superior over variants of bag-of-words. Unfortunately, training SDAs often requires a prohibitive amount of computation time and is non-trivial for non-experts. In this work, we show that with a few modifications of the SDA model, we can relax the optimization over the hidden weights into convex optimization problems with closed form solutions. Further, we show that the expected value of the hidden weights after infinitely many training iterations can also be computed in closed form. The resulting transformation (which we call marginalized-SDA) can be computed in no more than 20 lines of straight-forward Matlab code and requires no prior expertise in machine learning. The representations learned with mSDA behave similar to those obtained with SDA, but the training time is reduced by several orders of magnitudes. For example, mSDA matches the world-record on the Amazon transfer learning benchmark, however the training time shrinks from several days to a few minutes.

The Glorot et. al. reference is to: Domain Adaptation for Large-Scale Sentiment Classi cation: A Deep Learning Approach by Xavier Glorot, Antoine Bordes, and Yoshua Bengio, Proceedings of the 28th International Conference on Machine Learning, Bellevue, WA, USA, 2011.

Superficial searching reveals this to be a very active area of research.

I rather like the idea of training being reduced from days to minutes.

Mozilla Ignite [Challenge – $15,000]

Filed under: Challenges,Data Integration,Data Mining,Filters,Topic Maps — Patrick Durusau @ 8:21 am

Mozilla Ignite

From the webpage:

Calling all developers, network engineers and community catalysts. Mozilla and the National Science Foundation (NSF) invite designers, developers and everyday people to brainstorm and build applications for the faster, smarter Internet of the future. The goal: create apps that take advantage of next-generation networks up to 250 times faster than today, in areas that benefit the public — like education, healthcare, transportation, manufacturing, public safety and clean energy.

Designing for the internet of the future

The challenge begins with a “Brainstorming Round” where anyone can submit and discuss ideas. The best ideas will receive funding and support to become a reality. Later rounds will focus specifically on application design and development. All are welcome to participate in the brainstorming round.

BRAINSTORM

What would you do with 1 Gbps? What apps would you create for deeply programmable networks 250x faster than today? Now through August 23rd, let’s brainstorm. $15,000 in prizes.

The challenge is focused specifically on creating public benefit in the U.S. The deadline for idea submissions is August 23, 2012.

Here is the entry website.

I assume the 1Gbps is actual and not as measured by the marketing department of the local cable company. 😉

That would have to be from a source that can push 1 Gbps to you and you be capable of handling it. (Upstream limitations being what chokes my local speed down.)

I went looking for an example of what that would mean and came up with: “…[you] can download 23 episodes of 30 Rock in less than two minutes.

On the whole, I would rather not.

What other uses would you suggest for 1Gbps network speeds?

Assuming you have the capacity to push back at the same speed, I wonder what that means in terms of querying/viewing data as a topic map?

Transformation to a topic map for only for a subset of data?

Looking forward to seeing your entries!

Musical Spheres Anyone?

Filed under: Music,Sound — Patrick Durusau @ 4:23 am

Making Music With Real Stars: Kepler Telescope Star Data Creates Musical Melody reports on the creation of music from astronomical data.

By itself an amusing curiousity but in the larger context of data exploration, perhaps something more.

I would have trouble carrying a tune in sack but we shouldn’t evaluate data exploration techniques based solely on our personal capabilities. Any more than colors should be ignored in visualization because some researchers are color blind.

A starting place for conversations about sonification would be the Georgia Tech Sonification Lab.

Or you can download the Sonification Sandbox.

BTW, question for music librarians/researchers:

Is there an autocomplete feature for music searches? Where a user can type in the first few notes and is offered a list of continuations?

June 14, 2012

Neo4j 1.8 Milestone 4

Filed under: Graphs,Neo4j — Patrick Durusau @ 7:02 pm

Neo4j 1.8 Milestone 4

From the post:

Neo4j 1.8 Milestone 4 is available today, offering a few new ways to help you find happy paths. To query a graph you use a traversal, which identifies paths of nodes and relationships. This release updates the capabilities of Neo4j’s core Traversal Framework and introduces new ways to use paths in Cypher.

Graph Sherpa Mattias Persson

Mattias Persson works throughout the Neo4j code base, but is particularly well acquainted with the Traversal Framework, a core component of the Neo4j landscape. He’s agreed to guide us on a traversal tour:

Why My Soap Film is Better than Your Hadoop Cluster

Filed under: Algorithms,Hadoop,Humor — Patrick Durusau @ 6:56 pm

Why My Soap Film is Better than Your Hadoop Cluster

From the post:

The ever amazing slime mold is not the only way to solve complex compute problems without performing calculations. There is another: soap film. Unfortunately for soap film it isn’t nearly as photogenic as slime mold, all we get are boring looking pictures, but the underlying idea is still fascinating and ten times less spooky.

As a quick introduction we’ll lean on Long Ouyang, who has really straightforward explanation of how soap film works in Approaching P=NP: Can Soap Bubbles Solve The Steiner Tree Problem In Polynomial.

And no, this isn’t what I am writing about on Hadoop for next Monday. 😉

I point this out partially for humor.

But considering unconventional computational methods may give you ideas about more conventional things to try.

tm2o

Filed under: Odata,Topic Maps — Patrick Durusau @ 6:44 pm

TM2O – Topic Maps 2 OData Transformation

From the webpage:

TM2O is a generic tool to provide information stored within a topic map as an OData service. This OData services uses majortom as local topic maps backend or a remote majortom server. The implementation bases on odata4j and heavily uses the TMQL engine TMQL4J.

Getting started

We have prepared a getting started guideline for you. Here you will learn to set-up the TM2O service, and to create and use OData service out of your topic maps.

Stop me if I get off into the ditch but as interesting as this sounds, isn’t that going in the wrong direction?

Going in the opposite direction, an OData -> TM tool that views content served as OData as topics, associations, etc., would be more interesting. You can hang more information off subjects and relationships in a topic map, which is a form of value-add.

If nothing else, just think about the amount of content available as topic maps versus available as OData.

Do I need to say more?

« Newer PostsOlder Posts »

Powered by WordPress