Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 22, 2011

LexisNexis Open-Sources its Hadoop Alternative

Filed under: Hadoop,HPCC — Patrick Durusau @ 6:16 pm

LexisNexis Open-Sources its Hadoop Alternative

Ryan Rosario writes:

A month ago, I wrote about alternatives to the Hadoop MapReduce platform and HPCC was included in that article. For more information, see here.

LexisNexis has open-sourced its alternative to Hadoop, called High Performance Computing Cluster. The code is available on GitHub. For years the code was restricted to LexisNexis Risk Solutions. The system contains two major components:

  • Thor (Thor Data Refinery Cluster) is the data processing framework. It “crunches, analyzes and indexes huge amounts of data a la Hadoop.”
  • Roxie (Roxy Radid Data Delivery Cluster) is more like a data warehouse and is designed with quick querying in mind for frontends.

The protocol that drives the whole process is the Enterprise Control Language which is said to be faster and more efficient than Hadoop’s version of MapReduce. A picture is a much better way to show how the system works. Below is a diagram from the Gigaom article from which most of this information originates.

Interesting times ahead.

Explore large image collections with ImagePlot

Filed under: Image Understanding — Patrick Durusau @ 6:15 pm

Explore large image collections with ImagePlot from Flowing Data.

From the post:

When we make charts and graphs, we usually think of the data abstractions in terms of bars, dots, and other geometric shapes. ImagePlot, from UCSD-based Software Studies, instead makes it easier to use images to understand large collections.

Existing visualization tools show data as points, lines, and bars. ImagePlot’s visualizations shows the actual images in your collection. The images can be scaled to any size and organized in any order – according to their dates, content, visual characteristics, etc. Because digital video is just a set of individual still images, you can also use ImagePlot to explore patterns in films, animations, video games, and any other moving image data.

You can do this with other software (like R, for example), but ImagePlot is specifically built to handle lots of images (in the millions) and so it is much more robust, and it’s GUI-based, so no programming is required to use the software, which works on Windows, OS X, and Linux. The interface is pretty basic and not totally clear at first, but play around with the sample datasets and you should be able to pick it up fairly quickly.

The example I found particularly interesting was plotting Van Gogh painting by date on one axis and color on another.

A great deal of potential for exploring image collections from a variety of sources.

…Link Analysis on EU Case Law

Filed under: Law - Sources,Legal Informatics,Relevance — Patrick Durusau @ 6:13 pm

Malmgren: Towards a Theory of Jurisprudential Relevance Ranking – Using Link Analysis on EU Case Law

From the post:

Staffan Malmgren of Stockholm University and the free access to law service of Sweden, lagen.nu, has posted his Master’s thesis, Towards a Theory of Jurisprudential Relevance Ranking – Using Link Analysis on EU Case Law (2011). Here is the abstract:

Staffan is going to be posting his thesis a chapter at a time to solicit feedback on it.

Any takers?

Slides and replay from “R and Hadoop” webinar

Filed under: Hadoop,R — Patrick Durusau @ 6:10 pm

Slides and replay from “R and Hadoop” webinar

From the post:

So … there’s clearly a lot of interest in integrating R and Hadoop. Today’s webinar was a record-setter for Revolution Analytics, with more than 1000 people signing up to learn how to access Hadoop data from R with the packages from the open-source RHadoop project. If you didn’t catch the live webinar, don’t fret: the slides and replay are available for download, and you can learn more about the RHadoop packages in the white paper from CTO David Champagne, “Advanced ‘Big Data’ Analytics with R and Hadoop“.

I don’t know what the average numbers are for webinars but I suspect it is below 1,000. Way below 1,000. Does anyone have those numbers handy?

September 21, 2011

NoSQL, The Web And The Enterprise

Filed under: BigData,Enterprise Integration,Neo4j — Patrick Durusau @ 7:18 pm

NoSQL, The Web And The Enterprise

Emil Eifrem, the CEO of Neo Technology and co-founder of the Neo4j project waxes eloquently after Neo Technology raises $10M+. (I can wax too but it would probably be Emil’s car.)

I can’t believe it’s already been two years since we raised our seed round! Oct 2009 saw a nascent NOSQL movement and Neo Technology as a two man band in Malmö, Sweden. Today, NOSQL is exploding and Neo has grown to a 25 person orchestra across two continents and five countries.

During these two years we've released  Neo4j 1.0 (after 10 years of development!), coined the NOSQL = Not Only SQL expansion at nosql east, heard the CTO of Amazon proclaim that “Neo4j absolutely ROCKS,” watched Facebook tell the world that it’s all about graphs, co-founded the Spring Data project to provide excellent support for NOSQL in the world’s most popular Enterprise Java middleware, changed our open source licensing to enable graph database ubiquity, made several kickass releases and started putting graph databases in the cloud.

But all of that is dwarfed by the amazing things our customers and community have done with Neo4j! Neo4j downloads grew by 10x last year and this year our growth has accelerated even more. Neo4j is clearly taking off.

Read the three things that stand out for enterprise users.

Here are three that I think could carry Neo4j into the future:

  1. “You like tomato and I like tomahto:” Query by how the user identifies the subject of the query, and all information in the enterprise (or beyond?) returned.
  2. Deduplication of finding: “How many lawyers on average at X per hour find the same document in your files?” (Or insert staff, etc., each finding costs N?.)
  3. Capturing Serendipity: You accidentally find a useful (critical?) document. Will you be able to find it again?

BTW, congratulations to Neo Techology on its fund raising success!

Build an MLM Engine with Neo4j and MassPay, Part I

Filed under: Neo4j,News,Topic Maps — Patrick Durusau @ 7:17 pm

Build an MLM Engine with Neo4j and MassPay, Part I by John Wheeler.

From the post:

I. What is multi-level marketing?

Multi-level marketing (MLM), or network selling is a strategy for maximizing sales through a network of distributors. Distributors are paid commissions for personal sales and the sales of others they recruit. A heirarchy forms where new recruits are placed under distributors who recruit them, and commissions are paid several levels up. Consequently, the earlier a person joins, the larger his or her downline and potential commission will be.

MLM strategies come in different shapes and sizes and vary in sophistication. We implement one called the unilevel plan, which is easy to understand. Basically, distributors recruit as many as they can into their frontlines, and frontlines recruit as many as they can into second-lines, and so on. Further examples of MLM strategies are the binary plan in which frontlines can have no more than two distributors apiece and the forced matrix plan, which stipulates a maximum downline width and depth, for example 3×9. These two plans usually involve membership fees paid upline as new recruits spillover into leaf positions.

Unless you have been living under a rock all your life, you have probably heard of or encountered “multi-level marketing.” Some that come to mind (feel free to contribute others) are Amway, Avon, Electrolux, Tupperware. You may be interested in the Wikipedia article Multi-level marketing.

Multi-level marketing is a subject of some controversy but I mention this post because is is a good illustration of using Neo4j.

Just out of curiosity, I wonder how you could make data mining a multi-level marketing activity with a topic map? That is to say information researchers could recruit other researchers, which roll up the line to an organization that has the contacts to sell information either in pieces or in bulk. Like a news feed but curated and linked to other data.

It is one thing to have a report of Princess Y with some X but quite another to have that report with identification of X (say from a photo) along with background information, etc., delivered as a package.

I suppose news bloggers add information but it isn’t packaged. My impression is that you have to winnow a lot of chaff and repetition. I have very little interest in chaff and repetition. How about you?

Scala School!

Filed under: Scala — Patrick Durusau @ 7:09 pm

Scala School!

From the webpage:

About

Scala school was started as a series of lectures at Twitter to prepare experienced engineers to be productive Scala programmers. Being a relatively new language, but also one that draws on many familiar concepts, we found this an effective way of getting new engineers up to speed quickly. This is the written material that accompanied those lectures. We have found that these are useful in their own right.

Approach

We think it makes the most sense to approach teaching Scala not as if it’s an improved Java but as a new language. Experience in Java is not expected. Focus will be around the interpreter and the object-functional style as well as the style of programming we do here. An emphasis will be placed on maintainability, clarity of expression, and leveraging the type system.

Most of the lessons require no software other than a Scala REPL. The reader is encouraged to follow along, and go further! Use these lessons as a starting point to explore the language.

Excellent!

XQuery Survey

Filed under: Query Language,XQuery — Patrick Durusau @ 7:09 pm

XQuery Survey

From the webpage:

I am looking for feedback on the XQuery programming language. Please answer all questions as completely as possible. This poll and other information is forming the basis of a talk I am giving at GOTO 2011 Arhus, Denmark (http://lanyrd.com/2011/gotocon-aarhus/shqhc/). I will share all results at the end of October 2011.

Please help Jim Fuller out with his survey on XQuery!

Cassandra Write Performance – A quick look inside

Filed under: Cassandra,NoSQL,Software — Patrick Durusau @ 7:09 pm

Cassandra Write Performance – A quick look inside

From the post:

I was looking at Cassandra, one of the major NoSQL solutions, and I was immediately impressed with its write speed even on my notebook. But I also noticed that it was very volatile in its response time, so I took a deeper look at it.

Michael Kopp uses dynaTrace to look inside Cassandra. Lots of information in between and hopefully his conclusion will make you read this posts and those he promises to follow.

Conclusion

NoSQL or BigData Solutions are very very different from your usual RDBMS, but they are still bound by the usual constraints: CPU, I/O and most importantly how it is used! Although Cassandra is lighting fast and mostly I/O bound it’s still Java and you have the usual problems – e.g. GC needs to be watched. Cassandra provides a lot of monitoring metrics that I didn’t explain here, but seeing the flow end-to-end really helps to understand whether the time is spent on the client, network or server and makes the runtime dynamics of Cassandra much clearer.

Understanding is really the key for effective usage of NoSQL solutions as we shall see in my next blogs. New problem patterns emerge and they cannot be solved by simply adding an index here or there. It really requires you to understand the usage pattern from the application point of view. The good news is that these new solutions allow us a really deep look into their inner workings, at least if you have the right tools at hand.

What tools are you using to “look inside” your topic map engine?

Best Practices…Columnar Databases

Filed under: Column-Oriented,Columnar Database,InfiniDB — Patrick Durusau @ 7:09 pm

Best Practices in the Use of Columnar Databases: How to select the workloads for columnar databases based on the benefits” provided by William Mcknight. (pdf)

Focuses on Calpont’s InfiniDB.

It is a nice summary of the principles of columnar databases.

Also has amusing observations such as:

MapReduce is a method of parallel reduction of tasks; a 25 year old idea that came out of the Lisp programming language. There are popular implementations of the framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers.

It does make me curious about the use of columnar store databases for particular situations.

Read the whitepaper and see what you think. Comments welcome!

What’s new in Cassandra 1.0: Compression

Filed under: Cassandra,NoSQL — Patrick Durusau @ 7:08 pm

What’s new in Cassandra 1.0: Compression

From the post:

Cassandra 1.0 introduces support for data compression on a per-ColumnFamily basis, one of the most-requested features since the project started. Compression maximizes the storage capacity of your Cassandra nodes by reducing the volume of data on disk. In addition to the space-saving benefits, compression also reduces disk I/O, particularly for read-dominated workloads.

OK, maybe someone can help me here.

Cassandra, an Apache project, just released version 8.6. Here are the release notes for 8.6.

As a standards editor I understand being optimistic about what is “…going to appear…” in a future release but isn’t version 0.8.6 a little early to be treating features for 1.0 a bit early? (I don’t find “compression” mentioned in the cumulative release notes as of 0.8.6.)

May just be me.

CITRIS – Center for Information Technology Research in the Interest of Society

Filed under: Biomedical,Environment,Funding,Health care,Information Retrieval — Patrick Durusau @ 7:08 pm

CITRIS – Center for Information Technology Research in the Interest of Society

The mission statement:

The Center for Information Technology Research in the Interest of Society (CITRIS) creates information technology solutions for many of our most pressing social, environmental, and health care problems.

CITRIS was created to “shorten the pipeline” between world-class laboratory research and the creation of start-ups, larger companies, and whole industries. CITRIS facilitates partnerships and collaborations among more than 300 faculty members and thousands of students from numerous departments at four University of California campuses (Berkeley, Davis, Merced, and Santa Cruz) with industrial researchers from over 60 corporations. Together the groups are thinking about information technology in ways its never been thought of before.

CITRIS works to find solutions to many of the concerns that face all of us today, from monitoring the environment and finding viable, sustainable energy alternatives to simplifying health care delivery and developing secure systems for electronic medical records and remote diagnosis, all of which will ultimately boost economic productivity. CITRIS represents a bold and exciting vision that leverages one of the top university systems in the world with highly successful corporate partners and government resources.

I mentioned CITRIS as an aside (News: Summarization and Visualization) yesterday but then decided it needed more attention.

Its grants are limited the four University of California campuses mentioned above. Shades of EU funding restrictions. Location has a hand in the selection process.

Still, the projects funded by CITRIS could likely profit from the use of topic maps and as they say, a rising tide lifts all boats.

Neo4j and Scala hacking notes

Filed under: Neo4j,Scala — Patrick Durusau @ 7:08 pm

Neo4j and Scala hacking notes

From the post:

This week FOSS4G, though it has nothing in particular to do with geospatial (…yet), I’ve started hacking around graph database Neo4j in Scala because I’m convinced both are the future. I’ve had almost no experience with either.

Dwins kindly held my hand through this process. He knows a hell of a lot about Scala and guided me through how some of the language features could help me work with the Neo4j API. In this post, I will try to describe the process and problems we ran into and parrot his explanations.

Very nice introduction to using Neo4j and Scala.

I am not sure if the lesson is to read documentation first or not. See what you think. 😉

MongoUK – September 2011

Filed under: MongoDB — Patrick Durusau @ 7:07 pm

MongoUK – September 2011

A full day with three (3) tracks on MongoDB.

Just some titles at random to awake your interest: Indexes, What Indexes?, Scaling MongoDB for Real-Time Analytics, Intelligent Stream Filtering Using MongoDB, GeoSpatial Indexing. That 4 out of 25.

My suggestion is that you visit and find presentations relevant to your topic map interests. Enjoy!

BTW, another 20 or so presentations on MongoDB from the MongoUK event in March, 2011.

Using Machine Learning to Detect Malware Similarity

Filed under: Machine Learning,Malware,Similarity — Patrick Durusau @ 7:07 pm

Using Machine Learning to Detect Malware Similarity by Sagar Chaki.

From the post:

Malware, which is short for “malicious software,” consists of programming aimed at disrupting or denying operation, gathering private information without consent, gaining unauthorized access to system resources, and other inappropriate behavior. Malware infestation is of increasing concern to government and commercial organizations. For example, according to the Global Threat Report from Cisco Security Intelligence Operations, there were 287,298 “unique malware encounters” in June 2011, double the number of incidents that occurred in March. To help mitigate the threat of malware, researchers at the SEI are investigating the origin of executable software binaries that often take the form of malware. This posting augments a previous posting describing our research on using classification (a form of machine learning) to detect “provenance similarities” in binaries, which means that they have been compiled from similar source code (e.g., differing by only minor revisions) and with similar compilers (e.g., different versions of Microsoft Visual C++ or different levels of optimization).

Interesting study in the development of ways to identify a subject that is trying to hide. Not to mention some hard core disassembly and other techniques.

Dydra

Filed under: Dydra,RDF,SPARQL — Patrick Durusau @ 7:07 pm

Dydra

From What is Dydra?:

Dydra

Dydra is a cloud-based graph database. Whether you’re using existing social network APIs or want to build your own, Dydra treats your customers’ social graph as exactly that.

With Dydra, your data is natively stored as a property graph, directly representing the relationships in the underlying data.

Expressive

With Dydra, you access and update your data via an industry-standard query language specifically designed for graph processing, SPARQL. It’s easy to use and we provide a handy in-browser query editor to help you learn.

From the QuickStart

Dydra is an RDF store meant to be quick and easy for developers. Getting started quickly will require already being familiar with RDF and SPARQL

OK, so yes a “graph database,” but in the sense of being an RDF store.

Under What is RDF? -> Overview, the site authors say:

The use of URIs allows multiple data sources to talk about the same entities using the same language.

Really? That must mean all the 303 stuff that no less than Tim Berners-Lee and others have been talking about is unnecessary. I understand that several years ago that was the W3C “position,” but leaving aside all my ranting, it isn’t quite the current position.

There is a fundamental ambiguity when an address is used as an identifier. Does it identify what you find at the location it specifies or is it simply an identifier and what is at the location is additional information about what the address has identified?

The prose is out of date or the authors have a seriously dated view of RDF. Either way, it doesn’t inspire a lot of confidence.

.

Online Master of Science in Predictive Analytics

Filed under: Computer Science,CS Lectures,Degree Program,Library,Prediction — Patrick Durusau @ 7:07 pm

Online Master of Science in Predictive Analytics

As businesses seek to maximize the value of vast new stores of available data, Northwestern University’s Master of Science in Predictive Analytics program prepares students to meet the growing demand in virtually every industry for data-driven leadership and problem solving.

Advanced data analysis, predictive modeling, computer-based data mining, and marketing, web, text, and risk analytics are just some of the areas of study offered in the program. As a student in the Master of Science in Predictive Analytics program, you will:

  • Prepare for leadership-level career opportunities by focusing on statistical concepts and practical application
  • Learn from distinguished Northwestern faculty and from the seasoned industry experts who are redefining how data improve decision-making and boost ROI
  • Build statistical and analytic expertise as well as the management and leadership skills necessary to implement high-level, data-driven decisions
  • Earn your Northwestern University master’s degree entirely online

Just so you know, libraries schools were offering mostly online degrees a decade or so ago. Nice to see other disciplines catching up. 😉

It would be interesting to see short courses in subject analysis, as in subject identity and the properties that compose a particular identity, in specific domains.

September 20, 2011

News: Summarization and Visualization

Filed under: News,Summarization,Visualization — Patrick Durusau @ 7:54 pm

News: Summarization and Visualization (CITRIS* i4Science Lecture Series) by Laurent El Ghaoui.

I don’t know that I agree with the point: “Yet we can’t do without news!”

As much noise as is in the news, I think I could read about how it comes out a year later and not have missed much. 😉

Do watch this lecture as it is very interesting in that it counts and visualizes words in ways you might not expect. Great way to explore text resources.

There is a argument against normalization for search purposes.

For extra credit: How would you test a search engine to see how normalization was affecting its results?

*CITRIS – Center for Information Technology Research in the Interest of Society

Silverlight® Visualizations… Changing the Way We Look at Predictive Analytics

Filed under: Analytics,Prediction,Subject Identity — Patrick Durusau @ 7:53 pm

Silverlight® Visualizations… Changing the Way We Look at Predictive Analytics

Webinar: Tuesday, October 18, 2011 10:00 AM – 11:00 AM PDT

Presented by Caroline Junkin, Director of Analytics Solutions for Predixion Software.

That’s about all the webinar form says so I went looking for more information. 😉

Predixion Insight™ Video Library

From that page:

Predixion Software’s video library contains tutorials that explore the predictive analytics features currently available in Predixion Insight™, demonstrations that walk you through various applications for predictive analytics and Webinar Replays.

If subjects can include subjects that some people don’t think exist, then subjects can certainly include subjects we think may exist at some point in the future. And no doubt our references to them will change over time.

ElasticSearch 0.17.7 Released!

Filed under: ElasticSearch,NoSQL — Patrick Durusau @ 7:52 pm

ElasticSearch 0.17.7 Released!

From the post:

This release include the usual list of bug fixes, and also include an upgrade to Lucene 3.4.0 (fixes critical bugs, so make sure you upgrade), as well as improvements to the couchdb river (memory usage wise).

Release Notes

Estimating Memory and Storage for Lucene/Solr

Filed under: Lucene,Solr — Patrick Durusau @ 7:52 pm

Estimating Memory and Storage for Lucene/Solr

This is very cool!

Grant Ingersoll has put together an Excel spreadsheet to enable modeling of memory and disk space based on the formula in Lucene in Action (2nd ed.) with caveats for its use.

Starting a Search Application

Filed under: Lucene,Searching,Solr — Patrick Durusau @ 7:52 pm

Starting a Search Application

A useful whitepaper by Marc Krellenstein, CTO at Lucid Imagination.

I am interested in your reaction to Marc’s listing of the use cases for full-text searching:

Full-text search is good at a variety of information requests that can be hard to satisfy with other technologies. These include:

  • Finding the most relevant information about a specific topic, or an answer to a particular question,
  • Locating a specific document or content item, and
  • Exploring information in a general area, or even browsing the collection of documents or other content as a whole (this is often supported by clustering; see below).

For my class, do a “reaction” of one page in length giving your reaction to each of these points (that’s 3 pages total), and what “other” technologies might you use?

For class discussion, it would be nice if you can offer an example of either full-text searching meeting the requests or “other” technologies meeting these requests.

Testing/exploring Marc’s “information requests:”

Two teams.

Team One has a set of the Great Books of the Western World and use the the Syntopicon to answer information requests.

Team Two has access to a full-text version of Great Books of the Western World to answer information requests.

The class, including the teams, creates questions that are sent to me privately and I will prepare the final list of questions to be submitted by the teams. Questions are given to both teams at the same time and the first team with the correct answer (must have citation in the Great Books) wins.

I am open to suggestions for prizes.

The class following the contest we will discuss why some questions were better for full-text and why some worked better with the Syntopicon. It will give you insight into the choices you will have to make when creating a topic map.

BTW, the requirements section of Marc’s paper will help you in designing any information system. If you don’t know what is expected and can’t test for it, you are unlikely to satisfy anyone’s needs.

Running Mahout in the Cloud using Apache Whirr

Filed under: Cloud Computing,Hadoop,Mahout — Patrick Durusau @ 7:51 pm

Running Mahout in the Cloud using Apache Whirr

From the post:

This blog shows you how to run Mahout in the cloud, using Apache Whirr. Apache Whirr is a promising Apache incubator project for quickly launching cloud instances, from Hadoop to Cassandra, Hbase, Zookeeper and so on. I will show you how to setup a Hadoop cluster and run Mahout jobs both via the command line and Whirr’s Java API (version 0.4).

Running Mahout in the cloud with Apache Whirr will prepare you for using Whirr or similar tools to run services in the cloud.

September 19, 2011

Emrycher

Filed under: Artificial Intelligence,Machine Learning — Patrick Durusau @ 7:56 pm

Emrycher

Interesting site but you have to dig for further information.

ENRYCHER – SERVICE ORIENTED TEXT ENRICHMENT is a paper that I located about the site.

From the introduction:

In our experience, many knowledge extraction scenarios generally consist of multiple steps, starting with natural language processing, which are in turn used in higher level annotations, either as entities or document-level annotations. This in turn yields a rather complex dependency scheme between separate components. Such complexity growth is a common scenario in general information systems development. Therefore, we decided to mitigate this by applying a service-oriented approach to integration of a knowledge extraction component stack. The motivation behind Enrycher[17] is to have a single web service endpoint that could perform several of these steps, which we refer to as ‘enrichments’, without requiring the user to bother with setting up pre-processing infrastructure himself.

Note the critical statement: “…without requiring the user to bother with setting up pre-processing infrastructure himself.

The lower the bar to entry, the more participants you will have. What’s unclear about that?

The Joy of Indexing

Filed under: Indexing,MongoDB — Patrick Durusau @ 7:55 pm

The Joy of Indexing by Kyle Banker.

From the post:

We spend quite a lot of time at 10gen supporting MongoDB users. The questions we receive are truly legion but, as you might guess, they tend to overlap. We get frequent queries on sharding, replica sets, and the idiosyncrasies of JavaScript, but the one subject that never fails to appear each day on our mailing list is indexing.

Now, to be clear, I’m not talking about how to create an index. That’s easy. The trouble runs much deeper. It’s knowing how indexes work and having the intuition to create the best indexes for your queries and your data set. Lacking this intuition, your production database will eventually slow to a crawl, you’ll upgrade your hardware in vain, and when all else fails, you’ll blame both gods and men.

This need not be your fate. You can understand indexing! All that’s required is the right mental model, and over the course of this series, that’s just what I hope to provide.

But caveat emptor: what follows is a thought experiment. To get the most out of this post, you can’t skim it. Read every word. Use your imagination. Think through the quizzes. Do this, and your indexing struggles may soon be no more.

Very useful post and one that anyone starting to create indexes by automated means needs to read.

Curious how readers with a background in indexing feel about the description?

What would you instruct a reader to do differently if they were manually creating an index to this cookbook?

Amazed by neo4j, gwt and my apache tomcat webserver

Filed under: Music Retrieval,Neo4j — Patrick Durusau @ 7:55 pm

Amazed by neo4j, gwt and my apache tomcat webserver

From the post:

Besides reading papers I am currently implementing the infrastructure of my social news stream for the new metalcon version. For the very first time I was really using neo4j on a remote webserver in a real webapplication built on gwt. This combined the advantages of all these technologies and our new fast server! After seeing the results I am so excited I almost couldn’t sleep last night!

Setting

I selected a very small bipartit subgraph of metalcon which means just the fans and bands together with the fanship relation between them. This graph consists of 12’198 nodes (6’870 Bands and 5’328 Users). and 119’379 edges.

Results

  • For every user I displayed all the favourite bands
  • for each of those band I calculated similar bands (on the fly while page request!)
  • this was done by breadth first search (depth 2) and counting nodes on the fly

A page load for a random user with 56 favourite bands ends up in a traversal of 555’372. Together with sending the result via GWT over the web this was done in about 0.9 seconds!

See the post to see how MySQL fared.

And yes, I thought about you, Mary Jane, when I saw this post!

Recommender Systems

Filed under: Recommendation,Similarity — Patrick Durusau @ 7:55 pm

Recommender Systems

This website provides support for “Recommender Systems: An Introduction” and “Recommender Systems Handbook.”

Recommender systems are an important area of research for topic maps because recommendation of necessity involves recognition (or attempted recognition) of subjects similar to an example subject. That recommendation may be captured in relationship to a particular set of user characteristics or it can be used as the basis for identifying a subject.

The site offers pointers to very strong teaching materials (as of 19 September 2011):

Slides

Tutorials

Courses

If you want to contribute teaching materials, please contact dietmar.jannach (at) udo.edu.

SDD Contest!

Filed under: Humor,Marketing — Patrick Durusau @ 7:54 pm

As a follow up to my posting about SDD systems yesterday, I wanted to uncover some more material on the subject.

It occurred to me to search a well known computer science publisher’s site using just the acronym, “SDD.”

Here are the results from the first ten (10) hits (in no particular order):

  • semi-discrete matrix decomposition (SDD) method
  • soft decision-directed (SDD) adaptation
  • self-organizing link layer protocol (SDD)
  • Hierarchical Set Decision Diagrams (SDD)
  • SDD (strictly diagonally dominant)
  • secure Directed Diffusion protocol (SDD)
  • structured dialogic design
  • strong disjunctive database
  • solid-state-storage-device
  • storytest-driven development

Don’t bother counting, its ten (10) out of ten (10).

A better search engine would have computed a dissimilarity for terms and used that to separate terms that are likely to not have the same meaning. It should then group those “dissimilar” terms and say to the user: Your term(s) may have more than one meaning. We have created probable meanings for you to use as filters. (The display the snippets similar to those above to the user.)

That would avoid my having to sort through 236 “hits” as of today for “SDD.”

True, that is a very poor search term but if we can boost performance for the edge cases, think of what we will do for the more main stream searches.

Oh, sorry, almost forgot the contest part! Please contribute other expansions for the acronym SDD. (non-obscene expansions) No prizes, I am just curious about the number of unique expansions. Does make a good example of semantic ambiguity.


On a more serious note, a search interface that enabled readers to choose from a listing of terms the disambiguation of such content, would over time improve its search offerings to users. One can imagine professors having their graduate students disambiguating their articles for the same reason people write HTML pages. They want their content found by others.

Philosophy of Language, Logic, and Linguistics – The Very Basics

Filed under: Language,Linguistics,Logic — Patrick Durusau @ 7:53 pm

Philosophy of Language, Logic, and Linguistics – The Very Basics

Any starting reading list in this area is going to have its proponents and its detractors.

I would add to this list John Sowa’s “Knowledge representation : logical, philosophical, and computational foundations“. There is a website as well. Just be aware that John’s views on Charles Peirce represent a unique view of the development of logic in the 20th century. Still an excellent bibliography of materials for reading. And, as always, you should read the original texts for yourself. You may reach different conclusions from those reported by others.

Future Integration Needs: Embracing Complex Data

Filed under: Data Integration — Patrick Durusau @ 7:53 pm

Future Integration Needs: Embracing Complex Data is a report from the Aberdeen Group that I found posted at the Informatica website.

It is the sort of white paper that you can leave with executives so they can evaluate the costs of not integrating their data streams.

Two points that I point out for your amusement:

First, data integration isn’t a new topic nor did someone wake up last week and realize that data integration could lead to all the benefits that are extolled in this white paper. I suspect the advantages of integrated data systems has been touted to businesses for as long as data systems, manual or otherwise have existed.

The question the white paper does not answer (or even raise) is why do data integration issues persist? Just in the digital age, decades have been spent pointing the problem out and proposing solutions. A white paper that answered that question might help find solutions.

As it is, the white paper says “if you had a solution to this problem, for which we don’t know the cause, you would be better off.” No doubt but not very comforting.

BTW, in case you didn’t notice, the “n = 122” you keep seeing in the article means the sweeping claims are made on the basis of 122 respondents to a survey. It doesn’t say if it was one of those call you during dinner sort of phone surveys or not.

The second point to notice is that the conclusion of the paper is that you need a single product to use for data integration. Gee, I wonder where you would find software like that! 😉

I am sure the Informatica software is quite capable but my concern remains one of how do we transition from one software/format to another? Legacy formats and even code have proven to be more persistent than any one imagined. Software/formats don’t so much migrate as expand to fill the increasing amount of digital data.

Now that would be an interesting metric to ask the digital universe is expanding crowd. How many formats are coming online to represent the expanding amount of data? And where are we going to get the maps to move from one to another?

« Newer PostsOlder Posts »

Powered by WordPress