Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 24, 2011

Leaflet & GeoCommons JSON

Filed under: Geographic Data,Geographic Information Retrieval — Patrick Durusau @ 3:50 pm

Leaflet & GeoCommons JSON by Tim Waters.

From the post:

Hi, in this quick tutorial we will have a look at a new JavaScript mapping library, Leaflet using it to help load JSON features from a GeoCommons dataset. We will add our Acetate tile layer to the map, and use the cool API feature filtering functionalities to get just the features we want from the server, show them on a Leaflet map, add popups to the features, style the features according to what the feature is, and add some further interactivity. This blog follows up from two posts on my personal blog, showing GeoCommons features with OpenLayers and with Polymaps.

We have all read about tweets being used to plot reports or locations from and about the various “occupy” movements. I suspect that effective civil unrest is going to require greater planning for the distribution of support and resources in particular locales. Conveniently, current authorities have created or allowed to be created, maps and other resources that can be used for such purposes. This is one of those resources.

I don’t know of any research on such algorithms but occupiers might want to search for clusters of dense and confusing paths in urban areas. Those proved effective at times in struggles in Medieval times for control of walled cities. Once the walls were breached, would-be occupiers were confronted with warrens of narrow and confusing paths. As opposed to broad, open pathways that would enable a concentration of forces.

Is there an algorithm for longest, densest path?

However discovered, annotating a cluster of dense and confusing paths with tactical information and location of resources would be a natural use of topic maps. Or what to anticipate in such areas, if one is on the “other” side.

The Lazy Developer’s Guide to Loading Datasets into GeoCommons

Filed under: Geographic Data — Patrick Durusau @ 3:47 pm

The Lazy Developer’s Guide to Loading Datasets into GeoCommons

From the post:

Loading KML Files

So lets say you have a bunch of kml files you want to load into Geocommons. Of course, its fairly easy to load these through the web UI, but if you need to do this often enough, it would be nice to have a program to do it for you – after all, as Larry Wall said, laziness is one of the three virtues of great programmers.

Frankly, its not exactly obvious from our API documentation what the best way to do this is. And if you aren’t familiar with Curl, the examples are probably not going to help you much, so I’ll be doing this code in Java. Of course, we here at GeoIQ are Ruby programmers, and thus have a natural disdain for anything to do with Java, so I’m probably losing serious Ruby street cred just posting this, but anything for the good of the cause. We will be using the occasionally obtuse Geocommons REST API, but I’ll try to steer you around some of the not so obvious pitfalls.

The ability to load datasets into GeoCommons is one that may come in handy.

Front-end view generation with Hadoop

Filed under: Hadoop,Web Applications — Patrick Durusau @ 3:44 pm

Front-end view generation with Hadoop by Pere Ferrera.

From the post:

One of the most common uses for Hadoop is building “views”. The usual case is that of websites serving data in a front-end that uses a search index. Why do we want to use Hadoop to generate the index being served by the website? There are several reasons:

  • Parallelism: When the front-end needs to serve a lot of data, it is a good idea to divide them into “shards”. With Hadoop we can parallelize the creation of each of these shards so that both the generation of the view and service of it will be scaled and efficient.
  • Efficiency: In order to maximize the efficiency and the speed of a front-end, it is convenient to separate the generation from the serving of the view. The generation will be done by a back-end process whereas the serving will be done by a front-end; in this way we are freeing the front-end from the load that can be generated while indexing.
  • Atomicity: It is often convenient to have a method for generating and deploying views atomically. In this way, if the deployment fails, we can always go back to previous complete versions (rollback) easily. If the generation went badly we can always generate a new full view where the error will be solved in all the registers. Hadoop allows us to generate views atomically because it is batch-oriented. Some search engines / databases allow atomic deployment by doing a hot-swap of their data.

Covers use of Solr and Voldemort by example.

Concludes by noting this isn’t a solution for real-time updating but one suspects that isn’t a universal requirement across the web.

Plus see the additional resources suggested at the end of the post. You won’t (shouldn’t be) disappointed.

ASTER Global Digital Elevation Model (ASTER GDEM)

Filed under: Geographic Data,Geographic Information Retrieval,Mapping — Patrick Durusau @ 3:42 pm

ASTER Global Digital Elevation Model (ASTER GDEM)

From the webpage:

ASTER GDEM is an easy-to-use, highly accurate DEM covering all the land on earth, and available to all users regardless of size or location of their target areas.

Anyone can easily use the ASTER GDEM to display a bird’s-eye-view map or run a flight simulation, and this should realize visually sophisticated maps. By utilizing the ASTER GDEM as a platform, institutions specialized in disaster monitoring, hydrology, energy, environmental monitoring etc. can perform more advanced analysis.

In addition to the data, there is a GDEM viewer (freeware) at this site.

All that is missing is your topic map and you.

What topics science lovers link to the most

Filed under: Interface Research/Design,Visualization — Patrick Durusau @ 3:41 pm

What topics science lovers link to the most

From FlowingData a visualization by Hilary Mason, chief scientist at bitly, of links to 600 science pages and the pages people visited next.

Ask interesting questions and sometimes you get interesting “answers” or at least observations.

When I see this sort of graphic, it just screams “interface,” even if not suitable for everyone.

November 23, 2011

Coming Attractions: Apache Hive 0.8.0

Filed under: Hadoop,Hive — Patrick Durusau @ 7:52 pm

Coming Attractions: Apache Hive 0.8.0 by Carl Steinbach.

Apache Hive 0.8.0 won’t arrive for several weeks yet, but Carl’s preview covers:

  • Bitmap Indexes
  • TIMESTAMP datatype
  • Plugin Developer Kit
  • JDBC Driver Improvements

Are you interested now? Wondering what else will be included? Could always visit the Apache Hive project to find out. 😉

Using Apache Hadoop to Find Signal in the Noise: Analyzing Adverse Drug Events

Filed under: Cyc,Hadoop,MapReduce — Patrick Durusau @ 7:50 pm

Using Apache Hadoop to Find Signal in the Noise: Analyzing Adverse Drug Events

From the post:

Last month at the Web 2.0 Summit in San Francisco, Cloudera CEO Mike Olson presented some work the Cloudera Data Science Team did to analyze adverse drug events. We decided to share more detail about this project because it demonstrates how to use a variety of open-source tools – R, Gephi, and Cloudera’s Distribution Including Apache Hadoop (CDH) – to solve an old problem in a new way.

Background: Adverse Drug Events

An adverse drug event (ADE) is an unwanted or unintended reaction that results from the normal use of one or more medications. The consequences of ADEs range from mild allergic reactions to death, with one study estimating that 9.7% of adverse drug events lead to permanent disability. Another study showed that each patient who experiences an ADE remains hospitalized for an additional 1-5 days and costs the hospital up to $9,000.

Some adverse drug events are caused by drug interactions, where two or more prescription or over-the-counter (OTC) drugs taken together leads to an unexpected outcome. As the population ages and more patients are treated for multiple health conditions, the risk of ADEs from drug interactions increases. In the United States, roughly 4% of adults older than 55 are at risk for a major drug interaction.

Because clinical trials study a relatively small number of patients, both regulatory agencies and pharmaceutical companies maintain databases in order to track adverse events that occur after drugs have been approved for market. In the United States, the FDA uses the Adverse Event Reporting System (AERS), where healthcare professionals and consumers may report the details of ADEs they experienced. The FDA makes a well-formatted sample of the reports available for download from their website, to the benefit of data scientists everywhere.

Methodology

Identifying ADEs is primarily a signal detection problem: we have a collection of events, where each event has multiple attributes (in this case, the drugs the patient was taking) and multiple outcomes (the adverse reactions that the patient experienced), and we would like to understand how the attributes correlate with the outcomes. One simple technique for analyzing these relationships is a 2×2 contingency table:

For All Drugs/Reactions:

Reaction = Rj

Reaction != Rj

Total

Drug = Di

A

B

A + B

Drug != Di

C

D

C + D

Total

A + C

B + D

A + B + C + D

Based on the values in the cells of the tables, we can compute various measures of disproportionality to find drug-reaction pairs that occur more frequently than we would expect if they were independent.

For this project, we analyzed interactions involving multiple drugs, using a generalization of the contingency table method that is described in the paper, “Empirical bayes screening for multi-item associations” by DuMouchel and Pregibon. Their model computes a Multi-Item Gamma-Poisson Shrinkage (MGPS) estimator for each combination of drugs and outcomes, and gives us a statistically sound measure of disproportionality even if we only have a handful of observations for a particular combination of drugs. The MGPS model has been used for a variety of signal detection problems across multiple industries, such as identifying fraudulent phone calls, performing market basket analyses and analyzing defects in automobiles.

Apologies for the long setup:

Solving the Hard Problem with Apache Hadoop

At first glance, it doesn’t seem like we would need anything beyond a laptop to analyze ADEs, since the FDA only receives about one million reports a year. But when we begin to examine these reports, we discover a problem that is similar to what happens when we attempt to teach computers to play chess: a combinatorial explosion in the number of possible drug interactions we must consider. Even restricting ourselves to analyzing pairs of drugs, there are more than 3 trillion potential drug-drug-reaction triples in the AERS dataset, and tens of millions of triples that we actually see in the data. Even including the iterative Expectation Maximization algorithm that we use to fit the MGPS model, the total runtime of our analysis is dominated by the process of counting how often the various interactions occur.

The good news is that MapReduce running on a Hadoop cluster is ideal for this problem. By creating a pipeline of MapReduce jobs to clean, aggregate, and join our data, we can parallelize the counting problem across multiple machines to achieve a linear speedup in our overall runtime. The faster runtime for each individual analysis allows us to iterate rapidly on smaller models and tackle larger problems involving more drug interactions than anyone has ever looked at before.

Where have I heard about combinatorial explosions before?

If you think about it, semantic environments (except for artificial ones) are inherently noisy and the signal we are looking for to trigger merging may be hard to find.

Semantic environments like Cyc are noise free, but they are also not the semantic environments in which most data exists and in which we have to make decisions.

Questions: To what extent are “clean” semantic environments artifacts of adapting to the capacities of existing hardware/software? What aspects of then current hardware/software would you point to in making that case?

Hadoop World 2011 Presentations

Filed under: Hadoop,MapReduce — Patrick Durusau @ 7:48 pm

Hadoop World 2011 Presentations

Slides from Hadoop World 2011 with videos of presentations following as quickly as possible.

A real treasure trove of Hadoop materials.

When the presentations are posted, look for annotated posts on some of them.

In the meantime, enjoy!

Building and Deploying MR2

Filed under: Hadoop,MapReduce — Patrick Durusau @ 7:45 pm

Building and Deploying MR2

From the post:

A number of architectural changes have been added to Hadoop MapReduce. The new MapReduce system is called MR2 (AKA MR.next). The first release version to include these changes will be Hadoop 0.23.

A key change in the new architecture is the disappearance of the centralized JobTracker service. Previously, the JobTracker was responsible for provisioning the resources across the whole cluster, in addition to managing the life cycle of all submitted MapReduce applications; this typically included starting, monitoring and retrying the applications individual tasks. Throughout the years and from a practical perspective, the Hadoop community has acknowledged the problems that inherently exist in this functionally aggregated design (See MAPREDUCE-279).

In MR2, the JobTracker aggregated functionality is separated across two new components:

  1. Central Resource Manager (RM): Management of resources in the cluster.
  2. Application Master (AM): Management of the life cycle of an application and its tasks. Think of the AM as a per-application JobTracker.

The new design enables scaling Hadoop to run on much larger clusters, in addition to the ability to run non-mapreduce applications on the same Hadoop cluster. For more architecture details, the interested reader may refer to the design document at: https://issues.apache.org/jira/secure/attachment/12486023/MapReduce_NextGen_Architecture.pdf.

The objective of this blog is to outline the steps for building, configuring, deploying and running a single-node NextGen MR cluster.

…(see the post for the rest)

If you want to get a jump on experience with the next generation of Hadoop, here is a place to start!

Black Duck Software Joins GENIVI Alliance

Filed under: Marketing,Open Source,Topic Maps — Patrick Durusau @ 7:44 pm

Black Duck Software Joins GENIVI Alliance

From the post:

Black Duck Software, the leader in open source software knowledge, adoption and governance, today announced it has joined the GENIVI Alliance as an Associate Member. Black Duck will work with the GENIVI Alliance to provide open source compliance strategy, program development and training to Alliance members, which include top automakers and automotive software suppliers.

The GENIVI Alliance is an automotive and consumer electronics industry association driving the development and adoption of an open in-vehicle infotainment (IVI) reference platform. Among the Alliance’s goals are the delivery of a reusable, open source IVI platform consisting of Linux-based core services, middleware and open application layer interfaces; development and support of an open source community of IVI developers; and training and support programs to help software developers create compliant IVI applications.

I would think that infotainment for vehicles would need topic maps as much as any other information stream.

Not to mention that getting on the inside track with someone like Black Duck could not hurt topic maps. 😉

More from the post:

About Black Duck Software

Black Duck Software is the leading provider of strategy, products and services for automating the management, governance and secure use of open source software, at enterprise scale, in a multi-source development process. Black Duck enables companies to shorten time-to-solution and reduce development costs while mitigating the management, compliance and security challenges associated with open source software. Black Duck Software powers Koders.com, the industry’s leading code search engine for open source, and Ohloh.net, the largest free public directory of open source software and a vibrant web community of free and open source software developers and users. Black Duck is among the 500 largest software companies in the world, according to Softwaremag.com. For more information, visit www.blackducksoftware.com.

About GENIVI Alliance

GENIVI Alliance is a non-profit industry association whose mission is to drive the broad adoption of an In-Vehicle Infotainment (IVI) open source development platform. GENIVI will accomplish this by aligning requirements, delivering reference implementations, offering certification programs and fostering a vibrant open source IVI community. GENIVI’s work will result in shortened development cycles, quicker time-to-market, and reduced costs for companies developing IVI equipment and software. GENIVI is headquartered in San Ramon, Calif. www.genivi.org.

Do bear in mind that koders.com searches about 3.3+ billion lines of open source code. I am sure you can think of ways topic maps could improve that search experience.

Google Plugin for Eclipse (GPE) is Now Open Source

Filed under: Cloud Computing,Eclipse,Interface Research/Design,Java — Patrick Durusau @ 7:41 pm

Google Plugin for Eclipse (GPE) is Now Open Source by Eric Clayberg.

From the post:

Today is quite a milestone for the Google Plugin for Eclipse (GPE). Our team is very happy to announce that all of GPE (including GWT Designer) is open source under the Eclipse Public License (EPL) v1.0. GPE is a set of software development tools that enables Java developers to quickly design, build, optimize, and deploy cloud-based applications using the Google Web Toolkit (GWT), Speed Tracer, App Engine, and other Google Cloud services.

….

As of today, all of the code is available directly from the new GPE project and GWT Designer project on Google Code. Note that GWT Designer itself is based upon the WindowBuilder open source project at Eclipse.org (contributed by Google last year). We will be adopting the same guidelines for contributing code used by the GWT project.

Important for the reasons given but also one possible model for topic map services. What if your topic map services were hosted in the cloud and developers could write against against it? That is they would not have to concern themselves with the niceties of topic maps but simply request the information of interest to them, using tools you have provided to make that easier for them.

Take for example the Statement of Disbursements that I covered recently. If that were hosted as a topic map in the cloud, a developer, say working for a resturant promoter, might want to query the topic map for who frequents eateries in a particular area. They are not concerned with the merging that has to take place between various budgets and the alignment of those merges with individuals, etc. They are looking for a list of places with House members alphabetically sorted after it.

Apache Hadoop 0.23 is Here!

Filed under: Hadoop,MapReduce — Patrick Durusau @ 7:38 pm

Apache Hadoop 0.23 is Here! by Arun Murthy.

Arun isolates two major improvements:

HDFS Federation

HDFS has undergone a transformation to separate out Namespace management from the Block (storage) management to allow for significant scaling of the filesystem – in the current architecture they are intertwined in the NameNode.

More details are available in the HDFS Federation release documentation or in the recent HDFS Federation talk by Suresh Srinivas, a Hortonworks co-founder at Hadoop World, 2011.

NextGen MapReduce aka YARN

MapReduce has undergone a complete overhaul in hadoop-0.23 with the fundamental change to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. Thus, Hadoop becomes a general purpose data-processing platform where we can support MapReduce and other application execution frameworks such as MPI etc.

More details are available in the YARN release documentation or in the recent YARN presentation by Mahadev Konar, a Hortonworks co-founder at Hadoop World, 2011.

Arun also notes that Hadoop 0.23 is an alpha release so don’t use this in a production environment (unless you are feeling lucky. Are you?)

More details at Hadoop World presentation.

So, in addition to a production quality Hadoop ecosystem you are going to need to setup a test Hadoop ecosystem. Well, winter is coming on and a couple of more boxes to heat the office won’t be a bad thing. 😉

Search Solutions 2011: Highlights and Reflections

Filed under: Conferences,Findability,Searching — Patrick Durusau @ 7:37 pm

Search Solutions 2011: Highlights and Reflections by Tony Russell-Rose.

Of particular interest:

Probably the main one for me was Ricardo Baeza-Yates presentation “Beyond the Ten Blue Links”, which discussed Yahoo’s ongoing quest to satisfy the implicit and explicit needs of web search users, presented as a set of seven “challenges”. Some of these challenges you might have expected, such as Query Assistance (e.g. suggestions, related searches, and so on) and Universal Search (i.e. dealing with mixed media results). But other challenges were more unprecedented, e.g. “Post Search User Experience” and “Application Integration”. Both of these suggest a wider re-framing of the search problem, in which findability is just one (small) part of the overall search experience. In this context, the focus is no longer on low-level activities such as selecting relevant documents, but on recognising and providing support for the completion of higher-level tasks. This is interesting in its own right, but it also underlines Search Solutions policy of bringing together the web and enterprise search communities: in this instance, we clearly can learn a lot from each other.

The entire post merits you attention (the proceedings are online by the way) but I think Tony’s point about findability illustrates a weakness in at least how I have approached topic maps from time to time.

That is to approach topic maps as an excellent solution to authoring, finding, or maintaining information about subjects, without stopping to ask why we want to author, find or maintain the information about subjects?

However interesting or clever I find search, string comparison, networks/graphs, graph algorithms, etc., they are unlikely to be of interest to mainstream users.

Or at best, such concerns are a means to an end and not considered interesting enough to bother learning the names of the algorithms that others (including me) think are so bloody important.

Not that I think SC 34/WG 3 needs to expand its brief to include “higher-level tasks” but that in promoting topic maps, we should try to find “higher-level” tasks where topic maps can offer a substantial advantage.

Comments?

Crowdsourcing Maps

Filed under: Authoring Topic Maps,Crowd Sourcing,Maps — Patrick Durusau @ 7:35 pm

Crowdsourcing Maps by Mikhil Masli appears in the November 2011 issue of Computer.

Mikhil describes geowikis as having three characteristics that enable crowdsourcing of maps:

  • simple, WYSIWYG editing of geographic features like roads and landmarks
  • versioning that works with a network of tightly coupled objects rather than independent documents, and
  • spatial monitoring tools that make it easier for users to “watch” a geographic area for possibly malicious edits and to interpret map changes visually.

How would those translate into characeristics of topic maps?

  • simple WYSIWYG interface
  • versioning at lowest level
  • subject monitoring tools to enable watching for edits

Oh, I forgot, the topic map originator would have to supply the basic content of the map. Not going to be very interesting to have an empty map for other to fill in.

That is where geographic maps have the advantage is that there is already some framework, into which any user can add their smaller bit of information.

In creating environments where we want users to add to topic maps, we need to populate those “maps” and make it easy for users to contribute.

For example, a library catalog is already populated with information and one possible goal (it may or may not be yours) would be to annotate library holdings with commentary by anonymous or non-anonymous comments/reviews by library patrons. The binding could be based on the library’s internal identifier with other subjects (such as roles) being populated transparently to the user.

Could you do that without a topic map? Possibly, depending on your access to the internals of your library catalog software. But could you then also associate all those reviews with a particular author and not a particular book they had written? 😉 Yes, gets dicey when requirements for information delivery change over time. Topic maps excel at such situations because the subjects you want need only be defined. (Well, there is a bit more to it than that but the margin is too small to write it all down.)

My point here is that topic maps can be authored and vetted by small groups of experts but that they can also, with some planning, be usefully authored by large groups of individuals. That places a greater burden on the implementer of the authoring interface but experience with that sort of thing appears to be growing.

SUMMER SCHOOL ON ONTOLOGY ENGINEERING AND THE SEMANTIC WEB

Filed under: Ontology,Semantic Web — Patrick Durusau @ 5:39 pm

9TH SUMMER SCHOOL ON ONTOLOGY ENGINEERING AND THE SEMANTIC WEB (SSSW 2012), 8-14 July, 2012, Cercedilla, near Madrid, Spain.

Applications open: 30 January 2012, close: 31 March 2012

from the webpage:

The groundbreaking SSSW series of summer schools started in 2003. It is now a well-establish event within the research community and a role model for several other initiatives. Presented by leading researchers in the field, it represents an opportunity for both students and practitioners to equip themselves with the range of theoretical, practical, and collaboration skills necessary for full engagement with the challenges involved in developing Ontologies and Semantic Web applications. To ensure a high ratio between tutors and students the school will be limited to 50 participants. Applications for the summer school will open on the 30th January 2012 and will close by the 31st March 2012.

From the very beginning the school pioneered an innovative pedagogical approach, combining the practical with the theoretical, and adding teamwork and a competitive element to the mix. Specifically, tutorial/lecture material is augmented with hands-on practical workshops and we ensure that the sessions complement each other by linking them to a group project. Work on developing and presenting a project in cooperation with other participants serves as a means of consolidating the knowledge and skills gained from lectures and practical sessions. It also introduces an element of competition among teams, as prizes are awarded to the best projects at the end of the week. Participants will be provided with electronic versions of all course lectures and all necessary tools and environments for the hands-on sessions. PC access with all tools pre-installed will be available on site as well. SSSW 2011 will provide a stimulating and enjoyable environment in which participants will benefit not only from the formal and practical sessions but also from informal and social interactions with established researchers and the other participants to the school. To further facilitate communication and feedback all attendees will present a poster on their research.

It may just be me but I never cared for conferences/meetings that were “near” major locations. Academic and professional meetings should be held at or near large international airports. People who want vacation junkets should become politicians.

November 22, 2011

NoSQL Zone (at DZone)

Filed under: NoSQL — Patrick Durusau @ 7:01 pm

NoSQL Zone (at DZone)

A collection of high quality content on NoSQL.

Enjoy!

Introduction to Spring Data Neo4j

Filed under: Neo4j,Spring Data — Patrick Durusau @ 7:01 pm

Introduction to Spring Data Neo4j by Michael Hunger.

From the post:

The Spring Data Neo4j project has evolved to support the Neo4j graph data store within the Spring paradigm. With version 2.0 already at release candidate stage, now is a great time to learn how to extend your application’s persistence model to start using graphs instead of traditional relational stores. Neo4j expert, Michael Hunger, provides a guided tour of the technology and provides details on how to get started in this Introduction to Spring Data Neo4j.

Be forewarned that the audio is fairly poor quality.

Interesting, see time mark: 11:12, Google Image Search: ..graph OR network”

Displays slide full of different images but consider how they were obtained.

User had to specify “graph” or “network.”

This is what they call a “teaching moment.” 😉

First, a user who knows only “graph” or only “network” as a search term will retrieve less than all of the possible results.

Second, as users who do know both terms, we might decide to create a mapping between “graph” and “network” so that any user who searches for one gets those results, plus the results for the other.

Third, but if all we do is map two English terms together, with nothing more, on what basis is some subsequent user going to map terms to our mapping?

No topic map offers a universal solution to these issues but it can offer a solution for specified cases.

Smart Swarms of Bacteria-Inspired Agents with Performance Adaptable Interactions

Filed under: Agents,Swarms — Patrick Durusau @ 7:00 pm

Smart Swarms of Bacteria-Inspired Agents with Performance Adaptable Interactions by Adi Shklarsh, Gil Ariel, Elad Schneidman, and Eshel Ben-Jacob, appeared in September 2011 Issue of PLoS Computational Biology.

I mentioned to Jack Park that I had been thinking about swarms and mining semantics and he forwarded a link to the ScienceDaily article, Smart Swarms of Bacteria Inspire Robotics: Adaptable Decision-Making Found in Bacteria Communities, which was an adaptation of the PLoS Computational Biology article I cite above.

Abstract (from the PLoS article):

Collective navigation and swarming have been studied in animal groups, such as fish schools, bird flocks, bacteria, and slime molds. Computer modeling has shown that collective behavior of simple agents can result from simple interactions between the agents, which include short range repulsion, intermediate range alignment, and long range attraction. Here we study collective navigation of bacteria-inspired smart agents in complex terrains, with adaptive interactions that depend on performance. More specifically, each agent adjusts its interactions with the other agents according to its local environment – by decreasing the peers’ influence while navigating in a beneficial direction, and increasing it otherwise. We show that inclusion of such performance dependent adaptable interactions significantly improves the collective swarming performance, leading to highly efficient navigation, especially in complex terrains. Notably, to afford such adaptable interactions, each modeled agent requires only simple computational capabilities with short-term memory, which can easily be implemented in simple swarming robots.

This research has a number of aspects that are relevant to semantic domains.

First, although bacteria move in “complex terrains,” those terrains are no more complex and probably less so than the semantic terrains that are presented to agents (whether artificial or not). Whatever we can learn about navigation mechanisms that are successful for other, possibly less complex terrains, will be useful for semantic terrains.

Second, the notion of “performance” as increasing or decreasing influence over other agents sounds remarkably similar to “boosting” except that “boosting” is crude when compared to the mechanisms discussed in this paper.

Third, rather than complex and possibly rigid/fragile modeling (read ontologies, description logic), perhaps simpler computations in agents with short memories may be more successful.

No proof, just airy speculation at this point but experimental proof, the 19th century logicists may concede, is the best kind.

Gephi: Graph Streaming API

Filed under: Gephi,Graphs — Patrick Durusau @ 7:00 pm

Gephi: Graph Streaming API

Matt O’Donnell, @mdbod, wanted more information on the graph streaming API for Gephi, then tweets the URL you see above.

I have collaborated with Matt before. It is like working with a caffeinated fire hose. 😉

Seriously, Matt does extremely good work from biblical languages, linguistics, markup languages and now NLP and beyond.

Looking forward to him working on topic maps and related areas.

Modelling with Graphs

Filed under: Data Models,Graphs,Modeling — Patrick Durusau @ 6:59 pm

Modelling with Graphs by Alistair Jones at NoSQL Br 2011

From the description:

Neo4j is a powerful and expressive tool for storing, querying and manipulating data. However modelling data as graphs is quite different from modelling data under with relational databases. In this talk we’ll cover modelling business domains using graphs and show how they can be persisted and queried in the popular open source graph database Neo4j. We’ll contrast this approach with the relational model, and discuss the impact on complexity, flexibility and performance. We’ll also discuss strategies for deciding how to proceed when a graph allows multiple ways to represent the same concept, and explain the trade-offs involved. As a side-effect, we’ll examine some of the new tools for how to query graph data in Neo4j, and discuss architectures for using Neo4j in enterprise applications.

Alistair is a Software Engineer with Neo Technology, the company behind the popular open source graph database Neo4j.

Alistair has extensive experience as a developer, technical lead and architect for teams building enterprise software across a range of industries. He has a particular focus Domain Driven Design, and is an expert on Agile methodologies. Alistair often writes and presents on applying Agile principles to the discipline of performance testing.

Excellent presentation!

Anyone care to suggest a book on modeling or modeling with graphs?

NetworkX – 1.6

Filed under: Graphs,NetworkX — Patrick Durusau @ 6:59 pm

NetworkX – 1.6

NetworkX released a new version today.

From the home page:

NetworkX is a Python language software package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

Features:

  • Python language data structures for graphs, digraphs, and multigraphs.
  • Nodes can be “anything” (e.g. text, images, XML records)
  • Edges can hold arbitrary data (e.g. weights, time-series)
  • Generators for classic graphs, random graphs, and synthetic networks
  • Standard graph algorithms
  • Network structure and analysis measures
  • Basic graph drawing
  • Open source BSD license
  • Well tested: more than 1500 unit tests
  • Additional benefits from Python: fast prototyping, easy to teach, multi-platform

That list of features seems quite different from when I first covered it in January of this year.

Matthew O’Donnell tweeted about it.

Social Network Analysis — Finding communities and influencers

Filed under: Social Networks — Patrick Durusau @ 6:58 pm

Social Network Analysis — Finding communities and influencers

Webcast: Date: Tuesday, December 6, 2011
Time: 10 PT, San Francisco

Presented by: Maksim Tsvetovat
Duration: Approximately 60 minutes.
Cost: Free

Description:

A follow-on to Analyzing Social Networks on Twitter, this webcast will concentrate on the social component of Twitter data rather then the questions of data gathering and decomposition. Using a predefined dataset, we will attempt to find communities of people on Twitter that express particular interests. We will also mine Twitter streams for cascades of information diffusion, and determine most influential individuals in these cascades. The webcast will contain an initial introduction to Social Network Analysis methods and metrics.

About Maksim Tsvetovat

Maksim Tsvetovat is an interdisciplinary scientist, a software engineer, and a jazz musician. He has received his doctorate from Carnegie Mellon University in the field of Computation, Organizations and Society, concentrating on computational modeling of evolution of social networks, diffusion of information and attitudes, and emergence of collective intelligence. Currently, he teaches social network analysis at George Mason University. He is also a co-founder of DeepMile Networks, a startup company concentrating on mapping influence in social media. Maksim also teaches executive seminars in social network analysis, including “Social Networks for Startups” and “Understanding Social Media for Decisionmakers”.

Matt O’Donnell tweeted about this webcast and Social Network Analysis for Startups. (@mdbod)

Topic Maps as Alternative to Open Graph Database Protocol?

Filed under: Open Graph Database Protocol,Topic Maps — Patrick Durusau @ 6:58 pm

Matt O’Donnell replied to my tweet on a post about Open Graph Database Protocol, saying:

@patrickDurusau re #opengraph thoughts how #topicmaps can be used as alternatives for organic user driven practice for this sort of thing?

A bit more question than I can answer in 140 characters so I am replying here and will tweet the post. 😉

I am not sure that we should be seeking alternatives to “organic user driven practice” for adding information to web pages.

I didn’t always think so and it may be helpful if I say why my opinion has changed.

Set the way-back machine for the late 1980’s, early 1990’s, SciAM had coverage of SGML, the TEI was formed, great days were ahead for document re-use, preservation and processing. Except that the toolkits, for the average user, really sucked. For a variety of reasons, most of which are covered elsewhere, SGML found its way into very sophisticated systems but not much further.

XML, a subset of SGML, with simplified processing rules, was created to enter the market spaces not occupied by SGML. In fact, XML parsers were touted as being a weekend project for the average computer programmer. And certainly, given the gains of this simplied (some would say bastardized) system of markup, users would flock to its use.

Ahem, well, XML has certainly proven to be useful for data interchange (one of its original use cases) and with the advent of better interfaces (i.e., users don’t know they are using XML), XML has finally entered the broad consumer market.

And without a doubt, for a large number of use cases, XML is an excellent solution. Unfortunately is it one that the average user cannot stand. So solutions involving XML markup for the most part display the results and not the process to the average user.

I think there is a lesson to be learned from the journey in markup. I read text with markup almost as easy as I do clear text but that isn’t the average experience. Most users want their titles to be centered, special terms bolded, paragraphs, lists, etc., and they don’t really care how it got that way.

In semantic domains, users want to be able to find the information they need. (full stop) They could care less whether we use enslaved elves, markup, hidden Markov models or some other means to deliver that result.

True enough we have to enlist the assistance of users in such quests but expecting all but a few to use Topic Maps as topic maps is as futile as the Open Graph Database Protocol.

What we need to do is learn from user behavior what semantics they intend and create mechanisms to map that into our information systems. Since that information exists in different information systems, we will need topic maps to merge the resulting content together.

For example, if we determine that a a matter of practice that when a user writes: someText and the someText appears in the resource pointed to by someURL, what is intended is an identification, then we should treat all the same uses of someURL as identifying the same subject.

May be wrong in some cases but it is a start towards capturing what users intend without asking them to do more than they are right now.

Next move (this works for the new ODF metadata mechanism): We associate metadata vocabularies written in RDF or other forms, with documents with the express purpose of being associated with the content of documents. For example, if I write a document about WG3 (the ISO topic maps working group), I should be able to associate a vocabulary with that document that identifies all the likely subjects with no help from me. And when the document is saved, links to identifiers are inserted into the document.

That is we start to create smarter documents rather than trying to harvest dumb ones. Could be done with medical reports, patient charts, economic reports, computer science articles, etc.

Still haven’t reached topic map have I? Well, but all those somewhat smarter documents are going to have different vocabularies. At least if we ever expect usage of that sort of system to take off. Emory Hospital (just to pick one close to home) isn’t going to have exactly the same vocabulary as the Mayo Clinic. And we should not wait for them to have the same vocabulary.

Topic maps come in when we decide that the cost of the mapping is less than the benefit we will gain from mapping across the domains. May never map the physical plant records between Emory Hospital and the Mayo Clinic but could be likely to map nursing reports on particular types of patients or results of medical trials.

My gut instinct is that we need to ask users “what do you mean when you say X?” And whenever possible (a form of validation) ask them if that is another way to say Y?

So, back to your original question: Yes, we can use topic maps to capture more semantics than the Open Graph Database Protocol but I would start with Open Graph Database as a way to gather rough semantics that we could then refine.

It is the jumping from rough or even uncontrolled semantics to a polished result that is responsible for so many problems. A wood worker does not go from a bark covered piece of wood to a finished vase in one step. They slowly remove the bark, bring the wood into round and then begin several steps of refinement. We should go and do likewise.

Google BigQuery Service: Big data analytics at Google speed

Filed under: Google BigQuery — Patrick Durusau @ 6:57 pm

Google BigQuery Service: Big data analytics at Google speed

From the post:

Rapidly crunching terabytes of big data can lead to better business decisions, but this has traditionally required tremendous IT investments. Imagine a large online retailer that wants to provide better product recommendations by analyzing website usage and purchase patterns from millions of website visits. Or consider a car manufacturer that wants to maximize its advertising impact by learning how its last global campaign performed across billions of multimedia impressions. Fortune 500 companies struggle to unlock the potential of data, so it’s no surprise that it’s been even harder for smaller businesses.

We developed Google BigQuery Service for large-scale internal data analytics. At Google I/O last year, we opened a preview of the service to a limited number of enterprises and developers. Today we’re releasing some big improvements, and putting one of Google’s most powerful data analysis systems into the hands of more companies of all sizes.

  • We’ve added a graphical user interface for analysts and developers to rapidly explore massive data through a web application.
  • We’ve made big improvements for customers accessing the service programmatically through the API. The new REST API lets you run multiple jobs in the background and manage tables and permissions with more granularity.
  • Whether you use the BigQuery web application or API, you can now write even more powerful queries with JOIN statements. This lets you run queries across multiple data tables, linked by data that tables have in common.
  • It’s also now easy to manage, secure, and share access to your data tables in BigQuery, and export query results to the desktop or to Google Cloud Storage.

Did I remember to mention that this service is free? 😉 Customers will get 30-days notice when that is about to end.

Sorta like an early present isn’t it?

What did you do with Google BigQuery?

MIT OpenCourseware / OCW Scholar

Filed under: CS Lectures — Patrick Durusau @ 6:57 pm

MIT OpenCourseware / OCW Scholar

For some unknown reason I haven’t included a mention of these resources on my blog. Perhaps I assumed “everyone” knew about them or it was just oversight on my part.

MIT OpenCourseware is described as:

MIT OpenCourseWare (OCW) is a web-based publication of virtually all MIT course content. OCW is open and available to the world and is a permanent MIT activity.

What is MIT OpenCourseWare?

MIT OpenCourseWare is a free publication of MIT course materials that reflects almost all the undergraduate and graduate subjects taught at MIT.

  • OCW is not an MIT education.
  • OCW does not grant degrees or certificates.
  • OCW does not provide access to MIT faculty.
  • Materials may not reflect entire content of the course.

I would add: “You don’t have classmates working on the same material for discussion, etc.” but even with all those limitations, this is an incredible resource. Self-study is always more difficult but this is one of the best study aids on the Web!

OCW Scholar is described as:

OCW Scholar courses are designed for independent learners who have few additional resources available to them. The courses are substantially more complete than typical OCW courses and include new custom-created content as well as materials repurposed from MIT classrooms. The materials are also arranged in logical sequences and include multimedia such as video and simulations.

Only five courses listed but the two math courses (single and multi-value calculus) are fundamental to further CS work. And the courses include study groups.

Highly recommended and worthy of your support!

November 21, 2011

TokuDB v5.2 Beta Program

Filed under: TokuDB — Patrick Durusau @ 7:38 pm

TokuDB v5.2 Beta Program

From the webpage:

With the release of TokuDB v5.0 last March, we delivered a powerful and agile storage engine that broke through traditional MySQL scalability and performance barriers. As deployments of TokuDB have grown more varied, one request we have repeatedly heard from customers and prospects, especially in areas such as online advertising, social media, and clickstream analysis, is for improved performance for multi-client workloads.

Tokutek is now pleased to announce limited beta availability for TokuDB v5.2. The latest version of our flagship product offers a significant improvement over TokuDB v5.0 in multi-client scaling as well as performance gains in point queries, range queries, and trickle load speed. There are a host of other smaller changes and improvements that are detailed in our release notes (available to beta participants).

Here’s your chance for your topic map backend to have a jump over your competitors. And to help make an impressive product even more so.

Your impressions or comments most welcome!

Cryptography (class)

Filed under: Cryptography,CS Lectures — Patrick Durusau @ 7:37 pm

Cryptography with Dan Boneh. (Stanford)

Looks like competition to have an online class is heating up at Stanford. 😉

From the description:

Cryptography is an indispensable tool for protecting information in computer systems. This course explains the inner workings of cryptographic primitives and how to correctly use them. Students will learn how to reason about the security of cryptographic constructions and how to apply this knowledge to real-world applications. The course begins with a detailed discussion of how two parties who have a shared secret key can communicate securely when a powerful adversary eavesdrops and tampers with traffic. We will examine many deployed protocols and analyze mistakes in existing systems. The second half of the course discusses public-key techniques that let two or more parties generate a shared secret key. We will cover the relevant number theory and discuss public-key encryption, digital signatures, and authentication protocols. Towards the end of the course we will cover more advanced topics such as zero-knowledge, distributed protocols such as secure auctions, and a number of privacy mechanisms. Throughout the course students will be exposed to many exciting open problems in the field.

The course will include written homeworks and programming labs. The course is self-contained, however it will be helpful to have a basic understanding of discrete probability theory.

I mention this because topic mappers are going to face security issues and they had better be ready to at least discuss them. Even if the details are handed off to experts in security, including cryptography. Like law, security/cryptography aren’t good areas for self-help.

BTW, if this interests you, see Bruce Schneier’s homepage. Really nice collection of resources and other information on cryptography.

Open Graph Database Protocol

Filed under: Graphs,Open Graph Database Protocol — Patrick Durusau @ 7:36 pm

Open Graph Database Protocol

From the introduction:

The Open Graph protocol enables any web page to become a rich object in a social graph. For instance, this is used on Facebook to allow any web page to have the same functionality as any other object on Facebook.

While many different technologies and schemas exist and could be combined together, there isn’t a single technology which provides enough information to richly represent any web page within the social graph. The Open Graph protocol builds on these existing technologies and gives developers one thing to implement. Developer simplicity is a key goal of the Open Graph protocol which has informed many of the technical design decisions.

The usual suspects, Google, Microsoft, Facebook, etc.

There are a number of open source parsing/publishing tools listed at this page.

Additional information about web pages and their contents would be very useful, if we could just get users to enter the information.

The Open Graph Database Protocol is yet another attempt to find the minimum users are willing to do.

But not by observing user behavior.

By deciding what is the minimum useful set of data and developing a way for users to enter it.

I am sure there will be users who will enter information and it will improve the pages where it used.

But, not every user will use it and most likely those that do, will use it inconsistently.

Open Graph Database Protocol will become another lingo to parse and track across the Web.

How to combine Neo4j with GWT and Eclipse

Filed under: Eclipse,GWT,Neo4j — Patrick Durusau @ 7:35 pm

How to combine Neo4j with GWT and Eclipse by René Pickhardt.

From the post:

As stated before I did my first testings with Neo4j. Now I wanted to include Neo4j to GWT which is actually very straight forward but for some reasons I was fighting with it for quite a while. I even had to emberass myself by asking stupid questions on the neo4j mailinglist to which Peter and John Doran kindly responded.

Any way now I am excited to follow my research topics which I hope to answer by using neo4j.

A short (12 minutes or so) screencast of getting Neo4j, GWT and Eclipse together.

Probabilistic Graphical Models (class)

Probabilistic Graphical Models (class) by Daphne Koller. (Stanford University)

From the web page:

What are Probabilistic Graphical Models?

Uncertainty is unavoidable in real-world applications: we can almost never predict with certainty what will happen in the future, and even in the present and the past, many important aspects of the world are not observed with certainty. Probability theory gives us the basic foundation to model our beliefs about the different possible states of the world, and to update these beliefs as new evidence is obtained. These beliefs can be combined with individual preferences to help guide our actions, and even in selecting which observations to make. While probability theory has existed since the 17th century, our ability to use it effectively on large problems involving many inter-related variables is fairly recent, and is due largely to the development of a framework known as Probabilistic Graphical Models (PGMs). This framework, which spans methods such as Bayesian networks and Markov random fields, uses ideas from discrete data structures in computer science to efficiently encode and manipulate probability distributions over high-dimensional spaces, often involving hundreds or even many thousands of variables. These methods have been used in an enormous range of application domains, which include: web search, medical and fault diagnosis, image understanding, reconstruction of biological networks, speech recognition, natural language processing, decoding of messages sent over a noisy communication channel, robot navigation, and many more. The PGM framework provides an essential tool for anyone who wants to learn how to reason coherently from limited and noisy observations.

About The Course

In this class, you will learn the basics of the PGM representation and how to construct them, using both human knowledge and machine learning techniques; you will also learn algorithms for using a PGM to reach conclusions about the world from limited and noisy evidence, and for making good decisions under uncertainty. The class covers both the theoretical underpinnings of the PGM framework and practical skills needed to apply these techniques to new problems. Topics include: (i) The Bayesian network and Markov network representation, including extensions for reasoning over domains that change over time and over domains with a variable number of entities; (ii) reasoning and inference methods, including exact inference (variable elimination, clique trees) and approximate inference (belief propagation message passing, Markov chain Monte Carlo methods); (iii) learning methods for both parameters and structure in a PGM; (iv) using a PGM for decision making under uncertainty. The course will also draw from numerous case studies and applications, so that you’ll also learn how to apply PGM methods to computer vision, text understanding, medical decision making, speech recognition, and many other areas.

Another very strong resource from Stanford.

Serious (or aspiring) data miners will be lining up for this course!

« Newer PostsOlder Posts »

Powered by WordPress