Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 21, 2013

NetGestalt for Data Visualization in the Context of Pathways

Filed under: Bioinformatics,Biomedical,Graphs,Networks,Visualization — Patrick Durusau @ 7:06 pm

NetGestalt for Data Visualization in the Context of Pathways by Stephen Turner.

From the post:

Many of you may be familiar with WebGestalt, a wonderful web utility developed by Bing Zhang at Vanderbilt for doing basic gene-set enrichment analyses. Last year, we invited Bing to speak at our annual retreat for the Vanderbilt Graduate Program in Human Genetics, and he did not disappoint! Bing walked us through his new tool called NetGestalt.

NetGestalt provides users with the ability to overlay large-scale experimental data onto biological networks. Data are loaded using continuous and binary tracks that can contain either single or multiple lines of data (called composite tracks). Continuous tracks could be gene expression intensities from microarray data or any other quantitative measure that can be mapped to the genome. Binary tracks are usually insertion/deletion regions, or called regions like ChIP peaks. NetGestalt extends many of the features of WebGestalt, including enrichment analysis for modules within a biological network, and provides easy ways to visualize the overlay of multiple tracks with Venn diagrams.

Stephen also points to documentation and video tutorials.

NetGestalt uses gene symbol as the gene identifier. Data that uses other gene identifiers must be mapped to gene symbols before uploading. (Manual, page 4)

An impressive alignment of data sources even with the restriction to gene symbols.

February 19, 2013

Literature Survey of Graph Databases

Filed under: 4store,Accumulo,Diplodocus,Graphs,Networks,RDF,SHARD,Urika,Virtuoso,YARS2 — Patrick Durusau @ 3:39 pm

Literature Survey of Graph Databases by Bryan Thompson.

I can understand Danny Bickson, Literature survey of graph databases, being excited about the coverage of GraphChi in this survey.

However, there are other names you will recognize as well (TOC order):

  • RDF3X
  • Diplodocus
  • GraphChi
  • YARS2
  • 4store
  • Virtuoso
  • Bigdata
  • SHARD
  • Graph partitioning
  • Accumulo
  • Urika
  • Scalable RDF query processing on clusters and supercomputers (a system with no name at Rensselaer Polytechnic)

As you can tell from the system names, the survey focuses on processing of RDF.

In reviewing one system, Bryan remarks:

Only small data sets were considered (100s of millions of edges). (emphasis added)

I think that captures the focus of the paper better than any comment I can make.

A must read for graph heads!

Using a WHERE clause to filter paths

Filed under: Graphs,Networks,Visualization — Patrick Durusau @ 1:52 pm

neo4j/cypher: Using a WHERE clause to filter paths by Mark Needham.

From the post:

One of the cypher queries that I wanted to write recently was one to find all the players that have started matches for Arsenal this season and the number of matches that they’ve played in.

Mark sorts out the use of a where clause on paths.

Visualization of a query as it occurs, tracing a path from node to node, slowed down for the human observer, could be an interesting debugging technique.

Will have to give that some thought.

Could also be instructive for debugging topic map merging as well.

Either one would be subject to visual “clutter” so it might work best with a test set that illustrates the problem.

Or perhaps by starting with the larger data set and slowly excluding content until only the problem area remains.

February 15, 2013

Using molecular networks to assess molecular similarity

Systems chemistry: Using molecular networks to assess molecular similarity by Bailey Fallon.

From the post:

In new research published in Journal of Systems Chemistry, Sijbren Otto and colleagues have provided the first experimental approach towards molecular networks that can predict bioactivity based on an assessment of molecular similarity.

Molecular similarity is an important concept in drug discovery. Molecules that share certain features such as shape, structure or hydrogen bond donor/acceptor groups may have similar properties that make them common to a particular target. Assessment of molecular similarity has so far relied almost exclusively on computational approaches, but Dr Otto reasoned that a measure of similarity might be obtained by interrogating the molecules in solution experimentally.

Important work for drug discovery but there are semantic lessons here as well:

Tests for similarity/sameness are domain specific.

Which means there are no universal tests for similarity/sameness.

Lacking universal tests for similarity/sameness, we should focus on developing documented and domain specific tests for similarity/sameness.

Domain specific tests provide quicker ROI than less useful and doomed universal solutions.

Documented domain specific tests may, no guarantees, enable us to find commonalities between domain measures of similarity/sameness.

But our conclusions will be based on domain experience and not projection from our domain onto others, less well known domains.

February 14, 2013

Hypergraph-based multidimensional data modeling…

Filed under: Graphs,Hyperedges,Hypergraphs,Networks — Patrick Durusau @ 1:48 pm

Hypergraph-based multidimensional data modeling towards on-demand business analysis by Duong Thi Anh Hoang, Torsten Priebe and A. Min Tjoa. (Proceeding iiWAS ’11 Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services Pages 36-43 )

Abstract:

In the last few years, web-based environments have witnessed the emergence of new types of on-demand business analysis that facilitate complex and integrated analytical information from multidimensional databases. In these on-demand environments, users of business intelligence architectures can have very different reporting and analytical needs, requiring much greater flexibility and adaptability of today’s multidimensional data modeling. While structured data models for OLAP have been studied in detail, a majority of current approaches has not put its focus on the dynamic aspect of the multidimensional design and/or semantic enriched impact model. Within the scope of this paper, we present a flexible approach to model multidimensional databases in the context of dynamic web-based analysis and adaptive users’ requirements. The introduced formal approach is based on hypergraphs with the ability to provide formal constructs specifying the different types of multidimensional elements and relationships which enable the support of highly customized business analysis. The introduced hypergraphs are used to formally define the semantics of multidimensional models with unstructured ad-hoc analytic activities. The proposed model also supports a formal representation of advanced concepts like dynamic hierarchies, many-to-many associations, additivity constraints etc. Some scenario example are also provided to motivate and illustrate the proposed approach.

If you like illustrations of technologies with examples from the banking industry, this is the paper on hypergraphs for you.

Besides, banks are where they keep the money. 😉

Seriously, a very well illustrated introduction to the use of hypergraphs and multidimensional data modeling, plus why multidimensional data models matter to clients. (Another place where they keep money.)

Survey of graph database models

Filed under: Database,Graphs,Networks — Patrick Durusau @ 5:07 am

Survey of graph database models by Renzo Angles and Claudio Gutierrez. (ACM Computing Surveys (CSUR) Surveys, Volume 40 Issue 1, February 2008, Article No. 1 )

Abstract:

Graph database models can be defined as those in which data structures for the schema and instances are modeled as graphs or generalizations of them, and data manipulation is expressed by graph-oriented operations and type constructors. These models took off in the eighties and early nineties alongside object-oriented models. Their influence gradually died out with the emergence of other database models, in particular geographical, spatial, semistructured, and XML. Recently, the need to manage information with graph-like nature has reestablished the relevance of this area. The main objective of this survey is to present the work that has been conducted in the area of graph database modeling, concentrating on data structures, query languages, and integrity constraints.

If you need an antidote for graph database hype, look no further than this thirty-nine (39) page survey article.

You will come away with a deeper appreciate for graph databases and their history.

If you are looking for a self-improvement reading program, you could do far worse than starting with this article and reading the cited references one by one.

February 10, 2013

Basic planning algorithm

Filed under: Constraint Programming,Graphs,Networks,Searching — Patrick Durusau @ 4:02 pm

Basic planning algorithm by Ricky Ho.

From the post:

Planning can be think of a Graph search problem where each node in the graph represent a possible “state” of the reality. A directed edge from nodeA to nodeB represent an “action” is available to transition stateA to stateB.

Planning can be thought of another form of constraint optimization problem which is quite different from the one I describe in last blog. In planning case, the constraint is the goal state we want to achieve, where a sequence of actions need to be found to meet the constraint. The sequence of actions will incur cost and our objective is to minimize the cost associated with our chosen actions.

Makes me curious about topic maps that perform merging based on the “cost” of the merge.

That is upon a query, an engine may respond with a merger of topics found on one node but not request data from remote nodes.

In particular thinking of network performance issues which we all experience, waiting for ads to download for example.

Depending upon my requirements, I should be able to evaluate those costs and avoid them.

I may not have the most complete information but that may not be a requirement for some use cases.

How Neo4j beat Oracle Database

Filed under: Graphs,Neo4j,Networks,Oracle — Patrick Durusau @ 11:56 am

Neo Technology execs: How Neo4j beat Oracle Database by Paul Krill.

From the post:

Neo Technology, which was formed in 2007, offers Neo4J, a Java-based open source NoSQL graph database. With a graph database, which can search social network data, connections between data are explored. Neo4j can solve problems that require repeated network probing (the database is filled with nodes, which are then linked), and the company stresses Neo4j’s high performance. InfoWorld Editor at Large Paul Krill recently talked with Neo CEO Emil Eifrem and Philip Rathle, Neo senior director of products, about the importance of graph database technology as well as Neoo4j’s potential in the mobile space. Eifrem also stressed his confidence in Java, despite recent security issues affecting the platform.

InfoWorld: Graph database technology is not the same as NoSQL, is it?

Eifrem: NoSQL is actually four different types of databases: There’s key value stores, like Amazon DynamoDB, for example. There’s column-family stores like Cassandra. There’s document databases like MongoDB. And then there’s graph databases like Neo4j. There are actually four pillars of NoSQL, and graph databases is one of them. Cisco is building a master data management system based on Neo4j, and this is actually our first Fortune 500 customer. They found us about two years ago when they tried to build this big, complex hierarchy inside of Oracle RAC. In Oracle RAC, they had response time in minutes, and then when they replaced it [with] Neo4j, they had response times in milliseconds. (emphasis added)

It is a great story and one I would repeat if I were marketing Neo4j (which I like a lot).

However, there are a couple of bits missing from the story that would make it more informative.

Such as what “…big, complex hierarchy…” was Cisco trying to build? Details please.

There are things that relational databases don’t do well.

Not realizing that up front is a design failure, not one of software or of relational databases.

Another question I would ask: What percentage of Cisco databases are relational vs. graph?

Fewer claims/stories and more data would go a long way towards informed IT decision making.

February 9, 2013

The Perfect Case for Social Network Analysis [Maybe yes. Maybe no.]

Filed under: Graphs,Networks,Security,Social Networks — Patrick Durusau @ 8:21 pm

New Jersey-based Fraud Ring Charged this Week: The Perfect Case for Social Network Analysis by Mike Betron.

When I first saw the headline, I thought the New Jersey legislature had gotten busted. 😉

No such luck, although with real transparency on contributions, relationships and state contracts, prison building would become a growth industry in New Jersey and elsewhere.

From the post:

As reported by MSN Money this week, eighteen members of a fraud ring have just been charged in what may be one of the largest international credit card scams in history. The New Jersey-based fraud ring is reported to have stolen at least $200 million, fooling credit card agencies by creating thousands of fake identities to create accounts.

What They Did

The FBI claims the members of the ring began their activity as early as 2007, and over time, used more than 7,000 fake identities to get more than 25,000 credit cards, using more than 1,800 addresses. Once they obtained credit cards, ring members started out by making small purchases and paying them off quickly to build up good credit scores. The next step was to send false reports to credit agencies to show that the account holders had paid off debts – and soon, their fake account holders had glowing credit ratings and high spending limits. Once the limits were raised, the fraudsters would “bust out,” taking out cash loans or maxing out the cards with no intention of paying them back.

But here’s the catch: The criminals in this case created synthetic identities with fake identity information (social security numbers, names, addresses, phone numbers, etc.). Addresses for the account holders were used multiple times on multiple accounts, and the members created at least 80 fake businesses which accepted credit card payments from the ring members.

This is exactly the kind of situation that would be caught by Social Network Analysis (SNA) software. Unfortunately, the credit card companies in this case didn’t have it.

Well, yes and no.

Yes, if Social Network Analysis (SNA) software were looking for the right relationships, the it could catch the fraud in question.

No, if Social Network Analysis (SNA) software were looking at the wrong relationships, the it would not catch the fraud in question.

Analysis isn’t a question of technology.

For example, what one policy change would do more to prevent future 9/11 type incidents that all the $billions spent since 9/11/2001?

Would you believe: Don’t open the cockpit door for hijackers. (full stop)

The 9/11 hijackers took advantage of the “Common Strategy” flaw in U.S. hijacking protocols.

One of the FAA officials most involved with the Common Strategy in the period leading up to 9/11 described it as an approach dating back to the early 1980s, developed in consultation with the industry and the FBI, and based on the historical record of hijackings. The point of the strategy was to “optimize actions taken by a flight crew to resolve hijackings peacefully” through systematic delay and, if necessary, accommodation of the hijackers. The record had shown that the longer a hijacking persisted, the more likely it was to have a peaceful resolution. The strategy operated on the fundamental assumptions that hijackers issue negotiable demands, most often for asylum or the release of prisoners, and that “suicide wasn’t in the game plan” of hijackers.

Hijackers may blow up a plane, kill or torture passengers, but not opening the cockpit door prevents a future 9/11 type event.

But at 9/11, there was no historical experience with hijacking a plane to use a weapon.

Historical experience is just as important for detecting fraud.

Once a pattern is identified for fraud, SNA or topic maps or several other technologies can spot it.

But it has to be identified that first time.

February 4, 2013

Large Scale Network Analysis

Filed under: Conferences,Graphs,Networks — Patrick Durusau @ 7:12 pm

2nd International Workshop on Large Scale Network Analysis (LSNA 2013)

Dates:

Submission Deadline: February 25, 2013

Acceptance Notification: March 13, 2013

Camera-Ready Submission: March 27, 2013

Workshop Date: May 14, 2013

From the website:

Large amounts of network data are being produced by various modern applications at an ever growing speed, ranging from social networks such as Facebook and Twitter, scientific citation networks such as CiteSeerX, to biological networks such as protein interaction networks. Network data analysis is crucial to exploit the wealth of information encoded in these network data. An effective analysis of these data must take into account the complex structure including social, temporal and sometimes spatial dimensions, and an efficient analysis of these data demands scalable solutions. As a result, there has been increasing research in developing scalable solutions for novel network analytics applications.

This workshop will provide a forum for researchers to share new ideas and techniques for large scale network analysis. We expect novel research works that address various aspects of large scale network analysis, including network data acquisition and integration, novel applications for network analysis in different problem domains, scalable and efficient network analytics algorithms, distributed network data management, novel platforms supporting network analytics, and so on.

Topics of Interest

Topics of interest for this workshop include but are not limited to the following:

  • Large scale network data acquisition, filtering, navigation, integration, search and analysis
  • Novel applications for network data with interesting analytics results
  • Exploring scalability issues in network analysis or modeling
  • Distributed network data management
  • Discussing the deficiency of current network analytics or modeling approaches and proposing new directions for research
  • Discovering unique features of emerging network datasets (e.g new linked data, new form of social networks)

This workshop will include invited talks as well as presentation of accepted papers.

Being held in conjunction with WWW 2013, Rio de Janeiro, Brazil.

January 31, 2013

Demining the “Join Bomb” with graph queries

Filed under: Graphs,Neo4j,Networks — Patrick Durusau @ 7:26 pm

Demining the “Join Bomb” with graph queries by Rik Van Bruggen.

From the post:

For the past couple of months, and even more so since the beer post, people have been asking me a question that I have been struggling to answer myself for quite some time: what is so nice about the graphs? What can you do with a graph database that you could not, or only at great pains, do in a traditional relational database system. Conceptually, everyone understands that this is because of the inherent query power in a graph traversal – but how to make this tangible? How to show this to people in a real and straightforward way?

And then Facebook Graph Search came along, along with it’s many crazy search examples – and it sort of hit me: we need to illustrate this with *queries*. Queries that you would not – or only with a substantial amount of effort – be able to do in traditional database system – and that are trivial in a graph.

This is what I will be trying to do in this blog post, using an imaginary dataset that was inspired by the Telecommunications industry. You can download the dataset here, but really it is very simple: a number of “general” data elements (countries, languages, cities), a number of “customer” data elements (person, company) and a number of more telecom-related data elements (operators – I actually have the full list of all mobile operators in the countries in the dataset coming from here and here, phones and conference call service providers).

Great demonstration using simulated telecommunications data of the power of graph queries.

Highly recommended!

January 25, 2013

2013: What’s Coming Next in Neo4j!

Filed under: Graphs,Neo4j,Networks — Patrick Durusau @ 8:18 pm

2013: What’s Coming Next in Neo4j! by Philip Rathle.

From the post:

Even though roadmaps can change, and it’s nice not to spoil all of the surprises, we do feel it’s important to discuss priorities within our community. We’ve spent a lot of time over the last year taking to heart all of of the discussions we’ve had, publicly and privately, with our users, and closely looking at the various ways in which Neo4j is used. Our aim in 2013 is to build upon the strengths of today’s Neo4j database, and make a great product even better.

The 2013 product plan breaks down into a few main themes. This post is dedicate to the top two, which are:

1. Ease of Use. Making the product easier to learn, use, and maintain, for new & existing users, and

2. Big(ger) Data. Handling ever-bigger data and transaction volumes.

Philip shares some details (but not all) in the post.

It sounds like 2013 is going to be a good year for Neo4j (and by extension, it users)!

Neo4j Milestone 1.9.M04 released

Filed under: Graphs,Neo4j,Networks — Patrick Durusau @ 8:15 pm

Neo4j Milestone 1.9.M04 released by Michael Hunger.

From the post:

Today we’re happy to announce Neo4j 1.9.M04, the next milestone on our way to the Neo4j 1.9 release.

For this milestone we have worked on further improvements in Cypher, resolving several issues and continued to improve performance.

Something many users has asked for is Scala 2.10 support which we are providing now that a stable Scala 2.10 release is available.

There were some binary changes in the Scala runtime so by adopting to these, Cypher became incompatible to Scala 2.9. Please ping us if that is an issue for you.

In the Kernel we finally resolved a recovery problem that caused the recovery process to fail under certain conditions.

Due to a report from Jérémie Grodziski we identified a performance issue with REST-batch-operations which caused a massive slowdown on large requests (more than thousand commands).

Solving this we got a 30 times performance increase for these kinds of operations. So if you are inserting large amounts of data into Neo4j using the REST-batch-API then please try 1.9.M04 if that improves things for you.

If you are tracking development of Neo4j, a good time to update your installation.

January 24, 2013

Depth- and Breadth-First Search

Filed under: Graphs,Networks,Searching — Patrick Durusau @ 8:08 pm

Depth- and Breadth-First Search by Jeremy Kun.

From the post:

The graph is among the most common data structures in computer science, and it’s unsurprising that a staggeringly large amount of time has been dedicated to developing algorithms on graphs. Indeed, many problems in areas ranging from sociology, linguistics, to chemistry and artificial intelligence can be translated into questions about graphs. It’s no stretch to say that graphs are truly ubiquitous. Even more, common problems often concern the existence and optimality of paths from one vertex to another with certain properties.

Of course, in order to find paths with certain properties one must first be able to search through graphs in a structured way. And so we will start our investigation of graph search algorithms with the most basic kind of graph search algorithms: the depth-first and breadth-first search. These kinds of algorithms existed in mathematics long before computers were around. The former was ostensibly invented by a man named Pierre Tremaux, who lived around the same time as the world’s first formal algorithm designer Ada Lovelace. The latter was formally discovered much later by Edward F. Moore in the 50′s. Both were discovered in the context of solving mazes.

These two algorithms nudge gently into the realm of artificial intelligence, because at any given step they will decide which path to inspect next, and with minor modifications we can “intelligently” decide which to inspect next.

Of course, this primer will expect the reader is familiar with the basic definitions of graph theory, but as usual we provide introductory primers on this blog. In addition, the content of this post will be very similar to our primer on trees, so the familiar reader may benefit from reading that post as well.

As always, an excellent “primer,” this time on searching graphs.

GraphChi version 0.2 released!

Filed under: GraphChi,Graphs,Networks — Patrick Durusau @ 8:07 pm

GraphChi version 0.2 released! by Danny Bickson.

From the post:

GraphChi version 0.2 is the first major update to the GraphChi software for disk-based computation on massive graphs. This upgrade brings two major changes: compressed data storage (shards) and support for dynamically sized edges.

We also thank for your interest on GraphChi so far: since the release in July 9th 2012, there has been over 8,000 unique visitors to the Google Code project page, at least 2,000 downloads of the source package, several blog posts and hundreds of tweets. GraphChi is a research project, and your feedback has helped us tremendously in our work.

Excellent news!

A link for the Dynamic Edge Data tutorial was omitted from the original post.

January 23, 2013

Complex Adaptive Systems Modeling

Filed under: Adaptive Networks,Networks,Social Networks — Patrick Durusau @ 7:42 pm

Complex Adaptive Systems Modeling, Editor-in-Chief: Muaz A. Niazi, ISSN: 2194-3206 (electronic version)

From the webpage:

Complex Adaptive Systems Modeling is a peer-reviewed open access journal published under the brand SpringerOpen.

Complex Adaptive Systems Modeling (CASM) is a highly multidisciplinary modeling and simulation journal that serves as a unique forum for original, high-quality peer-reviewed papers with a specific interest and scope limited to agent-based and complex network-based modeling paradigms for Complex Adaptive Systems (CAS). The highly multidisciplinary scope of CASM spans any domain of CAS. Possible areas of interest range from the Life Sciences (E.g. Biological Networks and agent-based models), Ecology (E.g. Agent-based/Individual-based models), Social Sciences (Agent-based simulation, Social Network Analysis), Scientometrics (E.g. Citation Networks) to large-scale Complex Adaptive COmmunicatiOn Networks and environmentS (CACOONS) such as Wireless Sensor Networks (WSN), Body Sensor Networks, Peer-to-Peer (P2P) networks, pervasive mobile networks, service oriented architecture, smart grid and the Internet of Things.

In general, submitted papers should have the following key elements:

  • A clear focus on a specific area of CAS E.g. ecology, social sciences, large scale communication networks, biological sciences etc.)
  • Either focus on an agent-based simulation model or else a complex network model based on data from CAS (e.g. Citation networks, Gene regulatory Networks, Social networks, Ecological Networks etc.).

A new open access journal from Springer with a focus on complex adaptive systems.

Adaptive-network simulation library

Filed under: Adaptive Networks,Complex Networks,Networks,Simulations — Patrick Durusau @ 7:42 pm

Adaptive-network simulation library by Gerd Zschaler.

From the webpage:

The largenet2 library is a collection of C++ classes providing a framework for the simulation of large discrete adaptive networks. It provides data structures for an in-memory representation of directed or undirected networks, in which every node and link can have an integer-valued state.

Efficient access to (random) nodes and links as well as (random) nodes and links with a given state value is provided. A limited number of graph-theoretical measures is implemented, such as the (state-resolved) in- and out-degree distributions and the degree correlations (same-node and nearest-neighbor).

Read the tutorial here. Source code is available here.

A static topic map would not qualify as an adaptive network, but a dynamic, real time topic map system might have the characteristics of complex adaptive systems:

  • The number of elements is sufficiently large that conventional descriptions (e.g. a system of differential equations) are not only impractical, but cease to assist in understanding the system, the elements also have to interact and the interaction must be dynamic. Interactions can be physical or involve the exchange of information.
  • Such interactions are rich, i.e. any element in the system is affected by and affects several other systems.
  • The interactions are non-linear which means that small causes can have large results.
  • Interactions are primarily but not exclusively with immediate neighbours and the nature of the influence is modulated.
  • Any interaction can feed back onto itself directly or after a number of intervening stages, such feedback can vary in quality. This is known as recurrency.
  • Such systems are open and it may be difficult or impossible to define system boundaries
  • Complex systems operate under far from equilibrium conditions, there has to be a constant flow of energy to maintain the organization of the system
  • All complex systems have a history, they evolve and their past is co-responsible for their present behaviour
  • Elements in the system are ignorant of the behaviour of the system as a whole, responding only to what is available to it locally

The more dynamic the connections between networks, the closer we will move towards networks with the potential for adaptation.

That isn’t to say all networks will adapt at all or that those that do, will do it well.

Suspect adaption, like integration, is going to depend upon the amount of semantic information on hand.

You may also want to review: Largenet2: an object-oriented programming library for simulating large adaptive networks by Gerd Zschaler, and Thilo Gross. Bioinformatics (2013) 29 (2): 277-278. doi: 10.1093/bioinformatics/bts663

January 22, 2013

Click Dataset [HTTP requests]

Filed under: Dataset,Graphs,Networks,WWW — Patrick Durusau @ 2:41 pm

Click Dataset

From the webpage:

To foster the study of the structure and dynamics of Web traffic networks, we make available a large dataset (‘Click Dataset’) of HTTP requests made by users at Indiana University. Gathering anonymized requests directly from the network rather than relying on server logs and browser instrumentation allows one to examine large volumes of traffic data while minimizing biases associated with other data sources. It also provides one with valuable referrer information to reconstruct the subset of the Web graph actually traversed by users. The goal is to develop a better understanding of user behavior online and create more realistic models of Web traffic. The potential applications of this data include improved designs for networks, sites, and server software; more accurate forecasting of traffic trends; classification of sites based on the patterns of activity they inspire; and improved ranking algorithms for search results.

The data was generated by applying a Berkeley Packet Filter to a mirror of the traffic passing through the border router of Indiana University. This filter matched all traffic destined for TCP port 80. A long-running collection process used the pcap library to gather these packets, then applied a small set of regular expressions to their payloads to determine whether they contained HTTP GET requests.

Data available under terms and restrictions, including transfer by physical hard drive (~ 2.5 TB of data).

Intrigued by the notion of a “subset of the Web graph actually traversed by users.”

Does that mean that semantic annotation should occur on the portion of the “…Web graph actually traversed by users” before reaching other parts?

If the language of 4,148,237 English Wikipedia pages is never in doubt for any user, do we really need triples to record that for every page?

January 18, 2013

Graph Algorithms

Filed under: Algorithms,Graphs,Networks — Patrick Durusau @ 7:16 pm

Graph Algorithms by David Eppstein.

Graph algorithms course with a Wikipedia book, Graph Algorithms, made up of articles from Wikipedia.

The syllabus does include some materials not found at Wikipedia, so be sure to check there as well.

Strong components of the Wikipedia graph

Filed under: Algorithms,Graphs,Networks,Wikipedia — Patrick Durusau @ 7:16 pm

Strong components of the Wikipedia graph

From the post:

I recently covered strong connectivity analysis in my graph algorithms class, so I’ve been playing today with applying it to the link structure of (small subsets of) Wikipedia.

For instance, here’s one of the strong components among the articles linked from Hans Freudenthal (a mathematician of widely varied interests): Algebraic topology, Freudenthal suspension theorem, George W. Whitehead, Heinz Hopf, Homotopy group, Homotopy groups of spheres, Humboldt University of Berlin, Luitzen Egbertus Jan Brouwer, Stable homotopy theory, Suspension (topology), University of Amsterdam, Utrecht University. Mostly this makes sense, but I’m not quite sure how the three universities got in there. Maybe from their famous faculty members?

One of responses to this post suggest grabbing the entire Wikipedia dataset for purposes of trying out algorithms.

A good suggestion for algorithms, perhaps even algorithms meant to reduce visual clutter, but at what point does a graph become too “busy” for visual analysis?

Recalling the research that claims people can only remember seven or so things at one time.

January 17, 2013

Graph Database Resources

Filed under: Graphs,Networks — Patrick Durusau @ 7:26 pm

Graph Database Resources by Danny Bickson.

Danny provides a short list of graph database resources.

Do be careful with:

A paper that summarizes the state of graph databases that might be worth reading: http://swp.dcc.uchile.cl/TR/2005/TR_DCC-2005-010.pdf

Summarizes the state of the art as of 2005.

Still worth reading because many of the techniques and insights are relevant for today.

And if you pay attention to the citations, you will discover that “graphs as a new way of thinking” is either: ignorance or marketing hype.

The earliest paper cited in the 2005 state of art for graphs dates from 1965:

D. J. de S. Price. Networks of Scientific papers. Science, 149:510–515, 1965.

And there are plenty of citations from the 1970’s and 1980’s on hypergraphs, etc.

I am very much a graph enthusiast but the world wasn’t created anew because we came of age.

January 15, 2013

Graphs as a New Way of Thinking [Really?]

Filed under: Graphs,Neo4j,Networks — Patrick Durusau @ 8:30 pm

Graphs as a New Way of Thinking by Emil Eifrem.

From the post:

Faced with the need to generate ever-greater insight and end-user value, some of the world’s most innovative companies — Google, Facebook, Twitter, Adobe and American Express among them — have turned to graph technologies to tackle the complexity at the heart of their data.

To understand how graphs address data complexity, we need first to understand the nature of the complexity itself. In practical terms, data gets more complex as it gets bigger, more semi-structured, and more densely connected.

We all know about big data. The volume of net new data being created each year is growing exponentially — a trend that is set to continue for the foreseeable future. But increased volume isn’t the only force we have to contend with today: On top of this staggering growth in the volume of data, we are also seeing an increase in both the amount of semi-structure and the degree of connectedness present in that data.

He later concludes with:

Graphs are a new way of thinking for explicitly modeling the factors that make today’s big data so complex: Semi-structure and connectedness. As more and more organizations recognize the value of modeling data with a graph, they are turning to the use of graph databases to extend this powerful modeling capability to the storage and querying of complex, densely connected structures. The result is the opening up of new opportunities for generating critical insight and end-user value, which can make all the difference in keeping up with today’s competitive business environment.

I know it is popular rhetoric to say that X technology is a “new way of thinking.” Fashionable perhaps but also false.

People have always written about “connections” between people, institutions, events, etc. If you don’t believe me, find an online version of Plutarch.

Where I do think Emil has a good point is when he says: “Graphs are…for explicitly modeling the factors…,” which is no mean feat.

The key to disentangling big data isn’t “new thinking” or navel gazing about its complexity.

One key step is making connections between data (big or otherwise), explicit. Unless it is explicit, we can’t know for sure if we are talking about the same connection or not.

Another key step is identifying the data we are talking about (in topic maps terms, the subject of conversation) and how we identify it.

It isn’t rocket science nor does it require a spiritual or intellectual re-birth.

It does require some effort to make explicit what we usually elide over in conversation or writing.

For example, earlier in this post I used the term “Emil” and you instantly knew who I meant. A mechanical servant reading the same post might not be so lucky. Nor would it supply the connection to Neo4j.

A low effort barrier to making those explicit would go a long way to managing big data, with no “new way of thinking” being required.

I first saw this at Thinking Differently with Graph Databases by Angela Guess.

January 14, 2013

Fun with Beer – and Graphs

Filed under: Graphs,Networks — Patrick Durusau @ 8:36 pm

Fun with Beer – and Graphs by Rik van Bruggen.

From the post:

I make no excuses: My name is Rik van Bruggen and I am a salesperson. I think it is one of the finest and nicest professions in the world, and I love what I do. I love it specifically, because I get to sell great, awesome, fantastic products really – and I get to work with fantastic people along the way. But the point is I am not a technical person – at all. But, I do have a passion for technology, and feel the urge to understand and taste the products that I sell. And that’s exactly what happened a couple of months ago when I joined Neo Technology, the makers and maintainers of the popular Neo4j open source graph database.

So I decided to get my hands dirty and dive in head first. But also, to have some fun along the way.

The fun part would be coming from something that I thoroughly enjoy: Belgian beer. Some of you may know that Stella Artois, Hoegaerden, Leffe and the likes come from Belgium, but few of you know that this tiny little country in the lowlands around Brussels actually produces several thousand beers.

Belgian Beer

You can read about it on the Wikipedia page: Belgian beers are good, and numerous. So how would I go about putting Belgian beers into Neo4j? Interesting challenge.

Very useful post if you:

  • Don’t want to miss any Belgain beers as you drink your way through them.
  • Want to drink your way up or down in terms of alcohol percentage.
  • Want to memorize the names of all Belgian beers.
  • Oh, want to gain experience with Neo4j. 😉

On Graph Computing [Shared Vertices/Merging]

Filed under: Graphs,Merging,Networks — Patrick Durusau @ 8:35 pm

On Graph Computing by Marko A. Rodriguez.

Marko writes elegantly about graphs and I was about to put this down as another graph advocacy post. Interesting but graph followers have heard this story before.

But I do read the materials I cite and Marko proceeds to define three separate graphs, software, discussion and concept. Each of which has some vertexes in common with one or both of the others.

Then he has this section:

A Multi-Domain Graph

The three previous scenarios (software, discussion, and concept) are representations of real-world systems (e.g. GitHub, Google Groups, and Wikipedia). These seemingly disparate models can be seamlessly integrated into a single atomic graph structure by means of shared vertices. For instance, in the associated diagram, Gremlin is a Titan dependency, Titan is developed by Matthias, and Matthias writes messages on Aurelius’ mailing list (software merges with discussion). Next, Blueprints is a Titan dependency and Titan is tagged graph (software merges with concept). The dotted lines identify other such cross-domain linkages that demonstrate how a universal model is created when vertices are shared across domains. The integrated, universal model can be subjected to processes that provide richer (perhaps, more intelligent) services than what any individual model could provide alone.

Shared vertices sounds a lot like merging in the topic map sense to me.

It isn’t clear from the post what requirements may or may not exist for vertices to be “shared.”

Or how you would state the requirements for sharing vertices?

Or how to treat edges that become duplicates when separate vertices they connect now become the same vertices?

If “shared vertices” support what we call merging in topic maps, perhaps there are graph installations waiting to wake up as topic maps.

January 2, 2013

Wine industry network in the US

Filed under: Networks,Visualization — Patrick Durusau @ 3:03 pm

Wine industry network in the US by Nathan Yau.

Nathan points to an exploration of the wine network in the US. Like other markets, a few vendors dominate.

See what you make of the visualization and the underlying data.

Is there a mobile app for wine choices and locations? Perhaps with prices?

Thinking that could be extended to “tag” the “variety” that is actually the same vendor.

December 29, 2012

GRADES: Graph Data-management Experiences & Systems

Filed under: Graph Database Benchmark,Graph Databases,Graph Traversal,Graphs,Networks — Patrick Durusau @ 7:26 pm

GRADES: Graph Data-management Experiences & Systems

Workshop: Sunday June 23, 2013

Papers Due: March 31, 2013

Notification: April 22, 2013

Camera-ready: May 19, 2013

Workshop Scope:

Application Areas

A new data economy is emerging, based on the analysis of distributed, heterogeneous, and complexly structured data sets. GRADES focuses on the problem of managing such data, specifically when data takes the form of graphs that connect many millions of nodes, and the worth of the data and its analysis is not only in the attribute values of these nodes, but in the way these nodes are connected. Specific application areas that exhibit the growing need for management of such graph shaped data include:

  • life science analytics, e.g., tracking relationships between illnesses, genes, and molecular compounds.
  • social network marketing, e.g., identifying influential speakers and trends propagating through a community.
  • digital forensics, e.g., analyzing the relationships between persons and entities for law enforcement purposes.
  • telecommunication network analysis, e.g., directed at fixing network bottlenecks and costing of network traffic.
  • digital publishing, e.g., enriching entities occurring in digital content with external data sources, and finding relationships among the entities.

Perspectives

The GRADES workshop solicits contributions from two perspectives:

  • Experiences. This includes topics that describe use case scenarios, datasets, and analysis opportunities occurring in real-life graph-shaped, ans well as benchmark descriptions and benchmark results.
  • Systems. This includes topics that describe data management system architectures for processing of Graph and RDF data, and specific techniques and algorithms employed inside such systems.

The combination of the two (Experiences with Systems) and benchmarking RDF and graph database systems, is of special interest.

Topics Of Interest

The following is a non-exhaustive list describing the scope of GRADES:

  • vision papers describing potential applications and benefits of graph data management.
  • descriptions of graph data management use cases and query workloads.
  • experiences with applying data management technologies in such situations.
  • experiences or techniques for specific operations such as traversals or RDF reasoning.
  • proposals for benchmarks for data integration tasks (instance matching and ETL techniques).
  • proposals for benchmarks for RDF and graph database workloads.
  • evaluation of benchmark performance results on RDF or graph database systems.
  • system descriptions covering RDF or graph database technology.
  • data and index structures that can be used in RDF and graph database systems.
  • query processing and optimization algorithms for RDF and graph database systems.
  • methods and technique for measuring graph characteristics.
  • methods and techniques for visualizing graphs and graph query results.
  • proposals and experiences with graph query languages.

The GRADES workshop is co-located and sponsored by SIGMOD in recognition that these problems are only interesting at large-scale and the contribution of the SIGMOD community to handle such topics on large amounts of data of many millions or even billions of nodes and edges is of critical importance.

That sounds promising doesn’t it? (Please email, copy, post, etc.)

December 27, 2012

HyperGraphDB 1.2 Final

Filed under: Graphs,Hypergraphs,Networks — Patrick Durusau @ 10:25 am

HyperGraphDB 1.2 Final

From the post:

HyperGraphDB is a general purpose, free open-source data storage mechanism. Geared toward modern applications with complex and evolving domain models, it is suitable for semantic web, artificial intelligence, social networking or regular object-oriented business applications.

This release contains numerous bug fixes and improvements over the previous 1.1 release. A fairly complete list of changes can be found at the Changes for HyperGraphDB, Release 1.2 wiki page.

  1. Introduction of a new HyperNode interface together with several implementations, including subgraphs and access to remote database peers. The ideas behind are documented in the blog post HyperNodes Are Contexts.
  2. Introduction of a new interface HGTypeSchema and generalized mappings between arbitrary URIs and HyperGraphDB types.
  3. Implementation of storage based on the BerkeleyDB Java Edition (many thanks to Alain Picard and Sebastian Graf!). This version of BerkeleyDB doesn’t require native libraries, which makes it easier to deploy and, in addition, performs better for smaller datasets (under 2-3 million atoms).
  4. Implementation of parametarized pre-compiled queries for improved query performance. This is documented in the Variables in HyperGraphDB Queries blog post.

HyperGraphDB is a Java based product built on top of the Berkeley DB storage library.

This release dates from November 4, 2012. Apologies for missing the news until now.

December 26, 2012

Want some hackathon friendly altmetrics data?…

Filed under: Citation Analysis,Graphs,Networks,Tweets — Patrick Durusau @ 7:30 pm

Want some hackathon friendly altmetrics data? arXiv tweets dataset now up on figshare by Euan Adie.

From the post:

The dataset contains details of approximately 57k tweets linking to arXiv papers, found between 1st January and 1st October this year. You’ll need to supplement it with data from the arXiv API if you need metadata about the preprints linked to. The dataset does contain follower counts and lat/lng pairs for users where possible, which could be interesting to plot.

Euan has some suggested research directions and more details on the data set.

Something to play with during the holiday “down time.” 😉

I first saw this in a tweet by Jason Priem.

Titan-Android

Filed under: Graphs,Gremlin,Networks,TinkerPop,Titan — Patrick Durusau @ 3:34 pm

Titan-Android by David Wu.

From the webpage:

Titan-Android is a port/fork of Titan for the Android platform. It is meant to be a light-weight implementation of a graph database on mobile devices. The port removes HBase and Cassandra support as their usage make little sense on a mobile device (convince me otherwise!). Gremlin is only supported via the Java interface as I have not been able to port groovy successfully. Nevertheless, Titan-Android supports local storage backend via BerkeleyDB and supports the Tinkerpop stack natively.

Just in case there was an Android under the tree!

I first saw this in a tweet by Marko A. Rodriguez.

sigma.js

Filed under: Graphs,Networks,Sigma.js,Visualization — Patrick Durusau @ 11:46 am

sigma.js – Web network visualization made easy by Alexis Jacomy.

From the webpage:

sigma.js is an open-source lightweight JavaScript library to draw graphs, using the HTML canvas element. It has been especially designed to:

  • Display interactively static graphs exported from a graph visualization software – like Gephi
  • Display dynamically graphs that are generated on the fly

From October of 2012:

osdc2012-sigmajs-demo – French OSDC 2012 (demo)

osdc2012-sigmajs-presentation – French OSDC 2012 (Landslide presentation)

See also: Using Sigma.js with Neo4j.

A tweet from Andreas Müller reminded me to create a separate post on sigma.js.

« Newer PostsOlder Posts »

Powered by WordPress