Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 29, 2012

Twitter Social Network by @aneeshs (video lecture)

Filed under: Graphs,Networks,Social Networks,Tweets — Patrick Durusau @ 3:37 pm

Video Lecture: Twitter Social Network by @aneeshs by Marti Hearst.

From the post:

Learn about weak ties, triadic closures, and personal pagerank, and how they all relate to the Twitter social graph from Aneesh Sharma:

Just when you think the weekend can’t get any better!

Enjoy!

September 28, 2012

Windows into Relational Events: Data Structures for Contiguous Subsequences of Edges

Filed under: Graphs,Networks,Social Media,Social Networks,Time,Time Series — Patrick Durusau @ 12:59 pm

Windows into Relational Events: Data Structures for Contiguous Subsequences of Edges by Michael J. Bannister, Christopher DuBois, David Eppstein, Padhraic Smyth.

Abstract:

We consider the problem of analyzing social network data sets in which the edges of the network have timestamps, and we wish to analyze the subgraphs formed from edges in contiguous subintervals of these timestamps. We provide data structures for these problems that use near-linear preprocessing time, linear space, and sublogarithmic query time to handle queries that ask for the number of connected components, number of components that contain cycles, number of vertices whose degree equals or is at most some predetermined value, number of vertices that can be reached from a starting set of vertices by time-increasing paths, and related queries.

Among other interesting questions, raises the issue of what time span of connections constitutes a network of interest? More than being “dynamic.” A definitional issue for the social network in question.

If you are working with social networks, a must read.

PS: You probably need to read: Relational events vs graphs, a posting by David Eppstein.

David details several different terms for “relational event data,” and says there are probably others they did not find. (Topic maps anyone?)

September 27, 2012

Faunus

Filed under: Faunus,Graphs,Networks — Patrick Durusau @ 5:14 pm

Faunus

From the home page:

Faunus is a Hadoop based distributed computing framework for processing property graphs. A breadth-first version of the graph traversal language Gremlin operates on a vertex-centric property graph data structure. Faunus provides adaptors to the distributed graph database Titan, any Rexster fronted graph database, and to text and binary graphs stored in HDFS. The provided Gremlin operations and Hadoop graph tools can be extended using MapReduce and Blueprints.

Warning: Limitation on Vertexes

Faunus Vertex

  • id: a vertex id is a positive long value and therefore, a graph in Faunus can not have more than 9,223,372,036,854,775,807 vertices.
  • properties: the size of the properties map is denoted by a positive short and therefore there can not exist more than 32,767 properties per vertex.
  • edges:

    • unique labels: edges are indexed by their label using a short and therefore, there can not be more than 32,767 unique labels for the incoming (or outgoing) edges of a vertex.
    • total edges: the edge size for any one label is represented by an int and therefore, for any direction and any label, there can not be more than 2,147,483,647 edges.

Warning: Limitation on Edges

Faunus Edge

  • id: an edge id is a positive long value and therefore, a graph in Faunus can not have more than 9,223,372,036,854,775,807 edges.
  • properties: the size of the properties map is denoted by a positive short and therefore there can not exist more than 32,767 properties per edge.

I don’t like putting limitation warnings in my first post on software but thought you needed to be forewarned. 😉

September 21, 2012

WeST researcher’s summary of SocialCom 2012 in Amsterdam

Filed under: Graphity,Graphs,Networks — Patrick Durusau @ 7:26 pm

WeST researcher’s summary of SocialCom 2012 in Amsterdam by RenĂ© Pichardt.

René has a new principal blogging site and reports on SocialCom 2012.

In the beginning of this month I was attending my first major conferenc IEEE SocialComp2012 in Amsterdam. I was presenting my work Graphity. On the following URL you can find the slides, videos, data sets, source code and of course the full paper!

www.rene-pickhardt.de/graphity

In this article I want to talk about the conference itself. About what presentations I particularly liked and share some impressions.

I got distracted by the Graphity paper but promise I will read the rest of RenĂ©’s comments on the conference this weekend!

September 19, 2012

GT-VMT…Graph Transformation and Visual Modeling Techniques

Filed under: Conferences,Graphs,Networks,Visualization — Patrick Durusau @ 4:43 pm

GT-VMT 2013 : 12th International Workshop on Graph Transformation and Visual Modeling Techniques

Abstract Submission: December 7, 2012

Paper Submission: December 14, 2012

Notification to Authors: January 18, 2013

Camera Ready Submission: February 1, 2013

Workshop Dates: March 23-24, 2013

From the call for papers:

GT-VMT 2013 is the twelfth workshop of a series that serves as a forum for all researchers and practitioners interested in the use of visual notations (especially graph-based), techniques and tools for the specification, modeling, validation, manipulation and verification of complex systems. The aim of the workshop is to promote engineering approaches that provide effective sound tool support for visual modeling languages, enhancing formal reasoning at the syntactic as well as semantic level (e.g., for model specification, model analysis, model transformation, and model consistency management) in different domains, such as UML, Petri Nets, Graph Transformation or Business Process/Workflow Models.

This year’s workshop has a special theme of the analysis of non-functional / extra-functional / quality properties like performance, real-time, safety, reliability, energy consumption. We particularly encourage submissions that focus on the definition and the evaluation of such properties using visual/graph specification techniques, ranging from underlying theory through to their utility in complex system design.

As a summary, topics relevant to the scope of the workshop include (but are not restricted to) the following:

  • visual languages definition and syntax (incl. meta-modelling, grammars and graphical parsing);
  • static and dynamic semantics of visual languages (incl. OCL, graph constraints, simulation, animation, compilation);
  • visual/graph-based analysis in software engineering (incl. testing, verification & validation, static & dynamic analysis techniques);
  • visual/graph constraints (incl. definition, expressiveness, analysis techniques involving constraints);
  • model transformations and their application in model-driven development (incl. in particular, transformations between graphical and textual formalisms);
  • visual modeling techniques and graph transformation applied to patterns;
  • visual modeling techniques and graph transformations for systems with quality properties like performance, real-time, safety, reliability, energy consumption;
  • case studies and novel application areas (e.g. within engineering, biology, etc);
  • tool support and efficient algorithms.

Did I forget to mention the workshop will be held in Rome, Italy? 😉 (March is a great time to be in Rome.)

September 14, 2012

First Party Fraud (In Four Parts)

Filed under: Business Intelligence,Graphs,Networks,Social Graphs,Social Networks — Patrick Durusau @ 1:00 pm

Mike Betron as written a four-part series on first party fraud that merits your attention:

First Part Fraud [Part 1]

What is First Party Fraud?

First-party fraud (FPF) is defined as when somebody enters into a relationship with a bank using either their own identity or a fictitious identity with the intent to defraud. First-party fraud is different from third-party fraud (also known as “identity fraud”) because in third-party fraud, the perpetrator uses another person’s identifying information (such as a social security number, address, phone number, etc.). FPF is often referred to as a “victimless” crime, because no consumers or individuals are directly affected. The real victim in FPF is the bank, which has to eat all of the financial losses.

First-Party Fraud: How Do We Assess and Stop the Damage? [Part 2]

Mike covers the cost of first party fraud and then why it is so hard to combat.

Why is it so hard to detect FPF?

Given the amount of financial pain incurred by bust-out fraud, you might wonder why banks haven’t developed a solution and process for detecting and stopping it.

There are three primary reasons why first-party fraud is so hard to identify and block:

1) The fraudsters look like normal customers

2) The crime festers in multiple departments

3) The speed of execution is very fast

Fighting First Party Fraud With Social Link Analysis (3 of 4)

And you know, those pesky criminals won’t use their universally assigned identifiers for financial transactions. (Any security system that relies on good faith isn’t a security system, it’s an opportunity.)

A Trail of Clues Left by Criminals

Although organized fraudsters are sophisticated, they often leave behind evidence that can be used to uncover networks of organized crime. Fraudsters know that due to Know Your Customer (KYC) and Customer Due Diligence (CDD) regulations, their identification will be verified when they open an account with a financial institution. To pass these checks, the criminals will either modify their own identity slightly or else create a synthetic identity, which consists of combining real identity information (e.g., a social security number) with fake identity information (names, addresses, phone numbers, etc.).

Fortunately for banks, false identity information can be expensive and inconvenient to acquire and maintain. For example, apartments must be rented out to maintain a valid address. Additionally, there are only so many cell phones a person can carry at one time and only so many aliases that can be remembered. Because of this, fraudsters recycle bits and pieces of these valuable assets.

The reuse of identity information has inspired Infoglide to begin to create new technology on top of its IRE platform called Social Link Analysis (SLA). SLA works by examining the “linkages” between the recycled identities, therefore identifying potential fraud networks. Once the networks are detected, Infoglide SLA applies advanced analytics to determine the risk level for both the network and for every individual associated with that network.

First Party Fraud (post 4 of 4) – A Use Case

As discussed in our previous blog in this series, Social Link Analysis works by identifying linkages between individuals to create a social network. Social Link Analysis can then analyze the network to identify organized crime, such as bust-out fraud and internal collusion.

During the Social Link Analysis process, every individual is connected to a single network. An analysis at a large tier 1 bank will turn up millions of networks, but the majority of individuals only belong to very small networks (such as a husband and wife, and possibly a child). However, the social linking process will certainly turn up a small percentage of larger networks of interconnected individuals. It is in these larger networks where participants of bust-out fraud are hiding.

Due to the massive number of networks within a system, the analysis is performed mathematically (e.g. without user interface) and scores and alerts are generated. However, any network can be “visualized” using the software to create a graphic display of information and connections. In this example, we’ll look at a visualization of a small network that the social link analysis tool has alerted as a possible fraud ring.

A word of caution.

To leap from the example individuals being related to each other to:

As a result, Social Link Analysis has detected four members of a network, each with various amounts of charged-off fraud.

Is quite a leap.

Having charged off loans, with re-use of telephone numbers and a mobile population, doesn’t necessarily mean anyone is guilty of “charged-off fraud.”

Could be, but you should tread carefully and with legal advice before jumping to conclusions of fraud.

For good customer relations, if not avoiding bad PR and legal liability.

PS: Topic maps can help with this type of data. Including mapping in the bank locations or even personnel who accepted particular loans.

September 13, 2012

4th Workshop on Complex Networks, 2013

Filed under: Conferences,Graphs,Networks — Patrick Durusau @ 10:42 am

4th Workshop on Complex Networks, 2013

From the call for papers:

The 4th international workshop on complex networks (CompleNet 2013) aims at bringing together researchers and practitioners working on areas related to complex networks. In the past two decades we have been witnessing an exponential increase on the number of publications in this field. From biological systems to computer science, from economic to social systems, complex networks are becoming pervasive in many fields of science. It is this interdisciplinary nature of complex networks that this workshop aims at addressing.

Authors are encouraged to submit previously unpublished papers/abstracts on their research in complex networks. Both theoretical and applied papers are of interest. Specific topics of interest are (but not limited to):

  • Applications of Network Science
  • Behavioral & Social Influence
  • Community Structure in Networks
  • Complex Network in Technology
  • Complex Networks and Epidemics
  • Complex Networks and Mobility
  • Complex Networks in Biological Systems
  • Emergence in Complex Networks
  • Geometry in Complex Networks
  • Information Spreading in Social Media,
  • Link Analysis and Ranking
  • Modeling Human Behavior in Complex Networks
  • Models of Complex Networks
  • Network Evolution
  • Networks as Frameworks
  • Rumor Spreading
  • Search in Complex Networks
  • Shocks and Bursts
  • Social Networks
  • Structural Network Properties and Analysis
  • Synchronization in Networks

Prior programs as a guide to what you will encounter:

CompleNet 2012

CompleteNet 2010 (PDF file)

September 10, 2012

Graphlet-based edge clustering reveals pathogen-interacting proteins

Filed under: Clustering,Graphs,Networks,Similarity,Topology — Patrick Durusau @ 1:06 pm

Graphlet-based edge clustering reveals pathogen-interacting proteins by R. W. Solava, R. P. Michaels and T. Milenković. (Bioinformatics (2012) 28 (18): i480-i486. doi: 10.1093/bioinformatics/bts376)

Abstract:

Motivation: Prediction of protein function from protein interaction networks has received attention in the post-genomic era. A popular strategy has been to cluster the network into functionally coherent groups of proteins and assign the entire cluster with a function based on functions of its annotated members. Traditionally, network research has focused on clustering of nodes. However, clustering of edges may be preferred: nodes belong to multiple functional groups, but clustering of nodes typically cannot capture the group overlap, while clustering of edges can. Clustering of adjacent edges that share many neighbors was proposed recently, outperforming different node clustering methods. However, since some biological processes can have characteristic ‘signatures’ throughout the network, not just locally, it may be of interest to consider edges that are not necessarily adjacent.

Results: We design a sensitive measure of the ‘topological similarity’ of edges that can deal with edges that are not necessarily adjacent. We cluster edges that are similar according to our measure in different baker’s yeast protein interaction networks, outperforming existing node and edge clustering approaches. We apply our approach to the human network to predict new pathogen-interacting proteins. This is important, since these proteins represent drug target candidates.

Availability: Software executables are freely available upon request.

Contact: tmilenko@nd.edu

Of interest for bioinformatics but more broadly for its insights into topological similarity and edge clustering by topological similarity.

Being mindful that an “edge” is for all intents and purposes a “node” that records a connection between two (non-hyperedge and non-looping edge) other nodes. Nodes could, but don’t generally record their connection to other nodes, that connection being represented by an edge.

September 9, 2012

Reverse engineering of gene regulatory networks from biological data [self-conscious?]

Filed under: Bioinformatics,Biomedical,Networks — Patrick Durusau @ 4:25 pm

Reverse engineering of gene regulatory networks from biological data by Li-Zhi Liu, Fang-Xiang Wu, Wen-Jun Zhang. (Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, Volume 2, Issue 5, pages 365–385, September/October 2012)

Abstract:

Reverse engineering of gene regulatory networks (GRNs) is one of the most challenging tasks in systems biology and bioinformatics. It aims at revealing network topologies and regulation relationships between components from biological data. Owing to the development of biotechnologies, various types of biological data are collected from experiments. With the availability of these data, many methods have been developed to infer GRNs. This paper firstly provides an introduction to the basic biological background and the general idea of GRN inferences. Then, different methods are surveyed from two aspects: models that those methods are based on and inference algorithms that those methods use. The advantages and disadvantages of these models and algorithms are discussed.

As you might expect, heterogeneous data is one topic of interest in this paper:

Models Based on Heterogeneous Data

Besides the dimensionality problem, the data from microarray experiments always contain many noises and measurement errors. Therefore, an accurate network can hardly be obtained due to the limited information in microarray data. With the development of technologies, a large amount of other diverse types of genomic data are collected. Many researchers are motivated to study GRNs by combining these data with microarray data. Because different types of the genomic data reflect different aspects of underlying networks, the inferences of GRNs based on the integration of different types of data are expected to provide more accurate and reliable results than based on microarray data alone. However, effectively integrating heterogeneous data is currently a hot research topic and a nontrivial task because they are generally collected along with much noise and related to each other in a complex way. (emphasis added)

Truth be known, high dimensionality and heterogeneous data are more accurate reflections of the objects of our study.

Conversely, the lower the dimensions of a model or the greater the homogeneity of the data, the less accurate they become.

Are we creating less accurate reflections to allow for the inabilities of our machines?

Will that make our machines less self-conscious about their limitations?

Or will that make us less self-conscious about our machines’ limitations?

September 7, 2012

Networks, Crowds, and Markets

Filed under: Graphs,Networks — Patrick Durusau @ 10:46 am

Networks, Crowds, and Markets: Reasoning About a Highly Connected World by David Easley and Jon Kleinberg. (post by Max De Marzi)

Max has pointers to Cambridge University Press, the web and a pdf version of this book.

Max has pointers to three (3) Jon Kleinberg videos and the complete table of contents from the book.

At 833 pages, this should keep you occupied for a while.

Graph Motif Resume

Filed under: Graph Coloring,Graphs,Networks — Patrick Durusau @ 9:01 am

An (almost complete) state of the art around the Graph Motif problem by Florian Sikora.

A listing of current (March, 2012) results on the Graph Motif problem, including references to software.

Starts with an intuitive illustration of the problem.

Makes it easy to see why this problem is going to command a lot of attention as more complex (and realistic) modeling becomes commonplace.

Or to put it another way, normalized data is just that, normalized data. That’s why we don’t call it reality.

Graph motif in O*(2^k) time by narrow sieves [Comparing Graph Software/Databases]

Filed under: Graphs,Networks — Patrick Durusau @ 8:51 am

Graph motif in O*(2^k) time by narrow sieves by Lukasz Kowalik.

Abstract:

We show an O*(2^k)-time polynomial space algorithm for the Graph Motif problem. Moreover, we show an evidence that our result might be essentially tight: the existence of an O((2-\epsilon)^k)-time algorithm for the Graph Motif problem implies an O((2-\epsilon’)^n)-time algorithm for Set Cover.

The assault on graph problems continues!

It does make me wonder: Is there a comparison listing of graph software/databases and the algorithms they support?

It is one thing to support arbitrary nodes and edges, quite another to do something useful with them.

September 6, 2012

Progress on Partial Edge Drawings

Filed under: Graphs,Networks,Visualization — Patrick Durusau @ 4:55 pm

Progress on Partial Edge Drawings by Till Bruckdorfer, Sabine Cornelsen, Carsten Gutwenger, Michael Kaufmann, Fabrizio Montecchiani, Martin Nöllenburg and Alexander Wolff.

Abstract:

Recently, a new way of avoiding crossings in straight-line drawings of non-planar graphs has been investigated. The idea of partial edge drawings (PED) is to drop the middle part of edges and rely on the remaining edge parts called stubs. We focus on a symmetric model (SPED) that requires the two stubs of an edge to be of equal length. In this way, the stub at the other endpoint of an edge assures the viewer of the edge’s existence. We also consider an additional homogeneity constraint that forces the stub lengths to be a given fraction $\delta$ of the edge lengths ($\delta$-SHPED). Given length and direction of a stub, this model helps to infer the position of the opposite stub.

We show that, for a fixed stub–edge length ratio $\delta$, not all graphs have a $\delta$-SHPED. Specifically, we show that $K_{241}$ does not have a 1/4-SHPED, while bandwidth-$k$ graphs always have a $\Theta(1/\sqrt{k})$-SHPED. We also give bounds for complete bipartite graphs. Further, we consider the problem \textsc{MaxSPED} where the task is to compute the SPED of maximum total stub length that a given straight-line drawing contains. We present an efficient solution for 2-planar drawings and a 2-approximation algorithm for the dual problem.

I like the hair ball, brightly colored graphs as much as anyone but have to confess discerning useful information from them is problematic.

As graphs become more popular as a methodology, I suspect you will see more and more “default” presentations of hair ball visualizations.

This and similar research may help you move beyond cluttered visualizations to useful ones.

A dynamic data structure for counting subgraphs in sparse graphs

Filed under: Graphs,Networks,Subject Identity — Patrick Durusau @ 4:35 pm

A dynamic data structure for counting subgraphs in sparse graphs by Zdenek Dvorak and Vojtech Tuma.

Abstract:

We present a dynamic data structure representing a graph G, which allows addition and removal of edges from G and can determine the number of appearances of a graph of a bounded size as an induced subgraph of G. The queries are answered in constant time. When the data structure is used to represent graphs from a class with bounded expansion (which includes planar graphs and more generally all proper classes closed on topological minors, as well as many other natural classes of graphs with bounded average degree), the amortized time complexity of updates is polylogarithmic.

Work on data structures seems particularly appropriate when discussing graphs.

Subject identity, beyond string equivalent, can be seen as graph isomorphism or subgraph problem.

Has anyone proposed “bounded” subject identity mechanisms that correspond to the bounds necessary on graphs to make them processable?

We know how to do string equivalence and the “ideal” solution would be unlimited relationships to other subjects, but that is known to be intractable. For one thing we don’t know every relationship for any subject.

Thinking there may be boundary conditions for constructing subject identities that are more complex than string equivalence but that result in tractable identifications.

Suggestions?

Ternary graph isomorphism in polynomial time, after Luks

Filed under: Graphs,Networks,Python — Patrick Durusau @ 4:04 pm

Ternary graph isomorphism in polynomial time, after Luks by Adria Alcala Mena and Francesc Rossello.

Abstract:

The graph isomorphism problem has a long history in mathematics and computer science, with applications in computational chemistry and biology, and it is believed to be neither solvable in polynomial time nor NP-complete. E. Luks proposed in 1982 the best algorithm so far for the solution of this problem, which moreover runs in polynomial time if an upper bound for the degrees of the nodes in the graphs is taken as a constant. Unfortunately, Luks’ algorithm is purely theoretical, very difficult to use in practice, and, in particular, we have not been able to find any implementation of it in the literature. The main goal of this paper is to present an efficient implementation of this algorithm for ternary graphs in the SAGE system, as well as an adaptation to fully resolved rooted phylogenetic networks on a given set of taxa.

Building on his masters thesis, Adria focuses on implementation issues of Luks’ graph isomorphism algorithm.

Trivalent Graph isomorphism in polynomial time

Filed under: Graphs,Networks — Patrick Durusau @ 3:57 pm

Trivalent Graph isomorphism in polynomial time by Adria Alcala Mena.

Abstract:

It’s important to design polynomial time algorithms to test if two graphs are isomorphic at least for some special classes of graphs.

An approach to this was presented by Eugene M. Luks(1981) in the work Isomorphism of Graphs of Bounded Valence Can Be Tested in Polynomial Time. Unfortunately, it was a theoretical algorithm and was very difficult to put into practice. On the other hand, there is no known implementation of the algorithm, although Galil, Hoffman and Luks(1983) shows an improvement of this algorithm running in $O(n^3 \log n)$.

The two main goals of this master thesis are to explain more carefully the algorithm of Luks(1981), including a detailed study of the complexity and, then to provide an efficient implementation in SAGE system. It is divided into four chapters plus an appendix.

Work like this makes graph isomorphism sound vulnerable.

Other resources you may find useful:

Eugene M. Luks (homepage)

Isomorphism of graphs of bounded valence can be tested in polynomial time, Eugene M. Luks, Journal of Computer and System Sciences, 25, 1982, 42-65.

Eugene M. Luks (DBLP)

TwitterScope gets best paper at GD

Filed under: Graphs,Networks,Visualization — Patrick Durusau @ 3:24 pm

TwitterScope gets best paper at GD by David Eppstein.

From the post:

This year’s Graph Drawing Symposium is coming up in less than two weeks, and has already announced the winners of the best paper awards in each of the two submission tracks, “combinatorial and algorithmic aspects” and “visualization systems and interfaces”. The winner in the theory track was my paper on Lombardi drawing, but I already posted here about that, so instead I wanted to say some more about the other winner, “Visualizing Streaming Text Data with Dynamic Graphs and Maps” by Emden Gansner, Yifan Hu, and Stephen North. A preprint version of their paper is online at arXiv:1206.3980.

The paper describes the TwitterScope project, which provides visualization tools for high-volume streams of text data (e.g. from Twitter). Currently it exists as a publicly viewable prototype, set up to choose among nine preselected topics. Recent tweets are shown as small icons, grouped into colored regions (representing subtopics) within what looks like a map. Hovering the mouse over an icon shows the corresponding tweet. It’s updated dynamically as new tweets come in, and has a timeline for viewing past tweets. My feeling from the description is that the work involved in putting this system together was less about coming up with new technical methods for visualization (although there is some of that there, particularly in how they handle disconnected graphs) and more about making a selection among many different ideas previously seen in this area and putting them together into a single coherent and well-engineered system. Which I guess should be what that track is all about.

My musings on the same paper: Visualizing Streaming Text Data with Dynamic Maps.

Human Ignorance, Deep and Profound

Filed under: Bioinformatics,Biomedical,Graphs,Networks — Patrick Durusau @ 3:33 am

Scientists have discovered over 4 million gene switches, formerly known as “junk” (a scientific shorthand for “we don’t know what this means”), in the human genome. From the New York Times article Bits of Mystery DNA, Far From ‘Junk,’ Play Crucial Role (GINA KOLATA):

…The human genome is packed with at least four million gene switches that reside in bits of DNA that once were dismissed as “junk” but that turn out to play critical roles in controlling how cells, organs and other tissues behave. The discovery, considered a major medical and scientific breakthrough, has enormous implications for human health because many complex diseases appear to be caused by tiny changes in hundreds of gene switches.

The findings, which are the fruit of an immense federal project involving 440 scientists from 32 laboratories around the world, will have immediate applications for understanding how alterations in the non-gene parts of DNA contribute to human diseases, which may in turn lead to new drugs. They can also help explain how the environment can affect disease risk. In the case of identical twins, small changes in environmental exposure can slightly alter gene switches, with the result that one twin gets a disease and the other does not.

As scientists delved into the “junk” — parts of the DNA that are not actual genes containing instructions for proteins — they discovered a complex system that controls genes. At least 80 percent of this DNA is active and needed. The result of the work is an annotated road map of much of this DNA, noting what it is doing and how. It includes the system of switches that, acting like dimmer switches for lights, control which genes are used in a cell and when they are used, and determine, for instance, whether a cell becomes a liver cell or a neuron.

Reminds me of the discovery that glial cells aren’t packing material to support neurons. We were missing about half the human brain by size.

While I find both discoveries exciting, I am also mindful that we are not getting any closer to complete knowledge.

Rather opening up opportunities to correct prior mistakes and at some future time, to discover present ones.

PS: As you probably suspect, relationships between gene switches are extremely complex. New graph databases/algorithms anyone?

September 2, 2012

Entity disambiguation using semantic networks

Filed under: Entity Resolution,Graphs,Networks,Semantic Graph — Patrick Durusau @ 7:20 pm

Entity disambiguation using semantic networks by Jorge H. RomĂĄn, Kevin J. Hulin, Linn M. Collins and James E. Powell. Journal of the American Society for Information Science and Technology, published 29 August 2012.

Abstract:

A major stumbling block preventing machines from understanding text is the problem of entity disambiguation. While humans find it easy to determine that a person named in one story is the same person referenced in a second story, machines rely heavily on crude heuristics such as string matching and stemming to make guesses as to whether nouns are coreferent. A key advantage that humans have over machines is the ability to mentally make connections between ideas and, based on these connections, reason how likely two entities are to be the same. Mirroring this natural thought process, we have created a prototype framework for disambiguating entities that is based on connectedness. In this article, we demonstrate it in the practical application of disambiguating authors across a large set of bibliographic records. By representing knowledge from the records as edges in a graph between a subject and an object, we believe that the problem of disambiguating entities reduces to the problem of discovering the most strongly connected nodes in a graph. The knowledge from the records comes in many different forms, such as names of people, date of publication, and themes extracted from the text of the abstract. These different types of knowledge are fused to create the graph required for disambiguation. Furthermore, the resulting graph and framework can be used for more complex operations.

To give you a sense of the author’s approach:

A semantic network is the underlying information representation chosen for the approach. The framework uses several algorithms to generate subgraphs in various dimensions. For example: a person’s name is mapped into a phonetic dimension, the abstract is mapped into a conceptual dimension, and the rest are mapped into other dimensions. To map a name into its phonetic representation, an algorithm translates the name of a person into a sequence of phonemes. Therefore, two names that are written differently but pronounced the same are considered to be the same in this dimension. The “same” qualification in one of these dimensions is then used to identify potential coreferent entities. Similarly, an algorithm for generating potential alternate spellings of a name has been used to find entities for comparison with similarly spelled names by computing word distance.

The hypothesis underlying our approach is that coreferent entities are strongly connected on a well-constructed graph.

Question: What if the nodes to which the coreferent entities are strongly connected are themselves ambiguous?

Efficient Subgraph Matching on Billion Node Graphs [Parallel Graph Processing]

Filed under: Graphs,Neo4j,Networks,Trinity — Patrick Durusau @ 4:31 pm

Efficient Subgraph Matching on Billion Node Graphs by Zhao Sun (Fudan University, China), Hongzhi Wang (Harbin Institute of Technology, China), Haixun Wang (Microsoft Research Asia, China), Bin Shao (Microsoft Research Asia, China) and Jianzhong Li (Harbin Institute of Technology, China).

Abstract:

The ability to handle large scale graph data is crucial to an increasing number of applications. Much work has been dedicated to supporting basic graph operations such as subgraph matching, reachability, regular expression matching, etc. In many cases, graph indices are employed to speed up query processing. Typically, most indices require either super-linear indexing time or super-linear indexing space. Unfortunately, for very large graphs, super-linear approaches are almost always infeasible. In this paper, we study the problem of subgraph matching on billion-node graphs. We present a novel algorithm that supports efficient subgraph matching for graphs deployed on a distributed memory store. Instead of relying on super-linear indices, we use efficient graph exploration and massive parallel computing for query processing. Our experimental results demonstrate the feasibility of performing subgraph matching on web-scale graph data.

Did you say you were interested in parallel graph processing?

This paper and the materials cited in the bibliography make a nice introduction to the current options for graph processing.

I first saw this at Alex Popescu’s myNoSQL, citing it from the VLDB proceedings.

With the DBLP enhanced version of the VLDB proceedings, VLDB 2012 Ice Breaker v0.1, DBLP links for the authors were easy.

August 24, 2012

Giant Network Links All Known Compounds and Reactions

Filed under: Cheminformatics,Networks — Patrick Durusau @ 2:01 pm

Let’s start with the “popular” version: Scientists Create Chemical ‘Brain’: Giant Network Links All Known Compounds and Reactions

From the post:

Northwestern University scientists have connected 250 years of organic chemical knowledge into one giant computer network — a chemical Google on steroids. This “immortal chemist” will never retire and take away its knowledge but instead will continue to learn, grow and share.

A decade in the making, the software optimizes syntheses of drug molecules and other important compounds, combines long (and expensive) syntheses of compounds into shorter and more economical routes and identifies suspicious chemical recipes that could lead to chemical weapons.

“I realized that if we could link all the known chemical compounds and reactions between them into one giant network, we could create not only a new repository of chemical methods but an entirely new knowledge platform where each chemical reaction ever performed and each compound ever made would give rise to a collective ‘chemical brain,'” said Bartosz A. Grzybowski, who led the work. “The brain then could be searched and analyzed with algorithms akin to those used in Google or telecom networks.”

Called Chematica, the network comprises some seven million chemicals connected by a similar number of reactions. A family of algorithms that searches and analyzes the network allows the chemist at his or her computer to easily tap into this vast compendium of chemical knowledge. And the system learns from experience, as more data and algorithms are added to its knowledge base.

Details and demonstrations of the system are published in three back-to-back papers in the Aug. 6 issue of the journal Angewandte Chemie.

Well, true enough, except for the “share” part. Chematica is in the process of being commercialized.

If you are interested in the non-“popular” version:

Rewiring Chemistry: Algorithmic Discovery and Experimental Validation of One-Pot Reactions in the Network of Organic Chemistry (pages 7922–7927) by Dr. Chris M. Gothard, Dr. Siowling Soh, Nosheen A. Gothard, Dr. Bartlomiej Kowalczyk, Dr. Yanhu Wei, Dr. Bilge Baytekin and Prof. Bartosz A. Grzybowski. Article first published online: 13 JUL 2012 | DOI: 10.1002/anie.201202155.

Abstract:

Computational algorithms are used to identify sequences of reactions that can be performed in one pot. These predictions are based on over 86 000 chemical criteria by which the putative sequences are evaluated. The “raw” algorithmic output is then validated experimentally by performing multiple two-, three-, and even four-step sequences. These sequences “rewire” synthetic pathways around popular and/or important small molecules.

Parallel Optimization of Synthetic Pathways within the Network of Organic Chemistry (pages 7928–7932) by Dr. MikoƂaj Kowalik, Dr. Chris M. Gothard, Aaron M. Drews, Nosheen A. Gothard, Alex Weckiewicz, Patrick E. Fuller, Prof. Bartosz A. Grzybowski and Prof. Kyle J. M. Bishop. Article first published online: 13 JUL 2012 | DOI: 10.1002/anie.201202209.

Abstract:

Finding a needle in a haystack: The number of possible synthetic pathways leading to the desired target of a synthesis can be astronomical (1019 within five synthetic steps). Algorithms are described that navigate through the entire known chemical-synthetic knowledge to identify optimal synthetic pathways. Examples are provided to illustrate single-target optimization and parallel optimization of syntheses leading to multiple targets.

Chemical Network Algorithms for the Risk Assessment and Management of Chemical Threats (pages 7933–7937) by Patrick E. Fuller, Dr. Chris M. Gothard, Nosheen A. Gothard, Alex Weckiewicz and Prof. Bartosz A. Grzybowski. Article first published online: 13 JUL 2012 | DOI: 10.1002/anie.201202210.

Abstract:

A network of chemical threats: Current regulatory protocols are insufficient to monitor and block many short-route syntheses of chemical weapons, including those that start from household products. Network searches combined with game-theory algorithms provide an effective means of identifying and eliminating chemical threats. (Picture: an algorithm-detected pathway that yields sarin (bright red node) in three steps from unregulated substances.)

Do you see any potential semantic issues in such a network? Arising as our understanding of reactions changes?

Recalling that semantics isn’t simply a question of yesterday, today and tomorrow but also of tomorrows, 10, 50, or 100 or more years from now.

We may fancy our present understanding as definitive, but it is just a fancy.

August 20, 2012

Deploying Neo4j Graph Database Server across AWS regions with Puppet

Filed under: Amazon Web Services AWS,Graphs,Neo4j,Networks — Patrick Durusau @ 7:36 am

Deploying Neo4j Graph Database Server across AWS regions with Puppet by Jussi Heinonen.

From the post:

It’s been more than a year now since I rolled out Neo4j Graph Database Server image in Amazon EC2.

In May 2011 the version of Neo4j was 1.3 and just recently guys at Neo Technologies published version 1.7.2 so I thought now is the time to revisit this exercise and make fresh AMIs available.

Last year I created Neo4j AMI manually in one region then copied it across to the remaining AWS regions. Due to the size of the AMI and the latency between regions this process was slow.

If you aren’t already familiar with AWS, perhaps this will be your incentive to take the plunge.

Learning Puppet and Neo4j are just a lagniappe.

August 17, 2012

Character social networks in movies

Filed under: Graphs,Networks,Social Graphs,Social Networks — Patrick Durusau @ 4:28 pm

Character social networks in movies

Nathan Yau started the weekend early with:

We’ve seen a lot of network charts for Twitter, Facebook, and real people. Screw that. I want to see social networks for movie characters. That’s where Movie Galaxies comes in.

That’s really cool but what if you combined a topic map with a social graph?

So that everyone contributes their vision of the social graph at their office, homeroom, class, etc.

Then the social graphs get merged and can be viewed from different perspectives?

August 14, 2012

Tracking Down an Epidemic’s Source

Filed under: Networks,Record Linkage — Patrick Durusau @ 12:49 pm

Tracking Down an Epidemic’s Source (Physics 5, 89 (2012) | DOI: 10.1103/Physics.5.89)

From the post:

Epidemiologists often have to uncover the source of a disease outbreak with only limited information about who is infected. Mathematical models usually assume a complete dataset, but a team reporting in Physical Review Letters demonstrates how to find the source with very little data. Their technique is based on the principles used by telecommunication towers to pinpoint cell phone users, and they demonstrate its effectiveness with real data from a South African cholera outbreak. The system could also work with other kinds of networks to help governments locate contamination sources in water systems or find the leaders in a network of terrorist contacts.

A rumor can spread across a user network on Twitter, just as a disease spreads throughout a network of personal contacts. But there’s a big difference when it comes to tracking down the source: online social networks have volumes of time-stamped data, whereas epidemiologists usually have information from only a fraction of the infected individuals.

To address this problem, Pedro Pinto and his colleagues at the Swiss Federal Institute of Technology in Lausanne (EPFL) developed a model based on the standard network picture for epidemics. Individuals are imagined as points, or “nodes,” in a plane, connected by a network of lines. Each node has several lines connecting it to other nodes, and each node can be either infected or uninfected. In the team’s scenario, all nodes begin the process uninfected, and a single source node spreads the infection from neighbor to neighbor, with a random time delay for each transmission. Eventually, every node becomes infected and records both its time of infection and the identity of the infecting neighbor.

To trace back to the source using data from a fraction of the nodes, Pinto and his colleagues adapted methods used in wireless communications networks. When three or more base stations receive a signal from one cell phone, the system can measure the difference in the signal’s arrival time at each base station to triangulate a user’s position. Similarly, Pinto’s team combined the arrival times of the infection at a subset of “observer” nodes to find the source. But in the infection network, a given arrival time could correspond to multiple transmission paths, and the time from one transmission to the next varies randomly. To improve their chances of success, the team used the fact that the source had to be one of a finite set of nodes, unlike a cell phone user, who could have any of an infinite set of coordinates within the coverage area.

Summarizes: Locating the Source of Diffusion in Large-Scale Networks Pedro C. Pinto, Patrick Thiran, and Martin Vetterli Phys. Rev. Lett. 109, 068702 (2012).

One wonders if participation in multiple networks, some social, some electronic, some organizational, would be amenable to record linkage type techniques?

Leaks from government could be tracked using only one type of network but that is likely to be incomplete and misleading.

August 13, 2012

Caleydo Project

Filed under: Bioinformatics,Graphics,Graphs,Networks,Visualization — Patrick Durusau @ 4:04 pm

Caleydo Project

From the webpage:

Caleydo is an open source visual analysis framework targeted at biomolecular data. The biggest strength of Caleydo is the visualization of interdependencies between multiple datasets. Caleydo can load tabular data and groupings/clusterings. You can explore relationships between multiple groupings, between different datasets and see how your data maps onto pathways.

Caleydo has been successfully used to analyze mRNA, miRNA, methylation, copy number variation, mutation status and clinical data as well as other datset types.

The screenshot from mybiosoftware.com really caught my attention:

Caleydo Screenshot

Targets biomolecular data but may have broader applications.

August 12, 2012

4th International Workshop on Graph Data Management: Techniques and Applications (GDM 2013)

Filed under: Conferences,Graphs,Networks — Patrick Durusau @ 6:13 pm

4th International Workshop on Graph Data Management: Techniques and Applications (GDM 2013)

Important Dates:

Paper submission deadline: October 7, 2011
Author Notification: November 21, 2011
Final Camera-ready Copy Deadline: November 28, 2011
Workshop: April 11, 2012 (Brisbane, Australia)

From the call for papers:

The GDM workshop targets researchers that concentrate on all aspects of managing graph data such as: “How to store graph data?”, “How to efficiently index graph data?” and “How to query graph databases?”. Hence, we are interested in applications of graph databases in different practical domains. The workshop invites original research contributions as well as reports on prototype systems from research communities dealing with different theoretical and applied aspects of graph data management. Submitted papers will be evaluated on the basis of significance, originality, technical quality, and exposition. Papers should clearly establish the research contribution, and relation to previous research. Position and survey papers are also welcome.

Topics of interest include, but are not limited to:

  • Methods/Techniques for storing, indexing and querying graph data.
  • Methods/Techniques for estimating the selectivity of graph queries.
  • Methods/Techniques for graph mining.
  • Methods/Techniques for compact (compressed) representation of graph data.
  • Methods/Techniques for measuring graph similarity.
  • Methods/Techniques for large scale and distributed graph processing.
  • Tools/Techniques for graph data management for social network applications.
  • Tools/Techniques for graph data management of chemical compounds.
  • Tools/Techniques for graph data management of protein networks.
  • Tools/Techniques for graph data management of multimedia databases.
  • Tools/Techniques for graph data management of semantic web data (RDF).
  • Tools/Techniques for graph data management for spatio-temporal applications.
  • Tools/Techniques for graph data management for Business Process Management applications.
  • Tools/Techniques for visualizing, browsing, or navigating graph data
  • Analysis/Proposals for graph query languages.
  • Advanced applications and tools for managing graph databases in different domains.
  • Benchmarking and testing of graph data management techniques.

Being held in conjunction with the 29th IEEE International Conference on Data Engineering, Brisbane, Australia, April 8-12, 2013.

August 11, 2012

Scale and NoSQL Data Models

Filed under: Graphs,Neo4j,Networks — Patrick Durusau @ 3:42 pm

I am sure you have seen the Neo4j graphic:

NoSQL graphic

in almost every Neo4j presentation.

Seeing the graphic dozens, if not hundreds of times, made me realize it has two fundamental flaws.

First, if the dotted line represents 90% on the size axis, the scale of the size axis must change at the 90% mark or thereabouts.

Otherwise, key/value stores are at 180% of size. A marketing point for them but an unlikely one for anyone to credit.

Second, the complexity axis has no scale at all. Or at least not one that I can discern.

If you take a standard document database, say a CMS system, why is it more complex than a key/value store?

Or a bigtable clone for that matter?

Don’t get me wrong, I still think the future of data processing lies with graph databases.

Or more accurately, with the explicit identification/representation of relationships.

But I don’t need misleading graphics to make that case.

August 7, 2012

RESQUE: Network reduction…. [Are you listening NSA?]

Filed under: Bioinformatics,Graphs,Networks — Patrick Durusau @ 4:23 pm

RESQUE: Network reduction using semi-Markov random walk scores for efficient querying of biological networks by Sayed Mohammad Ebrahim Sahraeian and Byung-Jun Yoon. (Bioinformatics (2012) 28 (16): 2129-2136. doi: 10.1093/bioinformatics/bts341)

Abstract:

Motivation: Recent technological advances in measuring molecular interactions have resulted in an increasing number of large-scale biological networks. Translation of these enormous network data into meaningful biological insights requires efficient computational techniques that can unearth the biological information that is encoded in the networks. One such example is network querying, which aims to identify similar subnetwork regions in a large target network that are similar to a given query network. Network querying tools can be used to identify novel biological pathways that are homologous to known pathways, thereby enabling knowledge transfer across different organisms.

Results: In this article, we introduce an efficient algorithm for querying large-scale biological networks, called RESQUE. The proposed algorithm adopts a semi-Markov random walk (SMRW) model to probabilistically estimate the correspondence scores between nodes that belong to different networks. The target network is iteratively reduced based on the estimated correspondence scores, which are also iteratively re-estimated to improve accuracy until the best matching subnetwork emerges. We demonstrate that the proposed network querying scheme is computationally efficient, can handle any network query with an arbitrary topology and yields accurate querying results.

Availability: The source code of RESQUE is freely available at http://www.ece.tamu.edu/~bjyoon/RESQUE/

If you promise not to tell, you can get a preprint version of the article at the source code link.

RESQUE: REduction-based scheme using Semi-Markov scores for networkQUEing.

Sounds like a starting point if you are interested in the Visualize This! (NSA Network Visualization Contest).


If you know of anyone looking for a tele-commuting researcher who finds interesting things, pass my name along. Thanks!

Visualize This! (NSA Network Visualization Contest)

Filed under: Graphics,Graphs,Networks,Visualization — Patrick Durusau @ 3:54 pm

Visualize This! (NSA Network Visualization Contest)

From the webpage:

Are you a visual designer who can distill complex ideas and information into a clear and elegant display? Do you love testing your skills against the most difficult creative challenges? Then Visualize This! is the competition for you! The National Security Agency is looking for breakthrough visualizations that can bring order to the chaotic displays of large-scale computer networks.

Network performance and security depend on being able to quickly and effectively identify the changes occuring in network activity. However, current visualization tools are not suited to displaying these changes in ways that are clear and actionable.

You will be presented with a scenario of events and challenged to design a next-generation display that enables a network manager to immediately take appropriate action. This is the opportunity to channel your creative energies toward safer networks – make your mark on the world VISIBLE!

The challenge is due to appear in about two (2) months but I thought you might want to start sharpening your graph, tuple and topic map skills on simulated network typologies ahead of time.

I wonder if “clear and actionable” includes targeting information? 😉

BTW, be aware that the NSA markets Renoir:

Renoir: General Network Visualization and Manipulation Program

and it is covered by US Patent 6,515,666 which reads in part:

Method for constructing graph abstractions

Abstract

A method of constructing graph abstractions using a computer is described. The abstraction is presented on a computer display and used by a human viewer to understand a more complicated set of raw graphs. The method provides rapid generation of an abstraction that offers an arbitrary composition graph of vertices into composite vertices, dispersing and marshaling of composite vertices, arbitrary hiding and showing of portions of the composition, and marking of points of elision.

Patents like this remind me I need to finish my patent on addition/subtraction, if someone hasn’t beaten me to it.

August 3, 2012

De novo assembly and genotyping of variants using colored de Bruijn graphs

Filed under: Bioinformatics,De Bruijn Graphs,Genome,Graphs,Networks — Patrick Durusau @ 2:06 pm

De novo assembly and genotyping of variants using colored de Bruijn graphs by Zamin Iqbal, Mario Caccamo, Isaac Turner, Paul Flicek & Gil McVean. (Nature Genetics 44, 226–232 (2012))

Abstract:

Detecting genetic variants that are highly divergent from a reference sequence remains a major challenge in genome sequencing. We introduce de novo assembly algorithms using colored de Bruijn graphs for detecting and genotyping simple and complex genetic variants in an individual or population. We provide an efficient software implementation, Cortex, the first de novo assembler capable of assembling multiple eukaryotic genomes simultaneously. Four applications of Cortex are presented. First, we detect and validate both simple and complex structural variations in a high-coverage human genome. Second, we identify more than 3 Mb of sequence absent from the human reference genome, in pooled low-coverage population sequence data from the 1000 Genomes Project. Third, we show how population information from ten chimpanzees enables accurate variant calls without a reference sequence. Last, we estimate classical human leukocyte antigen (HLA) genotypes at HLA-B, the most variable gene in the human genome.

You will need access to Nature Genetics but rounding out today’s posts on de Bruijn graphs with a recent research article.

Comments on the Cortex software appreciated.

« Newer PostsOlder Posts »

Powered by WordPress