Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 7, 2013

Distributed Graph Computing with Gremlin

Filed under: Distributed Systems,Faunus,Graph Databases,Graphs,Gremlin,Titan — Patrick Durusau @ 2:53 pm

Distributed Graph Computing with Gremlin by Marko A. Rodriguez.

From the post:

The script-step in Faunus’ Gremlin allows for the arbitrary execution of a Gremlin script against all vertices in the Faunus graph. This simple idea has interesting ramifications for Gremlin-based distributed graph computing. For instance, it is possible evaluate a Gremlin script on every vertex in the source graph (e.g. Titan) in parallel while maintaining data/process locality. This section will discuss the following two use cases.

  • Global graph mutations: parallel update vertices/edges in a Titan cluster given some arbitrary computation.
  • Global graph algorithms: propagate information to arbitrary depths in a Titan cluster in order to compute some algorithm in a parallel fashion.

Another must read post from Marko A. Rodriguez!

Also a reminder that I need to pull out my Oxford Classical Dictionary to add some material to the mythology graph.

Linked Data Platform 1.0 (W3C)

Filed under: Linked Data,W3C — Patrick Durusau @ 2:13 pm

Linked Data Platform 1.0 (W3C)

Abstract:

A set of best practices and simple approach for a read-write Linked Data architecture, based on HTTP access to web resources that describe their state using the RDF data model.

Just in case you every encounter such a platform.

Complex Graphics (lattice) [Division of Labor?]

Filed under: Graphics,R,Visualization — Patrick Durusau @ 2:07 pm

Complex Graphics (lattice) by Dr. Tom Philippi.

From the webpage:

Clear communication of pattern via graphing data is no accident; a number of people spend their careers developing approaches based on human perceptions and cognitive science. While Edward Tufte and his “The Visual Display of Quantitative Information” is more widely known, William Cleveland’s “The Elements of Graphing Data” is perhaps more influential: reduced clutter, lowess or other smoothed curves through data, banking to 45° to emphasize variation in slopes, emphasizing variability as well as trends, paired plots to illustrate multiple components of the data such as fits and resuduals, and dot plots all come from his work.

It should come as no surprise that tools to implement principles from the field of graphical communication have been developed in R. The trellis package was originally developed to implement many of William Cleveland’s methods in S plus. Deepayan Sarkar wrote the lattice package as a port and extention of trellis graphs to R.

There is a second major package for advanced graphics in R; ggplot (now ggplot2), based on the Grammer of Graphics. Hadley Wickham wrote most of the ggplot2 package, as well as the book in the Use R! series on ggplot. My limited understanding of the Grammer of Graphics is that layers are specifications of data, mapping or transformations, geoms (geometric objects such as scatterplots or histograms), statistics (bin, smooth, density, etc.), and positions. Graphs are composed of detaults, one or more layers, scales, and coordinate systems. Again, each component has a default, so informative graphs may be produced by simple calls, but every detail may be tweaked if desired.

I do not recommend one package over the other. I started with lattice, and it may be a bit more complementary to analyses because of ease of recasting formulas from analytical functions to formulas for lattice graphs. ggplot2 may be more familiar to folks used to photoshop and other graphics and image processing tools, and it may be a better foundation for development over the next years. Both lattice and ggplot2 are built upon the grid graphics primatives, so a real wizard could compose graphics objects via a combination of both tools. This web page present lattice graphics solely because I have more experience with them and thus understand them better.

Very thorough coverage of the lattice package for R from Dr. Tom Philippi of the National Park Service.

Includes examples and further resources.

Visualization of data is a useful division of labor.

Machines label, sort, display data based on our instructions, but users/viewers determine the significance, if any, of the display.

I first saw this in a tweet from One R Tip a Day.

Neo4j 1.9.M05 released – wrapping up

Filed under: Cypher,Graphs,Neo4j — Patrick Durusau @ 1:30 pm

Neo4j 1.9.M05 released – wrapping up by Peter Neubauer.

From the post:

We are very proud to announce the next milestone of the Neo4j 1.9 release cycle. This time, we have been trying to introduce as few big changes as possible and instead concentrate on things that make the production environment a more pleasant experience. That means Monitoring, Cypher profiling, Java7 and High Availability were the targets for this work.

Everyone likes improvements, new features, etc.

I am leaning towards profiling cypher statements as my favorite in this release.

What’s yours?

March 6, 2013

Importing data into Neo4j – the spreadsheet way

Filed under: Graphs,Neo4j — Patrick Durusau @ 7:23 pm

Importing data into Neo4j – the spreadsheet way by Rik Van Bruggen.

From the post:

I am sure that many of you are very technical people, very knowledgeable about all things Java, Dr. Who and many other things – but I in case you have ever met me, you would probably have noticed that I am not. And I don’t want to be. I love technology, but have never had the talent, inclination or education to program – so I don’t. But I still want to get data into Neo4j – so how do I do that?

There are many technical tools out there (definitely look here, here and here, but I needed something simple. So my friend and colleague Michael Hunger came to the rescue, and offered some help to create a spreadsheet to import into Neo4j.

You will find the spreadsheet here, and you will find two components:

  1. an instruction sheet. I will get to that later.
  2. a data import sheet. Let’s look at that first.

Getting Neo4j closer to the average business user.

Are spreadsheets becoming (are?) the bridge between “unstructured” data and more sophisticated data structures?

Thinking of tools like Data Explorer for example.

If so, focusing subject identity/mapping tools on spreadsheet tables might be a good move.

State Sequester Numbers [Is This Transparency?]

Filed under: Government,Government Data,Transparency — Patrick Durusau @ 7:22 pm

A great visualization of the impact of sequestration state by state.

And, a post on the process followed to produce the visualization.

The only caveat being that one person read the numbers from PDF files supplied by the White House and another person typed them into a spreadsheet.

Doable with a small data set such as this one, but why was it necessary at all?

Once you have the data in a machine readable form, putting faces in the local community to the abstract categories should be the next step.

Topic maps anyone?

Hadoop MapReduce: to Sort or Not to Sort

Filed under: Hadoop,MapReduce,Sorting — Patrick Durusau @ 7:22 pm

Hadoop MapReduce: to Sort or Not to Sort by Tendu Yogurtcu.

From the post:

What is the big deal about Sort? Sort is fundamental to the MapReduce framework, the data is sorted between the Map and Reduce phases (see below). Syncsort’s contribution allows native Hadoop sort to be replaced by an alternative sort implementation, for both Map and Reduce sides, i.e. it makes Sort phase pluggable.

MapReduce

Opening up the Sort phase to alternative implementations will facilitate new use cases and data flows in the MapReduce framework. Let’s look at some of these use cases:

The use cases include:

  • Optimized sort implementations.
  • Hash-based aggregations.
  • Ability to run a job with a subset of data.
  • Optimized full joins.

See Tendu’s post for the details.

I first saw this at Use Cases for Hadoop’s New Pluggable Sort by Alex Popescu.

URL Search Tool!

Filed under: Common Crawl,Search Data,Search Engines,Searching — Patrick Durusau @ 7:22 pm

URL Search Tool! by Lisa Green.

From the post:

A couple months ago we announced the creation of the Common Crawl URL Index and followed it up with a guest post by Jason Ronallo describing how he had used the URL Index. Today we are happy to announce a tool that makes it even easier for you to take advantage of the URL Index!

URL Search is a web application that allows you to search for any URL, URL prefix, subdomain or top-level domain. The results of your search show the number of files in the Common Crawl corpus that came from that URL and provide a downloadable JSON metadata file with the address and offset of the data for each URL. Once you download the JSON file, you can drop it into your code so that you only run your job against the subset of the corpus you specified. URL Search makes it much easier to find the files you are interested in and significantly reduces the time and money it take to run your jobs since you can now run them across only on the files of interest instead of the entire corpus.

Imagine that.

Searching relevant data instead of “big data.”

What a concept!

Data Governance needs Searchers, not Planners

Filed under: Data Governance,Data Management,Data Models,Design — Patrick Durusau @ 7:22 pm

Data Governance needs Searchers, not Planners by Jim Harris.

From the post:

In his book Everything Is Obvious: How Common Sense Fails Us, Duncan Watts explained that “plans fail, not because planners ignore common sense, but rather because they rely on their own common sense to reason about the behavior of people who are different from them.”

As development economist William Easterly explained, “A Planner thinks he already knows the answer; A Searcher admits he doesn’t know the answers in advance. A Planner believes outsiders know enough to impose solutions; A Searcher believes only insiders have enough knowledge to find solutions, and that most solutions must be homegrown.”

I made a similar point in my post Data Governance and the Adjacent Possible. Change management efforts are resisted when they impose new methods by emphasizing bad business and technical processes, as well as bad data-related employee behaviors, while ignoring unheralded processes and employees whose existing methods are preventing other problems from happening.

If you don’t remember any line from any post you read here or elsewhere, remember this one:

“…they rely on their own common sense to reason about the behavior of people who are different from them.”

Whenever you encounter a situation where that description fits, you will find failed projects, waste and bad morale.

PolyBase

Filed under: Hadoop,HDFS,MapReduce,PolyBase,SQL,SQL Server — Patrick Durusau @ 11:20 am

PolyBase

From the webpage:

PolyBase is a fundamental breakthrough in data processing used in SQL Server 2012 Parallel Data Warehouse to enable truly integrated query across Hadoop and relational data.

Complementing Microsoft’s overall Big Data strategy, PolyBase is a breakthrough new technology on the data processing engine in SQL Server 2012 Parallel Data Warehouse designed as the simplest way to combine non-relational data and traditional relational data in your analysis. While customers would normally burden IT to pre-populate the warehouse with Hadoop data or undergo an extensive training on MapReduce in order to query non-relational data, PolyBase does this all seamlessly giving you the benefits of “Big Data” without the complexities.

I must admit I had my hopes up for the videos labeled: “Watch informative videos to understand PolyBase.”

But the first one was only 2:52 in length and the second was about the Jim Gray Systems Lab (2:13).

So, fair to say it was short on details. 😉

The closest thing I found to a clue was in the PolyBase datasheet that reads (under PolyBase Use Cases, if you are reading along) where it says:

PolyBase introduces the concept of external tables to represent data residing in HDFS. An external table defines a schema (that is, columns and their types) for data residing in HDFS. The table’s metadata lives in the context of a SQL Server database and the actual table data resides in HDFS.

I assume that means that the data in HDFS could have multiple external tables for the same data? Depending upon the query?

Curious if the external tables and/or data types are going to have mapreduce capabilities built-in? To take advantage of parallel processing of the data?

BTW, for topic map types, subject identities for the keys and data types would be the same as with more traditional “internal” tables. In case you want to merge data.

Just out of curiosity, any thoughts on possible IP on external schemas being applied to data?

I first saw this at Alex Popescu’s Microsoft PolyBase: Unifying Relational and Non-Relational Data.

Social Graphs and Applied NoSQL Solutions [Merging Graphs?]

Filed under: Graphs,Networks,Social Graphs,Social Networks — Patrick Durusau @ 11:20 am

Social Graphs and Applied NoSQL Solutions by John L. Myers.

From the post:

Recent postings have been more about the “theory” behind the wonderful world of NoSQL and less about how to implement a solution with a NoSQL platform. Well it’s about time that I changed that. This posting will be about how the graph structure and graph databases in particular can be an excellent “applied solution” of NoSQL technologies.

When Facebook released its Graph Search, the general public finally got a look at what the “backend” of Facebook looked like or its possible uses … For many the consumer to consumer (c2c) version of Facebook’s long available business-to-business and business-to-consumer offerings was a bit more of the “creepy” vs. the “cool” of the social media content. However, I think it will have the impact of opening people’s eyes on how their content can and probably should be used for search and other analytical purposes.

With graph structures, unlike tabular structures such as row and column data schemas, you look at the relationships between the nodes (i.e. customers, products, locations, etc.) as opposed to looking at the attributes of a particular object. For someone like me, who has long advocated that we should look at how people, places and things interact with each other versus how their “demographics” (i.e. size, shape, income, etc.) make us “guess” how they interact with each other. In my opinion, demographics and now firmographics have been used as “substitutes” for how people and organizations behave. While this can be effective in the aggregate, as we move toward a “bucket of one” treatment model for customers or clients, for instance, we need to move away from using demographics/firmographics as a primary analysis tool.

Let’s say that graph databases become as popular as SQL databases. You can’t scratch an enterprise without finding a graph database.

And they are all as different from each other as the typical SQL database is today.

How do you go about merging graph databases?

Can you merge a graph database and retain the characteristics of the graph databases separately?

If graph databases become as popular as they should, those are going to be real questions in the not too distant future.

ViralSearch: How Viral Content Spreads over Twitter

Filed under: Graphics,Social Media,Tweets,Visualization — Patrick Durusau @ 11:20 am

ViralSearch: How Viral Content Spreads over Twitter by Andrew Vande Moere.

From the post:

ViralSearch [microsoft.com], developed by Jake Hofman and others of Microsoft Research, visualizes how content spreads over social media, and Twitter in particular.

ViralSearch is based on hundred thousands of stories that are spread through billions of mentions of these stories, over many generations. In particular, it reveals the typical, hidden structures behind the sharing of viral videos, photos and posts as an hierarchical generation tree or as an animated bubble graph. The interface contains an interactive timeline of events, as well as a search field to explore specific phrases, stories, or Twitter users to provide an overview of how the independent actions of many individuals make content go viral.

As this tool seems only to be available within Microsoft, you can only enjoy it by watching the documentary video below.

See also NYTLabs Cascade: How Information Propagates through Social Media for a visualization of a very similar concept.

Impressive graphics!

Question: If and when you have an insight while viewing a social networking graphic, where do you capture that insight?

That is how do you link your insight into a particular point in the graphic?

NLTK 1.3 – Computing with Language: Simple Statistics

Filed under: Lisp,Natural Language Processing,NLTK — Patrick Durusau @ 11:20 am

NLTK 1.3 – Computing with Language: Simple Statistics by Vsevolod Dyomkin.

From the post:

Most of the remaining parts of the first chapter of NLTK book serve as an introduction to Python in the context of text processing. I won’t translate that to Lisp, because there’re much better resources explaining how to use Lisp properly. First and foremost I’d refer anyone interested to the appropriate chapters of Practical Common Lisp:

List Processing
Collections
Variables
Macros: Standard Control Constructs

It’s only worth noting that Lisp has a different notion of lists, than Python. Lisp’s lists are linked lists, while Python’s are essentially vectors. Lisp also has vectors as a separate data-structure, and it also has multidimensional arrays (something Python mostly lacks). And the set of Lisp’s list operations is somewhat different from Python’s. List is the default sequence data-structure, but you should understand its limitations and know, when to switch to vectors (when you will have a lot of elements and often access them at random). Also Lisp doesn’t provide Python-style syntactic sugar for slicing and dicing lists, although all the operations are there in the form of functions. The only thing which isn’t easily reproducible in Lisp is assigning to a slice:

Vsevolod continues his journey through chapter 1 of NLTK 1.3 focusing on the statistics (with examples).

VIAF: The Virtual International Authority File

Filed under: Authority Record,Library,Library Associations,Merging,Subject Identity — Patrick Durusau @ 11:19 am

VIAF: The Virtual International Authority File

From the webpage:

VIAF, implemented and hosted by OCLC, is a joint project of several national libraries plus selected regional and trans-national library agencies. The project’s goal is to lower the cost and increase the utility of library authority files by matching and linking widely-used authority files and making that information available on the Web.

The “about” link at the bottom of the page is broken (in the English version). A working “about” link for VIAF reports:

At a glance

  • A collaborative effort between national libraries and organizations contributing name authority files, furthering access to information
  • All authority data for a given entity is linked together into a “super” authority record
  • A convenient way for the library community and other agencies to repurpose bibliographic data produced by libraries serving different language communities

The Virtual International Authority File (VIAF) is an international service designed to provide convenient access to the world’s major name authority files. Its creators envision the VIAF as a building block for the Semantic Web to enable switching of the displayed form of names for persons to the preferred language and script of the Web user. VIAF began as a joint project with the Library of Congress (LC), the Deutsche Nationalbibliothek (DNB), the Bibliothèque nationale de France (BNF) and OCLC. It has, over the past decade, become a cooperative effort involving an expanding number of other national libraries and other agencies. At the beginning of 2012, contributors include 20 agencies from 16 countries.

Most large libraries maintain lists of names for people, corporations, conferences, and geographic places, as well as lists to control works and other entities. These lists, or authority files, have been developed and maintained in distinctive ways by individual library communities around the world. The differences in how to approach this work become evident as library data from many communities is combined in shared catalogs such as OCLC’s WorldCat.

VIAF helps to make library authority files less expensive to maintain and more generally useful to the library domain and beyond. To achieve this, VIAF matches and links the authority files of national libraries and groups all authority records for a given entity into a merged “super” authority record that brings together the different names for that entity. By linking disparate names for the same person or organization, VIAF provides a convenient means for a wider community of libraries and other agencies to repurpose bibliographic data produced by libraries serving different language communities.

If you were to substitute for ‘”super” authority record,” the term topic, you would be part of the way towards a topic map.

Topics gather information about a given entity into a single location.

Topics differ from the authority records you find at VIAF in two very important ways:

  1. First, topics, unlike authority records, have the ability to merge with other topics, creating new topics that have more information than any of the original topics.
  2. Second, authority records are created by, well, authorities. Do you see your name or the name of your organization on the list at VIAF? Topics can be created by anyone and merged with other topics on terms chosen by the possessor of the topic map. You don’t have to wait for an authority to create the topic or approve your merging of it.

There are definite advantages to having authorities and authority records, but there are also advantages to having the freedom to describe your world, in your terms.

March 5, 2013

Transparency and the Digital Oil Drop

Filed under: Government,Government Data,Transparency — Patrick Durusau @ 3:52 pm

I left off yesterday pointing out three critical failures in the Digital
 Accountability and Transparency 
Act
 (DATA
 Act)

Those failures were:

  • Undefined goals with unrealistic deadlines.
  • Lack of incentives for performance.
  • Lack of funding for assigned duties.

Digital
 Accountability and Transparency 
Act
 (DATA
 Act) [DOA]

Make no mistake, I think transparency, particularly in government spending is very important.

Important enough that proposals for transparency should take it seriously.

In broad strokes, here is my alternative to the Digital Accountability and Transparency Act (DATA Act) proposal:

  • Ask the GAO, the federal agency with the most experience auditing other federal agencies, to prepare an estimate for:
    • Cost/Time for preparing a program internal to the GAO to produce mappings of agency financial records to a common report form.
    • Cost/Time to train GAO personnel on the mapping protocol.
    • Cost/Time for additional GAO staff for the creation of the mapping protocol and permanent GAO staff as liaisons with particular agencies.
    • Recommendations for incentives to promote assistance from agencies.
  • Upon approval and funding of the GAO proposal, which should include at least two federal agencies as test cases, that:
    • Test case agencies are granted additional funding for training and staff to cooperate with the GAO mapping team.
    • Test case agencies are granted additional funding for training and staff to produce reports as specified by the GAO.
    • Staff in test case agencies are granted incentives to assist in the initial mapping effort and maintenance of the same. (Positive incentives.)
  • The program of mapping of accounts expand no more often than every two to three years and only if prior agencies have achieved and remain in conformance.

Some critical differences between my sketch of a proposal and the Digital
 Accountability and Transparency 
Act
 (DATA
 Act):

  1. Additional responsibilities and requirements will be funded for agencies, including additional training and personnel.
  2. Agency staff will have incentives to learn the new skills and procedures necessary for exporting their data as required by the GAO.
  3. Instead of trying to swallow the Federal whale, the project proceeds incrementally and with demonstrable results.

Topic maps can play an important role in such a project but we should be mindful that projects rarely succeed or fail because of technology.

Project fail because, like the DATA Act, they ignore basic human needs, experience in similar situations (9/11), and substitute abuse for legitimate incentives.

A Simple, Combinatorial Algorithm for Solving…

Filed under: Algorithms,Data Structures,Mathematics — Patrick Durusau @ 2:55 pm

A Simple, Combinatorial Algorithm for Solving SDD Systems in Nearly-Linear Time by Jonathan A. Kelner, Lorenzo Orecchia, Aaron Sidford, Zeyuan Allen Zhu.

Abstract:

In this paper, we present a simple combinatorial algorithm that solves symmetric diagonally dominant (SDD) linear systems in nearly-linear time. It uses very little of the machinery that previously appeared to be necessary for a such an algorithm. It does not require recursive preconditioning, spectral sparsification, or even the Chebyshev Method or Conjugate Gradient. After constructing a “nice” spanning tree of a graph associated with the linear system, the entire algorithm consists of the repeated application of a simple (non-recursive) update rule, which it implements using a lightweight data structure. The algorithm is numerically stable and can be implemented without the increased bit-precision required by previous solvers. As such, the algorithm has the fastest known running time under the standard unit-cost RAM model. We hope that the simplicity of the algorithm and the insights yielded by its analysis will be useful in both theory and practice.

In one popular account, the importance of the discovery was described this way:

The real value of the MIT paper, Spielman says, is in its innovative theoretical approach. “My work and the work of the folks at Carnegie Mellon, we’re solving a problem in numeric linear algebra using techniques from the field of numerical linear algebra,” he says. “Jon’s paper is completely ignoring all of those techniques and really solving this problem using ideas from data structures and algorithm design. It’s substituting one whole set of ideas for another set of ideas, and I think that’s going to be a bit of a game-changer for the field. Because people will see there’s this set of ideas out there that might have application no one had ever imagined.”

Thirty-two pages of tough sledding but if the commentaries are correct, this paper may have a major impact on graph processing.

Six Years of Many Eyes:… [New Release March 25th 2013]

Filed under: Graphics,Visualization — Patrick Durusau @ 2:23 pm

Six Years of Many Eyes: A Personal Retrospective by Frank Van Ham.

From the post:

As Many Eyes is being brought out of hibernation. I thought it would be appropriate to reflect on how it came to be and some of the paths the original research team followed.

As the title says, these are my personal views and recollections and they should be taken as such. Some of these I had to piece back from memory, and some from old presentation decks still on my drive.

….

Many Eyes v2 launches at the end of March with several new enhancements that will continue the heritage of the site to advance visualization on the web, including:

  • A comprehensive site redesign with updated layout and presentation. Plus, new affinity areas to find and navigate visualizations by industry or topic, such as finance, healthcare and risk.
  • Addition of the Expert Eyes blog dedicated to helping you learn how to create effective and engaging visualizations that provide maximum insight and tell a story. IBM visualization luminaries and IBM Researchers from the Center for Advanced Visualization will contribute their perspectives regularly.
  • Two new visualization types, including a heatmap and view-in-context visualization built on IBM’s Rapidly Adaptive Visualization Engine (RAVE). RAVE, a declarative language based on the Grammar of Graphics, provides a flexible way to create visual mappings by describing what the visualization should look like, without having to write any tedious rendering code.
  • New options to share visualizations across the web and on social networks, including popular visual social networks, such as Pinterest.

Discover the newest version of Many Eyes beginning March 25 by visiting www.ibm.com/manyeyes (yes, a new URL as well).

Don’t skip the history as I did above! But I did want to get you excited about the new drop!

A nice collaborative filtering tutorial “for dummies”

Filed under: Data Mining,Filters — Patrick Durusau @ 2:12 pm

A nice collaborative filtering tutorial “for dummies”

Danny Bickson writes:

I got from M. Burhan, one of our GraphChi users from Germany, the following link to an online book called: A Programmer’s Guide to Data Mining.

There are two relevant chapters that may help beginners understand the basic concepts.

The first one of them is Chapter 2: Collaborative Filtering and Chapter 3: Implicit Ratings and Item Based Filtering.

Rebuilding Gephi’s core for the 0.9 version

Filed under: Gephi,Graphs,Networks,Visualization — Patrick Durusau @ 2:07 pm

Rebuilding Gephi’s core for the 0.9 version by Mathieu Bastian.

From the post:

This is the first article about the future Gephi 0.9 version. Our objective is to prepare the ground for a future 1.0 release and focus on solving some of the most difficult problems. It all starts with the core of Gephi and we’re giving today a preview of the upcoming changes in that area. In fact, we’re rewriting the core modules from scratch to improve performance, stability and add new features. The core modules represent and store the graph and attributes in memory so it’s available to the rest of the application. Rewriting Gephi’s core is like replacing the engine of a truck and involves adapting a lot of interconnected pieces. Gephi’s current graph structure engine was designed in 2009 and didn’t change much in multiple releases. Although it’s working, it doesn’t have the level of quality we want for Gephi 1.0 and needs to be overhauled. The aim is to complete the new implementation and integrate it in the 0.9 version.

Deeply interesting work!

To follow, consider subscribing to: gephi-dev — List for core developers.

“Do Bees” / “Don’t Bees” and Neo4j

Filed under: Cypher,Graphs,Neo4j,Networks — Patrick Durusau @ 1:12 pm

According to Michael Hunger in a Neo4j Google Groups message, the Neo4j team is drowning in its own success!

Now there’s a problem to have!

“Do Bees” for Neo4j will:

…ask questions on Stack Overflow that related to:

Please tag your questions with “neo4j” and “cypher”, “gremlin” or “spring data neo4j” accordingly. See the current list:

http://stackoverflow.com/questions/tagged/neo4j

Currently questions on SO are answered quickly by a group of very active people which we hope you will join. We try to chime in as often as possible (especially with unanswered questions).

So PLEASE post your questions there on Stack Overflow, we will start asking individuals to move their questions to that platform and if they don’t manage it, move them ourselves.

We will also monitor this badge: http://stackoverflow.com/badges/1785/neo4j and award cool stuff for people that make it there.

This google group shall return to its initial goals of having broader discussions about graph topics, modeling, architectures, roadmap, announcements, cypher evolution, open source etc. So we would love everyone who has questions or problems in these areas to reach out and start a conversation.

Hope for your understanding to make more breathing room in this group and more interesting discussions in the future while keeping an interactive FAQ around Neo4j going on SO with quick feedback loops and turnaround times.

The Neo4j community will be healthier if we are all “Do Bees” so I won’t cover the alternative.

If you don’t know “Do Bees” / “Don’t Bees,” see: Romper Room.

See you at Stackoverflow!

How Search Works [Promoting Your Interface]

Filed under: Search Engines,Searching — Patrick Durusau @ 12:51 pm

How Search Works (Google)

Clever graphics and I rather liked the:

By the way, in the **** seconds you’ve been on this page, approximately *********** searches were performed.

Not that you want that sort of tracking if your topic map interface only gets two or three “hits” a day but in an enterprise context…, might be worth thinking about.

Evidence of the popularity of your topic map interface with the troops.

I first saw this in a tweet by Christian Jansen.

Marketing Data Sets (Read Topic Maps)

Filed under: Marketing,News,Reporting — Patrick Durusau @ 11:50 am

The National Institute for Computer-Assisted Reporting (NICAR) has forty-seven (47) databases for sale in bulk or by geographic region.

Data sets range from “AJC School Test Scores” and “FAA Accidents and Incidents” to “Social Security Administration Death Master File” and “Wage and Hour Enforcement.”

The data sets cover decades of records.

There is a one hundred (100) record sample for each database.

The samples offer an avenue to show what more is possible with topic maps, to paying customers based upon a familiar dataset.

With all the talk of gun control in the United States, consider the Federal Firearms/Explosives Licensees database.

For free you can see:

Main documentation (readme.txt)

Sample Data (sampleatf_ffl.xls)

Record layout (Layout.txt)

Do remember that NICAR already has the attention of an interested audience, should you need a partner in marketing a fuller result.

Tools, Slides and Links from NICAR13 [News Investigation/Reporting]

Filed under: News,Reporting — Patrick Durusau @ 11:14 am

Tools, Slides and Links from NICAR13 by Chrys Wu.

The acronyms were new to me: NICAR (National Institute for Computer-Assisted Reporting), a program of IRE (Investigative Reporters & Editors).

From the post:

NICAR13 brings together some of the sharpest minds and most experienced hands in investigative journalism. Over four days, people share, discuss and teach techniques for hunting leads, gathering data, and presenting stories. Of all the conferences I go to, this one gets the highest marks from attendees for intensive, immediately applicable learning; networking and fun.

NICAR 2014 will be in Baltimore from Feb. 27 to March 2. You should be there.

For additional tutorials, videos, presentations and tips see the lists from 2012 and 2011.

A real wealth of material if you are interesting in mining, analyzing and reporting data.

Enjoy!

I first saw this in a tweet by Chrys Wu.

Addictive Topic Map Forums

Filed under: Interface Research/Design,Marketing,Topic Maps,Users — Patrick Durusau @ 10:37 am

They exist in theory at this point and I would like to see that change. But I don’t know how.

Here are three examples of addictive forums:

Y Hacker News: It has default settings to keep you from spending too much time on the site.

Facebook: Different in organization and theme from Y Hacker News.

Stack Overflow: Different from the other two but also qualifies as addictive.

There are others but those represent a range of approaches that have produced addictive forums.

I’m not looking for a “smoking gun” sort of answer but some thoughts on what lessons these sites have for creating other sites.

Not just for use in creating a topic map forum but for creating topic map powered information resources that have those same characteristics.

An addictive information service would quite a marketing coup.

Some information resource interfaces are better than others but I have yet to see one I would voluntarily seek out just for fun.

March 4, 2013

Digital
 Accountability and Transparency 
Act
 (DATA
 Act) [DOA]

Filed under: Government,Government Data,Transparency — Patrick Durusau @ 5:44 pm

I started this series of posts in: Digital
 Accountability 
and
 Transparency 
Act
 (DATA
 Act) [The Details], where I concluded the Data Act had the following characteristics:

  • Secretary of the Treasury has one (1) year to design a common data format for unknown financial data in Federal agencies.
  • Federal agencies have one (1) year to comply with the common data format from the Secretary of the Treasure.
  • No penalties or bonuses for the Secretary of the Treasury.
  • No penalties or bonuses for Federal agencies failing to comply.
  • No funding for the Secretary of the Treasury to carry out the assigned duties.
  • No funding for Federal agencies to carry out the assigned duties.

As written, the Digital
 Accountability 
and
 Transparency 
Act
 (DATA
 Act) will be DOA (Dead On Arrival) in the current or any future session of Congress.

There are three (3) main reasons why that is the case.

A Common Data Format

Let me ask a dumb question: Do you remember 9/11?

Of course you do. And the United States has been in a state of war on terrorism every since.

I point that out because intelligence sharing (read common data format) was identified as a reason why the 9/11 attacks weren’t stopped and has been a high priority to solve since then.

Think about that: Reason why the attacks weren’t stopped and a high priority to correct.

This next September 11th will be the twelfth anniversary of those attacks.

Progress on intelligence sharing: Progress Made and Challenges Remaining in Sharing Terrorism-Related Information which I gloss in Read’em and Weep, along with numerous other GAO reports on intelligence sharing.

The good news is that we are less than five (5) years away from some unknown level of intelligence sharing.

The bad news is that puts us sixteen (16) years after 9/11 with some unknown level of intelligence sharing.

And that is for a subset of the entire Federal government.

A smaller set than will be addressed by the Secretary of the Treasury.

Common data format in a year? Really?

To say nothing of the likelihood of agencies changing the multitude of systems they have in place in a year.

No penalties or bonuses

You can think of this as the proverbial carrot and stick if you like.

What incentive does either the Secretary of the Treasury and/or Federal agencies have to engage in this fool’s errand pursuing a common data format?

In case you have forgotten, both the Secretary of the Treasury and Federal agencies have obligations under their existing missions.

Missions which they are designed by legislation and habit to discharge before they turn to additional reporting duties.

And what happens if they discharge their primary mission but don’t do the reporting?

Oh, they get reported to Congress. And ranked in public.

As Ben Stein would say, “Wow.”

No Funding

To add insult to injury, there is no additional funding for either the Secretary of the Treasury or Federal agencies to engage in any of the activities specified by the Digital
 Accountability 
and
 Transparency 
Act
 (DATA
 Act).

As I noted above, the Secretary of the Treasury and Federal agencies already have full plates with their current missions.

Now they are to be asked to undertake unfamiliar tasks, creation of a chimerical “common data format” and submitting reports based upon it.

Without any addition staff, training, or other resources.

Directives without resources to fulfill them are directives that are going to fail. (full stop)

Tentative Conclusion

If you are asking yourself, “Why would anyone advocate the Digital
 Accountability 
and
 Transparency 
Act
 (DATA
 Act)?,” five points for your house!

I don’t know of anyone who understands:

  1. the complexity of Federal data,
  2. the need for incentives,
  3. the need for resources to perform required tasks,

who thinks the Digital
 Accountability 
and
 Transparency 
Act
 (DATA
 Act) is viable.

Why advocate non-viable legislation?

Its non-viability make it an attractive fund raising mechanism.

Advocates can email, fund raise, telethon, rant, etc., to their heart’s content.

Advocating non-viable transparency lines an organization’s pocket at no risk of losing its rationale for existence.


The third post in this series, suggesting a viable way forward, will appear tomorrow under: Transparency and the Digital Oil Drop.

Digital
 Accountability 
and
 Transparency 
Act
 (DATA
 Act) [The Details]

Filed under: Government,Government Data,Transparency — Patrick Durusau @ 5:42 pm

The Data Transparency Coalition, the Sunlight Foundation and others are calling for reintroduction of the Digital
 Accountability 
and
 Transparency 
Act
 (DATA
 Act) in order to make U.S. government spending more transparent.

Transparency in government spending is essential for an informed electorate. An electorate that can call attention to spending that is inconsistent with policies voted for by the electorate. Accountability as it were.

But saying “transparency” is easy. Achieving transparency, not so easy.

Let’s look at some of the details in the DATA Act.

(2) DATA STANDARDS-

    ‘(A) IN GENERAL- The Secretary of the Treasury, in consultation with the Director of the Office of Management and Budget, the General Services Administration, and the heads of Federal agencies, shall establish Government-wide financial data standards for Federal funds, which may–

      ‘(i) include common data elements, such as codes, unique award identifiers, and fields, for financial and payment information required to be reported by Federal agencies;

      ‘(ii) to the extent reasonable and practicable, ensure interoperability and incorporate–

        ‘(I) common data elements developed and maintained by an international voluntary consensus standards body, as defined by the Office of Management and Budget, such as the International Organization for Standardization;

        ‘(II) common data elements developed and maintained by Federal agencies with authority over contracting and financial assistance, such as the Federal Acquisition Regulatory Council; and

        ‘(III) common data elements developed and maintained by accounting standards organizations; and

      ‘(iii) include data reporting standards that, to the extent reasonable and practicable–

        ‘(I) incorporate a widely accepted, nonproprietary, searchable, platform-independent computer-readable format;

        ‘(II) be consistent with and implement applicable accounting principles;

        ‘(III) be capable of being continually upgraded as necessary; and

        ‘(IV) incorporate nonproprietary standards in effect on the date of enactment of the Digital Accountability and Transparency Act of 2012.

    ‘(B) DEADLINES-

      ‘(i) GUIDANCE- The Secretary of the Treasury, in consultation with the Director of the Office of Management and Budget, shall issue guidance on the data standards established under subparagraph (A) to Federal agencies not later than 1 year after the date of enactment of the Digital Accountability and Transparency Act of 2012.

      ‘(ii) AGENCIES- Not later than 1 year after the date on which the guidance under clause (i) is issued, each Federal agency shall collect, report, and maintain data in accordance with the data standards established under subparagraph (A).

OK, I have a confession to make: I was a lawyer for ten years and reading this sort of thing is second nature to me. Haven’t practiced law in decades but I still read legal stuff for entertainment. 😉

First, read section A and write down the types of data you would have to collect for each of those items.

Don’t list the agencies/organizations you would have to contact, you probably don’t have enough paper in your office for that task.

Second, read section B and notice that the Secretary of the Treasury has one (1) years to issue guidance for all the data you listed under Section A.

That means gathering, analyzing, testing and designing a standard for all that data, most of which is unknown. Even to the GAO.

And, if they meet that one (1) year deadline, the various agencies have only one (1) year to comply with the guidance from the Secretary of the Treasury.

Do I need to comment on the likelihood of success?

As far as the Secretary of the Treasury, what happens if they don’t meet the one year deadline? Do you see any penalties?

Assuming some guidance emerges, what happens to any Federal agency that does not comply? Any penalties for failure? Any incentives to comply?

My reading is:

  • Secretary of the Treasury has one (1) year to design a common data format for unknown financial data in Federal agencies.
  • Federal agencies have one (1) year to comply with the common data format from the Secretary of the Treasure.
  • No penalties or bonuses for the Secretary of the Treasury.
  • No penalties or bonuses for Federal agencies failing to comply.
  • No funding for the Secretary of the Treasury to carry out the assigned duties.
  • No funding for Federal agencies to carry out the assigned duties.

Do you disagree with that reading of the Digital
 Accountability 
and
 Transparency 
Act
 (DATA
 Act)?

My analysis of that starting point appears in Digital
 Accountability 
and
 Transparency 
Act
 (DATA
 Act) [DOA]

Pattern Based Graph Generator

Filed under: Graph Databases,Graph Generator,Graphs,Networks — Patrick Durusau @ 5:35 pm

Pattern Based Graph Generator by Hong-Han Shuai, De-Nian Yang, Philip S. Yu, Chih-Ya Shen, Ming-Syan Chen.

Abstract:

The importance of graph mining has been widely recognized thanks to a large variety of applications in many areas, while real datasets always play important roles to examine the solution quality and efficiency of a graph mining algorithm. Nevertheless, the size of a real dataset is usually fixed and constrained according to the available resources, such as the efforts to crawl an on-line social network. In this case, employing a synthetic graph generator is a possible way to generate a massive graph (e.g., billions nodes) for evaluating the scalability of an algorithm, and current popular statistical graph generators are properly designed to maintain statistical metrics such as total node degree, degree distribution, diameter, and clustering coefficient of the original social graphs. Nevertheless, in addition to the above metrics, recent studies on graph mining point out that graph frequent patterns are also important to provide useful implications for the corresponding social networking applications, but this crucial criterion has not been noticed in the existing graph generators. This paper first manifests that numerous graph patterns, especially large patterns that are crucial with important domain-specific semantic, unfortunately disappear in the graphs created by popular statistic graph generators, even though those graphs enjoy the same statistical metrics with the original real dataset. To address this important need, we make the first attempt to propose a pattern based graph generator (PBGG) to generate a graph including all patterns and satisfy the user-specified parameters on supports, network size, degree distribution, and clustering coefficient. Experimental results show that PBGG is efficient and able to generate a billion-node graph with about 10 minutes, and PBGG is released for free download.

Download at: http://arbor.ee.ntu.edu.tw/~hhshuai/PBGG.html.

OK, this is not a “moderate size database” program.

D3 World Maps: Tooltips, Zooming, and Queue

Filed under: D3,Graphics,Mapping,Maps,Visualization — Patrick Durusau @ 5:16 pm

D3 World Maps: Tooltips, Zooming, and Queue

From the post:

D3 has a lot of built in support (a powerful geographic projection system) for creating Maps from GeoJSON. If you have never used D3 for maps, I think you should take a look at this D3 Map Tutorial. It covers the essentials of making a map with D3 and TopoJSON, which I will use below in more advanced examples. TopoJson encodes topology and eliminates redundancy, resulting in a much smaller file and the GeoJSON to TopoJSON converter is built with NodeJS.

Thus, I encourage you all to start using TopoJSON and below, I will go over a couple examples of building a D3 World Map with colors, tooltips, different zooming options, plotting points from geo coordinates, and listening to click events to load new maps. I will also use Mike Bostock’s queue script to load the data asynchronously.

Creating geographic maps with D3? This is a must stop.

What I need to look for is a library not for geo-coordinates but one that supports arbitrary, user-defined coordinates.

The sort of thing that could map locations in library stacks.

Suggestions/pointers?

Data, Data, Data: Thousands of Public Data Sources

Filed under: Data,Dataset — Patrick Durusau @ 5:05 pm

Data, Data, Data: Thousands of Public Data Sources

From the post:

We love data, big and small and we are always on the lookout for interesting datasets. Over the last two years, the BigML team has compiled a long list of sources of data that anyone can use. It’s a great list for browsing, importing into our platform, creating new models and just exploring what can be done with different sets of data.

A rather remarkable list of data sets. You are sure to find something of interest!

« Newer PostsOlder Posts »

Powered by WordPress