Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 9, 2012

Crowdsourced Legal Case Annotation

Filed under: Annotation,Law,Law - Sources,Legal Informatics — Patrick Durusau @ 12:38 pm

Crowdsourced Legal Case Annotation

From the post:

This is an academic research study on legal informatics (information processing of the law). The study uses an online, collaborative tool to crowdsource the annotation of legal cases. The task is similar to legal professionals’ annotation of cases. The result will be a public corpus of searchable, richly annotated legal cases that can be further processed, analysed, or queried for conceptual annotations.

Adam and Wim are computer scientists who are interested in language, law, and the Internet.

We are inviting people to participate in this collaborative task. This is a beta version of the exercise, and we welcome comments on how to improve it. Please read through this blog post, look at the video, and get in contact.

Non-trivial annotation of complex source documents.

What you do with the annotations, such as create topic maps, etc. would be a separate step.

The early evidence for the enhancement of our own work, based on the work of others, Picking the Brains of Strangers…, should make this approach even more exciting.

PS: I saw this at Legal Informatics but wanted to point you directly to the source article.
Just musing for a moment but what if the conclusion on collaboration and access is that by restricting access we impoverish not only others, but ourselves as well?

Bruce on the Functions of Legislative Identifiers

Filed under: Identifiers,Law,Law - Sources,Legal Informatics — Patrick Durusau @ 12:06 pm

Bruce on the Functions of Legislative Identifiers

From Legal Informatics:

In this post, Tom [Bruce] discusses the multiple functions that legislative document identifiers serve. These include “unique naming,” “navigational reference,” “retrieval hook / container label,” “thread tag / associative marker,” “process milestone,” and several more.

A promised second post will examine issues of identifier design.

Enjoy and pass along!

Structural Abstractions in Brains and Graphs

Filed under: Graphs,Neural Networks,Neuroinformatics — Patrick Durusau @ 10:31 am

Structural Abstractions in Brains and Graphs.

Marko Rodriguez compares the brain to a graph saying (in part):

A graph database is a software system that persists and represents data as a collection of vertices (i.e. nodes, dots) connected to one another by a collection of edges (i.e. links, lines). These databases are optimized for executing a type of process known as a graph traversal. At various levels of abstraction, both the structure and function of a graph yield a striking similarity to neural systems such as the human brain. It is posited that as graph systems scale to encompass more heterogenous data, a multi-level structural understanding can help facilitate the study of graphs and the engineering of graph systems. Finally, neuroscience may foster an appreciation and understanding of the various structural abstractions that exist within the graph.

It is a very suggestive post for thinking about graphs and I commend it to you for reading, close reading.

Making Intelligence Systems Smarter (or Dumber)

Filed under: Data Silos,Distributed Sensemaking,Intelligence,Sensemaking — Patrick Durusau @ 10:02 am

Picking the Brains of Strangers….[$507 Billion Dollar Prize (at least)] had three keys to its success:

  • Use of human analysts
  • Common access to data and prior efforts
  • Reuse of prior efforts by human analysts

Intelligence analysts spend their days with snippets and bits of data, trying to wring sense out of it, only to pigeon hold their results in silos.

Other analysts have to know about data to even request it. Or analysts with information must understand their information will help others with their own sensemaking.

All contrary to the results in Picking the Brains of Strangers….

What information will result in sensemaking for one or more analysts is unknown. And cannot be known.

Every firewall, every silo, every compartment, every clearance level, makes every intelligence agency and the overall intelligence community dumber.

Until now, the intelligence community has chosen to be dumber and more secure.

In a time of budget cuts and calls for efficiency in government, it is time for more effective intelligence work, even if less secure.

Take the leak of the diplomatic cables. The only people unaware of the general nature of the cables were the public and perhaps the intelligence agency of Zambia. All other intelligence agencies probably had them or their own version, pigeon holed in their own systems.

With robust intelligence sharing, the NSA could do all the signal capture and expense it out to other agencies. Rather than having duplicate systems by various agencies.

And perhaps a public data flow of analysis for foreign news sources in their original languages. They may not have clearance but they may have insights into cultures and languages that are rare in intelligence agencies.

But that presumes an interest in smarter intelligence systems, not dumber ones by design.

Picking the Brains of Strangers….[$507 Billion Dollar Prize (at least)]

Filed under: BigData,Distributed Sensemaking,Intelligence,Sensemaking — Patrick Durusau @ 9:17 am

Picking the Brains of Strangers Helps Make Sense of Online Information

Science Daily carried this summary (the official abstract and link are below):

People who have already sifted through online information to make sense of a subject can help strangers facing similar tasks without ever directly communicating with them, researchers at Carnegie Mellon University and Microsoft Research have demonstrated.

This process of distributed sensemaking, they say, could save time and result in a better understanding of the information needed for whatever goal users might have, whether it is planning a vacation, gathering information about a serious disease or trying to decide what product to buy.

The researchers explored the use of digital knowledge maps — a means of representing the thought processes used to make sense of information gathered from the Web. When participants in the study used a knowledge map that had been created and improved upon by several previous users, they reported that the quality of their own work was better than when they started from scratch or used a newly created knowledge map.

“Collectively, people spend more than 70 billion hours a year trying to make sense of information they have gathered online,” said Aniket Kittur, assistant professor in Carnegie Mellon’s Human-Computer Interaction Institute. “Yet in most cases, when someone finishes a project, that work is essentially lost, benefitting no one else and perhaps even being forgotten by that person. If we could somehow share those efforts, however, all of us might learn faster.”

Three take away points:

  • “people spend more than 70 billion hours a year trying to make sense of information they have gathered online”
  • “when someone finishes a project, that work is essentially lost, benefitting no one else and perhaps even being forgotten by that person”
  • using knowledge maps created and improved upon by others — improved the quality of their own work

At the current minimum wage in the US of $7.25, that’s roughly $507,500,000,000. Some of us make more than minimum wage so that figure should be adjusted upwards.

The key to success was improvement upon efforts already improved upon by others.

Based on a small sample set (21 people) so there is an entire research field waiting to explore. Whether this holds true with different types of data, what group dynamics make it work best, individual characteristics that influence outcomes, interfaces (that help or hinder), processing models, software, hardware, integrating the results from different interfaces, etc.

Start here:

Distributed sensemaking: improving sensemaking by leveraging the efforts of previous users
by Kristie Fisher, Scott Counts, and Aniket Kittur.

Abstract:

We examine the possibility of distributed sensemaking: improving a user’s sensemaking by leveraging previous users’ work without those users directly collaborating or even knowing one another. We asked users to engage in sensemaking by organizing and annotating web search results into “knowledge maps,” either with or without previous users’ maps to work from. We also recorded gaze patterns as users examined others’ knowledge maps. Our findings show the conditions under which distributed sensemaking can improve sensemaking quality; that a user’s sensemaking process is readily apparent to a subsequent user via a knowledge map; and that the organization of content was more useful to subsequent users than the content itself, especially when those users had differing goals. We discuss the role distributed sensemaking can play in schema induction by helping users make a mental model of an information space and make recommendations for new tool and system development.

May 8, 2012

Reading Other People’s Mail For Fun and Profit

Filed under: Analytics,Data Analysis,Intelligence — Patrick Durusau @ 6:16 pm

Bob Gourley writes much better content than he does titles: Osama Bin Laden Letters Analyzed: A rapid assessment using Recorded Future’s temporal analytic technologies and intelligence analysis tools. (Sorry Bob.)

Bob writes:

The Analysis Intelligence site provides open source analysis and information on a variety of topics based on the the temporal analytic technology and intelligence analysis tools of Recorded Future. Shortly after the release of 175 pages of documents from the Combatting Terrorism Center (CTC) a very interesting assessment was posted on the site. This assessment sheds light on the nature of these documents and also highlights some of the important context that the powerful capabilities of Recorded Future can provide.

The analysis by Recorded Future is succinct and well done so I cite most of it below. I’ll conclude with some of my own thoughts as an experienced intelligence professional and technologist on some of the “So What” of this assessment.

If you are interested in analytics, particularly visual analytics, you will really appreciate this piece.

Recorded Future has a post on the US Presidential Election. Just to be on the safe side, I would “fuzz” the data when it got close to the election. 😉

BFF (Best Friends Forever or …)

Filed under: Fuzzing,Security — Patrick Durusau @ 5:58 pm

Basic Fuzzing Framework (BFF) From CERT – Linux & Mac OSX Fuzzer Tool

Opportunities for topic maps are just about everywhere! 😉

From the post:

The CERT Basic Fuzzing Framework (BFF) is a software testing tool that finds defects in applications that run on the Linux and Mac OS X platforms. BFF performs mutational fuzzing on software that consumes file input. (Mutational fuzzing is the act of taking well-formed input data and corrupting it in various ways, looking for cases that cause crashes.) The BFF automatically collects test cases that cause software to crash in unique ways, as well as debugging information associated with the crashes. The goal of BFF is to minimize the effort required for software vendors and security researchers to efficiently discover and analyze security vulnerabilities found via fuzzing.

Traditionally fuzzing has been very effective at finding security vulnerabilities, but because of its inherently stochastic nature results can be highly dependent on the initial configuration of the fuzzing system. BFF applies machine learning and evolutionary computing techniques to minimize the amount of manual configuration required to initiate and complete an effective fuzzing campaign. BFF adjusts its configuration parameters based on what it finds (or does not find) over the course of a fuzzing campaign. By doing so it can dramatically increase both the efficacy and efficiency of the campaign. As a result, expert knowledge is not required to configure an effective fuzz campaign, and novices and experts alike can start finding and analyzing vulnerabilities very quickly.

Topic maps would be useful for mapping vulnerabilities across networks by application/OS and other uses.

What’s new in Cassandra 1.1

Filed under: Cassandra — Patrick Durusau @ 3:47 pm

What’s new in Cassandra 1.1 by Jonathan Ellis.

From the post:

Cassandra 1.1 was just released with some useful improvements over 1.0. We’ve been describing these as 1.1 was developed, but it’s useful to list them all in one place:

There is a lot to digest here!

Pointers to posts taking advantage of these new capacities?

Enjoy!

@Zotero 4 Law and OpenCongress.org

Filed under: Law,Law - Sources,Legal Informatics — Patrick Durusau @ 3:39 pm

@Zotero 4 Law and OpenCongress.org

I don’t suppose one more legal resource from Legal Informatics for today will hurt anything. 😉

A post on MLZ (Multilingual Zotero), a legal research and citation processor. Operates as a plugin to Firefox.

Even if you don’t visit the original post, do watch the video on using MLZ. Not slick but you will see the potential that it offers.

It should also give you some ideas about user friendly interfaces and custom topic map applications.

Intent vs. Inference

Filed under: Data,Data Analysis,Inference,Intent — Patrick Durusau @ 3:03 pm

Intent vs. Inference by David Loshin.

David writes:

I think that the biggest issue with integrating external data into the organization (especially for business intelligence purposes) is related to the question of data repurposing. It is one thing to consider data sharing for cross-organization business processes (such as brokering transactions between two different trading partners) because those data exchanges are governed by well-defined standards. It is another when your organization is tapping into a data stream created for one purpose to use the data for another purpose, because there are no negotiated standards.

In the best of cases, you are working with some published metadata. In my previous post I referred to the public data at www.data.gov, and those data sets are sometimes accompanied by their data layouts or metadata. In the worst case, you are integrating a data stream with no provided metadata. In both cases, you, as the data consumer, must make some subjective judgments about how that data can be used.

A caution about “intent” or as I knew it, the intentional fallacy in literary criticism. It is popular in some legal circles in the United States as well.

One problem is that there is no common basis for determining authorial intent.

Another problem is that “intent” is often used to privilege one view over others as representing the “intent” of the author. The “original” view is beyond questioning or criticism because it is the “intent” of the original author.

It should come as no surprise that for law (Scalia and the constitution) and the Bible (you pick’em), “original intent” means agrees with the speaker.

It isn’t entirely clear where David is going with this thread but I would simply drop the question of intent and ask two questions:

  1. What is the purpose of this data?
  2. Is the data suited to that purpose?

Where #1 may include what inferences we want to make, etc.

Cuts to the chase as it were.

New Version of Code of Federal Regulations Launched by Cornell LII

Filed under: Law,Law - Sources,Linked Data — Patrick Durusau @ 2:40 pm

New Version of Code of Federal Regulations Launched by Cornell LII

From Legal Informatics, news of improved access to the Code of Federal Regulations.

US Government site: Code of Federal Regulations.

Cornell LII site: Code of Federal Regulations

You tell me, which one do you like better?

Note that the Government Printing Office (GPO, originator of the “official” version), Cornell LII and the Cornell Law Library have been collaborating for the last two years to make this possible.

The Legal Informatics post has a summary of the new features. You won’t gain anything from my repeating them.

Cornell LII plans on using Linked Data so you can link into the site.

Being able to link into this rich resource will definitely be a boon to other legal resource sites and topic maps. (Despite the limitations of linked data.)

The complete announcement can be found here.

PS: Donate to support the Cornell LII project.

Mill: US Code Citation Extraction Library in JavaScript, with Node API

Filed under: Law,Law - Sources,Legal Informatics — Patrick Durusau @ 10:51 am

Mill: US Code Citation Extraction Library in JavaScript, with Node API

Legal Informatics brings news of new scripts by Eric Mill of Sunlight Labs to extract US Code citations in texts.

Legal citations being a popular means of identifying laws, these would be of interest for law related topic maps.

Monique da Silva Moore, et al. v. Publicis Group SA, et al, 11 Civ. 1279

Filed under: e-Discovery,Law,Legal Informatics — Patrick Durusau @ 10:44 am

Monique da Silva Moore, et al. v. Publicis Group SA, et al, 11 Civ. 1279

The foregoing link is something of a novelty. It is a link to the opinion by US Magistrate Andrew Peck, approving the use of predictive coding (computer-assisted review) as part of e-discovery.

It is not a pointer to an article with no link to the opinion. It is not a pointer to an article on the district judge’s opinion, upholding the magistrate’s order but adding nothing of substance on the use of predictive coding. It is not a pointer to a law journal that requires “free” registration.

I think readers have a reasonable expectation that articles contain pointers to primary source materials. Otherwise, why not write for the tabloids?

Sorry, I just get enraged when resources do not point to primary sources. Not only is it poor writing, it is discourteous to readers.

Magistrate Peck’s opinion is said to be the first that approves the use of predictive coding as part of e-discovery.

In very summary form, the plaintiff (the person suing) has requested the defendant (the person being sued), produce documents, including emails, in its possession that are responsive to a discovery request. A discovery request is where the plaintiff specifies what documents it wants the defendant to produce, usually described as a member of a class of documents. For example, all documents with statements about [plaintiff’s name] employment with X, prior to N date.

In this case, there are 3 million emails to be searched and then reviewed by the defense lawyers (for claims of privilege, non-disclosure authorized by law, such as advice of counsel in some cases) prior to production for review by the plaintiff, who may then use one or more of the emails at trial.

The question is: Should the defense lawyers use a few thousand documents to train a computer to search the 3 million documents or should they use other methods, which will result in much higher costs because lawyers have to review more documents?

The law, facts and e-discovery issues weave in and out of Magistrate Peck’s decision but if you ignore the obviously legalese parts you will get the gist of what is being said. (If you have e-discovery issues, please seek professional assistance.)

I think topic maps could be very relevant in this situation because subjects permeate the discovery process, under different names and perspectives, to say nothing of sharing analysis and data with co-counsel.

I am also mindful that analysis of presentations, speeches, written documents, emails, discovery from other cases, could well develop profiles of potential witnesses in business litigation in particular. A topic map could be quite useful in mapping the terminology most likely to be used by a particular defendant.

BTW, it will be a long time coming, in part because it would reduce the fees of the defense bar, but I would say, “OK, here are the 3 million emails. We reserve the right to move to exclude any on the basis of privilege, relevancy, etc.”

That ends all the dancing around about discovery and if the plaintiff wants to slough through 3 million emails, fine. They still have to disclose what they intend to produce as exhibits at trial.

Downloading the XML data from the Exome Variant Server

Filed under: Bioinformatics,Biomedical,Genome — Patrick Durusau @ 10:44 am

Downloading the XML data from the Exome Variant Server

Pierre Lindenbaum writes:

From EVS: “The goal of the NHLBI GO Exome Sequencing Project (ESP) is to discover novel genes and mechanisms contributing to heart, lung and blood disorders by pioneering the application of next-generation sequencing of the protein coding regions of the human genome across diverse, richly-phenotyped populations and to share these datasets and findings with the scientific community to extend and enrich the diagnosis, management and treatment of heart, lung and blood disorders.

The NHLBI Exome Sequencing Project provides a download area but I wanted to build a local database for the richer XML data returned by their Web Services (previously described here on my blog ). The following java program sends some XML/SOAP requests to the EVS server for each chromosome using a genomic window of 150000 bp and parses the XML response.

If you are interested in tools that will assist you in populating a genome-centric topic map, Pierre’s blog is an important one to follow.

May 7, 2012

Startups are Creating a New System of the World for IT

Filed under: Cloud Computing — Patrick Durusau @ 7:19 pm

Startups are Creating a New System of the World for IT

Todd Hoff writes:

It remains that, from the same principles, I now demonstrate the frame of the System of the World. — Isaac Newton

The practice of IT reminds me a lot of the practice of science before Isaac Newton. Aristotelianism was dead, but there was nothing to replace it. Then Newton came along, created a scientific revolution with his System of the World. And everything changed. That was New System of the World number one.

New System of the World number two was written about by the incomparable Neal Stephenson in his incredible Baroque Cycle series. It explores the singular creation of a new way of organizing society grounded in new modes of thought in business, religion, politics, and science. Our modern world emerged Enlightened as it could from this roiling cauldron of forces.

In IT we may have had a Leonardo da Vinci or even a Galileo, but we’ve never had our Newton. Maybe we don’t need a towering genius to make everything clear? For years startups, like the frenetically inventive age of the 17th and 18th centuries, have been creating a New System of the World for IT from a mix of ideas that many thought crazy at first, but have turned out to be the founding principles underlying our modern world of IT.

If you haven’t guessed it yet, I’m going to make the case that the New System of the World for IT is that much over hyped word: cloud. I hope to show, using many real examples from real startups, that the cloud is built on a powerful system of ideas and technologies that make it a superior model for delivering IT.

Interesting piece but Todd misses a couple of critical points:

First, Newton was wrong. (full stop) True, his imagining of the world was sufficient and over-determining for centuries, but it wasn’t true. It took until the 20th century for his hegemony to be over turned but it was.

Newtonian mechanics are still taught, but for how much longer? As our understanding of quantum systems grow and our designs move closer and closer to realms unimagined by Newton.

Second, every effort you find at Sourceforge or Freshmeat or similar locales, is the project of someone or small group of someones, all utterly convinced that their project has some unique insight that isn’t contained in the other N projects of the same type. That may well be true, at least for some of them.

But the point remains that the “cloud” enables that fracturing of IT services to a degree not seen up until now. Well, at least not for a long time.

I remember there being 300 or so formats that conversion software offered to handle. How many exist in the cloud today? How many do you think there will be a year from now? (Or perhaps better, how many clouds do you think there will be a year from now?)

With or without the cloud, greater data access is going to drive the need for an understanding and modeling of the subject identities that underlie data and its structures. Brave new world or no.

Enjoy your Newtonian (or is that Napoleonic?) dreams.

List of Hand-Picked and Recommended Data Visualization Tools

Filed under: Graphics,Visualization — Patrick Durusau @ 7:19 pm

List of Hand-Picked and Recommended Data Visualization Tools

Information aesthetics points us to a collection of “recommended” data visualization tools.

There are lots of tools out there and more appearing every day.

These will be enough to get you started and to gain enough experience to choose wisely from other offerings.

Enjoy!

(Please re-post and/or forward.)

Indexing Reverse Top-k Queries

Filed under: Indexing,Reverse Data Management,Top-k Query Processing — Patrick Durusau @ 7:19 pm

Indexing Reverse Top-k Queries by Sean Chester, Alex Thomo, S. Venkatesh, and Sue Whitesides.

Abstract:

We consider the recently introduced monochromatic reverse top-k queries which ask for, given a new tuple q and a dataset D, all possible top-k queries on D union {q} for which q is in the result. Towards this problem, we focus on designing indexes in two dimensions for repeated (or batch) querying, a novel but practical consideration. We present the insight that by representing the dataset as an arrangement of lines, a critical k-polygon can be identified and used exclusively to respond to reverse top-k queries. We construct an index based on this observation which has guaranteed worst-case query cost that is logarithmic in the size of the k-polygon.

We implement our work and compare it to related approaches, demonstrating that our index is fast in practice. Furthermore, we demonstrate through our experiments that a k-polygon is comprised of a small proportion of the original data, so our index structure consumes little disk space.

This was the article that made me start looking for resources on “reverse data management.”

Interesting in its own right but I commend it to your attention in part because of the recognition that interesting challenges lie ahead with higher-dimensional indexing.

If you think about it, “indexing” in the sense of looking for a simple string token, is indexing in one dimension.

If you index a simple string token, but with the further requirement that it appear in JAMA (Journal of the American Medical Association), that is indexing in two dimensions.

If you index a simple string token, appearing in JAMA, but it must also be in an article with the “tag” cancer, now you are indexing in three dimensions.

A human indexer, creating an annual index of cancer publications would move along those dimensions with ease.

Topic maps are an attempt to capture the insight that allows automatic replication of the human indexer’s insight.

Reverse Data Management

Filed under: Reverse Data Management — Patrick Durusau @ 7:19 pm

Reverse Data Management

I encountered “reverse data management” today.

Being unfamiliar I searched for more information and encountered this page, which reads in part as follows:

Forward and Reverse Data Transformations

Database research mainly focuses on forward-moving data flows: source data is subjected to transformations and evolves through queries, aggregations, and view definitions to form a new target in- stance, possibly with a different schema. This forward paradigm underpins most data management tasks today, such as querying, data integration, data mining, etc. We contrast this forward processing with Reverse Data Management (RDM), where the action needs to be performed on the input data, on behalf of desired outcomes in the output data. Some data management tasks already fall under this paradigm, for example updates through views, data generation, data cleaning and repair. RDM is, by necessity, conceptually more difficult to define, and computationally harder to achieve. Today, however, as increasingly more of the available data is derived from other data, there is an increased need to be able to modify the input in order to achieve a desired effect on the output, motivating a systematic study of RDM.

[graphic omitted]

In general, RDM problems are harder to formulate and implement, because of the simple fact that the inverse of a function is not always a function. Given a desired output (or change to the output), there are multiple inputs (or none at all) that can produce it. This is a prevailing difficulty in all RDM problems. This project aims to develop a unified framework for Reverse Data Management problems, which will bring together several subfields of database research. RDM problems can be classified along two dimensions, as shown in the table below. On the “target” dimension, problems are divided into those that have explicit and those that have implicit specifications. The former means that the desired target effect is given as a tuple-level data instance; this is the case in causality and view updates. The latter means that the target effect is described indirectly, through statistics and constraints; examples include how-to queries and data generation. On the “source” dimension, problems are divided in those that use a reference source, and those that do not. For example, view updates and how-to queries fall under the former category, while data generation under the latter.

I am encouraged by the view that changes in inputs can cause changes in outputs.

It sounds trivial to say.

It is one step down the slippery path where outputs aren’t determinate, in some manner “out there” and inter-subjective.

Outputs depend upon what we make of the inputs.

If I don’t like the outcome, I just need a new set of inputs. Or a new reading of the old ones.

And I need to be mindful that is always the case, whatever I think of the current outputs.

If the edges of database research are exploring RDM issues, that should be a warning to the intelligence community that appears to think value can be derived from crunching enough data.

Perhaps, but be mindful that data crunching produces outcomes based on inputs. If the inputs change, so may the outputs. Something to think about.

Particularly if your integration solution is “lite” on enabling you to probe (alter?) the subject identities as inputs that are shaping your outputs.

Make no mistake, whether we acknowledge it or not, ever datum, every data structure, every operation, every input and every output represents choices. Choices that can be explicit and accounted for or not.

RDM looks like a coming “hot” field of research that addresses some of those issues.

RANK: Top-k Query Processing

Filed under: Top-k Query Processing — Patrick Durusau @ 7:18 pm

RANK: Top-k Query Processing

Although a bit dated, the project offers useful illustrations of top-k query processing in a variety of areas.

Graphs in the Cloud: Neo4j and Heroku

Filed under: Graphs,Heroku,Neo4j — Patrick Durusau @ 7:18 pm

Graphs in the Cloud: Neo4j and Heroku, Presentation by James Ward, Heroku and Andreas Kollegger, Neo Technology.

From the webpage:

Thursday May 10 10:00 PDT / 17:00 GMT

With more and more applications in the cloud, developers are looking for a fast solution to deploy their applications. This webinar is intended for developers that are interested in the value of launching your application in the cloud, and the power of using a graph database.

In this session, you will learn:

  • how to build Java applications that connect to the Neo4j graph database.
  • how to instantly deploy and scale those applications on the cloud with Heroku.

I’m registered, how about you?

Clusters & Communities (overlapping dense groups in networks)

Filed under: CFinder,Clustering,Networks — Patrick Durusau @ 7:18 pm

Clusters & Communities (overlapping dense groups in networks) (CFinder)

From the webpage:

CFinder is a free software for finding and visualizing overlapping dense groups of nodes in networks, based on the Clique Percolation Method (CPM) of Palla et. al., Nature 435, 814-818 (2005). CFinder was recently applied to the quantitative description of the evolution of social groups: Palla et. al., Nature 446, 664-667 (2007).

CFinder offers a fast and efficient method for clustering data represented by large graphs, such as genetic or social networks and microarray data. CFinder is also very efficient for locating the cliques of large sparse graphs.

I rather like the title for the webpage as opposed to simply CFinder, which is what I started to use. Would be accurate but also wouldn’t capture the notion of discovering overlapping dense groups in networks.

Whether we realize it or not, the choice of a basis for relationships can produce or conceal any number of dense overlapping groups in a network.

I have mentioned CFinder elsewhere but wanted to call it out, in part to raise its position on my horizon.

Parallel clustering with CFinder

Filed under: CFinder,Clustering,Networks,Parallel Programming — Patrick Durusau @ 7:18 pm

Parallel clustering with CFinder by Peter Pollner, Gergely Palla, and Tamas Vicsek.

Abstract:

The amount of available data about complex systems is increasing every year, measurements of larger and larger systems are collected and recorded. A natural representation of such data is given by networks, whose size is following the size of the original system. The current trend of multiple cores in computing infrastructures call for a parallel reimplementation of earlier methods. Here we present the grid version of CFinder, which can locate overlapping communities in directed, weighted or undirected networks based on the clique percolation method (CPM). We show that the computation of the communities can be distributed among several CPU-s or computers. Although switching to the parallel version not necessarily leads to gain in computing time, it definitely makes the community structure of extremely large networks accessible.

If you aren’t familiar with CFinder, you should be.

Neo4j Podcasts

Filed under: Graphs,Neo4j — Patrick Durusau @ 9:18 am

Neo4j Podcasts

Peter Hoffman calls our attention to Neo4j podcasts:

  • Emil Eifrem, the man behind Neo4j (mp3)
  • Agile architect Peter Bell on Neo4j, a graph-oriented datastore built in Java(mp3)

I always thought Dr. Who was behind Neo4j?

Could be wrong.

Will check with the old hands at Neo4j.

May 6, 2012

Avengers, Assembled (and Visualized) – Part 2

Filed under: Graphics,Visualization — Patrick Durusau @ 7:45 pm

Avengers, Assembled (and Visualized) – Part 2

Jer writes:

Last week I shared a set of visualizations I made, exploring the history of The Avengers – the Marvel comic series which first appeared in 1963, and was last week released as a bombastic, blockbuster film (which, by the way, I enjoed tremendously). I looked at the 570-issue archive as a whole, and tried to dig out some interesting patterns concerning female characters, robots, and gods (as far as I know, there are no female robot god avengers – though I guess Jocasta comes pretty close). If you missed that first post, you might want to give it a quick read right now, as I’ll be picking up where I left off.

So far, the discussion has been mostly around the characters of the Avengers, at a collective level. Lots of data is available about each individual character, as well – for example we can look at any Avenger and see every appearance they’ve made over the last 50 years. Here’s Captain America’s record number of appearances:

What a delightful way to start the week!

This is way cool!

Enjoy!

Explicit Semantic Analysis (ESA) using Wikipedia

Filed under: Explicit Semantic Analysis,TF-IDF — Patrick Durusau @ 7:45 pm

Explicit Semantic Analysis (ESA) using Wikipedia

From the webpage:

The Explicit Semantic Analysis (ESA) method (Gabrilovich and Markovitch, 2007) is a measure to compute the semantic relatedness (SR) between two arbitrary texts. The Wikipedia-based technique represents terms (or texts) as high-dimensional vectors, each vector entry presenting the TF-IDF weight between the term and one Wikipedia article. The semantic relatedness between two terms (or texts) is expressed by the cosine measure between the corresponding vectors.

I have no objection to machine-based techniques, or human-based ones for that matter, so long as the limitations of both are kept firmly in mind.

Some older resources on Explicit Semantic Analysis.

Why Your Brain Isn’t A Computer

Filed under: Artificial Intelligence,Semantics,Subject Identity — Patrick Durusau @ 7:45 pm

Why Your Brain Isn’t A Computer by Alex Knapp.

Alex writes:

“If the human brain were so simple that we could understand it, we would be so simple that we couldn’t.”
– Emerson M. Pugh

Earlier this week, i09 featured a primer, of sorts, by George Dvorsky regarding how an artificial human brain could be built. It’s worth reading, because it provides a nice overview of the philosophy that underlies some artificial intelligence research, while simultaneously – albeit unwittingly – demonstrating the some of the fundamental flaws underlying artificial intelligence research based on the computational theory of mind.

The computational theory of mind, in essence, says that your brain works like a computer. That is, it takes input from the outside world, then performs algorithms to produce output in the form of mental state or action. In other words, it claims that the brain is an information processor where your mind is “software” that runs on the “hardware” of the brain.

Dvorsky explicitly invokes the computational theory of mind by stating “if brain activity is regarded as a function that is physically computed by brains, then it should be possible to compute it on a Turing machine, namely a computer.” He then sets up a false dichotomy by stating that “if you believe that there’s something mystical or vital about human cognition you’re probably not going to put too much credence” into the methods of developing artificial brains that he describes.

I don’t normally read Forbes but I made and exception in this case and am glad I did.

Not that I particularly care about which side of the AI debate you come out on.

I do think that the notion of “emergent” properties is an important one for judging subject identities. Whether those subjects occur in text messages, intercepted phone calls, signal “intell” of any sort.

Properties that identify subjects “emerge” from a person who speaks the language in question, who has social/intellectual/cultural experiences that give them a grasp of the matters under discussion and perhaps the underlying intent of the parties to the conversation.

A computer program can be trained to mindlessly sort through large amounts of data. It can even be trained to acceptable levels of mis-reading, mis-interpretation.

What will our evaluation be when it misses the one conversation prior to another 9/11? Because the context or language was not anticipated? Because the connection would only emerge out of a living understanding of cultural context?

Computers are deeply useful, but not when emergent properties, emergent properties of the sort that identify subjects, targets and the like are at issue.

The Cost of Rediscovery

Filed under: Rediscovery — Patrick Durusau @ 7:44 pm

Bob Carpenter writes in: Mavandadi et al. (2012) Distributed Medical Image Analysis and Diagnosis through Crowd- Sourced Games: A Malaria Case Study:

I found a link from Slashdot of all places to this forthcoming paper:

None of the nine authors, the reviewer(s) or editor(s) knew that their basic technique has been around for over 30 years. (I’m talking about the statistical technique here, not the application to distributed diagnosis of diseases, which I don’t know anything about.)

Of course, many of us reinvented this particular wheel over the past three decades, and the lack of any coherent terminology for the body of work across computer science, statistics, and epidemiology is part of the problem. But now, in 2012, a simple web search for crowdsourcing should reveal the existing literature because enough of us have found it and cited it.

Have you ever wondered how much reinvention costs your company or organization every year?

Or how little benefit is derived from funded research that reinvents a wheel?

It is popular to talk about more cost-effective and efficient government, but shouldn’t being cost-effective and efficient be goals of private organizations as well?

How would you go about detecting reinvention of the wheel in your company or organization? (Leaving to one side how you would preserve that discovery once made.)

Big Data Equivalent to CRM?

Filed under: BigData — Patrick Durusau @ 7:44 pm

Wharton Professor Pokes Hole in Big Data Balloon by Robert Gelber.

Just a reminder that useful results may matter more than what terminology you use. Maybe.

Just as we enter the upswing on the hype cycle for big data, the academics are stepping in and making it clear that in this case, there might be much ado about nothing.

According to Dr. Peter Fader, co-director of the Wharton School of Business and marketing professor at the University of Pennsylvania, there are some glaring problems with the way vendors and enterprise execs are framing the conversation around big data.

He doesn’t consider himself a data Luddite, and understands that captured information can generate more value. The argument is that too much information is being captured in order to answer questions that don’t require it.

In an interview this week, Dr. Fader poked holes into the big data practices, collecting masses of data to enable granular customer analysis. He says that while the concept itself has high potential (after all, it’s just data), it reminds him of another bubble that infiltrated enterprise IT no so long ago–customer relationship management (CRM). Like big data, he says, the CRM concept focused on collecting and analyzing transactional information, but failed to achieve its given goal.

Snapshots from the Edge of Big Visualization

Filed under: BigData,Graphics,Visualization — Patrick Durusau @ 7:44 pm

Snapshots from the Edge of Big Visualization

Want inspiration about visualization?

The staff at Datanami have a set of real visual and analytical delights for you:

Visualization is becoming even more critical as datasets continue to outgrow their containers.

Even the most robust analytics applications can sometimes fail to present the big picture of a particular dataset. This realization is the driving force behind the ongoing integration of specialized visualization suites for nearly every analytics offering.

Data scientists of all stripes are continually asking new questions of their data and increasingly want robust tools to help them see what new questions are on the horizon.

Effective visualization allows previously undiscovered questions to emerge and provides a human element to bridge the divide between the plethora of new tools and frameworks for processing and managing big data.

Online resources for handling big data and parallel computing in R

Filed under: BigData,Parallel Programming,R — Patrick Durusau @ 7:44 pm

Online resources for handling big data and parallel computing in R by Yanchang Zhao.

Resources to spice up your reading list for this week:

Compared with many other programming languages, such as C/C++ and Java, R is less efficient and consumes much more memory. Fortunately, there are some packages that enables parallel computing in R and also packages for processing big data in R without loading all data into RAM. I have collected some links to online documents and slides on handling big data and parallel computing in R, which are listed below. Many online resources on other topics related to data mining with R can be found at http://www.rdatamining.com/resources/onlinedocs.

« Newer PostsOlder Posts »

Powered by WordPress