Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 6, 2012

Functional Thinking: Thinking Functionally, Part 3

Filed under: Functional Programming — Patrick Durusau @ 11:36 am

Functional Thinking: Thinking Functionally, Part 3

From the summary:

Functional thinking series author Neal Ford continues his guided tour of functional programming constructs and paradigms. You’ll look at number-classification code in Scala and take a glance at unit testing in the functional world. Then you’ll learn about partial application and currying — two functional approaches that facilitate code reuse — and see how recursion fits into the functional way of thinking.

Perhaps you will (also) learn a new way to think this year! 😉

Question: If data is too big to move, it is also too big to change and maintain referential integrity? Or is size an issue or simply an excuse? In an open and distributed architecture we cannot know (or find) all the references to our data.

Querying Semi-Structured Data

Querying Semi-Structured Data

The Semi-structured data and P2P graph databases post I point to has a broken reference to Serge Abiteboul’s “Querying Semi-Structured Data.” Since I could not correct it there and the topic is of interest for topic maps, I created this entry for it here.

From the Introduction:

The amount of data of all kinds available electronically has increased dramatically in recent years. The data resides in diff erent forms, ranging from unstructured data in le systems to highly structured in relational database systems. Data is accessible through a variety of interfaces including Web browsers, database query languages, application-specifi c interfaces, or data exchange formats. Some of this data is raw data, e.g., images or sound. Some of it has structure even if the structure is often implicit, and not as rigid or regular as that found in standard database systems. Sometimes the structure exists but has to be extracted from the data. Sometimes also it exists but we prefer to ignore it for certain purposes such as browsing. We call here semi-structured data this data that is (from a particular viewpoint) neither raw data nor strictly typed, i.e., not table-oriented as in a relational model or sorted-graph as in object databases.

As will seen later when the notion of semi-structured data is more precisely defi ned, the need for semi-structured data arises naturally in the context of data integration, even when the data sources are themselves well-structured. Although data integration is an old topic, the need to integrate a wider variety of data-formats (e.g., SGML or ASN.1 data) and data found on the Web has brought the topic of semi-structured data to the forefront of research.

The main purpose of the paper is to isolate the essential aspects of semi-structured data. We also survey some proposals of models and query languages for semi-structured data. In particular, we consider recent works at Stanford U. and U. Penn on semi-structured data. In both cases, the motivation is found in the integration of heterogeneous data. The “lightweight” data models they use (based on labelled graphs) are very similar.

As we shall see, the topic of semi-structured data has no precise boundary. Furthermore, a theory of semi-structured data is still missing. We will try to highlight some important issues in this context.

The paper is organized as follows. In Section 2, we discuss the particularities of semi-structured data. In Section 3, we consider the issue of the data structure
and in Section 4, the issue of the query language.

A bit dated, 1996, but still worth reading. Updating the paper would make a nice semester size project

BTW, note the download graphics. Makes me think that archives should have an “anonymous notice” feature that allows anyone downloading a paper to send an email to anyone who has downloaded the paper in the past, without disclosing the emails of the prior downloaders.

I would really like to know what the people in Jan/Feb of 2011 were looking for? Perhaps they are working on an update of the paper? Or would like to collaborate on updating the paper.

Seems like a small “feature” that would allow researchers to contact others without disclosure of email addresses (other than for the sender of course).

Formal publication data:

Abiteboul, S. (1996) Querying Semi-Structured Data. Technical Report. Stanford InfoLab. (Publication Note: Database Theory – ICDT ’97, 6th International Conference, Delphi, Greece, January 8-10, 1997)

Neography

Filed under: Neo4j,Neography,Ruby — Patrick Durusau @ 11:36 am

Neography

From the webpage:

Neography is a thin Ruby wrapper to the Neo4j Rest API, for more information:

If you want to the full power of Neo4j, you will want to use JRuby and the excellent Neo4j.rb gem at github.com/andreasronge/neo4j by Andreas Ronge

A complement to Neography is the Neology Gem at github.com/lordkada/neology by Carlo Alberto Degli Atti

An alternative is the Architect4r Gem at github.com/namxam/architect4r by Maximilian Schulz

For all you Ruby hackers out there!

Neovigator

Filed under: Neography,Neovigator,Processing.js — Patrick Durusau @ 11:36 am

Neovigator

From the webpage:

An attempt to use Neography and processing.js to navigate a Neo4j graph via its REST API.

Be sure to visit the demo site: http://neovigator.herokuapp.com/

Mouse-over and enjoy!

Aside to Kirk: Thinking of the potential to represent morphemes, morphological annotations, syntactic analysis, with variants in a first person shooter, sorry, point and click interface. First step display, then authoring.

Hmmm, some relationships could be auto-generated, such an an emendation need only be entered with its relationship to a morpheme and its relationships to the word, verse, syntactic divisions, etc. could be automatically added. Could have “display changes” to facilitate review.

Semi-structured data and P2P graph databases

Filed under: Graphs,NoSQL,Plasma — Patrick Durusau @ 11:34 am

Semi-structured data and P2P graph databases by Jeff Rose.

From the post:

In a previous post I introduced the Plasma graph query engine that I’ve been working on as part of my thesis project. With Plasma you can declaratively define queries and evaluate them against a graph database. The heart of the system is a library of dataflow query operators, and on top of them sits a fairly simplistic query “language”. (I put it in quotes because in a lisp based language like Clojure the line between a mini-language and an API gets blurry.) In this post I’ll write a bit about why I think graph databases could be an interesting foundation for next generation P2P networks, and then I’ll give some examples of performing distributed graph queries using Plasma. First I think it is important to motivate the use of a graph database though. While most of the marketing speak on the web regarding graph databases is all about representing social network data, this is just one of many potential applications.

I am not convinced the categories of “structured,” “semi-structured,” and “unstructured” data are all that helpful.

For example, when did the New Testament become a structured text? Division into chapters? (13th century) Division into verses? (mid-16th century) or is it still “unstructured?” Or the same question for the Tanakh, except there relying on a much richer system of divisions.

If you mean by “structured” a particular form of internal representation and reference, such as are represented to users as relational tables, why not say so? That is a particular form of structuring data, not the only one.

And as Wikipedia observes (Table (Database):

An equally valid representations of a relation is as an n-dimensional chart, where n is the number of attributes (a table’s columns). For example, a relation with two attributes and three values can be represented as a table with two columns and three rows, or as a two-dimensional graph with three points. The table and graph representations are only equivalent if the ordering of rows is not significant, and the table has no duplicate rows.

I take that to mean that I can treat a graph as a data structure with more “structure” as it were.

I am equally unconvinced that P2P networks are the key to avoiding the control and censorship issues of architectures like the Internet. If you think the telcos rolled over quick when asked information for “national security,” just think about your CIO or even your local network administrator. And being P2P means arbitrary peers can pick up the data stream. Want to see the folks in dark shades and cheap suits?

P2P maybe a better technological choice to lessen the chances of censorship, but social institutions that oppose censorship or make it more difficult are equally important, if not more so.

January 5, 2012

Graph Algorithms

Filed under: Algorithms,Cypher,Graphs,Gremlin,Neo4j — Patrick Durusau @ 4:14 pm

Graph Algorithms

I ran across this Wikipedia book while working on one of the data structures posts for today.

I think you may find it useful but some cautions:

First, being a collection of Wikipedia articles, it doesn’t have a consistent editorial voice. That is more than being fussy, the depth and usefulness of explanations will vary from article to article.

Second, you will find topics that are “stubs,” and hence not very useful.

Third, I think with the advent of Neo4j, Grelim, Cypher and other graph databases/software, future entries should have in addition to text, exercises that users can perform with common software to reinforce their understanding of entries.

Running along the graph using Neo4J Spatial and Gephi

Filed under: Gephi,Graphs,Neo4j — Patrick Durusau @ 4:14 pm

Running along the graph using Neo4J Spatial and Gephi

Just to whet your appetite:

When I started running some years ago, I bought a Garmin Forerunner 405. It’s a nifty little device that tracks GPS coordinates while you are running. After a run, the device can be synchronized by uploading your data to the Garmin Connect website. Based upon the tracked time and GPS coordinates, the Garmin Connect website provides you with a detailed overview of your run, including distance, average pace, elevation loss/gain and lap splits. It also visualizes your run, by overlaying the tracked course on Bing and/or Google maps. Pretty cool! One of my last runs can be found here.

Apart from simple aggregations such as total distance and average speed, the Garmin Connect website provides little or no support to gain deeper insights in all of my runs. As I often run the same course, it would be interesting to calculate my average pace at specific locations. When combining the data of all of my courses, I could deduct frequently encountered locations. Finally, could there be a correlation between my average pace and my distance from home? In order to come up with answers to these questions, I will import my running data into a Neo4J Spatial datastore. Neo4J Spatial extends the Neo4J Graph Database with the necessary tools and utilities to store and query spatial data in your graph models. For visualizing my running data, I will make use of Gephi, an open-source visualization and manipulation tool that allows users to interactively browse and explore graphs.

Suggestion: If you want to know where you go and/or how you spend your time, try tracking both for a week. Faithfully record how you spend your time, reading, commuting, TV, exercise, work, etc., in say 30 minute intervals. Also keep track of your physical location. Don’t try to be overly precise, use big buckets. And no peeking as to how the week is shaping up. I think you will be surprised at how your week shapes up.

Interoperability Driven Integration of Biomedical Data Sources

Interoperability Driven Integration of Biomedical Data Sources by Douglas Teodoro, Rémy Choquet, Daniel Schober, Giovanni Mels, Emilie Pasche, Patrick Ruch, and Christian Lovis.

Abstract:

In this paper, we introduce a data integration methodology that promotes technical, syntactic and semantic interoperability for operational healthcare data sources. ETL processes provide access to different operational databases at the technical level. Furthermore, data instances have they syntax aligned according to biomedical terminologies using natural language processing. Finally, semantic web technologies are used to ensure common meaning and to provide ubiquitous access to the data. The system’s performance and solvability assessments were carried out using clinical questions against seven healthcare institutions distributed across Europe. The architecture managed to provide interoperability within the limited heterogeneous grid of hospitals. Preliminary scalability result tests are provided.

Appears in:

Studies in Health Technology and Informatics
Volume 169, 2011
User Centred Networked Health Care – Proceedings of MIE 2011
Edited by Anne Moen, Stig Kjær Andersen, Jos Aarts, Petter Hurlen
ISBN 978-1-60750-805-2

I have been unable to find a copy online, well, other than the publisher’s copy, at $20 for four pages. I have written to one of the authors requesting a personal use copy as I would like to report back on what it proposes.

New Year’s Resolution: Learn How to Code

Filed under: Programming — Patrick Durusau @ 4:12 pm

New Year’s Resolution: Learn How to Code by Stephen Turner

From the post:

Q&A sites for biologists are littered with questions from researchers asking for non-technical, code-free ways of doing a particular analysis. Your friendly bioinformatics or computational biology neighbor can often point to a resource or design a solution that can get you 90% of the way, but usually won’t grok the biological problem as truly as you do. By learning even the smallest bit of programming, you can at least be equipped with the knowledge of what is programmatically possible, and collaborations with your bioinformatician can be more fruitful. As every field of biological research becomes more computational in nature, learning how to code is becoming more important than ever. (emphasis added)

The line “…usually won’t grok the biological problem as truly as you do….” is the key to the article, but you will find a number of excellent resources cited further down in it.

I say that because programmers are going to code to the implicit subjects that they recognize and understand as important for the program. Nothing wrong with that and it would be quite odd if they didn’t. The problem is those may not represent your understanding of what you want to accomplish, including the subjects that you think are important to be recognized.

Yes, programs consist of subjects, even though we don’t normally use topic maps syntax to identify them. Nor should we if we want acceptable running times. What we can do is be sure that the subjects that are important to us, perhaps identified by a topic map in the planning stages for a project, are represented in the acceptable inputs and results from a program. Knowing how to program or even read code a bit, will help you achieve that goal.

Getting started with Ruby and Neo4j

Filed under: Neo4j,Ruby — Patrick Durusau @ 4:11 pm

Getting started with Ruby and Neo4j

Max De Marzi walks you through installation of neography and then to making a social network graph. Nothing new but a gentle introduction to Neo4j with promises of more to come on Gremlin and Cypher (ways to walk across the graph).

Pass along to any Rubyists that need an introduction to Neo4j.

Digging into Data Challenge

Filed under: Archives,Contest,Data Mining,Library,Preservation — Patrick Durusau @ 4:09 pm

Digging into Data Challenge

From the homepage:

What is the “challenge” we speak of? The idea behind the Digging into Data Challenge is to address how “big data” changes the research landscape for the humanities and social sciences. Now that we have massive databases of materials used by scholars in the humanities and social sciences — ranging from digitized books, newspapers, and music to transactional data like web searches, sensor data or cell phone records — what new, computationally-based research methods might we apply? As the world becomes increasingly digital, new techniques will be needed to search, analyze, and understand these everyday materials. Digging into Data challenges the research community to help create the new research infrastructure for 21st century scholarship.

Winners for Round 2, some 14 projects out of 67, were announced on 3 January 2012.

Interested to hear your comments on the projects as I am sure the projects would as well.

Two Journalist Databases

Filed under: News — Patrick Durusau @ 4:07 pm

Two Journalist Databases

Matthew Hurst has found two databases about “them,” you know, members of the fourth estate. 😉

Curious how you would combine the information from these two sources?

Or taking that combination and creating a window for viewing stories written by a particular reporter and providing access to the information from these databases at the same time?

Perhaps we should just say “the press” and leave the “public” out of it.

To avoid the implication that it is the “public’s” interest that is being served by the press.

Baltimore gun offenders and where academics don’t live

Filed under: Data Analysis,Geographic Data,Statistics — Patrick Durusau @ 4:06 pm

Baltimore gun offenders and where academics don’t live

An interesting plotting of the residential addresses (not crime locations) of gun offenders. You need to see the post to observe how stark the “island” of academics appears on the map.

Illustration of non-causation, unless you want to contend that the presence of academics in a neighborhood drives out gun offenders. Which would argue in favor of more employment and wider residential patterns for academics. I would favor that but suspect that is personal bias.

A cross between this map and a map of gun offenses would be a good guide for housing prospects in Baltimore.

What other data would be useful for such a map? Education, libraries, fire protection, other crime rates…. Easy enough since there are geographic boundaries as the binding points but “summing up” information as you zoom out might be interesting.

That is say crime statistics are on a police district basis and as you zoom out, you want information from multiple districts merged and resorted. Or you have overlapping districts for water, electricity, police, fire, etc. Having a geographic grid becomes your starting place but only a starting place.

Data Structures and Algorithms

Filed under: Data Structures — Patrick Durusau @ 4:05 pm

Data Structures and Algorithms with Object-Oriented Design Patterns in Java by Bruno R. Preiss.

From Goals:

The primary goal of this book is to promote object-oriented design using Java and to illustrate the use of the emerging object-oriented design patterns. Experienced object-oriented programmers find that certain ways of doing things work best and that these ways occur over and over again. The book shows how these patterns are used to create good software designs. In particular, the following design patterns are used throughout the text: singleton, container, enumeration, adapter and visitor.

Virtually all of the data structures are presented in the context of a single, unified, polymorphic class hierarchy. This framework clearly shows the relationships between data structures and it illustrates how polymorphism and inheritance can be used effectively. In addition, algorithmic abstraction is used extensively when presenting classes of algorithms. By using algorithmic abstraction, it is possible to describe a generic algorithm without having to worry about the details of a particular concrete realization of that algorithm.

A secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context. In the past when the topics in this book were taught at the graduate level, an author could rely on students having the needed background in mathematics. However, because the book is targeted for second- and third-year students, it is necessary to fill in the background as needed. To the extent possible without compromising correctness, the presentation fosters intuitive understanding of the concepts rather than mathematical rigor.

Noticed in David Eppstein’s Link Roundup.

Open Data Structures

Filed under: Data Structures,Java — Patrick Durusau @ 4:04 pm

Open Data Structures by Pat Morin.

From “about:”

Open Data Structures covers the implementation and analysis of data structures for sequences (lists), queues, priority queues, unordered dictionaries, and ordered dictionaries.

Data structures presented in the book include stacks, queues, deques, and lists implemented as arrays and linked-list; space-efficient implementations of lists; skip lists; hash tables and hash codes; binary search trees including treaps, scapegoat trees, and red-black trees; and heaps, including implicit binary heaps and randomized meldable heaps.

The data structures in this book are all fast, practical, and have provably good running times. All data structures are rigorously analyzed and implemented in Java and C++. The Java implementations implement the corresponding interfaces in the Java Collections Framework.

The book and accompanying source code are free (libre and gratis) and are released under a Creative Commons Attribution License. Users are free to copy, distribute, use, and adapt the text and source code, even commercially. The book’s LaTeX sources, Java/C++ sources, and build scripts are available through github.

Noticed in David Eppstein’s Link Roundup.

January 4, 2012

To Know, but Not Understand: David Weinberger on Science and Big Data

Filed under: Books,Epistemology,Knowledge,Philosophy of Science — Patrick Durusau @ 2:21 pm

To Know, but Not Understand: David Weinberger on Science and Big Data

From the introduction:

In an edited excerpt from his new book, Too Big to Know, David Weinberger explains how the massive amounts of data necessary to deal with complex phenomena exceed any single brain’s ability to grasp, yet networked science rolls on.

Well, it is a highly entertaining excerpt, with passages like:

For example, the biological system of an organism is complex beyond imagining. Even the simplest element of life, a cell, is itself a system. A new science called systems biology studies the ways in which external stimuli send signals across the cell membrane. Some stimuli provoke relatively simple responses, but others cause cascades of reactions. These signals cannot be understood in isolation from one another. The overall picture of interactions even of a single cell is more than a human being made out of those cells can understand. In 2002, when Hiroaki Kitano wrote a cover story on systems biology for Science magazine — a formal recognition of the growing importance of this young field — he said: “The major reason it is gaining renewed interest today is that progress in molecular biology … enables us to collect comprehensive datasets on system performance and gain information on the underlying molecules.” Of course, the only reason we’re able to collect comprehensive datasets is that computers have gotten so big and powerful. Systems biology simply was not possible in the Age of Books.

Weinberger slips twix and tween philosophy of science, epistemology, various aspects of biology and computational science. Not to mention with the odd bald faced assertion such as: “…the biological system of an organism is complex beyond imagining.” At one time that could have been said about the atom. I think some progress has been made on understanding that last item, or so physicists claim.

Don’t get me wrong, I have a copy on order and look forward to reading it.

But, no single reader will be able to discover all the factual errors and leaps of logic in Too Big to Know. Perhaps a website or wiki, Too Big to Correct?

Google Correlate expands to 49 additional countries

Filed under: Google Correlate,Search Behavior,Searching — Patrick Durusau @ 12:06 pm

Google Correlate expands to 49 additional countries

Matt Mohebbi, Software Engineer, writes:

From the post:

In May of this year we launched Google Correlate on Google Labs. This system enables a correlation search between a user-provided time series and millions of time series of Google search traffic. Since our initial launch, we’ve graduated to Google Trends and we’ve seen a number of great applications of Correlate in several domains, including economics (consumer spending, unemployment rate and housing inventory), sociology and meteorology. The correspondence of gas prices and search activity for fuel efficient cars was even briefly discussed in a Fox News presidential debate and NPR recently covered correlations related to political commentators.

Google has added 49 countries for use with Correlate, bring the total to 50.

Just in case you are curious:

Country Table for Google Correlate – 4 Jan. 2012
  • Argentina
  • Australia
  • Austria
  • Belgium
  • Brazil
  • Bulgaria
  • Canada
  • Chile
  • China
  • Colombia
  • Croatia
  • Czech Republic
  • Denmark
  • Egypt
  • Finland
  • France
  • Germany
  • Greece
  • Hungary
  • India
  • Indonesia
  • Ireland
  • Israel
  • Italy
  • Japan
  • Malaysia
  • Mexico
  • Morocco
  • Netherlands
  • New Zealand
  • Norway
  • Peru
  • Philippines
  • Poland
  • Portugal
  • Romania
  • Russian Federation
  • Saudi Arabia
  • Singapore
  • Spain
  • Sweden
  • Switzerland
  • Taiwan
  • Thailand
  • Turkey
  • Ukraine
  • United Kingdom
  • United States
  • Venezuela
  • Viet Nam

What correlations are you going to find? (Bearing in mind that correlation is not causation.)

Algorithm estimates who’s in control

Filed under: Data Analysis,Discourse,Linguistics,Social Graphs,Social Networks — Patrick Durusau @ 10:43 am

Algorithm estimates who’s in control

John Kleinberg, whose work influenced Google’s PageRank, is working on ranking something else. Kelinberg et al. developed an algorithm that ranks people, based on how they speak to each other.

This on the heels of the Big Brother’s Name is… has to have you wondering if you even want Internet access at all. 😉

Just imagine, power (who has, who doesn’t) analysis of email discussion lists, wiki edits, email archives, transcripts.

This has the potential (along with other clever analysis) to identify and populate topic maps with some very interesting subjects.

I first saw this at FlowingData

Top Holiday Gifts For Data Scientists

Filed under: Books,Data Science — Patrick Durusau @ 10:28 am

Top Holiday Gifts For Data Scientists by Jeff Hammerbacher.

Hammerbacher is the chief scientist for Cloudera. Need I say more?

Missed the holidays but I do have a birthday coming up. 😉

Enjoy!

Hadoop for Archiving Email – Part 2

Filed under: Hadoop,Indexing,Lucene,Solr — Patrick Durusau @ 9:40 am

Hadoop for Archiving Email – Part 2 by Sunil Sitaula.

From the post:

Part 1 of this post covered how to convert and store email messages for archival purposes using Apache Hadoop, and outlined how to perform a rudimentary search through those archives. But, let’s face it: for search to be of any real value, you need robust features and a fast response time. To accomplish this we use Solr/Lucene-type indexing capabilities on top of HDFS and MapReduce.

Before getting into indexing within Hadoop, let us review the features of Lucene and Solr:

Continues Part 1 (my blog post) and mentions several applications and libraries that will be useful for indexing email.

Data Structure for Social News Streams on Graph Databases

Filed under: Graphs,News,Social Media — Patrick Durusau @ 8:34 am

Data Structure for Social News Streams on Graph Databases

René Pickhardt writes (in part):

I also looked into the case of saving the news stream as a flat file for every user in which the events from his friends are saved for every user. For some reason I thought I had picked up somewhere that facebook is running on such a system. But right now I can’t find the resource anymore. If you can, please tell me! Anyway while studying these different approaches I realized that the flat file approach even though it seems to be primitive makes perfect sense. It scales to infinity and is very fast for reading! Even though I can’t find the resource anymore I will still call this approach the Facebook approach.

I was now wondering how you would store a social news stream in a graph data base like neo4j in a way that you get some nice properties. More specifically I wanted to combine the advantages of both the facebook and the twitter approach and try to get rid of the downfalls. And guess what! To me this seems actually possible on graph data bases. The key Idea is to store the social network and content items created by the users not only in a star topology but also in a list topology ordered by time of occuring events. The crucial part is to maintain this topology which is actually possible in O(1) while Updates occure to the graph. (emphasis in original)

See the post for links to his poster, paper and other interesting material.

Riak NoSQL Database: Use Cases and Best Practices

Filed under: NoSQL,Riak — Patrick Durusau @ 7:49 am

Riak NoSQL Database: Use Cases and Best Practices

From the post:

Riak is a key-value based NoSQL database that can be used to store user session related data. Andy Gross from Basho Technologies recently spoke at QCon SF 2011 Conference about Riak use cases. InfoQ spoke with Andy and Mark Phillips (Community Manager) about Riak database features and best practices when using Riak.

Not a lot of technical detail but enough to get a feel for whether you want/need to learn more about Riak.

Big Brother’s Name is…

Filed under: Marketing,Networks,Social Media,Social Networks — Patrick Durusau @ 7:09 am

not the FBI, CIA, Interpol, Mossad, NSA or any other government agency.

Walmart all but claims that name at: Social Genome.

From the webpage:

In a sense, the social world — all the millions and billions of tweets, Facebook messages, blog postings, YouTube videos, and more – is a living organism itself, constantly pulsating and evolving. The Social Genome is the genome of this organism, distilling it to the most essential aspects.

At the labs, we have spent the past few years building and maintaining the Social Genome itself. We do this using public data on the Web, proprietary data, and a lot of social media. From such data we identify interesting entities and relationships, extract them, augment them with as much information as we can find, then add them to the Social Genome.

For example, when Susan Boyle was first mentioned on the Web, we quickly detected that she was becoming an interesting person in the world of social media. So we added her to the Social Genome, then monitored social media to collect more information about her. Her appearances became events, and the bigger events were added to the Social Genome as well. As another example, when a new coffee maker was mentioned on the Web, we detected and added it to the Social Genome. We strive to keep the Social Genome up to date. For example, we typically detect and add information from a tweet into the Social Genome within two seconds, from the moment the tweet arrives in our labs.

As a result of our effort, the Social Genome is a vast, constantly changing, up-to-date knowledge base, with hundreds of millions of entities and relationships. We then use the Social Genome to perform semantic analysis of social media, and to power a broad array of e-commerce applications. For example, if a user never uses the word “coffee”, but has mentioned many gourmet coffee brands (such as “Kopi Luwak”) in his tweets, we can use the Social Genome to detect the brands, and infer that he is interested in gourmet coffee. As another example, using the Social Genome, we may find that a user frequently mentions movies in her tweets. As a result, when she tweeted “I love salt!”, we can infer that she is probably talking about the movie “salt”, not the condiment (both of which appear as entities in the Social Genome).

Two seconds after you hit “send” on your tweet, it has been stripped, analyzed and added to the Social Genome at WalMart. For every tweet. Plus other data.

How should we respond to this news?

One response is to trust that WalMart and whoever it sells this data trove to, will use the information to enhance your shopping experience and achieve greater fulfilment by balancing shopping against your credit limit.

Another response is to ask for legislation to attempt regulation of a multi-national corporation that is larger than many governments.

Another response is to hold sit-ins and social consciousness raising events at WalMart locations.

My suggestion? One good turn deserves another.

WalMart is owned by someone. Walmart has a board of directors. Walmart has corporate officers. Walmart has managers, sales representatives, attorneys and advertising executives. All of who have information footprints. Perhaps not as public as ours, but they exist. Wny not gather up information on who is running Walmart? Fighting fire with fire as they say. Publish that information so that regulators, stock brokers, divorce lawyers and others can have access to it.

Let’s welcome WalMart as “Little Big Brothers.”

Now in JAGS! Now in JAGS!

Filed under: Bayesian Data Analysis,R — Patrick Durusau @ 7:09 am

Now in JAGS! Now in JAGS!

John K. Kruschke writes:

I have created JAGS versions of all the BUGS programs in Doing Bayesian Data Analysis. Unlike BUGS, JAGS runs on MacOS, Linux, and Windows. JAGS has other features that make it more robust and user-friendly than BUGS. I recommend that you use the JAGS versions of the programs. Please let me know if you encounter any errors or inaccuracies in the programs. (hyperlink to book added)

First spotted by Matthew O’Donnell (@mdbod).

January 3, 2012

Knowledge Federation 2010: Self-Organizing Collective Mind

Filed under: Conferences,Federation,Knowledge Organization — Patrick Durusau @ 5:15 pm

Knowledge Federation 2010: Self-Organizing Collective Mind

The proceedings from the Second International Workshop on Knowledge Federation, Dubrovnik, Croatia, October 3-6, 2010, edited by Dino Karabeg and Jack Park, have just appeared online.

Table of Contents

Preface

  1. The Praxis of Social Knowledge Federation
    Arnim Bleier, Patrick Jahnichen, Uta Schulze, Lutz Maicher
  2. Steps Towards a Federated Course Model
    Dino Karabeg

  3. On Nature and Control of Creativity: Tesla as a Case Study
    Dejan Rakovic
  4. Semiotic Perspective on Sensemaking Software and Consequences for Journalism
    Shiqin “Eddie” Choo
  5. Towards a Federated Framework for Self-evolving Educational Experience Design on Massive Scale (SEED-M)
    George Pór
  6. Context-Driven Social Network Visualisation: Case Wiki Co-Creation
    Jukka Huhtamaki, Jaakko Salonen, Jarno Marttila, Ossi Nykanen
  7. Boundary Infrastructures for Conversational Knowledge Federation
    Jack Park
  8. Combinatorial Inquiries into Knowledge Federation
    Karl F. Hebenstreit Jr.
  9. An Ark for the Exaflood Rushing upon Us
    Mei Lin Fung, Robert S. Stephenson
  10. Webbles: Programmable and Customizable Meme Media Objects in a Knowledge Federation Framework Environment on the Web
    Micke N. Kuwahara, Yuzuru Tanaka
  11. Causality in collective filtering
    Mario Paolucci, Stefano Picascia, Walter Quattrociocchi
  12. Images of knowledge. Interfaces for knowledge access in an epistemic transition
    Marco Quaggiotto
  13. New ecosystem in journalism: Decentralized newsrooms empowered by self-organized crowds
    Tanja Aitamurto

Tribeforth

Filed under: Networks,Politics — Patrick Durusau @ 5:14 pm

Tribeforth

From the homepage:

Tribeforth Foundation is a group of people developing and promoting a collective intelligence computer system to assist in stimulating new solutions ideas and connections on a global scale. The system as it is planned is not unlike an every day wiki. The key difference is that millions can speak as one with out losing a voice and the software tunes the conversation into reason. This keeps us from getting lost in syntax and helping us to work with the real semantics.

Heavily rooted in collective intelligence and the semantic web (Web 3.0) we are organizing a collection of open source software and then extending them to create the most high tech discussion platform in human history. Available to anyone, anywhere as a basic standard of living.

A handful of powerful, fundamental principles and values guide us here at Tribeforth. We use these principles to create new tools for all of us

A project built on the principles of self reflection an echo of human ingenuity.

I don’t know if topic maps would be of assistance or not but when you are talking about making connections that persist across semantic boundaries (my words, not theirs), then you are going to need topic maps or something very similar.

I suppose I am a bit old school for the disclaimer:

THE TRIBEFORTH SYSTEM WILL NOT COLLECT ANY INFORMATION REGARDING MILITARY PERSONNEL, SYSTEMS, EQUIPMENT, PLANNING OR DEPLOYMENT. INCIDENTS REGARDING HUMAN RIGHTS VIOLATIONS ARE NOT SUBJECT TO THIS POLICY.

Existing solutions/structures are not going to go into the night quietly. That is a historical certainty. I would rather be prepared for the push back.

List of cities/states with open data – help me find more!

Filed under: Data,Government Data — Patrick Durusau @ 5:13 pm

List of cities/states with open data – help me find more!

A plea from “Simply Statistics” to increase its listing of cities with open data.

Mostly American and Canadian, with a few others, Berlin for example, suggested in comments.

I haven’t looked (yet) but since European libraries lead the charge in many ways to have greater access to their collections (my recollection, yours may differ), I would expect to find European cities and authorities also ahead on the race to publish public data.

Pointers from European readers? (Or I can look them up later this week, just not today.)

Voting Networks in the Danish Parliament [2004-2011]

Filed under: Graphs,Networks,R — Patrick Durusau @ 5:12 pm

Voting Networks in the Danish Parliament [2004-2011]

From the post:

One of my Christmas presents was the book Beautiful Visualization. Chapter 8 by Andrew Odewahn is a very nice piece on visualizing the U.S Senate social graph. Odewahn basically builds an affinity network, where ties represent whether two senator have voted in the same manner during a given time period. The rules for creating the network are nicely broken down to the following steps:

  1. Nodes represent senators
  2. Nodes are colored according to party affiliation
  3. Nodes are connected with an edge if two senators voted together more than 65% of the time during a given timeframe

Based on the above rules Odewahn builds a series of interesting graphs, showing that there are a few consistently bipartisan senators on both sides in almost every session of the Congress.

But rather than just grousing about American politics (don’t get me started), the author produces voting network graphs of the Danish Parliament!

I leave it for you to decide if the results signal hope or despair. 😉

Topical Classification of Biomedical Research Papers – Details

Filed under: Bioinformatics,Biomedical,Medical Informatics,MeSH,PubMed,Topic Maps — Patrick Durusau @ 5:11 pm

OK, I registered both on the site and for the contest.

From the Task:

Our team has invested a significant amount of time and effort to gather a corpus of documents containing 20,000 journal articles from the PubMed Central open-access subset. Each of those documents was labeled by biomedical experts from PubMed with several MeSH subheadings that can be viewed as different contexts or topics discussed in the text. With a use of our automatic tagging algorithm, which we will describe in details after completion of the contest, we associated all the documents with the most related MeSH terms (headings). The competition data consists of information about strengths of those bonds, expressed as numerical value. Intuitively, they can be interpreted as values of a rough membership function that measures a degree in which a term is present in a given text. The task for the participants is to devise algorithms capable of accurately predicting MeSH subheadings (topics) assigned by the experts, based on the association strengths of the automatically generated tags. Each document can be labeled with several subheadings and this number is not fixed. In order to ensure that participants who are not familiar with biomedicine, and with the MeSH ontology in particular, have equal chances as domain experts, the names of concepts and topical classifications are removed from data. Those names and relations between data columns, as well as a dictionary translating decision class identifiers into MeSH subheadings, can be provided on request after completion of the challenge.

Data format: The data set is provided in a tabular form as two tab-separated values files, namely trainingData.csv (the training set) and testData.csv (the test set). They can be downloaded only after a successful registration to the competition. Each row of those data files represents a single document and, in the consecutive columns, it contains integers ranging from 0 to 1000, expressing association strengths to corresponding MeSH terms. Additionally, there is a trainingLables.txt file, whose consecutive rows correspond to entries in the training set (trainingData.csv). Each row of that file is a list of topic identifiers (integers ranging from 1 to 83), separated by commas, which can be regarded as a generalized classification of a journal article. This information is not available for the test set and has to be predicted by participants.

It is worth noting that, due to nature of the considered problem, the data sets are highly dimensional – the number of columns roughly corresponds to the MeSH ontology size. The data sets are also sparse, since usually only a small fraction of the MeSH terms is assigned to a particular document by our tagging algorithm. Finally, a large number of data columns have little (or even none) non-zero values (corresponding concepts are rarely assigned to documents). It is up to participants to decide which of them are still useful for the task.

I am looking at it as an opportunity to learn a good bit about automatic text classification and what, if any, role that topic maps can play in such a scenario.

Suggestions as well as team members are most welcome!

« Newer PostsOlder Posts »

Powered by WordPress