January « 2012 « Another Word For It

January 6, 2012

Functional Thinking: Thinking Functionally, Part 3

Filed under: Functional Programming — Patrick Durusau @ 11:36 am

Functional Thinking: Thinking Functionally, Part 3

From the summary:

Functional thinking series author Neal Ford continues his guided tour of functional programming constructs and paradigms. You’ll look at number-classification code in Scala and take a glance at unit testing in the functional world. Then you’ll learn about partial application and currying — two functional approaches that facilitate code reuse — and see how recursion fits into the functional way of thinking.

Perhaps you will (also) learn a new way to think this year! 😉

Question: If data is too big to move, it is also too big to change and maintain referential integrity? Or is size an issue or simply an excuse? In an open and distributed architecture we cannot know (or find) all the references to our data.

Comments Off

Querying Semi-Structured Data

Filed under: Data Integration,Graphs,Heterogeneous Data,Query Language,Semi-Structured Data — Patrick Durusau @ 11:36 am

Querying Semi-Structured Data

The Semi-structured data and P2P graph databases post I point to has a broken reference to Serge Abiteboul’s “Querying Semi-Structured Data.” Since I could not correct it there and the topic is of interest for topic maps, I created this entry for it here.

From the Introduction:

The amount of data of all kinds available electronically has increased dramatically in recent years. The data resides in different forms, ranging from unstructured data in le systems to highly structured in relational database systems. Data is accessible through a variety of interfaces including Web browsers, database query languages, application-specific interfaces, or data exchange formats. Some of this data is raw data, e.g., images or sound. Some of it has structure even if the structure is often implicit, and not as rigid or regular as that found in standard database systems. Sometimes the structure exists but has to be extracted from the data. Sometimes also it exists but we prefer to ignore it for certain purposes such as browsing. We call here semi-structured data this data that is (from a particular viewpoint) neither raw data nor strictly typed, i.e., not table-oriented as in a relational model or sorted-graph as in object databases.

As will seen later when the notion of semi-structured data is more precisely defined, the need for semi-structured data arises naturally in the context of data integration, even when the data sources are themselves well-structured. Although data integration is an old topic, the need to integrate a wider variety of data-formats (e.g., SGML or ASN.1 data) and data found on the Web has brought the topic of semi-structured data to the forefront of research.

The main purpose of the paper is to isolate the essential aspects of semi-structured data. We also survey some proposals of models and query languages for semi-structured data. In particular, we consider recent works at Stanford U. and U. Penn on semi-structured data. In both cases, the motivation is found in the integration of heterogeneous data. The “lightweight” data models they use (based on labelled graphs) are very similar.

As we shall see, the topic of semi-structured data has no precise boundary. Furthermore, a theory of semi-structured data is still missing. We will try to highlight some important issues in this context.

The paper is organized as follows. In Section 2, we discuss the particularities of semi-structured data. In Section 3, we consider the issue of the data structure
and in Section 4, the issue of the query language.

A bit dated, 1996, but still worth reading. Updating the paper would make a nice semester size project

BTW, note the download graphics. Makes me think that archives should have an “anonymous notice” feature that allows anyone downloading a paper to send an email to anyone who has downloaded the paper in the past, without disclosing the emails of the prior downloaders.

I would really like to know what the people in Jan/Feb of 2011 were looking for? Perhaps they are working on an update of the paper? Or would like to collaborate on updating the paper.

Seems like a small “feature” that would allow researchers to contact others without disclosure of email addresses (other than for the sender of course).

Formal publication data:

Abiteboul, S. (1996) Querying Semi-Structured Data. Technical Report. Stanford InfoLab. (Publication Note: Database Theory – ICDT ’97, 6th International Conference, Delphi, Greece, January 8-10, 1997)

Comments Off

Neography

Filed under: Neo4j,Neography,Ruby — Patrick Durusau @ 11:36 am

Neography

From the webpage:

Neography is a thin Ruby wrapper to the Neo4j Rest API, for more information:

Getting Started with Neo4j Server

Neo4j Rest API Reference

If you want to the full power of Neo4j, you will want to use JRuby and the excellent Neo4j.rb gem at github.com/andreasronge/neo4j by Andreas Ronge

A complement to Neography is the Neology Gem at github.com/lordkada/neology by Carlo Alberto Degli Atti

An alternative is the Architect4r Gem at github.com/namxam/architect4r by Maximilian Schulz

For all you Ruby hackers out there!

Comments Off

Neovigator

Filed under: Neography,Neovigator,Processing.js — Patrick Durusau @ 11:36 am

Neovigator

From the webpage:

An attempt to use Neography and processing.js to navigate a Neo4j graph via its REST API.

Be sure to visit the demo site: http://neovigator.herokuapp.com/

Mouse-over and enjoy!

Aside to Kirk: Thinking of the potential to represent morphemes, morphological annotations, syntactic analysis, with variants in a first person shooter, sorry, point and click interface. First step display, then authoring.

Hmmm, some relationships could be auto-generated, such an an emendation need only be entered with its relationship to a morpheme and its relationships to the word, verse, syntactic divisions, etc. could be automatically added. Could have “display changes” to facilitate review.

Comments Off

Semi-structured data and P2P graph databases

Filed under: Graphs,NoSQL,Plasma — Patrick Durusau @ 11:34 am

Semi-structured data and P2P graph databases by Jeff Rose.

From the post:

In a previous post I introduced the Plasma graph query engine that I’ve been working on as part of my thesis project. With Plasma you can declaratively define queries and evaluate them against a graph database. The heart of the system is a library of dataflow query operators, and on top of them sits a fairly simplistic query “language”. (I put it in quotes because in a lisp based language like Clojure the line between a mini-language and an API gets blurry.) In this post I’ll write a bit about why I think graph databases could be an interesting foundation for next generation P2P networks, and then I’ll give some examples of performing distributed graph queries using Plasma. First I think it is important to motivate the use of a graph database though. While most of the marketing speak on the web regarding graph databases is all about representing social network data, this is just one of many potential applications.

I am not convinced the categories of “structured,” “semi-structured,” and “unstructured” data are all that helpful.

For example, when did the New Testament become a structured text? Division into chapters? (13th century) Division into verses? (mid-16th century) or is it still “unstructured?” Or the same question for the Tanakh, except there relying on a much richer system of divisions.

If you mean by “structured” a particular form of internal representation and reference, such as are represented to users as relational tables, why not say so? That is a particular form of structuring data, not the only one.

And as Wikipedia observes (Table (Database):

An equally valid representations of a relation is as an n-dimensional chart, where n is the number of attributes (a table’s columns). For example, a relation with two attributes and three values can be represented as a table with two columns and three rows, or as a two-dimensional graph with three points. The table and graph representations are only equivalent if the ordering of rows is not significant, and the table has no duplicate rows.

I take that to mean that I can treat a graph as a data structure with more “structure” as it were.

I am equally unconvinced that P2P networks are the key to avoiding the control and censorship issues of architectures like the Internet. If you think the telcos rolled over quick when asked information for “national security,” just think about your CIO or even your local network administrator. And being P2P means arbitrary peers can pick up the data stream. Want to see the folks in dark shades and cheap suits?

P2P maybe a better technological choice to lessen the chances of censorship, but social institutions that oppose censorship or make it more difficult are equally important, if not more so.

Comments Off

January 5, 2012

Graph Algorithms

Filed under: Algorithms,Cypher,Graphs,Gremlin,Neo4j — Patrick Durusau @ 4:14 pm

Graph Algorithms

I ran across this Wikipedia book while working on one of the data structures posts for today.

I think you may find it useful but some cautions:

First, being a collection of Wikipedia articles, it doesn’t have a consistent editorial voice. That is more than being fussy, the depth and usefulness of explanations will vary from article to article.

Second, you will find topics that are “stubs,” and hence not very useful.

Third, I think with the advent of Neo4j, Grelim, Cypher and other graph databases/software, future entries should have in addition to text, exercises that users can perform with common software to reinforce their understanding of entries.

Comments Off

Running along the graph using Neo4J Spatial and Gephi

Filed under: Gephi,Graphs,Neo4j — Patrick Durusau @ 4:14 pm

Running along the graph using Neo4J Spatial and Gephi

Just to whet your appetite:

When I started running some years ago, I bought a Garmin Forerunner 405. It’s a nifty little device that tracks GPS coordinates while you are running. After a run, the device can be synchronized by uploading your data to the Garmin Connect website. Based upon the tracked time and GPS coordinates, the Garmin Connect website provides you with a detailed overview of your run, including distance, average pace, elevation loss/gain and lap splits. It also visualizes your run, by overlaying the tracked course on Bing and/or Google maps. Pretty cool! One of my last runs can be found here.

Apart from simple aggregations such as total distance and average speed, the Garmin Connect website provides little or no support to gain deeper insights in all of my runs. As I often run the same course, it would be interesting to calculate my average pace at specific locations. When combining the data of all of my courses, I could deduct frequently encountered locations. Finally, could there be a correlation between my average pace and my distance from home? In order to come up with answers to these questions, I will import my running data into a Neo4J Spatial datastore. Neo4J Spatial extends the Neo4J Graph Database with the necessary tools and utilities to store and query spatial data in your graph models. For visualizing my running data, I will make use of Gephi, an open-source visualization and manipulation tool that allows users to interactively browse and explore graphs.

Suggestion: If you want to know where you go and/or how you spend your time, try tracking both for a week. Faithfully record how you spend your time, reading, commuting, TV, exercise, work, etc., in say 30 minute intervals. Also keep track of your physical location. Don’t try to be overly precise, use big buckets. And no peeking as to how the week is shaping up. I think you will be surprised at how your week shapes up.

Comments Off

Interoperability Driven Integration of Biomedical Data Sources

Filed under: Bioinformatics,Biomedical,Data Integration,Interoperability,Medical Informatics — Patrick Durusau @ 4:13 pm

Interoperability Driven Integration of Biomedical Data Sources by Douglas Teodoro, Rémy Choquet, Daniel Schober, Giovanni Mels, Emilie Pasche, Patrick Ruch, and Christian Lovis.

Abstract:

In this paper, we introduce a data integration methodology that promotes technical, syntactic and semantic interoperability for operational healthcare data sources. ETL processes provide access to different operational databases at the technical level. Furthermore, data instances have they syntax aligned according to biomedical terminologies using natural language processing. Finally, semantic web technologies are used to ensure common meaning and to provide ubiquitous access to the data. The system’s performance and solvability assessments were carried out using clinical questions against seven healthcare institutions distributed across Europe. The architecture managed to provide interoperability within the limited heterogeneous grid of hospitals. Preliminary scalability result tests are provided.

Appears in:

Studies in Health Technology and Informatics
Volume 169, 2011
User Centred Networked Health Care – Proceedings of MIE 2011
Edited by Anne Moen, Stig Kjær Andersen, Jos Aarts, Petter Hurlen
ISBN 978-1-60750-805-2

I have been unable to find a copy online, well, other than the publisher’s copy, at $20 for four pages. I have written to one of the authors requesting a personal use copy as I would like to report back on what it proposes.

Comments Off

New Year’s Resolution: Learn How to Code

Filed under: Programming — Patrick Durusau @ 4:12 pm

New Year’s Resolution: Learn How to Code by Stephen Turner

From the post:

Q&A sites for biologists are littered with questions from researchers asking for non-technical, code-free ways of doing a particular analysis. Your friendly bioinformatics or computational biology neighbor can often point to a resource or design a solution that can get you 90% of the way, but usually won’t grok the biological problem as truly as you do. By learning even the smallest bit of programming, you can at least be equipped with the knowledge of what is programmatically possible, and collaborations with your bioinformatician can be more fruitful. As every field of biological research becomes more computational in nature, learning how to code is becoming more important than ever. (emphasis added)

The line “…usually won’t grok the biological problem as truly as you do….” is the key to the article, but you will find a number of excellent resources cited further down in it.

I say that because programmers are going to code to the implicit subjects that they recognize and understand as important for the program. Nothing wrong with that and it would be quite odd if they didn’t. The problem is those may not represent your understanding of what you want to accomplish, including the subjects that you think are important to be recognized.

Yes, programs consist of subjects, even though we don’t normally use topic maps syntax to identify them. Nor should we if we want acceptable running times. What we can do is be sure that the subjects that are important to us, perhaps identified by a topic map in the planning stages for a project, are represented in the acceptable inputs and results from a program. Knowing how to program or even read code a bit, will help you achieve that goal.

Comments Off

Getting started with Ruby and Neo4j

Filed under: Neo4j,Ruby — Patrick Durusau @ 4:11 pm

Getting started with Ruby and Neo4j

Max De Marzi walks you through installation of neography and then to making a social network graph. Nothing new but a gentle introduction to Neo4j with promises of more to come on Gremlin and Cypher (ways to walk across the graph).

Pass along to any Rubyists that need an introduction to Neo4j.

Comments Off

Digging into Data Challenge

Filed under: Archives,Contest,Data Mining,Library,Preservation — Patrick Durusau @ 4:09 pm

Digging into Data Challenge

From the homepage:

What is the “challenge” we speak of? The idea behind the Digging into Data Challenge is to address how “big data” changes the research landscape for the humanities and social sciences. Now that we have massive databases of materials used by scholars in the humanities and social sciences — ranging from digitized books, newspapers, and music to transactional data like web searches, sensor data or cell phone records — what new, computationally-based research methods might we apply? As the world becomes increasingly digital, new techniques will be needed to search, analyze, and understand these everyday materials. Digging into Data challenges the research community to help create the new research infrastructure for 21st century scholarship.

Winners for Round 2, some 14 projects out of 67, were announced on 3 January 2012.

Interested to hear your comments on the projects as I am sure the projects would as well.

Comments Off

Two Journalist Databases

Filed under: News — Patrick Durusau @ 4:07 pm

Two Journalist Databases

Matthew Hurst has found two databases about “them,” you know, members of the fourth estate. 😉

Curious how you would combine the information from these two sources?

Or taking that combination and creating a window for viewing stories written by a particular reporter and providing access to the information from these databases at the same time?

Perhaps we should just say “the press” and leave the “public” out of it.

To avoid the implication that it is the “public’s” interest that is being served by the press.

Comments Off

Baltimore gun offenders and where academics don’t live

Filed under: Data Analysis,Geographic Data,Statistics — Patrick Durusau @ 4:06 pm

Baltimore gun offenders and where academics don’t live

An interesting plotting of the residential addresses (not crime locations) of gun offenders. You need to see the post to observe how stark the “island” of academics appears on the map.

Illustration of non-causation, unless you want to contend that the presence of academics in a neighborhood drives out gun offenders. Which would argue in favor of more employment and wider residential patterns for academics. I would favor that but suspect that is personal bias.

A cross between this map and a map of gun offenses would be a good guide for housing prospects in Baltimore.

What other data would be useful for such a map? Education, libraries, fire protection, other crime rates…. Easy enough since there are geographic boundaries as the binding points but “summing up” information as you zoom out might be interesting.

That is say crime statistics are on a police district basis and as you zoom out, you want information from multiple districts merged and resorted. Or you have overlapping districts for water, electricity, police, fire, etc. Having a geographic grid becomes your starting place but only a starting place.

Comments Off

Data Structures and Algorithms

Filed under: Data Structures — Patrick Durusau @ 4:05 pm

Data Structures and Algorithms with Object-Oriented Design Patterns in Java by Bruno R. Preiss.

From Goals:

The primary goal of this book is to promote object-oriented design using Java and to illustrate the use of the emerging object-oriented design patterns. Experienced object-oriented programmers find that certain ways of doing things work best and that these ways occur over and over again. The book shows how these patterns are used to create good software designs. In particular, the following design patterns are used throughout the text: singleton, container, enumeration, adapter and visitor.

Virtually all of the data structures are presented in the context of a single, unified, polymorphic class hierarchy. This framework clearly shows the relationships between data structures and it illustrates how polymorphism and inheritance can be used effectively. In addition, algorithmic abstraction is used extensively when presenting classes of algorithms. By using algorithmic abstraction, it is possible to describe a generic algorithm without having to worry about the details of a particular concrete realization of that algorithm.

A secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context. In the past when the topics in this book were taught at the graduate level, an author could rely on students having the needed background in mathematics. However, because the book is targeted for second- and third-year students, it is necessary to fill in the background as needed. To the extent possible without compromising correctness, the presentation fosters intuitive understanding of the concepts rather than mathematical rigor.

Noticed in David Eppstein’s Link Roundup.

Comments Off

Open Data Structures

Filed under: Data Structures,Java — Patrick Durusau @ 4:04 pm

Open Data Structures by Pat Morin.

From “about:”

Open Data Structures covers the implementation and analysis of data structures for sequences (lists), queues, priority queues, unordered dictionaries, and ordered dictionaries.

Data structures presented in the book include stacks, queues, deques, and lists implemented as arrays and linked-list; space-efficient implementations of lists; skip lists; hash tables and hash codes; binary search trees including treaps, scapegoat trees, and red-black trees; and heaps, including implicit binary heaps and randomized meldable heaps.

The data structures in this book are all fast, practical, and have provably good running times. All data structures are rigorously analyzed and implemented in Java and C++. The Java implementations implement the corresponding interfaces in the Java Collections Framework.

The book and accompanying source code are free (libre and gratis) and are released under a Creative Commons Attribution License. Users are free to copy, distribute, use, and adapt the text and source code, even commercially. The book’s LaTeX sources, Java/C++ sources, and build scripts are available through github.

Noticed in David Eppstein’s Link Roundup.

Comments Off

Everlasting Metadata?

Filed under: Library,Metadata,Museums,Preservation — Patrick Durusau @ 11:15 am

Everlasting Metadata?

Cynthia Murrell writes of the concern of some groups who want permanent metadata on digital objects:

On the other hand, there’s the law of unintended consequences. There is also the question of “language drift.” If metadata are not up to date, the searcher of the future might not be able to locate the information object because the search term does not match the metadata’s lingo.

And raises the more pragmatic concern of what happens when metadata needs to be corrected?

Curious, do institutions return to digital objects to update them to have the latest metadata?

My impression is that they don’t, but that is based on my experiences with mostly libraries and card catalog data.

Suggestions? Comments for where to look?

A topic map would enable the preservation of the original metadata and updates to that metadata.

Anyone know of digital preservation efforts that plan on preservation of metadata and its updating?

Comments Off

January 4, 2012

To Know, but Not Understand: David Weinberger on Science and Big Data

Filed under: Books,Epistemology,Knowledge,Philosophy of Science — Patrick Durusau @ 2:21 pm

To Know, but Not Understand: David Weinberger on Science and Big Data

From the introduction:

In an edited excerpt from his new book, Too Big to Know, David Weinberger explains how the massive amounts of data necessary to deal with complex phenomena exceed any single brain’s ability to grasp, yet networked science rolls on.

Well, it is a highly entertaining excerpt, with passages like:

For example, the biological system of an organism is complex beyond imagining. Even the simplest element of life, a cell, is itself a system. A new science called systems biology studies the ways in which external stimuli send signals across the cell membrane. Some stimuli provoke relatively simple responses, but others cause cascades of reactions. These signals cannot be understood in isolation from one another. The overall picture of interactions even of a single cell is more than a human being made out of those cells can understand. In 2002, when Hiroaki Kitano wrote a cover story on systems biology for Science magazine — a formal recognition of the growing importance of this young field — he said: “The major reason it is gaining renewed interest today is that progress in molecular biology … enables us to collect comprehensive datasets on system performance and gain information on the underlying molecules.” Of course, the only reason we’re able to collect comprehensive datasets is that computers have gotten so big and powerful. Systems biology simply was not possible in the Age of Books.

Weinberger slips twix and tween philosophy of science, epistemology, various aspects of biology and computational science. Not to mention with the odd bald faced assertion such as: “…the biological system of an organism is complex beyond imagining.” At one time that could have been said about the atom. I think some progress has been made on understanding that last item, or so physicists claim.

Don’t get me wrong, I have a copy on order and look forward to reading it.

But, no single reader will be able to discover all the factual errors and leaps of logic in Too Big to Know. Perhaps a website or wiki, Too Big to Correct?

Comments (3)

Google Correlate expands to 49 additional countries

Filed under: Google Correlate,Search Behavior,Searching — Patrick Durusau @ 12:06 pm

Google Correlate expands to 49 additional countries

Matt Mohebbi, Software Engineer, writes:

From the post:

In May of this year we launched Google Correlate on Google Labs. This system enables a correlation search between a user-provided time series and millions of time series of Google search traffic. Since our initial launch, we’ve graduated to Google Trends and we’ve seen a number of great applications of Correlate in several domains, including economics (consumer spending, unemployment rate and housing inventory), sociology and meteorology. The correspondence of gas prices and search activity for fuel efficient cars was even briefly discussed in a Fox News presidential debate and NPR recently covered correlations related to political commentators.

Google has added 49 countries for use with Correlate, bring the total to 50.

Just in case you are curious:

**Country Table for Google Correlate – 4 Jan. 2012**
Argentina Australia Austria Belgium Brazil Bulgaria Canada Chile China Colombia Croatia Czech Republic Denmark Egypt Finland France Germany	Greece Hungary India Indonesia Ireland Israel Italy Japan Malaysia Mexico Morocco Netherlands New Zealand Norway Peru Philippines Poland	Portugal Romania Russian Federation Saudi Arabia Singapore Spain Sweden Switzerland Taiwan Thailand Turkey Ukraine United Kingdom United States Venezuela Viet Nam

What correlations are you going to find? (Bearing in mind that correlation is not causation.)

Comments Off

Top Holiday Gifts For Data Scientists

Filed under: Books,Data Science — Patrick Durusau @ 10:28 am

Top Holiday Gifts For Data Scientists by Jeff Hammerbacher.

Hammerbacher is the chief scientist for Cloudera. Need I say more?

Missed the holidays but I do have a birthday coming up. 😉

Enjoy!

Comments Off

Hadoop for Archiving Email – Part 2

Filed under: Hadoop,Indexing,Lucene,Solr — Patrick Durusau @ 9:40 am

Hadoop for Archiving Email – Part 2 by Sunil Sitaula.

From the post:

Part 1 of this post covered how to convert and store email messages for archival purposes using Apache Hadoop, and outlined how to perform a rudimentary search through those archives. But, let’s face it: for search to be of any real value, you need robust features and a fast response time. To accomplish this we use Solr/Lucene-type indexing capabilities on top of HDFS and MapReduce.

Before getting into indexing within Hadoop, let us review the features of Lucene and Solr:

Continues Part 1 (my blog post) and mentions several applications and libraries that will be useful for indexing email.

Comments (1)

Riak NoSQL Database: Use Cases and Best Practices

Filed under: NoSQL,Riak — Patrick Durusau @ 7:49 am

Riak NoSQL Database: Use Cases and Best Practices

From the post:

Riak is a key-value based NoSQL database that can be used to store user session related data. Andy Gross from Basho Technologies recently spoke at QCon SF 2011 Conference about Riak use cases. InfoQ spoke with Andy and Mark Phillips (Community Manager) about Riak database features and best practices when using Riak.

Not a lot of technical detail but enough to get a feel for whether you want/need to learn more about Riak.

Comments Off

Big Brother’s Name is…

Filed under: Marketing,Networks,Social Media,Social Networks — Patrick Durusau @ 7:09 am

not the FBI, CIA, Interpol, Mossad, NSA or any other government agency.

Walmart all but claims that name at: Social Genome.

From the webpage:

In a sense, the social world — all the millions and billions of tweets, Facebook messages, blog postings, YouTube videos, and more – is a living organism itself, constantly pulsating and evolving. The Social Genome is the genome of this organism, distilling it to the most essential aspects.

At the labs, we have spent the past few years building and maintaining the Social Genome itself. We do this using public data on the Web, proprietary data, and a lot of social media. From such data we identify interesting entities and relationships, extract them, augment them with as much information as we can find, then add them to the Social Genome.

For example, when Susan Boyle was first mentioned on the Web, we quickly detected that she was becoming an interesting person in the world of social media. So we added her to the Social Genome, then monitored social media to collect more information about her. Her appearances became events, and the bigger events were added to the Social Genome as well. As another example, when a new coffee maker was mentioned on the Web, we detected and added it to the Social Genome. We strive to keep the Social Genome up to date. For example, we typically detect and add information from a tweet into the Social Genome within two seconds, from the moment the tweet arrives in our labs.

As a result of our effort, the Social Genome is a vast, constantly changing, up-to-date knowledge base, with hundreds of millions of entities and relationships. We then use the Social Genome to perform semantic analysis of social media, and to power a broad array of e-commerce applications. For example, if a user never uses the word “coffee”, but has mentioned many gourmet coffee brands (such as “Kopi Luwak”) in his tweets, we can use the Social Genome to detect the brands, and infer that he is interested in gourmet coffee. As another example, using the Social Genome, we may find that a user frequently mentions movies in her tweets. As a result, when she tweeted “I love salt!”, we can infer that she is probably talking about the movie “salt”, not the condiment (both of which appear as entities in the Social Genome).

Two seconds after you hit “send” on your tweet, it has been stripped, analyzed and added to the Social Genome at WalMart. For every tweet. Plus other data.

How should we respond to this news?

One response is to trust that WalMart and whoever it sells this data trove to, will use the information to enhance your shopping experience and achieve greater fulfilment by balancing shopping against your credit limit.

Another response is to ask for legislation to attempt regulation of a multi-national corporation that is larger than many governments.

Another response is to hold sit-ins and social consciousness raising events at WalMart locations.

My suggestion? One good turn deserves another.

WalMart is owned by someone. Walmart has a board of directors. Walmart has corporate officers. Walmart has managers, sales representatives, attorneys and advertising executives. All of who have information footprints. Perhaps not as public as ours, but they exist. Wny not gather up information on who is running Walmart? Fighting fire with fire as they say. Publish that information so that regulators, stock brokers, divorce lawyers and others can have access to it.

Let’s welcome WalMart as “Little Big Brothers.”

Comments (1)

Now in JAGS! Now in JAGS!

Filed under: Bayesian Data Analysis,R — Patrick Durusau @ 7:09 am

Now in JAGS! Now in JAGS!

John K. Kruschke writes:

I have created JAGS versions of all the BUGS programs in Doing Bayesian Data Analysis. Unlike BUGS, JAGS runs on MacOS, Linux, and Windows. JAGS has other features that make it more robust and user-friendly than BUGS. I recommend that you use the JAGS versions of the programs. Please let me know if you encounter any errors or inaccuracies in the programs. (hyperlink to book added)

First spotted by Matthew O’Donnell (@mdbod).

Comments Off

January 3, 2012

Knowledge Federation 2010: Self-Organizing Collective Mind

Filed under: Conferences,Federation,Knowledge Organization — Patrick Durusau @ 5:15 pm

Knowledge Federation 2010: Self-Organizing Collective Mind

The proceedings from the Second International Workshop on Knowledge Federation, Dubrovnik, Croatia, October 3-6, 2010, edited by Dino Karabeg and Jack Park, have just appeared online.

Table of Contents

Preface

The Praxis of Social Knowledge Federation
Arnim Bleier, Patrick Jahnichen, Uta Schulze, Lutz Maicher

Steps Towards a Federated Course Model
Dino Karabeg

On Nature and Control of Creativity: Tesla as a Case Study
Dejan Rakovic

Semiotic Perspective on Sensemaking Software and Consequences for Journalism
Shiqin “Eddie” Choo

Towards a Federated Framework for Self-evolving Educational Experience Design on Massive Scale (SEED-M)
George Pór

Context-Driven Social Network Visualisation: Case Wiki Co-Creation
Jukka Huhtamaki, Jaakko Salonen, Jarno Marttila, Ossi Nykanen

Boundary Infrastructures for Conversational Knowledge Federation
Jack Park

Combinatorial Inquiries into Knowledge Federation
Karl F. Hebenstreit Jr.

An Ark for the Exaflood Rushing upon Us
Mei Lin Fung, Robert S. Stephenson

Webbles: Programmable and Customizable Meme Media Objects in a Knowledge Federation Framework Environment on the Web
Micke N. Kuwahara, Yuzuru Tanaka

Causality in collective filtering
Mario Paolucci, Stefano Picascia, Walter Quattrociocchi

Images of knowledge. Interfaces for knowledge access in an epistemic transition
Marco Quaggiotto

New ecosystem in journalism: Decentralized newsrooms empowered by self-organized crowds
Tanja Aitamurto

Comments Off

Tribeforth

Filed under: Networks,Politics — Patrick Durusau @ 5:14 pm

Tribeforth

From the homepage:

Tribeforth Foundation is a group of people developing and promoting a collective intelligence computer system to assist in stimulating new solutions ideas and connections on a global scale. The system as it is planned is not unlike an every day wiki. The key difference is that millions can speak as one with out losing a voice and the software tunes the conversation into reason. This keeps us from getting lost in syntax and helping us to work with the real semantics.

Heavily rooted in collective intelligence and the semantic web (Web 3.0) we are organizing a collection of open source software and then extending them to create the most high tech discussion platform in human history. Available to anyone, anywhere as a basic standard of living.

A handful of powerful, fundamental principles and values guide us here at Tribeforth. We use these principles to create new tools for all of us

A project built on the principles of self reflection an echo of human ingenuity.

I don’t know if topic maps would be of assistance or not but when you are talking about making connections that persist across semantic boundaries (my words, not theirs), then you are going to need topic maps or something very similar.

I suppose I am a bit old school for the disclaimer:

THE TRIBEFORTH SYSTEM WILL NOT COLLECT ANY INFORMATION REGARDING MILITARY PERSONNEL, SYSTEMS, EQUIPMENT, PLANNING OR DEPLOYMENT. INCIDENTS REGARDING HUMAN RIGHTS VIOLATIONS ARE NOT SUBJECT TO THIS POLICY.

Existing solutions/structures are not going to go into the night quietly. That is a historical certainty. I would rather be prepared for the push back.

Comments Off

List of cities/states with open data – help me find more!

Filed under: Data,Government Data — Patrick Durusau @ 5:13 pm

List of cities/states with open data – help me find more!

A plea from “Simply Statistics” to increase its listing of cities with open data.

Mostly American and Canadian, with a few others, Berlin for example, suggested in comments.

I haven’t looked (yet) but since European libraries lead the charge in many ways to have greater access to their collections (my recollection, yours may differ), I would expect to find European cities and authorities also ahead on the race to publish public data.

Pointers from European readers? (Or I can look them up later this week, just not today.)

Comments Off

Voting Networks in the Danish Parliament [2004-2011]

Filed under: Graphs,Networks,R — Patrick Durusau @ 5:12 pm

Voting Networks in the Danish Parliament [2004-2011]

From the post:

One of my Christmas presents was the book Beautiful Visualization. Chapter 8 by Andrew Odewahn is a very nice piece on visualizing the U.S Senate social graph. Odewahn basically builds an affinity network, where ties represent whether two senator have voted in the same manner during a given time period. The rules for creating the network are nicely broken down to the following steps:

Nodes represent senators

Nodes are colored according to party affiliation

Nodes are connected with an edge if two senators voted together more than 65% of the time during a given timeframe

Based on the above rules Odewahn builds a series of interesting graphs, showing that there are a few consistently bipartisan senators on both sides in almost every session of the Congress.

But rather than just grousing about American politics (don’t get me started), the author produces voting network graphs of the Danish Parliament!

I leave it for you to decide if the results signal hope or despair. 😉

Comments Off

Topical Classification of Biomedical Research Papers – Details

Filed under: Bioinformatics,Biomedical,Medical Informatics,MeSH,PubMed,Topic Maps — Patrick Durusau @ 5:11 pm

OK, I registered both on the site and for the contest.

From the Task:

Our team has invested a significant amount of time and effort to gather a corpus of documents containing 20,000 journal articles from the PubMed Central open-access subset. Each of those documents was labeled by biomedical experts from PubMed with several MeSH subheadings that can be viewed as different contexts or topics discussed in the text. With a use of our automatic tagging algorithm, which we will describe in details after completion of the contest, we associated all the documents with the most related MeSH terms (headings). The competition data consists of information about strengths of those bonds, expressed as numerical value. Intuitively, they can be interpreted as values of a rough membership function that measures a degree in which a term is present in a given text. The task for the participants is to devise algorithms capable of accurately predicting MeSH subheadings (topics) assigned by the experts, based on the association strengths of the automatically generated tags. Each document can be labeled with several subheadings and this number is not fixed. In order to ensure that participants who are not familiar with biomedicine, and with the MeSH ontology in particular, have equal chances as domain experts, the names of concepts and topical classifications are removed from data. Those names and relations between data columns, as well as a dictionary translating decision class identifiers into MeSH subheadings, can be provided on request after completion of the challenge.

Data format: The data set is provided in a tabular form as two tab-separated values files, namely trainingData.csv (the training set) and testData.csv (the test set). They can be downloaded only after a successful registration to the competition. Each row of those data files represents a single document and, in the consecutive columns, it contains integers ranging from 0 to 1000, expressing association strengths to corresponding MeSH terms. Additionally, there is a trainingLables.txt file, whose consecutive rows correspond to entries in the training set (trainingData.csv). Each row of that file is a list of topic identifiers (integers ranging from 1 to 83), separated by commas, which can be regarded as a generalized classification of a journal article. This information is not available for the test set and has to be predicted by participants.

It is worth noting that, due to nature of the considered problem, the data sets are highly dimensional – the number of columns roughly corresponds to the MeSH ontology size. The data sets are also sparse, since usually only a small fraction of the MeSH terms is assigned to a particular document by our tagging algorithm. Finally, a large number of data columns have little (or even none) non-zero values (corresponding concepts are rarely assigned to documents). It is up to participants to decide which of them are still useful for the task.

I am looking at it as an opportunity to learn a good bit about automatic text classification and what, if any, role that topic maps can play in such a scenario.

Suggestions as well as team members are most welcome!

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 6, 2012

January 5, 2012

January 4, 2012

January 3, 2012