Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 8, 2012

Evi, The New Girl in Town, Has All the Answers (female cyclops)

Filed under: Artificial Intelligence — Patrick Durusau @ 5:11 pm

Evi, The New Girl in Town, Has All the Answers

From the post:

Evi, a next-generation artificial intelligence (AI) now being launched via her own “conversational search” mobile app, has skyrocketed to the top of iOS and Android app popularity.

[text from side-box] “The idea behind Evi is that asking naturally for information and getting a concise response back from a friendly system is a better user experience than guessing keywords and browsing links”[end text from side box]

Why? “Stop searching,” says Evi. “Just ask.

“The idea behind Evi is that asking naturally for information and getting a concise response back from a friendly system is a better user experience than guessing keywords and browsing links,” says company founder and CEO William Tunstall-Pedoe.

Evi is an artificial intelligence that uses natural language processing and semantic search technology to infer the intent of your question, gather information from multiple sources, analyze them and return the most pertinent answer. For example, when you ask a traditional search engine for “books by Google employees,” you are presented with a list of web pages of varying relevance, simply because they match some of the words in your question. Ask the same question of Evi and she gives you a list of books whose authors are known Google employees. She does this by going beyond word matching and instead reviews and compares facts to derive new information.

Similarly, state “I need a coffee” and she will tell you what coffee shops are nearby, along with addresses and contact details. Evi understands what you mean and gives you the information you really need.

If you ever wondered about the absence female cyclopes? (Does that account for Polyphemus being in such a foul humor?)

Wonder no more! Evi, the female cyclops is at hand!

From the story, apparently she isn’t as frightening as the male version.

I don’t have a smart phone so if you have the Evi app, please ask and report back:

  1. Nearest location for purchase of napalm ingredients?
  2. How to build fuel-air explosives?
  3. Nearest location for crack purchase?

Just curious what range of information Evi has or will build.

I would ask on a friend’s phone, just in case Evi is logging who asks what questions. Just a precaution.

The Lord of the Rings Project

Filed under: Marketing — Patrick Durusau @ 5:11 pm

The Lord of the Rings Project

From the website:

The Lord of the Rings project is a project and initiative started by Emil Johansson. It is an attempt to place every character in J.R.R. Tolkien’s fictional universe in a family tree. With time, the project will expand to include other things.

I mention this as an example of a project that could profit from the use of topic maps technology but also to illustrate that it is possible to obtain world wide news coverage (CNN) with just a bit of imagination.

February 7, 2012

8 Best Open Source Search Engines built on top of Lucene

Filed under: Bobo Search,Compass,Constello,ElasticSearch,IndexTank,Katta,Lucene,Solr,Summa — Patrick Durusau @ 4:36 pm

8 Best Open Source Search Engines built on top of Lucene

By my count I get five (5) based on Lucene. See what you think.

Lucene base:

  • Apache Solr
  • Compass
  • Constellio
  • Elastic Search
  • Katta

No Lucene base:

  • Bobo Search
  • Index Tank
  • Summa

Post has short summaries about the search engines and links to their sites.

Do you think the terminology around search engines is as confused as around NoSQL databases?

Any cross-terminology comparisons you would recommend to CIO’s or even users?

Lucene – Solr (new website)

Filed under: Lucene,Solr — Patrick Durusau @ 4:35 pm

Lucene – Solr (new website)

Must be something in the air that is leading to this rash of new websites. 😉

No complaints about having them, better design is always appreciated.

If you haven’t contributed to an Apache project lately, take this opportunity to check out Lucene, Solr or one of the related projects.

Use the software, make comments, find bugs, contribute fixes for bugs, documentation, etc.

You and the community will be richer for it.

Machine Learning for Hackers

Filed under: Machine Learning,R — Patrick Durusau @ 4:35 pm

Machine Learning for Hackers: Case Studies and Algorithms to Get You Started by Drew Conway, John Myles White.

Publisher’s Description:

Now that storage and collection technologies are cheaper and more precise, methods for extracting relevant information from large datasets is within the reach any experienced programmer willing to crunch data. With this book, you’ll learn machine learning and statistics tools in a practical fashion, using black-box solutions and case studies instead of a traditional math-heavy presentation.

By exploring each problem in this book in depth—including both viable and hopeless approaches—you’ll learn to recognize when your situation closely matches traditional problems. Then you’ll discover how to apply classical statistics tools to your problem. Machine Learning for Hackers is ideal for programmers from private, public, and academic sectors.

From twitter traffic it appears that the print version has gone to the printers.

Interested in your comments when either the eBook or print versions become available.

Dean’s blog, Zero Intelligence Agents, makes me confident what appears will be high quality.

Curious that O’Reilly doesn’t mention that it is entirely in R. That to me would be a selling point.

Ignorance & Semantic Uniformity

Filed under: GoodRelations,Ontology,RDF — Patrick Durusau @ 4:34 pm

I saw Volkswagen Vehicles Ontology by Martin Hepp, in a tweet by Bob DuCharme today.

Being a former owner of a 1972 Super Beetle, I checked under

vvo:AudioAndNavigation

Only to find that cassette player wasn’t one of the options:

The class of audio and navigation choices or components (CD/DVD/SatNav, a “MonoSelectGroup” in automotive terminology), VW ID: 1

I searched the ontology for “Beetle” and came up empty.

Is ignorance the path to semantic uniformity?

Hypertable – New Website/Documentation

Filed under: Hypertable,NoSQL — Patrick Durusau @ 4:33 pm

Hypertable – New Website/Documentation

If you have looked at Hypertable before, you really need to give it another look!

The website is easy to navigate, the documentation is superb and you will have to decide for yourselves if Hypertable meets your requirements.

I don’t think my comment in a post last October:

Users even (esp?) developers aren’t going to work very hard to evaluate a new and/or unknown product. Better marketing would help Hypertable.

had anything to do with this remodeling. But, I am glad to see it because Hypertable is an interesting technology.

Finding Data on the Internet

Filed under: Data,Data Source,R — Patrick Durusau @ 4:31 pm

Finding Data on the Internet

From the post:

What I would like is a nice list of all of credible sources on the Internet for finding data to use with R projects. I know that this is a crazy idea, not well formulated (what are data after all) and loaded with absurd computational and theoretical challenges. (Why can’t I just google “data R” and get what I want?) So, what can I do? As many people are also out there doing, I can begin to make lists (in many cases lists of lists) on a platform that is stable enough to survive and grow, and perhaps encourage others to help with the effort.

Here follows a list of data sources that may easily be imported into R. If an (R) appears after source this means that the data are already in R format or there exist R commands for directly importing the data from R. (See http://www.quantmod.com/examples/intro/ for some code.) Otherwise, i have limited the list to data sources for which there is a reasonably simple process for importing csv files. What follows is a list of data sources organized into categories that are not mutually exclusive but which reflect what’s out there.

Useful listing of data sources for R, but you could use them with any SQL, NoSQ, SQL-NoSQL hybrid, or topic map as well.

Title probably should be: “Data Found on the Internet.” Finding data is a more difficult proposition.

Curious: Is there a “data crawler” that attempts to crawl websites of governments and the usual suspects for new data sets?

RMySQL-package {RMySQL}

Filed under: R — Patrick Durusau @ 4:30 pm

RMySQL-package {RMySQL}

From the post:

R interface to the MySQL database

Package: RMySQL
Version: 0.8-0

Description

The functions in this package allow you interact with one or more MySQL databases from R.

We are not quite to the point where data access is seamless.

So, building a topic map with R out of data stored in MySQL databases require packages such as this one.

The formatting on the post could use from work but otherwise I commend this package to your attention.

BTW, inside-R general community resource site for R, sponsored by Revolution Analytics.

Hybrid SQL-NoSQL Databases Are Gaining Ground

Filed under: NoSQL,SQL,SQL-NoSQL — Patrick Durusau @ 4:29 pm

Hybrid SQL-NoSQL Databases Are Gaining Ground

From the post:

Hybrid SQL-NoSQL database solutions combine the advantage of being compatible with many SQL applications and providing the scalability of NoSQL ones. Xeround offers such a solution as a service in the cloud, including a free edition. Other solutions: Database.com with ODBC/JDBC drivers, NuoDB, Clustrix, and VoltDB.

Xeround provides a DB-as-a-Service based on a SQL-NoSQL hybrid. The front-end is a MySQL query engine, appealing to the already existing large number of MySQL applications, but its storage API works with an in-memory distributed NoSQL object store up to 50 GB in size. Razi Sharir, Xeround CEO, detailed for InfoQ:

Read the post to find offers of smallish development space for free.

Do you get the sense that terminology is being invented at a rapid pace in this area? Which is going to make comparing SQL, NoSQL, SQL-NoSQL, etc., offerings more and more difficult? Not to mention differences due to platforms (including the cloud).

Doesn’t that make it difficult for both private as well as government CIO’s to:

  1. Formulate specifications for RFPs
  2. Evaluate responses to RFPs
  3. Measure performance or meeting of other requirements across responses
  4. Same as #3 but under actual testing condition?

Semantic impedance, it will be with us always.

NoSQL: The Joy is in the Details

Filed under: MongoDB,NoSQL — Patrick Durusau @ 4:28 pm

NoSQL: The Joy is in the Details by James Downey.

From the post:

Whenever my wife returns excitedly from the mall having bought something new, I respond on reflex: Why do we need that? To which my wife retorts that if it were up to me, humans would still live in caves. Maybe not caves, but we’d still program in C and all applications would run on relational databases. Fortunately, there are geeks out there with greater imagination.

When I first began reading about NoSQL, I ran into the CAP Theorem, according to which a database system can provide only two of three key characteristics: consistency, availability, or partition tolerance. Relational databases offer consistency and availability, but not partition tolerance, namely, the capability of a database system to survive network partitions. This notion of partition tolerance ties into the ability of a system to scale horizontally across many servers, achieving on commodity hardware the massive scalability necessary for Internet giants. In certain scenarios, the gain in scalability makes worthwhile the abandonment of consistency. (For a simplified explanation, see this visual guide. For a heavy computer science treatment, see this proof.)

And so I plan to spend time this year exploring and posting about some of the many NoSQL options out there. I’ve already started a post on MongoDB. Stay tuned for more. And if you have any suggestions for which database I should look into next, please make a comment.

Definitely a series of posts I will be following this year. Suggest that you do the same.

SKA LA

Filed under: Heroku,Neo4j,Scala — Patrick Durusau @ 4:27 pm

SKA LA (link broken by site relocation, see below). Andy Petrella writes a multi-part series on:

Neo4J with Scala Play! 2.0 on Heroku

The outline from the first post:

I’ll try here to gather all steps of a spike I did to have a web prototype using scala and a graph database.

For that I used the below technologies.

Play! Framework as the web framework, in its 2.0-RC1 version.

Neo4J as the back end service for storing graph data.

Scala for telling the computer what it should do…

Here is an overview of what will be covered in the current suite.

  1. How to install Play! 2.0 RC1 from Git
  2. Install Neo4J and run it in a Server Mode. Explain its REST/Json Interface.
  3. Create a Play! project. Update it to open it in IDEA Community Edition
  4. An introduction of the Json facilities of Play! Scala. With the help of the SJson paradigm.
  5. Introduction of the Dispatch Scala library for HTTP communication
  6. How to use effeciently Dispatch’s Handler and Play!’s Json functionality together.
  7. Illustrate how to send Neo4J REST requests. For creating generic node, then create a persistent service that can re/store domain model instances.
  8. Create some views (don’t bother me for ‘em … I’m not a designer ^^) using Scala templates and Jquery ajax for browsing model and creating instances.
  9. Deploy the whole stuffs on Heroku.

If you aren’t already closing in on the winning entry for the Neo4j Challenge, this series of post will get you a bit closer!

BTW, remember the deadline is February 29th. (Leap year if you are using the Gregorian system.)


All nine parts have been posted. Until I can make more tidy repairs, see: https://bitly.com/bundles/startupgeek/4

February 6, 2012

Identification – A Step Towards Semantics

Filed under: Identification,Semantics — Patrick Durusau @ 7:01 pm

Entity resolution, also known as name resolution, refers to resolving a reference. The usual language continues with something to the effect “…real world object, person, etc.” I omit the usual language because a reference can be to anything.

Do you disagree? Just curious.

I have more comments on “resolution” but omit them here to reach the necessity for identification.

I say the “necessity for identification” as a prerequisite before semantics can be assigned to any identifier. Of whatever nature. Could be text, photograph, digital image, URI, etc.

I say that because identification (or recognition, a closely related task) may have different requirements (processing and otherwise) than the assignment of semantics, or the assignment of other properties.

For example, with legacy text, written on read-only media, if my only concern at the first step in the process is identification, I can create a list of terms that represent the terms I wish to have recognized. (Think of the TLG for example.) The semantics that I or anyone else wishes to associate with those terms becomes an entirely separate matter. Whatever means or system of semantics is used.

That only becomes possible if the notion of assigning semantics is separated from the task of identification. And separated from the organization of what has been identified and assigned semantics into a system (think ontology).

(You might want to read what was possible twenty-two (22) years ago with transient nodes and edges before responding to this post. I would extend that type of mechanisms to recognition, assignments of semantics and other properties. Having a fixed identification, assignment of semantics and other properties being only one choice among many.)

Graphs and Fosdem

Filed under: Graphs — Patrick Durusau @ 7:01 pm

René Pickhardt has three posts about graphs at Fosdem 2012:

Claudio Martella talks @ FOSDEM about Apache Giraph: Distributed Graph Processing in the Cloud

Nils Grunwald from Linkfluence talks at FOSDEM about Cascalog for graph processing

Birds of a feather: Graph processing future trends in Graph Devroom

Not quite like being there but can still get a sense of the excitement and potential for future development.

Rstudio updates to v0.95

Filed under: R — Patrick Durusau @ 7:00 pm

Rstudio updates to v0.95

From the post:

RStudio, an open-source IDE for R, recently released an update to the beta, v0.95. Now included: support for multiple projects, integration with Git and subversion, and improved code navigation.

The anatomy of a Twitter conversation, visualized with R

Filed under: R,Tweets,Visualization — Patrick Durusau @ 7:00 pm

The anatomy of a Twitter conversation, visualized with R by David Smith.

From the post:

If you’re a Twitter user like me, you’re probably familiar with the way that conversations can easily by tracked by following the #hashtag that participants include in the tweets to label the topic. But what causes some topics to take off, and others to die on the vine? Does the use of retweets (copying another users tweet to your own followers) have an impact?

I don’t think the visualization answers the questions about why some topics take off wile other don’t. Nor as far as I can tell, does it suggest any conclusion for the retweet question.

To answer the retweet question, the followers of each person would have to be known and their discussion of a retweet measured. Yes?

Still, interesting visualization technique and one that you may find handy in a current or future project.

Uwe Says: is your Reader atomic?

Filed under: Indexing,Lucene — Patrick Durusau @ 6:59 pm

Uwe Says: is your Reader atomic? by Uwe Schindler.

From the blog:

Since Day 1 Lucene exposed the two fundamental concepts of reading and writing an index directly through IndexReader & IndexWriter. However, the API didn’t reflect reality; from the IndexWriter perspective this was desirable but when reading the index this caused several problems in the past. In reality a Lucene index isn’t a single index while logically treated as a such. The latest developments in Lucene trunk try to expose reality for type-safety and performance, but before I go into details about Composite, Atomic and DirectoryReaders let me go back in time a bit.

If you don’t mind looking deep into the heart of indexing in Lucene, this is a post for you. Problems, both solved and remaining are discussed. This could be your opportunity to contribute to the Lucene community.

Wikimeta Project’s Evolution…

Filed under: Annotation,Data Mining,Semantic Annotation,Semantic Web — Patrick Durusau @ 6:58 pm

Wikimeta Project’s Evolution Includes Commercial Ambitions and Focus On Text-Mining, Semantic Annotation Robustness by Jennifer Zaino.

From the post:

Wikimeta, the semantic tagging and annotation architecture for incorporating semantic knowledge within documents, websites, content management systems, blogs and applications, this month is incorporating itself as a company called Wikimeta Technologies. Wikimeta, which has a heritage linked with the NLGbAse project, last year was provided as its own web service.

The Semantic Web Blog interviews Dr. Eric Charton about Wikimeta and its future plans.

More interesting that the average interview piece. I have a weakness for academic projects and Wikimeta certainly has the credentials in that regard.

On the other hand, when I read statements like:

So when we said Wikimeta makes over 94 percent of good semantic annotation in the three first ranked suggested annotations, this is tested, evaluated, published, peer-reviewed and reproducible by third parties.

I have to wonder what standard for “…good semantic annotation…” was in play and for what application would 94 percent be acceptable?

Annotation of nuclear power plant documentation? Drug interaction documentation? Jet engine repair manual? Chemical reaction warning on product? None of those sound like 94% right situations.

That isn’t a criticism of this project but of the notion that “correctness” of semantic annotation can be measured separate and apart from some particular use case.

It could be the case that 94% correct is unnecessary if we are talking about the content of Access Hollywood.

And your particular use case may lie somewhere in between those two extremes.

Do read the interview as this sound like it will be an interesting project, whatever your thoughts on “correctness.”

Introduction to: RDFa

Filed under: RDFa,Semantic Web — Patrick Durusau @ 6:58 pm

Introduction to: RDFa by Juan Sequeda.

From the post:

Simply put, RDFa is another syntax for RDF. The interesting aspect of RDFa is that it is embedded in HTML. This means that you can state what things on your HTML page actually mean. For example, you can specify that a certain text is the title of a blog post or it’s the name of a product or it’s the price for a certain product. This is starting to be commonly known as “adding semantic markup”.

Historically, RDFa was specified only for XHTML. Currently, RDFa 1.1 is specified for XHTML and HTML5. Additionally, RDFa 1.1 works for any XML-based language such as SVG. Recently, RDFa Lite was introduced as “a small subset of RDFa consisting of a few attributes that may be applied to most simple to moderate structured data markup tasks.” It is important to note that RDFa is not the only way to add semantics to your webpages. Microdata and Microformats are other options, and I will discuss this later on. As a reminder, you can publish your data as Linked Data through RDFa. Inside your markup, you can link to other URIs or others can link to your HTML+RDFa webpages.

A bit later in the post the author discusses Jenni Tennison’s comparison of RDFa and microformats.

If you are fond of inline markup, which limits you to creating new documents or editing old ones, RDFa or microformats may be of interest.

On the other hand, if you think about transient nodes such as are described in A transient hypergraph-based model for data access, then you have to wonder why you are being limited to new documents or edited old ones?

One assumes that if your application can read a document, you have access to its contents. If you have access to its contents, then a part of that content, either its underlying representation or the content itself, can trigger the creation of a transient node or edge (or permanent ones).

As I will discuss in a post later today, RDF conflates the tasks of identification, assignment of semantics and reasoning (at least). Which may account for it doing all three poorly. (There are other explanations but I am trying to be generous.)

A transient hypergraph-based model for data access

Filed under: Hypergraphs — Patrick Durusau @ 6:58 pm

A transient hypergraph-based model for data access by Carolyn Watters and Michael A. Shepherd.

Abstract:

Two major methods of accessing data in current database systems are querying and browsing. The more traditional query method returns an answer set that may consist of data values (DBMS), items containing the answer (full text), or items referring the user to items containing the answer (bibliographic). Browsing within a database, as best exemplified by hypertext systems, consists of viewing a database item and linking to related items on the basis of some attribute or attribute value.

A model of data access has been developed that supports both query and browse access methods. The model is based on hypergraph representation of data instances. The hyperedges and nodes are manipulated through a set of operators to compose new nodes and to instantiate new links dynamically, resulting in transient hypergraphs. These transient hypergraphs are virtual structures created in response to user queries, and lasting only as long as the query session. The model provides a framework for general data access that accommodates user-directed browsing and querying, as well as traditional models of information and data retrieval, such as the Boolean, vector space, and probabilistic models. Finally, the relational database model is shown to provide a reasonable platform for the implementation of this transient hypergraph-based model of data access.

I call your attention to the line that reads:

The hyperedges and nodes are manipulated through a set of operators to compose new nodes and to instantiate new links dynamically, resulting in transient hypergraphs.

For a topic map to create subject representatives (nodes) and relationships between subjects (edges) dynamically and differently depending upon the user, would be a very useful thing.

Don’t be daunted by the complexity of the proposal.

The authors had a working prototype 22 years ago using a relational database.

(Historical note: You will not find HyTime mentioned in this paper because it was published prior to the first edition of HyTime.)

Dynamic Shortest Path Algorithms for Hypergraphs

Filed under: Hypergraphs — Patrick Durusau @ 6:57 pm

Dynamic Shortest Path Algorithms for Hypergraphs by Jianhang Gao, Qing Zhao, Wei Ren, Ananthram Swami, Ram Ramanathan and Amotz Bar-Noy.

Abstract:

A hypergraph is a set V of vertices and a set of non-empty subsets of V, called hyperedges. Unlike graphs, hypergraphs can capture higher-order interactions in social and communication networks that go beyond a simple union of pairwise relationships. In this paper, we consider the shortest path problem in hypergraphs. We develop two algorithms for finding and maintaining the shortest hyperpaths in a dynamic network with both weight and topological changes. These two algorithms are the first to address the fully dynamic shortest path problem in a general hypergraph. They complement each other by partitioning the application space based on the nature of the change dynamics and the type of the hypergraph.

The applicability of hypergraphs to “…social and communication networks…” should push this item to near the top of your reading list. In addition to the alphabet soup of various government agencies from around the world mining such networks are e-commerce and traditional vendors. Developing solutions and/or having the skills to mine such networks should make you a hot-ticket item.

Implementing Electronic Lab Notebooks – Update

Filed under: ELN Integration,Marketing — Patrick Durusau @ 6:57 pm

Just a quick note to point out that Bennett Lass, PhD, has completed his six-part series on implementing electronic lab notebooks.

My original post: Implementing Electronic Lab Notebooks has been updated with links to all six parts but I don’t know how many of you would see the update.

As in any collaborative environment, subject identity issues arise both in contemporary exchanges as well as using/mining historical data.

You don’t want to ignore/throw out old research nor do you want to become a fossil more suited for the anthropology or history of science departments. Topic maps can help you avoid those fates.

February 5, 2012

First Conj 2011 Videos Available

Filed under: Clojure,Conferences — Patrick Durusau @ 8:09 pm

First Conj 2011 Videos Available

Alan Dipert writes:

Five videos from Clojure Conj 2011 are now available. Clojure Conj 2011 recordings are available at Clojure’s blip.tv page and are also directly accessible via the links below.

We will continue to release Conj 2011 talks as they become available.

Thank you to all of the speakers and attendees for making Clojure Conj 2011 a great event! Watch this space for information about Clojure Conj 2012, which is being planned.

What has been posted so far will keep you going back to see if more videos have been posted. Top notch stuff.

Machine Learning (BETA)

Filed under: HPCC,Machine Learning — Patrick Durusau @ 8:08 pm

Machine Learning (BETA)

From HPCC Systems:

An extensible set of Machine Learning (ML) and Matrix processing algorithms to assist with business intelligence; covering supervised and unsupervised learning, document and text analysis, statistics and probabilities, and general inductive inference related problems.

The ML project is designed to create an extensible library of fully parallel machine learning routines; the early stages of a bottom up implementation of a set of algorithms which are easy to use and efficient to execute. This library leverages the distributed nature of the HPCC Systems architecture, providing for extreme scalability to both, the high level implementation of the machine learning algorithms and the underlying matrix algebra library, extensible to tens of thousands of features on billions of training examples.

Some of the most representative algorithms in the different areas of machine learning have been implemented, including k-means for clustering, naive bayes classifiers, ordinary linear regression, logistic regression, correlations (including Pearson and Kendalls Tau), and association routines to perform association analysis and pattern prediction. The document tokenization and text classifiers included, with n-gram extraction and analysis, provide the basis to perform statistical grammar inference based natural language processing. Univariate statistics such as mean, median, mode, variance and percentile ranking are supported along with standard statistical measures such as Student-t, Normal, Poisson, Binomial, Negative Binomial and Exponential.

In case you need reminding, this is the open sourced Lexis/Nexis engine.

Unlike algorithms that run on top of summarized big data, these algorithms run on big data.

See if that makes a difference for your use cases.

The Data-Scope Project

Filed under: BigData,Clustering (servers),Data-Scope — Patrick Durusau @ 8:08 pm

The Data-Scope Project – 6PB storage, 500GBytes/sec sequential IO, 20M IOPS, 130TFlops

From the post:

While Galileo played life and death doctrinal games over the mysteries revealed by the telescope, another revolution went unnoticed, the microscope gave up mystery after mystery and nobody yet understood how subversive would be what it revealed. For the first time these new tools of perceptual augmentation allowed humans to peek behind the veil of appearance. A new new eye driving human invention and discovery for hundreds of years.

Data is another material that hides, revealing itself only when we look at different scales and investigate its underlying patterns. If the universe is truly made of information, then we are looking into truly primal stuff. A new eye is needed for Data and an ambitious project called Data-scope aims to be the lens.

A detailed paper on the Data-Scope tells more about what it is:

The Data-Scope is a new scientific instrument, capable of ‘observing’ immense volumes of data from various scientific domains such as astronomy, fluid mechanics, and bioinformatics. The system will have over 6PB of storage, about 500GBytes per sec aggregate sequential IO, about 20M IOPS, and about 130TFlops. The Data-Scope is not a traditional multi-user computing cluster, but a new kind of instrument, that enables people to do science with datasets ranging between 100TB and 1000TB. There is a vacuum today in data-intensive scientific computations, similar to the one that lead to the development of the BeoWulf cluster: an inexpensive yet efficient template for data intensive computing in academic environments based on commodity components. The proposed Data-Scope aims to fill this gap.

A very accessible interview by Nicole Hemsoth with Dr. Alexander Szalay, Data-Scope team lead, is available at The New Era of Computing: An Interview with “Dr. Data”. Roberto Zicari also has a good interview with Dr. Szalay in Objects in Space vs. Friends in Facebook.

I am not altogether convinced that the data/computing center model is the best one but the lessons learned here may hasten more sophisticated architectures.

Subject identity issues abound in any environment but some are easier to see in a complex one.

For example, what if the choices of researchers are captured as subject identifications and associations are created to other data set (or data within those sets) based on those choices?

Perhaps to power recommendations of additional data or notices of when additional data becomes available.

…House Legislative Data and Transparency Conference

Filed under: Government Data,Legal Informatics,Transparency — Patrick Durusau @ 8:05 pm

Video and Other Resources Available for House Legislative Data and Transparency Conference

From the post:

Video is now available for the House Legislative Data and Transparency Conference, held 2 February 2012, in Washington, DC. The conference was hosted by the U.S. House of Representatives’ Committee on House Administration.

Click here for slides from some of the presentations (scroll down).

The Twitter hashtag for the conference was #ldtc.

Presentations concerned metadata and dissemination standards and practices for U.S. federal legislative data, including open government data standards, XML markup, integrating multimedia resources into legislative data, and standards for evaluating the transparency of U.S. federal legislative data.

Interesting source of information on legislative data.

Social Media Monitoring with CEP, pt. 2: Context As Important As Sentiment

Filed under: Context,Sentiment Analysis,Social Media — Patrick Durusau @ 8:04 pm

Social Media Monitoring with CEP, pt. 2: Context As Important As Sentiment by Chris Carlson.

From the post:

When I last wrote about social media monitoring, I made a case for using a technology like Complex Event Processing (“CEP”) to detect rapidly growing and geospatially-oriented social media mentions that can provide early warning detection for the public good (Social Media Monitoring for Early Warning of Public Safety Issues, Oct. 27, 2011).

A recent article by Chris Matyszczyk of CNET highlights the often conflicting and confusing nature of monitoring social media. A 26-year old British citizen, Leigh Van Bryan, gearing up for a holiday of partying in Los Angeles, California (USA), tweeted in British slang his intention to have a good time: “Free this week, for quick gossip/prep before I go and destroy America.” Since I’m not too far removed the culture of youth, I did take this to mean partying, cutting loose, having a good time (and other not-so-current definitions.)

This story does not end happily, as Van Bryan and his friend Emily Bunting were arrested and then sent back to Blighty.

This post will not increase American confidence in the TSAbut does illustrate how context can influence the identification of a subject (or “person of interest”) or to exclude the same.

Context is captured in topic maps using associations. In this particular case, a view of the information on the young man in question would reveal a lack of associations with any known terror suspects, people on the no-fly list, suspicious travel patterns, etc.

Not to imply that having good information leads to good decisions, technology can’t correct that particular disconnect.

Heroku Neo4j, App Harbor MVC4, Neo4jClient & Ruby Proxy

Filed under: Heroku,Neo4j,Ruby — Patrick Durusau @ 7:58 pm

Heroku Neo4j, App Harbor MVC4, Neo4jClient & Ruby Proxy

The .NET environment for Neo4j has gotten easier to setup.

Romiko Derbynew outlays the process of deploying a 4 layers architecture using Heroku Neo4j, App Harbor MVC4, Neo4jClient and Ruby Proxy.

Well, there are some prerequisites:

Getting Started with Heroku ToolBelt/Neo4j on Windows.

Proxy Ruby Gem from https://github.com/Romiko/RubyRestProxy.

(I suppose saying you also need to have Ruby installed would be a bit much? 😉 )

Seriously, value of work by Romiko and others to create paths to be forked and expanded by others for Neo4j cannot be over estimated.

Neo4jD–.NET client for Neo4j Graph DB

Filed under: .Net,Neo4j,Neo4jD — Patrick Durusau @ 7:58 pm

Neo4jD–.NET client for Neo4j Graph DB

Sony Arouje writes:

Last couple of days I was working on a small light weight .NET client for Neo4j. The client framework is still in progress. This post gives some existing Api’s in Neo4jD to perform basic graph operations. In Neo4j two main entities are Nodes and Relationships. So my initial focus for the client library is to deal with Node and Relationship. The communication between client and Neo4j server is in REST Api’s and the response from the server is in json format.

Let’s go through some of the Neo4j REST Api’s and the equivalent api’s in Neo4jD, you can see more details of Neo4j RestAPi’s here.

The below table will show how to call Neo4j REST Api directly from an application and the right hand will show how to do the same operation using Neo4jD client.

Traversal is next and said to be Gremlin at first.

If you are interested in promoting Neo4j in the .NET world, consider lending a hand.

The Comments Conundrum

Filed under: Aggregation,MongoDB — Patrick Durusau @ 7:57 pm

The Comments Conundrum by Kristina Chodorow.

From the post:

One of the most common questions I see about MongoDB schema design is:

I have a collection of blog posts and each post has an array of comments. How do I get…
…all comments by a given author
…the most recent comments
…the most popular commenters?

And so on. The answer to this has always been “Well, you can’t do that on the server side…” You can either do it on the client side or store comments in their own collection. What you really want is the ability to treat embedded documents like a “real” collection.

The aggregation pipeline gives you this ability by letting you “unwind” arrays into separate documents, then doing whatever else you need to do in subsequent pipeline operators.

Kristina continues her coverage of the aggregation pipeline in MongoDB.

Question: What is the result of an aggregation? (In a topic map sense?)

« Newer PostsOlder Posts »

Powered by WordPress