June « 2012 « Another Word For It

June 14, 2012

The Elephant in the Enterprise

Filed under: Hadoop — Patrick Durusau @ 6:14 pm

The Elephant in the Enterprise by Jon Zuanich.

From the post:

On Tuesday, June 12th The Churchill Club of Silicon Valley hosted a panel discussion on Hadoop’s evolution from an open-source project to becoming a standard component of today’s enterprise computing fabric. The lively and dynamic discussion was moderated by Cade Metz, Editor, Wired Enterprise.

Panelists included:

Michael Driscoll, CEO, Metamarkets

Andrew Mendelsohn, SVP, Oracle Server Technologies

Mike Olson, CEO, Cloudera

Jay Parikh, VP Infrastructure Engineering, Facebook

John Schroeder, CEO, MapR

By the end of the evening, this much was clear: Hadoop has arrived as a required technology. Whether provisioned in the cloud, on-premise, or using a hybrid approach, companies need Hadoop to harness the massive data volumes flowing through their organizations today and into the future. The power of Hadoop is due in part to the way it changes the economics of large-scale computing and storage, but — even more importantly — because it gives organizations a new platform to discover, analyze and ultimately monetize all of their data. To learn more about how market leaders view Hadoop and the reasons for its accelerated adoption into the heart of the enterprise, view the above video.

If you have time, try to watch this between now and next Monday.

I may be out most of the day but I have a post I will be working on over the weekend to post early Monday.

What is missing from this discussion of scale?

Comments Off

Titan: The Rise of Big Graph Data [SLF4J conflicts] + Solution to Transaction issue

Filed under: BigData,Graphs,Titan — Patrick Durusau @ 3:14 pm

Titan: The Rise of Big Graph Data by Marko O. Rodriguez and Matthias Broecheler.

Description:

A graph is a data structure composed of vertices/dots and edges/lines. A graph database is a software system used to persist and process graphs. The common conception in today’s database community is that there is a tradeoff between the scale of data and the complexity/interlinking of data. To challenge this understanding, Aurelius has developed Titan under the liberal Apache 2 license. Titan supports both the size of modern data and the modeling power of graphs to usher in the era of Big Graph Data. Novel techniques in edge compression, data layout, and vertex-centric indices that exploit significant orders are used to facilitate the representation and processing of a single atomic graph…

Some minor corrections:

Correct: http://thinkaurelius/titan.zip to: https://github.com/thinkaurelius/titan (slide 109)

Correct: titan/ to: titan-0.1-apha/

Another clash with Ontopia, multiple SLF4J bindings (recalling that I had to put slf4j-log4j2-1.5.11.jar explicitly in my class path, Ontopia 5.2.1 and the CLASSPATH). Clashes with :/home/patrick/working/titan-0.1-alpha/lib/slf4j-log4j12-1.6.1.jar.

Fixed that, need a solution to easily switch classpaths.

It did startup after that fix.

I got to slide 116, trying to load ‘data/graph-of-the-gods.xml’ when Titan started returning error messages. Send messages + stack trace to Marko.

Will report back when I find where i went wrong or the software has that bug fixed.

This is a very exciting project so I suggest that you take a look it sooner rather than later.

Update: (I should get this sort of response time from commercial vendors):

From Matthias:

there is a bit of a transactional hickup in Titan/Blueprints right now. In Titan, every operation on the graph occurs in the context of a transaction. In Blueprints, calling startTransaction requires that no other transaction is currently running for that thread. loadGraphML calls “startTransaction”. Putting all of those together you get the exception below.

So, to get around it, you would have to call “stopTransaction(SUCCESS)” before loading the data. We should add that to the slides.

However, we are hoping that this situation is temporary, meaning having to call stopTransaction explicitly.

One proposal is to have startTransaction automatically “attach” to the previous transaction since this is completely acceptable behavior in most if not all situations. This is currently in the pipeline.

So, the slides read:

[Incorrect]
gremlin> g.createKeyIndex(‘name’, Vertex.class)
==>null
gremlin> g.loadGraphML(‘data/graph-of-the-gods.xml’)
==>null

Should read:

[Correct]
gremlin> g.createKeyIndex(‘name’, Vertex.class)
==>null
gremlin> g.stopTransaction(SUCCESS)
==>null

The Getting Started page for Titan appears to be more accurate than the slides (at least so far). 😉

Comments Off

Ontopia 5.2.1 and HBase [Port 8080 Conflicts]

Filed under: HBase,Ontopia — Patrick Durusau @ 12:58 pm

I have HBase 0.92-1 and Ontopia 5.2.1 installed on the same Ubuntu box. (An obvious “gotcha” and its correction follows.)

The Catalina server will not start with the following error message in {$basedir}/apache-tomcat/logs/catalina.(date).log:

SEVERE: Error initializing endpoint
java.net.BindException: Address already in use <null>:8080

To discover what is using port 8080, use:

sudo lsof -i :8080

Result (partial):

COMMAND PID USER …

java 5314 hbase …

You can either reset the port on HBase or open up {$basedir}/apache-tomcat/conf/server.xml and change:

<Connector port=”8080″ protocol=”HTTP/1.1″

to:

<!– Redefined port to avoid conflict with hbase
–>
<Connector port=”8090″ protocol=”HTTP/1.1″

The comment about avoiding conflict with HBase isn’t necessary but good practice.

(Cloudera has a note about the 8080 issue in its HBase Installation documentation.)

Comments (3)

June 13, 2012

SeRSy 2012

Filed under: Conferences,Recommendation,Semantic Web — Patrick Durusau @ 2:21 pm

SeRSy 2012: International Workshop on Semantic Technologies meet Recommender Systems & Big Data

Important Dates:

Submission of papers: July 31, 2012
Notification of acceptance: August 21, 2012
Camera-ready versions: September 10, 2012

[In connection with the 11th International Semantic Web Conference, Boston, USA, November 11-15, 2012.]

The scope statement:

People generally need more and more advanced tools that go beyond those implementing the canonical search paradigm for seeking relevant information. A new search paradigm is emerging, where the user perspective is completely reversed: from finding to being found. Recommender Systems may help to support this new perspective, because they have the effect of pushing relevant objects, selected from a large space of possible options, to potentially interested users. To achieve this result, recommendation techniques generally rely on data referring to three kinds of objects: users, items and their relations.

Recent developments of the Semantic Web community offer novel strategies to represent data about users, items and their relations that might improve the current state of the art of recommender systems, in order to move towards a new generation of recommender systems which fully understand the items they deal with.

More and more semantic data are published following the Linked Data principles, that enable to set up links between objects in different data sources, by connecting information in a single global data space: the Web of Data. Today, Web of Data includes different types of knowledge represented in a homogeneous form: sedimentary one (encyclopedic, cultural, linguistic, common-sense) and real-time one (news, data streams, …). This data might be useful to interlink diverse information about users, items, and their relations and implement reasoning mechanisms that can support and improve the recommendation process.

The challenge is to investigate whether and how this large amount of wide-coverage and linked semantic knowledge can be automatically introduced into systems that perform tasks requiring human-level intelligence. Examples of such tasks include understanding a health problem in order to make a medical decision, or simply deciding which laptop to buy. Recommender systems support users exactly in those complex tasks.

The primary goal of the workshop is to showcase cutting edge research on the intersection of Semantic Technologies and Recommender Systems, by taking the best of the two worlds. This combination may provide the Semantic Web community with important real-world scenarios where its potential can be effectively exploited into systems performing complex tasks.

Should be interesting to see whether the semantic technologies or the recommender systems or both get the “rough” or inexact edges.

Comments Off

Autocompletion and Heavy Metal

Filed under: AutoComplete,AutoSuggestion,Music,Music Retrieval,Searching — Patrick Durusau @ 2:08 pm

Building an Autocompletion on GWT with RPC, ContextListener and a Suggest Tree: Part 0

René Pickhardt has started a series of posts that should interest anyone with search applications (or an interest metal bands).

From the post:

Over the last weeks there was quite some quality programming time for me. First of all I built some indices on the typology data base in which way I was able to increase the retrieval speed of typology by a factor of over 1000 which is something that rarely happens in computer science. I will blog about this soon. But heaving those techniques at hand I also used them to built a better auto completion for the search function of my online social network metalcon.de.

The search functionality is not deployed to the real site yet. But on the demo page you can find a demo showing how the completion is helping you typing. Right now the network requests are faster than google search (which I admit it is quite easy if you only have to handle a request a second and also have a much smaller concept space). Still I was amazed by the ease and beauty of the program and the fact that the suggestions for autocompletion are actually more accurate than our current data base search. So feel free to have a look at the demo:

http://gwt.metalcon.de/GWT-Modelling/#AutoCompletionTest

Right now it consists of about 150 thousand concepts which come from 4 different data sources (Metal Bands, Metal records, Tracks and Germen venues for Heavy metal) I am pretty sure that increasing the size of the concept space by 2 orders of magnitude should not be a problem. And if everything works out fine I will be able to test this hypothesis on my joint project related work which will have a data base with at least 1 mio. concepts that need to be autocompleted.

Well, I must admit that 150,000 concepts sounds a bit “lite” for heavy metal but then being an admirer of the same, that comes as no real surprise. 😉

Still, it also sounds like a very good starting place.

Enjoy!

Comments Off

Faster Ranking As A Goal?

Filed under: PageRank,Quantum — Patrick Durusau @ 1:25 pm

When I read in Quantum Computers Could Help Search Engines Keep Up With the Internet’s Growth:

Most people don’t think twice about how Internet search engines work. You type in a word or phrase, hit enter, and poof — a list of web pages pops up, organized by relevance.

Behind the scenes, a lot of math goes into figuring out exactly what qualifies as most relevant web page for your search. Google, for example, uses a page ranking algorithm that is rumored to be the largest numerical calculation carried out anywhere in the world. With the web constantly expanding, researchers at USC have proposed — and demonstrated the feasibility — of using quantum computers to speed up that process.

“This work is about trying to speed up the way we search on the web,” said Daniel Lidar, corresponding author of a paper on the research that appeared in the journal Physical Review Letters on June 4.

As the Internet continues to grow, the time and resources needed to run the calculation — which is done daily — grow with it, Lidar said.

I thought of my post earlier today about inexact computing and how our semantics are inexact. (On the value of being inexact)

Is it the case that quantum computing is going to help us be more exact more quickly?

I am not sure what the advantage of being wrong more quickly could be? Do you?

The full reference:

Silvano Garnerone, Paolo Zanardi, Daniel Lidar. Adiabatic Quantum Algorithm for Search Engine Ranking. Physical Review Letters, 2012; 108 (23) DOI: 10.1103/PhysRevLett.108.230506

Chance discover of an interesting journal feature:

Abstract:

We propose an adiabatic quantum algorithm for generating a quantum pure state encoding of the PageRank vector, the most widely used tool in ranking the relative importance of internet pages. We present extensive numerical simulations which provide evidence that this algorithm can prepare the quantum PageRank state in a time which, on average, scales polylogarithmically in the number of web pages. We argue that the main topological feature of the underlying web graph allowing for such a scaling is the out-degree distribution. The top-ranked log⁡(n) entries of the quantum PageRank state can then be estimated with a polynomial quantum speed-up. Moreover, the quantum PageRank state can be used in “q-sampling” protocols for testing properties of distributions, which require exponentially fewer measurements than all classical schemes designed for the same task. This can be used to decide whether to run a classical update of the PageRank.

Physics Synopsis:

Although quantum computing has only been demonstrated for small calculations so far, researchers are interested in finding problems where its potentially massive parallelism would pay off if scaled-up versions can be made. In Physical Review Letters, Silvano Garnerone of the Institute for Quantum Computing at the University of Waterloo, Canada, and colleagues simulate the speedup achieved by using a quantum approach to rank websites.

The PageRank method, implemented by Google, assigns each website a score based on how many other sites link to it and what their scores are. Starting with an enormous matrix that represents which sites link to which others, the algorithm evaluates the probability that a steady stream of surfers starting at random sites and following random links will be found at each site. This information helps determine which search results should be listed highest. The PageRank calculation currently requires a time that is roughly proportional to the number of sites. This slowdown with size is not as bad as for many complex problems, but it can still take many days to rank the entire worldwide web.

Garnerone and colleagues propose an approach to page ranking that uses an “adiabatic quantum algorithm,” in which a simple matrix with a known solution is gradually transformed into the real problem, producing the desired solution. They simulated many relatively small networks that had similar link topology to the worldwide web, and found that reconstructing and reading out the most relevant part of the PageRank required a time that grows more slowly than the best classical algorithms available. – Don Monroe

That looks like a really cool feature to me.

Abstract for the initiated. Synopsis for the may be interested.

Are there IR/KD/etc. journals following that model?

Seems like a good way to create “trading zones” where we will become aware of work in other areas.

Comments Off

..the reasoning that people actually engage in

Filed under: Argumentation,Reasoning — Patrick Durusau @ 1:08 pm

Informal Logic: Reasoning and Argumentation in Theory and Practice

A self-description of the journal appears in the first issue, July of 1978:

However, as we found out at the Windsor Symposium, informal logic means many things to many people. Let us then declare our conception of it. For the time being, we shall use this term to denote a wide spectrum of interests and questions, whose only common link may appear to be that they do not readily lend themselves to treatment in the pages of “The Journal of Symbolic Logic.” More positively, we think of informal logic as covering the gamut of theoretical and practical issues that come into focus when one examines closely, from a normative viewpoint, the reasoning that people actually engage in. Subtract from this the exclusively formal issues and what remains is informal logic. Thus our conception is very broad and liberal, and covers everything from theoretical issues (theory of fallacy and argument) to practical ones (such as how best to display the structure of ordinary arguments) to pedagogical questions (how to design critical thinking courses; what sorts of material to use). [I changed the underlining of “The Journal of Sybolic Logic” to quotes to avoid confusion with hyperlinking. Emphasis added.]

“…the reasoning that people actually engage in” sounds like it would interest topic map authors.

Jack Park forwarded this to my attention.

Comments Off

On the value of being inexact

Filed under: Computation,Computer Science,Inexact,Semantics — Patrick Durusau @ 12:31 pm

Algorithmic methodologies for ultra-efficient inexact architectures for sustaining technology scaling by Avinash Lingamneni, Kirthi Krishna Muntimadugu, Richard M. Karp, Krishna V. Palem, and Christian Piguet.

The following non-technical blurb caught my eye:

Researchers have unveiled an “inexact” computer chip that challenges the industry’s dogmatic 50-year pursuit of accuracy. The design improves power and resource efficiency by allowing for occasional errors. Prototypes unveiled this week at the ACM International Conference on Computing Frontiers in Cagliari, Italy, are at least 15 times more efficient than today’s technology.

[ads deleted]

The research, which earned best-paper honors at the conference, was conducted by experts from Rice University in Houston, Singapore’s Nanyang Technological University (NTU), Switzerland’s Center for Electronics and Microtechnology (CSEM) and the University of California, Berkeley.

“It is exciting to see this technology in a working chip that we can measure and validate for the first time,” said project leader Krishna Palem, who also serves as director of the Rice-NTU Institute for Sustainable and Applied Infodynamics (ISAID). “Our work since 2003 showed that significant gains were possible, and I am delighted that these working chips have met and even exceeded our expectations.” [From: Computing experts unveil superefficient ‘inexact’ chip which I saw in a list of links by Greg Linden.

Think about it. We are inexact and so are our semantics.

But we attempt to model our inexact semantics with increasingly exact computing platforms.

Does that sound like a modeling mis-match to you?

BTW, if you are interested in the details, see: Algorithmic methodologies for ultra-efficient inexact architectures for sustaining technology scaling

Abstract:

Owing to a growing desire to reduce energy consumption and widely anticipated hurdles to the continued technology scaling promised by Moore’s law, techniques and technologies such as inexact circuits and probabilistic CMOS (PCMOS) have gained prominence. These radical approaches trade accuracy at the hardware level for significant gains in energy consumption, area, and speed. While holding great promise, their ability to influence the broader milieu of computing is limited due to two shortcomings. First, they were mostly based on ad-hoc hand designs and did not consider algorithmically well-characterized automated design methodologies. Also, existing design approaches were limited to particular layers of abstraction such as physical, architectural and algorithmic or more broadly software. However, it is well-known that significant gains can be achieved by optimizing across the layers. To respond to this need, in this paper, we present an algorithmically well-founded cross-layer co-design framework (CCF) for automatically designing inexact hardware in the form of datapath elements. Specifically adders and multipliers, and show that significant associated gains can be achieved in terms of energy, area, and delay or speed. Our algorithms can achieve these gains with adding any additional hardware overhead. The proposed CCF framework embodies a symbiotic relationship between architecture and logic-layer design through the technique of probabilistic pruning combined with the novel confined voltage scaling technique introduced in this paper, applied at the physical layer. A second drawback of the state of the art with inexact design is the lack of physical evidence established through measuring fabricated ICs that the gains and other benefits that can be achieved are valid. Again, in this paper, we have addressed this shortcoming by using CCF to fabricate a prototype chip implementing inexact data-path elements; a range of 64-bit integer adders whose outputs can be erroneous. Through physical measurements of our prototype chip wherein the inexact adders admit expected relative error magnitudes of 10% or less, we have found that cumulative gains over comparable and fully accurate chips, quantified through the area-delay-energy product, can be a multiplicative factor of 15 or more. As evidence of the utility of these results, we demonstrate that despite admitting error while achieving gains, images processed using the FFT algorithm implemented using our inexact adders are visually discernible.

Why the link to the ACM Digital library or to the “unoffiical version” were not reported in any of the press stories I cannot say.

Comments Off

ML-Flex

Filed under: Machine Learning,ML-Flex — Patrick Durusau @ 10:46 am

ML-Flex by Stephen Piccolo.

From the webpage:

ML-Flex uses machine-learning algorithms to derive models from independent variables, with the purpose of predicting the values of a dependent (class) variable. For example, machine-learning algorithms have long been applied to the Iris data set, introduced by Sir Ronald Fisher in 1936, which contains four independent variables (sepal length, sepal width, petal length, petal width) and one dependent variable (species of Iris flowers = setosa, versicolor, or virginica). Deriving prediction models from the four independent variables, machine-learning algorithms can often differentiate between the species with near-perfect accuracy.

…

Machine-learning algorithms have been developed in a wide variety of programming languages and offer many incompatible ways of interfacing to them. ML-Flex makes it possible to interface with any algorithm that provides a command-line interface. This flexibility enables users to perform machine-learning experiments with ML-Flex as a harness while applying algorithms that may have been developed in different programming languages or that may provide different interfaces.

ML-Flex is described at: jmlr.csail.mit.edu/papers/volume13/piccolo12a/piccolo12a.pdf

I don’t see any inconsistency in my interest in machine learning and thinking that users are the ultimate judges of semantics. Machine learning is a tool, much like indexes, concordances and other tools before it.

I first saw ML-Flex at KDnuggets.

Comments Off

Social Annotations in Web Search

Filed under: Annotation,Interface Research/Design,Search Behavior,Search Interface,Searching — Patrick Durusau @ 10:42 am

Social Annotations in Web Search by Aditi Muralidharan,
Zoltan Gyongyi, and Ed H. Chi. (CHI 2012, May 5–10, 2012, Austin, Texas, USA)

Abstract:

We ask how to best present social annotations on search results, and attempt to find an answer through mixed-method eye-tracking and interview experiments. Current practice is anchored on the assumption that faces and names draw attention; the same presentation format is used independently of the social connection strength and the search query topic. The key findings of our experiments indicate room for improvement. First, only certain social contacts are useful sources of information, depending on the search topic. Second, faces lose their well-documented power to draw attention when rendered small as part of a social search result annotation. Third, and perhaps most surprisingly, social annotations go largely unnoticed by users in general due to selective, structured visual parsing behaviors specific to search result pages. We conclude by recommending improvements to the design and content of social annotations to make them more noticeable and useful.

The entire paper is worth your attention but the first paragraph of the conclusion gives much food for thought:

For content, three things are clear: not all friends are equal, not all topics benefit from the inclusion of social annotation, and users prefer different types of information from different people. For presentation, it seems that learned result-reading habits may cause blindness to social annotations. The obvious implication is that we need to adapt the content and presentation of social annotations to the specialized environment of web search.

The complexity and sublty of semantics on human side keeps bumping into the search/annotate with a hammer on the computer side.

Or as the authors say: “…users prefer different types of information from different people.”

Search engineers/designers who use their preferences/intuitions as the designs to push out to the larger user universe are always going to fall short.

Because all users have their own preferences and intuitions about searching and parsing search results. What is so surprising about that?

I have had discussions with programmers who would say: “But it will be better for users to do X (as opposed to Y) in the interface.”

Know what? Users are the only measure of the fitness of an interface or success of a search result.

A “pull” model (user preferences) based search engine will gut all existing (“push” model, engineer/programmer preference) search engines.

PS: You won’t discover the range of user preferences by study groups with 11 participants. Ask one of the national survey companies and have them select several thousand participants. Then refine which preferences get used the most. Won’t happen overnight but every precentage gain will be one the existing search engines won’t regain.

PPS: Speaking of interfaces, I would pay for a web browser that put webpages back under my control (the early WWW model).

Enabling me to defeat those awful “page is loading” ads from major IT vendors who should know better. As well as strip other crap out. It is a data stream that is being parsed. I should be able to clean it up before viewing. That could be a real “hit” and make page load times faster.

I first saw this article in a list of links from Greg Linden.

Comments Off

Network of data visualization references

Filed under: BigData,Curation,Graphics,Visualization — Patrick Durusau @ 9:57 am

Network of data visualization references by Nathan Yau.

From the post:

Developer Santiago Ortiz places Delicious tags for visualization references in a discovery context. There are two views. The first is a network view with tags and resources as nodes. A fisheye effect is used to zoom in on nodes and make the more readable. Mouse over a tag, and the labels for related resources get bigger, and likewise, mouse over a resource, and the related tags get bigger.

The second view lets you compare resources. In the network view, select two ore more resources, and then click on the bottom button to compare the selected.

On the left hand side, top, you will see:

blogs
studios
people
tools
books

I had to select one of those before getting the option to switch to the second view.

The graph view seems to move too quickly but that may just be me.

I am sure there is a “big data” view of visualization but I find this more limited view quite useful.

As a matter of fact, I suspect finding sub-communities that share semantics is going to be more of a growth area than “big data.” To be sure, you may start with “big data” but you will quickly boil it down to “small data” that is both useful and relevant to your user community.

Small enough for machine-assisted curation no doubt. Where the curation is the value-add resulting in a product.

Comments Off

Andreas Kollegger and Neo4j [podcast]

Filed under: Graphs,Neo4j — Patrick Durusau @ 4:40 am

Andreas Kollegger and Neo4j [podcast]

From the post:

Our own Ines Sombra interviews Andreas Kollegger about the present and future of Neo4j, how to get started, and a likely Graph Conference in San Francisco.

Andreas on Twitter: https://twitter.com/#!/akollegger Neo Technology: http://www.neotechnology.com/

Discussion * 0:00 All about Andreas * 3:00 Working with graphs * 5:00 Trends with Neoj4 * 6:00 Mistakes people make with Neo4j * 8:45 Areas of improvement for Graph Databases * 14:00 Getting started with Neo4j for Ruby and PHP * 15:45 What’s coming up in the Neo4j world * 22:30 Neo4j user groups * 23:15 Cat’s out of the bag: Graph Conference in SF

Links * Neo4j – http://neo4j.org/ * Neograph by Max de Marzi – https://github.com/maxdemarzi/neography * NoSQL – http://nosql-database.org/

You may be tempted to think of this as a beginner’s guide to Neo4j, which is true, but you should also consider it a reminder of issues that you were thinking about not so long ago. 😉

Comments Off

Azure Changes Dress Code, Allows Tuxedos

Filed under: Cloud Computing,Linux OS,Marketing — Patrick Durusau @ 4:12 am

Azure Changes Dress Code, Allows Tuxedos by Robert Gelber.

Had it on my list to mention that Azure is now supporting Linux. Robert summarizes as follows:

Microsoft has released previews of upcoming services on their Azure cloud platform. The company seems focused on simplifying the transition of in-house resources to hybrid or external cloud deployments. Most notable is the ability for end users to create virtual machines with Linux images. The announcement will be live streamed later today at 1 p.m. PST.

Azure’s infrastructure will support CentOS 6.2, OpenSUSE 12.1, SUSE Linux Enterprise Server SP2 and Ubuntu 12.04 VM images. Microsoft has already updated their Azure site to reflect the compatibility. Other VM features include:

Virtual Hard Disks – Allowing end users to migrate data between on-site and cloud permises.

Workload Migration – Moving SQL Server, Sharepoint, Windows Server or Linux images to cloud services.

Common Virtualization Format – Microsoft has made the VHD file format freely available under an open specification promise.

Cloud offerings are changing, perhaps evolving would be a better word, at a rapid pace.

Although standardization may be premature, it is certainly a good time to start gathering information on services, vendors, in a way that cuts across the verbal jungle that is cloud computing PR.

Topic maps anyone?

Comments Off

June 12, 2012

Network Medicine: Using Visualization to Decode Complex Diseases

Filed under: Bioinformatics,Biomedical,Genome,Graphs,Networks — Patrick Durusau @ 6:26 pm

Network Medicine: Using Visualization to Decode Complex Diseases

From the post:

Albert Làszló Barabàsi is a physicist, but maybe best known for his work in the field of network theory. In his TEDMED talk titled “Network Medicine: A Network Based Approach to Decode Complex Diseases” [tedmed.com], Albert-Làszló applies advanced network theory to the field of biology.

Using a metaphor of Manhattan maps, he explains how an all-encompassing map of the relationships between genes, proteins and metabolites can form the key to truly understand the mechanisms behind many diseases. He further makes the point that diseases should not be divided up in organ-based separate branches of medicin, but rather as a tightly interconnected network.

More information and movies at the post (information aesthetics)

Turns out that relationships (can you say graph/network?) are going to be critical in the treatment of disease. (Not treatment of symptoms, treatment of disease.)

Comments Off

Introducing Hortonworks Data Platform v1.0

Filed under: Hadoop,Hortonworks,MapReduce — Patrick Durusau @ 3:27 pm

Introducing Hortonworks Data Platform v1.0

John Kreisa writes:

I wanted to take this opportunity to share some important news. Today, Hortonworks announced version 1.0 of the Hortonworks Data Platform, a 100% open source data management platform based on Apache Hadoop. We believe strongly that Apache Hadoop, and therefore, Hortonworks Data Platform, will become the foundation for the next generation enterprise data architecture, helping companies to load, store, process, manage and ultimately benefit from the growing volume and variety of data entering into, and flowing throughout their organizations. The imminent release of Hortonworks Data Platform v1.0 represents a major step forward for achieving this vision.

You can read the full press release here. You can also read what many of our partners have to say about this announcement here. We were extremely pleased that industry leaders such as Attunity, Dataguise, Datameer, Karmasphere, Kognitio, MarkLogic, Microsoft, NetApp, StackIQ, Syncsort, Talend, 10gen, Teradata and VMware all expressed their support and excitement for Hortonworks Data Platform.

Those who have followed Hortonworks since our initial launch already know that we are absolutely committed to open source and the Apache Software Foundation. You will be glad to know that our commitment remains the same today. We don’t hold anything back. No proprietary code is being developed at Hortonworks.

Hortonworks Data Platform was created to make it easier for organizations and solution providers to install, integrate, manage and use Apache Hadoop. It includes the latest stable versions of the essential Hadoop components in an integrated and tested package. Here is a diagram that shows the Apache Hadoop components included in Hortonworks Data Platform:

And I thought this was going to be a slow news week. 😉

Excellent news!

Comments (1)

Open Content (Index Data)

Filed under: Data,Data Source — Patrick Durusau @ 3:20 pm

Open Content

From the webpage:

The searchable indexes below expose public domain ebooks, open access digital repositories, Wikipedia articles, and miscellaneous human-cataloged Internet resources. Through standard search protocols, you can make these resources part of your own information portals, federated search systems, catalogs etc. Connection instructions for SRU and Z39.50 are provided. If you have comments, questions, or suggestions for resources you would like us to add, please contact us, or consider joining the mailing list.. This service is powered by Index Data’s Zebra and Metaproxy

Looking around after reading the post on the interview with Sebastian Hammer on Federated Search I found this listing of resources.

Database name #records Description

gutenberg 22194
Project Gutenberg.
High-quality clean-text ebooks, some audio-books.

oaister 9988376
OAIster. A Union catalog of digital resources, chiefly open archives of journals, etc.

oca-all 135673 All of the ebooks made available by the Internet Archive
as part of the Open Content Alliance (OCA). Includes high-quality, searchable PDFs, online book-readers,
audio books, and much more. Excludes the Gutenberg sub-collection, which is available as a
separate database.

oca-americana 49056 The American
Libraries collection of the Open Content Alliance.

oca-iacl 669 The Internet Archive Children’s Library. Books for children from around the world.

oca-opensource 2616 Collection of community-contributed books at the Internet Archive.

oca-toronto 37241 The Canadian Libraries
collection of the Open
Content Alliance.

oca-universallibrary 30888 The Universal Library, a digitzation
project founded at Carnegie-Mellon University. Content hosted at the Internet Archive.

wikipedia 1951239 Titles and abstracts from Wikipedia, the open encyclopedia.

wikipedia-da 66174 The Danish Wikipedia. Many thanks to Fujitsu Denmark for their support for the indexing of the national Wikipedias.

wikipedia-sv 243248 The Swedish Wikipedia.

Latency is an issue but I wonder what my reaction would be if a search quickly offered 3 or 4 substantive resources and invited me to read/manipulate them, while it seeks additional information/data?

Most of the articles you see cited in this blog aren’t the sort of thing you can skim and some take more than one pass to jell.

I suppose I could be offered 50 highly relevant articles in milli-seconds but I am not capable of assimalating them that quickly.

So how many resources have been wasted to give me a capacity I can’t effectively use?

Comments Off

Sebastian Hammer on Federated Search

Filed under: Federated Search,Federation,Searching — Patrick Durusau @ 2:57 pm

Sebastian Hammer on Federated Search

From the post:

Recently our own Sebastian Hammer was interviewed by David Weinberger of the Harvard Library Innovation Lab and a member of the Dev Core for the DPLA. Sebastian explained the strengths and limitations of federated search. You can listen to the 23-minute podcast here or read the transcript attached.

Hammer_Weinberger_Interview.pdf.

Short but interesting exploration of federated search.

I first saw this at Federated Search: Federated Data Explained.

Comments Off

One Trillion Stored (and counting) [new uncertainty principle?]

Filed under: Amazon Web Services AWS — Patrick Durusau @ 2:34 pm

Amazon S3 – The First Trillion Objects

Jeff Barr writes:

Late last week the number of objects stored in Amazon S3 reached one trillion (1,000,000,000,000 or 10¹²). That’s 142 objects for every person on Planet Earth or 3.3 objects for every star in our Galaxy. If you could count one object per second it would take you 31,710 years to count them all.

We knew this day was coming! Lately, we’ve seen the object count grow by up to 3.5 billion objects in a single day (that’s over 40,000 new objects per second).

Old news because no doubt the total is greater than one trillion a week later. Or perhaps any time period greater than 1/40,000 of a second?

Is there a new uncertainty principle? Overall counts for S3 are estimates for some time X?

Comments Off

how much of commonsense and legal reasoning is formalizable? A review of conceptual obstacles

Filed under: Law,Law - Sources,Legal Informatics — Patrick Durusau @ 2:19 pm

how much of commonsense and legal reasoning is formalizable? A review of conceptual obstacles by Jame Franklin.

Abstract:

Fifty years of effort in artificial intelligence (AI) and the formalization of legal reasoning have produced both successes and failures. Considerable success in organizing and displaying evidence and its interrelationships has been accompanied by failure to achieve the original ambition of AI as applied to law: fully automated legal decision-making. The obstacles to formalizing legal reasoning have proved to be the same ones that make the formalization of commonsense reasoning so difficult, and are most evident where legal reasoning has to meld with the vast web of ordinary human knowledge of the world. Underlying many of the problems is the mismatch between the discreteness of symbol manipulation and the continuous nature of imprecise natural language, of degrees of similarity and analogy, and of probabilities.

I haven’t (yet) been able to access a copy of this article.

From the abstract,

….mismatch between the discreteness of symbol manipulation and the continuous nature of imprecise natural language, of degrees of similarity and analogy, and of probabilities.

I suspect it will be useful reminder of the boundaries to formal information systems.

I first saw this at Legal Informatics: Franklin: How Much of Legal Reasoning Is Formalizable?

Comments Off

Data distillation with Hadoop and R

Filed under: Data Mining,Data Reduction,Hadoop,R — Patrick Durusau @ 1:55 pm

Data distillation with Hadoop and R by David Smith.

From the post:

We’re definitely in the age of Big Data: today, there are many more sources of data readily available to us to analyze than there were even a couple of years ago. But what about extracting useful information from novel data streams that are often noisy and minutely transactional … aye, there’s the rub.

One of the great things about Hadoop is that it offers a reliable, inexpensive and relatively simple framework for capturing and storing data streams that just a few years ago we would have let slip though our grasp. It doesn’t matter what format the data comes in: without having to worry about schemas or tables, you can just dump unformatted text (chat logs, tweets, email), device “exhaust” (binary, text or XML packets), flat data files, network traffic packets … all can be stored in HDFS pretty easily. The tricky bit is making sense of all this unstructured data: the downside to not having a schema is that you can’t simply make an SQL-style query to extract a ready-to-analyze table. That’s where Map-Reduce comes in.

Think of unstructured data in Hadoop as being a bit like crude oil: it’s a valuable raw material, but before you can extract useful gasoline from Brent Sweet Light Crude or Dubai Sour Crude you have to put it through a distillation process in a refinery to remove impurities, and extract the useful hydrocarbons.

I may find this a useful metaphor because I grew up in Louisiana where land based oil wells were abundant and there was an oil reflinery only a couple of miles from my home.

Not a metaphor that will work for everyone but one you should keep in mind.

Comments Off

Painting with Numbers

Filed under: Graphics,Visualization — Patrick Durusau @ 1:47 pm

Painting with Numbers

From John D. Cook, some comments on: Painting with numbers : presenting financials and other numbers so people will understand you by Randall Bolten (WorldCat).

From the summary at WorldCat:

“Painting with Numbers is a thoughtful, yet practical and readable, guide to presenting numerical information effectively. The chapters are divided into two sections: “Technique” – focuses on how readers and listeners perceive numbers and what they look for in reports, describes how reports should be laid out and organized, and provides tips on how to take advantage of the available tools – principally Microsoft Excel – to minimize time spent “beautifying” reports.”Content” – provides thoughts and guidelines on specific types of reports. One chapter alone is devoted to the “Natural P&L”, the statement that is the cornerstone of management financial reporting. Other chapters discuss other reports common to businesses, reports relevant to other walks of our lives, and information that especially lends itself to presentation in graphs rather than tables”-

I did like John’s quoting:

I like #17: “I know most of you can’t read the numbers on this slide, but …”

We have all heard that about markup, code, diagrams, charts, etc.

I would like to never hear it again.

Suggestions?

Comments Off

Predicting link directions via a recursive subgraph-based ranking

Filed under: Graphs,Prediction,Ranking,Subgraphs — Patrick Durusau @ 1:13 pm

Predicting link directions via a recursive subgraph-based ranking by Fangjian Guo, Zimo Yang, and Tao Zhou.

Abstract:

Link directions are essential to the functionality of networks and their prediction is helpful towards a better knowledge of directed networks from incomplete real-world data. We study the problem of predicting the directions of some links by using the existence and directions of the rest of links. We propose a solution by first ranking nodes in a specific order and then predicting each link as stemming from a lower-ranked node towards a higher-ranked one. The proposed ranking method works recursively by utilizing local indicators on multiple scales, each corresponding to a subgraph extracted from the original network. Experiments on real networks show that the directions of a substantial fraction of links can be correctly recovered by our method, which outperforms either purely local or global methods.

This paper focuses mostly on prediction of direction of links, relying on other research for the question of link existence.

I mention it because predicting links and their directions will be important for planning graph database deployments in particular.

It will be a little late to find out when under full load that other modeling choices should have been made. (It is usually under “full load” conditions when retrospectives on modeling choices come up.)

Comments Off

Dreams of Universality, Reality of Interdisciplinarity [Indexing/Mapping Pidgin]

Filed under: Complexity,Indexing,Mapping — Patrick Durusau @ 12:55 pm

Complex Systems Science: Dreams of Universality, Reality of Interdisciplinarity by Sebastian Grauwin, Guillaume Beslon, Eric Fleury, Sara Franceschelli, Jean-Baptiste Rouquier, and Pablo Jensen.

Abstract:

Using a large database (~ 215 000 records) of relevant articles, we empirically study the “complex systems” field and its claims to find universal principles applying to systems in general. The study of references shared by the papers allows us to obtain a global point of view on the structure of this highly interdisciplinary field. We show that its overall coherence does not arise from a universal theory but instead from computational techniques and fruitful adaptations of the idea of self-organization to specific systems. We also find that communication between different disciplines goes through specific “trading zones”, ie sub-communities that create an interface around specific tools (a DNA microchip) or concepts (a network).

If disciplines don’t understand each other…:

Where do the links come from then? In an illuminating analogy, Peter Galison [32] compares the difficulty of connecting scientific disciplines to the difficulty of communicating between different languages. History of language has shown that when two cultures are strongly motivated to communicate – generally for commercial reasons – they develop simplied languages that allow for simple forms of interaction. At first, a “foreigner talk” develops, which becomes a “pidgin” when social uses consolidate this language. In rare cases, the “trading zone” stabilizes and the expanded pidgin becomes a creole, initiating the development of an original, autonomous culture. Analogously, biologists may create a simplied and partial version of their discipline for interested physicists, which may develop to a full-blown new discipline such as biophysics. Specifically, Galison has studied [32] how Monte Carlo simulations developed in the postwar period as a trading language between theorists, experimentalists, instrument makers, chemists and mechanical engineers. Our interest in the concept of a trading zone is to allow us to explore the dynamics of the interdisciplinary interaction instead of ending analysis by reference to a “symbiosis” or “collaboration”.

My interest is in how to leverage “trading zones” for the purpose of indexing and mapping (as in topic maps).

Noting that “trading zones” are subject to emprical discovery and no doubt change over time.

Discovering and capitalizing on such “trading zones” will be a real value-add for users.

Comments Off

June 11, 2012

Social Design – Systems of Engagement

Filed under: Design,Interface Research/Design — Patrick Durusau @ 4:28 pm

I almost missed this series except this caught my eye skimming posts:

“We were really intrigued when we heard Cognizant’s Malcolm Frank and industry guru Geoffrey Moore discuss the enterprise/consumer IT divide. They liken it to the Sunday night vs. Monday morning experience.

It goes something like this … on a typical Sunday night, we use our iPhones/iPads and interact with our friends via Facebook/Twitter etc. It is a delightful experience. Malcolm and Geoff call these environments “Systems of Engagement”. Then Monday morning arrives and we show up in the office and are subjected to applications like the Timesheet I described in the previous post. Adding to the misery is the landline phone, a gadget clearly from the previous millennium (and alien to most millennials who came of age with mobile phones).

We then asked ourselves this additional question – did any of us attend a training program, seminar or e-learning program to use the iPhone, iPad, Facebook, Twitter, etc? The answer, obviously is NO. Why then, we concluded, do users need training to use corporate IT applications!

(http://dealarchitect.typepad.com/deal_architect/2012/06/the-pursuit-of-employee-delight-part-2.html)

Let me make that question personal: Do your users require training to use your apps?

Or do any of these sound familiar?

1. Confusing navigation. There were just too many steps to reach the target screen. Developed by different groups, each application had its own multi-level menu structure. Lack of a common taxonomy further complicated usability.

2. Each screen had too many fields which frustrated users. Users had to go through several hours of training to use the key applications.

3. Some applications were slow, especially when accessed from locations far away from our data centers.

4. Each application had its own URL and required a separate login. Sometimes, one application had many URLs. Bookmarking could never keep up with this. Most importantly, new associates could never easily figure out which application was available at which URL.

5. All applications were generating huge volumes of email alerts to keep the workflow going. This resulted in tremendous e-mail overload.

(http://dealarchitect.typepad.com/deal_architect/2012/06/the-real-deal-sukumar-rajagopal-on-a-cios-pursuit-of-employee-delight.html)

Vinnie Mirchandani covers systems of engagement in five posts that I won’t attempt to summarize.

The Real Deal: Sukumar Rajagopal on a CIO’s Pursuit of Employee Delight

The Pursuit of Employee Delight – Part 2

The Pursuit of Employee Delight – Part 3

The Pursuit of Employee Delight – Part 4

The Pursuit of Employee Delight – Part 5

There is much to learn here.

Comments Off

Monday Fun: Seven Databases in Song

Filed under: Database,Humor — Patrick Durusau @ 4:27 pm

Monday Fun: Seven Databases in Song

From the post:

If you understand things best when they’re formatted as a musical, this video is for you. It teaches the essentials of PostgreSQL, Riak, HBase, MongoDB, CouchDB, Neo4J and Redis in the style of My Fair Lady. And for a change, it’s very SFW.

This is a real hoot!

It went by a little too quickly to make sure it covered everything but it covered a lot. 😉

All kidding aside, there have been memorization techniques that relied upon rhyme and song.

Not saying you will have a gold record with an album of Hadoop commands with options but you might gain some noteriety.

If you start setting *nix commands to song, I don’t think Stairway to Heaven is long enough for sed and all its options.

Comments Off

HBase Con2012

Filed under: Conferences,HBase — Patrick Durusau @ 4:26 pm

HBase Con2012

Slides and in some cases videos from HBase Con2012.

Mostly slides at this point, I could six (6) videos as of June 11, 2012 in the late afternoon on the East Coast.

Will keep checking back as there is a lot of good content.

Comments Off

Machine Learning in Java has never been easier! [Java App <-> BigML Rest API]

Filed under: Java,Machine Learning — Patrick Durusau @ 4:25 pm

Machine Learning in Java has never been easier!

From the post:

Java is by far one of the most popular programming languages. It’s on the top of the TIOBE index and thousands of the most robust, secure, and scalable backends have been built in Java. In addition, there are many wonderful libraries available that can help accelerate your project enormously. For example, most of BigML’s backend is developed in Clojure which runs on top of the Java Virtual Machine. And don’t forget the ever-growing Android market, with 850K new devices activated each day!

There are number of machine learning Java libraries available to help build smart data-driven applications. Weka is one of the more popular options. In fact, some of BigML’s team members were Weka users as far back as the late 90s. We even used it as part of the first BigML backend prototype in early 2011. Apache Mahout is another great Java library if you want to deal with bigger amounts of data. However in both cases you cannot avoid “the fun of running servers, installing packages, writing MapReduce jobs, and generally behaving like IT ops folks“. In addition you need to be concerned with selecting and parametrizing the best algorithm to learn from your data as well as finding a way to activate and integrate the model that you generate into your application.

Thus we are thrilled to announce the availability of the first Open Source Java library that easily connects any Java application with the BigML REST API. It has been developed by Javi Garcia, an old friend of ours. A few of the BigML team members have been lucky enough to work with Javi in two other companies in the past.

With this new library, in just a few lines of code you can create a predictive model and generate predictions for any application domain. From finding the best price for a new product to forecasting sales, creating recommendations, diagnosing malfunctions, or detecting anomalies.

It won’t be as easy as “…in just a few lines of code…” but it will, what’s the term, modularize the building of machine learning applications. Someone has to run/maintain the servers, do security patches, backups but it doesn’t have to be you.

Specialization, that’s the other term. So that team members can be really good at what they do, as opposed to sorta good at a number of things.

If you need a common example, consider documentation, most of which is written by developers when they can spare the time. Reads like it. Costs your clients time and money trying to get their developers to work with poor documentation.

Not to mention costing you time and money when the software is not longer totally familiar to one person.

PS: As of today, June 11, 2012, Java is now #2 and C is #1 on the TIOBE list.

Comments Off

Flowchart: Connections in Stephen King novels

Filed under: Flowchart,Humor,Mapping — Patrick Durusau @ 4:24 pm

Flowchart: Connections in Stephen King novels by Nathan Yau.

For your modeling exercise and amusement, a flowchart of connections in Stephen King novels (excluding the Dark Tower series). I not sure what impact excluding the Dark Tower series has on the flowchart. If you discover it, please report back.

Topic map and other semantic modeling groups could use this flowchart as the answer to Google employment questions. 😉

Speaking of modeling, I wonder how many degrees of separation there are between characters in novels?

And how would they be connected? Family names, places of employment, physical locations, perhaps even fictional connections?

That could be an interesting mapping exercise.

Comments Off

Neo4j in the Trenches [webinar – Thursday June 14 10:00 PDT / 19:00 CEST]

Filed under: Graphs,Neo4j,Recommendation — Patrick Durusau @ 4:23 pm

Neo4j in the Trenches

Thursday June 14 10:00 PDT / 19:00 CEST

From the webpage:

OpenCredo discusses Opigram: a social recommendation engine

In this webinar, Nicki Watt of OpenCredo presents the lessons learned (and being learned) on an active Neo4j project: Opigram. Opigram is a socially oriented recommendation engine which is already live, with some 150k users and growing. The webinar will cover Neo4j usage, challenges encountered, and solutions to these challenges.

I was curious enough to run down the homepage for OpenCredo.

Now there is an interesting homepage!

The blog post titles promise some interesting reading.

I will report back as I find items of interest.

Comments Off

Open-Source R software driving Big Data analytics in government

Filed under: Analytics,BigData,R — Patrick Durusau @ 4:22 pm

Open-Source R software driving Big Data analytics in government by David Smith.

From the post:

As government agencies and departments expand their capabilities for collecting information, the volume and complexity of digital data stored for public purposes is far outstripping departments’ ability to make sense of it all. Even worse, with data siloed within individual departments and little cross-agency collaboration, untold hours and dollars are being spent on data collection and storage with return on investment in the form of information-based products and services for the public good.

But that may now be starting to change, with the Obama administration’s Big Data Research and Development Initiative.

In fact, the administration has had a Big Data agenda since its earliest days, with the appointment of Aneesh Chopra as the nation’s first chief technology officer in 2009. (Chopra passed the mantle to current CTO Todd Park in March.) One of Chopra’s first initiatives was the creation of data.gov, a vehicle to make government data and open-source tools available in a timely and accessible format for a community of citizen data scientists to make sense of it all.

For example, independent statistical analysis of data released by data.gov revealed a flaw in the 2000 Census results that apparently went unnoticed by the Census Bureau.

David goes on to give some other examples of the use of R with government data.

The US federal government is diverse enough that its IT solutions will be diverse as well. But R will be familiar to some potential clients.

I first saw this at the Revolutions blog on R.

Comments Off

« Newer Posts — Older Posts »

Database name	#records	Description
gutenberg	22194	Project Gutenberg. High-quality clean-text ebooks, some audio-books.
oaister	9988376	OAIster. A Union catalog of digital resources, chiefly open archives of journals, etc.
oca-all	135673	All of the ebooks made available by the Internet Archive as part of the Open Content Alliance (OCA). Includes high-quality, searchable PDFs, online book-readers, audio books, and much more. Excludes the Gutenberg sub-collection, which is available as a separate database.
oca-americana	49056	The American Libraries collection of the Open Content Alliance.
oca-iacl	669	The Internet Archive Children’s Library. Books for children from around the world.
oca-opensource	2616	Collection of community-contributed books at the Internet Archive.
oca-toronto	37241	The Canadian Libraries collection of the Open Content Alliance.
oca-universallibrary	30888	The Universal Library, a digitzation project founded at Carnegie-Mellon University. Content hosted at the Internet Archive.
wikipedia	1951239	Titles and abstracts from Wikipedia, the open encyclopedia.
wikipedia-da	66174	The Danish Wikipedia. Many thanks to Fujitsu Denmark for their support for the indexing of the national Wikipedias.
wikipedia-sv	243248	The Swedish Wikipedia.

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 14, 2012

June 13, 2012

June 12, 2012

June 11, 2012