Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 22, 2011

Designing Faceted Searches

Filed under: Facets,Search Interface,Searching — Patrick Durusau @ 6:09 pm

Tony Russell-Rose has been doing a series of posts on faceted searches.

Since topic maps capture information that can be presented as facets, I thought it would be helpful to gather up the links to Tony’s posts for your review.

Interaction Models for Faceted Search

Where am I? Techniques for wayfinding and navigation in faceted search

Designing Faceted Search: Getting the basics right (part 1)

Designing Faceted Search: Getting the basics right (part 2)

Designing Faceted Search: Getting the basics right (part 3)

And a couple of related goodies:

A Taxonomy of Search Strategies and their Design Implications

From Search to Discovery: Information Search Strategies and Design Solutions

Word of warning: You can easily lose hours if not days chasing down design insights that remain just out of reach. Have fun!

Hoop – Hadoop HDFS over HTTP

Filed under: Hadoop — Patrick Durusau @ 6:08 pm

Hoop – Hadoop HDFS over HTTP

From the webpage:

Hoop provides access to all Hadoop Distributed File System (HDFS) operations (read and write) over HTTP/S.

Hoop can be used to:

  • Access HDFS using HTTP REST.
  • Transfer data between clusters running different versions of Hadoop (thereby overcoming RPC versioning issues).
  • Access data in a HDFS cluster behind a firewall. The Hoop server acts as a gateway and is the only system that is allowed to go through the firewall.

Hoop has a Hoop client and a Hoop server component:

  • The Hoop server component is a REST HTTP gateway to HDFS supporting all file system operations. It can be accessed using standard HTTP tools (i.e. curl and wget), HTTP libraries from different programing languages (i.e. Perl, JavaScript) as well as using the Hoop client. The Hoop server component is a standard Java web-application and it has been implemented using Jersey (JAX-RS).
  • The Hoop client component is an implementation of Hadoop FileSystem client that allows using the familiar Hadoop filesystem API to access HDFS data through a Hoop server.

Clojurescript

Filed under: Clojure,Javascript — Patrick Durusau @ 6:07 pm

Clojurescript

From the homepage:

ClojureScript is a dialect of Clojure that targets JavaScript as a deployment platform.

From the rationale:

ClojureScript seeks to address the weak link in the client/embedded application development story by replacing JavaScript with Clojure, a robust, concise and powerful programming language. In its implementation, ClojureScript adopts the strategy of the Google Closure library and compiler, and is able to effectively leverage both tools, gaining a large, production-quality library and whole-program optimization. ClojureScript brings the rich data structure set, functional programming, macros, reader, destructuring, polymorphism constructs, state discipline and many other features of Clojure to every place JavaScript reaches.

Just in case you ever work on the client side of topic maps. 🙂

Random Graphs Anyone?

Filed under: Graphs,Mathematics,Wandora — Patrick Durusau @ 6:06 pm

I saw a tweet from @CompSciFact (John Cook) pointing out Luc Devroye’s (McGill University) Non-Uniform Random Variate Generation (Springer-Verlag, New York, 1986) was available for free download.

Amazon lists used copies starting at $180.91 and one new copy for $618.47, so you are better off with the scanned PDF, unless you are simply trying to burn up grant funding before the end of a year.

Chapter XIII. RANDOM COMBINATORIAL OBJECTS includes random graphs and notes:

Graphs are the most general comblnatorlal objects dealt wlth In this chapter. They have appllcatlons In nearly all flelds of sclence and englneerlng. It is qulte impossible to glve a thorough overvlew of the dlfferent subclasses of graphs, and how objects In these subclasses can be generated uniformly and at random. Instead, we will just glve a superflclal treatment, and refer the reader to general principles or speclflc artlcles In the literature whenever necessary.

For one use of random graphs in topic maps work, see the Random Graph Generator in Wandora.

Hadoop for Intelligence Analysis???

Filed under: Hadoop,Intelligence — Patrick Durusau @ 6:05 pm

Hadoop for Intelligence Analysis

From the webpage:

CTOlabs.com, a subsidiary of the technology research, consulting and services firm Crucial Point LLC and a peer site of CTOvision.com, has just published a white paper providing context, tips and strategies around Hadoop titled “Hadoop for Intelligence Analysis.” This paper focuses on use cases selected to be informative to any organization thinking through ways to make sense out of large quantities of information.

I’m curious. How little would you have to know about Hadoop or intelligence analysis to get something from the “white paper?”

Or is having “Hadoop” in a title these days enough to gain a certain number of readers?

Unless you want to answer my first question, suggest that you avoid this “white paper” as “white noise.”

Your time can be better spent, doing almost anything.

EC Tender for Open Data Portal

Filed under: EU,Open Data — Patrick Durusau @ 6:03 pm

Deadline: 19 September 2011

From the announcement:

Today, [19 July 2011] the European Commission has taken a new step in realising an European Data Portal. They have published a call for tenders to develop the data portal on it’s electronic Tender Portal ted.europa.eu. All information can be found on this page.

Luxembourg, 19 July 2011

(by Tom Kronenburg)

At the Digital Agenda Assembly workshop on Open Data in June, mr. Khalil Rouhana of the European Commission announced the intention (slide 7) to build a European Open Data portal. Rouhana said that a EC Portal should become operational in 2012, holding a significant amount of EC datasets. It is also planned that by 2013 a pan/european data portal should present datasets published by the Member States.

Today, the European Commission has taken a new step in realizing the European Data Portal. The EC has published a call for tenders to develop the data portal on it’s electronic Tender Portal ted.europa.eu. The call for tenders is one of the necessary steps for realizing the ambition of creating one pan-european Open Data portal.

The tender procedure will result in a contract that encompasses four types of services:

  • to develop and administer a web portal to act as a single point of access to data sets produced and held by European Commission services (and by extension to data sets produced and held by other European institutions/bodies and other public bodies),
  • to assist the Commission with the definition and implementation of a data set publication process,
  • to assist the Commission with the preparation of data sets for publication via the portal,
  • to assist the Commission in supporting for engaging the stakeholders’ community interested in re-using the published data sets.

I checked to be sure and the tender is open to people based in the United States.

This looks like it could be both interesting and fun.

Check with your usual major players to see if you can contract out for part of the action in case they are successful.

July 21, 2011

Wonderdog

Filed under: ElasticSearch,Hadoop,Pig — Patrick Durusau @ 6:30 pm

Wonderdog

From the webpage:

Wonderdog is a Hadoop interface to Elastic Search. While it is specifically intended for use with Apache Pig, it does include all the necessary Hadoop input and output formats for Elastic Search. That is, it’s possible to skip Pig entirely and write custom Hadoop jobs if you prefer.

I may just be paying more attention but the search scene seems to be really active.

That’s good for topic maps because the more data that is searched, the greater the likelihood of heterogeneous data. Text messages between teens are probably heterogeneous but who cares?

Medical researchers using different terminology results in heterogeneous data, not just today, but data from yesteryear. Now that could be important.

Oracle, Sun Burned, and Solr Exposure

Filed under: Data Mining,Database,Facets,Lucene,SQL,Subject Identity — Patrick Durusau @ 6:27 pm

Oracle, Sun Burned, and Solr Exposure

From the post:

Frankly we wondered when Oracle would move off the dime in faceted search. “Faceted search”, in my lingo, is showing users categories. You can fancy up the explanation, but a person looking for a subject may hit a dead end. The “facet” angle displays links to possibly related content. If you want to educate me, use the comments section for this blog, please.

We are always looking for a solution to our clients’ Oracle “findability” woes. It’s not just relevance. Think performance. Query and snack is the operative mode for at least one of our technical baby geese. Well, Oracle is a bit of a red herring. The company is not looking for a solution to SES11g functionality. Lucid Imagination, a company offering enterprise grade enterprise search solutions, is.

If “findability” is an issue at Oracle, I would be willing to bet that subject identity is as well. Rumor has it that they have paying customers.

HBase at YFrog

Filed under: HBase — Patrick Durusau @ 6:26 pm

HBase at YFrog

Alex Popescu’s summary of slides on the use of HBase at YFrog.

Impressive numbers!

Hadoop Advancing Information Management

Filed under: Conferences,Hadoop — Patrick Durusau @ 6:25 pm

Hadoop Advancing Information Management

I first saw this at Alex Popescu’s myNoSQL site and decided to explore further.

From the Ventana Research site:

Newly conducted benchmark research from Ventana Research shows organizations are recognizing that addressing big data needs requires new approaches to data and information management. New processes and new technologies have begun to take hold, as have the beginnings of a set of best practices. In particular, Hadoop is emerging in fore-front as a solution for managing large-scale data. The research findings indicate that Hadoop is already being used in one third of big data environments and evaluated in nearly another fifth. The research also found that Hadoop is additive to existing technologies according to almost two thirds of research participants.

Topping the lists of benefits in Hadoop adoption are newly found capabilities – 87% of organizations using Hadoop report being able to do new things with big data versus 52% of other organizations, 94% perform new types of analytics on large volumes of data, 88% analyze data at greater level of detail. These research statistics already validate the arrival of Hadoop as a key component of organization’s information management efforts. However, challenges remain with over half the organizations indicating some level of dissatisfaction with Hadoop.

Like me you probably want some more than breathless numbers and more detail. That is going to be available:

Ventana Research will detail the findings of this benchmark research in a live interactive webinar on July 28, 2011 at 10:00 AM Pacific time [1 PM Eastern] that will discuss the research findings and offer recommendations for improvement. Key research findings to be discussed will include:

  • The current state of organizations’ thinking on how best to apply Big Data management techniques.
  • Top patterns in the adoption of new methods and technologies
  • The current state, future direction of and potential investments
  • The competencies required to manage large-scale data.
  • Recommendations for organizations to act on immediately.

Of course I am looking for places in the use of Hadoop where subject identity is likely to be recognized as an issue.

Graph Databases

Filed under: Graphs,Topic Maps — Patrick Durusau @ 6:23 pm

Graph Databases by Josh Adell.

You won’t learn anything new but may be a nice slide deck to pass along to others who aren’t familiar with graph databases.

How would you use his observation that relationships are first class citizens in a graph database to support a discussion of what subjects need to be first class citizens in a topic map?

ELN Integration: Avoiding the Spaghetti Bowl

Filed under: Data Integration,ELN Integration — Patrick Durusau @ 6:11 pm

ELN Integration: Avoiding the Spaghetti Bowl by Michael H. Elliott. (Scientific Computing, May 2011)

Michael writes:

…over 20 percent of the average scientist’s time is spend on non-value-added data aggregation, transcription, formatting and manual documentation. [p.19]

…in a recent survey of over 400 scientists, “integrating data from multiple systems” was cited as the number one laboratory data management challenge. [p. 19]

The multiple terminologies various groups use can also impact integration. For example, what a “lot” or “batch” can vary by who you ask: the medicinal chemist, formulator, or biologics process development scientist. A common vocabulary can be one of the biggest stumbling blocks, as it involves either gaining consensus, defining semantic relationships and/or data transformations. [p.21]

Good article that highlights the on-going difficulty that scientists face with ELN (Electronic Lab Notebook) solutions.

It was refreshing to hear someone mention organizational and operational issues being “…more difficult to address than writing code.”

Technical solutions cannot address personnel, organizational or semantic issues.

However tempting it may be to “wait and see,” the personnel, organizational and semantic issues you had before an integration solution will be there post-integration solution. That’s a promise.

Edward Tufte’s “Slopegraphs”

Filed under: Graphs,Visualization — Patrick Durusau @ 6:10 pm

Edward Tufte’s “Slopegraphs”

From the post:

Back in 2004, Edward Tufte defined and developed the concept of a “sparkline”. Odds are good that — if you’re reading this — you’re familiar with them and how popular they’ve become.

What’s interesting is that over 20 years before sparklines came on the scene, Tufte developed a different type of data visualization that didn’t fare nearly as well. To date, in fact, I’ve only been able to find three examples of it, and even they aren’t completely in line with his vision.

It’s curious that it hasn’t become more popular, as the chart type is quite elegant and aligns with all of Tufte’s best practices for data visualization, and was created by the master of information design. Why haven’t these charts (christened “slopegraphs” by Tufte about a month ago) taken off the way sparklines did?

In this post, we’re going to look at slopegraphs — what they are, how they’re made, why they haven’t seen a massive uptake so far, and why I think they’re about to become much more popular in the near future.

How to “best” visualize data is in part an aesthetic choice and this article expands your range of choices.

July 20, 2011

Harnessing the Power of Apache Hadoop:…

Filed under: Hadoop — Patrick Durusau @ 3:38 pm

Harnessing the Power of Apache Hadoop: How to Rapidly and Reliably Operationalize Apache Hadoop and Unlock the Value of Data

I swear! That really is the title! From the page just below it:

Webinar: Thursday, July 21, 2011 10:00 AM – 10:45 AM PDT

Join Cloudera’s CEO Mike Olson on July 21st for a webinar about optimizing your Apache Hadoop deployment and leveraging this completely open source technology in production for Big Data analytics and asking questions across structured and unstructured data that were previously impossible to solve.

  • Learn why you must consider this open source technology in order to evolve your company
  • Understand how Apache Hadoop can provide your organization with a holistic view and insight into data
  • Learn how you can easily configure Apache Hadoop for your enterprise
  • Find out how several well-known organizations are using Apache Hadoop to solve real-world business problems such as increasing revenue, delivering better business solutions and ensuring network performance

Ambitious goals for forty-five (45) minutes in a forum where a lot of folks will also be reading email, tweets, etc, but maybe Mike really is that good. 😉

I am sure it will be interesting but hopefully also recorded.

My experience is that most webinars are good for picking up memes and themes that you can then explore at your leisure.


Suggested future title: Hadoop: Where Data Hits The Road. Zero hits in Google for “Where Data Hits The Road” as of 16:30 Eastern Time, 20 July 2011.

To be fair, the present title, “Harnessing the Power of Apache Hadoop: How to Rapidly and Reliably Operationalize Apache Hadoop and Unlock the Value of Data” gets five (5) hits in Google.

Wonder which one would propagate better?

The Britney Spears Problem

Filed under: Data Streams,Data Structures,Topic Maps — Patrick Durusau @ 1:05 pm

The Britney Spears Problem by Brian Hayes.

From the article:

Back in 1999, the operators of the Lycos Internet portal began publishing a weekly list of the 50 most popular queries submitted to their Web search engine. Britney Spears—initially tagged a “teen songstress,” later a “pop tart“—was No. 2 on that first weekly tabulation. She has never fallen off the list since then—440 consecutive appearances when I last checked. Other perennials include ­Pamela Anderson and Paris Hilton. What explains the enduring popularity of these celebrities, so famous for being famous? That’s a fascinating question, and the answer would doubtless tell us something deep about modern culture. But it’s not the question I’m going to take up here. What I’m trying to understand is how we can know Britney’s ranking from week to week. How are all those queries counted and categorized? What algorithm tallies them up to see which terms are the most frequent? (emphasis added)

Deeply interesting discussion on the analysis of stream data and algorithms for the same. Very much worth a close read if you are working on or interested in such issues.

The article concludes:

All this mathematics and algorithmic engineering seems like a lot of work just for following the exploits of a famous “pop tart.” But I like to think the effort might be justified. Years from now, someone will type “Britney Spears” into a search engine and will stumble upon this article listed among the results. Perhaps then a curious reader will be led into new lines of inquiry. (emphasis added)

But what if the user enters “pop tart?” Will they still find this article? Or will it be “hit” number 100,000, which almost no one reaches? As of 20 July 2011, there were some 13 million “hits” for “pop tart” on a popular search engine. I suspect at least some of them are not about Britney Spears.

So, should I encounter a resource about Britney Spears, using the term “pop tart,” how am I going to accumulate those up for posterity?

Or do we all have to winnow search chaff for ourselves?*

*Question for office managers: How much time do you think your staff spends winnowing search chaff already winnowed by another user in your office?

…Develop[ing] Personal Search Engine

Filed under: Marketing,Search Engines — Patrick Durusau @ 1:01 pm

Ness Computing Announces $5M Series A Financing to Develop Personal Search Engine

From the post:

SILICON VALLEY, Calif., July 19, 2011 /PRNewswire/ — Ness Computing is announcing that it raised a $5M Series A round of financing in November 2010. The round was led by Vinod Khosla and Ramy Adeeb of Khosla Ventures, with participation from Alsop Louie Partners, TomorrowVentures, Bullpen Capital, a co-founder of Palantir Technologies and several angel investors. This financing is enabling the company’s team of engineers and scientists, with expertise in information retrieval and machine learning, to pursue their vision to change the nature of search by building technology that delivers results and recommendations that are unique to each person using it.

The technology, which the company calls a Likeness Engine, represents a new approach to this complex engineering challenge by fusing a search engine and a recommendation engine, and will power the company’s first product, a mobile service called Ness. The Likeness Engine is different from traditional search engines that are useful for finding fact-based objective information that is the same for everyone, such as weather reports, dictionary terms, and stock prices. Ness Computing’s vision is to answer questions of a more subjective nature by understanding each person’s likes and dislikes, to deliver results that match his or her personal tastes. This can be seen in the difference between a person asking, “Which concerts are playing in New York City?” and “Which concerts would I most enjoy in New York City?” Ultimately, Ness aims to help people make decisions about dining, nightlife, entertainment, shopping, music, travel and more, culled expressly for them from the world’s almost limitless options.

Impressive array of previously successful talent.

I am not sure I buy the “objective” versus “subjective” information divide but clearly Ness is interested in learning the user’s view of the world in order to “better” answer their questions.

Depending on how successful the searches by Ness become, a user could become insulated in a cocoon of previous expressions of likes and dislikes.

That isn’t an original insight, I saw it somewhere in an article about personalized search results from search engines. Nor is it a problem that arose due to personalization of search engines.

The average user (read not a librarian), tends to search for terms in a field or subject area that they already know. So they are unlikely to encounter information that uses different terminology. In a very real way, user’s searches are already highly personalized.

Personalization isn’t a bad thing but it is a limiting thing. That is it puts a border on the information that you will get back from a search and you won’t have much of an opportunity to go beyond that. It simply never comes up. And information overload being what it is, having limited, safe results can be quite useful. Particularly if you like sleeping at the Holiday Inn, eating at McDonald’s and watching American Idol.

Hopefully Ness will address the semantic diversity issue in order to provide users, at least the ones who are interested, with a richer search experience. Topic maps would be useful in such an attempt.

Growing a DSL with Clojure

Filed under: Clojure,DSL — Patrick Durusau @ 12:58 pm

Growing a DSL with Clojure by Ambrose Bonnaire-Sergeant.

From the post:

From seed to full bloom, Ambrose takes us through the steps to grow a domain-specific language in Clojure.

Lisps like Clojure are well suited to creating rich DSLs that integrate seamlessly into the language.

You may have heard Lisps boasting about code being data and data being code. In this article we will define a DSL that benefits handsomely from this fact.

We will see our DSL evolve from humble beginnings, using successively more of Clojure’s powerful and unique means of abstraction.

You know, the “…code being data and data being code” line reminds me of DATATAG in ISO 8879 (SGML).

I suspect this gets us keys being first class citizens but that will have to await another post.

K-sort: A new sorting algorithm that beats Heap sort for n <= 70 lakhs!

Filed under: Algorithms,Search Algorithms — Patrick Durusau @ 12:56 pm

K-sort: A new sorting algorithm that beats Heap sort for n <= 70 lakhs! by Kiran Kumar Sundararajan, Mita Pal, Soubhik Chakraborty, N.C. Mahanti.

From the description:

Sundararajan and Chakraborty (2007) introduced a new version of Quick sort removing the interchanges. Khreisat (2007) found this algorithm to be competing well with some other versions of Quick sort. However, it uses an auxiliary array thereby increasing the space complexity. Here, we provide a second version of our new sort where we have removed the auxiliary array. This second improved version of the algorithm, which we call K-sort, is found to sort elements faster than Heap sort for an appreciably large array size (n <= 70,00,000) for uniform U[0, 1] inputs.

OK, so some people have small data, < = 70 lakhs to sort at one time. 😉 Also shows that there are still interesting things to say about sorting. An operation that is of interest to topic mappers and others.

Cassandra SF 2011

Filed under: Cassandra,Conferences — Patrick Durusau @ 12:55 pm

Cassandra SF 2011

Slides with videos to follow!

From the website:

Keynote Presentation

  • Jonathan Ellis (DataStax)State of Cassandra, 2011 (Slides)

Cassandra Internals

  • Ed AnuffIndexing in Cassandra (Slides)
  • Gary Dusbabek (RackSpace)Cassandra Internals (Slides)
  • Sylvain Lesbresne (DataStax) Counters in Cassandra (Slides)

High-Level Cassandra Development

  • Eric Evans (Rackspace)CQL – Not just NoSQL, It’s MoSQL (Slides)
  • Jake Luciani (DataStax) Scaling Solr with Cassandra (Slides)

Lightning Talks

  • Ben Coverston (DataStax)Redesigned Compaction LevelDB (Slides)
  • Joaquin Casares (DataStax)The Auto-Clustering Brisk AMI (Slides)
  • Matt Dennis (DataStax)Cassandra Anti-Patterns (Slides)
  • Mike Bulman (DataStax)OpsCenter: Cluster Management Doesn’t Have To Be Hard (Slides)
  • Stu Hood (Twitter)Prometheus’ Patch: #674 and You (Slides)

Practical Development

  • Jeremy Hanna (Dachis)Using Pig alongside Cassandra (Slides)
  • Matt Dennis (DataStax)Data Modeling Workshop (Slides)
  • Nate McCall (DataStax)Cassandra for Java Developers (Slides)
  • Yewei Zhang (DataStax)Hive Over Brisk (Slides)

Products

  • Jake Luciani (DataStax) Introduction to Brisk (Slides)
  • Kyle Roche (Isidorey) Cloudsandra: Multi-tenant Platform Build on Brisk (Slides)

Use Cases

  • Adrian Cockcroft (Netflix)Migrating Netflix from DataCenter Oracle to Global Cassandra (Slides)
  • Chris Goffinet (Twitter)Cassandra at Twitter (Slides)
  • David Strauss (Pantheon)Highly Available DNS and Request Routing Using Apache Cassandra (Slides)
  • Edward Capriolo (media6degrees)Real World Capacity Planning: Cassandra on Blades and Big Iron (Slides)
  • Eric Onnen (Urban Airship)From 100s to 100′s of Millions (Slides)

Voldemort V0.9 Released: NIO, Pipelined FSM, Hinted Handoff

Filed under: Key-Value Stores,NoSQL,Voldemort — Patrick Durusau @ 12:54 pm

Voldemort V0.9 Released: NIO, Pipelined FSM, Hinted Handoff

From Alex Popescu’s myNoSQL, links to commentary on the latest release.

See also: http://project-voldemort.com/

July 19, 2011

Overview: Visualization to Connect the Dots

Filed under: Analytics,Java,Scala,Visualization — Patrick Durusau @ 7:54 pm

Overview is Hiring!

I don’t think I have ever re-posted a job ad but this one merits wide distribution:

We need two Java or Scala ninjas to build the core analytics and visualization components of Overview, and lead the open-source development community. You’ll work in the newsroom at AP’s global headquarters in New York, which will give you plenty of exposure to the very real problems of large document sets.

The exact responsibilities will depend on who we hire, but we imagine that one of these positions will be more focused on user experience and process design, while the other will do the computer science heavy lifting — though both must be strong, productive software engineers. Core algorithms must run on a distributed cluster, and scale to millions of documents. Visualization will be through high-performance OpenGL. And it all has to be simple and obvious for a reporter on deadline who has no time to fight technology. You will be expected to implement complex algorithms from academic references, and expand prototype techniques into a production application.

From the about page:

Overview is an open-source tool to help journalists find stories in large amounts of data, by cleaning, visualizing and interactively exploring large document and data sets. Whether from government transparency initiatives, leaks or Freedom of Information requests, journalists are drowning in more documents than they can ever hope to read.

There are good tools for searching within large document sets for names and keywords, but that doesn’t help find stories we’re not looking for. Overview will display relationships among topics, people, places and dates to help journalists to answer the question, “What’s in there?”

We’re building an interactive system where computers do the visualization, while a human guides the exploration. We will also produce documentation and training to help people learn how to use this system. The goal is to make this capability available to anyone who needs it.

Overview is a project of The Associated Press, supported by the John S. and James L. Knight Foundation as part of its Knight News Challenge. The Associated Press invests its resources to advance the news industry, delivering fast, unbiased news from every corner of the world to all media platforms and formats. The Knight News Challenge is an international contest to fund digital news experiments that use technology to inform and engage communities.

Sounds like a project that is worth supporting to me!

Analytics are great, but subject identity would be more useful.

Apply if you have the skill sets, repost the link, and/or volunteer to carry the good news of topic maps to the project.

Build your own internet search engine

Filed under: Erlang,Search Algorithms,Search Engines,Searching — Patrick Durusau @ 7:53 pm

Build your own internet search engine by Daniel Himmelein.

Uses Erlang but also surveys the Apache search stack.

Not that you have to roll your own search engine but it will give you a different appreciate for the issues they face.


Update: Build your own internet search engine – Part 2

I ran across part 2 while cleaning up at year’s end. Enjoy!

Building your own Facebook Realtime Analytics System

Filed under: Analytics — Patrick Durusau @ 7:52 pm

Building your own Facebook Realtime Analytics System

Much more interesting than most of the content I see on Facebook.

From the post:

Recently, I was reading Todd Hoff’s write-up on FaceBook real time analytics system. As usual, Todd did an excellent job in summarizing this video from Engineering Manager at Facebook Alex Himel.

In the first post, I’d like to summarize the case study, and consider some things that weren’t mentioned in the summaries. This will lead to an architecture for building your own Realtime Time Analytics for Big-Data that might be easier to implement, using Facebook’s experience as a starting point and guide as well as the experience gathered through a recent work with few of GigaSpaces customers. The second post provide a summary of that new approach as well as a pattern and a demo for building your own Real Time Analytics system..

Excel DataScope

Filed under: Algorithms,Cloud Computing,Excel Datascope,Hadoop — Patrick Durusau @ 7:51 pm

Excel DataScope

From the webpage:

From the familiar interface of Microsoft Excel, Excel DataScope enables researchers to accelerate data-driven decision making. It offers data analytics, machine learning, and information visualization by using Windows Azure for data and compute-intensive tasks. Its powerful analysis techniques are applicable to any type of data, ranging from web analytics to survey, environmental, or social data.

And:

Excel DataScope is a technology ramp between Excel on the user’s client machine, the resources that are available in the cloud, and a new class of analytics algorithms that are being implemented in the cloud. An Excel user can simply select an analytics algorithm from the Excel DataScope Research Ribbon without concern for how to move their data to the cloud, how to start up virtual machines in the cloud, or how to scale out the execution of their selected algorithm in the cloud. They simply focus on exploring their data by using a familiar client application.

Excel DataScope is an ongoing research and development project. We envision a future in which a model developer can publish their latest data analysis algorithm or machine learning model to the cloud and within minutes Excel users around the world can discover it within their Excel Research Ribbon and begin using it to explore their data collection. (emphasis added)

I added emphasis to the last sentence because that is the sort of convenience/collaboration that will make cloud computing and collaboration meaningful.

Imagine that sort of sharing across MS and non-MS cloud resources. Well, you would have to have an Excel DataScope interface on non-MS cloud resources, but one hopes that will be a product offering in the near future.

July 18, 2011

…Neo4j is 377 times faster than MySQL

Filed under: Graphs,MySQL,Neo4j — Patrick Durusau @ 6:46 pm

Time lines and news streams: Neo4j is 377 times faster than MySQL by René Pickhardt.

From the post:

Over the last weeks I did some more work on neo4j. And I am ready to present some more results on the speed (In my use case neo4j outperformed MySQL by a factor of 377 ! That is more than two magnitudes). As known one part of my PhD thesis is to create a social newsstream application around my social networking site metalcon.de. It is very obvious that a graph structure for social newsstreams are very natural: You go to a user. Travers to all his friends or objects of interest and then traverse one step deeper to the newly created content items. A problem with this kind of application is the sorting by Time or relvance of the content items. But before I discuss those problems I just want to present another comparission between MySQL and neo4j.

It is as exciting as the title implies.

Microsoft Research Releases Another Hadoop Alternative for Azure

Filed under: Daytona,Hadoop — Patrick Durusau @ 6:45 pm

Microsoft Research Releases Another Hadoop Alternative for Azure

From the post:

Today Microsoft Research announced the availability of a free technology preview of Project Daytona MapReduce Runtime for Windows Azure. Using a set of tools for working with big data based on Google’s MapReduce paper, it provides an alternative to Apache Hadoop.

Daytona was created by the eXtreme Computing Group at Microsoft Research. It’s designed to help scientists take advantage of Azure for working with large, unstructured data sets. Daytona is also being used to power a data-analytics-as-a-service offering the team calls Excel DataScope.

Excellent coverage of this latest release along with information about related software from Microsoft.

I don’t think anyone disputes that Hadoop is difficult to use effectively, so why not offer an MS product that makes Apache Hadoop easier to use? With all the consumer software skills at Microsoft it would still be a challenge but one that Microsoft would be the most likely candidate to overcome.

And that would give Microsoft a window (sorry) into non-Azure environments as well as an opportunity to promote an Excel-like interface. (Hard to argue against the familiar.)

We are going to reach the future of computing more quickly the fewer times we stop to build product silos.

Products yes, product silos, no.

The Future of Hadoop in Bioinformatics

Filed under: BigData,Bioinformatics,Hadoop,Heterogeneous Data — Patrick Durusau @ 6:44 pm

The Future of Hadoop in Bioinformatics: Hadoop and its ecosystem including MapReduce are the dominant open source Big Data solution by Bob Gourley.

From the post:

Earlier, I wrote on the use of Hadoop in the exciting, evolving field of Bioinformatics. I have since had the pleasure of speaking with Dr. Ron Taylor of Pacific Northwest National Library, the author of “An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics“, on what’s changed in the half-year since its publication and what’s to come.

As Dr. Taylor expected, Hadoop and it’s “ecosystem” including MapReduce are the dominant open source Big Data solution for next generation DNA sequencing analysis. This is currently the sub-field generating the most data and requiring the most computationally expensive analysis. For example, de novo assembly pieces together tens of millions of short reads (which may be 50 bases long on ABI SOLiD sequencers). To do so, every read needs to be compared to the others, which scales in proportion to n(logn), meaning, even assuming reads that are 100 base pairs in length and a human genome of 3 billion pairs, analyzing an entire human genome will take 7.5 times longer than if it scaled linearly. By dividing the task up into a Hadoop cluster, the analysis will be faster and, unlike other high performance computing alternatives, it can run on regular commodity servers that are much cheaper than custom supercomputers. This, combined with the savings from using open source software, ease of use due to seamless scaling, and the strength of the Hadoop community make Hadoop and related software the parallelization solution of choice in next generation sequencing.In other areas, however, traditional HPC is still more common and Hadoop has not yet caught on. Dr. Taylor believes that in the next year to 18 months, this will change due to the following trends:

So, over the next year to eighteen months, what do you see as the evolution of topic map software and services?

Or what problems do you see becoming apparent in bioinformatics or other areas (like the Department of Energy’s knowledgebase) that will require topic maps?

(More on the DOE project later this week.)

IBM Targets the Future of Social Media Analytics

Filed under: Analytics,Hadoop — Patrick Durusau @ 6:42 pm

IBM Targets the Future of Social Media Analytics

This is from back in April, 2011 but thought it was worthy of a note. The post reads in part:

The new product, called Cognos Consumer Insight, is built upon IBM’s Cognos business intelligence technology along with Hadoop to process the piles of unstructured social media data. According to Deepak Advani, IBM’s VP of predictive analytics, there’s a lot of value in performing text analytics on data derived from Twitter, Facebook and other social forums to determine how companies or their products are faring among consumers. Cognos lets customers view sentiment levels over time to determine how efforts are working, he added, and skilled analysts can augment their Cognos Consumer Insight usage with IBM’s SPSS product to bring predictive analytics into the mix.

The partnership with Yale is designed to address the current dearth of analytic skills among business leaders, Advani said. Although the program will involve training on analytics technologies, Advani explained that business people still need some grounding in analytic theory and thinking rather than just knowing how to use a particular piece of software. “I think the primary goal is for students to learn analytically,” he said, which will help know which technology to put to work on what data, and how.

Within many organizations, he added, the main problem is that they’re not using analytics at the point of decision or across all their business processes. Advani says partnerships like those with Yale will help instill the thought process of using mathematical algorithms instead of gut feelings.

I was with them up to the point that it says: “….instill the thought process of using mathematical algorithms instead of gut feelings.”

I don’t take “analytical thinking” to be limited to mathematical algorithms.

Moreover, we have been down this road before, when Jack Kennedy was president and Robert McNamara was Secretary of Defense. Operations analysis they called it back then. Should be able to determine, mathematically, how much equipment was needed at any particular location and didn’t need to ask local “gut” opinions about it. True, some bases don’t need snow plows every year, but when planes are trying to land, they are very nice to have.

If you object that is an abuse of operations theory I would have to concede you are correct, but abused it was on a regular basis.

I suspect the program will be a very good one along with the software. My only caution is really on any analytical technique that gives an answer at variance with years of experience in a trade. At least a reason to pause to ask why?

The Joy of Erlang; Or, How To Ride A Toruk

Filed under: Erlang,Marketing,Topic Maps — Patrick Durusau @ 6:42 pm

The Joy of Erlang; Or, How To Ride A Toruk by Evan Miller.

From the post:

In the movie Avatar, there’s this big badass bird-brained pterodactyl thing called a Toruk that the main character must learn to ride in order to regain the trust of the blue people. As a general rule, Toruks do not like to be ridden, but if you fight one, subdue it, and then link your Blue Man ponytail to the Toruk’s ptero-tail, you get to own the thing for life. Owning a Toruk is awesome; it’s like owning a flying car you can control with your mind, which comes in handy when battling large chemical companies, impressing future colleagues, or delivering a pizza. But learning to ride a Toruk is dangerous, and very few people succeed.

I like to think of the Erlang programming language as a Toruk. Most people are frightened of Erlang. Legends of its abilities abound. In order to master it, you have to fight it, subdue it, and (finally) link your mind to it. But assuming you survive, you then get to control the world’s most advanced server platform, usually without even having to think. And let me tell you: riding a Toruk is great fun.

This guide is designed to teach you the Erlang state of mind, so that you are not afraid to go off and commandeer a Toruk of your own. I am going to introduce only a handful of Erlang language features, but we’re going to use them to solve a host of practical problems. The purpose is to give you the desire and confidence to go out and master the rest of the language yourself.

You are welcome to type the examples into your own Erlang shell and play around with them, but examples are foremost designed to be read. I recommend printing this document out and perusing it in a comfortable chair, away from email, compilers, 3-D movies, and other distractions.

Do you think people view topic maps as a Toruk?

How would you train them to ride rather than be eaten?

Introduction to Logic Programming with Clojure

Filed under: Clojure,Logic — Patrick Durusau @ 6:41 pm

Introduction to Logic Programming with Clojure

From the post:

How to use this Tutorial

This tutorial is meant to be used with a Clojure REPL handy. An example project has been set up.

You should be able to run all code examples in the logic-introduction.core namespace.

Won’t hurt you and could prove to be useful when having a go at logicians.

« Newer PostsOlder Posts »

Powered by WordPress