Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 9, 2013

Scrapely

Filed under: Data Mining,Python — Patrick Durusau @ 9:36 am

Scrapely

From the webpage:

Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.

A tool for data mining similar HTML pages.

Supports a command line interface.

Apache Hadoop Patterns of Use: Refine, Enrich and Explore

Filed under: Hadoop,Hortonworks — Patrick Durusau @ 9:27 am

Apache Hadoop Patterns of Use: Refine, Enrich and Explore by Jim Walter.

From the post:

“OK, Hadoop is pretty cool, but exactly where does it fit and how are other people using it?” Here at Hortonworks, this has got to be the most common question we get from the community… well that and “what is the airspeed velocity of an unladen swallow?”

We think about this (where Hadoop fits) a lot and have gathered a fair amount of expertise on the topic. The core team at Hortonworks includes the original architects, developers and operators of Apache Hadoop and its use at Yahoo, and through this experience and working within the larger community they have been privileged to see Hadoop emerge as the technological underpinning for so many big data projects. That has allowed us to observe certain patterns that we’ve found greatly simplify the concepts associated with Hadoop, and our aim is to share some of those patterns here.

As an organization laser focused on developing, distributing and supporting Apache Hadoop for enterprise customers, we have been fortunate to have a unique vantage point.

With that, we’re delighted to share with you our new whitepaper ‘Apache Hadoop Patterns of Use’. The patterns discussed in the whitepaper are:

Refine: Collect data and apply a known algorithm to it in a trusted operational process.
Enrich: Collect data, analyze and present salient results for online apps.
Explore: Collect data and perform iterative investigation for value.

You can download it here, and we hope you enjoy it.

If you are looking for detailed patterns of use, you will be disappointed.

Runs about nine (9) pages in very high level summary mode.

What remains to be written (to my knowledge) is a collection of use patterns with a realistic amount of detail from a cross-section of Hadoop users.

That would truly be a compelling resource for the community.

One Hour Hadoop Cluster

Filed under: Ambari,Hadoop,Virtual Machines — Patrick Durusau @ 5:02 am

How to setup a Hadoop cluster in one hour using Ambari?

A guide to setting up a 3-node Hadoop cluster using Oracle’s VirtualBox and Apache Ambari.

HPC may not be the key to semantics but it can still be useful. 😉

High-Performance and Parallel Computing with R

Filed under: HPC,R — Patrick Durusau @ 4:48 am

High-Performance and Parallel Computing with R by Dirk Eddelbuettel.

From the webpage:

This CRAN task view contains a list of packages, grouped by topic, that are useful for high-performance computing (HPC) with R. In this context, we are defining ‘high-performance computing’ rather loosely as just about anything related to pushing R a little further: using compiled code, parallel computing (in both explicit and implicit modes), working with large objects as well as profiling.

Here you will find R packages for:

  • Explicit parallelism
  • Implicit parallelism
  • Grid computing
  • Hadoop
  • Random numbers
  • Resource managers and batch schedulers
  • Applications
  • GPUs
  • Large memory and out-of-memory data
  • Easier interfaces for Compiled code
  • Profiling tools

Despite HPC advances over the last decade, semantics remain an unsolved problem.

Perhaps raw computational capacity isn’t the key to semantics.

If not, some different approach awaits to be discovered.

I first saw this in a tweet by One R Tip a Day.

April 8, 2013

Knight News Challenge – 40 Finalists

Filed under: Challenges,Government,News — Patrick Durusau @ 7:09 pm

Knight News Challenge – 40 Finalists

There are 78 days (as of today) before the evaluation of the forty (40) finalists in the Knight News Challenge closes.

You will need to average better than two (2) a day in order to see all of them.

Worthwhile because:

  • Your comments may help improve a project.
  • Your comments may assist in evaluation of a project.
  • You may get some great ideas for another project.
  • You may see ways to incorporate topic maps in one or more projects. (or not)

It is important to learn to contribute to projects that are not your own and may not be your top choice.

You may discover ideas, techniques and even people who you would otherwise miss.

Splitting a Large CSV File into…

Filed under: CSV,Government Data — Patrick Durusau @ 4:46 pm

Splitting a Large CSV File into Separate Smaller Files Based on Values Within a Specific Column by Tony Hirst.

From the post:

One of the problems with working with data files containing tens of thousands (or more) rows is that they can become unwieldy, if not impossible, to use with “everyday” desktop tools. When I was Revisiting MPs’ Expenses, the expenses data I downloaded from IPSA (the Independent Parliamentary Standards Authority) came in one large CSV file per year containing expense items for all the sitting MPs.

In many cases, however, we might want to look at the expenses for a specific MP. So how can we easily split the large data file containing expense items for all the MPs into separate files containing expense items for each individual MP? Here’s one way using a handy little R script in RStudio

Just because data is “open,” doesn’t mean it will be easy to use. (Leaving the useful question to one side.)

We have been kicking around idea for a “killer” topic map application.

What about a plug-in for a browser that recognizes file types and suggests tools for processing them?

I am unlikely to remember this post a year from now when I have a CSV file from some site.

But if a browser plugin recognized the extension, .csv, and suggested a list of tools for exploring it….

Particularly if the plug-in called upon some maintained site of tools, so the list of tools is maintained.

Or for that matter, that it points to other data explorers who have examined the same file (voluntary disclosure).

Not the full monty of topic maps but a start towards collectively enhancing our experience with data files.

Beginners Guide To Enhancing Solr/Lucene Search…

Filed under: Lucene,Mahout,Solr — Patrick Durusau @ 4:33 pm

Beginners Guide To Enhancing Solr/Lucene Search With Mahout’s Machine Learning by Doug Turnbull.

From the post:

Yesterday, John and I gave a talk to the DC Hadoop Users Group about using Mahout with Solr to perform Latent Semantic Indexing — calculating and exploiting the semantic relationships between keywords. While we were there, I realized, a lot of people could benefit from a bigger picture, less in-depth, point of view outside of our specific story. In general where do Mahout and Solr fit together? What does that relationship look like, and how does one exploit Mahout to make search even more awesome? So I thought I’d blog about how you too get start to put these pieces together to simultaneously exploit Solr’s search and Mahout’s machine learning capabilities.

The root of how this all works is with a slightly obscure feature of Lucene based search — Term Vectors. Lucene based search applications give you the ability to generate term vectors from documents in the search index. Its a feature often turned on for specific search features, but other than that can appear to be a weird opaque feature to beginners. What is a term vector, you might ask? And why would you want to get one?

You know my misgivings about metric approaches to non-metric data (such as semantics) but there is no denying that Latent Semantic Indexing can be useful.

Think of Latent Semantic Indexing as a useful tool.

A saw is a tool too but not every cut made with a saw is a correct one.

Yes?

Neo4j 2.0.0-M01

Filed under: Graphs,Neo4j — Patrick Durusau @ 3:57 pm

Nodes are people, too by Philip Rathle.

From the post:

Today we are releasing Milestone Release Neo4j 2.0.0-M01 of the Neo4j 2.0 series which we expect to be generally available (GA) in the next couple months. This release is significant in that it is the first time since the inception of Neo4j thirteen years ago that we are making a change to the property graph model. Specifically, we will be adding a new construct: labels.

We’ve completed a first cut at a significant addition to the data model: one that we believe nearly every graph will benefit from. Because this is a major change, it merits feedback, and we are opening the code up now for early comment. Please therefore consider 2.0 to be an experimental release. This first milestone is intended to solicit your input. In addition to the new technology being work-in-progress, some of the new terminology is also work-in-progress. We look forward to making 2.0 a better release together, with your feedback. Please tell us how you’d like to use these changes. We can’t wait to hear what you think.

Read the post for the introduction to “labels” for nodes.

Suggest you run 1.8/9 along side the experimental release 2.0.0-M01.

Something about the modeling of person by adding a property to the node strikes me as odd.

Rather than creating a “person” node with an edge to the original node.

Can’t put my finger on it but will be playing with it this week.

What other features would you like to see?

I’m thinking scope on properties would be high on my list.

A Tour through the Visualization Zoo

Filed under: Graphics,Visualization — Patrick Durusau @ 3:34 pm

A Tour through the Visualization Zoo by Jeffrey Heer, Michael Bostock, Vadim Ogievetsky.

From the article:

Thanks to advances in sensing, networking, and data management, our society is producing digital information at an astonishing rate. According to one estimate, in 2010 alone we will generate 1,200 exabytes—60 million times the content of the Library of Congress. Within this deluge of data lies a wealth of valuable information on how we conduct our businesses, governments, and personal lives. To put the information to good use, we must find ways to explore, relate, and communicate the data meaningfully.

The goal of visualization is to aid our understanding of data by leveraging the human visual system’s highly tuned ability to see patterns, spot trends, and identify outliers. Well-designed visual representations can replace cognitive calculations with simple perceptual inferences and improve comprehension, memory, and decision making. By making data more accessible and appealing, visual representations may also help engage more diverse audiences in exploration and analysis. The challenge is to create effective and engaging visualizations that are appropriate to the data.

Creating a visualization requires a number of nuanced judgments. One must determine which questions to ask, identify the appropriate data, and select effective visual encodings to map data values to graphical features such as position, size, shape, and color. The challenge is that for any given data set the number of visual encodings—and thus the space of possible visualization designs—is extremely large. To guide this process, computer scientists, psychologists, and statisticians have studied how well different encodings facilitate the comprehension of data types such as numbers, categories, and networks. For example, graphical perception experiments find that spatial position (as in a scatter plot or bar chart) leads to the most accurate decoding of numerical data and is generally preferable to visual variables such as angle, one-dimensional length, two-dimensional area, three-dimensional volume, and color saturation. Thus, it should be no surprise that the most common data graphics, including bar charts, line charts, and scatter plots, use position encodings. Our understanding of graphical perception remains incomplete, however, and must appropriately be balanced with interaction design and aesthetics.

This article provides a brief tour through the “visualization zoo,” showcasing techniques for visualizing and interacting with diverse data sets. In many situations, simple data graphics will not only suffice, they may also be preferable. Here we focus on a few of the more sophisticated and unusual techniques that deal with complex data sets. After all, you don’t go to the zoo to see Chihuahuas and raccoons; you go to admire the majestic polar bear, the graceful zebra, and the terrifying Sumatran tiger. Analogously, we cover some of the more exotic (but practically useful!) forms of visual data representation, starting with one of the most common, time-series data; continuing on to statistical data and maps; and then completing the tour with hierarchies and networks. Along the way, bear in mind that all visualizations share a common “DNA”—a set of mappings between data properties and visual attributes such as position, size, shape, and color—and that customized species of visualization might always be constructed by varying these encodings.

Most of the visualizations shown here are accompanied by interactive examples. The live examples were created using Protovis, an open source language for Web-based data visualization. To learn more about how a visualization was made (or to copy and paste it for your own use), simply “View Source” on the page. All example source code is released into the public domain and has no restrictions on reuse or modification. Note, however, that these examples will work only on a modern, standards-compliant browser supporting SVG (scalable vector graphics ). Supported browsers include recent versions of Firefox, Safari, Chrome, and Opera. Unfortunately, Internet Explorer 8 and earlier versions do not support SVG and so cannot be used to view the interactive examples.
….

Some what dated but still a useful overview of visualization.

I first saw this at A Tour Through the Visualization Zoo by Alex Popescu.

Permission Resolution With Neo4j — Part 3

Filed under: Graphs,Neo4j — Patrick Durusau @ 3:22 pm

Permission Resolution With Ne4j — Part 3 by Max De Marzi.

From the post:

Let’s add a couple of performance tests to the mix. We learned about Gatling in a previous blog post, we’re going to use it here again. The first test will randomly choose users and documents (from the graph we created in part 2) and write the results to a file, the second test will re-use the results of the first one and run consistently so we can change hardware, change Neo4j parameters, tune the JVM, etc. and see how they affect our performance.

Interesting post on testing a graph database to the point of:

How well (lousy) would this fair on a relational database?

A 10 Million row table joined to itself 100 times…omgwtfbbq.

Maybe we should ask Facebook?

Facebook releases Linkbench MySQL benchmark

In case you don’t know, MySQL is a relational database.

Last I heard the Facebook graph is one (1) billion users+.

Relational database technology is managing their graph fairly well.

What do you think?

Permission Resolution with Neo4j — Part 2

Filed under: Graphs,Neo4j — Patrick Durusau @ 3:07 pm

Permission Resolution with Neo4j — Part 2 by Max De Marzi.

From the post:

Let’s try tackling something a little bigger. In Part 1 we created a small graph to test our permission resolution graph algorithm and it worked like a charm on our dozen or so nodes and edges. I don’t have fast hands, so instead of typing out a million node graph, we’ll build a graph generator and use the batch importer to load it into Neo4j. What I want to create is a set of files to feed to the batch-importer.

Nice walk through on generating graphs and importing them into Neo4j.

Curious, have you encountered a real world graph with three (3) relationship types?

How do you think a higher number of relationships types would impact performance?

WTF: [footnote 1]

Filed under: Cassovary,Graphs,Hadoop,Recommendation — Patrick Durusau @ 1:52 pm

WTF: The Who to Follow Service at Twitter by Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, Reza Zadeh.

Abstract:

WTF (“Who to Follow”) is Twitter’s user recommendation service, which is responsible for creating millions of connections daily between users based on shared interests, common connections, and other related factors. This paper provides an architectural overview and shares lessons we learned in building and running the service over the past few years. Particularly noteworthy was our design decision to process the entire Twitter graph in memory on a single server, which signicantly reduced architectural complexity and allowed us to develop and deploy the service in only a few months. At the core of our architecture is Cassovary, an open-source in-memory graph processing engine we built from scratch for WTF. Besides powering Twitter’s user recommendations, Cassovary is also used for search, discovery, promoted products, and other services as well. We describe and evaluate a few graph recommendation algorithms implemented in Cassovary, including a novel approach based on a combination of random walks and SALSA. Looking into the future, we revisit the design of our architecture and comment on its limitations, which are presently being addressed in a second-generation system under development.

You know it is going to be an amusing paper when footnote 1 reads:

The confusion with the more conventional expansion of the acronym is intentional and the butt of many internal jokes. Also, it has not escaped our attention that the name of the service is actually ungrammatical; the pronoun should properly be in the objective case, as in \whom to follow”.

😉

Algorithmic recommendations may miss the mark for an end user.

On the other hand, what about an authoring interface that supplies recommendations of associations and other subjects?

A paper definitely worth a slow read!

I first saw this at: WTF: The Who to Follow Service at Twitter (Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, Reza Zadeh).

HDFS File Operations Made Easy with Hue (demo)

Filed under: Hadoop,HDFS,Hue — Patrick Durusau @ 1:33 pm

HDFS File Operations Made Easy with Hue by Romain Rigaux.

From the post:

Managing and viewing data in HDFS is an important part of Big Data analytics. Hue, the open source web-based interface that makes Apache Hadoop easier to use, helps you do that through a GUI in your browser — instead of logging into a Hadoop gateway host with a terminal program and using the command line.

The first episode in a new series of Hue demos, the video below demonstrates how to get up and running quickly with HDFS file operations via Hue’s File Browser application.

Very nice 2:18 video.

Brings the usual graphical file interface to Hadoop (no small feat) but reminds me of every other graphical file interface.

To step beyond the common graphical file interface, why not:

  • Links to scripts that call a file
  • File ownership – show all files owned by a user
  • Navigation of files by content type(s)
  • Grouping of files by common scripts
  • Navigation of files by content
  • Grouping of files by script owners calling the files

are just a few of the possibilities that come to mind.

I would make the roles in those relationships explicit but that is probably my topic map background showing through.

Spring/Summer Reading – 2013

Filed under: Books,CS Lectures — Patrick Durusau @ 1:05 pm

The ACM has released:

Best Reviews (2012)

and,

Notable Computing Books and Articles of 2012

Before you hit the summer conference or vacation schedule, visit your local bookstore or load up your ebook reader!

I first saw this at Best Reviews & Notable Books and Articles of 2012 by Shar Steed.

April 7, 2013

RSSOwl and Feed Validation

Filed under: RSS,XML — Patrick Durusau @ 6:17 pm

I rather hate to end the day on a practical note, ;-), but after going off Google Reader, I started using RSSOwl.

I have been adding feeds to RSSOwl but there were two that simply refused to load.

Feed Validator reported the feed was:

not well-formed (invalid token)

with a pointer to the letter “f” in the word “find.”

Helpful but not a bunch.

Captured the feed as XML and loaded it into oXygen.

A form feed character was immediately in front of the “f” in “fine” but of course was not displaying.

Culprit in one case was a form feed character, 0xc and in the other, end of text, 0x03.

ASCII characters 0 — 31 and 127 are non-printing control characters called CO controls.

Of the CO control characters, only carriage return (0x0d), linefeed (0x0a) and horizontal tab (0x09) can appear in an XML feed.

For loading and parsing RSS feeds into a topic map, you may want to filter for CO controls that should not appear in the XML feed.

PS: I suspect in both cases the control characters were introduced by copy-n-paste operations.

Open PHACTS

Open PHACTS – Open Pharmacological Space

From the homepage:

Open PHACTS is building an Open Pharmacological Space in a 3-year knowledge management project of the Innovative Medicines Initiative (IMI), a unique partnership between the European Community and the European Federation of Pharmaceutical Industries and Associations (EFPIA).

The project is due to end in March 2014, and aims to deliver a sustainable service to continue after the project funding ends. The project consortium consists of leading academics in semantics, pharmacology and informatics, driven by solid industry business requirements: 28 partners, including 9 pharmaceutical companies and 3 biotechs.

Sourcecode has just appeared on GibHub: OpenPHACTS.

Important to different communities for different reasons. My interest isn’t the same as BigPharma. 😉

A project to watch as they navigate the thickets of vocabularies, ontologies and other semantically diverse information sources.

Visuals as Starting Point for Analysis

Filed under: Astroinformatics,BigData — Patrick Durusau @ 4:02 pm

In Stat models to solve astronomical mysteries – application to business data Mirko Krivanek uses this image:

Pleiades

to argue that big data analysis could profit from signal amplification.

In astronomy, multiple measures are taken and then combined to amplify a weak signal.

I suspect the signal in astronomy is easier to separate from the noise than in big data.

Perhaps not.

It is certainly an idea that bears watching.

Not to mention documenting amplification your analysis.

Like “boosting” a search term, what is the basis for amplification is a computational artifact?

Phoenix in 15 Minutes or Less

Filed under: HBase,Phoenix,SQL — Patrick Durusau @ 3:50 pm

Phoenix in 15 Minutes or Less by Justin Kestelyn.

An amusing FAQ by “James Taylor of Salesforce, which recently open-sourced its Phoenix client-embedded JDBC driver for low-latency queries over HBase.”

From the post:

What is this new Phoenix thing I’ve been hearing about?
Phoenix is an open source SQL skin for HBase. You use the standard JDBC APIs instead of the regular HBase client APIs to create tables, insert data, and query your HBase data.

Doesn’t putting an extra layer between my application and HBase just slow things down?
Actually, no. Phoenix achieves as good or likely better performance than if you hand-coded it yourself (not to mention with a heck of a lot less code) by:

  • compiling your SQL queries to native HBase scans
  • determining the optimal start and stop for your scan key
  • orchestrating the parallel execution of your scans
  • bringing the computation to the data by
    • pushing the predicates in your where clause to a server-side filter
    • executing aggregate queries through server-side hooks (called co-processors)

In addition to these items, we’ve got some interesting enhancements in the works to further optimize performance:

  • secondary indexes to improve performance for queries on non row key columns
  • stats gathering to improve parallelization and guide choices between optimizations
  • skip scan filter to optimize IN, LIKE, and OR queries
  • optional salting of row keys to evenly distribute write load

…..

Sounds authentic to me!

You?

Big Data Is Not the New Oil

Filed under: BigData,Data Analysis — Patrick Durusau @ 3:05 pm

Big Data Is Not the New Oil by Jer Thorp.

From the post:

Every 14 minutes, somewhere in the world, an ad exec strides on stage with the same breathless declaration:

“Data is the new oil!”

It’s exciting stuff for marketing types, and it’s an easy equation: big data equals big oil, equals big profits. It must be a helpful metaphor to frame something that is not very well understood; I’ve heard it over and over and over again in the last two years.

The comparison, at the level it’s usually made, is vapid. Information is the ultimate renewable resource. Any kind of data reserve that exists has not been lying in wait beneath the surface; data are being created, in vast quantities, every day. Finding value from data is much more a process of cultivation than it is one of extraction or refinement.

Jer’s last point, “more a process of cultivation than it is one of extraction or refinement,” and his last recommendation:

…we need to change the way that we collectively think about data, so that it is not a new oil, but instead a new kind of resource entirely.

resonates the most with me.

Everyone can apply the same processes to oil and get out largely the same results.

Data on the other hand, cannot be processed or analyzed until some user assigns it values.

Data and the results of analysis of data, have value only because of the assignment of meaning by some user.

Assignment of meaning is fraught with peril, as we saw in K-Nearest Neighbors: dangerously simple.

You can turn the crank on big data, but the results will disappoint unless there is an understanding of the data.

I first saw this at: Big Data Is Not the New Oil

The Pragmatic Haskeller – Episode 1

Filed under: Functional Programming,Haskell,JSON — Patrick Durusau @ 2:40 pm

The Pragmatic Haskeller – Episode 1 by Alfredo Di Napoli.

The first episode of “The Pragmatic Haskeller” starts with:

In the beginning was XML, and then JSON.

When I read that sort of thing, it is hard to know whether to weep or pitch a fit.

Neither one is terribly productive but if you are interested in the rich heritage that XML relies upon drop me a line.

The first lesson is a flying start on Haskell data and moving it between JSON and XML fomats.

imMens: Real-time Visual Querying of Big Data

Filed under: BigData,Query Language,Visualization — Patrick Durusau @ 2:00 pm

imMens: Real-time Visual Querying of Big Data by Zhicheng Liu, Biye Jiangz and Jeffrey Heer.

Abstract:

Data analysts must make sense of increasingly large data sets, sometimes with billions or more records. We present methods for interactive visualization of big data, following the principle that perceptual and interactive scalability should be limited by the chosen resolution of the visualized data, not the number of records. We first describe a design space of scalable visual summaries that use data reduction methods (such as binned aggregation or sampling) to visualize a variety of data types. We then contribute methods for interactive querying (e.g., brushing & linking) among binned plots through a combination of multivariate data tiles and parallel query processing. We implement our techniques in imMens, a browser-based visual analysis system that uses WebGL for data processing and rendering on the GPU. In benchmarks imMens sustains 50 frames-per-second brushing & linking among dozens of visualizations, with invariant performance on data sizes ranging from thousands to billions of records.

Code is available at: https://github.com/StanfordHCI/imMens

The emphasis on “real-time” with “big data” continues.

Impressive work but I wonder if there is a continuum of “big data” for “real-time” access, analysis and/or visualization?

Some types of big data are simple enough for real-time analysis, but other types are less so and there are types of big data where real-time analysis is inappropriate.

What I don’t know is what factors you would evaluate to place one big data set at one point on that continuum and another data set at another. Closer to one end or the other.

Research that you are aware of on the appropriateness of “real-time” analysis of big data?

I first saw this in This Week in Research by Isaac Lopez.

GeoLocation Friends Visualizer

Filed under: Geographic Data,Graphics,Visualization — Patrick Durusau @ 1:37 pm

GeoLocation Friends Visualizer by Marcel Caraciolo.

Slides from a presentation at the XXVI Pernambuco’s Python User Group meeting.

Code at: https://github.com/marcelcaraciolo/Geo-Friendship-Visualization

Just to get you interested:

social network

If you had the phone records (cell and land) from elected and appointed government officials, you could begin to build a visualization of the government network.

In terms of an “effective” data leak, it is hard to imagine a better one.

Advances in Neural Information Processing Systems (NIPS)

Filed under: Decision Making,Inference,Machine Learning,Neural Networks,Neuroinformatics — Patrick Durusau @ 5:47 am

Advances in Neural Information Processing Systems (NIPS)

From the homepage:

The Neural Information Processing Systems (NIPS) Foundation is a non-profit corporation whose purpose is to foster the exchange of research on neural information processing systems in their biological, technological, mathematical, and theoretical aspects. Neural information processing is a field which benefits from a combined view of biological, physical, mathematical, and computational sciences.

Links to videos from NIPS 2012 meetings are featured on the homepage. The topics are as wide ranging as the foundation’s description.

A tweet from Chris Diehl, wondering what to do with “old hardbound NIPS proceedings (NIPS 11)” led me to: Advances in Neural Information Processing Systems (NIPS) [Online Papers], which has the papers from 1987 to 2012 by volume and a search interface to the same.

Quite a remarkable collection just from a casual skim of some of the volumes.

Unless you need to fill book shelf space, suggest you bookmark the NIPS Online Papers.

Deploying Graph Algorithms on GPUs: an Adaptive Solution

Filed under: Algorithms,GPU,Graphs,Networks — Patrick Durusau @ 5:46 am

Deploying Graph Algorithms on GPUs: an Adaptive Solution by Da Li and Michela Becchi. (27th IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2013)

From the post:

Thanks to their massive computational power and their SIMT computational model, Graphics Processing Units (GPUs) have been successfully used to accelerate a wide variety of regular applications (linear algebra, stencil computations, image processing and bioinformatics algorithms, among others). However, many established and emerging problems are based on irregular data structures, such as graphs. Examples can be drawn from different application domains: networking, social networking, machine learning, electrical circuit modeling, discrete event simulation, compilers, and computational sciences. It has been shown that irregular applications based on large graphs do exhibit runtime parallelism; moreover, the amount of available parallelism tends to increase with the size of the datasets. In this work, we explore an implementation space for deploying a variety of graph algorithms on GPUs. We show that the dynamic nature of the parallelism that can be extracted from graph algorithms makes it impossible to find an optimal solution. We propose a runtime system able to dynamically transition between different implementations with minimal overhead, and investigate heuristic decisions applicable across algorithms and datasets. Our evaluation is performed on two graph algorithms: breadth-first search and single-source shortest paths. We believe that our proposed mechanisms can be extended and applied to other graph algorithms that exhibit similar computational patterns.

A development that may surprise some graph software vendors, there are “no optimal solution[s] across graph problems and datasets” for graph algorithms on GPU.

This paper points towards an adaptive technique that may prove to be “resilient to the irregularity and heterogeneity of real world graphs.”

I first saw this in a tweet by Stefano Bertolo.

April 6, 2013

ODBMS.ORG

Filed under: Database,ODBMS — Patrick Durusau @ 6:55 pm

ODBMS.ORG – Object Database Management Systems

From the “about” page:

Launched in 2005, ODBMS.ORG was created to serve faculty and students at educational and research institutions as well as software developers in the open source community or at commercial companies.

It is designed to meet the fast-growing need for resources focusing on Big Data, Analytical data platforms, Scalable Cloud platforms, Object databases, Object-relational bindings, NoSQL databases, Service platforms, and new approaches to concurrency control

This portal features an easy introduction to ODBMSs as well as free software, lecture notes, tutorials, papers and other resources for free download. It is complemented by listings of relevant books and vendors to provide a comprehensive and up-to-date overview of available resources.

The Expert Section contains exclusive contributions from 130+ internationally recognized experts such as Suad Alagic, Scott Ambler, Michael Blaha, Jose Blakeley, Rick Cattell, William Cook, Ted Neward, and Carl Rosenberger.

The ODBMS Industry Watch Blog is part of this portal and contains up to date Information, Trends, and Interviews with industry leaders on Big Data, New Data Stores (NoSQL, NewSQL Databases), New Developments and New Applications for Objects and Databases, New Analytical Data Platforms, Innovation.

The portal’s editor, Roberto V. Zicari, is Professor of Database and Information Systems at Frankfurt University and representative of the Object Management Group (OMG) in Europe. His interest in object databases dates back to his work at the IBM Research Center in Almaden, CA, in the mid ’80s, when he helped craft the definition of an extension of the relational data model to accommodate complex data structures. In 1989, he joined the design team of the Gip Altair project in Paris, later to become O2, one of the world’s first object database products.

All materials and downloads are free and anonymous.

Non-profit ODBMS.ORG is made possible by contributions from ODBMS.ORG’s Panel of Experts,and the support of the sponsors displayed in the right margin of these pages.

http://www.odbms.org/About/books.aspx

The free download page is what first attracted my attention.

By any measure, a remarkable collection of material.

Ironic isn’t it?

CS needs to develop better access strategies for its own output.

Scientific Computing and Numerical Analysis FAQ

Filed under: Numerical Analysis,Scientific Computing — Patrick Durusau @ 6:37 pm

Scientific Computing and Numerical Analysis FAQ

From the webpage:


Note: portions of this document may be out of date. Search the web for more recent information!

This is a summary of Internet-related resources for a handful of fields related to Scientific Computing, primarily:

  • scientific and engineering numerical computing
  • numerical analysis
  • symbolic algebra
  • statistics
  • operations research

Some parts may be out of date but it makes up an impressive starting place.

I first saw this in a tweet by Scientific Python.

The Pragmatic Haskeller

Filed under: Haskell,Programming — Patrick Durusau @ 6:30 pm

Announcing The Pragmatic Haskeller by Alfredo Di Napoli.

From the post:

We are working programmers. Even though we are carried away by the beauty and formalism of functional languages, at the end of the day we still need to get the job done. The Haskell community has been accused during the years of navel gazing, and usually the misconception is that “Haskell is an academic language, but you can’t tackle real world problem with it”.

Despite internet is full of fantastic resources for learning Haskell (including the awesome and recently-released School of Haskell) I wanted to use a peculiar approach: I tried to create a web application in a similar fashion of what I would have done during my everyday job in Scala, to see if I was able to replicate a large subset of the features normally present in my Scala apps. Not only was the experiment a success, but I finally had the chance of experimenting with a lot of cool libraries which were in my “to-play-with-list” for a while. So without further ado I’m happy to announce The Pragmatic Haskeller, a collection of small building blocks for creating a full-fledged webapp:

https://github.com/cakesolutions/the-pragmatic-haskeller

As you might notice, we have a lot of small subfolders. Each folder is a self-contained, working, incremental example experimenting with a new library or technique. Even though you could rush and play with each app all at once, I aim to write a dedicated blog post to each sub-project, introducing the used library, in a similar way of what Oliver Charles did with 24 days of Hackage, with the difference we’ll use the same example episode after episode, until transforming our naive bunch of modules into a complete app.

If the future of big data is with functional programming, where’s yours?

KairosDB

Filed under: KairosDB,Time,Time Series — Patrick Durusau @ 5:05 pm

KairosDB

From the webpage:

KairosDB is a fast distributed scalable time series database written primarily for Cassandra but works with HBase as well.

It is a rewrite of the original OpenTSDB project started at Stumble Upon. Many thanks go out to the original authors for laying the groundwork and direction for this great product. See a list of changes here.

Because it is written on top of Cassandra (or HBase) it is very fast and scalable. With a single node we are able to capture 40,000 points of data per second.

Why do you need a time series database? The quick answer is so you can be data driven in your IT decisions. With KairosDB you can use it to track the number of hits on your web server and compare that with the load average on your MySQL database.

Getting Started

Metrics

KairosDB stores metrics. Each metric consists of a name, data points (measurements), and tags. Tags are used to classify the metric.

Metrics can be submitted to KairosDB via telnet protocol or a REST API.

Metrics can be queried using a REST API. Aggregators can be used to manipulate the data as it is returned. This allows downsampling, summing, averaging, etc.

Do be aware that values must be either longs or doubles.

If your data can be mapped into metric space, KairosDB may be quite useful.

The intersection of time series data with non-metric data or events awaits a different solution.

I first saw this at Alex Popescu’s Kairosdb – Fast Scalable Time Series Database.

Graphs for Gaming [Neo4j]

Filed under: Games,Graphs,Neo4j — Patrick Durusau @ 4:52 pm

Graphs for Gaming by Toby O’Rourke and Rik van Bruggen.

From the description:

Graph Databases have many use cases in many industries, but one of the most interesting ones that are emerging is in the Gaming industry. Because of its real-time nature, games are a perfect environment to make use of graph-based queries that are the basis for in-game recommendations. These recommendations make games more interesting for the users (they get to play cooler games with other people in their area, of their level, sharing their social network profile, etc) but also more profitable for the game providers, developers and publishers. After all: the latter want to be recommending specific games to specific target audiences, and thereby maximising their potential revenues.

Just in case tonight is movie night at your house and you forgot to pick up any videos. 😉

Or not.

Review comments:

Rik van Bruggen covers two centuries of math (Euler as the inventor of graphs), skips to Neo4j, then to NoSQL, criticisms of relational databases, new definition of complexity, and examples of complexity. Hits games at time mark 12:30, but discusses them very vaguely. Graphs in gaming, harnessing social networks. A demo of finding the games two people have played.

Nice demo of the Neo4j console.

Works for same company as the basis for a recommendation to play against Rik? Remember the perils of K-Nearest Neighbors: dangerously simple.

Query response in milliseconds? Think about the company size for the query.

Demonstrates querying but nothing to do with using graphs in gaming. (Mining networks of users, yes, but that’s a generic problem.)

At time mark 28:00, the Neo4j infomercial finally ends.

Toby O’Rourke takes over. Bingo business case was to obtain referrals of friends. Social network problem. General comments on future user of graphs for recommendations and fraud/collusion detection. (Yes, I know, friend referrals and recommendations sound a lot alike. Not to the presenter.)

There are informative and useful Neo4j videos so don’t judge them all by this one.

However, spend your forty-eight plus minutes somewhere other than on this video.

Ultimate library challenge: taming the internet

Filed under: Data,Indexing,Preservation,Search Data,WWW — Patrick Durusau @ 3:40 pm

Ultimate library challenge: taming the internet by Jill Lawless.

From the post:

Capturing the unruly, ever-changing internet is like trying to pin down a raging river. But the British Library is going to try.

For centuries, the library has kept a copy of every book, pamphlet, magazine and newspaper published in Britain. Starting on Saturday, it will also be bound to record every British website, e-book, online newsletter and blog in a bid to preserve the nation’s ”digital memory”.

As if that’s not a big enough task, the library also has to make this digital archive available to future researchers – come time, tide or technological change.

The library says the work is urgent. Ever since people began switching from paper and ink to computers and mobile phones, material that would fascinate future historians has been disappearing into a digital black hole. The library says firsthand accounts of everything from the 2005 London transit bombings to Britain’s 2010 election campaign have already vanished.

”Stuff out there on the web is ephemeral,” said Lucie Burgess the library’s head of content strategy. ”The average life of a web page is only 75 days, because websites change, the contents get taken down.

”If we don’t capture this material, a critical piece of the jigsaw puzzle of our understanding of the 21st century will be lost.”

For more details, see Jill’s post or, Click to save the nations digital memory (British Library press release), or 100 websites: Capturing the digital universe (sample of results of archiving with only 100 sites).

The content gathered by the project will be made available to the public.

A welcome venture, particularly since the results will be made available to the public.

An unanswerable question but I do wonder how we would view Greek drama if all of it had been preserved?

Hundreds if not thousands of plays were written and performed every year.

The Complete Greek Drama lists only forty-seven (47) that have survived to this day.

If whole scale preservation is the first step, how do we preserve paths to what’s worth reading in a data labyrinth as a second step?

I first saw this in a tweet by Jason Ronallo.

« Newer PostsOlder Posts »

Powered by WordPress