Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 27, 2013

snapLogic

Filed under: Data Integration,Enterprise Integration,snapLogic — Patrick Durusau @ 2:36 pm

snapLogic

From What We Do:

SnapLogic is the only cloud integration solution built on modern web standards and “containerized” Snaps, allowing you to easily connect any combination of Cloud, SaaS or On-premise applications and data sources.

We’ve now entered an era in which the Internet is the network, much of the information companies need to coordinate is no longer held in relational databases, and the number of new, specialized cloud applications grows each day. Today, organizations are demanding a faster and more modular way to interoperate with all these new cloud applications and data sources.

Prefab mapping components for data sources such as Salesforce, Oracle’s PeopleSoft, SAP (all for sale) and free components for Google Spreadsheet, HDFS, Hive and others.

Two observations:

First, the “snaps” are all for data sources and not data sets, although I don’t see any reason why data sets could not be the subject of snaps.

Second, the mapping examples I saw (caveat, I did not see them all), did not provide for the recording of the basis for data operations (read subject identity).

With regard to the second observation, my impression is that snaps can be extended to provide capabilities such as we would associate with a topic map.

Something to consider even if you are fielding your own topic map application.

I am going to be reading more about snapLogic and its products.

Sing out if you have pointers or suggestions.

The Motherlode of Semantics, People

Filed under: Conferences,Crowd Sourcing,Semantic Web,Semantics — Patrick Durusau @ 8:08 am

1st International Workshop on “Crowdsourcing the Semantic Web” (CrowdSem2013)

Submission deadline: July 12, 2013 (23:59 Hawaii time)

From the post:

1st International Workshop on “Crowdsourcing the Semantic Web” in conjunction with the 12th Interantional Seamntic Web Conference (ISWC 2013), 21-25 October 2013, in Sydney, Australia. This interactive workshop takes stock of the emergent work and chart the research agenda with interactive sessions to brainstorm ideas and potential applications of collective intelligence to solving AI hard semantic web problems.

The Global Brain Semantic Web—a Semantic Web interleaving a large number of human and machine computation—has great potential to overcome some of the issues of the current Semantic Web. In particular, semantic technologies have been deployed in the context of a wide range of information management tasks in scenarios that are increasingly significant in both technical (data size, variety and complexity of data sources) and economical terms (industries addressed and their market volume). For many of these tasks, machine-driven algorithmic techniques aiming at full automation do not reach a level of accuracy that many production environments require. Enhancing automatic techniques with human computation capabilities is becoming a viable solution in many cases. We believe that there is huge potential at the intersection of these disciplines – large scale, knowledge-driven, information management and crowdsourcing – to solve technically challenging problems purposefully and in a cost effective manner.

I’m encouraged.

The Semantic Web is going to start asking the entities (people) that originate semantics about semantics.

Going the motherlode of semantics.

Now to see what they do with the answers.

Strange Loop 2013

Filed under: Conferences,Data Structures,Database,NoSQL,Programming — Patrick Durusau @ 4:20 am

Strange Loop 2013

Dates:

  • Call for presentation opens: Apr 15th, 2013
  • Call for presentation ends: May 9, 2013
  • Speakers notified by: May 17, 2013
  • Registration opens: May 20, 2013
  • Conference dates: Sept 18-20th, 2013

From the webpage:

Below is some guidance on the kinds of topics we are seeking and have historically accepted.

  • Frequently accepted or desired topics: functional programming, logic programming, dynamic/scripting languages, new or emerging languages, data structures, concurrency, database internals, NoSQL databases, key/value stores, big data, distributed computing, queues, asynchronous or dataflow concurrency, STM, web frameworks, web architecture, performance, virtual machines, mobile frameworks, native apps, security, biologically inspired computing, hardware/software interaction, historical topics.
  • Sometimes accepted (depends on topic): Java, C#, testing frameworks, monads
  • Rarely accepted (nothing wrong with these, but other confs cover them well): Agile, JavaFX, J2EE, Spring, PHP, ASP, Perl, design, layout, entrepreneurship and startups, game programming

It isn’t clear why Strange Loop claims to have “archives:”

2009201020112012

As far as I can tell, these are listings with bios of prior presentations, but no substantive content.

Am I missing something?

April 26, 2013

The Wikidata revolution is here:…

Filed under: Data,Wikidata,Wikipedia — Patrick Durusau @ 5:52 pm

The Wikidata revolution is here: enabling structured data on Wikipedia by Tilman Bayer.

From the post:

A year after its announcement as the first new Wikimedia project since 2006, Wikidata has now begun to serve the over 280 language versions of Wikipedia as a common source of structured data that can be used in more than 25 million articles of the free encyclopedia.

By providing Wikipedia editors with a central venue for their efforts to collect and vet such data, Wikidata leads to a higher level of consistency and quality in Wikipedia articles across the many language editions of the encyclopedia. Beyond Wikipedia, Wikidata’s universal, machine-readable knowledge database will be freely reusable by anyone, enabling numerous external applications.

“Wikidata is a powerful tool for keeping information in Wikipedia current across all language versions,” said Wikimedia Foundation Executive Director Sue Gardner. “Before Wikidata, Wikipedians needed to manually update hundreds of Wikipedia language versions every time a famous person died or a country’s leader changed. With Wikidata, such new information, entered once, can automatically appear across all Wikipedia language versions. That makes life easier for editors and makes it easier for Wikipedia to stay current.”

This is a great source of curated data!

Bitsy

Filed under: Bitsy,Graphs — Patrick Durusau @ 3:33 pm

Bitsy

From the webpage:

Bitsy is a small, fast, embeddable, durable in-memory graph database that implements the Blueprints API.

Features

  • Support for most Blueprints features including key indices and threaded transactions
  • ACID guarantees on transactions
  • Designed for multi-threaded OLTP applications
  • Implements optimistic concurrency control
  • Data stored in readable text files
  • Serialization using the Jackson JSON processor
  • Recovers cleanly from power failures and crashes provided the underlying file system supports metadata journaling, like NTFS, ext3, ext4, XFS and JFS (not FAT32 or ext2)
  • Supports online backups through a JMX interface

Just in time for the weekend!

Enjoy!

In-browser topic modeling

Filed under: Latent Dirichlet Allocation (LDA),Topic Models (LDA) — Patrick Durusau @ 3:28 pm

In-browser topic modeling by David Mimno.

From the post:

Many people have found topic modeling a useful (and fun!) way to explore large text collections. Unfortunately, running your own models usually requires installing statistical tools like R or Mallet. The goals of this project are to (a) make running topic models easy for anyone with a modern web browser, (b) explore the limits of statistical computing in Javascript and (c) allow tighter integration between models and web-based visualizations.

About as easy an introduction/exploration as I can imagine.

Enjoy!

Once Under Wraps, Supreme Court Audio Trove Now Online

Filed under: Data,History,Law,Law - Sources — Patrick Durusau @ 3:09 pm

Once Under Wraps, Supreme Court Audio Trove Now Online

From the post:

On Wednesday, the U.S. Supreme Court heard oral arguments in the final cases of the term, which began last October and is expected to end in late June after high-profile rulings on gay marriage, affirmative action and the Voting Rights Act.

Audio from Wednesday’s arguments will be available at week’s end at the court’s website, but that’s a relatively new development at an institution that has historically been somewhat shuttered from public view.

The court has been releasing audio during the same week as arguments only since 2010. Before that, audio from one term generally wasn’t available until the beginning of the next term. But the court has been recording its arguments for nearly 60 years, at first only for the use of the justices and their law clerks, and eventually also for researchers at the National Archives, who could hear — but couldn’t duplicate — the tapes. As a result, until the 1990s, few in the public had ever heard recordings of the justices at work.

But as of just a few weeks ago, all of the archived historical audio — which dates back to 1955 — has been digitized, and almost all of those cases can now be heard and explored at an online archive called the Oyez Project.

A truly incredible resources for U.S. history in general and legal history in particular.

The transcripts and tapes are synchronized so your task, if you are interested, is to map these resources to other historical accounts and resources. 😉

The only disappointment is that the recordings begin with the October term of 1955. One of the most well known cases of the 20th century, Brown v. Board of Education, was argued in 1952 and re-argued in 1953. Hearing Thurgood Marshall argue that case would be a real treat.

I first saw this at: NPR: oyez.org finishes Supreme Court oral arguments project.

What is The ROI of Ignorance?

Filed under: Humor — Patrick Durusau @ 2:46 pm

What is The ROI of Ignorance? by Timo Elliott.

Some quants will be disappointed but it’s a fair estimate:

Ignorance ROI

Bad Practices

Filed under: Design,Interface Research/Design,Programming — Patrick Durusau @ 2:39 pm

Why Most People Don’t Follow Best Practices by Kendra Little.

Posted in a MS SQL Server context but the lesson applies to software, systems, and processes alike:

Unfortunately, human nature makes people persist all sorts of bad practices. I find everything in the wild from weekly reboots to crazy settings in Windows and SQL Server that damage performance and can cause outages. When I ask why the settings are in place, I usually hear a story that goes like this:

  • Once upon a time, in a land far far away there was a problem
  • The people of the land were very unhappy
  • A bunch of changes were made
  • Some of the changes were recommended by someone on the internet. We think.
  • The problem went away
  • The people of the land were happier
  • We hunkered down and just hoped the problem would never come back
  • The people of the land have been growing more and more unhappy over time again

Most of the time “best practices” are implemented to try and avoid pain rather than to configure things well. And most of the time they aren’t thought out in terms of long term performance. Most people haven’t really implemented any best practices, they’ve just reacted to situations.

How are the people of the land near you?

Introducing Categories

Filed under: Category Theory,Mathematics — Patrick Durusau @ 5:13 am

Introducing Categories by Jeremy Kun.

From the post:

It is time for us to formally define what a category is, to see a wealth of examples. In our next post we’ll see how the definitions laid out here translate to programming constructs. As we’ve said in our soft motivational post on categories, the point of category theory is to organize mathematical structures across various disciplines into a unified language. As such, most of this post will be devote to laying down the definition of a category and the associated notation. We will be as clear as possible to avoid a notational barrier for newcomers, so if anything is unclear we will clarify it in the comments.

Definition of a Category

Let’s recall some examples of categories we’ve seen on this blog that serve to motivate the abstract definition of a category. We expect the reader to be comfortable with sets, and to absorb or glaze over the other examples as comfort dictates. The reader who is uncomfortable with sets and functions on sets should stop here. Instead, visit our primers on proof techniques, which doubles as a primer on set theory (or our terser primer on set theory from a two years ago).

The go-to example of a category is that of sets: sets together with functions between sets form a category. We will state exactly what this means momentarily, but first some examples of categories of “sets with structure” and “structure-preserving maps.”

Not easy but not as difficult as some introductions to category theory.

Jeremy promises that the very next post jumps into code to show the “definition of a category as a type in ML.”

Plus some pro’s and con’s on proceeding this way.

April 25, 2013

Why Hypergraphs?

Filed under: Graphs,Hypergraphs — Patrick Durusau @ 6:06 pm

Why Hypergraphs? by Linas Vepstas.

From the post:

OpenCog uses hypergraphs to represent knowledge. Why? I don’t think this is clearly, succinctly explained anywhere, so I will try to do so here. This is a very important point: I can’t begin to tell you how many times I went searching for some whiz-bang logic programming system, or inference engine, or theorem-prover, or some graph re-writing engine, or some probabilistic programming system, only to throw up my hands up and realize that, after many wasted hours, none of them do what I want. If you’re interested in AGI, then let me assure you: they don’t do what you want, either. So, what do I want them to do, and why?

Well, lets begin easy: with graph re-writing systems. These days, almost everyone agrees that a great way to represent knowledge is with graphs. The structure IsA(Cat, Animal) looks like a graph with two vertexes, Cat and Animal, and a labelled edge, IsA, between them. If I also know that IsA(Binky, Cat), then, in principle, I should be able to deduce that IsA(Binky, Animal). This is a simple transitive relationship, and the act of logical deduction, for this example, is a simple graph re-write rule: If you see two IsA edges in a row, you should draw a third IsA edge between the first and the last vertex. Easy, right?

So perhaps you’d think that all logic induction and reasoning engines have graph rewrite systems at their core, right? So you’d think. In fact, almost none of them do. And those that do, do it in some internal, ad hoc, non-public, undocumented way: there’s no API, its not exposed externally; its not an ‘official’ part of the system for you to use or tinker with.

You know how I feel about AI triumphalism so I won’t bother to repeat the rant.

However, the hypergraph part of this work looks interesting. Whatever your views on AI.

A good place to start would be the OpenCog Development page.

Client-side search

Filed under: Javascript,Lucene — Patrick Durusau @ 3:07 pm

Client-side search by Gene Golovchinsky.

From the post:

When we rolled out the CHI 2013 previews site, we got a couple of requests for being able to search the site with keywords. Of course interfaces for search are one of my core research interests, so that request got me thinking. How could we do search on this site? The problem with the conventional approach to search is that it requires some server-side code to do the searching and to return results to the client. This approach wouldn’t work for our simple web site, because from the server’s perspective, our site was static — just a few HTML files, a little bit of JavaScript, and about 600 videos. Using Google to search the site wouldn’t work either, because most of the searchable content is located on two pages, with hundreds of items on each page. So what to do?

I looked around briefly trying to find some client-side indexing and retrieval code, and struck out. Finally, I decided to take a crack at writing a search engine in JavaScript. Now, before you get your expectations up, I was not trying to re-implement Lucene in JavaScript. All I wanted was some rudimentary keyword search capability. Building that in JavaScript was not so difficult.

One simplifying assumption I could make was that my document collection was static: sorry, the submission deadline for the conference has passed. Thus, I could have a static index that could be made available to each client, and all the client needed to do was match and rank.

Each of my documents had a three character id, and a set of fields. I didn’t bother with the fields, and just lumped everything together in the index. The approach was simple, again due to lots of assumptions. I treated the inverted index as a hash table that maps keywords onto lists of document ids. OK, document ids and term frequencies. Including positional information is an exercise left to the reader.

A refreshing reminder that simplified requirements can lead to successful applications.

Or to put it another way, not every application has to meet every possible use case.

For example, I might want to have a photo matching application that only allows users to pick match/no match for any pair of photos.

Not why, what reasons for match/no match, etc.

But it does capture the users identity in an association as saying photo # and photo # are of the same person.

That doesn’t provide any basis for automated comparison of those judgments, but not every judgment is required to do so.

I am starting to think of subject identification as a continuum of practices, some of which enable more reuse than others.

Which of those you choose, depends upon your requirements, your resources and other factors.

PubMed Watcher (beta)

Filed under: News,PubMed,PubMed Watcher — Patrick Durusau @ 2:39 pm

PubMed Watcher (beta)

After logging it with a Google account:

Welcome on PubMed Watcher!

Thanks for registering, here is what you need to know to get quickly started:

Step 1 – Add a Key Article

Define your research topic by setting up to four Key Articles. For instance you can use your own work as input or the papers of the lab you are working in at the moment. Key Articles describe the science you care about. The articles must be referenced on PubMed.

Step 2 – Read relevant stuff

PubMed Watcher will provide you with a feed of related articles, sorted by relevance and similarity in regards to the Key Articles content. The more Key Articles you have, the more tailored the list will be. PubMed Watcher helps to abstract away from journals, impact factors and date of publishing. Spend time reading, not searching! Come back every now and then to monitor your field and to get relevant literature to read.

Ready? Add your first Key Article or learn more about PubMed Watcher machinery.

OK, so I picked four seed articles and then read the “about,” where a “pinch of heuristics” says:

Now the idea behind PubMed Watcher is to pool the feeds coming from each one of your Key Article. If an article is present in more than one feed, it means that this article seems to be even more interesting to you, that’s the heuristic. The redundant article then gets a new higher score which is the sum of all its indivual scores. Example, let’s say you have two Key Articles named A and B. A has two similar articles F and G with respective similarity scores of 4 and 2. The Key Article B has two similar articles too: M and G with scores 7 and 6. The feed presented to you by PubMed Watcher will then be: G first (score of 6+2=8), M (score of 7) and finally F (4). This score is standardised in percentages (relative relatedness, the blue bars in the application), so here we would get: G (100%), M (88%) and F (50%). This metrics is not perfect yet it’s intuitive and gives good enough results; plus it’s fast to compute.

Paper on the technique:

PubMed related articles: a probabilistic topic-based model for content similarity by Jimmy Lin and W John Wilbur.

Code on Github.

The interface is fairly “lite” and you can change your four articles easily.

One thing I like from the start is that all I need do it pick one to four articles and I’m setup.

Hard to imagine an easier setup process that comes close to matching your interests.

PODC and SPAA 2013 Accepted Papers

Filed under: Conferences,Distributed Computing,Parallel Programming,Parallelism — Patrick Durusau @ 2:03 pm

ACM Symposium on Principles of Distributed Computing [PODC] accepted papers. (Montréal, Québec, Canada, July 22-24, 2013) Main PODC page.

Symposium on Parallelism in Algorithms and Architectures [SPAA] accepted papers. (Montréal, Québec, Canada, July 23 – 25, 2013) Main SPAA page.

Just scanning the titles reveals a number of very interesting papers.

Suggest you schedule a couple of weeks of vacation in Canada following SPAA before attending the Balisage Conference, August 6-9, 2013.

The weather is quite temperate and the outdoor dining superb.

I first saw this at: PODC AND SPAA 2013 ACCEPTED PAPERS.

Hadoop Summit North America (June 26-27, 2013)

Filed under: Conferences,Hadoop,MapReduce — Patrick Durusau @ 1:44 pm

Hadoop Summit North America

From the webpage:

Hortonworks and Yahoo! are pleased to host the 6th Annual Hadoop Summit, the leading conference for the Apache Hadoop community. This two-day event will feature many of the Apache Hadoop thought leaders who will showcase successful Hadoop use cases, share development and administration tips and tricks, and educate organizations about how best to leverage Apache Hadoop as a key component in their enterprise data architecture. It will also be an excellent networking event for developers, architects, administrators, data analysts, data scientists and vendors interested in advancing, extending or implementing Apache Hadoop.

Community Choice Selectees:

  • Application and Data Science Track: Watching Pigs Fly with the Netflix Hadoop Toolkit (Netflix)
  • Deployment and Operations Track: Continuous Integration for the Applications on top of Hadoop (Yahoo!)
  • Enterprise Data Architecture Track: Next Generation Analytics: A Reference Architecture (Mu Sigma)
  • Future of Apache Hadoop Track: Jubatus: Real-time and Highly-scalable Machine Learning Platform (Preferred Infrastructure, Inc.)
  • Hadoop (Disruptive) Economics Track: Move to Hadoop, Go Fast and Save Millions: Mainframe Legacy Modernization (Sears Holding Corp.)
  • Hadoop-driven Business / BI Track: Big Data, Easy BI (Yahoo!)
  • Reference Architecture Track: Genie – Hadoop Platformed as a Service at Netflix (Netflix)

If you need another reason to attend, it’s located in San Jose, California.

2nd best US location for a conference. #1 being New Orleans.

A different take on data skepticism

Filed under: Algorithms,Data,Data Models,Data Quality — Patrick Durusau @ 1:26 pm

A different take on data skepticism by Beau Cronin.

From the post:

Recently, the Mathbabe (aka Cathy O’Neil) vented some frustration about the pitfalls in applying even simple machine learning (ML) methods like k-nearest neighbors. As data science is democratized, she worries that naive practitioners will shoot themselves in the foot because these tools can offer very misleading results. Maybe data science is best left to the pros? Mike Loukides picked up this thread, calling for healthy skepticism in our approach to data and implicitly cautioning against a “cargo cult” approach in which data collection and analysis methods are blindly copied from previous efforts without sufficient attempts to understand their potential biases and shortcomings.

…Well, I would argue that all ML methods are not created equal with regard to their safety. In fact, it is exactly some of the simplest (and most widely used) methods that are the most dangerous.

Why? Because these methods have lots of hidden assumptions. Well, maybe the assumptions aren’t so much hidden as nodded-at-but-rarely-questioned. A good analogy might be jumping to the sentencing phase of a criminal trial without first assessing guilt: asking “What is the punishment that best fits this crime?” before asking “Did the defendant actually commit a crime? And if so, which one?” As another example of a simple-yet-dangerous method, k-means clustering assumes a value for k, the number of clusters, even though there may not be a “good” way to divide the data into this many buckets. Maybe seven buckets provides a much more natural explanation than four. Or maybe the data, as observed, is truly undifferentiated and any effort to split it up will result in arbitrary and misleading distinctions. Shouldn’t our methods ask these more fundamental questions as well?

Beau make several good points on questioning data methods.

I would extend those “…more fundamental questions…” to data as well.

Data, at least as far as I know, doesn’t drop from the sky. It is collected, generated, sometimes both, by design.

That design had some reason for collecting that data, in some particular way and in a given format.

Like methods, data stands mute with regard to those designs, what choices were made, by who and for what reason?

Giving voice what can be known about methods and data falls to human users.

Beginner Tips For Elastic MapReduce

Filed under: Cloud Computing,Elastic Map Reduce (EMR),Hadoop,MapReduce — Patrick Durusau @ 1:08 pm

Beginner Tips For Elastic MapReduce by John Berryman.

From the post:

By this point everyone is well acquainted with the power of Hadoop’s MapReduce. But what you’re also probably well acquainted with is the pain that must be suffered when setting up your own Hadoop cluster. Sure, there are some really good tutorials online if you know where to look:

However, I’m not much of a dev ops guy so I decided I’d take a look at Amazon’s Elastic MapReduce (EMR) and for the most part I’ve been very pleased. However, I did run into a couple of difficulties, and hopefully this short article will help you avoid my pitfalls.

I often dream of setting up a cluster that requires a newspaper hat because of the oil from cooling the coils, wait!, that was replica of the early cyclotron, sorry, wrong experiment. 😉

I mean a cluster of computers humming and driving up my cooling bills.

But there are alternatives.

Amazon’s Elastic Map Reduce (EMR) is one.

You can learn Hadoop with Hortonworks Sandbox and when you need production power, EMR awaits.

From a cost effectiveness standpoint, that sounds like a good deal to me.

You?

PS: Someone told me today that Amazon isn’t a reliable cloud because they have downtime. It is true that Amazon does have downtime but that isn’t a deciding factor.

You have to consider the relationship between Amazon’s aggressive pricing and how much reliability you need.

If you are running flight control for a moon launch, you probably should not use a public cloud.

Or for a heart surgery theater. And a few other places like that.

If you mean the webservices for your < 4,000 member NGO, 100% guaranteed uptime is a recipe for someone making money, off of you.

Gmail Email analysis with Neo4j – and spreadsheets

Filed under: Email,Neo4j — Patrick Durusau @ 10:36 am

Gmail Email analysis with Neo4j – and spreadsheets by Rik Van Bruggen.

From the post:

A bunch of different graphistas have pointed out to me in recent months that there is something funny about Graphs and email. Specifically, about graphs and email analysis. From my work in previous years at security companies, I know that Email Forensics is actually big business. Figuring out who emails whom, about what topics, with what frequency, at what times – is important. Especially when the proverbial sh*t hits the fan and fraud comes to light – like in the Enron case. How do I get insight into email traffic? How do I know what was communicated to who? And how do I get that insight, without spending a true fortune?

An important demonstration that sophisticated data analysis may originate with fairly pedestrian authoring tools.

For the Enron emails, see: Enron Email Dataset. Reported to be 0.5M messages, approximately 423Mb, tarred and gzipped.

The topic map question is what to do with separate graphs of:

  • Enron emails,
  • Enron corporate structure,
  • Social relationships between Enron employees and others,
  • Documents of other types interchanged or read inside of Enron,
  • Travel and expense records, and,
  • Phone logs inside Enron?

Graphs of any single data set can be interesting.

Merging graphs of inter-related data sets can be powerful.

Open Data On The Web : April 2013

Filed under: Open Data,W3C — Patrick Durusau @ 10:09 am

Open Data On The Web : April 2013 by Kal Ahmed.

From the post:

I was privileged to be one of the attendees of the Open Data on the Web workshop organized by W3C and hosted by Google in London this week. I say privileged because the gathering brought together researchers, developers and entrepreneurs from all around the world together in a unique mix that I’m sure won’t be achieved again until Phil Archer at W3C organizes the next one.

In the following I have not used direct quotes from those named as I didn’t make many notes of direct quotations. I hope that I have not misrepresented anyone, but if I have, please let me know and I will fix the text. This is not a journalistic report, its more a reflection of my concerns through the prism of a lot of people way smarter than me saying a lot of interesting things.

Covers sustainability, make it simpler?, data as a service, discoverability, attribution & licensing.

Kal has an engaging writing style and you will gain a lot just from his summary.

The issues he reports are largely the same across the datasphere, whatever your technological preference.

April 24, 2013

Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices

Filed under: Graphic Processors,Graphs,Machine Learning,R,Sparse Data,Sparse Matrices — Patrick Durusau @ 7:05 pm

Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices by Shivaram Venkataraman, Erik Bodzsar, Indrajit Roy, Alvin AuYoung, and Robert S. Schreiber.

Abstract:

It is cumbersome to write machine learning and graph algorithms in data-parallel models such as MapReduce and Dryad. We observe that these algorithms are based on matrix computations and, hence, are inefficient to implement with the restrictive programming and communication interface of such frameworks.

In this paper we show that array-based languages such as R [3] are suitable for implementing complex algorithms and can outperform current data parallel solutions. Since R is single-threaded and does not scale to large datasets, we have built Presto, a distributed system that extends R and addresses many of its limitations. Presto efficiently shares sparse structured data, can leverage multi-cores, and dynamically partitions data to mitigate load imbalance. Our results show the promise of this approach: many important machine learning and graph algorithms can be expressed in a single framework and are substantially faster than those in Hadoop and Spark.

Your mileage may vary but the paper reports that for PageRank, Presto is 40X faster than Hadoop and 15X Spark.

Unfortunately I can’t point you to any binary or source code for Presto.

Still, the description is an interesting one at a time of rapid development of computing power.

History of the Modern GPU Series

Filed under: GPU,Programming — Patrick Durusau @ 6:06 pm

History of the Modern GPU Series

From the post:

Graham Singer over at Techspot posted a series of articles a few weeks ago covering the history of the modern GPU. It is well-written and in-depth.

For GPU affectionados, this is a nice read. There are 4 parts to the series:

  1. Part 1: (1976 – 1995) The Early Days of 3D Consumer Graphics
  2. Part 2: (1995 – 1999) 3Dfx Voodoo: The Game-changer
  3. Part 3: (2000 – 2006) The Nvidia vs. ATI Era Begins
  4. Part 4: (2006 – 2013) The Modern GPU: Stream processing units a.k.a. GPGPU

Just in case you are excited about the GPU news reported below, a bit of history might not hurt.

😉

Fast Database Emerges from MIT Class… [Think TweetMap]

Filed under: GPU,MapD,SQL — Patrick Durusau @ 4:39 pm

Fast Database Emerges from MIT Class, GPUs and Student’s Invention by Ian B. Murphy.

Details the invention of MapD by Todd Mostak.

From the post:

MapD, At A Glance:

MapD is a new database in development at MIT, created by Todd Mostak.

  • MapD stands for “massively parallel database.”
  • The system uses graphics processing units (GPUs) to parallelize computations. Some statistical algorithms run 70 times faster compared to CPU-based systems like MapReduce.
  • A MapD server costs around $5,000 and runs on the same power as five light bulbs.
  • MapD runs at between 1.4 and 1.5 teraflops, roughly equal to the fastest supercomputer in 2000.
  • MapD uses SQL to query data.
  • Mostak intends to take the system open source sometime in the next year.

Sam Madden (MIT) describes MapD this way:

Madden said there are three elements that make Mostak’s database a disruptive technology. The first is the millisecond response time for SQL queries across “huge” datasets. Madden, who was a co-creator of the Vertica columnar database, said MapD can do in milliseconds what Vertica can do in minutes. That difference in speed is everything when doing iterative research, he said.

The second is the very tight coupling between data processing and visually rendering the data; this is a byproduct of building the system from GPUs from the beginning. That adds the ability to visualize the results of the data processing in under a second. Third is the cost to build the system. MapD runs in a server that costs around $5,000.

“He can do what a 1000 node MapReduce cluster would do on a single processor for some of these applications,” Madden said.

Not a lot of technical detail but you could start learning CUDA while waiting for the open source release.

At 1.4 to 1.5 teraflops on $5,000 worth of hardware, how will clusters will retain their customer base?

Welcome to TweetMap ALPHA

Filed under: GPU,Maps,SQL,Tweets — Patrick Durusau @ 3:57 pm

Welcome to TweetMap ALPHA

From the introduction popup:

TweetMap is an instance of MapD, a massively parallel database platform being developed through a collaboration between Todd Mostak, (currently a researcher at MIT), and the Harvard Center for Geographic Analysis (CGA).

The tweet database presented here starts on 12/10/2012 and ends 12/31/2012. Currently 95 million tweets are available to be queried by time, space, and keyword. This could increase to billions and we are working on real time streaming from tweet-tweeted to tweet-on-the-map in under a second.

MapD is a general purpose SQL database that can be used to provide real-time visualization and analysis of just about any very large data set. MapD makes use of commodity Graphic Processing Units (GPUs) to parallelize hard compute jobs such as that of querying and rendering very large data sets on-the-fly.

This is a real treat!

Try something popular, like “gaga,” without the quotes.

Remember this is running against 95 million tweets.

Impressive! Yes?

Threat Assessment Glossary

Filed under: Vocabularies — Patrick Durusau @ 3:33 pm

Threat Assessment Glossary by Denise Bulling and Mario Scalora.

If you are working in the public/national security area, you may need some vocabulary help.

I would check the definitions against other sources.

Here’s why:

Hunters (AKA Biters) Hunters are individuals who intend to follow a path toward violence and behave in ways to further that goal

I’m sure the NRA will like that one.

Identification Thoughts of the necessity and utility of violence by a subject that are made evident through behaviors such as researching previous attackers and collecting, practicing, and fantasizing about weapons

That looks like a typo but I can’t tell where it should go.

Terrorism Act of violence or threats of violence used to further the agenda of the perpetrator while causing fear and psychological distress

I would have included physical harm but I’m no expert on terrorism.

So you want to look at a graph

Filed under: Graphics,Graphs,Visualization — Patrick Durusau @ 1:54 pm

So you want to look at a graph by email: Carlos Scheidegger.

From the post:

Say you are given a graph and are told: “Tell me everything that is interesting about this graph”. What do you do? We visualization folks like to believe that good pictures show much of what is interesting about data; this series of posts will carve a path from graph data to good graph plots. The path will take us mostly through well-known research results and techniques; the trick here is I will try to motivate the choices from first principles, or at least as close to it as I can manage.

One of the ideas I hope to get across is that, when designing a visualization, it pays to systematically consider the design space. Jock MacKinlay’s 1986 real breakthrough was not the technique for turning a relational schema into a drawing specification. It was the realization that this systematization was possible and desirable. That his technique was formal enough to be encoded in a computer program is great gravy, but the basic insight is deeper.

Of course, the theory and practice of visualization in general is not ready for a complete systematization, but there are portions ripe for the picking. In this series, I want to see what I can do about graph visualization.

If you like this introduction, be sure to follow the series to:

So you want to look at a graph, part 1

This series of posts is a tour through of the design space of graph visualization. As I promised, I will do my best to objectively justify as many visualization decisions as I can. This means we will have to go slow; I won’t even draw anything today! In this post, I will only take the very first step: all we will do is think about graphs, and what might be interesting about them.

So you want to look at a graph, part 2

This series of posts is a thorough examination of the design space of graph visualization (Intro, part 1). In the previous post, we talked about graphs and their properties. We will now talk about constraints arising from the process of transforming our data into a visualization.

So you want to look at a graph, part 3

This series of posts is a tour of the design space of graph visualization. I’ve written about graphs and their properties, and how the encoding of data into a visual representation is crucial. In this post, I will use those ideas to justify the choices behind a classic algorithm for laying out directed, mostly-acyclic graphs.

More posts are coming!

Brain: … [Topic Naming Constraint Reappears]

Filed under: Bioinformatics,OWL,Semantic Web — Patrick Durusau @ 1:41 pm

Brain: biomedical knowledge manipulation by Samuel Croset, John P. Overington and Dietrich Rebholz-Schuhmann. (Bioinformatics (2013) 29 (9): 1238-1239. doi: 10.1093/bioinformatics/btt109)

Abstract:

Summary: Brain is a Java software library facilitating the manipulation and creation of ontologies and knowledge bases represented with the Web Ontology Language (OWL).

Availability and implementation: The Java source code and the library are freely available at https://github.com/loopasam/Brain and on the Maven Central repository (GroupId: uk.ac.ebi.brain). The documentation is available at https://github.com/loopasam/Brain/wiki.

Contact: croset@ebi.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

Odd how things like the topic naming constraint show up in unexpected contexts. 😉

This article may be helpful if you are required to create or read OWL based data.

But as I read the article I saw:

The names (short forms) of OWL entities handled by a Brain object have to be unique. It is for instance not possible to add an OWL class, such as http://www.example.org/Cell to the ontology if an OWL entity with the short form ‘Cell’ already exists.

The explanation?

Despite being in contradiction with some Semantic Web principles, this design prevents ambiguous queries and hides as much as possible the cumbersome interaction with prefixes and Internationalized Resource Identifiers (IRI).

I suppose but doesn’t ambiguity exist in the mind of the user? That is they use a term than can have more than one meaning?

Having unique terms simply means inventing odd terms that no user will know.

Rather than unambiguous isn’t that unfound?

Weapons of Mass Destruction Were In Iraq

Filed under: Government,Language — Patrick Durusau @ 1:26 pm

It is commonly accepted that no weapons of mass destruction were found after the invasion of Iraq by Bush II.

But is that really true?

To credit that claim, you would have to be unable to find a common pressure cooker in Iraq.

The FBI apparently considers bombs made using pressure cookers to be “weapons of mass destruction.”

How remarkable. I have one of the big pressure canners. That must be the H-Bomb of pressure cookers. 😉

“Weapon of mass destruction” gets even vaguer when you get into the details.

18 USC § 2332a – Use of weapons of mass destruction, which refers you to another section, “any destructive device as defined in section 921 of this title;” to find the definition.

And, 18 USC § 921 – Definitions reads in relevant part:

(4) The term “destructive device” means—
(A) any explosive, incendiary, or poison gas—
(i) bomb,
(ii) grenade,
(iii) rocket having a propellant charge of more than four ounces,
(iv) missile having an explosive or incendiary charge of more than one-quarter ounce,
(v) mine, or
(vi) device similar to any of the devices described in the preceding clauses;

Maybe Bush II should have asked the FBI to hunt for “weapons of mass destruction” in Iraq.

They would not have come home empty handed.


If this seems insensitive, remember government debasement of language contributes to the lack of sane discussions about national security.

Discussions that could have lead to better information sharing and possibly the stopping of some crimes.

Yes, crimes, not acts of terrorism. Crimes are solved by old fashioned police work.

Fear of acts of terrorism leads to widespread monitoring of electronic communications, loss of privacy, etc.

As shown in the Boston incident, national security monitoring played no role in stopping the attack or apprehending the suspects.

Traditional law enforcement did.

Why is the most effective tool against crime not a higher priority?

How to Go Viral, Every Time

Filed under: Marketing,Topic Maps — Patrick Durusau @ 8:52 am

How to Go Viral, Every Time by Jess Bachman.

From the post:

Everyone wants their content to go viral. It’s the holy grail of marketing. It can turn companies and product into the talk of the town, even if they sell toiletries. The ROI on content with more than a million views is almost unmeasurable. So how do you make sure your content will go viral?

The secret is simple. Be incredibly lucky.

Luck is the third piece of the virality triumvirate and obviously the hardest to bank on. In fact, you cannot achieve true virality without it. With great content and powerful tactics you can certainly get millions of views on a consistent basis, but if lady luck doesn’t give her blessing, you will end up with a good – but not great – ROI.

What do you think would make good viral material for a topic map video?

And of course:

Anyone with skills at producing videos interested in a topic map video?

Balisage Advice: How To Organize A Talk

Filed under: Marketing — Patrick Durusau @ 8:01 am

How To Organize A Talk

From the post:

Say you are speaking for an hour to an audience of 100. Its just a fact of human nature that nobody in the audience is going to be paying close attention to what you are saying for more than 1/4 of the time. The other 45 minutes of the time people will be thinking, talking, or just daydreaming. You must accept this as an unavoidable constraint.

Absent any intervention on your part then you will get a randomly selected 15 minutes of attention from each member of the audience. This means that at any one point in time you will have the attention of only 1/4 of your audience or 25 out of the 100 people. The very important things you will have to say will be processed and potentially remembered by 1/4 of your audience, the same fraction that will be paying attention to the least important things you have to say.
….

A forty-five minute time slot means you have about 11 minutes to say your important ideas.

See the post for some tips on doing exactly that.

I suspect the same is true for discussions with potential/actual customers as well.

They are not stupid, they just aren’t paying attention to what you are saying.

One response would be to wire them up like mice that get shocked at random. (That may be illegal in some jurisdictions.)

Another response would be to accept that people are as they are and not as we might want them to to be.

The second response is likely to be the more successful, if less satisfying. 😉

Not easy to do but explanations with complex diagrams to which more complexity is added, haven’t set the woods on fire as a marketing tool.

April 23, 2013

Meet @InfoVis_Ebooks, …

Filed under: Tweets,Visualization — Patrick Durusau @ 7:15 pm

Meet @InfoVis_Ebooks, Your Source for Random InfoVis Paper Snippets by Robert Kosara.

From the post:

InfoVis Ebooks takes a random piece of text from a random paper in its repository and tweets it. It has read all of last year’s InfoVis papers, and is now getting started with the VAST proceedings. After that, it will start reading infovis papers published in last year’s EuroVis and CHI conferences, and then work its way back to previous years.

Each tweet contains a reference to the paper the snippet is from. For InfoVis, VAST, and CHI, these are DOIs rather than links. Links get long and distracting, whereas DOIs are much easier to tune out in a tweet. If you want to see the paper, google the DOI string (keep the “doi:” part). You can also take everything but the “doi:” and append it to http://dx.doi.org/ to be redirected to the paper page. For other sources, I will probably have to use links.

As the name suggests, InfoVis Ebooks is about infovis papers. If you want to do the same for SciVis, HCI, or anything else, the code is available on github.

When I first saw this, I thought it would be a source of spam.

But it lingered on a browser tab for a day or so and when I looked back at it, I started to get interested.

Not that this would help a machine but for human readers, seeing the right snippet at the right time, could lead to a good (or bad) idea.

Can’t tell which one in advance but seems like it would be worth the risk.

Perhaps we can’t guarantee serendipity but we can create conditions where it is more likely to happen.

Yes?

PS: If you start one of these feeds, let me know so I can point to it.

« Newer PostsOlder Posts »

Powered by WordPress