Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 15, 2014

Google details new “Poodle” bug…

Filed under: Cybersecurity,Programming,Security — Patrick Durusau @ 9:26 am

Google details new “Poodle” bug, making browsers susceptible to hacking by Jonathan Vanian.

From the post:

Google’s security team detailed today a new bug that takes advantage of a design flaw in SSL version 3.0, a security protocol created by Netscape in the mid 1990s. The researchers called it a Padding Oracle on Downgraded Legacy Encryption bug, or POODLE.

Although the protocol is old, Google said that “nearly all browsers support it” and its available for hackers to exploit. Even though many modern-day websites use the TLS security protocol (essentially, the next-generation SSL) as their means of encrypting data for a secure network connection between a browser and a website, things can run amok if the connection goes down for some reason.

See Jonathan’s post for more “Poodle” details.

Suggestions for a curated and relatively comprehensive collection in security bugs as they are discovered? I ask because I follow a couple of fairly active streams but I haven’t found one that I would call “curated,” in the sense that each bug is reported once and only once, with related material linked to it.

Is it just me or would others find that to be a useful resource?

Inductive Graph Representations in Idris

Filed under: Functional Programming,Graphs,Types — Patrick Durusau @ 8:54 am

Inductive Graph Representations in Idris by Michael R. Bernstein.

An early exploration on Inductive Graphs and Functional Graph Algorithms by Martin Erwig.

Abstract (of Erwig’s paper):

We propose a new style of writing graph algorithms in functional languages which is based on an alternative view of graphs as inductively defined data types. We show how this graph model can be implemented efficiently and then we demonstrate how graph algorithms can be succinctly given by recursive function definitions based on the inductive graph view. We also regard this as a contribution to the teaching of algorithms and data structures in functional languages since we can use the functional-graph algorithms instead of the imperative algorithms that are dominant today.

You can follow Michael at: @mrb_bk or https://github.com/mrb or his blog: http://michaelrbernste.in/.

More details on Idris: A Language With Dependent Types.

October 14, 2014

Cryptic genetic variation in software:…

Filed under: Bioinformatics,Bugs,Programming,Software — Patrick Durusau @ 6:18 pm

Cryptic genetic variation in software: hunting a buffered 41 year old bug by Sean Eddy.

From the post:

In genetics, cryptic genetic variation means that a genome can contain mutations whose phenotypic effects are invisible because they are suppressed or buffered, but under rare conditions they become visible and subject to selection pressure.

In software code, engineers sometimes also face the nightmare of a bug in one routine that has no visible effect because of a compensatory bug elsewhere. You fix the other routine, and suddenly the first routine starts failing for an apparently unrelated reason. Epistasis sucks.

I’ve just found an example in our code, and traced the origin of the problem back 41 years to the algorithm’s description in a 1973 applied mathematics paper. The algorithm — for sampling from a Gaussian distribution — is used worldwide, because it’s implemented in the venerable RANLIB software library still used in lots of numerical codebases, including GNU Octave. It looks to me that the only reason code has been working is that a compensatory “mutation” has been selected for in everyone else’s code except mine.

,,,

A bug hunting story to read and forward! Sean just bagged a forty-one (41) year old bug. What’s the oldest bug you have ever found?

When you reach the crux of the problem, you will understand why ambiguous, vague, incomplete and poorly organized standards annoy me to no end.

No guarantees of unambiguous results but if you need extra eyes on IT standards you know where to find me.

I first saw this in a tweet by Neil Saunders.

Classifying Shakespearean Drama with Sparse Feature Sets

Filed under: Feature Spaces,Sparse Data,Text Analytics — Patrick Durusau @ 4:06 pm

Classifying Shakespearean Drama with Sparse Feature Sets by Douglas Duhaime.

From the post:

In her fantastic series of lectures on early modern England, Emma Smith identifies an interesting feature that differentiates the tragedies and comedies of Elizabethan drama: “Tragedies tend to have more streamlined plots, or less plot—you know, fewer things happening. Comedies tend to enjoy a multiplication of characters, disguises, and trickeries. I mean, you could partly think about the way [tragedies tend to move] towards the isolation of a single figure on the stage, getting rid of other people, moving towards a kind of solitude, whereas comedies tend to end with a big scene at the end where everybody’s on stage” (6:02-6:37). 

The distinction Smith draws between tragedies and comedies is fairly intuitive: tragedies isolate the poor player that struts and frets his hour upon the stage and then is heard no more. Comedies, on the other hand, aggregate characters in order to facilitate comedic trickery and tidy marriage plots. While this discrepancy seemed promising, I couldn’t help but wonder whether computational analysis would bear out the hypothesis. Inspired by the recent proliferation of computer-assisted genre classifications of Shakespeare’s plays—many of which are founded upon high dimensional data sets like those generated by DocuScope—I was curious to know if paying attention to the number of characters on stage in Shakespearean drama could help provide additional feature sets with which to carry out this task.

A quick reminder that not all text analysis is concerned with 140 character strings. 😉

Do you prefer:

high dimensional

where every letter in “high dimensional” is a hyperlink with an unknown target or a fuller listing:

Allison, Sarah, and Ryan Heuser, Matthew Jockers, Franco Moretti, Michael Witmore. Quantitative Formalism: An Experiment

Jockers, Matthew. Machine-Classifying Novels and Plays by Genre

Hope, Jonathan and Michael Witmore. “The Hundredth Psalm to the Tune of ‘Green Sleeves’”: Digital Approaches Shakespeare’s Language of Genre

Hope, Jonathan. Shakespeare by the numbers: on the linguistic texture of the Late Plays

Hope, Jonathan and Michael Witmore. The Very Large Textual Object: A Prosthetic Reading of Shakespeare

Lenthe, Victor. Finding the Sherlock in Shakespeare: some ideas about prose genre and linguistic uniqueness

Stumpf, Mike. How Quickly Nature Falls Into Revolt: On Revisiting Shakespeare’s Genres

Stumpf, Mike. This Thing of Darkness (Part III)

Tootalian, Jacob A. Shakespeare, Without Measure: The Rhetorical Tendencies of Renaissance Dramatic Prose

Ullyot, Michael. Encoding Shakespeare

Witmore, Michael. A Genre Map of Shakespeare’s Plays from the First Folio (1623)

Witmore, Michael. Shakespeare Out of Place?

Witmore, Michael. Shakespeare Quarterly 61.3 Figures

Witmore, Michael. Visualizing English Print, 1530-1800, Genre Contents of the Corpus

Decompiling Shakespeare (Site is down. Was also down when the WayBack machine tried to archive the site in July of 2014)

I prefer the longer listing.

If you are interested in Shakespeare, Folger Digital Texts has free XML and PDF versions of his work.

I first saw this in a tweet by Gregory Piatetsky

RNeo4j: Neo4j graph database combined with R statistical programming language

Filed under: Graphs,Neo4j,R — Patrick Durusau @ 2:46 pm

From the description:

RNeo4j combines the power of a Neo4j graph database with the R statistical programming language to easily build predictive models based on connected data. From calculating the probability of friends of friends connections to plotting an adjacency heat map based on graph analytics, the RNeo4j package allows for easy interaction with a Neo4j graph database.

Nicole is the author of the RNeo4j R package. Don’t be dismayed by the “What is a Graph” and “What is R” in the presentation outline. Mercifully only three minutes followed by a rocking live coding demonstration of the package!

Beyond Neo4j and R, use this webinar as a standard for the useful content that should appear in a webinar!

RNeo4j at Github.

How designers prototype at GDS

Filed under: Design,Interface Research/Design — Patrick Durusau @ 10:53 am

How designers prototype at GDS by Rebecca Cottrell.

From the post:

All of the designers at GDS can code or are learning to code. If you’re a designer who has used prototyping tools like Axure for a large part of your professional career, the idea of prototyping in code might be intimidating. Terrifying, even.

I’m a good example of that. When I joined GDS I felt intimidated by the idea of using Terminal and things like Git and GitHub, and just the perceived slowness of coding in HTML.

At first I felt my workflow had slowed down significantly, but the reason for that was the learning curve involved – I soon adapted and got much faster.

GDS has lots of tools (design patterns, code snippets, front-end toolkit) to speed things up. Sharing what I learned in the process felt like a good idea to help new designers get to grips with how we work.

Not a rigid set of prescriptions but experience at prototyping and pointers to other resources. Whether you have a current system of prototyping or not, you are very likely to gain a tip or two from this post.

I first saw this in a tweet by Ben Terrett.

ADW (Align, Disambiguate and Walk) [Semantic Similarity]

Filed under: Natural Language Processing,Semantics,Similarity — Patrick Durusau @ 10:28 am

ADW (Align, Disambiguate and Walk) version 1.0 by Mohammad Taher Pilehvar.

From the webpage:

This package provides a Java implementation of ADW, a state-of-the-art semantic similarity approach that enables the comparison of lexical items at different lexical levels: from senses to texts. For more details about the approach please refer to: http://wwwusers.di.uniroma1.it/~navigli/pubs/ACL_2013_Pilehvar_Jurgens_Navigli.pdf

The abstract for the paper reads:

Semantic similarity is an essential component of many Natural Language Processing applications. However, prior methods for computing semantic similarity often operate at different levels, e.g., single words or entire documents, which requires adapting the method for each data type. We present a unified approach to semantic similarity that operates at multiple levels, all the way from comparing word senses to comparing text documents. Our method leverages a common probabilistic representation over word senses in order to compare different types of linguistic data. This unified representation shows state-of-the-art performance on three tasks: semantic textual similarity, word similarity, and word sense coarsening.

Online Demo.

The strength of this approach is the use of multiple levels of semantic similarity. It relies on WordNet but the authors promise to extend their approach to named entities and other tokens not appearing in WordNet (like your company or industry’s internal vocabulary).

The bibliography of the paper cites much of the recent work in this area so that will be an added bonus for perusing the paper.

I first saw this in a tweet by Gregory Piatetsky.

October 13, 2014

The Dirty Little Secret of Cancer Research

Filed under: BigData,Bioinformatics,Genomics,Medical Informatics — Patrick Durusau @ 8:06 pm

The Dirty Little Secret of Cancer Research by Jill Neimark.

From the post:


Across different fields of cancer research, up to a third of all cell lines have been identified as imposters. Yet this fact is widely ignored, and the lines continue to be used under their false identities. As recently as 2013, one of Ain’s contaminated lines was used in a paper on thyroid cancer published in the journal Oncogene.

“There are about 10,000 citations every year on false lines—new publications that refer to or rely on papers based on imposter (human cancer) celllines,” says geneticist Christopher Korch, former director of the University of Colorado’s DNA Sequencing Analysis & Core Facility. “It’s like a huge pyramid of toothpicks precariously and deceptively held together.”

For all the worry about “big data,” where is the concern over “big bad data?”

Or is “big data” too big for correctness of the data to matter?

Once you discover that a paper is based on “imposter (human cancer) celllines,” how do you pass that information along to anyone who attempts to cite the article?

In other words, where do you write down that data about the paper, where the paper is the subject in question?

And how do you propagate that data across a universe of citations?

The post ends on a high note of current improvements but it is far from settled how to prevent reliance on compromised research.

I first saw this in a tweet by Dan Graur

Scrape the Gibson: Python skills for data scrapers

Filed under: Python,Web Scrapers — Patrick Durusau @ 7:32 pm

Scrape the Gibson: Python skills for data scrapers by Brian Abelson.

From the post:

Two years ago, I learned I had superpowers. Steve Romalewski was working on some fascinating analyses of CitiBike locations and needed some help scraping information from the city’s data portal. Cobbling together the little I knew about R, I wrote a simple scraper to fetch the json files for each bike share location and output it as a csv. When I opened the clean data in Excel, the feeling was tantamount to this scene from Hackers:

Ever since then I’ve spent a good portion of my life scraping data from websites. From movies, to bird sounds, to missed connections, and john boards (don’t ask, I promise it’s for good!), there’s not much I haven’t tried to scrape. In many cases, I dont’t even analyze the data I’ve obtained, and the whole process amounts to a nerdy version of sport hunting, with my comma-delimited trophies mounted proudly on Amazon S3.

Important post for two reasons:

  • Good introduction to the art of scraping data
  • Set the norm for sharing scraped data
    • The people who force scraping of data don’t want it shared, combined, merged or analyzed.

      You can help in disappointing them! 😉

Making of: Introduction to A*

Filed under: Graphics,Visualization — Patrick Durusau @ 4:23 pm

Making of: Introduction to A* by Amit Patel.

From the post:

(Warning: these notes are rough – the main page is here and these are some notes I wrote for a few colleagues and then I kept adding to it until it became a longer page)

Several people have asked me how I make the diagrams on my tutorials.

I need to learn the algorithm and data structures I want to demonstrate. Sometimes I already know them. Sometimes I know nothing about them. It varies a lot. It can take 1 to 5 months to make a tutorial. It’s slow, but the more I make, the faster I am getting.

I need to figure out what I want to show. I start with what’s in the algorithm itself: inputs, outputs, internal variables. With A*, the input is (start, goal, graph), the output is (parent pointers, distances), and the internal variables are (open set, closed set, parent pointers, distances, current node, neighbors, child node). I’m looking for the main idea to visualize. With A* it’s the frontier, which is the open set. Sometimes the thing I want to visualize is one of the algorithm’s internal variables, but not always.

Pure gold on making diagrams for tutorials here. You may make different choices but it isn’t often that the process of making a choice is exposed.

Pass this along. We all benefit from better illustrations in tutorials!

The Big List of D3.js Examples (Approx. 2500 Examples)

Filed under: D3,Graphics,Visualization — Patrick Durusau @ 3:48 pm

The Big List of D3.js Examples by Christophe Viau.

The interactive version has 2523 examples, whereas the numbered list has 1897 examples, as of 13 October 2014.

There is a rudimentary index of the examples. That’s an observation, not a compliant. Effective indexing of the examples would be a real challenge to the art of indexing.

The current index uses chart type, a rather open ended category. The subject matter of the chart would be another way to index. Indexing by the D3 techniques used would be useful. Data that is being combined with other data?

Effective access to the techniques and data represented by this collection would be awesome!

Give it some thought.

I first saw this in a tweet by Michael McGuffin.

Introduction to Graphing with D3.js

Filed under: D3,Graphics — Patrick Durusau @ 3:28 pm

Introduction to Graphing with D3.js by Jan Milosh.

From the post:

D3.js (d3js.org) stands for Data-Driven Documents, a JavaScript library for data visualization. It was created by Mike Bostock, based on his PhD studies in the Stanford University data visualization program. Mike now works at the New York Times who sponsors his open source work.

D3 was designed for more than just graphs and charts. It’s also capable of presenting maps, networks, and ordered lists. It was created for the efficient manipulation of documents based on data.

This demonstration will focus on creating a simple scatter plot.

If you are not already using D3 for graphics, Jan’s post is an easy introduction with additional references to take you further.

Enjoy!

I first saw this in a tweet by Christophe Viau.

Mirrors for Princes and Sultans:…

Filed under: History,Politics,Text Analytics — Patrick Durusau @ 3:07 pm

Mirrors for Princes and Sultans: Advice on the Art of Governance in the Medieval Christian and Islamic Worlds by Lisa Blaydes, Justin Grimmery, and Alison McQueen.

Abstract:

Among the most signi cant forms of political writing to emerge from the medieval period are texts off ering advice to kings and other high-ranking ocials. Books of counsel varied considerably in their content and form; scholars agree, however, that such texts reflected the political exigencies of their day. As a result, writings in the “mirrors for princes” tradition o er valuable insights into the evolution of medieval modes of governance. While European mirrors (and Machiavelli’s Prince in particular) have been extensively studied, there has been less scholarly examination of a parallel political advice literature emanating from the Islamic world. We compare Muslim and Christian advisory writings from the medieval period using automated text analysis, identify sixty conceptually distinct topics that our method automatically categorizes into three areas of concern common to both Muslim and Christian polities, and examine how they evolve over time. We o er some tentative explanations for these trends.

If you don’t know the phrase, “mirrors for princes,”:

texts that seek to off er wisdom or guidance to monarchs and other high-ranking advisors.

Since nearly all bloggers and everyone with a byline in traditional media considers themselves qualified to offer advice to “…monarchs and other high-ranking advisors,” one wonders how the techniques presented would fare with modern texts?

Certainly a different style of textual analysis than is seen outside the humanities and so instructive for that purpose.

I do wonder about the comparison of texts in translation into English. Obviously easier but runs the risk of comparing translators to translators and not so much the thoughts of the original authors.

I first saw this in a tweet by Christopher Phipps.

Measuring Search Relevance

Filed under: Relevance,Searching — Patrick Durusau @ 10:33 am

Measuring Search Relevance by Hugh E. Williams.

From the post:

The process of asking many judges to assess search performance is known as relevance judgment: collecting human judgments on the relevance of search results. The basic task goes like this: you present a judge with a search result, and a search engine query, and you ask the judge to assess how relevant the item is to the query on (say) a four-point scale.

Suppose the query you want to assess is ipod nano 16Gb. Imagine that one of the results is a link to Apple’s page that describes the latest Apple iPod nano 16Gb. A judge might decide that this is a “great result” (which might be, say, our top rating on the four-point scale). They’d then click on a radio button to record their vote and move on to the next task. If the result we showed them was a story about a giraffe, the judge might decide this result is “irrelevant” (say the lowest rating on the four point scale). If it were information about an iPhone, it might be “partially relevant” (say the second-to-lowest), and if it were a review of the latest iPod nano, the judge might say “relevant” (it’s not perfect, but it sure is useful information about an Apple iPod).

The human judgment process itself is subjective, and different people will make different choices. You could argue that a review of the latest iPod nano is a “great result” — maybe you think it’s even better than Apple’s page on the topic. You could also argue that the definitive Apple page isn’t terribly useful in making a buying decision, and you might only rate it as relevant. A judge who knows everything about Apple’s products might make a different decision to someone who’s never owned an digital music player. You get the idea. In practice, judging decisions depend on training, experience, context, knowledge, and quality — it’s an art at best.

There are a few different ways to address subjectivity and get meaningful results. First, you can ask multiple judges to assess the same results to get an average score. Second, you can judge thousands of queries, so that you can compute metrics and be confident statistically that the numbers you see represent true differences in performance between algorithms. Last, you can train your judges carefully, and give them information about what you think relevance means.

An illustrated walk through measuring search relevance. Useful for a basic understanding of the measurement process and its parameters.

Bookmark this post so When you tell your judges what “…relevance means”, you can return here and post what you told your judges.

I ask because I deeply suspect that our ideas of “relevance” vary widely from subject to subject.

Thanks!

October 12, 2014

Twitter Mapping: Foundations

Filed under: Mapping,Maps,Tweets — Patrick Durusau @ 10:32 am

Twitter Mapping: Foundations by Simon Rogers.

From the post:

With more than 500 million tweets sent every day, Twitter data as a whole can seem huge and unimaginable, like cramming the contents of the Library of Congress into your living room.

One way of trying to make that big data understandable is by making it smaller and easier to handle by giving it context; by putting it on a map.

It’s something I do a lot—I’ve published over 1,000 maps in the past five years, mostly at Guardian Data. At Twitter, with 77% of users outside the US, it’s often aimed at seeing if regional variations can give us a global picture, an insight into the way a story spreads around the globe. Here’s what I’ve learned about using Twitter data on maps.

… (lots of really cool maps and links omitted)

Creating data visualizations is simpler now than it’s ever been, with a plethora of tools (free and paid) meaning that any journalist working in any newsroom can make a chart or a map in a matter of minutes. Because of time constraints, we often use CartoDB to animate maps of tweets over time. The process is straightforward—I’ve written a how-to guide on my blog that shows how to create an animated map of dots using the basic interface, and if the data is not too big it won’t cost you anything. CartoDB is also handy for other reasons: as it has access to Twitter data, you can use it to get the geotagged tweets too. And it’s not the only one: Trendsmap is a great way to see location of conversations over time.

Have you made a map with Twitter Data that tells a compelling story? Share it with us via @TwitterData.

While composing this post I looked at CartoDB solution for geotagged tweets and while impressive, it is currently in beta with a starting price of $300/month. Works if you get your expenses paid but a bit pricey for occasional use.

There is a free option for CartoDB (up to 50 MB of data) but I don’t think it includes the twitter capabilities.

Sample mapping tweets on your favorite issues. Maps are persuasive in ways that are not completely understood.

October 11, 2014

Making Your First Map

Filed under: MapBox,Mapping,Maps — Patrick Durusau @ 1:09 pm

Making Your First Map from Mapbox.

From the webpage:

Regardless of your skill level, we have the tools that allow you to quickly build maps and share them online in minutes.

In this guide, we’ll cover the basics of our online tool, the Mapbox Editor, by creating a store location map for a bike shop.

A great example of the sort of authoring interface that is needed by topic maps.

Hmmm, by the way, did you notice that “…creating a store location map for a bike shop” is creating an association between the “bike shop” and a “street location?” True, Mapbox doesn’t include roles or the association type but the role players are present.

For a topic map authoring interface, you could default the role of location for any geographic point on the map and the association type to be “street-location.”

The user would only have to pick, possibly from a pick list, the role of the role player, bike shop, bar, etc.

Mapbox could have started their guide with a review of map projections, used and theoretical.

Or covered the basics of surveying and a brief overview of surveying instruments. They didn’t.

I think there is a lesson there.

Microsoft’s Quantum Mechanics

Filed under: Microsoft,Quantum,Semantics — Patrick Durusau @ 11:52 am

Microsoft’s Quantum Mechanics by Tom Simonite.

From the post:

In 2012, physicists in the Netherlands announced a discovery in particle physics that started chatter about a Nobel Prize. Inside a tiny rod of semiconductor crystal chilled cooler than outer space, they had caught the first glimpse of a strange particle called the Majorana fermion, finally confirming a prediction made in 1937. It was an advance seemingly unrelated to the challenges of selling office productivity software or competing with Amazon in cloud computing, but Craig Mundie, then heading Microsoft’s technology and research strategy, was delighted. The abstruse discovery—partly underwritten by Microsoft—was crucial to a project at the company aimed at making it possible to build immensely powerful computers that crunch data using quantum physics. “It was a pivotal moment,” says Mundie. “This research was guiding us toward a way of realizing one of these systems.”

Microsoft is now almost a decade into that project and has just begun to talk publicly about it. If it succeeds, the world could change dramatically. Since the physicist Richard Feynman first suggested the idea of a quantum computer in 1982, theorists have proved that such a machine could solve problems that would take the fastest conventional computers hundreds of millions of years or longer. Quantum computers might, for example, give researchers better tools to design novel medicines or super-efficient solar cells. They could revolutionize artificial intelligence.

Fairly upbeat review of current efforts to build a quantum computer.

You may want to off-set it by reading Scott Aaronson’s blog, Shtetl-Optimized, which has the following header note:

If you take just one piece of information from this blog:
Quantum computers would not solve hard search problems
instantaneously by simply trying all the possible solutions at once. (emphasis added)

See in particular: Speaking Truth to Parallelism at Cornell

Whatever speedups are possible with quantum computers, getting a semantically incorrect answer faster isn’t an advantage.

Assumptions about faster computing platforms include an assumption of correct semantics. There have been no proofs of default correct handling of semantics by present day or future computing platforms.

I first saw this in a tweet by Peter Lee.

PS: I saw the reference to Scott Aaronson’s blog in a comment to Tom’s post.

The 100 Worst Landlords in New York City [Here Be Bastards]

Filed under: Mapping,Maps — Patrick Durusau @ 11:17 am

The 100 Worst Landlords in New York City

A great illustration of the power of mapping to bring information together! (Like a topic map does.)

I don’t live in New York so the classes of violations (mis-named “details”) wasn’t helpful to me. Nor were the actual “details” of particular violations available. (If I am wrong on that, please post a response saying how to obtain the details via the map interface.)

Suggested Improvement: Names of owners as hyperlinks to their residences on a map with the notation “Here Be Bastards” (a riff off of the “Here Be Dragons” from early sea maps).

Spark Breaks Previous Large-Scale Sort Record

Filed under: BigData,Hadoop,Spark — Patrick Durusau @ 10:28 am

Spark Breaks Previous Large-Scale Sort Record by Reynold Xin.

From the post:

Apache Spark has seen phenomenal adoption, being widely slated as the successor to Hadoop MapReduce, and being deployed in clusters from a handful to thousands of nodes. While it was clear to everybody that Spark is more efficient than MapReduce for data that fits in memory, we heard that some organizations were having trouble pushing it to large scale datasets that could not fit in memory. Therefore, since the inception of Databricks, we have devoted much effort, together with the Spark community, to improve the stability, scalability, and performance of Spark. Spark works well for gigabytes or terabytes of data, and it should also work well for petabytes.

To evaluate these improvements, we decided to participate in the Sort Benchmark. With help from Amazon Web Services, we participated in the Daytona Gray category, an industry benchmark on how fast a system can sort 100 TB of data (1 trillion records). The previous world record was 72 minutes, set by Yahoo using a Hadoop MapReduce cluster of 2100 nodes. Using Spark on 206 EC2 nodes, we completed the benchmark in 23 minutes. This means that Spark sorted the same data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache.

Additionally, while no official petabyte (PB) sort competition exists, we pushed Spark further to also sort 1 PB of data (10 trillion records) on 190 machines in under 4 hours. This PB time beats previously reported results based on Hadoop MapReduce (16 hours on 3800 machines). To the best of our knowledge, this is the first petabyte-scale sort ever done in a public cloud.

Bottom line: Sorted 100 TB of data in 23 minutes, beat old record of 72 minutes, on fewer machines.

Read Reynold’s post and then get thee to Apache Spark!

I first saw this in a tweet by paco nathan.

October 10, 2014

Visualizing MNIST: An Exploration of Dimensionality Reduction

Filed under: Dimension Reduction,Machine Learning — Patrick Durusau @ 7:00 pm

Visualizing MNIST: An Exploration of Dimensionality Reduction by Christopher Olah.

From the post:

At some fundamental level, no one understands machine learning.

It isn’t a matter of things being too complicated. Almost everything we do is fundamentally very simple. Unfortunately, an innate human handicap interferes with us understanding these simple things.

Humans evolved to reason fluidly about two and three dimensions. With some effort, we may think in four dimensions. Machine learning often demands we work with thousands of dimensions – or tens of thousands, or millions! Even very simple things become hard to understand when you do them in very high numbers of dimensions.

Reasoning directly about these high dimensional spaces is just short of hopeless.

As is often the case when humans can’t directly do something, we’ve built tools to help us. There is an entire, well-developed field, called dimensionality reduction, which explores techniques for translating high-dimensional data into lower dimensional data. Much work has also been done on the closely related subject of visualizing high dimensional data.

These techniques are the basic building blocks we will need if we wish to visualize machine learning, and deep learning specifically. My hope is that, through visualization and observing more directly what is actually happening, we can understand neural networks in a much deeper and more direct way.

And so, the first thing on our agenda is to familiarize ourselves with dimensionality reduction. To do that, we’re going to need a dataset to test these techniques on.

Extremely useful illustration of dimensional reduction, exploring “recognition” of hand written digits.

I agree that “Reasoning directly about these high dimensional spaces is just short of hopeless.” However, unlike our machines, humans don’t need high dimensional spaces in order to recognize hand written digits. 😉

I first saw this in a tweet by Christophe Lalanne.

Supporting Open Annotation

Filed under: Annotation,WWW — Patrick Durusau @ 4:34 pm

Supporting Open Annotation by Gerben.

From the post:

In its mission to connect the world’s knowledge and thoughts, the solution Hypothes.is pursues is a web-wide mechanism to create, share and discover annotations. One of our principal steps towards this end is providing a browser add-on that works with our annotation server, enabling people to read others’ annotations on any web page they visit, and to publish their own annotations for others to see.

I spent my summer interning at Hypothes.is to work towards a longer term goal, taking annotation sharing to the next level: an open, decentralised approach. In this post I will describe how such an approach could work, how standardisation efforts are taking off to get us there, and how we are involved in making this happen – the first step being support for the preliminary Open Annotation data model.

An annotation ecosystem

While we are glad to provide a service enabling people to create and share annotations on the web, we by no means want to become the sole annotation service provider, as this would imply creating a monopoly position and a single point of failure. Rather, we encourage anyone to build annotation tools and services, possibly using the same code we use. Of course, a problematic consequence of having multiple organisations each running separate systems is that even more information silos emerge on the web, and quite likely the most popular service would obtain a monopoly position after all.

To prevent either fragmentation or monopolisation of the world’s knowledge, we would like an ecosystem to evolve, comprising interoperable annotation services and client implementations. Such an ecosystem would promote freedom of innovation, prevent dependence on a single party, and provide scalability and robustness. It would be like the architecture of the web itself.

Not a bad read if you accept the notion that interoperable annotation servers are an acceptable architecture for annotation of web resources.

Me? I would just as soon put:

target

on my annotation and mail the URL to the CIA, FBI, NSA and any foreign intelligence agencies that I can think of with a copy of my annotaton.

You can believe that government agencies will follow the directives of Congress with regard to spying on United States citizens, but then that was already against the law. Remember the old saying, “Fool me once, shame on you. Fool me twice, shame on me.”? That is applicable to government surveillance.

We need robust annotation mechanisms but not ones that make sitting targets out of our annotations. Local, encrypted annotation mechanisms that can cooperate with other local, encrypted annotation mechanisms would be much more attractive to me.

I first saw this in a tweet by Ivan Herman.

Flexible Neo4j Batch Import with Groovy

Filed under: Graphs,Neo4j — Patrick Durusau @ 4:03 pm

Flexible Neo4j Batch Import with Groovy by Michael Hunger.

From the post:

You might have data as CSV files to create nodes and relationships from in your Neo4j Graph Database.
It might be a lot of data, like many tens of million lines.
Too much for LOAD CSV to handle transactionally.

Usually you can just fire up my batch-importer and prepare node and relationship files that adhere to its input format requirements.

Your Requirements

There are some things you probably want to do differently than the batch-importer does by default:

  • not create legacy indexes
  • not index properties at all that you just need for connecting data
  • create schema indexes
  • skip certain columns
  • rename properties from the column names
  • create your own labels based on the data in the row
  • convert column values into Neo4j types (e.g. split strings or parse JSON)

Michael helps you avoid the defaults of batch importing into Neo4j.

Lance’s Lesson – Gödel Incompleteness

Filed under: Mathematical Reasoning,Mathematics,Philosophy — Patrick Durusau @ 3:45 pm

Lance’s Lesson – Gödel Incompleteness by Lance Fortnow.

The “entertainment” category on YouTube is very flexible since it included this lesson on Gödel Incompleteness. 😉

Lance uses Turing machines to “prove” the first and second incompleteness theorems in under a page of notation.

Recognizing patterns in genomic data

Filed under: Bioinformatics,Biomedical,Genomics,Medical Informatics — Patrick Durusau @ 3:13 pm

Recognizing patterns in genomic data – New visualization software uncovers cancer subtypes from a vast repository of biomedical information by Stephanie Dutchen.

From the post:

Much of biomedical research these days is about big data—collecting and analyzing vast, detailed repositories of information about health and disease. These data sets can be treasure troves for investigators, often uncovering genetic mutations that drive a particular kind of cancer, for example.

Trouble is, it’s impossible for humans to browse that much data, let alone make any sense of it.

“It’s [StratomeX] a tool to help you make sense of the data you’re collecting and find the right questions to ask,” said Nils Gehlenborg, research associate in biomedical informatics at HMS and co-senior author of the correspondence in Nature Methods. “It gives you an unbiased view of patterns in the data. Then you can explore whether those patterns are meaningful.”

The software, called StratomeX, was developed to help researchers distinguish subtypes of cancer by crunching through the incredible amount of data gathered as part of The Cancer Genome Atlas, a National Institutes of Health–funded initiative. Identifying distinct cancer subtypes can lead to more effective, personalized treatments.

When users input a query, StratomeX compares tumor data at the molecular level that was collected from hundreds of patients and detects patterns that might indicate significant similarities or differences between groups of patients. The software presents those connections in an easy-to-grasp visual format.

“It helps you make meaningful distinctions,” said co-first author Alexander Lex, a postdoctoral researcher in the Pfister group.

Other than the obvious merits of this project, note the the role of software as the assistant to the user. It crunches the numbers in a specific domain and presents those results in a meaningful fashion.

It is up to the user to decide which patters are useful and which are not. Shades of “recommending” other instances of the “same” subject?

StratomeX is available for download.

I first saw this in a tweet by Harvard SEAS.

No Query Language Needed: Using Python with an Ordered Key-Value Store

Filed under: FoundationDB,Key-Value Stores,Python — Patrick Durusau @ 10:50 am

No Query Language Needed: Using Python with an Ordered Key-Value Store by Stephen Pimentel.

From the post:

FoundationDB is a complex and powerful database, designed to handle sharding, replication, network hiccups, and server failures gracefully and automatically. However, when we designed our Python API, we wanted most of that complexity to be hidden from the developer. By utilizing familiar features- such as generators, itertools, and comprehensions-we tried to make FoundationDB’s API as easy to us as a Python dictionary.

In the video below, I show how FoundationDB lets you query data directly using Python language features, rather than a separate query language.

Most applications have back-end data stores that developers need to query. This talk presents an approach to storing and querying data that directly employs Python language features. Using the Key-Value Store, we can make our data persistent with an interface similar to a Python dictionary. Python then gives us a number of tools “out of the box” that we can use to form queries:

  • generators for memory-efficient data retrieval;
  • itertools to filter and group data;
  • comprehensions to assemble the query results.

Taken together, these features give us a query capability using straight Python. The talk walks through a number of example queries using the Enron email dataset.

https://github.com/stephenpiment/object-store For code and the details.

More motivation to take a look at FoundationDB!

I do wonder about the “no query language needed.” Users, despite their poor results, appear to be committed to querying and query languages.

Whether it is the illusion of “empowerment” of users, the current inability to measure the cost of ineffectual searching, or acceptance of poor search results, search and search operators continue to be the preferred means of interaction. Plan accordingly.

I first saw this in a tweet by Hari Kishan.

Super-Detailed Interactive 3-D Seafloor Map

Filed under: Mapping,Maps,Oceanography — Patrick Durusau @ 9:47 am

Super-Detailed Interactive 3-D Seafloor Map by Nick Stockton.

From the post:

This super-detailed map of the ocean floor’s topography is based on satellite measurements of subtle lumps on the ocean’s surface. These lumps of water, which are subtle, low, and wide on the ocean’s surface, are caused by the gravitational pull of underwater features like mountains and ridges. The team of scientists wrapped their data around a Google Earth globe, so you and I could explore it ourselves, in the visualization above.

The map has more than twice the resolution of previous seafloor maps, and shows a plethora of never-before-seen features. These include thousands of volcanoes and what could be the ridge where two plates pulled apart to create the Gulf of Mexico. The map is part of new research published last week in Science.

The visualization at the top of the page (click here for a full screen view) lets you play with the vertical exaggeration of both continental and subsea topography using the upper left drop-down menu. (They might seem huge to us at ground level, but the planet’s mountains and valleys are almost imperceptible from the vantage of space.) Another visualization of the study’s map lets you drag a time bar to simulate the movement of tectonic plates.

Great seafloor map and visualization techniques!

Read Nick’s post to get some background on “gravitational mapping.” In short, gravitational mapping relies on the impact of features of the seafloor on ocean height to create detailed seafloor maps.

Sounds like very interesting data sets with many discoveries left to be made.

Last Call: XQuery 3.1 and XQueryX 3.1; and additional supporting documents

Filed under: JSON,XPath,XQuery,XSLT — Patrick Durusau @ 9:02 am

Last Call: XQuery 3.1 and XQueryX 3.1; and additional supporting documents

From the post:

Today the XQuery Working Group published a Last Call Working Draft of XQuery 3.1 and XQueryX 3.1. Additional supporting documents were published jointly with the XSLT Working Group: a Last Call Working Draft of XPath 3.1, together with XPath Functions and Operators, XQuery and XPath Data Model, and XSLT and XQuery Serialization. XQuery 3.1 and XPath 3.1 introduce improved support for working with JSON data with map and array data structures as well as loading and serializing JSON; additional support for HTML class attributes, HTTP dates, scientific notation, cross-scaling between XSLT and XQuery and more. Comments are welcome through 7 November 2014. Learn more about the XML Activity.

How closely do you read?

To answer that question, read all the mentioned documents by 7 November 2014, keeping a list of errors you spot.

Submit your list to the XQuery Working Group by by 7 November 2014 and score your reading based on the number of “errors” accepted by the working group.

What is your W3C Proofing Number? (Average number of accepted “errors” divided by the number of W3C drafts where “errors” were submitted.)

Annotated Chemical Patent Corpus: A Gold Standard for Text Mining

Filed under: Cheminformatics,Chemistry,Corpora,Text Corpus,Text Mining — Patrick Durusau @ 8:37 am

Annotated Chemical Patent Corpus: A Gold Standard for Text Mining by Saber A. Akhondi, et al. (Published: September 30, 2014 DOI: 10.1371/journal.pone.0107477)

Abstract:

Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at www.biosemantics.org.

Highly recommended both as a “gold standard” for chemical patent text mining but also as the state of the art in developing such a standard.

To say nothing of annotation as a means of automatic creation of topic maps where entities are imbued with subject identity properties.

I first saw this in a tweet by ChemConnector.

October 9, 2014

These aren’t the reducing functions you are looking for

Filed under: Clojure,Functional Programming,Programming — Patrick Durusau @ 7:06 pm

These aren’t the reducing functions you are looking for by Christophe Grand.

From the post:

Transducers are powerful and easy to grasp when they claim they transform reducing functions. However once you scratch their surface you quickly realize that’s not their true nature: they transform stateful processes.

In a previous post, I explained why seeded transduce forces transducers to return stateful reducing functions. However this can be fixed. The current implementation of transduce reads:

Heavy sledding but the leading edge always is. Relevance depends on your need for/dedication to advancing a paradigm.

I first saw this in a tweet by Johnathan Winandy.

2014 State of Clojure & ClojureScript Survey

Filed under: Clojure,ClojureScript,Conferences,Functional Programming — Patrick Durusau @ 6:44 pm

2014 State of Clojure & ClojureScript Survey by Alex Miller.

From the post:

For the past four years, Chas Emerick has run an annual State of Clojure survey (expanded last year to cover ClojureScript as well). Due to recent happy arrivals in the Emerick household, Chas has asked Cognitect to run the survey this year.

The survey has been broken into two parts. A link to the second survey will appear after you submit the first, or you can use these links directly:

The surveys will be open until October 17th. Shortly thereafter we will release all of the data and some analysis.

If you’re not yet planning to attend the Clojure/conj in Washington DC, Nov 20-22, tickets are on sale now (regular registration rate ends Oct. 17th)!

Summary:

Two Important Dates:

Oct. 17th – Deadline for State of Clojure/ClojureScript 2014 surveys and regular registration for Clojure/conj.

Nov. 20-22Clojure/conj in Washington, DC.

Avoid entering Oct. 17th in your calendar by completing the surveys and purchasing your ticket to Clojure/conj after reading this post.

« Newer PostsOlder Posts »

Powered by WordPress