Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 8, 2014

Rent Data – USA

Filed under: Data,Politics — Patrick Durusau @ 3:46 pm

These 7 Charts Show Why the Rent Is Too Damn High…and what can be done about it. by Erika Eichelberger and AJ Vicens.

Nothing really compares to a Mother Jones article when they get into full voice. 😉

For example:

More Americans than ever before are unable to afford rent. Here’s a look at why the rent is too damn high and what can be done about it.

Part of the problem has to do with simple supply and demand. Millions of Americans lost their homes during the foreclosure crisis, and many of those folks flooded into the rental market. In 2004, 31 percent of US households were renters, according to HUD. Today that number is 35 percent. “With more people trying to get into same number of units you get an incredible pressure on prices,” says Shaun Donovan*, the former secretary of housing and urban development for the Obama administration.

If you are interested in a data set to crunch on a current public policy issue, the problem of affordable housing is a good as any. All the data cited in this article is available for downloading.

It would take more data mining but identifying those who benefit from a tight rental market versus those who will profit from public housing assistance, such as construction and rental management agencies would make an interesting exercise when compared to political donations and support.

Housing assistance does benefit people being oppressed by high rent but others benefit as well. If you would like to pursue that question, ping me. I have some ideas on where to look for evidence.

ROpenSci News – August 2014

Filed under: R,Topic Maps — Patrick Durusau @ 3:23 pm

Community conversations and a new package for full text by Scott Chamberlain and Karthik Ram.

ROpenSci announces they are reopening their public Google list.

We encourage you to sign up and post ideas for packages, solicit feedback on new ideas, and most importantly find other collaborators who share your domain interests. We also plan to use the list to solicit feedback on some of the bigger rOpenSci projects early on in the development phase allowing our community to shape future direction and also collaborate where appropriate.

Among the work that is underway:

Through time we have been attempting to unify our R packages that interact with individual data sources into single packages that handle one use case. For example, spocc aims to create a single entry point to many different sources (currently 6) of species occurrence data, including GBIF, AntWeb, and others.

Another area we hope to simplify is acquiring text data, specifically text from scholarly journal articles. We call this R package fulltext. The goal of fulltext is to allow a single user interface to searching for and retrieving full text data from scholarly journal articles. Rather than learning a different interface for each data source, you can learn one interface, making your work easier. fulltext will likely only get you data, and make it easy to browse that data, and use it downstream for manipulation, analysis, and vizualization.

We currently have R packages for a number of sources of scholarly article text, including for Public Library of Science (PLOS), Biomed Central (BMC), and eLife – which could all be included in fulltext. We can add more sources as they become available.

Instead of us rOpenSci core members planning out the whole package, we'd love to get the community involved at the beginning.

The “individual data sources into single packages” sounds particularly ripe for enhancement with topic map based ideas.

Not a plea for topic map syntax or modeling, although either would make nice output options. The critical idea being to identify central subjects with key/value pairs to enable robust identification of subjects by later users.

Surface tokens with unexpressed contexts set hard boundaries to the usefulness and accuracy of search results. If we capture what is known to identity surface tokens, we enrich our world and the world of others.

Understanding Clojure Transducers Through Types

Filed under: Clojure,Functional Programming — Patrick Durusau @ 3:02 pm

Understanding Clojure Transducers Through Types by Franklin Chen.

From the post:

Yesterday, Rich Hickey published a blog post, “Transducers are Coming”, which attracted a lot of attention.

I have a confession to make, which I have made before: I find it very difficult to understand ideas or code not presented with types. So I decided that the only way I could possibly understand what “transducers” are would be to actually implement them in a typed language. I ended up doing so and am sharing my findings here.

Franklin’s post is some of the discussion that Rich Hickey promised would follow!

Read both the post and the comments, which includes further comments from Rich Hickey on transducers.

I first saw this in a tweet by Bruno Lara Tavares.

Juju Charm (HPCC Systems)

Filed under: BigData,HPCC,Unicode — Patrick Durusau @ 1:44 pm

HPCC Systems from LexisNexis Celebrates Third Open-Source Anniversary, And Releases 5.0 Version

From the post:

LexisNexis® Risk Solutions today announced the third anniversary of HPCC Systems®, its open-source, enterprise-proven platform for big data analysis and processing for large volumes of data in 24/7 environments. HPCC Systems also announced the upcoming availability of version 5.0 with enhancements to provide additional support for international users, visualization capabilities and new functionality such as a Juju charm that makes the platform easier to use.

“We decided to open-source HPCC Systems three years ago to drive innovation for our leading technology that had only been available internally and allow other companies and developers to experience its benefits to solve their unique business challenges,” said Flavio Villanustre, Vice President, Products and Infrastructure, HPCC Systems, LexisNexis.

….

5.0 Enhancements
With community contributions from developers and analysts across the globe, HPCC Systems is offering translations and localization in its version 5.0 for languages including Chinese, Spanish, Hungarian, Serbian and Brazilian Portuguese with other languages to come in the future.
Additional enhancements include:
• Visualizations
• Linux Ubuntu Juju Charm Support
• Embedded language features
• Apache Kafka Integration
• New Regression Suite
• External Database Support (MySQL)
• Web Services-SQL

The HPCC Systems source code can be found here: https://github.com/hpcc-systems
The HPCC Systems platform can be found here: http://hpccsystems.com/download/free-community-edition

Just in time for the Fall upgrade season! 😉

While reading the documentation I stumbled across: Unicode Indexing in ECL, last updated January 09, 2014.

From the page:

ECL’s dafault indexing logic works great for strings and numbers, but can encounter problems when indexing Unicode data. In some cases, unicode indexes don’t return all matching recordsfor a query. For example, If you have a Unicode field “ufield” in a dataset and select dataset(ufield BETWEEN u’ma’ AND u’me’), it would bring back records for ‘mai’,’Mai’ and ‘may’. However a query on the index for that dataset, idx(ufield BETWEEN u’ma’ AND u’me’), only brings back a record for ‘mai’.

This is a result of the way unicode fields are sorted for indexing. Sorting compares the values of two fields byte by byte to see if a field matches or is less than or greater than another value. Integers are stored in bigendian format, and signed numbers have an offset added to create an absolute value range.

Unicode fields are different. When compared/sorted in datasets, the comparisons are performed using the ICU locale sensitive comparisons to ensure correct ordering. However, index lookup operations need to be fast and therefore the lookup operations perform binary comparisons on fixed length blocks of data. Equality checks will return data correctly, but queries involving between, > or < may fail.

If you are considering HPCC, be sure to check your indexing requirements with regard to Unicode.

August 7, 2014

Sex Only When Lights Are Off

Filed under: Security — Patrick Durusau @ 7:44 pm

Extracting audio from visual information: Algorithm recovers speech from the vibrations of a potato-chip bag filmed through soundproof glass. by Larry Hardesty.

From the post:

Researchers at MIT, Microsoft, and Adobe have developed an algorithm that can reconstruct an audio signal by analyzing minute vibrations of objects depicted in video. In one set of experiments, they were able to recover intelligible speech from the vibrations of a potato-chip bag photographed from 15 feet away through soundproof glass.

In other experiments, they extracted useful audio signals from videos of aluminum foil, the surface of a glass of water, and even the leaves of a potted plant. The researchers will present their findings in a paper at this year’s Siggraph, the premier computer graphics conference.

“When sound hits an object, it causes the object to vibrate,” says Abe Davis, a graduate student in electrical engineering and computer science at MIT and first author on the new paper. “The motion of this vibration creates a very subtle visual signal that’s usually invisible to the naked eye. People didn’t realize that this information was there.”

A big shout-out to MIT, Microsoft, and Adobe for taking privacy to a new low!

The article cites “…obvious applications in law enforcement and forensics….”

Illegitimate governments, I could name a few, think of their activities as “law enforcement.”

You may want to read up on laser microphones. Being entirely passive, this latest technique will avoid some of the detection difficulties with laser microphones.

I don’t know if wavy glass, one defense against laser microphones, will be effective against this new privacy threat.

On the other hand, there’s always the light switch. 😉

Mapbox GL For The Web

Filed under: Graphics,Javascript,MapBox,Visualization — Patrick Durusau @ 4:37 pm

Mapbox GL For The Web: An open source JavaScript framework for client-side vector maps by Eric Gundersen.

From the post:

Announcing Mapbox GL JS — a fast and powerful new system for web maps. Mapbox GL JS is a client-side renderer, so it uses JavaScript and WebGL to dynamically draw data with the speed and smoothness of a video game. Instead of fixing styles and zoom levels at the server level, Mapbox GL puts power in JavaScript, allowing for dynamic styling and freeform interactivity. Vector maps are the next evolution, and we’re excited to see what developers build with this framework. Get started now.

This rocks!

I not going to try to reproduce the examples here so see the original post!

What high performance maps are you going to create?

Ebola: “Highly contagious…” or…

Filed under: Semantics,Topic Maps — Patrick Durusau @ 2:30 pm

NPR has developed a disturbing range of semantics for the current Ebola crisis.

Consider these two reports, one on August 7th and one on August 2nd, 2014.

Aug. 7th: Officials Fear Ebola Will Spread Across Nigeria

Dave Greene – Reports on there only being two or three cases in Lagos, but Nigeria is declaring a state of emergency because Ebola is “…highly contagious….”

Aug. 2nd: Atlanta Hospital Prepares To Treat 2 Ebola Patients

Jim Burress – comments on a news conference at Emory:

“He downplayed any threat to public safety because the virus can only be spread through close contact with an infected person.”

To me, “highly contagious” and “close contact with an infected person” are worlds apart. Why the shift in semantics in only five days?

Curious if you have noticed this or other shifting semantics around the Ebola outbreak from other news outlets?

Not that I would advocate any one “true” semantic for the crisis but I wonder who would benefit from a Ebola-fear panic in Nigeria? Or who would benefit from no panic and a possible successful treatment for Ebola?

Working on the assumption that semantics vary depending on who benefits from a particular semantic.

Topic maps could help you “out” the beneficiaries. Or help you plan to conceal connections to the beneficiaries, depending upon your business model.


Update: A close friend pointed me to: FILOVIR: Scientific Resource for Research on Filoviruses. Website, twitter feed, etc. In case you are looking for a current feed of Ebola information, both public and professional.

August 6, 2014

CAB Thesaurus 2014

Filed under: Social Sciences,Thesaurus — Patrick Durusau @ 2:46 pm

CAB Thesaurus 2014

From the webpage:

The CAB Thesaurus is the essential search tool for all users of the CAB ABSTRACTS™ and Global Health databases and related products. The CAB Thesaurus is not only an invaluable aid for database users but it has many potential uses by individuals and organizations indexing their own information resources for both internal use and on the Internet.

Its strengths include:

  • Controlled vocabulary that has been in constant use since 1983
  • Regularly updated (current version released July 2014)
  • Broad coverage of pure and applied life sciences, technology and social sciences
  • Approximately 264,500 terms, including 144,900 preferred terms and 119,600 non-preferred terms
  • Specific terminology for all subjects covered
  • Includes about 206,400 plant, animal and microorganism names
  • Broad, narrow and related terms to help users find relevant terminology
  • Cross-references from non-preferred synonyms to preferred terms
  • Multi-lingual, with Dutch, Portuguese and Spanish equivalents for most English terms, plus lesser content in Danish, Finnish, French, German, Italian, Norwegian and Swedish
  • American and British spelling variants
  • Relevant CAS registry numbers for chemicals
  • Commission notation for enzymes

Impressive work and one that you should consult before venturing out to make a “standard” vocabulary for some area. It may already exist.

As a traditional thesaurus, CAB lists equivalent terms in other languages. That is to say it omits any properties of its primary or “matching” terms to enable the reader to judge for themselves if the terms represent the same subject.

When you become accustomed to thinking of what criteria was used to say two or more words represent the same subject, the lack of that information becomes glaring.

I first saw this at New edition of CAB Thesaurus published by Anton Doroszenko.

Data Science Cheat Sheet

Filed under: Data Science — Patrick Durusau @ 2:14 pm

Data Science Cheat Sheet by Vincent Granville.

Vincent has resources and suggestions in eleven (11) different categories:

  1. Hardware
  2. Linux environment on Windows laptop
  3. Basic UNIX commands
  4. Scripting language
  5. R language
  6. Advanced Excel
  7. Visualization
  8. Machine Learning
  9. Projects
  10. Data Sets
  11. Miscellaneous

The only suggestion where I depart company from Vincent is on hardware and OS. I prefer *nix as an OS and run Windows on a VM.

A good starting set of suggestions until you develop your own preferences.

Transducers Are Coming

Filed under: Clojure,Functional Programming — Patrick Durusau @ 1:47 pm

Transducers Are Coming by Rich Hickey.

From the post:

Transducers are a powerful and composable way to build algorithmic transformations that you can reuse in many contexts, and they’re coming to Clojure core and core.async.

Two years ago, in a blog post describing how reducers work, I described the reducing function transformers on which they were based, and provided explicit examples like ‘mapping‘, ‘filtering‘ and ‘mapcatting‘. Because the reducers library intends to deliver an API with the same ‘shape’ as existing sequence function APIs, these transformers were never exposed a la carte, instead being encapsulated by the macrology of reducers.

In working recently on providing algorithmic combinators for core.async, I became more and more convinced of the superiority of reducing function transformers over channel->channel functions for algorithmic transformation. In fact, I think they are a better way to do many things for which we normally create bespoke replicas of map, filter etc.

So, reducing function transformers are getting a name – ‘transducers‘, and first-class support in Clojure core and core.async.

A glimpse of ongoing work on Clojure.

Would not be too late to start reading about transducers now.

I first saw this in a tweet by Cognitect, Inc.

datamash

Filed under: Datamash,Statistics — Patrick Durusau @ 1:24 pm

GNU datamash

From the homepage:

GNU datamash is a command-line program which performs basic numeric,textual and statistical operations on input textual data files.

To which you then reasonably ask: What basic numeric, textual and statistical operations?

From the manual:

File operations: transpose, reverse

Numeric operations: sum, min, max, absmin, absmax

Textual/Numeric operations: count, first, last, rand, unique, collapse, countunique

Statistical operations: mean, median, q1, q3, iqr, mode, antimode, pstdev, sstdev, pvar, svar, mad, madraw, sskew, pskew, skurt, pkurt, jarque, dpo

The default column separator is TAB but another character can be substituted for TAB.

Looks like a great utility to have in your data mining toolbox.

I first saw this in a tweet by Joe Pickrell.

Israel, Gaza, War & Data…

Filed under: News,Personalization,Reporting — Patrick Durusau @ 10:05 am

Israel, Gaza, War & Data – social networks and the art of personalizing propaganda by Gilad Lotan.

From the post:

It’s hard to shake away the utterly depressing feeling that comes with news coverage these days. IDF and Hamas are at it again, a vicious cycle of violence, but this time it feels much more intense. While war rages on the ground in Gaza and across Israeli skies, there’s an all-out information war unraveling in social networked spaces.

Not only is there much more media produced, but it is coming at us at a faster pace, from many more sources. As we construct our online profiles based on what we already know, what we’re interested in, and what we’re recommended, social networks are perfectly designed to reinforce our existing beliefs. Personalized spaces, optimized for engagement, prioritize content that is likely to generate more traffic; the more we click, share, like, the higher engagement tracked on the service. Content that makes us uncomfortable, is filtered out.
….

You are familiar with the “oooh” and “aaah” social network graphs. Interesting but too dense in most cases to be useful.

The first thing you will notice about Gilad’s post is that he is making effective use of fairly dense social network graphs. The second thing you will notice is the post is one of the relatively few that can be considered sane on the topic of Israel and Gaza. It is worth reading for its sanity if nothing else.

Gilad argues algorithms are creating information cocoons about us “…where never is heard a discouraging word…” or at least any that we would find disagreeable.

Social network graphs are used to demonstrate such information cocoons for the IDF and Hamas and to show possible nodes that may be shared by those cocoons.

I encourage you to read Gilad’s post as an illustration of good use of social network graphics, an interesting analysis of bridging information cocoons and a demonstration that relatively even-handed reporting remains possible.

I first saw this in a tweet by Wandora which read: “Thinking of #topicmaps and #LOD.”

August 5, 2014

Dangerous Data Democracy

Filed under: Data,Data Science — Patrick Durusau @ 7:03 pm

K-Nearest Neighbors: dangerously simple by Cathy O’Neil (aka mathbabe).

From the post:

I spend my time at work nowadays thinking about how to start a company in data science. Since there are tons of companies now collecting tons of data, and they don’t know what do to do with it, nor who to ask, part of me wants to design (yet another) dumbed-down “analytics platform” so that business people can import their data onto the platform, and then perform simple algorithms themselves, without even having a data scientist to supervise.

After all, a good data scientist is hard to find. Sometimes you don’t even know if you want to invest in this whole big data thing, you’re not sure the data you’re collecting is all that great or whether the whole thing is just a bunch of hype. It’s tempting to bypass professional data scientists altogether and try to replace them with software.

I’m here to say, it’s not clear that’s possible. Even the simplest algorithm, like k-Nearest Neighbor (k-NN), can be naively misused by someone who doesn’t understand it well. Let me explain.

Cathy’s post is a real hoot! You may not roll out of your chair but memories of prior similar episodes will flash by.

She makes a compelling case that the “democratization of data science” effort is not only mis-guided, it is dangerous to boot. Dangerous at least to users who take advantage of data democracy services.

Or should I say that data democracy services are taking advantage of users? 😉

The only reason to be concerned is that users may blame data science rather than their own incompetence with data tools for their disasters. (That seems like the most likely outcome.)

Suggested counters to the “data democracy for everyone” rhetoric?

PS: Sam Hunting reminded me of this post from Cathy O’Neil.

Speaking of Automata

Filed under: Automata,Computer Science — Patrick Durusau @ 6:44 pm

Since I just mentioned Michael McCandless’ post on automata in Lucene 4.10, it seems like a good time to say that Jeffrey Ullman will be teaching automata-003 starting Monday Sept. 1, 2014.

I got a bulk email from the course administrators saying that Stanford had given its permission and that arrangements are underway with Coursera.

If you want to see the prior course: https://class.coursera.org/automata-002. Or you could start watching early!

A new proximity query for Lucene, using automatons

Filed under: Automata,Lucene,Search Engines — Patrick Durusau @ 6:34 pm

A new proximity query for Lucene, using automatons by Michael McCandless.

From the post:


As of Lucene 4.10 there will be a new proximity query to further generalize on MultiPhraseQuery and the span queries: it allows you to directly build an arbitrary automaton expressing how the terms must occur in sequence, including any transitions to handle slop.

automata

This is a very expert query, allowing you fine control over exactly what sequence of tokens constitutes a match. You build the automaton state-by-state and transition-by-transition, including explicitly adding any transitions (sorry, no QueryParser support yet, patches welcome!). Once that’s done, the query determinizes the automaton and then uses the same infrastructure (e.g. CompiledAutomaton) that queries like FuzzyQuery use for fast term matching, but applied to term positions instead of term bytes. The query is naively scored like a phrase query, which may not be ideal in some cases.

Micahael walks through current proximity queries before diving into the new proximity query for Lucene 4.10.

As always, this is a real treat!

The World’s Most Hackable Cars

Filed under: Cybersecurity,Security — Patrick Durusau @ 6:07 pm

The World’s Most Hackable Cars by Kelly Jackson Higgins.

From the post:

If you drive a 2014 Jeep Cherokee, a 2014 Infiniti Q50, or a 2015 Escalade, your car not only has state-of-the-art network-connected functions and automated features, but it’s also the most likely to get hacked.

That’s what renowned researchers Charlie Miller and Chris Valasek concluded in their newest study of vulnerabilities in modern automobiles, which they will present Wednesday at Black Hat USA in Las Vegas. The researchers focused on the potential for remote attacks, where a nefarious hacker could access the car’s network from afar — breaking into its wireless-enabled radio, for instance, and issuing commands to the car’s steering or other automated driving feature.

Since the Dog Days of summer have arrived (by most ancient accounts), this looked like interesting news and possibly a fun summer activity.

In doing some background research, I noticed that http://www.cars.com does not list the hackability of the 2014 Jeep Cherokee, 2014 Infiniti Q50, or 2015 Escalade as a “What We Don’t Like” in their expert review.

If you are collating information about cars I would definitely list “[remote] issuing commands to the car’s steering or other automated driving feature” as a “don’t like.”

I have checked with a couple of car owner mailiing list houses and 2014 owner lists aren’t available, yet. Check back end of this year or early next.

Deep Learning in Java

Filed under: Deep Learning,Feature Learning,Machine Learning — Patrick Durusau @ 6:03 pm

Deep Learning in Java by Ryan Swanstrom.

From the post:

Deep Learning is the hottest topic in all of data science right now. Adam Gibson, cofounder of Blix.io, has created an open source deep learning library for Java named DeepLearning4j. For those curious, DeepLearning4j is open sourced on github.

Ryan has some other deep learning goodies at his post so don’t skip directly to DeepLearning4j. 😉

Like all machine learning techniques, the more you know about it the easier it will be to ask uncomfortable questions when someone over plays their results.

It’s a useful technique but it is also useful to be an intelligent consumer of its results.

Bioinformatics Data and Microsoft Word

Filed under: Bioinformatics,Microsoft — Patrick Durusau @ 4:25 pm

Is there ever a valid reason for storing bioinformatics data in a Microsoft Word document? by Keith Bradnam.

You already know the answer from the title so I will skip to the conclusion:

This is not an acceptable practice! Use of Microsoft Word to store bioinformatics data will only ever result in unhappiness, frustration, and anger.

I think Keith, myself and many others who make the same or similar points are missing one critical issue:

Why is MS Word (or Excel) so much easier to use than other applications for bioinformatics?

Or perhaps even more to the point:

Why hasn’t bioinformatics lobbied for extensions to MS Word or Excel to work with their workflow?

For the most part, users aren’t really interested in a personal relationship with their computer or a religious experience with their software. They want to get some non-hardware/non-software task done. (full stop)

Rather than trying to fix users, why don’t we try to fix their tools?

Shouldn’t I be able to create a new MS Word or OpenOffice document, indicate that it contains gene names and simply type them in? And have them intelligently extracted for use with genome databases?

“Fixing” users isn’t a winning strategy. Let’s trying fixing their tools. No promises but we know the other approach fails.

Function Graphs and Other Applications for PostScript

Filed under: Humor,PostScript — Patrick Durusau @ 3:57 pm

Function Graphs and Other Applications for PostScript by Gernot Hoffmann.

From the introduction:

This documents shows some examples for drawing smooth curves by PostScript.

The main purpose is the creation of mathematical function graphs for placing in PageMaker.

The CIE Chromaticity Diagram is an advanced example for accurate 2D-graphics, based on mathematical data.

New developments show that accurate 3D graphics are possible, though with many restrictions.

The programming techniques differ considerably from the typical PostScript style.

PostScript uses the stack extensively. Optimized programs are hardly readable.
….

I don’t think you will find this to be useful but in a burst of nostalgia I had to post it.

I remember hand coding PostScript for output to an Apple IIe printer. It is an experience that will make you appreciate more recent software. 😉

I first saw this in a tweet by onepaperperday.

Mapping Phone Calls

Filed under: Mapping,Maps,Visualization — Patrick Durusau @ 12:52 pm

Map: Every call Obama has made to a foreign leader in 2014 by Max Fisher.

From the post:

What foreign leaders has Obama spoken to this year? Reddit user nyshtick combed through official White House press releases to make this map showing every phone call Obama has made in 2014 to another head of state or head of government. The results are revealing, a great little window into the year in American foreign policy so far:

It’s a visual so you need to visit Max’s post to see the resulting world map.

I think you will be surprised.

There is another lesson lurking in the post.

The analysis did not require big data, distributed GPU compuations or category theory.

What it did require was:

  • An interesting question: “What foreign leaders has Obama spoken to this year?”
  • A likely data set: press releases
  • A user willing to dig through the data and to create a visualization.

Writing as much to myself as to anyone:

Don’t overlook smallish data with simple visualizations. (If you goal is impact and not the technology you used.)

August 4, 2014

User Experience Research at Scale

Filed under: Interface Research/Design,Users,UX — Patrick Durusau @ 7:14 pm

User Experience Research at Scale by Nick Cawthon.

From the post:

An important part of any user experience department should be a consistent outreach effort to users both familiar and unfamiliar. Yet, it is hard to both establish and sustain a continued voice amongst the business of our schedules.

Recruiting, screening, and scheduling daily or weekly one-on-one walkthroughs can be daunting for someone in a small department having more than just user research responsibilities, and the investment of time eventually outweighs the returns as both the number of participants and size of the company grow.

This article is targeted at user experience practitioners at small- to mid-size companies who want to incorporate a component of user research into their workflow.

It first outlines a point of advocacy around why it is important to build user research into a company’s ethos from the very start and states why relying upon standard analytics packages are not enough. The article then addresses some of the challenges around being able to automate, scale, document, and share these efforts as your user base (hopefully) increases.

Finally, the article goes on to propose a methodology that allows for an adjustable balance between a department’s user research and product design and highlights the evolution of trends, best practices, and common avoidances found within the user research industry, especially as they relate to SaaS-based products.

If you have an interest in producing products/services that meet users’ needs, i.e., the kind of products or services that sell, this is an article for you.

Summingbird:… [VLDB 2014]

Filed under: Hadoop,Scala,Storm,Summingbird,Tweets — Patrick Durusau @ 4:07 pm

Summingbird: A Framework for Integrating Batch and Online MapReduce Computations by Oscar Boykin, Sam Ritchie, Ian O’Connell, and Jimmy Lin.

Abstract:

Summingbird is an open-source domain-specifi c language implemented in Scala and designed to integrate online and batch MapReduce computations in a single framework. Summingbird programs are written using data flow abstractions such as sources, sinks, and stores, and can run on diff erent execution platforms: Hadoop for batch processing (via Scalding/Cascading) and Storm for online processing. Different execution modes require di fferent bindings for the data flow abstractions (e.g., HDFS files or message queues for the source) but do not require any changes to the program logic. Furthermore, Summingbird can operate in a hybrid processing mode that transparently integrates batch and online results to efficiently generate up-to-date aggregations over long time spans. The language was designed to improve developer productivity and address pain points in building analytics solutions at Twitter where often, the same code needs to be written twice (once for batch processing and again for online processing) and indefi nitely maintained in parallel. Our key insight is that certain algebraic structures provide the theoretical foundation for integrating batch and online processing in a seamless fashion. This means that Summingbird imposes constraints on the types of aggregations that can be performed, although in practice we have not found these constraints to be overly restrictive for a broad range of analytics tasks at Twitter.

Heavy sledding but deeply interesting work. Particularly about “…integrating batch and online processing in a seamless fashion.”

I first saw this in a tweet by Jimmy Lin.

Guessing user needs doesn’t work.

Filed under: Interface Research/Design,UX — Patrick Durusau @ 3:49 pm

Why governments need hack days by Amy Whitney.

Amy describes a hackathon at the DVLA (Driver and Vehicle Licensing Agency) and it includes this jewel on user input:

First they went to talk to others into the DVLA to fully understand the problem. Then they went out onto the street to talk to real users about their needs. In some cases the results were eye-opening and unexpected. User research is a team sport, and users are a crucial part of that team. Guessing user needs doesn’t work. (emphasis in original)

I would emphasize: Guessing user needs doesn’t work.

Are you guessing user needs or do you have some other method to establish their needs?

I first saw this in a tweet by Mark Hurrell.

PS: Would “Guessing user needs doesn’t work.” also apply to FOL? 😉

August 3, 2014

550 talks related to big data [GPU Technology Conference]

Filed under: BigData,GPU — Patrick Durusau @ 6:55 pm

550 talks related to big data by Amy.

Amy has forged links to forty-four (44) topic areas at the GPU Technology Conference 2014.

Definitely a post to bookmark!

You may remember that GPUs were what Bryan Thompson and others are using to achieve 3 Billion Traversed Edges Per Second (TEPS) for graph work. Non-kiddie graph work.

Enjoy!

Side by side with Elasticsearch and Solr

Filed under: ElasticSearch,Lucene,Solr — Patrick Durusau @ 4:26 pm

Side by side with Elasticsearch and Solr by Rafał Kuć and Radu Gheorghe.

Abstract:

We all know that Solr and Elasticsearch are different, but what those differences are and which solution is the best fit for a particular use case is a frequent question. We will try to make those differences clear, not by showing slides and compare them, but by showing online demo of both Elasticsearch and Solr:

  • Set up and start both search servers. See what you need to prepare and launch Solr and Elasticsearch
  • Index data right after the server was started using the “schemaless” mode
  • Create index structure and modify it using the provided API
  • Explore different query use cases
  • Scale by adding and removing nodes from the cluster, creating indices and managing shards. See how that affects data indexing and querying.
  • Monitor and administer clusters. See what metrics can be seen out of the box, how to get them and what tools can provide you with the graphical view of all the goodies that each search server can provide.

Slides

Very impressive split-screen comparison of Elasticsearch and Solr by two presenters on the same data set.

I first saw this at: Side-By-Side with Solr and Elasticsearch : A Comparison by Charles Ditzel.

August 2, 2014

MapGraph:… [3 billion Traversed Edges Per Second (TEPS) on a GPU]

Filed under: GPU,Graphs,Parallel Programming — Patrick Durusau @ 6:58 pm

MapGraph: A High Level API for Fast Development of High Performance Graph Analytics on GPUs by Zhisong Fu, Michael Personick, and Bryan Thompson.

Abstract:

High performance graph analytics are critical for a long list of application domains. In recent years, the rapid advancement of many-core processors, in particular graphical processing units (GPUs), has sparked a broad interest in developing high performance parallel graph programs on these architectures. However, the SIMT architecture used in GPUs places particular constraints on both the design and implementation of the algorithms and data structures, making the development of such programs difficult and time-consuming.

We present MapGraph, a high performance parallel graph programming framework that delivers up to 3 billion Traversed Edges Per Second (TEPS) on a GPU. MapGraph provides a high-level abstraction that makes it easy to write graph programs and obtain good parallel speedups on GPUs. To deliver high performance, MapGraph dynamically chooses among different scheduling strategies depending on the size of the frontier and the size of the adjacency lists for the vertices in the frontier. In addition, a Structure Of Arrays (SOA) pattern is used to ensure coalesced memory access. Our experiments show that, for many graph analytics algorithms, an implementation, with our abstraction, is up to two orders of magnitude faster than a parallel CPU implementation and is comparable to state-of-the-art, manually optimized GPU implementations. In addition, with our abstraction, new graph analytics can be developed with relatively little effort.

Those of us who remember Bryan Thompson from the early days of topic maps are not surprised to see his name on a paper with phrases like: “…delivers up to 3 billion Traversed Edges Per Second (TEPS) on a GPU,” and “…is up to two orders of magnitude faster than a parallel CPU implementation….”

Heavy sledding but definitely worth the effort.

Oh, btw, did I mention this is an open source project? http://sourceforge.net/projects/mpgraph/

I first saw this in MapGraph: speeding up graph processing with GPUs by Danny Bickson.

User Interfaces

Filed under: Interface Research/Design,Marketing,UX — Patrick Durusau @ 4:20 pm

user interface

My future response to all interface statements that begin:

  • users must understand
  • users don’t know what they are missing
  • users have been seduced by search
  • users need training
  • etc.

These statements and others mean that “users,” those folks who are going to pay money for services/products, aren’t going to be happy.

Making a potential customer unhappy is a very poor sales technique.

I saw this in a tweet by Startup Vitamins.

Hire Al-Qaeda Programmers

Filed under: Cybersecurity,Security — Patrick Durusau @ 3:24 pm

Hire Al-Qaeda Programmers was all I could think of after reading:

Big Data Firm Says It Can Link Snowden Data To Changed Terrorist Behavior by Dina Temple-Raston (August 01, 2014). Which was based upon:

How Al-Qaeda Uses Encryption Post-Snowden (Part 2) – New Analysis in Collaboration With ReversingLabs by C (August 01, 2014). Which is a follow up to:

How Al-Qaeda Uses Encryption Post-Snowden (Part 1) by C.

C says in part 1 of the posts from Recorded Future

Following the June 2013 Edward Snowden leaks we observe an increased pace of innovation, specifically new competing jihadist platforms and three (3) major new encryption tools from three (3) different organizations – GIMF, Al-Fajr Technical Committee, and ISIS – within a three to five-month time frame of the leaks.

The May 2014 “research” post created a stir in the media, such as:

The Telegraph:

The latest example of how hostile groups are adapting their behaviour in the post-Snowden world comes with reports that al-Qaeda has created new encryption software so that its activities can avoid detection by surveillance agencies, such as Britain’s GCHQ listening post at Cheltenham. According to the American big data firm Recorded Future, al-Qaeda-affiliated cells have developed three new versions of encryption software to replace the previous format, known as “Mujahideen Secrets”, that Osama bin Laden’s organisation had used prior to Snowden’s revelations. The company says the most likely explanation for the new software is that Islamist terror groups are seeking to change their computer systems in the wake of Snowden’s revelations, thereby making it harder for Western intelligence agencies to track their activities and thwart potential terrorist attacks.

Ars Technica

The influx of new programs for al Qaeda members came amid revelations that the NSA was able to decode vast amounts of encrypted data traveling over the Internet. Among other things, according to documents Snowden provided, government-sponsored spies exploited backdoors or crippling weaknesses that had been surreptitiously and intentionally built in to widely used standards.

There were a number of others but I won’t bore you with repeating their uncritical repetition of the original Recorded Future report.

What is the one thing that jumps out as odd in the May 2014 Recorded Future report?

Let’s review the summary of that report:

Following the June 2013 Edward Snowden leaks we observe an increased pace of innovation, specifically new competing jihadist platforms and three (3) major new encryption tools from three (3) different organizations – GIMF, Al-Fajr Technical Committee, and ISIS – within a three to five-month time frame of the leaks. (emphasis added)

According to their blog, “over 14,000 intelligence analysts and security professionals” viewed that summary. Have you heard any concerns about the absurdity of the time frame projected by Recorded Future?

three (3) major new encryption tools from (3) different organizations…within a three to five-month time frame of the leaks

Really? Not unless Al-Qaeda programmers are better than any other known programmers or they are following a methodology that is several orders of magnitude better than any other known one.

Think about it. I don’t have to quote experts to convince you that writing, writing mind you, the RFP for an encryption tool would take more than three to five months. That leaves out any programming, testing, etc.

I realize that Edward Snowden is the whipping boy of the year for contracts and funding and I like to believe several impossible things before breakfast. But not absurd things.

If you want to cripple Al-Qaeda, hire their programmers. According to Recorded Future, they are the best programmers who have ever lived.

PS: Apologies but I don’t know how to legally obtain the Recorded Future mailing list for its blog. It would be helpful in terms of knowing who to not consult on probable time frames for software development.

Data Science Master

Open Source Data Science Master – The Plan by Fras and Sabine.

From the post:

Free!! education platforms have put some of the world’s most prestigious courses online in the last few years. This is our plan to use these and create our own custom open source data science Master.

Free online courses are selected to cover: Data Manipulation, Machine Learning & Algorithms, Programming, Statistics, and Visualization.

Be sure to take know of the pre-requisites the authors completed before embarking on their course work.

No particular project component is suggested because the course work will suggest ideas.

What other choices would you suggest? Either for broader basics or specialization?

August 1, 2014

OpenGM

Filed under: C/C++,Graphs — Patrick Durusau @ 4:37 pm

OpenGM

From the webpage:

OpenGM is a C++ template library for discrete factor graph models and distributive operations on these models. It includes state-of-the-art optimization and inference algorithms beyond message passing. OpenGM handles large models efficiently, since (i) functions that occur repeatedly need to be stored only once and (ii) when functions require different parametric or non-parametric encodings, multiple encodings can be used alongside each other, in the same model, using included and custom C++ code. No restrictions are imposed on the factor graph or the operations of the model. OpenGM is modular and extendible. Elementary data types can be chosen to maximize efficiency. The graphical model data structure, inference algorithms and different encodings of functions inter-operate through well-defined interfaces. The binary OpenGM file format is based on the HDF5 standard and incorporates user extensions automatically.

Documentation lists algorithms with references.

I first saw this in a post by Danny Bickson, OpenGM graphical models toolkit.

« Newer PostsOlder Posts »

Powered by WordPress