Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 22, 2013

DCAT Application Profile for Data Portals in Europe – Final Draft

Filed under: DCAT,EU,Integration — Patrick Durusau @ 4:27 pm

DCAT Application Profile for Data Portals in Europe – Final Draft

From the post:

The DCAT Application profile for data portals in Europe (DCAT-AP) is a specification based on the Data Catalogue vocabulary (DCAT) for describing public sector datasets in Europe. Its basic use case is to enable a cross-data portal search for data sets and make public sector data better searchable across borders and sectors. This can be achieved by the exchange of descriptions of data sets among data portals.

This final draft is open for public review until 10 June 2013. Members of the public are invited to download the specification and post their comments directly on this page. To be able to do so you need to be registered and logged in.

If you are interested in integration of data from European data portals, it is worth the time to register, etc.

Not all the data you are going to need to integrate a data set but at least a start in the right direction.

Open Access to Weather Data for International Development

Filed under: Agriculture,Open Data,Weather Data — Patrick Durusau @ 3:28 pm

Open Access to Weather Data for International Development

From the post:

Farming communities in Africa and South Asia are becoming increasingly vulnerable to shock as the effects of climate change become a reality. This increased vulnerability, however, comes at a time when improved technology makes critical information more accessible than ever before. aWhere Weather, an online platform offering free weather data for locations in Western, Eastern and Southern Africa and South Asia provides instant and interactive access to highly localized weather data, instrumental for improved decision making and providing greater context in shaping policies relating to agricultural development and global health.

Weather Data in 9km Grid Cells

Weather data is collected at meteorological stations around the world and interpolated to create accurate data in detailed 9km grids. Within each cell, users can access historical, daily-observed and 8 days of daily forecasted ‘localized’ weather data for the following variables:

  • Precipitation 
  • Minimum and Maximum Temperature
  • Minimum and Maximum Relative Humidity 
  • Solar Radiation 
  • Maximum and Morning Wind Speed
  • Growing degree days (dynamically calculated for your base and cap temperature) 

These data prove essential for risk adaption efforts, food security interventions, climate-smart decision making, and agricultural or environmental research activities.

Sign up Now

Access is free and easy. Register at http://www.awhere.com/en-us/weather-p. Then, you can log back in anytime at me.awhere.com.  

For questions on the platform, please contact weather@awhere.com

At least as a public observer, I could not determine how much “interpolation” is going to the weather data. That would have a major impact on the risk of accepting the data provided at face value.

I suspect it varies from little interpolation at all in heavily instrumented areas to quite a bit in areas with sparser readings. How much is unclear.

It maybe that the amount of interpolation in the data is a factor of whether you use the free version or some upgraded commercial version.

Still, an interesting data source to combine with others, if you are mindful of the risks.

Introduction to Artificial Intelligence (Berkeley CS188.1x)

Filed under: Artificial Intelligence,Programming — Patrick Durusau @ 2:17 pm

Introduction to Artificial Intelligence (Berkeley CS188.1x)

The schedule for CS188.2x hasn’t been announced, yet.

In the meantime, you can register for CS188.1x and peruse the videos, exercises, etc. while you wait for the second part of the course.

From the description:

CS188.1x is a new online adaptation of the first half of UC Berkeley’s CS188: Introduction to Artificial Intelligence. The on-campus version of this upper division computer science course draws about 600 Berkeley students each year.

Artificial intelligence is already all around you, from web search to video games. AI methods plan your driving directions, filter your spam, and focus your cameras on faces. AI lets you guide your phone with your voice and read foreign newspapers in English. Beyond today’s applications, AI is at the core of many new technologies that will shape our future. From self-driving cars to household robots, advancements in AI help transform science fiction into real systems.

CS188.1x focuses on Behavior from Computation. It will introduce the basic ideas and techniques underlying the design of intelligent computer systems. A specific emphasis will be on the statistical and decision–theoretic modeling paradigm. By the end of this course, you will have built autonomous agents that efficiently make decisions in stochastic and in adversarial settings. CS188.2x (to follow CS188.1x, precise date to be determined) will cover Reasoning and Learning. With this additional machinery your agents will be able to draw inferences in uncertain environments and optimize actions for arbitrary reward structures. Your machine learning algorithms will classify handwritten digits and photographs. The techniques you learn in CS188x apply to a wide variety of artificial intelligence problems and will serve as the foundation for further study in any application area you choose to pursue.

Dynamic faceting with Lucene

Filed under: Faceted Search,Facets,Indexing,Lucene,Search Engines — Patrick Durusau @ 2:08 pm

Dynamic faceting with Lucene by Michael McCandless.

From the post:

Lucene’s facet module has seen some great improvements recently: sizable (nearly 4X) speedups and new features like DrillSideways. The Jira issues search example showcases a number of facet features. Here I’ll describe two recently committed facet features: sorted-set doc-values faceting, already available in 4.3, and dynamic range faceting, coming in the next (4.4) release.

To understand these features, and why they are important, we first need a little background. Lucene’s facet module does most of its work at indexing time: for each indexed document, it examines every facet label, each of which may be hierarchical, and maps each unique label in the hierarchy to an integer id, and then encodes all ids into a binary doc values field. A separate taxonomy index stores this mapping, and ensures that, even across segments, the same label gets the same id.

At search time, faceting cost is minimal: for each matched document, we visit all integer ids and aggregate counts in an array, summarizing the results in the end, for example as top N facet labels by count.

This is in contrast to purely dynamic faceting implementations like ElasticSearch‘s and Solr‘s, which do all work at search time. Such approaches are more flexible: you need not do anything special during indexing, and for every query you can pick and choose exactly which facets to compute.

However, the price for that flexibility is slower searching, as each search must do more work for every matched document. Furthermore, the impact on near-real-time reopen latency can be horribly costly if top-level data-structures, such as Solr’s UnInvertedField, must be rebuilt on every reopen. The taxonomy index used by the facet module means no extra work needs to be done on each near-real-time reopen.

The dynamic range faceting sounds particularly useful.

30 Days to Data Storytelling

Filed under: Data Storytelling — Patrick Durusau @ 1:07 pm

30 Days to Data Storytelling by James Lytle.

From the post:

We learn best from a diversity of inputs. That’s partly why our previous 30 days exercise sheet was such a huge hit.

It’s critical for analysts and presenters of data to share information in a way that people just get it. Enter data storytelling – a magical elixir to all your data communication woes! Well, maybe not quite. But you should be aware of recent efforts using this timeless approach to deliver information so naturally – through stories.

That’s why we’ve created 30 Days to Data Storytelling.

This exercise breaks down a structured (yet casual) introduction to data storytelling through a variety resources. We wanted to provide a diversity of depth and inspiration. Feel free to skip around or follow our 4 week sequence. Print it and post it near the water cooler or slap it to your virtual desktop.

I don’t have a water cooler but I will post “30 Days to Data Storytelling” next to my monitors.

Whatever the subject, knowledge you can’t communicate to others, is lost.

May 21, 2013

Hadoop, Hadoop, Hurrah! HDP for Windows is Now GA!

Filed under: Hadoop,Hortonworks,Microsoft — Patrick Durusau @ 4:54 pm

Hadoop, Hadoop, Hurrah! HDP for Windows is Now GA! by John Kreisa.

From the post:

Today we are very excited to announce that Hortonworks Data Platform for Windows (HDP for Windows) is now generally available and ready to support the most demanding production workloads.

We have been blown away with the number and size of organizations who have downloaded the beta bits of this 100% open source, and native to Windows distribution of Hadoop and engaged Hortonworks and Microsoft around evolving their data architecture to respond to the challenges of enterprise big data.

With this key milestone HDP for Windows offers the millions of customers running their business on Microsoft technologies an ecosystem-friendly Hadoop-based solution that is built for the enterprise and purpose built for Windows. This release cements Apache Hadoop’s role as a key component of the next generation enterprise data architecture, across the broadest set of datacenter configurations as HDP becomes the first production-ready Apache Hadoop distribution to run on both Windows and Linux.

Additionally, customers now also have complete portability of their Hadoop applications between on-premise and cloud deployments via HDP for Windows and Microsofts’s HDInsight Service.

Two lessons here:

First, Hadoop is a very popular way to address enterprise big data.

Second, going where users are, not where they ought to be, is a smart business move.

JSME: a free molecule editor in JavaScript

Filed under: Cheminformatics,Editor,Interface Research/Design,Javascript — Patrick Durusau @ 4:48 pm

JSME: a free molecule editor in JavaScript by Bruno Bienfait and Peter Ertl. (Journal of Cheminformatics 2013, 5:24 doi:10.1186/1758-2946-5-24)

Abstract:

Background

A molecule editor, i.e. a program facilitating graphical input and interactive editing of molecules, is an indispensable part of every cheminformatics or molecular processing system. Today, when a web browser has become the universal scientific user interface, a tool to edit molecules directly within the web browser is essential. One of the most popular tools for molecular structure input on the web is the JME applet. Since its release nearly 15 years ago, however the web environment has changed and Java applets are facing increasing implementation hurdles due to their maintenance and support requirements, as well as security issues. This prompted us to update the JME editor and port it to a modern Internet programming language – JavaScript.

Summary

The actual molecule editing Java code of the JME editor was translated into JavaScript with help of the Google Web Toolkit compiler and a custom library that emulates a subset of the GUI features of the Java runtime environment. In this process, the editor was enhanced by additional functionalities including a substituent menu, copy/paste, drag and drop and undo/redo capabilities and an integrated help. In addition to desktop computers, the editor supports molecule editing on touch devices, including iPhone, iPad and Android phones and tablets. In analogy to JME the new editor is named JSME. This new molecule editor is compact, easy to use and easy to incorporate into web pages.

Conclusions

A free molecule editor written in JavaScript was developed and is released under the terms of permissive BSD license. The editor is compatible with JME, has practically the same user interface as well as the web application programming interface. The JSME editor is available for download from the project web page http://peter-ertl.com/jsme/

Just in case you were having any doubts about using JavaScript to power an annotation editor.

Better now?

Neo4j 1.9 General Availability Announcement!

Filed under: Graphs,Neo4j — Patrick Durusau @ 4:38 pm

Neo4j 1.9 General Availability Announcement! by Philip Rathle.

From the post:

After over a year of R&D, five milestone releases, and two release candidates, we are happy to release Neo4j 1.9 today! It is available for download effective immediately. And the latest source code is available, as always, on Github.

The 1.9 release adds primarily three things:

  • Auto-Clustering, which makes Neo4j Enterprise clustering more robust & easier to administer, with fewer moving parts
  • Cypher language improvements make the language more functionally powerful and more performant, and
  • New welcome pages make learning easier for new users

Beyond Enterprise Search…

Filed under: Linked Data,MarkLogic,Searching,Semantic Web — Patrick Durusau @ 2:49 pm

Beyond Enterprise Search… by adamfowleruk.

From the post:

Searching through all your content is fine – until you get a mountain of it with similar content, differentiated only by context. Then you’ll need to understand the meaning within the content. In this post I discuss how to do this using semantic techniques…

Organisations today have realised that for certain applications it is useful to have a consolidated search approach over several catalogues. This is most often the case when customers can interact with several parts of the company – sales, billing, service, delivery, fraud checks.

This approach is commonly called Enterprise Search, or Search and Discovery, which is where your content across several repositories is indexed in a separate search engine. Typically this indexing occurs some time after the content is added. In addition, it is not possible for a search engine to understand the fully capabilities of every content system. This means complex mappings are needed between content, meta data and security. In some cases, this may be retrofitted with custom code as the systems do not support a common vocabulary around these aspects of information management.

Content Search

We are all used to content search, so much so that for today’s teenagers a search bar with a common (‘Google like’) grammar is expected. This simple yet powerful interface allows us to search for content (typically web pages and documents) that contain all the words or phrases that we need. Often this is broadened by the use of a thesaurus and word stemming (plays and played stems to the verb play), and combined with some form of weighting based on relative frequency within each unit of content.

Other techniques are also applied. Metadata is extracted or implied – author, date created, modified, security classification, Dublin Core descriptive data. Classification tools can be used (either at the content store or search indexing stages) to perform entity extraction (Cheese is a food stuff) and enrichment (Sheffield is a place with these geospatial co-ordinates). This provides a greater level of description of the term being searched for over and above simple word terms.

Using these techniques, additional search functionality can be provided. Search for all shops visible on a map using a bounding box, radius or polygon geospatial search. Return only documents where these words are within 6 words of each other. Perhaps weight some terms as more important than others, or optional.

These techniques are provided by many of the Enterprise class search engines out there today. Even Open Source tools like Lucene and Solr are catching up with this. They have provided access to information where before we had to rely on Information and Library Services staff to correctly classify incoming documents manually, as they did back in the paper bound days of yore.

Content search only gets you so far though.

I was amening with the best of them until Adam reached the part about MarkLogic 7 going to add Semantic Web capabilities. 😉

I didn’t see any mention of linked data replicating the semantic diversity that currently exists in data stores.

Making data more accessible isn’t going to make it less diverse.

Although making data more accessible may drive the development of ways to manage semantic diversity.

So perhaps there is a useful side to linked data after all.

Named Entity Tutorial

Filed under: Entity Resolution,LingPipe,Named Entity Mining — Patrick Durusau @ 2:31 pm

Named Entity Tutorial (LingPipe)

While looking for something else I ran across this named entity tutorial at LingPipe.

Other named entity tutorials that I should collect?

Cloud Computing as a Low-Cost Commodity

Filed under: Cloud Computing — Patrick Durusau @ 11:35 am

A Revolution in Cloud Pricing: Minute By Minute Cloud Billing for Everyone by Sean Murphy.

From the post:

Google IO wrapped up last week with a tremendous number of data-related announcements. Today’s post is going to focus on Google Compute Engine (GCE), Google’s answer to Amazon’s Elastic Compute Cloud (EC2) that allows you to create and run virtual compute instances within Google’s cloud. We have spent a good amount of time talking about GCE in the past, in particular, benchmarking it against EC2 here, here, here, and here.

The main GCE announcement at IO was, of course, the fact that now **anyone** and **everyone** can try out and use GCE. Yes, GCE instances now support up to 10 terabytes per disk volume, which is a BIG deal. However, the fact that GCE will use minute-by-minute pricing, which might not seem incredibly significant on the surface, is an absolute game changer.

Let’s say that I have a job that will take just a thousand instances each a little bit over an hour to finish (a total of just over a thousand “instance hours”). I launch my thousand instances, run the needed job, and then shut down my cloud 61 minutes later. Let’s also assume that Amazon and Google both charge about the same amount, say $0.50 per instance per hour (a relatively safe assumption) and that Amazon’s and Google’s instances have the same computational horsepower (this is not true, see my benchmark results). As Amazon charges by the hour, Amazon would charge me for two hours per instance or $1000.00 total (1000 instances x $0.50 per instance per hour x 2 hours per instance) whereas Google would only charge me $508.34 (1000 instances x $0.50 per instance per hour x 61/60 hours per instance). In this circumstance, Amazon’s hourly billing has almost doubled my costs but the impact is far worse.

Sean does a great job covering the impact of minute-by-minute pricing for cloud computing.

Great news for the short run and I suspect even greater news for the long run.

What happens when instances and storage become too cheap to meter?

Like domestic long distance telephone service.

When anything that can be computed is within the reach of everyone, what will be computed?

“Practical Data Science with R” MEAP (ordered)

Filed under: Data Science,R — Patrick Durusau @ 11:17 am

Big News! “Practical Data Science with R” MEAP launched! by John Mount.

From the post:

Nina Zumel and I ( John Mount ) have been working very hard on producing an exciting new book called “Practical Data Science with R.” The book has now entered Manning Early Access Program (MEAP) which allows you to subscribe to chapters as they become available and give us feedback before the book goes into print.

R image

Deal of the Day May 21 2013: Half off Practical Data Science with R. Use code dotd0521au.

I ordered the “Practical Data Science with R” MEAP today, based on my other Manning MEAP experiences.

You?

The Art of Data Visualization

Filed under: Graphics,Visualization — Patrick Durusau @ 7:25 am

Series of short clips on data visualization.

Quite good even if very broad and general.

Tufte closes with the thought that we “see to confirm,” as less taxing on the brain. (Cf. Kahneman, “Thinking, Fast and Slow”)

Suggests that we need to “see to learn.”

I first saw this at spatial.ly.

Searching on BillTrack50

Filed under: Government,Law,Transparency — Patrick Durusau @ 6:58 am

How to find what you are looking for – constructing a search on BillTrack50 by Karen Suhaka.

From the post:

Building a search on BillTrack50 is fairly straightforward, however it isn’t exactly like doing a Google search. So there’s a few things you need to keep in mind, which I’ll explain in this post. There’s also a few tips and tricks advanced users might find useful. Any bills that are introduced later and meet your search terms will be automatically added to your bill sheet (if you made a bill sheet).

Tracking “thumb on the scale” (TOTS) at the state level? BillTrack50 is a great starting point.

BillTrack50 provides surface facts, to which you can add vote trading, influence peddling and other routine legislative activities.

Metaphor Identification in Large Texts Corpora

Filed under: Corpora,Metaphors — Patrick Durusau @ 6:44 am

Metaphor Identification in Large Texts Corpora by Yair Neuman, Dan Assaf, Yohai Cohen, Mark Last, Shlomo Argamon, Newton Howard, Ophir Frieder. (Neuman Y, Assaf D, Cohen Y, Last M, Argamon S, et al. (2013) Metaphor Identification in Large Texts Corpora. PLoS ONE 8(4): e62343. doi:10.1371/journal.pone.0062343)

Abstract:

Identifying metaphorical language-use (e.g., sweet child) is one of the challenges facing natural language processing. This paper describes three novel algorithms for automatic metaphor identification. The algorithms are variations of the same core algorithm. We evaluate the algorithms on two corpora of Reuters and the New York Times articles. The paper presents the most comprehensive study of metaphor identification in terms of scope of metaphorical phrases and annotated corpora size. Algorithms’ performance in identifying linguistic phrases as metaphorical or literal has been compared to human judgment. Overall, the algorithms outperform the state-of-the-art algorithm with 71% precision and 27% averaged improvement in prediction over the base-rate of metaphors in the corpus.

A deep review of current work and promising new algorithms on metaphor identification.

I first saw this in Nat Torkinton’s Four short links: 14 May 2013.

May 20, 2013

Consumers of Furry Pornography = Tax Dodgers?

Filed under: Data Science Toolkit (DSTK),Heatmaps,Mapping,Maps — Patrick Durusau @ 5:00 pm

heatmaps cartoon

No more heatmaps that are just population maps! by Pete Warden.

From the post:

I'm pleased to announce that there's a brand new 0.50 version of the DSTK out! It has a lot of bug fixes, and a couple of major new features, and you can get it on Amazon's EC2 as ami-7b9df412, download the Vagrant box from http://static.datasciencetoolkit.org/dstk_0.50.box, or grab it as a BitTorrent stream from http://static.datasciencetoolkit.org/dstk_0.50.torrent

What are the new features?

The biggest is the integration of high resolution (sub km-squared) geostatistics for the entire globe. You can get population density, elevation, weather and more using the new coordinates2statistics API call. Why is this important? No more heatmaps that are just population maps, for the love of god! I'm using this extensively to normalize my data analysis so that I can actually tell which places actually have an unusually high occurrence of X, rather than just having more people.

If you use the DSTK (and you should), do send Pete a note of appreciation.

I can’t wait to start mapping tax dodgers!

U.S. Senate Panel Discovers Nowhere Man [Apple As Tax Dodger]

Filed under: Government,Government Data,Transparency — Patrick Durusau @ 4:47 pm

Forty-seven years after Nowhere Man by the Beatles, a U.S. Senate panel discovers several nowhere men.

A Wall Street Journal Technology Alert:

Apple has set up corporate structures that have allowed it to pay little or no corporate tax–in any country–on much of its overseas income, according to the findings of a U.S. Senate examination.

The unusual result is possible because the iPhone maker’s key foreign subsidiaries argue they are residents of nowhere, according to the investigators’ report, which will be discussed at a hearing Tuesday where Apple CEO Tim Cook will testify. The finding comes from a lengthy investigation into the technology giant’s tax practices by the Senate Permanent Subcommittee on Investigations, led by Sens. Carl Levin (D., Mich.) and John McCain (R., Ariz.).

In additional coverage, Apple says:

Apple’s testimony also includes a call to overhaul: “Apple welcomes an objective examination of the US corporate tax system, which has not kept pace with the advent of the digital age and the rapidly changing global economy.”

Tax reform will be useful only if “transparent” tax reform.

Transparent tax reform mean every provision with more than a $100,000 impact on any taxpayer, names all the taxpayers impacted. Whether more or less taxes.

We have the data, we need the will to apply the analysis.

A tax-impact topic map anyone?

The Index-Based Subgraph Matching Algorithm (ISMA)…

Filed under: Bioinformatics,Graphs,Indexing — Patrick Durusau @ 4:23 pm

The Index-Based Subgraph Matching Algorithm (ISMA): Fast Subgraph Enumeration in Large Networks Using Optimized Search Trees by Sofie Demeyer, Tom Michoel, Jan Fostier, Pieter Audenaert, Mario Pickavet, Piet Demeester. (Demeyer S, Michoel T, Fostier J, Audenaert P, Pickavet M, et al. (2013) The Index-Based Subgraph Matching Algorithm (ISMA): Fast Subgraph Enumeration in Large Networks Using Optimized Search Trees. PLoS ONE 8(4): e61183. doi:10.1371/journal.pone.0061183)

Abstract:

Subgraph matching algorithms are designed to find all instances of predefined subgraphs in a large graph or network and play an important role in the discovery and analysis of so-called network motifs, subgraph patterns which occur more often than expected by chance. We present the index-based subgraph matching algorithm (ISMA), a novel tree-based algorithm. ISMA realizes a speedup compared to existing algorithms by carefully selecting the order in which the nodes of a query subgraph are investigated. In order to achieve this, we developed a number of data structures and maximally exploited symmetry characteristics of the subgraph. We compared ISMA to a naive recursive tree-based algorithm and to a number of well-known subgraph matching algorithms. Our algorithm outperforms the other algorithms, especially on large networks and with large query subgraphs. An implementation of ISMA in Java is freely available at http://sourceforge.net/projects/isma/.

From the introduction:

Over the last decade, network theory has come to play a central role in our understanding of complex systems in fields as diverse as molecular biology, sociology, economics, the internet, and others [1]. The central question in all these fields is to understand behavior at the level of the whole system from the topology of interactions between its individual constituents. In this respect, the existence of network motifs, small subgraph patterns which occur more often in a network than expected by chance, has turned out to be one of the defining properties of real-world complex networks, in particular biological networks [2]. Network motifs act as the fundamental information processing units in cellular regulatory networks [3] and they form the building blocks of larger functional modules (also known as network communities) [4]–[6]. The discovery and analysis of network motifs crucially depends on the ability to enumerate all instances of a given query subgraph in a network or graph of interest, a classical problem in pattern recognition [7], that is known to be NP complete [8].

Heavy sledding but important for exploration of large graphs/networks and the subsequent representation of those findings in a topic map.

I first saw this in Nat Torkinton’s Four short links: 13 May 2013.

Reloading my Beergraph – using an in-graph-alcohol-percentage-index

Filed under: Graphs,Neo4j — Patrick Durusau @ 3:27 pm

Reloading my Beergraph – using an in-graph-alcohol-percentage-index by Rik Van Bruggen.

From the post:

As you may remember, I created a little beer graph some time ago to experiment and have fun with beer, and graphs. And yes, I have been having LOTS of fun with it – using it to explain graph concepts to lots of not-so-technical folks, like myself. Many people liked it, and even more people had some questions about it – started thinking in graphs, basically. Which is way more than what I ever hoped for – so that’s great!

One of the questions that people always asked me was about the model. Why did I model things the way I did? Are there no other ways to model this domain? What would be the *best* way to model it? All of these questions have somewhat vague answers, because as a rule, there is no *one way* to model a graph. The data does not determine the model – it’s the QUERY that will drive the modelling decisions.

One of the things that spurred the discussion was – probably not coincidentally – the AlcoholPercentage. Many people were expecting that to be a *property* of the Beerbrand – but instead in my beergraph, I had “pulled it out”. The main reason at the time was more coincidence than anything else, but when you think of it – it’s actually a fantastic thing to “pull things out” and normalise the data model much further than you probably would in a relational model. By making the alcoholpercentage a node of its own, it allowed me to do more interesting queries and pathfinding operations – which led to interesting beer recommendations. Which is what this is all about, right?

(…)

When I read:

All of these questions have somewhat vague answers, because as a rule, there is no *one way* to model a graph. The data does not determine the model – it’s the QUERY that will drive the modelling decisions.

or

…but instead in my beergraph, I had “pulled it out”. The main reason at the time was more coincidence than anything else, but when you think of it – it’s actually a fantastic thing to “pull things out” and normalise the data model much further than you probably would in a relational model.

I don’t feel like I’ve been vague, ever. 😉

Here is my summary of what Rik may have meant:

  • “no *one way* to model a graph” -> graphs support multiple models of data
  • “The data does not determine the model ” -> may mean you can create any arbitrary model based on any data
  • “…the QUERY that will drive the modeling decisions.” -> in topic map terms, what gets represented by a topic (node in a graph) is what you want to talk about (query)
  • “…pulled it out…”/”…pull things out…” -> represent a subject with a node (graph) or topic (topic maps).
  • “…normlise the data model much further…” -> The distinction from database normalization isn’t clear, may just be filler.
    • Clarity in writing reduces unnecessary vagueness.

FuzzyLaw [FuzzyDBA, FuzzyRDF, FuzzySW?]

Filed under: Law,Legal Informatics,Semantic Diversity,Semantic Inconsistency,Users — Patrick Durusau @ 2:03 pm

FuzzyLaw

From the webpage:

(…)

FuzzyLaw has gathered explanations of legal terms from members of the public in order to get a sense of what the ‘person on the street’ has in mind when they think of a legal term. By making lay-people’s explanations of legal terms available to interpreters, police and other legal professionals, we hope to stimulate debate and learning about word meaning, public understanding of law and the nature of explanation.

The explanations gathered in FuzzyLaw are unusual in that they are provided by members of the public. These people, all aged over 18, regard themselves as ‘native speakers’, ‘first language speakers’ and ‘mother tongue’ speakers of English and have lived in England and/or Wales for 10 years or more. We might therefore expect that they will understand English legal terminology as well as any member of the public might. No one who has contributed has ever worked in the criminal law system or as an interpreter or translator. They therefore bring no special expertise to the task of explanation, beyond whatever their daily life has provided.

We have gathered explanations for 37 words in total. You can see a sample of these explanations on FuzzyLaw. The sample of explanations is regularly updated. You can also read responses to the terms and the explanations from mainly interpreters, police officers and academics. You are warmly invited to add your own responses and join in the discussion of each and every word. Check back regularly to see how discussions develop and consider bookmarking the site for future visits. The site also contains commentaries on interesting phenomena which have emerged through the site. You can respond to the commentaries too on that page, contributing to the developing research project.

(…)

Have you ever wondered that the ‘person on the street’ thinks about relational databases, RDF or the Semantic Web?

Those are the folks who are being pushed content based on interpretations not their own making.

Here’s a work experiment for you:

  1. Take ten search terms from your local query log.
  2. At each department staff meeting, distribute sheets with the words, requesting everyone to define the terms in their own words. No wrong answers.
  3. Tally up the definitions per department and across the company.
  4. Comments anyone?

I first saw this at: FuzzyLaw: Collection of lay citizens’ understandings of legal terminology.

GraphX: A Resilient Distributed Graph System on Spark

Filed under: Graphs,GraphX,Spark — Patrick Durusau @ 10:23 am

GraphX: A Resilient Distributed Graph System on Spark by Reynold Xin, Joseph Gonzalez, Michael Franklin, Ion Stoica.

Abstract:

From social networks to targeted advertising, big graphs capture the structure in data and are central to recent advances in machine learning and data mining. Unfortunately, directly applying existing data-parallel tools to graph computation tasks can be cumbersome and inefficient. The need for intuitive, scalable tools for graph computation has lead to the development of new graph-parallel systems (e.g. Pregel, PowerGraph) which are designed to efficiently execute graph algorithms. Unfortunately, these new graph-parallel systems do not address the challenges of graph construction and transformation which are often just as problematic as the subsequent computation. Furthermore, existing graph-parallel systems provide limited fault-tolerance and support for interactive data mining.

We introduce GraphX, which combines the advantages of both data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark data-parallel framework. We leverage new ideas in distributed graph representation to efficiently distribute graphs as tabular data-structures. Similarly, we leverage advances in data-flow systems to exploit in-memory computation and fault-tolerance. We provide powerful new operations to simplify graph construction and transformation. Using these primitives we implement the PowerGraph and Pregel abstractions in less than 20 lines of code. Finally, by exploiting the Scala foundation of Spark, we enable users to interactively load, transform, and compute on massive graphs.

Of particular note is the use of an immutable graph as the core data structure for GraphX.

The authors report that GraphX performs less well than PowerGraph (GraphLab 2.1) but promise performance gains and offsetting gains in productivity.

I didn’t find any additional resources at AMPLab on GraphX but did find:

Spark project homepage, and,

Screencasts on Spark

Both will benefit you when more information emerges on GraphX.

Graph Landscape Survey

Filed under: GraphBuilder,GraphLab,Graphs,Neo4j,Pregel,Spark — Patrick Durusau @ 9:41 am

Improving options for unlocking your graph data by Ben Lorica.

From the post:

The popular open source project GraphLab received a major boost early this week when a new company comprised of its founding developers, raised funding to develop analytic tools for graph data sets. GraphLab Inc. will continue to use the open source GraphLab to “push the limits of graph computation and develop new ideas”, but having a commercial company will accelerate development, and allow the hiring of resources dedicated to improving usability and documentation.

While social media placed graph data on the radar of many companies, similar data sets can be found in many domains including the life and health sciences, security, and financial services. Graph data is different enough that it necessitates special tools and techniques. Because tools were a bit too complex for casual users, in the past this meant graph data analytics was the province of specialists. Fortunately graph data is an area that has attracted many enthusiastic entrepreneurs and developers. The tools have improved and I expect things to get much easier for users in the future. A great place to learn more about tools for graph data, is at the upcoming GraphLab Workshop (on July 1st in SF).
(…)

Ben summarizes graph resources for:

  • Data wrangling: creating graphs
  • Data management and search
  • Graph-parallel frameworks
  • Machine-learning and analytics
  • Visualization

It would be hard to find a better starting place for investigating the buzz about graphs.

I first saw this in An Overview of Graph Processing Frameworks by Danny Bickson.

May 19, 2013

Visual Storytelling – a thing of the past

Filed under: Graphics,Interface Research/Design,Visualization — Patrick Durusau @ 6:46 pm

Visual Storytelling – a thing of the past by Michel Guillet.

From the post:

I spent quite a few summer vacations as a kid getting dragged around Europe visiting castles and churches. It is definitely an experience that I’m more thankful for now than I was at the time. One of the things that I loved most, even as a child, was seeing the stained glass windows. I have strong memories of being in Notre Dame in Paris and watching the light come in at dawn or staring at the Chartres Cathedral windows for minutes without moving.

Chartres

As a boy, it wasn’t the history, the architecture or an admiration of the faith involved to build these churches. Those were concepts beyond my ability, knowledge or frankly interest at the time. What I have come to realize only in the past couple of years is that the windows were meant for me. At the base level, I needed something that could grab my attention and hold it. What I have discovered is that from this standpoint, I am no different than the illiterate masses of the Middle Ages or Renaissance. (emphasis in original)

Michel proceeds to make the art of Chartres Cathedral a lesson in data visualization and graphic presentation.

A very powerful lesson.

Does your interface treat communication with users as important?

You Are Listening to The New York Times

Filed under: Interface Research/Design,Music,News — Patrick Durusau @ 4:05 pm

You Are Listening to The New York Times by Hugh Mandeville.

From the post:

When the San Francisco Giants won the 2010 World Series, the post-victory celebrations got out of control. Revelers smashed windows, got into fistfights and started fires. A Muni bus and the metaverse were both set alight.

To track the chaos, Eric Eberhardt, a techie from the Bay Area, tuned in to a San Francisco police scanner station on soma.fm — while also listening to music. Something about the combination of ambient music and live police chatter clicked for Eberhardt, and youarelistening.to was born.

Eberhardt’s site is a mash-up of three APIs: police scanner audio from RadioReference.com, ambient music from SoundCloud and images from Flickr. The outcome is like a real-time soundtrack to Michael Mann’s movie “Heat.” My colleague Chase Davis, interactive news assistant editor, describes it as “‘Hearts of Space’ meets ‘The Wire.’”

(…)

My explorations inspired me to create a page on youarelistening.to that takes New York Times headlines from the Times Newswire API and reads them aloud using TTS-API.com’s text-to-speech API. I also created a page that reads trending tweets, using Twitter’s Search API.

Definitely has potential to enrich a user experience.

Imagine studying early 21st century history and when George W. Bush or Dick Cheney show up on your ereader, War Pigs plays in the background.

Trivia: Did you know that War Pigs was one of 165 songs that Clear Channel suggested could be inappropriate to play after 9/11? 2001 Clear Channel Memorandum.

Cat Stevens with Peace Train also made the list.

Terrorism we can survive. Those trying to protect us, I’m not so sure.

6 Golden Rules to Successful Dashboard Design

Filed under: Dashboard,Graphics,Interface Research/Design,Visualization — Patrick Durusau @ 3:40 pm

6 Golden Rules to Successful Dashboard Design

From the article:

Dashboards are often created on-the-fly with data being added simply because there is some white space not being used. Different people in the company ask for different data to be displayed and soon the dashboard becomes hard to read and full of meaningless non-related information. When this happens, the dashboard is no longer useful.

This article discusses the steps that need to be taken during the design phase in order to create a useful and actionable dashboard.

Topic maps can be expressed as dashboards as well as other types of interfaces.

Whatever your interface, it needs to be driven by good design principles.

AnormCypher 0.4.1 released!

Filed under: Cypher,Graphs — Patrick Durusau @ 3:18 pm

AnormCypher 0.4.1 released! by Wes Freeman.

From the post:

Thanks to Pieter, AnormCypher 0.4.1 supports versions earlier than Neo4j 1.9 (I didn’t realize this was an issue).

AnormCypher is a Cypher-oriented Scala library for Neo4j Server (REST). The goal is to provide a great API for calling arbitrary Cypher and parsing out results, with an API inspired by Anorm from the Play! Framework.

If you are working with a Neo4j Server this may be of interest.

Scala Resources & Community links for the newcomer

Filed under: Functional Programming,Scala — Patrick Durusau @ 3:14 pm

Scala Resources & Community links for the newcomer by Raúl Raja.

From the post:

During the last couple of months I have been asked a few times among colleagues and friends hot to get started with Scala. People come to Scala from diverse backgrounds such as… – Java folks looking for a better Java or just tired of waiting for Java features other modern languages such as C# already offer. – Ruby, PHP, and programmers that come from a scripting background looking for type safety. – People trying to bridge the best of both OOP and Functional paradigms. Scala is a vast language full of features with a very technical community. Don’t let your first step discourage you as you don’t need to know everything about Scala to become productive quickly. People in the mailing list will often talk about some crazy shit you don’t need to know just yet. Monads, Monoids, Combinators, Macros and things you may not even know how to pronounce,… Seriously guys as you start to learn about it it’s gonna blow your mind. It’s gonna take some time to digest all the info but it sure it’s worth it. Here is a few resources / steps may help you get started focused on its community and not so much on the technical details of downloading and running your first scala “hello world”

More than a collection to bookmark for “someday,” this is a collection of resources to start following today.

I haven’t looked at all the references but from the ones I checked, I don’t think you will be disappointed.

Apache Drill

Filed under: Drill,NoSQL,Query Engine — Patrick Durusau @ 3:08 pm

Michael Hausenblas at NoSQL Matters 2013 does a great lecture on Apache Drill.

Slides.

Google’s Dremel Paper

Projects “beta” for Apache Drill by second quarter and GA by end of year.

Apache Drill User.

From the rationale:

There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel.

How do you handle ad hoc exploration of data sets as part of planning a topic map?

Being able to “test” merging against data prior to implementation sounds like a good idea.

Visualizing your LinkedIn graph using Gephi (Parts 1 & 2)

Filed under: Gephi,Graphics,Networks,Social Networks,Visualization — Patrick Durusau @ 1:41 pm

Visualizing your LinkedIn graph using Gephi – Part 1

&

Visualizing your LinkedIn graph using Gephi – Part 2

by Thomas Cabrol.

From part 1:

Graph analysis becomes a key component of data science. A lot of things can be modeled as graphs, but social networks are really one of the most obvious examples.

In this post, I am going to show how one could visualize its own LinkedIn graph, using the LinkedIn API and Gephi, a very nice software for working on this type of data. If you don’t have it yet, just go to http://gephi.org/ and download it now !

My objective is to simply look at my connections (the “nodes” or “vertices” of the graph), see how they relate to each other (the “edges”) and find clusters of strongly connected users (“communities”). This is somewhat emulating what is available already in the InMaps data product, but, hey, this is cool to do it by ourselves, no ?

The first thing to do for running this graph analysis is to be able to query LinkedIn via its API. You really don’t want to get the data by hand… The API uses the oauth authentification protocol, which will let an application make queries on behalf of a user. So go to https://www.linkedin.com/secure/developer and register a new application. Fill the form as required, and in the OAuth part, use this redirect URL for instance:

Great introduction to Gephi!

As a bonus, reinforces the lesson that ETL isn’t required to re-use data.

ETL may be required in some cases but in a world of data APIs those are getting fewer and fewer.

Think of it this way: Non-ETL data access means someone else is paying for maintenance, backups, hardware, etc.

How much of your IT budget is supporting duplicated data?

Google Map Redesign [Brain Buds]

Filed under: Google Maps,Mapping,Maps — Patrick Durusau @ 1:22 pm

Google Map Redesign by Caitlin Dempsey.

From the post:

Googles Maps is preparing to debut its newly revamped Google Maps. Terming it “smart recommendations” the new functionality of Google Maps is intended to be more interactive and custom tailored to the specific user. The more you use the map to search for locations, favorite items by starring them, and write location reviews, the more unique the map becomes. Clicking a specific business or feature will result in the map features adjusting to show roads and locations related to that place.

(…)

Previewing the new Google Maps is currently only available by invite at the moment. You can request your invite via the Preview page.

Technology could be exposing you to a broader view of the world, perhaps even as other see it.

Instead:

  • Apple brought us ear buds that wall us off from ambient sound and others.
  • Apple also brought us eye buds (iPhones) that wall us off from our visual surroundings.
  • Google is building brain buds to wrap you in a customized cocoon of content.

Ironic if you remember the original MacIntosh commercial:

Timothy Leary today would say:

Turn on, tune in, unplug.

« Newer PostsOlder Posts »

Powered by WordPress