March « 2013 « Another Word For It

March 19, 2013

BitcoinVisualizer

Filed under: Graphics,Networks,Visualization — Patrick Durusau @ 5:55 am

From the webpage:

Block Viewer visualizes the Bitcoin block chain by building an ownership network on top of the underlying transaction network and presents a web-enabled user interface to display the visualization results.

Great mapping exercise!

Imagine what could be done tracking all banking transfers.

Before you object that banking transfer monitoring would require a search warrant, remember that Richard Nixon could not be prosecuted for treason because the evidence was the result of an illegal wiretap.

Take this mapping as a reminder to use cash whenever possible.

Demo: http://www.blockviewer.com/#30203900

I first saw this in a tweet by Max De Marzi.

Comments Off

Cybersecurity Blogs

Filed under: Cybersecurity,Security — Patrick Durusau @ 5:37 am

I have been looking for a starting collection of cybersecurity blogs and encountered Security Blogs at convert.io today.

I count ninety-two (92) blogs listed as of this morning.

I haven’t loaded them into a reader to judge how active or timely they are as a whole.

Suggestions on other blog lists for cybersecurity?

Thanks!

Comments Off

Internet Census 2012

Filed under: Cybersecurity,Security — Patrick Durusau @ 5:27 am

Internet Census 2012

Abstract:

While playing around with the Nmap Scripting Engine (NSE) we discovered an amazing number of open embedded devices on the Internet. Many of them are based on Linux and allow login to standard BusyBox with empty or default credentials. We used these devices to build a distributed port scanner to scan all IPv4 addresses. These scans include service probes for the most common ports, ICMP ping, reverse DNS and SYN scans. We analyzed some of the data to get an estimation of the IP address usage.

All data gathered during our research is released into the public domain for further study.

Interesting paper, not to mention compressed with ZPAQ, 568GB of data. Unpacked, about 9TB of log data.

The topic map use case being to map this port data with other information resources.

Maybe time to get an extra external disk drive. 😉

I first saw this in a tweet by Jason Trost.

Comments Off

Easy 6502

Filed under: Artificial Intelligence,Programming — Patrick Durusau @ 5:18 am

Easy 6502 by Nick Morgan.

From the webpage:

In this tiny ebook I’m going to show you how to get started writing 6502 assembly language. The 6502 processor was massive in the seventies and eighties, powering famous computers like the BBC Micro, Atari 2600, Commodore 64, Apple II, and the Nintendo Entertainment System. Bender in Futurama has a 6502 processor for a brain. Even the Terminator was programmed in 6502.

So, why would you want to learn 6502? It’s a dead language isn’t it? Well, so’s Latin. And they still teach that. Q.E.D.

(Actually, I’ve been reliably informed that 6502 processors are still being produced by Western Design Center, so clearly 6502 isn’t a dead language! Who knew?)

Seriously though, I think it’s valuable to have an understanding of assembly language. Assembly language is the lowest level of abstraction in computers – the point at which the code is still readable. Assembly language translates directly to the bytes that are executed by your computer’s processor. If you understand how it works, you’ve basically become a computer magician.

Then why 6502? Why not a useful assembly language, like x86? Well, I don’t think learning x86 is useful. I don’t think you’ll ever have to write assembly language in your day job – this is purely an academic exercise, something to expand your mind and your thinking. 6502 was originally written in a different age, a time when the majority of developers were writing assembly directly, rather than in these new-fangled high-level programming languages. So, it was designed to be written by humans. More modern assembly languages are meant to written by compilers, so let’s leave it to them. Plus, 6502 is fun. Nobody ever called x86 fun.

A useful reminder about the nature of processing in computers.

Whatever a high level language may imply to you, for your computer, it’s just instructions.

Comments Off

OpenNews Learning… [data recycling?]

Filed under: News,Reporting — Patrick Durusau @ 5:18 am

OpenNews Learning wants to provide lessons to developers in and out of newsrooms by Justin Ellis.

From the post:

If you ever wanted an “Ask This Old House”-style guide set in the universe of newsroom developers and designers, today you’re in luck: OpenNews Learning is a new kind of online education project that looks at the nuts and bolts of interactive projects through the eyes of the people who built them. It’s the newest arm of Knight-Mozilla OpenNews, the two-foundation collaboration that aims to strengthen the bonds between the worlds of journalism and software development.

One of the central ideas behind OpenNews is sharing knowledge, through building community and by putting outside developers directly into newsrooms. OpenNews Learning is an extension of that, designed to help developers (aspiring and otherwise) learn how specific projects were built. Consider it another way to “show your work.”

Following these projects should provide ample opportunities to suggest where topic maps could have been used.

I suspect most researchers would prefer data recycling over data mining.

Comments Off

March 18, 2013

Permission Resolution With Neo4j – Part 1

Filed under: Graphs,Neo4j,Networks,Security — Patrick Durusau @ 4:32 pm

Permission Resolution With Neo4j – Part 1 by Max De Marzi.

From the post:

People produce a lot of content. Messages, text files, spreadsheets, presentations, reports, financials, etc, the list goes on. Usually organizations want to have a repository of all this content centralized somewhere (just in case a laptop breaks, gets lost or stolen for example). This leads to some kind of grouping and permission structure. You don’t want employees seeing each other’s HR records, unless they work for HR, same for Payroll, or unreleased quarterly numbers, etc. As this data grows it no longer becomes easy to simply navigate and a search engine is required to make sense of it all.

But what if your search engine returns 1000 results for a query and the user doing the search is supposed to only have access to see 4 things? How do you handle this? Check the user permissions on each file realtime? Slow. Pre-calculate all document permissions for a user on login? Slow and what if new documents are created or permissions change between logins? Does the system scale at 1M documents, 10M documents, 100M documents?

Max addresses the scaling issue by checking only the results from a search. So to that extent, the side of the document store becomes irrelevant.

At least if you have a smallish number of results from the search.

I haven’t seen part 2 but another scale tactic would be to limit access to indexes by permissions. Segregating human resources, accounting, etc.

Looking forward to where Max takes this one.

Comments Off

High Performance JS heatmaps

Filed under: Graphics,Heatmaps,Visualization,WebGL — Patrick Durusau @ 3:51 pm

High Performance JS heatmaps by Florian Boesch. (live demo)

From the webpage:

You might have encountered heatmaps for data visualization before. There is a fabulous library, heatmap.js, which brings that capability to draw them to javascript. There is only one problem, it is not exactly fast. Sometimes that doesn’t matter. But if you have hundreds of thousands of data points to plot, or need realtime performance, it gets tricky. To solve that I’ve written a little engine using WebGL for drawing heatmaps.

Github: WebGL Heatmap.

Another tool for your data visualization toolkit.

I first saw this in Nat Torkington’s Four Short Links: 18 March 2013.

Comments Off

The Biggest Failure of Open Data in Government

Filed under: Government,Government Data,Open Data,Open Government — Patrick Durusau @ 3:35 pm

Many open data initiatives forget to include the basic facts about the government itself by Philip Ashlock.

From the post:

In the past few years we’ve seen a huge shift in the way governments publish information. More and more governments are proactively releasing information as raw open data rather than simply putting out reports or responding to requests for information. This has enabled all sorts of great tools like the ones that help us find transportation or the ones that let us track the spending and performance of our government. Unfortunately, somewhere in this new wave of open data we forgot some of the most fundamental information about our government, the basic “who”, “what”, “when”, and “where”.

Do you know all the different government bodies and districts that you’re a part of? Do you know who all your elected officials are? Do you know where and when to vote or when the next public meeting is? Now perhaps you’re thinking that this information is easy enough to find, so what does this have to do with open data? It’s true, it might not be too hard to learn about the highest office or who runs your city, but it usually doesn’t take long before you get lost down the rabbit hole. Government is complex, particularly in America where there can be a vast multitude of government districts and offices at the local level.

…

How can we have a functioning democracy when we don’t even know the local government we belong to or who our democratically elected representatives are? It’s not that Americans are simply too ignorant or apathetic to know this information, it’s that the system of government really is complex. With what often seems like chaos on the national stage it can be easy to think of local government as simple, yet that’s rarely the case. There are about 35,000 municipal governments in the US, but when you count all the other local districts there are nearly 90,000 government bodies (US Census 2012) with a total of more than 500,000 elected officials (US Census 1992). The average American might struggle to name their representatives in Washington D.C., but that’s just the tip of the iceberg. They can easily belong to 15 government districts with more than 50 elected officials representing them.

We overlook the fact that it’s genuinely difficult to find information about all our levels of government. We unconsciously assume that this information is published on some government website well enough that we don’t need to include it as part of any kind of open data program

Yes, the number of subdivisions of government and the number of elected officials are drawn from two different census reports, the first from the 2012 census and the second from the 1992 census, a gap of twenty (20) years.

The Census bureau has the 1992 list, saying:

1992 (latest available) 1992 Census of Governments vol. I no. 2 [PDF, 2.45MB] * Report has been discontinued

Makes me curious why such a report would be discontinued?

A report that did not address the various agencies, offices, etc. that are also part of various levels of government.

Makes me think you need an “insider” and/or a specialist just to navigate the halls of government.

Philip’s post illustrates that “open data” dumps from government are distractions from more effective questions of open government.

Questions such as:

Which officials have authority over what questions?
How to effectively contact those officials?
What actions are under consideration now?
Rules and deadlines for comments on actions?
Hearing and decision calendars?
Comments and submissions by others?
etc.

It never really is “…the local board of education (substitute your favorite board) decided….” but “…member A, B, D, and F decided that….”

Transparency means not allowing people and their agendas to hide behind the veil of government.

Comments Off

Semantic Search Over The Web (SSW 2013)

Filed under: Conferences,RDF,Semantic Diversity,Semantic Graph,Semantic Search,Semantic Web — Patrick Durusau @ 2:00 pm

3RD International Workshop onSemantic Search Over The Web (SSW 2013)

Dates:

Abstract Papers submission: May 31, 2013 – 15:00 (3:00 pm) EDT
(Short) Full Paper submission: June 7, 2013 – 15:00 (3:00 pm) EDT
Author notification: July 19, 2013
Camera-ready copy due: August 2, 2013
Workshop date: During VLDB (Aug 26 – Aug 30)

From the webpage:

We are witnessing a smooth evolution of the Web from a worldwide information space of linked documents to a global knowledge base, composed of semantically interconnected resources. To date, the correlated and semantically annotated data available on the web amounts to 25 billion RDF triples, interlinked by around 395 million RDF links. The continuous publishing and the integration of the plethora of semantic datasets from companies, government and public sector projects is leading to the creation of the so-called Web of Knowledge. Each semantic dataset contributes to extend the global knowledge and increases its reasoning capabilities. As a matter of facts, researchers are now looking with growing interest to semantic issues in this huge amount of correlated data available on the Web. Many progresses have been made in the field of semantic technologies, from formal models to repositories and reasoning engines. While the focus of many practitioners is on exploiting such semantic information to contribute to IR problems from a document centric point of view, we believe that such a vast, and constantly growing, amount of semantic data raises data management issues that must be faced in a dynamic, highly distributed and heterogeneous environment such as the Web.

The third edition of the International Workshop on Semantic Search over the Web (SSW) will discuss about data management issues related to the search over the web and the relationships with semantic web technologies, proposing new models, languages and applications.

The research issues can be summarized by the following problems:

How can we model and efficiently access large amounts of semantic web data?

How can we effectively retrieve information exploiting semantic web technologies?

How can we employ semantic search in real world scenarios?

The SSW Workshop invites researchers, engineers, service developers to present their research and works in the field of data management for semantic search. Papers may deal with methods, models, case studies, practical experiences and technologies.

Apologies for the uncertainty of the workshop date. (There is confusion about the date on the workshop site, one place says the 26th, the other the 30th. Check before you make reservation/travel arrangements.)

I differ with the organizers on some issues but on the presence of: “…data management issues that must be faced in a dynamic, highly distributed and heterogeneous environment such as the Web,” there is no disagreement.

That’s the trick isn’t it? In any confined or small group setting, just about any consistent semantic solution will work.

The hurly-burly of a constant stream of half-heard, partially understood communications across distributed and heterogeneous systems tests the true mettle of semantic solutions.

Not a quest for perfect communication but “good enough.”

Comments Off

VLDB 2013

Filed under: BigData,Conferences,Database — Patrick Durusau @ 1:37 pm

39th International Conference on Very Large Data Bases

Dates:

Submissions still open:

Industrial & Application Papers, Demonstration Proposals, Tutorial Proposals, PhD Workshop Papers, due by March 31st, 2013, author notification: May 31st, 2013

Conference: August 26 – 30, 2013.

From the webpage:

VLDB is a premier annual international forum for data management and database researchers, vendors, practitioners, application developers, and users. The conference will feature research talks, tutorials, demonstrations, and workshops. It will cover current issues in data management, database and information systems research. Data management and databases remain among the main technological cornerstones of emerging applications of the twenty-first century.

VLDB 2013 will take place at the picturesque town of Riva del Garda, Italy. It is located close to the city of Trento, on the north shore of Lake Garda, which is the largest lake in Italy, formed by glaciers at the end of the last ice age. In the 17th century, Lake Garda became a popular destination for young central European nobility. The list of its famous guests includes Goethe, Freud, Nietzsche, the Mann brothers, Kafka, Lawrence, and more recently James Bond. Lake Garda attracts many tourists every year, and offers numerous opportunities for sightseeing in the towns along its shores (e.g., Riva del Garda, Malcesine, Torri del Benaco, Sirmione), outdoors activities (e.g., hiking, wind-surfing, swimming), as well as fun (e.g., Gardaland amusement theme park).

Smile when you point “big data” colleagues to 1st Very Large Data Bases VLDB 1975: Framingham, Massachusetts.

Some people catch on sooner than others. 😉

Comments Off

…2,958 Nineteenth-Century British Novels

Filed under: Literature,Text Analytics,Text Mining — Patrick Durusau @ 10:27 am

A Quantitative Literary History of 2,958 Nineteenth-Century British Novels: The Semantic Cohort Method by Ryan Heuser and Long Le-Khac.

From the introduction:

The nineteenth century in Britain saw tumultuous changes that reshaped the fabric of society and altered the course of modernization. It also saw the rise of the novel to the height of its cultural power as the most important literary form of the period. This paper reports on a long-term experiment in tracing such macroscopic changes in the novel during this crucial period. Specifically, we present findings on two interrelated transformations in novelistic language that reveal a systemic concretization in language and fundamental change in the social spaces of the novel. We show how these shifts have consequences for setting, characterization, and narration as well as implications for the responsiveness of the novel to the dramatic changes in British society.

This paper has a second strand as well. This project was simultaneously an experiment in developing quantitative and computational methods for tracing changes in literary language. We wanted to see how far quantifiable features such as word usage could be pushed toward the investigation of literary history. Could we leverage quantitative methods in ways that respect the nuance and complexity we value in the humanities? To this end, we present a second set of results, the techniques and methodological lessons gained in the course of designing and running this project.

This branch of the digital humanities, the macroscopic study of cultural history, is a field that is still constructing itself. The right methods and tools are not yet certain, which makes for the excitement and difficulty of the research. We found that such decisions about process cannot be made a priori, but emerge in the messy and non-linear process of working through the research, solving problems as they arise. From this comes the odd, narrative form of this paper, which aims to present the twists and turns of this process of literary and methodological insight. We have divided the paper into two major parts, the development of the methodology (Sections 1 through 3) and the story of our results (Sections 4 and 5). In actuality, these two processes occurred simultaneously; pursuing our literary-historical questions necessitated developing new methodologies. But for the sake of clarity, we present them as separate though intimately related strands.

If this sounds far afield from mining tweets, emails, corporate documents or government archives, can you articulate the difference?

Or do we reflexively treat some genres of texts as “different?”

How useful you will find some of the techniques outlined will depend on the purpose of your analysis.

If you are only doing key-word searching, this isn’t likely to be helpful.

If on the other hand, you are attempting more sophisticated analysis, read on!

I first saw this in Nat Torkington’s Four Short Links: 18 March 2013.

Comments Off

Curating Inorganics? No. (ChEMBL)

Filed under: Cheminformatics,Curation — Patrick Durusau @ 8:57 am

The results are in – inorganics are out!

From the ChEMBL-og blog which “covers the activities of the Computational Chemical Biology Group at the EMBL-EBI in Hinxton.

From the post:

A few weeks ago we ran a small poll on how we should deal with inorganic molecules – not just simple sodium salts, but things like organoplatinums, and other compounds with dative bonds, unusual electronic states, etc. The results from you were clear, there was little interest in having a lot of our curation time spent on these. We will continue to collect structures from the source journals, and they will be in the full database, but we won’t try and curate the structures, or display them in the interface. They will be appropriately flagged, and nothing will get lost. So there it is, democracy in action.

So for ChEMBL 16 expect fewer issues when you try and load our structures in your own pipelines and systems.

Just an FYI that inorganic compounds are not being curated at ChEMBL.

If you decide to undertake such work, contacting ChEMBL to coordinate collection, etc., would be a good first step.

Comments Off

LDBC – Second Technical User Community (TUC) Meeting

Filed under: Graph Databases,RDF — Patrick Durusau @ 8:43 am

LDBC: Linked Data Benchmark Council – Second Technical User Community (TUC) Meeting – 22/23rd April 2013.

From the post:

The LDBC consortium are pleased to announce the second Technical User Community (TUC) meeting.

This will be a two day event in Munich on the 22/23rd April 2013.

The event will include:

Introduction to the objectives and progress of the LDBC project.

Description of the progress of the benchmarks being evolved through Task Forces.

Users explaining their use-cases and describing the limitations they have found in current technology.

Industry discussions on the contents of the benchmarks.

All users of RDF and graph databases are welcome to attend. If you are interested, please contact: ldbc AT ac DOT upc DOT edu.

Further meeting details at the post.

Comments Off

Crowdsourced Chemistry… [Documents vs. Data]

Filed under: Cheminformatics,Crowd Sourcing,Curation — Patrick Durusau @ 5:01 am

Crowdsourced Chemistry Why Online Chemistry Data Needs Your Help by Antony Williams. (video)

From the description:

This is the Ignite talk that I gave at ScienceOnline2010 #sci010 in the Research Triangle Park in North Carolina on January 16th 2010. This was supposed to be a 5 minute talk highlighting the quality of chemistry data on the internet. Ok, it was a little tongue in cheek because it was an after dinner talk and late at night but the data are real, the problem is real and the need for data curation of chemistry data online is real. On ChemSpider we have provided a platform to deposit and curate data. Other videos will show that in the future.

Great demonstration of the need for curation in chemistry.

And of the impact that re-usable information can have on the quality of information.

The errors in chemical descriptions you see in this video could be corrected in:

In an article.
In a monograph.
In a webpage.
In an online resource that can be incorporated by reference.

Which one do you think would propagate the corrected information more quickly?

Documents are a great way to convey information to a reader.

They are an incredibly poor way to store/transmit information.

Every reader has to extract the information in a document for themselves.

Not to mention that data is fixed, unless it has incorporated information by reference.

Funny isn’t it? We are still storing data as we did when clay tablets were the medium of choice.

Isn’t it time we separated presentation (documents) from storage/transmission (data)?

Comments Off

Dublin Core Mapping Comments [by 7 April 2013]

Filed under: Dublin Core,Provenance — Patrick Durusau @ 4:21 am

Stuart Sutton, Managing Director, DCMI, calls on the Dublin Core community to comment on a mapping from Dublin Core terms to the PROV provenance ontology.

His call reads:

The DCMI Metadata Provenance Task Group [1] is collaborating with the W3C Provenance Working Group [2] on a mapping from Dublin Core terms to the PROV provenance ontology [3], currently a W3C Proposed Recommendation. More precisely, the document describes a partial mapping from DCMI Metadata Terms [4] to the PROV-O OWL2 ontology [5] — a set of classes and properties usable for representing and interchanging information about provenance. Numerous terms in the DCMI vocabulary provide information about the provenance of a resource. Translating these terms into PROV relates this information explicitly to the W3C provenance model.

The mapping is currently a W3C Working Draft. The final state of the document will be that of a W3C Note, to be published as part of a suite of documents in support of a W3C Recommendation for provenance interchange [6].

DCMI would like to point to the W3C Note as a DCMI Recommended Resource and therefore encourages the Dublin Core community to provide feedback and take part in the finalization of the mapping.

The deadline for all comments is 7 April 2013. We recommend that comments be provided directly to the public W3C list for comments: public-prov-comments@w3.org [7], ideally with a Cc: to DCMI’s dc-provenance list [8]. Comments sent only to the dc-provenance list will be summarized on the W3C list and addressed, and discussions on the W3C list will be summarized back on the dc-provenance list when appropriate.

Stuart Sutton, Managing Director, DCMI

[1] http://dublincore.org/groups/provenance/
[2] http://www.w3.org/2011/prov/wiki/Main_Page
[3] http://www.w3.org/TR/2013/WD-prov-dc-20130312/
[4] http://dublincore.org/documents/dcmi-terms/
[5] http://www.w3.org/TR/prov-o/
[6] http://www.w3.org/TR/prov-overview/
[7] http://lists.w3.org/Archives/Public/public-prov-comments/
[8] https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=dc-provenance

Comments Off

March 17, 2013

Treacherous backdoor found in TP-Link routers

Filed under: Cybersecurity,Security — Patrick Durusau @ 4:31 pm

Treacherous backdoor found in TP-Link routers

From the post:

Security experts in Poland have discovered a treacherous backdoor in various router models made by TP-Link. When a specially crafted URL is called, the router will respond by downloading and executing a file from the accessing computer, reports Michał Sajdak from Securitum.

Said to affect: TL-WDR4300 and TL-WR743ND models.

I read this bulletin and now you have read my post about it.

How do I capture this information so it can be recovered by anyone purchasing or interacting with TP-Link routers?

Or better yet, pushed to anyone who is at an online purchasing site?

In the flood of security flaws I am not going to remember this tidbit past tomorrow or maybe the next day.

Moreover, whatever defect is causing this issue, likely exists elsewhere. How do I capture that information as well?

In case you are interested: TP-Link.

Comments Off

On Philosophy, Science, and Data

Filed under: Data Science,Philosophy — Patrick Durusau @ 4:17 pm

On Philosophy, Science, and Data by Jim Harris.

From the post:

Ever since Melinda Thielbar helped me demystify data science on OCDQ Radio, I have been pondering my paraphrasing of an old idea: Science without philosophy is blind; Philosophy without science is empty; Data needs both science and philosophy.

“A philosopher’s job is to find out things about the world by thinking rather than observing,” the philosopher Bertrand Russell once said. One could say a scientist’s job is to find out things about the world by observing and experimenting. In fact, Russell observed that “the most essential characteristic of scientific technique is that it proceeds from experiment, not from tradition.”

Russell also said that “science is what we know, and philosophy is what we don’t know.” However, Stuart Firestein, in his book Ignorance: How It Drives Science, explained “there is no surer way to screw up an experiment than to be certain of its outcome.”

Although it seems it would make more sense for science to be driven by what we know, by facts, “working scientists,” according to Firestein, “don’t get bogged down in the factual swamp because they don’t care that much for facts. It’s not that they discount or ignore them, but rather that they don’t see them as an end in themselves. They don’t stop at the facts; they begin there, right beyond the facts, where the facts run out. Facts are selected for the questions they create, for the ignorance they point to.”

In this sense, philosophy and science work together to help us think about and experiment with what we do and don’t know.

Some might argue that while anyone can be a philosopher, being a scientist requires more rigorous training. A commonly stated requirement in the era of big data is to hire data scientists, but this begs the question: Is data science only for data scientists?

“Is data science only for data scientists?”

Let me answer that question with a story.

There is a book, originally published in 1965, called “How to Avoid Probate.” (Legal proceedings that may follow after death.) It claimed to tell “regular folks” how to avoid this difficulty and was marketed in a number of states.

Well, except that the laws concerning property, inheritance, etc., vary from state to state and even lawyers who don’t practice inheritance law in a state will send you to someone who does.

There were even rumors that the state bar associations were funding its publication.

If you think lawyers are expensive, try self-help. Your fees could easily double or triple, if not more.

The answer to: “Is data science only for data scientists?” depends on the result you want.

If you want a high quality, reliable results, then you need to spend money on hiring data scientists.

If you want input from the managers of the sixty percent (60%) of your projects that fail, you know who to call.

BTW, be able to articulate what “success” would look like from a data science project before hiring data scientists.

If you can’t, use your failing project managers.

There isn’t enough data science talent to do around and it should not be wasted.

PS: Those who argue anyone can be a philosopher get the sort of philosophy they deserve.

Comments Off

M3R: Increased Performance for In-Memory Hadoop Jobs

Filed under: Hadoop,Main Memory Map Reduce (M3R),MapReduce — Patrick Durusau @ 3:42 pm

M3R: Increased Performance for In-Memory Hadoop Jobs by Avraham Shinnar, David Cunningham, Benjamin Herta, Vijay Saraswat.

Abstract:

Main Memory Map Reduce (M3R) is a new implementation of the Hadoop Map Reduce (HMR) API targeted at online analytics on high mean-time-to-failure clusters. It does not support resilience, and supports only those workloads which can ﬁt into cluster memory. In return, it can run HMR jobs unchanged – including jobs produced by compilers for higher-level languages such as Pig, Jaql, and SystemML and interactive front-ends like IBM BigSheets – while providing signiﬁcantly better performance than the Hadoop engine on several workloads (e.g. 45x on some input sizes for sparse matrix vector multiply). M3R also supports extensions to the HMR API which can enable Map Reduce jobs to run faster on the M3R engine, while not aﬀecting their performance under the Hadoop engine.

The authors start with the assumption of “clean” data that has already been reduced to terabytes in size and that can be stored in main memory for “scores” of nodes as opposed to thousands of nodes. (score = 20)

And they make the point that main memory is only going to increase in the coming years.

While phrased as “interactive analytics (e.g. interactive machine learning),” I wonder if the design point is avoiding non-random memory?

And what the consequences of entirely random memory will have on algorithm design? Or the assumptions that drive algorithmic design?

One way to test the impact of large memory on design would be to award access to cluster with several terabytes of data on a competitive basis, for some time period, with all the code, data, runs, etc., being streamed to a pubic forum.

One qualification being that the user not already have access to that level of computing power at work. 😉

I first saw this at Alex Popescu’s Paper: M3R – Increased Performance for In-Memory Hadoop Jobs.

Comments Off

Twitter users forming tribes with own language…

Filed under: Language,Tribes,Usage — Patrick Durusau @ 1:28 pm

Twitter users forming tribes with own language, tweet analysis shows by Jason Rodrigues.

From the post:

Twitter users are forming ‘tribes’, each with their own language, according to a scientific analysis of millions of tweets.

The research on Twitter word usage throws up a pattern of behaviour that seems to contradict the commonly held belief that users simply want to share everything with everyone.

In fact, the findings point to a more precise use of social media where users frequently include keywords in their tweets so that they engage more effectively with other members of their community or tribe. Just like our ancestors we try to join communities based on our political interests, ethnicity, work and hobbies.

And just like our ancestors our communities have semantics unique to those communities.

Always a pleasure to see people replicating the semantic diversity that keeps data curation techniques relevant.

Like topic maps mapping between language tribes.

See Jason’s post for the details but of particular interest is the placing of people into Twitter tribes based on usage.

Comments Off

Open Law Lab

Filed under: Education,Law,Law - Sources,Legal Informatics — Patrick Durusau @ 12:36 pm

Open Law Lab

From the webpage:

Open Law Lab is an initiative to design law – to make it more accessible, more usable, and more engaging.

Projects:

Law Visualized

Law Education Tech

Usable Court Systems

Access to Justice by Design

Not to mention a number of interesting blog posts represented by images further down the homepage.

Access/interface issues are universal and law is a particularly tough nut to crack.

Progress in providing access to legal materials could well carry over to other domains.

I first saw this at: Hagan: Open Law Lab.

Comments Off

Beacons of Availability

Filed under: Linked Data,LOD,RDF,Semantic Web — Patrick Durusau @ 10:39 am

From Records to a Web of Library Data – Pt3 Beacons of Availability by Richard Wallis.

Beacons of Availability

As I indicated in the first of this series, there are descriptions of a broader collection of entities, than just books, articles and other creative works, locked up in the Marc and other records that populate our current library systems. By mining those records it is possible to identify those entities, such as people, places, organisations, formats and locations, and model & describe them independently of their source records.

As I discussed in the post that followed, the library domain has often led in the creation and sharing of authoritative datasets for the description of many of these entity types. Bringing these two together, using URIs published by the Hubs of Authority, to identify individual relationships within bibliographic metadata published as RDF by individual library collections (for example the British National Bibliography, and WorldCat) is creating Library Linked Data openly available on the Web.

Why do we catalogue? is a question, I often ask, with an obvious answer – so that people can find our stuff. How does this entification, sharing of authorities, and creation of a web of library linked data help us in that goal. In simple terms, the more libraries can understand what resources each other hold, describe, and reference, the more able they are to guide people to those resources. Sounds like a great benefit and mission statement for libraries of the world but unfortunately not one that will nudge the needle on making library resources more discoverable for the vast majority of those that can benefit from them.

I have lost count of the number of presentations and reports I have seen telling us that upwards of 80% of visitors to library search interfaces start in Google. A similar weight of opinion can be found that complains how bad Google, and the other search engines, are at representing library resources. You will get some balancing opinion, supporting how good Google Book Search and Google Scholar are at directing students and others to our resources. Yet I am willing to bet that again we have another 80-20 equation or worse about how few, of the users that libraries want to reach, even know those specialist Google services exist. A bit of a sorry state of affairs when the major source of searching for our target audience, is also acknowledged to be one of the least capable at describing and linking to the resources we want them to find!

Library linked data helps solve both the problem of better description and findability of library resources in the major search engines. Plus it can help with the problem of identifying where a user can gain access to that resource to loan, download, view via a suitable license, or purchase, etc.

I’m am an ardent sympathizer helping people to find “our stuff.”

I don’t disagree with the description of Google as: “…the major source of searching for our target audience, is also acknowledged to be one of the least capable at describing and linking to the resources we want them to find!”

But in all fairness to Google, I would remind you of Drabenstott’s research that found for the Library of Congress subject headings:

Overall percentages of correct meanings for subject headings in the original order of subdivisions were as follows:

children 32%

adults 40%

reference 53%

technical services librarians 56%

The Library of Congress subject classification has been around for more than a century and just over half of the librarians can use it correctly.

Let’s don’t wait more than a century to test the claim:*

“Library linked data helps solve both the problem of better description and findability of library resources in the major search engines.”

* By “test” I don’t mean the sort of study, “…we recruited twelve LIS students but one had to leave before the study was complete….”

I am using “test” in the sense of a well designed and organized social science project with professional assistance from social scientists, UI test designers and the like.

I think OCLC is quite sincere in its promotion of linked data, but effectiveness is an empirical question, not one of sincerity.

Comments Off

Semantic Queries by Example [Identity by Example (IBE)?]

Filed under: Query Language,Searching,Semantics — Patrick Durusau @ 9:47 am

Semantic Queries by Example by Lipyeow Lim, Haixun Wang, Min Wang.

Abstract:

With the ever increasing quantities of electronic data, there is a growing need to make sense out of the data. Many advanced database applications are beginning to support this need by integrating domain knowledge encoded as ontologies into queries over relational data. However, it is extremely difficult to express queries against graph structured ontology in the relational SQL query language or its extensions. Moreover, semantic queries are usually not precise, especially when data and its related ontology are complicated. Users often only have a vague notion of their information needs and are not able to specify queries precisely. In this paper, we address these challenges by introducing a novel method to support semantic queries in relational databases with ease. Instead of casting ontology into relational form and creating new language constructs to express such queries, we ask the user to provide a small number of examples that satisfy the query she has in mind. Using those examples as seeds, the system infers the exact query automatically, and the user is therefore shielded from the complexity of interfacing with the ontology. Our approach consists of three steps. In the first step, the user provides several examples that satisfy the query. In the second step, we use machine learning techniques to mine the semantics of the query from the given examples and related ontologies. Finally, we apply the query semantics on the data to generate the full query result. We also implement an optional active learning mechanism to find the query semantics accurately and quickly. Our experiments validate the effectiveness of our approach.

Potentially deeply important work for both a topic map query language and topic map authoring.

The authors conclude:

In this paper, we introduce a machine learning approach to support semantic queries in relational database. In semantic query processing, the biggest hurdle is to represent ontological data in relational form so that the relational database engine can manipulate the ontology in a way consistent with manipulating the data. Previous approaches include transforming the graph ontological data into tabular form, or representing ontological data in XML and leveraging database extenders on XML such as DB2’s Viper. These approaches, however, are either expensive (materializing a transitive relationship represented by a graph may increase the data size exponentially) or requiring changes in the database engine and new extensions to SQL. Our approach shields the user from the necessity of dealing with the ontology directly. Indeed, as our user study indicates, the diﬃculty of expressing ontology-based query semantics in a query language is the major hurdle of promoting semantic query processing. With our approach, the users do not even need to know ontology representation. All that is required is that the user gives some examples that satisfy the query he has in mind. The system then automatically ﬁnds the answer to the query. In this process, semantics, which is a concept usually hard to express, remains as a concept in the mind of user, without having to be expressed explicitly in a query language. Our experiments and user study results show that the approach is eﬃcient, eﬀective, and general in supporting semantic queries in terms of both accuracy and usability. (emphasis added)

I rather like: “In this process, semantics, which is a concept usually hard to express, remains as a concept in the mind of user, without having to be expressed explicitly in a query language.”

To take it a step further, it should apply to the authoring of topic maps as well.

A user selects from a set of examples the subjects they want to talk about. Quite different from any topic map authoring interface I have seen to date.

The “details” of capturing and querying semantics have stymied RDF:

F-16 cockpit

(From: The Semantic Web Is Failing — But Why? (Part 4))

And topic map authoring as well.

Is your next authoring/querying interface going to be by example?

I first saw this in a tweet by Stefano Bertolo.

Comments Off

XMLQuire Web Edition

Filed under: HTML5,Saxon,XML,XSLT — Patrick Durusau @ 5:50 am

XMLQuire Web Edition: A Free XSLT 2.0 Editor for the Web

From the webpage:

XSLT 2.0 processing within the browser is now a reality with the introduction of the open source Saxon-CE from Saxonica. This processor runs as a JavaScript app and supports JavaScript interoperability and user-event handling for the era of HTML5 and the dynamic web.

This Windows product, XMLQuire, is an XSLT edtior specially extended to integrate with Saxon-CE and support the Saxon-CE language extensions that make interactive XSLT possible. Saxon-CE is not included with this product, but is available from Saxonica here.

*nix folks will have to install Windows 7 or 8 on a VM to take advantage of this software.

Worth the effort if for no other reason than to see how the market majority lives. 😉

I first saw this in a tweet by Michael Kay.

Comments (2)

Linkurious [free beta]

Filed under: Graphics,Graphs,Networks,Visualization — Patrick Durusau @ 5:36 am

Linkurious

From the homepage:

CONNECT

Our Open Source backend indexes your graph so you can connect with it on Linkurious and get started in minutes. When it is done, launch the web application of Linkurious.

SEARCH

Typing any keyword in the search bar brings up all the related data in one step. We provide a console for advanced queries so you can be as broad or as specific as you want.

EXPLORE

By focusing on the items related to your search, visualizing and exploring your graph has never been easier. Dig further in any direction using the connected nodes and make sense of your data.

A couple of other resources:

How it works, and

Graph Visualization options and latest developments

will be of interest.

I haven’t signed up, yet, but the slides make a good point that what graph visualization you need depends, unsurprisingly, on your use case.

I first saw this in a tweet by David W. Allen.

Comments Off

Matching Traversal Patterns with MATCH

Filed under: Cypher,Graphs,Neo4j — Patrick Durusau @ 5:17 am

Cypher basics: Matching Traversal Patterns with MATCH by Wes Freeman.

From the post:

“Because friends don’t let friends write atrocious recursive joins in SQL.” –Max De Marzi

The match clause is one of the first things you learn with Cypher. Once you’ve figured out how to look up your starting bound identifiers with start, you usually (but not always) want to match a traversal pattern, which is one of Cypher’s most compelling features.

The goal of this post is not to go over the syntax for all of the different cases in match–for that the docs do a good job: Cypher MATCH docs. Rather, I hoped to explain more the how of how match works.

First, you need to understand the difference between bound and unbound identifiers (sometimes we call them variables, too, in case I slip up and forget to be consistent). Bound identifiers are the ones that you know the value(s) of–usually you set these in the start clause, but sometimes they’re passed through with with. Unbound identifiers are the ones you don’t know the values of: the part of the pattern you’re matching. If you don’t specify an identifier, and instead just do a-->(), or something of that sort, an implicit unbound identifier is created for you behind the scenes, so Cypher can keep track of the values it’s found. The goal of the match clause is to find real nodes and relationships that match the pattern specified (find the unbound identifiers), based on the bound identifiers you have from the start.

Wes is creating enough of these mini-tutorials that you will find his Cypher page, a welcome collection point.

Comments Off

March 16, 2013

Lux

Filed under: Lucene,Saxon,Solr,XQuery,XSLT — Patrick Durusau @ 7:51 pm

Lux

From the readme:

Lux is an open source XML search engine formed by fusing two excellent technologies: the Apache Lucene/Solr search index and the Saxon XQuery/XSLT processor.

At its core, Lux provides XML-aware indexing, an XQuery 1.0 optimizer that rewrites queries to use the indexes, and a function library for interacting with Lucene via XQuery. These capabilities are tightly integrated with Solr, and leverage its application framework in order to deliver a REST service and application server.

The REST service is accessible to applications written in almost any language, but it will be especially convenient for developers already using Solr, for whom Lux operates as a Solr plugin that provides query services using the same REST APIs as other Solr search plugins, but using a different query language (XQuery). XML documents may be inserted (and updated) using standard Solr REST calls: XML-aware indexing is triggered by the presence of an XML-aware field in a document. This means that existing application frameworks written in many different languages are positioned to use Lux as a drop-in capability for indexing and querying semi-structured content.

The application server is a great way to get started with Lux: it provides the ability to write a complete application in XQuery and XSLT with data storage backed by Lucene.

If you are looking for experience with XQuery and Lucene/Solr, look no further!

May be a good excuse for me to look at defining equivalence statements using XQuery.

I first saw this in a tweet by Michael Kay.

Comments Off

Apache Solr 4 Cookbook (Win a free copy)

Filed under: Contest,Solr — Patrick Durusau @ 7:34 pm

Apache Solr 4 Cookbook (Win a free copy)

Deadline 28.03.2013.

From the post:

Readers would be pleased to know that we have teamed up with Packt Publishing to organize a Giveaway of the Apache Solr 4 Cookbook. Two lucky winners will win a copy of the book (in eBook format). Keep reading to find out how you can be one of the Lucky Winners.

Let’s start with a little reminder about the book:

Learn how to make Apache Solr search faster, more complete, and comprehensively scalable

Solve performance, setup, configuration, analysis, and query problems in no time

Get to grips with, and master, the new exciting features of Apache Solr 4

Read more about this book and download free Sample Chapter.

How to Enter ?

All you need to do is head on over to the book page (Apache Solr 4 Cookbook) and look through the product description of the book and drop a line via the comments below this post to let us know what interests you the most about this book. It’s that simple.

Product Description: http://www.packtpub.com/apache-solr-4-cookbook/book

Deadline

The contest will close on 28.03.2013. Winners will be contacted by email, so be sure to use your real email address when you comment!

Who Will Win ?

The winners will be chosen by the Solr.pl team randomly from readers entering the competition that replied with on topic comment.

If you want to increase your chances of winning, write a small review of the book using the sample chapter on Amazon.com and also forward the same post to bhavins@packtpub.com.

You would know I see this contest two (2) days about purchasing an electronic copy of this book!

I may enter the contest anyway so I can forward someone the “extra” copy of it.

Comments Off

Algebraix Data Achieves Unrivaled Semantic Benchmark Performance

Filed under: Benchmarks,RDF,Semantic Web — Patrick Durusau @ 7:24 pm

Algebraix Data Achieves Unrivaled Semantic Benchmark Performance by Angela Guess.

From the post:

Algebraix Data Corporation today announced its SPARQL Server(TM) RDF database successfully executed all 17 of its queries on the SP2 benchmark up to one billion triples on one computer node. The SP2 benchmark is the most computationally complex for testing SPARQL performance and no other vendor has reported results for all queries on data sizes above five million triples.

Furthermore, SPARQL Server demonstrated linear performance in total SP2Bench query time on data sets from one million to one billion triples. These latest dramatic results are made possible by algebraic optimization techniques that maximize computing resource utilization.

“Our outstanding SPARQL Server performance is a direct result of the algebraic techniques enabled by our patented Algebraix technology,” said Charlie Silver, CEO of Algebraix Data. “We are investing heavily in the development of SPARQL Server to continue making substantial additional functional, performance and scalability improvements.”

Pretty much a copy of the press release from Algebraix.

You may find:

Doing the Math: The Algebraix DataBase Whitepaper: What it is, how it works, why we need it (PDF) by Robin Bloor, PhD

ALGEBRAIX Technology Mathematics Whitepaper (PDF), by Algebraix Data

and,

Granted Patents

7613734 Systems and Methods for Providing Data Sets using a Store of Algebraic Relations

7720806 Systems and Methods for Data Manipulation using Multiple Storage Formats

7769754 Systems and Methods for Data Storage and Retrieval using Algebraic Optimization

7797319 Systems and Methods for Data Model Mapping

7865503 Systems and Methods for Data Storage and Retrieval using Virtual Data Sets

7877370 Systems and Methods for Data Storage and Retrieval using Algebraic Relations

8032509 Systems and Methods for Data Storage and Retrieval using Algebraic Relations Composed from Query Language Statements

more useful.

BTW, The SP²Bench SPARQL Performance Benchmark, will be useful as well.

Algebraix listed its patents but I supplied the links. Why the links were missing at Algebraix I cannot say.

If the “…no other vendor has reported results for all queries on data sizes above five million triples…” is correct, isn’t scaling an issue for SQARQL?

Comments Off

Non-Word Count Hello World

Filed under: Hadoop,MapReduce — Patrick Durusau @ 4:11 pm

Finally! A Hadoop Hello World That Isn’t A Lame Word Count! by John Berryman.

From the post:

So I got bored of the old WordCount Hello World, and being a fairly mathy person, I decided to make my own Hello World in which I coaxed Hadoop into transposing a matrix!

…

What? What’s that you say? You think that a matrix transpose MapReduce is way more lame than a word count? Well I didn’t say that we were going to be saving the world with this MapReduce job, just flexing our mental muscles a little more. Typically, when you run the WordCount example, you don’t even look at the java code. You just pat yourself on the back when the word “the” invariably revealed to be the most popular word in the English language.

The goal of this exercise is to present a new challenge and a simple challenge so that we can practice thinking about solving BIG problems under the sometimes unintuitive constraints of MapReduce. Ultimately I intend to follow this post up with exceedingly more difficult MapReduce problems to challenge you and encourage you to tackle your own problems.

So, without further adieu:

As John says, not much beyond the Word Count examples but it is a different problem.

The promise of more difficult MapReduce problems sounds intriguing.

Need to watch for following posts.

Comments Off

From Records to a Web of Library Data – Pt2 Hubs of Authority

Filed under: Library,Linked Data,LOD,RDF — Patrick Durusau @ 4:00 pm

From Records to a Web of Library Data – Pt2 Hubs of Authority by Richard Wallis.

From the post:

Hubs of Authority

Libraries, probably because of their natural inclination towards cooperation, were ahead of the game in data sharing for many years. The moment computing technology became practical, in the late sixties, cooperative cataloguing initiatives started all over the world either in national libraries or cooperative organisations. Two from personal experience come to mind, BLCMP started in Birmingham, UK in 1969 eventually evolved in to the leading Semantic Web organisation Talis, and in 1967 Dublin, Ohio saw the creation of OCLC. Both in their own way having had significant impact on the worlds of libraries, metadata, and the web (and me!).

One of the obvious impacts of inter-library cooperation over the years has been the authorities, those sources of authoritative names for key elements of bibliographic records. A large number of national libraries have such lists of agreed formats for author and organisational names. The Library of Congress has in addition to its name authorities, subjects, classifications, languages, countries etc. Another obvious success in this area is VIAF, the Virtual International Authority File, which currently aggregates over thirty authority files from all over the world – well used and recognised in library land, and increasingly across the web in general as a source of identifiers for people & organisations.

…

These, Linked Data enabled, sources of information are developing importance in their own right, as a natural place to link to, when asserting the thing, person, or concept you are identifying in your data. As Sir Tim Berners-Lee’s fourth principle of Linked Data tells us to “Include links to other URIs. so that they can discover more things”. VIAF in particular is becoming such a trusted, authoritative, source of URIs that there is now a VIAFbot responsible for interconnecting Wikipedia and VIAF to surface hundreds of thousands of relevant links to each other. A great hat-tip to Max Klein, OCLC Wikipedian in Residence, for his work in this area.

I don’t deny that VIAF is a very useful tool but if you search for personal name, “Marilyn Monroe,” it returns:

1. Miller, Arthur, 1915-2005
‎

Miller, Arthur (Arthur Asher), 1915-2005
‎

Miller, Arthur, 1915-
‎

ميلر، ارثر، 1915-2005 م.
‎

Miller, Arthur
‎

מילר, ארתור, 1915-2005
‎

2. Monroe, Marilyn, 1926-1962
‎

Monroe, Marilyn
‎

Monroe, Marilyn American actress, model, and singer, 1926-1962
‎

Monroe, Marilyn, pseud.
‎

3. DiMaggio, Joe, 1914-1999
‎

Di Maggio, Joe 1914-1999
‎

Di Maggio, Joseph Paul, 1914-1999
‎

DiMaggio, Joe, 1914-
‎

Dimaggio, Joseph Paul, 1914-1999
‎

DiMaggio, Joe (Joseph Paul), 1914-1999
‎

Dimaggio, Joe
‎

4. Monroe, Marilyn
‎

5. Hurst-Monroe, Marlene
‎

6. Wolf, Marilyn Monroe
‎