Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 8, 2014

Transducers – java, js, python, ruby

Filed under: Clojure,Functional Programming,Java,Javascript,Python,Ruby — Patrick Durusau @ 10:59 am

Transducers – java, js, python, ruby

Struggling with transducers?

Learn better by example?

Cognitect Labs has released transducers for Java, JavaScript, Ruby, and Python.

Clojure recently added support for transducers – composable algorithmic transformations. These projects bring the benefits of transducers to other languages:

BTW, take a look at Rich Hickey’s latest (as of Nov. 2014) video on Transducers.

Please forward to language specific forums.

November 7, 2014

On the Shoulders of Giants: The Growing Impact of Older Articles

Filed under: Computer Science,Research Methods — Patrick Durusau @ 7:44 pm

On the Shoulders of Giants: The Growing Impact of Older Articles by Alex Verstak, et al.

Abstract:

In this paper, we examine the evolution of the impact of older scholarly articles. We attempt to answer four questions. First, how often are older articles cited and how has this changed over time. Second, how does the impact of older articles vary across different research fields. Third, is the change in the impact of older articles accelerating or slowing down. Fourth, are these trends different for much older articles.

To answer these questions, we studied citations from articles published in 1990-2013. We computed the fraction of citations to older articles from articles published each year as the measure of impact. We considered articles that were published at least 10 years before the citing article as older articles. We computed these numbers for 261 subject categories and 9 broad areas of research. Finally, we repeated the computation for two other definitions of older articles, 15 years and older and 20 years and older.

There are three conclusions from our study. First, the impact of older articles has grown substantially over 1990-2013. In 2013, 36% of citations were to articles that are at least 10 years old; this fraction has grown 28% since 1990. The fraction of older citations increased over 1990-2013 for 7 out of 9 broad areas and 231 out of 261 subject categories.

Second, the increase over the second half (2002-2013) was double the increase in the first half (1990-2001).

Third, the trend of a growing impact of older articles also holds for even older articles. In 2013, 21% of citations were to articles >= 15 years old with an increase of 30% since 1990 and 13% of citations were to articles >= 20 years old with an increase of 36%.

Now that finding and reading relevant older articles is about as easy as finding and reading recently published articles, significant advances aren’t getting lost on the shelves and are influencing work worldwide for years after.

Deeply encouraging results!

If indexing and retrieval could operate at a sub-article level, following chains of research across the literature would be even easier.

At A Glance – Design Pattern

Filed under: Design,Design Patterns — Patrick Durusau @ 3:44 pm

Spotted: clever and useful design patterns by Ben Terrett.

From the post:

I was looking at the Ikea website at the weekend and noticed this smart design pattern.

product ad

Many websites tell you whether an item is in stock and many tell you whether a product is available in store. But this tells you how many are in stock today and how many will be in stock tomorrow and the two days after. It’s clever and useful. (Should you wish to check the current availability of Malm drawers in Croydon Ikea you can here.)

Ben goes on to point out one aspect of this design pattern is that it only requires a glance to understand.

I assume you can think of some topic map presentations with graphics that required more than a glance to understand. 😉

Comprehension “at a glance” isn’t always possible to realize but when it is never realized, take that as a warning sign. Particularly when it is customers who are having the difficulty.

Ben’s post has other examples and pointers on the issue of being “glanceable.”

Information Extraction framework in Python

Filed under: Associations,Entity Resolution,Python — Patrick Durusau @ 3:28 pm

Information Extraction framework in Python

From the post:

IEPY is an open source tool for Information Extractionfocused on Relation Extraction.

To give an example of Relation Extraction, if we are trying to find a birth date in:

“John von Neumann (December 28, 1903 – February 8, 1957) was a Hungarian and American pure and applied mathematician, physicist, inventor and polymath.”

then IEPY’s task is to identify “John von Neumann” and “December 28, 1903” as the subject and object entities of the “was born in” relation.

It’s aimed at:
  • users needing to perform Information Extraction on a large dataset.
  • scientists wanting to experiment with new IE algorithms.

Your success with recognizing relationships will vary but every one successfully recognized is one less that must be coded by hand.

Speaking of relationships, I would prefer to also have the relationships between John von Neumann and “Hungarian and American pure and applied mathematician, physicist, inventor and polymath” recognized as well.

I first saw this in a tweet by Scientific Python.

British intelligence spies on lawyer-client communications, government admits

Filed under: Cybersecurity,Government,Privacy,Security — Patrick Durusau @ 3:11 pm

British intelligence spies on lawyer-client communications, government admits by David Meyer.

From the post:

After the Snowden leaks, British lawyers expressed fears that the government’s mass surveillance efforts could undermine the confidentiality of their conversations with clients, particularly when those clients were engaged in legal battles with the state. Those fears were well-founded.

On Thursday the legal charity Reprieve, which provides assistance to people accused of terrorism, U.S. death row prisoners and so on, said it had succeeded in getting the U.K. government to admit that spy agencies tell their staff they may target and use lawyer-client communications “just like any other item of intelligence.” This is despite the fact that both English common law and the European Court of Human Rights protect legal professional privilege as a fundamental principle of justice.

See David’s post for the full details.

The dividends from 9/11 continue. One substantial terrorist attack and the United States, the United Kingdom and a number of other countries are in head long flight from their constitutions and traditions of individual liberty from government intrusion.

Given the lack of terrorist attacks in the United States following 9/11, either the United States isn’t on maps used by terrorists or they can’t afford a plane ticket to the US. I don’t consider the underwear bomber so much a terrorist as a sad follower who wanted to be a terrorist. If that’s the best they have, we are in no real danger.

What the terrorism debate needs is a public airing of credible risks and strategies for addressing those risks. The secret abandonment of centuries of legal tradition because government functionaries lack the imagination to combat common criminals is inexcusable.

Citizens are in far more danger from their governments than any known terrorist organization. Perhaps that was the goal of 9/11. If so, it was the most successful attack in human history.

data.parliament @ Accountability Hack 2014

Filed under: Government,Government Data,Law,Law - Sources — Patrick Durusau @ 2:39 pm

data.parliament @ Accountability Hack 2014 by Zeid Hadi.

From the post:

We are pleased to announce that data.parliament will be providing data to be used during the Accountability Hack 2014

data.parliament is a platform that enables the sharing of UK Parliament’s data with consumers both within and outside of Parliament. Designed to complement existing data services it aims to be the central publishing platform and data repository for data that is produced by Parliament. Note our release is in Alpha.

It provides both a repository (http://api.data.parliament.uk) for data and a Linked Data API (http://lda.data.parliament.uk). The platform’s ‘shop front’ or data catalogue can be found here (http://data.parliament.uk)

The following datasets and APIs are now available on data.parliament

  • Commons Written Parliamentary Questions and Answers
  • Lords Written Parliamentary Questions and Answers
  • Commons Oral Questions and Question Times
  • Early Day Motions
  • Lords Divisions
  • Commons Divisions
  • Commons Members
  • Lords Members
  • Constituencies
  • Briefing Papers
  • Papers Laid

A description of the APIs and their usage can be found at http://lda.data.parliament.uk. All the data exposed by the endpoints can be returned in a variety of formats not least JSON.

To get you started the team has coded two publically available demonstrators that make use of the data in data.parliament. The source code for these can found at https://github.com/UKParliData. One of the demonstrators, a client app, can be found working at http://ddpdemo.azurewebsites.net/. Also be sure to read our blog (http://blog.data.parliament.uk) for quick start guides, updates, and news about upcoming datasets.

The data.parliament team will be on hand at the Hack, both participating and networking through the event to gather feedback and ideas..

I don’t know enough about British parliamentary procedure to comment on the completeness of the interface.

I am quite interested in the Briefing Papers data feed:

This dataset contains the data for research briefings produced by the Libraries of the House of Commons and House of Lords and the Parliamentary Office of Science and Technology. Each briefing has a pdf document for the briefing itself as well as a set of metadata to accompany it. (http://www.data.parliament.uk/dataset/04)

A great project but even a complete set of documents and transcripts of every word spoken at Parliament does not document relationships between members of Parliment, their relationships to economic interests, etc.

Looking forward to collation of information from this project with other data to form a clearer picture of the legislative process in the UK.

I first saw this in a tweet by data.parliament UK.

No Search and Seizure, Just Join

Filed under: Cybersecurity,Security — Patrick Durusau @ 11:57 am

Most of us like to think the Tor Project offers perfect anonymity online. And certainly not subject to FBI hacking.

Imagine the surprise when the headline reads: Alleged operator of Silk Road 2.0 busted, charged in NYC by Lisa Vaas. That is just so not cool. Is there some problem with Tor?

Relax! So far as is known, nothing is wrong with Tor. The problem? Internal security.

This time around, to get hold of whomever owned and operated Silk Road 2.0, a Homeland Security Investigations (HSI) agent tried a different tactic: he or she got onto Silk Road 2.0’s support staff.

The undercover agent got access to private, restricted areas of the site reserved for Benthall and his administrators.

Ouch!

Someone wasn’t paying attention to internal security. Andrea Heuer points out there are internal security breaches over 2560 times a day. (Over 2,560 Internal Security Breaches Occurred In US Businesses Every Day)

Consider replacing missing walls before buying expensive locks for the doors.

The braggadocious claims of the FBI:

Let’s be clear – this Silk Road, in whatever form, is the road to prison. Those looking to follow in the footsteps of alleged cybercriminals should understand that we will return as many times as necessary to shut down noxious online criminal bazaars. We don’t get tired.

makes you wonder how many of the participants in Silk Road 2.0 will be prosecuted? Unless Tor has defeated the FBI’s efforts to trace the buyers and sellers.

If you hear of FBI prosecution of buyers or sellers from Silk Road 2.0, please drop me a note. Thanks!


Update: The truth behind Tor’s confidence crisis by Patrick Howell O’Neill.

Patrick sheds light to dispel the smoke being spread by law enforcement about Tor security. Bad security practices appear to be the reason why Dark Net sites have fallen. The largest sites continue without interruption. No security is absolute but certainly not without basic security practices (Don’t say “I am the darklord and live at NNN MainStreet, Commontown, USA.” on social media if “darklord” is your handle on a Tor network. OK? Just don’t do it.)

November 6, 2014

Deeper Than Quantum Mechanics—David Deutsch’s New Theory of Reality

Filed under: Information Theory,Philosophy,Quantum — Patrick Durusau @ 8:07 pm

Deeper Than Quantum Mechanics—David Deutsch’s New Theory of Reality

From the post:


Their new idea is called constructor theory and it is both simpler and deeper than quantum mechanics, or indeed any other laws of physics. In fact, Deutsch claims that constructor theory forms a kind of bedrock of reality from which all the laws of physics emerge.

Constructor theory is a radically different way of thinking about the universe that Deutsch has been developing for some time. He points out that physicists currently ply their trade by explaining the world in terms of initial conditions and laws of motion. This leads to a distinction between what happens and what does not happen.

Constructor theory turns this approach on its head. Deutsch’s new fundamental principle is that all laws of physics are expressible entirely in terms of the physical transformations that are possible and those that are impossible.

In other words, the laws of physics do not tell you what is possible and impossible, they are the result of what is possible and impossible. So reasoning about the physical transformations that are possible and impossible leads to the laws of physics.

That’s why constructor theory is deeper than anything that has gone before it. In fact, Deutsch does not think about it as a law of physics but as a principle, or set of principles, that the laws of physics must obey.

If that sounds like heavy sledding, see: arxiv.org/abs/1405.5563 : Constructor Theory of Information.

Abstract:

We present a theory of information expressed solely in terms of which transformations of physical systems are possible and which are impossible – i.e. in constructor-theoretic terms. Although it includes conjectured laws of physics that are directly about information, independently of the details of particular physical instantiations, it does not regard information as an a priori mathematical or logical concept, but as something whose nature and properties are determined by the laws of physics alone. It does not suffer from the circularity at the foundations of existing information theory (namely that information and distinguishability are each defined in terms of the other). It explains the relationship between classical and quantum information, and reveals the single, constructor-theoretic property underlying the most distinctive phenomena associated with the latter, including the lack of in-principle distinguishability of some states, the impossibility of cloning, the existence of pairs of variables that cannot simultaneously have sharp values, the fact that measurement processes can be both deterministic and unpredictable, the irreducible perturbation caused by measurement, and entanglement (locally inaccessible information).

The paper runs thirty (30) pages so should give you a good workout before the weekend. 😉

I first saw this in a tweet by Steven Pinker.

‘Magic’ – Quickest Way to Turn Webpage Into Data

Filed under: Web Scrapers — Patrick Durusau @ 7:49 pm

‘Magic’ – Quickest Way to Turn Webpage Into Data

From the post:

import.io recently launched it’s newest feature called ‘Magic’.

The tool, which they are providing free of charge, is useful for transforming web page(s) into a table that can be downloaded as a static CSV or accessed via a live API.

To use Magic, users simply need to paste in a URL, hit the “Get Data” button and import.io’s algorithms will turn that page into a table of data.The user does not need to download or install anything.

I tried this on a JIRA page (no go) but worked fine on CEUR Workshop Proceedings.

I will be testing it on a number of pages. Could be the very thing for one page or lite mining that comes up.

Enjoy!

I first saw this in a tweet by Christophe Lalanne.

Extracting insights from the shape of complex data using topology

Filed under: Data Analysis,Mathematics,Topological Data Analysis,Topology — Patrick Durusau @ 7:29 pm

Extracting insights from the shape of complex data using topology by P. Y. Lum, et al. (Scientific Reports 3, Article number: 1236 doi:10.1038/srep01236)

Abstract:

This paper applies topological methods to study complex high dimensional data sets by extracting shapes (patterns) and obtaining insights about them. Our method combines the best features of existing standard methodologies such as principal component and cluster analyses to provide a geometric representation of complex data sets. Through this hybrid method, we often find subgroups in data sets that traditional methodologies fail to find. Our method also permits the analysis of individual data sets as well as the analysis of relationships between related data sets. We illustrate the use of our method by applying it to three very different kinds of data, namely gene expression from breast tumors, voting data from the United States House of Representatives and player performance data from the NBA, in each case finding stratifications of the data which are more refined than those produced by standard methods.

In order to identify subjects you must first discover them.

Does the available financial contribution data on members of the United States House of Representatives correspond with the clustering analysis here? (Asking because I don’t know but would be interested in finding out.)

I first saw this in a tweet by Stian Danenbarger.

Spark officially sets a new record in large-scale sorting

Filed under: Hadoop,Sorting,Spark — Patrick Durusau @ 7:16 pm

Spark officially sets a new record in large-scale sorting by Reynold Xin.

From the post:

A month ago, we shared with you our entry to the 2014 Gray Sort competition, a 3rd-party benchmark measuring how fast a system can sort 100 TB of data (1 trillion records). Today, we are happy to announce that our entry has been reviewed by the benchmark committee and we have officially won the Daytona GraySort contest!

In case you missed our earlier blog post, using Spark on 206 EC2 machines, we sorted 100 TB of data on disk in 23 minutes. In comparison, the previous world record set by Hadoop MapReduce used 2100 machines and took 72 minutes. This means that Spark sorted the same data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache. This entry tied with a UCSD research team building high performance systems and we jointly set a new world record.

Winning this benchmark as a general, fault-tolerant system marks an important milestone for the Spark project. It demonstrates that Spark is fulfilling its promise to serve as a faster and more scalable engine for data processing of all sizes, from GBs to TBs to PBs. In addition, it validates the work that we and others have been contributing to Spark over the past few years.

If you are not already familiar with Spark, see the project homepage and/or the extensive documentation page. (Be careful, you can easily lose yourself in the Spark documentation.)

Introducing Revolution R Open and Revolution R Plus

Filed under: Programming,R — Patrick Durusau @ 3:45 pm

Introducing Revolution R Open and Revolution R Plus by David Smith.

From the post:

For the past 7 years, Revolution Analytics has been the leading provider of R-based software and services to companies around the globe. Today, we're excited to announce a new, enhanced R distribution for everyone: Revolution R Open.

Revolution R Open is a downstream distribution of R from the R Foundation for Statistical Computing. It's built on the R 3.1.1 language engine, so it's 100% compatible with any scripts, packages or applications that work with R 3.1.1. It also comes with enhancements to improve your R experience, focused on performance and reproducibility: 

  • Revolution R Open is linked with the Intel Math Kernel Libraries (MKL). These replace the standard R BLAS/LAPACK libraries to improve the performance of R, especially on multi-core hardware. You don't need to modify your R code to take advantage of the performance improvements.
  • Revolution R Open comes with the Reproducible R Toolkit. The default CRAN repository is a static snapshot of CRAN (taken on October 1). You can always access newer R packages with the checkpoint package, which comes pre-installed. These changes make it easier to share R code with other R users, confident that they will get the same results as you did when you wrote the code.

Today we are also introducing MRAN, a new website where you can find information about R, Revolution R Open, and R Packages. MRAN includes tools to explore R Packages and R Task Views, making it easy to find packages to extend R's capabilities. MRAN is updated daily.

Revolution R Open is available for download now. Visit mran.revolutionanalytics.com/download for binaries for Windows, Mac, Ubuntu, CentOS/Red Hat Linux and (of course) the GPLv2 source distribution.

With the new Revolution R Plus program, Revolution Analytics is offering technical support and open-source assurance for Revolution R Open and several other open source projects from Revolution Analytics (including DeployR Open, ParallelR and RHadoop). If you are interested in subscribing, you can find more information at www.revolutionanalytics.com/plus . And don't forget that big-data R capabilities are still available in Revolution R Enterprise.

We hope you enjoy using Revolution R Open, and that your workplace will be confident adopting R with the backing of technical support and open source assurance of Revolution R Plus. Let us know what you think in the comments! 

Apologies for missing such important R news!

I have downloaded R Open (Ubuntu 64-bit) and as soon as I exit a conference call, will install. (I try not to multi-task anytime I am root or even sudo.)

EU commits €14.4m to support open data across Europe

Filed under: EU,Open Data — Patrick Durusau @ 2:47 pm

EU commits €14.4m to support open data across Europe by Samuel Gibbs.

From the post:

The European Union has committed €14.4m (£11m) towards open data with projects and institutions lead by the Open Data Institute (ODI), Southampton University, the Open University and Telefonica.

The funding, announced today at the ODI Summit being held in London, is the largest direct investment into open data startups globally and will be used to fund three separate schemes covering startups, open data research and a new training academy for data science.

“This is a decisive investment by the EU to create open data skills, build capabilities, and provide fuel for open data startups across Europe,” said Gavin Starks, chief executive of the ODI a non-for-profit organisation based in London co-founded by inventor of the world wide web Sir Tim Berners-Lee. “It combines three key drivers for open adoption: financing startups, deepening our research and evidence, and training the next generation of data scientists, to exploit emerging open data ecosystems.”

Money from the €14.4m will be divided into three sections. Through the EU’s €80 billion Horizon 2020 research and innovation funding, €7.8m will be used to fund the 30-month Open Data Incubator for Europe (ODInE) for open data startups modelled on the ODI’s UK open data startup incubator that has been running since 2012.

Take a look at Open Data Institute’s Startup page

BTW, on the list of graduates, the text of the links for Provenance and Mastodon C are correct but the underlying hyperlinks,
http://theodi.org/start-ups/www.provenance.it and http://theodi.org/start-ups/www.mastodonc.com, respectively, are incorrect.

With the correct underlying hyperlinks:

Mastodon C

Provenance

I did not check the links for the current startups. I did run the W3C Link Checker on http://theodi.org/start-ups and go some odd results. If you are interested, see what you think.

Sorry, I got diverted by the issues with the Open Data Institute site.

Among other highlights from the article:

A further €3.7m will be used to fund 15 researchers into open data posed with the question “how can we answer complex questions with web data?”.

You can puzzle over that one on your own.

Caselaw is Set Free, What Next? [Expanding navigation/search targets]

Filed under: Law,Law - Sources,Legal Informatics,Topic Maps — Patrick Durusau @ 1:31 pm

Caselaw is Set Free, What Next? by Thomas Bruce, Director, Legal Information Institute, Cornell.

Thomas provides a great history of Google Scholar’s caselaw efforts and its impact on the legal profession.

More importantly, at least to me, were his observations on how to go beyond the traditional indexing and linking in legal publications:

A trivial example may help. Right now, a full-text search for “tylenol” in the US Code of Federal Regulations will find… nothing. Mind you, Tylenol is regulated, but it’s regulated as “acetaminophen”. But if we link up the data here in Cornell’s CFR collection with data in the DrugBank pharmaceutical collection , we can automatically determine that the user needs to know about acetaminophen — and we can do that with any name-brand drug in which acetaminophen is a component. By classifying regulations using the same system
that science librarians use to organize papers in agriculture
, we can determine which scientific papers may form the rationale for particular regulations, and link the regulations to the papers that explain the underlying science. These techniques, informed by emerging approaches in natural-language processing and the Semantic Web, hold great promise.

All successful information-seeking processes permit the searcher to exchange something she already knows for something she wants to know. By using technology to vastly expand the number of things that can meaningfully and precisely be submitted for search, we can dramatically improve results for a wide swath of users. In our shop, we refer to this as the process of “getting from barking dog to nuisance”, an in-joke that centers around mapping a problem expressed in real-world terms to a legal concept. Making those mappings on a wide scale is a great challenge. If we had those mappings, we could answer a lot of everyday questions for a lot of people.

(emphasis added)

The first line I bolded in the quote:

All successful information-seeking processes permit the searcher to exchange something she already knows for something she wants to know.

captures the essence of a topic map. Yes? That is a user navigates or queries a topic map on the basis of terms they already know. In so doing, they can find other terms that are interchangeable with theirs, but more importantly, if information is indexed using a different term than theirs, they can still find the information.

In traditional indexing systems, think of the Readers Guide to Periodical Literature, Library of Congress Subject Headings, some users learned those systems in order to become better searchers. Still an interchange of what you know for what you don’t know, but with a large front-end investment.

Thomas is positing a system like topic maps that enables a users to navigate by the terms they know already to find information they don’t know.

The second block of text I bolded:

Making those mappings on a wide scale is a great challenge. If we had those mappings, we could answer a lot of everyday questions for a lot of people.

Making wide scale mappings certainly is a challenge. In part because there are so many mappings to be made and so many different ways to make them. Not to mention that the mappings will evolve over time as usages change.

There is growing realization that indexing or linking data results in a very large pile of indexed or linked data. You can’t really navigate it unless or until you hit upon the correct terms to make the next link. We could try to teach everyone the correct terms but as more correct terms appear everyday, that seems an unlikely solution. Thomas has the right of it when he suggests expanding the target of “correct” terms.

Topic maps are poised to help expand the target of “correct” terms, and to do so in such a way as to combine with other expanded targets of “correct” terms.

I first saw this in a tweet by Aaron Kirschenfeld.


Update: Tarlton Law Libary (University of Texas at Austin) Legal Research Guide has a great page of tips and pointers on the Google Scholar caselaw collection. Bookmark this guide.

PgOSQuery [OSQuery for Postgres]

Filed under: osquery,PostgreSQL — Patrick Durusau @ 12:58 pm

PgOSQuery

From the webpage:

So I saw Facebook’s OSQuery, and thought “That looks awesome, but complicated to build on top of SQLite. Postgres’ Foreign Data Wrappers seem like a much better foundation. How long would it take to write the same app on top of Postgres?”. Turns out it takes about 15 minutes, for someone who’s never written an FDW before 🙂

This approach does have the downside that it runs as the postgres user rather than as root, so it can’t see the full details of other people’s processes, but I’m sure that could be worked around if you really want to.

Currently this is just a proof-of-concept to see how useful Postgres’ foreign data wrappers are, and how easy they are to create with the Multicorn python library. Seems the answers are “very useful” and “very easy”. If people want to make this more useful by adding more virtual tables, pull requests are welcome~

The system information captured by OSQuery and PgOSQuery is always present. But in order to talk about it (in terms of recorded information), you must capture that information and, just as importantly, have a method to associate your comments with that information.

Any database could capture the system information captured by OSQuery and PgSQLQuery. But having captured it, how do you talk about the column headers for the data? Data dictionaries are an option if your database supports them, but then how do you talk about the entry in your data dictionary?

Not that you are required to talk about entries in your data dictionary but it should be a design choice to not talk about data dictionary entries, not a default cone of silence.

Facebook – Government Requests Report [A Poverty of Terrorists?]

Filed under: Cybersecurity,Government,Security — Patrick Durusau @ 10:34 am

Facebook – Government Requests Report

Facebook has an interactive world map for finding reports on government data requests. You can also download the report as a CSV file.

Be forewarned that in summary form, the data is nearly useless in terms of tracking the basis for government demands.

On the other hand, Government demands for Facebook user data soar by 24%, sounds impressive, 24% is after all a large increase, until you read the details:

Facebook says a total of 34,946 user data requests covering 49,479 accounts were made – an increase of 24% – with the US again leading the way with 15,433 inquiries. India made 4559 requests, and Germany, France, the UK, Italy and Brazil all made over 1000 requests for information on users.

Oh, so 15,433 queries for the US out of 1.35 billion monthly active users as of September 30, 2014. (Facebook report.)

For round numbers, let’s call it 15,000 queries for the US out of 1,350,000,000 accounts.

I am hard pressed to explain the poverty of terrorists on Facebook since the US media and government claim terrorists are lurking behind every bush and in every dimly lit corner.

Is the absence of terrorists in meaningful numbers on Facebook evidence of their skill at hiding? Or is it more like keeping pink elephants away by snapping your finders, a sign of a poverty of terrorists?

PS: If 15,000 possible terrorists and other criminals sounds like a scary number, remember that 44,000 people die every six months in the US due to excessive alcohol consumption. CDC Fact Sheet on Alcohol.

November 5, 2014

Pro Git

Filed under: Git,Programming — Patrick Durusau @ 8:20 pm

Pro Git

From the webpage:

The entire Pro Git book, written by Scott Chacon and Ben Straub and published by Apress, is available here. All content is licensed under the Creative Commons Attribution Non Commercial Share Alike 3.0 license. Print versions of the book are available on Amazon.com.

Available on the web, in PDF, mobi, or ePub form for free and in a variety of languages.

At five hundred and seventy-four (574) pages I suspect it covers any subtlety of Git that you will need.

Pass this along!

Good Open Data. . . by design

Filed under: Data Governance,Data Quality,Open Data — Patrick Durusau @ 8:07 pm

Good Open Data. . . by design by Victoria L. Lemieux, Oleg Petrov, and, Roger Burks.

From the post:

An unprecedented number of individuals and organizations are finding ways to explore, interpret and use Open Data. Public agencies are hosting Open Data events such as meetups, hackathons and data dives. The potential of these initiatives is great, including support for economic development (McKinsey, 2013), anti-corruption (European Public Sector Information Platform, 2014) and accountability (Open Government Partnership, 2012). But is Open Data’s full potential being realized?

A news item from Computer Weekly casts doubt. A recent report notes that, in the United Kingdom (UK), poor data quality is hindering the government’s Open Data program. The report goes on to explain that – in an effort to make the public sector more transparent and accountable – UK public bodies have been publishing spending records every month since November 2010. The authors of the report, who conducted an analysis of 50 spending-related data releases by the Cabinet Office since May 2010, found that that the data was of such poor quality that using it would require advanced computer skills.

Far from being a one-off problem, research suggests that this issue is ubiquitous and endemic. Some estimates indicate that as much as 80 percent of the time and cost of an analytics project is attributable to the need to clean up “dirty data” (Dasu and Johnson, 2003).

In addition to data quality issues, data provenance can be difficult to determine. Knowing where data originates and by what means it has been disclosed is key to being able to trust data. If end users do not trust data, they are unlikely to believe they can rely upon the information for accountability purposes. Establishing data provenance does not “spring full blown from the head of Zeus.” It entails a good deal of effort undertaking such activities as enriching data with metadata – data about data – such as the date of creation, the creator of the data, who has had access to the data over time and ensuring that both data and metadata remain unalterable.

What is it worth to you to use good open data rather than dirty open data?

Take the costs of your analytics projects for the past year and multiply that by eighty (80) percent. Just an estimate, the actual cost will vary from project to project, but did that result get your attention?

If so, contact your sources for open data and lobby for clean open data.

PS: You may find the World Bank’s Open Data Readiness Assessment Tool useful.

MeSH on Demand Update: How to Find Citations Related to Your Text

Filed under: Indexing,Medical Informatics,MeSH — Patrick Durusau @ 7:51 pm

MeSH on Demand Update: How to Find Citations Related to Your Text

From the post:

In May 2014, NLM introduced MeSH on Demand, a Web-based tool that suggests MeSH terms from your text such as an abstract or grant summary up to 10,000 characters using the MTI (Medical Text Indexer) software. For more background information, see the article, MeSH on Demand Tool: An Easy Way to Identify Relevant MeSH Terms.

New Feature

A new MeSH on Demand feature displays the PubMed ID (PMID) for the top ten related citations in PubMed that were also used in computing the MeSH term recommendations.

To access this new feature start from the MeSH on Demand homepage (see Figure 1), add your text, such as a project summary, into the box labeled “Text to be Processed.” Then, click the “Find MeSH Terms” button.

Results page:

mesh results

A clever way to deal with the problem of a searcher not knowing the specialized vocabulary of an indexing system.

Have you seen this method used outside of MeSH?

AMR: Not semantics, but close (? maybe ???)

Filed under: Semantics,Subject Identity — Patrick Durusau @ 7:37 pm

AMR: Not semantics, but close (? maybe ???) by Hal Daumé.

From the post:

Okay, necessary warning. I’m not a semanticist. I’m not even a linguist. Last time I took semantics was twelve years ago (sigh.)

Like a lot of people, I’ve been excited about AMR (the “Abstract Meaning Representation”) recently. It’s hard not to get excited. Semantics is all the rage. And there are those crazy people out there who think you can cram meaning of a sentence into a !#$* vector [1], so the part of me that likes Language likes anything that has interesting structure and calls itself “Meaning.” I effluviated about AMR in the context of the (awesome) SemEval panel.

There is an LREC paper this year whose title is where I stole the title of this post from: Not an Interlingua, But Close: A Comparison of English AMRs to Chinese and Czech by Xue, Bojar, Hajič, Palmer, Urešová and Zhang. It’s a great introduction to AMR and you should read it (at least skim).

What I guess I’m interested in discussing is not the question of whether AMR is a good interlingua but whether it’s a semantic representation. Note that it doesn’t claim this: it’s not called ASR. But as semantics is the study of the relationship between signifiers and denotation, [Edit: it’s a reasonable place to look; see Emily Bender’s comment.] it’s probably the closest we have.

Deeply interesting work, particularly given the recent interest in Enhancing open data with identifiers. Be sure to read the comments to the post as well.

Who knew? Semantics are important!

😉

Topic maps take that a step further and capture your semantics, not necessarily the semantics of some expert unfamiliar with your domain.

Core Econ: a free economics textbook

Filed under: Ecoinformatics,Government,Government Data,Politics,Skepticism — Patrick Durusau @ 5:49 pm

Core Econ: a free economics textbook by Cathy O’Neil.

From the post:

Today I want to tell you guys about core-econ.org, a free (although you do have to register) textbook my buddy Suresh Naidu is using this semester to teach out of and is also contributing to, along with a bunch of other economists.

(image omitted)

It’s super cool, and I wish a class like that had been available when I was an undergrad. In fact I took an economics course at UC Berkeley and it was a bad experience – I couldn’t figure out why anyone would think that people behaved according to arbitrary mathematical rules. There was no discussion of whether the assumptions were valid, no data to back it up. I decided that anybody who kept going had to be either religious or willing to say anything for money.

Not much has changed, and that means that Econ 101 is a terrible gateway for the subject, letting in people who are mostly kind of weird. This is a shame because, later on in graduate level economics, there really is no reason to use toy models of society without argument and without data; the sky’s the limit when you get through the bullshit at the beginning. The goal of the Core Econ project is to give students a taste for the good stuff early; the subtitle on the webpage is teaching economics as if the last three decades happened.

Skepticism of government economic forecasts and data requires knowledge of the lingo and assumptions of economics. This introduction won’t get you to that level but it is a good starting place.

Enjoy!

Data Sources for Cool Data Science Projects: Part 2

Filed under: Data,Data Science — Patrick Durusau @ 5:33 pm

Data Sources for Cool Data Science Projects: Part 2 by Ryan Swanstrom.

From the post:

I am excited for the first ever guest posts on the Data Science 101 blog. Dr. Michael Li, Executive Director of The Data Incubator in New York City, is providing 2 great posts (see Part 1) about finding data for your next data science project.

Nice collection of data sources, some familiar and some unexpected.

Enjoy!

Google and Mission Statements

Filed under: Google+,Indexing,Searching — Patrick Durusau @ 4:55 pm

Google has ‘outgrown’ its 14-year old mission statement, says Larry Page by Samuel Gibbs.

From the post:

Google’s chief executive Larry Page has admitted that the company has outgrown its mission statement to “organise the world’s information and make it universally accessible and useful” from the launch of the company in 1998, but has said he doesn’t yet know how to redefine it.

Page insists that the company is still focused on the altruistic principles that it was founded on in 1998 with the original mission statement, when he and co-founder Sergey Brin were aiming big with “societal goals” to “organise the world’s information and make it universally accessible and useful”.

Questioned as to whether Google needs to alter its mission statement, which was twinned with the company mantra “don’t be evil, for the next stage of company growth in an interview with the Financial Times, Page responded: “We’re in a bit of uncharted territory. We’re trying to figure it out. How do we use all these resources … and have a much more positive impact on the world?”

This post came as a surprise to me because I was unaware that Google had solved the problem of “organis[ing] the world’s information and mak[ing] it universally accessible and useful.”

Perhaps so but it hasn’t made it to the server farm that sends results to me.

A quick search using Google on “cia” today produces a front page with resources on the Central Intelligence Agency, the Culinary Institute of American, Certified Internal Auditor (CIA) Certification and allegedly, 224,000,000 more results.

If I search using “Central Intelligence Agency,” I get a “purer” stream of content on the Central Intelligence Agency, that runs from its official website, https://www.cia.gov, to the Wikipedia article, http://en.wikipedia.org/wiki/Central_Intelligence_Agency, and ArtsBeat | Can’t Afford a Giacometti Sculpture? There’s Always the CIA’s bin Laden Action Figure .

Even with a detailed query Google search results remind me of a line from Saigon Warrior that goes:

But the organization is a god damned disgrace

https://www.youtube.com/watch?v=0-U9Ns9oG6E

If Larry Page thinks Google has “organise[d] the world’s information and ma[de] it universally accessible and useful,” he needs a reality check.

True, Google has gone further than any other enterprise towards indexing some of the world’s information, but hardly all of it nor is it usefully organized.

Why expand Google’s corporate mission when the easy part of the earlier mission has been accomplished and the hard part is about to start?

Perhaps some enterprising journalist will ask Page why Google is dodging the hard part of organizing information? Yes?

Category: The Essence of Composition

Filed under: Category Theory,Functional Programming — Patrick Durusau @ 3:57 pm

Category: The Essence of Composition by Bartosz Milewski.

From the post:

I was overwhelmed by the positive response to my previous post, the Preface to Category Theory for Programmers. At the same time, it scared the heck out of me because I realized what high expectations people were placing in me. I’m afraid that no matter what I’ll write, a lot of readers will be disappointed. Some readers would like the book to be more practical, others more abstract. Some hate C++ and would like all examples in Haskell, others hate Haskell and demand examples in Java. And I know that the pace of exposition will be too slow for some and too fast for others. This will not be the perfect book. It will be a compromise. All I can hope is that I’ll be able to share some of my aha! moments with my readers. Let’s start with the basics.

Bartosz’s post includes pigs, examples in C and Haskell, and ends with:

Challenges

  1. Implement, as best as you can, the identity function in your favorite language (or the second favorite, if your favorite language happens to be Haskell).
  2. Implement the composition function in your favorite language. It takes two functions as arguments and returns a function that is their composition.
  3. Write a program that tries to test that your composition function respects identity.
  4. Is the world-wide web a category in any sense? Are links morphisms?
  5. Is Facebook a category, with people as objects and friendships as morphisms?
  6. When is a directed graph a category?

My suggestion is that you follow Bartosz’s posts and after mastering them, try less well explained treatments of category theory.

Category Theory & Programming

Filed under: Category Theory,Functional Programming,Haskell — Patrick Durusau @ 3:42 pm

Category Theory & Programming by Yann Esposito. (slides)

Great slides on category theory with this quote on slide 6:

One of the goal of Category Theory is to create a homogeneous vocabulary between different disciplines.

Is the creation of a homogeneous vocabulary responsible for the complexity of category theory or is the proof of equivalence between different disciplines?

As you know, topic maps retains the vocabularies of different disciplines as opposed to replacing them with a homogeneous one. Nor do topic maps require proof of equivalence in the sense of category theory.

I first saw this in a tweet by Sam Ritchie.

Category Theory Applied to Functional Programming

Filed under: Category Theory,Functional Programming,Haskell,Programming — Patrick Durusau @ 2:20 pm

Category Theory Applied to Functional Programming by Juan Pedro Villa Isaza.

Abstract:

We study some of the applications of category theory to functional programming, particularly in the context of the Haskell functional programming language, and the Agda dependently typed functional programming language and proof assistant. More specifically, we describe and explain the concepts of category theory needed for conceptualizing and better understanding algebraic data types and folds, functors, monads, and parametrically polymorphic functions. With this purpose, we give a detailed account of categories, functors and endofunctors, natural transformations, monads and Kleisli triples, algebras and initial algebras over endofunctors, among others. In addition, we explore all of these concepts from the standpoints of categories and programming in Haskell, and, in some cases, Agda. In other words, we examine functional programming through category theory.

Impressive senior project along with Haskell source code.

I first saw this in a tweet by Computer Science.

November 4, 2014

Open Data 500

Filed under: Open Data — Patrick Durusau @ 8:24 pm

Open Data 500

From the webpage:

The Open Data 500, funded by the John S. and James L. Knight Foundation and conducted by the GovLab, is the first comprehensive study of U.S. companies that use open government data to generate new business and develop new products and services.

The full list is the most likely to be useful resource at this site. You can filter by subject area and/or federal agency that is supplying the data.

Great place to look for gaps in terms of data based products and/or what areas are already being served.

I first saw this in a tweet by Paul Rissen.

What Is Big Data?

Filed under: BigData — Patrick Durusau @ 8:10 pm

What Is Big Data? by Jenna Dutcher.

From the post:

“Big data.” It seems like the phrase is everywhere. The term was added to the Oxford English Dictionary in 2013 and appeared in Merriam-Webster’s Collegiate Dictionary in 2014. Now, Gartner’s just-released 2014 Hype Cycle shows “big data” passing the “peak of inflated expectations” and moving on its way down into the “trough of disillusionment.” Big data is all the rage. But what does it actually mean?

A commonly repeated definition cites the three Vs: volume, velocity, and variety. But others argue that it’s not the size of data that counts, but the tools being used or the insights that can be drawn from a dataset.

Jenna collected forty (40) different responses to the question: “What is Big Data?”

If you don’t see one you agree with at Jenna’s post, feel free to craft your own as a comment to this post.

If a large number of people mean almost but not quite the same thing by “big data,” does that give you a clue as to a persistent problem in IT? And in relationships between IT and other departments?

I first saw this in a tweet by Lutz Maicher.

Tessera

Filed under: BigData,Hadoop,R,RHIPE,Tessera — Patrick Durusau @ 7:20 pm

Tessera

From the webpage:

The Tessera computational environment is powered by a statistical approach, Divide and Recombine. At the front end, the analyst programs in R. At the back end is a distributed parallel computational environment such as Hadoop. In between are three Tessera packages: datadr, Trelliscope, and RHIPE. These packages enable the data scientist to communicate with the back end with simple R commands.

Divide and Recombine (D&R)

Tessera is powered by Divide and Recombine. In D&R, we seek meaningful ways to divide the data into subsets, apply statistical methods to each subset independently, and recombine the results of those computations in a statistically valid way. This enables us to use the existing vast library of methods available in R – no need to write scalable versions

DATADR

The datadr R package provides a simple interface to D&R operations. The interface is back end agnostic, so that as new distributed computing technology comes along, datadr will be able to harness it. Datadr currently supports in-memory, local disk / multicore, and Hadoop back ends, with experimental support for Apache Spark. Regardless of the back end, coding is done entirely in R and data is represented as R objects.

TRELLISCOPE

Trelliscope is a D&R visualization tool based on Trellis Display that enables scalable, flexible, detailed visualization of data. Trellis Display has repeatedly proven itself as an effective approach to visualizing complex data. Trelliscope, backed by datadr, scales Trellis Display, allowing the analyst to break potentially very large data sets into many subsets, apply a visualization method to each subset, and then interactively sample, sort, and filter the panels of the display on various quantities of interest.
trelliscope

RHIPE

RHIPE is the R and Hadoop Integrated Programming Environment. RHIPE allows an analyst to run Hadoop MapReduce jobs wholly from within R. RHIPE is used by datadr when the back end for datadr is Hadoop. You can also perform D&R operations directly through RHIPE , although in this case you are programming at a lower level.

Quite an impressive package for R and “big data.”

I first saw this in a tweet by Christophe Lalanne.

Open Source As Trouble?

Filed under: Cybersecurity,Security — Patrick Durusau @ 7:06 pm

Germany weighs law that could mean more trouble for U.S. tech heavyweights by Barb Darrow.

From the post:

Barb quotes a Wall Street Journal article which summarizes the issue as follows:

The draft law, which is still being hammered out, envisions new requirements like revealing source code or other proprietary data for companies that sell information technology to the German government or to private companies that are part of industries Berlin deems critical to the country’s security.

To which she observes:

This is in part a response to Edward Snowden’s revelations last year of U.S. intelligence agencies spying on European citizens — includign German chancellor Angela Merkel — often using U.S. technology as their conduit. But, in truth, European companies, acting out of self-interest, started pushing for national clouds built on home-grown technology long before Snowden became a household name. In September, 2011, for example, Reinhard Clemens, then CEO of Deutsche Telekom T-systems group, pushed German regulators to create a new certification to enable super-secure clouds to be built in Germany or elsewhere in Europe. France Telecom execs subsequently pushed for similar moves in their home country.

Reading between the lines, the “vendors” pushing for smallish clouds are the same vendors who hope to build those clouds.

Reading “vendors” as in customers of current cloud providers, nationalism isn’t enough of a line item in a budget to pay more for less, short of legal requirements.

Rather than fighting a rear guard action against nationalistic legislation as it comes up in every country, a time consuming and ultimately losing position, U.S. IT should take the offensive against such efforts.

For example, the major cloud providers should start preparation to open source their software products.

Before anyone has to reach for their heart pills, remember what it means to open source software.

Sure, you could download the XXX-WordProcessor source code, but who is going to compile it for you, integrate it into your existing systems, customize it for your enterprise use?

You local street corner IT shop or vendors with decades, yes, decades of IT support and who originated the software?

BTW, before you worry too much about the coins that will drop off the table, who will be hit harder by an open source policy for software vendors? Vendors with fifty (50) million lines of code projects or vendors with < one million lines of code projects? There is another upside to open sourcing commercial software, at least if commercial use is prohibited, can you say "detection of software piracy?" On a level playing field, where disclosure is the norm, seems to me that piracy becomes very difficult to sustain. Rather than an impediment for current cloud vendors, open source requirements, if managed properly, could lead to:

  1. Minimal impact on current enterprise vendors
  2. Better tracking of innovation in smaller IT shops
  3. Better detection of software piracy

Will the big four or five of the Cloud and open source ride down Germany, France, the EU(?), like the four horsemen of the apocalypse?

PS: I am personally interested in open source requirements because it creates one less place for U.S. intelligence agencies to hide. No, I don’t credit their protests to be acting in good faith or on behalf of the citizens of the United States. Why would you credit a known habitual liar?

« Newer PostsOlder Posts »

Powered by WordPress