Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 8, 2014

Prototyping in Clojure

Filed under: Clojure,Programming — Patrick Durusau @ 12:24 pm

Prototyping in Clojure by Philip Potter.

From the post:

We recently completed an alpha on identity assurance for organisations. The objective of an alpha is to gain understanding of a service and validate a design approach by building a working prototype. As part of the alpha, we built a set of interacting prototype applications representing a number of separate services as assurers and consumers of identity information. Because our objective was understanding and validation, these prototypes did not use any real user’s data.

We wanted our prototype to:

  • evolve rapidly, adapt quickly to feedback from our user research, and to be able to change direction entirely if necessary
  • be realistic enough to validate whether the service we were exploring was technically feasible
  • be simple and focus on the service we were trying to explore, rather than getting bogged down in implementation details

However, we didn’t need to worry about maintaining the code long-term: because the objective was better understanding through building a prototype, we were prepared to throw the code away at the end.

A government alpha IT project? That alone makes the post worth mentioning. Someday FRPs for alphas may be a feature of the United States Government 2.0. 😉

Philip covers the pluses, minuses and could be better aspects of using Clojure for prototyping. Particularly useful if you want to use or improve Clojure as a prototyping language.

Curious, even if prototype code is “thrown away,” does the clarity of understanding from functional coding have a measurable impact on the final code?

I doubt there is enough Clojure prototyping to form a basis for such a study now, but in a few years…, it could be a hot topic of research interest.

I first saw this in a tweet by P.F.

July 7, 2014

Search Suggestions

Filed under: Humor,Searching — Patrick Durusau @ 6:38 pm

James Hughes posted this image of search suggestions to Twitter:

search suggestions

How do you check search suggestions?

Data Visualization in Sociology

Filed under: Graphics,Social Sciences,Visualization — Patrick Durusau @ 6:32 pm

Data Visualization in Sociology by Kieran Healy and James Moody. (Annu. Rev. Sociol. 2014. 40:5.1–5.24, DOI: 10.1146/annurev-soc-071312-145551)

Abstract:

Visualizing data is central to social scientific work. Despite a promising early beginning, sociology has lagged in the use of visual tools. We review the history and current state of visualization in sociology. Using examples throughout, we discuss recent developments in ways of seeing raw data and presenting the results of statistical modeling. We make a general distinction between those methods and tools designed to help explore data sets and those designed to help present results to others. We argue that recent advances should be seen as part of a broader shift toward easier sharing of the code and data both between researchers and with wider publics, and we encourage practitioners and publishers to work toward a higher and more consistent standard for the graphical display of sociological insights.

A great review of data visualization in sociology. I was impressed by the author’s catching the context of John Maynard Keyes‘ remark about the “evils of the graphical method unsupported by tables of figures.”

In 1938, tables of figures reported actual data, not summaries. With a table of figures, another researcher could verify a graphic representation and/or re-use the data for their own work.

Perhaps journals could adopt a standing rule that no graphic representations are allowed in a publication unless and until the authors provide the data and processing steps necessary to reproduce the graphic. For public re-use.

The authors’ also make the point that for all the wealth of books on visualization and graphics, there is no cookbook that will enable a user to create a great graphic.

My suggestion in that regard is to collect visualizations that are widely thought to be “great” visualizations. Study the data and background of the visualization. Not so that you can copy the technique but in order to develop a sense for what “works” or doesn’t for visualization.

No guarantees but at a minimum, you will have experienced a large number of visualizations. That can’t hurt in your quest to create better visualizations.

I first saw this in a tweet by Christophe Lalanne.

Free Airport Chargers for Terrorists

Filed under: NSA,Security — Patrick Durusau @ 4:38 pm

Airlines may start supplying terrorists with free chargers for their electronic gear.

Why the Department of Homeland Security wants that result isn’t clear.

Jeff John Roberts reports in: Security order bans uncharged devices from some US-bound flights:

International air travelers heading to the U.S. now face another potential headache in security lines: they may not be able to board their plane unless they are able to turn on their phones, laptops and other electronic devices.

According to a new order announced on Sunday by the Department of Homeland Security:

“During the security examination, officers may also ask that owners power up some devices, including cell phones. Powerless devices will not be permitted onboard the aircraft. The traveler may also undergo additional screening.”

The new order, which did not specify which particular airports will be subject to the decree, comes after earlier expressions of concern by U.S. concern that devices might be used as a shell to contain a bomb.

The new measures may add to already-arduous wait times at the airport, though the security expert cited by the Journal suggested that airlines may rush to provide chargers to passengers who are waiting to board.

If you arrive at an airport with a bomb and its battery is dead, guess what? You can’t detonate it.

Don’t worry though. The airlines will soon be be supplying free chargers so you can charge the battery for your explosive device. Plus you can turn your device on and off for the TSA.

I don’t suppose it ever occurred to the DHS that the extruded shells of carry on baggage could be made to hold a fair amount of explosive plus battery? Hermetically sealed it would have no odors or imagery to betray its owner.

If the only return we get from the war on terrorism is more electrical sockets in airports, I say it hasn’t been worth the effort.

Access vs. Understanding

Filed under: Open Data,Public Data,Statistics — Patrick Durusau @ 4:09 pm

In Do doctors understand test results? William Kremer covers Risk savvy : how to make good decisions, a recent book on understanding risk statistics by Gerd Gigerenzer.

You will have little doubt that doctors don’t know the correct risk statistics for very common medical issues (breast cancer screening) and even when supplied with the correct information, they are incapable of interpreting it correctly when you finish Kermer’s article.

And the public?

Unsurprisingly, patients’ misconceptions about health risks are even further off the mark than doctors’. Gigerenzer and his colleagues asked over 10,000 men and women across Europe about the benefits of PSA screening and breast cancer screening respectively. Most overestimated the benefits, with respondents in the UK doing particularly badly – 99% of British men and 96% of British women overestimated the benefit of the tests. (Russians did the best, though Gigerenzer speculates that this is not because they get more good information, but because they get less misleading information.)

What does that suggest to you about the presentation/interpretation of data encoded with a topic map or not?

To me it says that beyond testing an interface for usability and meeting the needs of users, we need to start testing users’ understanding of the data presented by interfaces. Delivery of great information that leaves a user mis-informed (unless that is intentional) doesn’t seem all that helpful.

I am looking forward to reading Risk savvy : how to make good decisions. I don’t know that I will make “better” decisions but I will know when I am ignoring the facts. 😉

I first saw this in a tweet by Alastair Kerr.

Palaeography

Filed under: History,Palaeography — Patrick Durusau @ 3:46 pm

Palaeography: reading old handwriting 1500 – 1800 from the The National Archives (UK)

From the webpage:

A practical online tutorial

Palaeography is the study of old handwriting. This web tutorial will help you learn to read the handwriting found in documents written in English between 1500 and 1800.

At first glance, many documents written at this time look illegible to the modern reader. By reading the practical tips and working through the documents in the Tutorial in order of difficulty, you will find that it becomes much easier to read old handwriting. You can find more documents on which to practise your skills in the further practice section.

An excellent resource if you are writing topic maps based on or citing English handwritten sources between 1500 – 1800.

Or if you simply like puzzle solving.

PS: Librarians take note.

OpenStreetMap Nears Ten

Filed under: Mapping,Maps,OpenStreetMap — Patrick Durusau @ 3:17 pm

OpenStreetMap – What’s next for the ‘Wikipedia of mapping” as it turns 10? by Ed Freyfogle.

From the post:

OpenStreetMap (OSM) has come a long way. After starting almost 10 years ago in London, OSM is now an entrenched part of the geo/location-based service toolchain and one of the leading examples of crowdsourcing at a massive scale.

Since 2004, over 1.5 million volunteers have signed up to contribute terabytes of geo-data to the project often referred to as the “Wikipedia of mapping”. What began as one guy wandering around London with his GPS has now turned into a global movement and spawned countless spinoff projects (see: WheelMap, OpenCycleMap and OpenRailwayMap).

Ed details the amazing progress that OpenStreetMap has made in ten years but also mentions diversity, governance and licensing issues that continue to hold OSM back from greater adoption.

Another concern is breadth of coverage:

A recent study found that just five countries make up 58% of OpenStreetMap’s data coverage. It needs to be asked what dynamics are preventing local communities from forming around the world. Is OSM just a ‘rich world’ phenomenon?

A difference in perspective. I would be thrilled to have the level of participation for topic maps that OSM has, even if it were mostly limited to five countries.

My question would be what is it about mapping physical terrain or the interfaces for mapping it, that makes it more attractive than mapping subjects and how they are identified?

Is there a lesson here for topic maps?

On the licensing issue, I hopeful OSM will adopt an Apache license. The rosters of Apache projects with their corporate sponsored participants and commercial products based on those projects are the best evidence for Apache licensing on a project.

That is, assuming being successful is more important to you than some private notion of purity. I recommend successful.

Random Forests…

Filed under: Ensemble Methods,GPU,Machine Learning,Random Forests — Patrick Durusau @ 2:30 pm

Random Forests of Very Fast Decision Trees on GPU for Mining Evolving Big Data Streams by Diego Marron, Albert Bifet, Gianmarco De Francisci Morales.

Abstract:

Random Forests is a classical ensemble method used to improve the performance of single tree classifiers. It is able to obtain superior performance by increasing the diversity of the single classifiers. However, in the more challenging context of evolving data streams, the classifier has also to be adaptive and work under very strict constraints of space and time. Furthermore, the computational load of using a large number of classifiers can make its application extremely expensive. In this work, we present a method for building Random Forests that use Very Fast Decision Trees for data streams on GPUs. We show how this method can benefit from the massive parallel architecture of GPUs, which are becoming an efficient hardware alternative to large clusters of computers. Moreover, our algorithm minimizes the communication between CPU and GPU by building the trees directly inside the GPU. We run an empirical evaluation and compare our method to two well know machine learning frameworks, VFML and MOA. Random Forests on the GPU are at least 300x faster while maintaining a similar accuracy.

The authors should get a special mention for honesty in research publishing. Figure 11 shows their GPU Random Forest algorithm seeming to scale almost constantly. The authors explain:

In this dataset MOA scales linearly while GPU Random Forests seems to scale almost constantly. This is an effect of the scale, as GPU Random Forests runs in milliseconds instead of minutes.

How fast/large are your data streams?

I first saw this in a tweet by Stefano Bertolo.

An Advanced Clojure Workflow

Filed under: Clojure,Editor,Functional Programming,Programming — Patrick Durusau @ 10:27 am

An Advanced Clojure Workflow

From the post:

“This is my workflow. There are many like it, but this one is mine.” –Anon.

This is the first in a sequence of posts about my current workflow for developing in Clojure. I call this workflow “advanced” because it takes time to learn the various tools involved, but also because it allows me to tackle harder problems than I otherwise could.

The basic ingredients of this workflow are:

  • Lisp-related editor enhancements, including parenthesis magic, and, especially, in-editor REPL integration.
  • Continuous testing with instant display of results
  • Notebook-style semi-literate programming with continuous updating of documentation, display of math symbols, and in-page plots.

My setup has evolved over time and I expect it will be different by the time most people read this. I have found, though, that the right combination of tools allows me not only to manipulate Clojure code and see results quickly, but also to think more clearly about the problem at hand. In my experience, this leads to simpler, more elegant, more maintainable code.

These tools and techniques provide a sort of “sweet-spot” – they help me be productive in the language, and I find them fun to use together. Your mileage may vary – I have seen people be productive with vastly different workflows and toolsets.

In the first post in this series (Emacs Customization for Clojure) we learn:

I have been using Emacs since the 1990s, though I still consider myself a novice (Emacs is that way). Though good alternatives exist, over half of the Clojure community has adopted Emacs despite its lack of polish and its Himalayan learning curve. Emacs is massively customizable, with hundreds of plugins available, and can be extended to just about any degree using its own flavor of Lisp.

Could over half of the Clojure community be wrong?

It’s possible, but highly unlikely. 😉

I first saw this in a tweet by Christophe Lalanne.

July 6, 2014

Finding needles in haystacks:…

Filed under: Bioinformatics,Biology,Names,Taxonomy — Patrick Durusau @ 4:54 pm

Finding needles in haystacks: linking scientific names, reference specimens and molecular data for Fungi by Conrad L. Schoch, et al. (Database (2014) 2014 : bau061 doi: 10.1093/database/bau061).

Abstract:

DNA phylogenetic comparisons have shown that morphology-based species recognition often underestimates fungal diversity. Therefore, the need for accurate DNA sequence data, tied to both correct taxonomic names and clearly annotated specimen data, has never been greater. Furthermore, the growing number of molecular ecology and microbiome projects using high-throughput sequencing require fast and effective methods for en masse species assignments. In this article, we focus on selecting and re-annotating a set of marker reference sequences that represent each currently accepted order of Fungi. The particular focus is on sequences from the internal transcribed spacer region in the nuclear ribosomal cistron, derived from type specimens and/or ex-type cultures. Re-annotated and verified sequences were deposited in a curated public database at the National Center for Biotechnology Information (NCBI), namely the RefSeq Targeted Loci (RTL) database, and will be visible during routine sequence similarity searches with NR_prefixed accession numbers. A set of standards and protocols is proposed to improve the data quality of new sequences, and we suggest how type and other reference sequences can be used to improve identification of Fungi.

Database URL: http://www.ncbi.nlm.nih.gov/bioproject/PRJNA177353

If you are interested in projects to update and correct existing databases, this is the article for you.

Fungi may not be on your regular reading list but consider one aspect of the problem described:

It is projected that there are ~400 000 fungal names already in existence. Although only 100 000 are accepted taxonomically, it still makes updates to the existing taxonomic structure a continuous task. It is also clear that these named fungi represent only a fraction of the estimated total, 1–6 million fungal species (93–95).

I would say that computer science isn’t the only discipline where “naming things” is hard.

You?

PS: The other lesson from this paper (and many others) is that semantic accuracy is not easy nor is it cheap. Anyone who says differently is lying.

Data Visualization Contest @ use!R 2014

Filed under: Graphics,R,Visualization — Patrick Durusau @ 4:34 pm

Data Visualization Contest @ use!R 2014

From the webpage:

The aim of the Data Visualization Contest @ use!R 2014 is to show the potential of R for analysis and visualization of large and complex data sets.

Submissions are welcomed in these two broad areas:

  • Track 1: Schools matter: the importance of school factors in explaining academic performance.
  • Track 2: Inequalities in academic achievement.

Really impressive visualizations but I would treat some of the conclusions with a great deal of caution.

One participant alleges that the absence of computers makes math scores fall. I am assuming that is literally what the data says but that doesn’t establish a causal relationship.

I say that because all of the architects of atomic bomb, to say nothing of the digital computer, learned mathematics without the aid of computers. Yes?

CouchDB-Lucene 1.0 Release

Filed under: CouchDB,Lucene — Patrick Durusau @ 4:19 pm

CouchDB-Lucene

From the release page:

  • Upgrade to Lucene 4.9.0
  • Upgrade to Tika 1.5
  • Use the full OOXML Schemas from Apache POI, to make Tika able to parse Office documents that use exotic features
  • Allow search by POST (using form data)

+1! to incorporating Lucene in software as opposed to re-rolling basic indexing.

Medical Vocabulary

Filed under: Medical Informatics,Vocabularies — Patrick Durusau @ 4:00 pm

Medical Vocabulary by John D. Cook.

A new twitter account that tweets medical terms with definitions.

Would a twitter account that focuses on semantic terminology be useful?

No promises, just curious.

Thinking promising semantic searching/integration if there was some evidence that semanticists are aware of the vast and different terminology in their own field.

PS: John D. Cook has seventeen (17) Twitter accounts as of today:

I subscribe to several of them and they are very much worth the time to follow.

For the current list of John D. Cook twitter accounts, see: http://www.johndcook.com/twitter/

ease()-y as Math.PI…

Filed under: D3,Graphics — Patrick Durusau @ 3:14 pm

ease()-y as Math.PI: 1,200,000ms of Fun with D3’s Animated Transitions by Scott Murray.

From the webpage:

This is a presentation I gave at the Eyeo Festival in Minneapolis on June 11, 2014, adapted for the web. The talk was entirely live-coded in the JavaScript console, an experience I’ve tried to recreate here.

I recommend viewing this in Chrome, with the developer tools open. Click the next button to step through the presentation. Or, of course you can retype any of the code directly into the console yourself. Click any code block to execute it (but note that running them out of the intended order may produce unexpected results).

If you have any interest in D3 or in graphics you are going to enjoy this presentation!

Perhaps it will inspire you to try a live coding presentation. 😉

Or at least give you more confidence in using D3 for visualization.

PS: One moment of confusion as I was stepping through the presentation. I did not have the DOM inspector open, so I did not see the SVG element containing circles. The circles on my monitor resembled a solid line. My bad. The lesson here is to: Follow the Directions!

The Zen of Cypher

Filed under: Cypher,Graphs,Neo4j — Patrick Durusau @ 2:53 pm

The Zen of Cypher by Nigel Small.

The original “Zen” book, Zen and the art of motorcycle maintenance: an inquiry into values runs four hundred and eighteen pages.

Nigel has a useful summary for Cypher but I would estimate it runs about a page.

Not really the in depth sort of treatment that qualifies for a “Zen” title.

Yes?

Graph Hairballs, Impressive but Not Informative

Filed under: Graphs,Visualization — Patrick Durusau @ 2:35 pm

Large-Scale Graph Visualization and Analytics by Kwan-Liu Ma and Chris W. Muelder. (Computer, June 2013)

Abstract:

Novel approaches to network visualization and analytics use sophisticated metrics that enable rich interactive network views and node grouping and filtering. A survey of graph layout and simplification methods reveals considerable progress in these new directions.

We have all seen large graph hairballs that are as impressive as they are informative. Impressive to someone who has recently discovered “…everything is a graph…” but not to anyone else.

Ma and Muelder do an excellent job of constrasting traditional visualizations that result in “…an unintelligible hairball—a tangled mess of lines” versus more informative techniques.

Among the methods touched upon are:

References at the end of the article should get you started towards useful visualizations of large scale graphs.

PS: I assume the article is based in part on C.W. Muelder’s “Advanced Visualization Techniques for Abstract Graphs and Computer Networks,” PhD dissertation, Dept. Computer Science, University of Calif., Davis, 2011. It is cited among the references. Published by ProQuest which means the 130 page dissertation runs $62.10 in paperback.

Let me know if you run across a more reasonably accessible copy.

I first saw this in a tweet by Paul Blaser.

SICP Distilled

Filed under: Clojure,Functional Programming,Programming,Scheme — Patrick Durusau @ 10:56 am

SICP Distilled

From the webpage:

An idiosyncratic tour of the best of SICP in Clojure.

Have you always wanted to read SICP but either didn’t start or gave up mid-way?

Take a few days to experience a distillation of its most important lessons.

Followed by a sign-up form.

Don’t know any more than that but it is an interesting concept.

If and when I get more details, I will post that information.

July 5, 2014

Wandora Moves to Github

Filed under: Topic Map Software,Wandora — Patrick Durusau @ 4:49 pm

Wandora Moves to Gibhub

I saw a tweet today by Wandora announcing a move to GitHub.

From the new GitHub page:

Wandora is a general purpose information extraction, management and publishing application based on Topic Maps and Java. Wandora has graphical user interface, multiple visualization models, huge collection of information extraction, import and export options, embedded HTTP server with several output modules and open plug-in architecture. Wandora is a FOSS application with GNU GPL license. Wandora suits well for constructing and maintaining vocabularies, ontologies and information mashups. Application areas include data integration, business intelligence, digital preservation, data journalism, open data and linked data projects.

If you aren’t familiar with Wandora, check it out.

If it has been a while since you looked at Wandora, its time for another visit.

The traditional Wandora site is still up and news there reports the move to GitHub was to make development more transparent and to attract new developers.

Well, you have the invitation. How are you going to respond?

Understanding and Expressing Scalable Concurrency

Filed under: Algorithms,Concurrent Programming,Parallelism,Scalability — Patrick Durusau @ 4:35 pm

Understanding and Expressing Scalable Concurrency by Aaron Turon.

Abstract


The Holy Grail of parallel programming is to provide good speedup while hiding or avoiding the pitfalls of concurrency. But some level in the tower of abstraction must face facts: parallel processors execute code concurrently, and the interplay between concurrent code, synchronization, and the memory subsystem is a major determiner of performance. Effšective parallel programming must ultimately be supported by scalable concurrent algorithms—algorithms that tolerate (or even embrace) concurrency for the sake of scaling with available parallelism. This dissertation makes several contributions to the understanding and expression of such algorithms:

  • It shows how to understand scalable algorithms in terms of local protocols governing each part of their hidden state. These protocols are visual artifacts that can be used to informally explain an algorithm at the whiteboard. But they also play a formal role in a new logic for verifying concurrent algorithms, enabling correctness proofs that are local in space, time, and thread execution. Correctness is stated in terms of refinement: clients of an algorithm can reason as if they were using the much simpler specification code it refines.
  • It shows how to express synchronization in a declarative but scalable way, based on a new library providing join patterns. By declarative, we mean that the programmer needs only to write down the constraints of a synchronization problem, and the library will automatically derive a correct solution.By scalable, we mean that the derived solutions deliver robust performance with increasing processor count and problem complexity. The library’s performance on common synchronization problems is competitive with specialized algorithms from the literature.
  • It shows how to express scalable algorithms through reagents, a new monadic abstraction. With reagents, concurrent algorithms no longer need to be constructed from “wholecloth,” i.e., by using system-level primitives directly. Instead, they are built using a mixture of shared-state and message-passing combinators. Concurrency experts benefit, because they can write libraries at a higher level, with more reuse, without sacrificing scalability. Their clients benefit, because composition empowers them to extend and tailor a library without knowing the details of its underlying algorithms.

Not for the faint of heart! 😉

But if you are interested in algorithms for when processing is always parallel by default, best dig in.

I like the author’s imagery of “Go Fish” when he says:

A scalable hashtable is useful not just for concurrent systems; it can also be a boon for explicit parallel programming. A simple but vivid example is the problem of duplicate removal: given a vector of items, return the items in any order, but without any duplicates. Since the input is unstructured, any way of dividing it amongst parallel threads appears to require global coordination to discover duplicate items. The key to avoiding a multiprocessor game of “Go Fish” is to focus on producing the output rather than dividing the input. If threads share a scalable hashtable that allows parallel insertion of distinct elements, they can construct the correct output with (on average) very little coordination, by simply each inserting a segment of the input into the table, one element at a time.

Now that I think about it, topic map processing does a lot of duplicate removal.

Topic maps in a parallel processing environment anyone?

I first saw this in a tweet by Alex Clemmer.

Greenhouse [Exposes Political Funding]

Filed under: Government,Politics — Patrick Durusau @ 2:48 pm

Expose Any Politician’s Fundraising Sources with This Ingenious Tool by Jason O. Gilbert.

From the post:

Whenever you’re reading a news story and you see a U.S. representative or senator supporting a bill or a cause, you might wonder just how much a specific industry or lobbying group has paid to influence his thinking.

Now, with a new browser extension called Greenhouse, you can find out instantly. Once you’ve installed Greenhouse (available for Google Chrome, Firefox, or Safari), every politician’s name will become highlighted, on any article you’re reading; hovering your mouse over the name will launch a pop-up box that shows you exactly which industries and lobbies contributed the most cash to that politician’s campaign fund.

The tool highlights contributions to both members of Congress and U.S. senators…

You can also click through to see a more detailed breakdown, as well as which, if any, campaign-finance reform bills the politician backs. A small box to the right of the politician’s name also shows the percentage of her contributions that came from small donors (individuals giving $200 or less.)

Nicholas Rubin, creator of Greenhouse comments on the Greenhouse homepage:

Exactly one hundred years ago, in Harper’s Weekly, Louis Brandeis made the frequently quoted statement that “sunlight is said to be the best of disinfectants.” Brandeis’s preceding sentence in the article may be less well known, but it is equally important: “Publicity is justly commended as a remedy for social and industrial diseases.” I created Greenhouse to shine light on a social and industrial disease of today: the undue influence of money in our Congress. This influence is everywhere, even if it is hidden. I aim to expose and publicize that disease through technology that puts important data where it is most useful, on websites where people read about the actions, or inaction, of members of Congress every day.

Does the following sound familiar?

…its adoption did not quell the cries for reform. Eliminating corporate influence was only one of the ideas being advanced at this time to clean up political finance. Reducing the influence of wealthy individuals was also a concern, and some reformers pushed for limits on individual donations. Still others advocated even bolder ideas.

Care to guess the year being spoken of in that quote?

1907*.

Campaign finance reform and a belief in disclosure of contributions as a “cure” has a long history in the United States. The first “disclosure” act was pass in 1910. The Federal Corrupt Practices Act (aka Publicity Act) entry at Wikipedia.

That’s been what? One hundred and four (104) years ago?

My assumption is that if money was seen as corrupting politics more than a century ago and is still seen that way today, the measures taken to reduce the corruption of politics by money until now, have been largely ineffectual.

Nicholas’ browser plugin is going to be useful, but not for the reasons he gives.

First, the plugin is useful for identifying members of the House or Senate for contributions. That is to join with others in buying legislators leaning in a direction I like. You do know that an “honest” politician is one that once bought, stays bought. Yes?

Secondly and perhaps more importantly, a general form of the plugin would make a useful way to deliver links to curated (topic mapped) material on news or other items on a webpage. The average user sees underlined text, just another link. On pop-up or following it, they obtain highly curated information about a subject.


* Campaign Finance Reform: A Sourcebook edited by Anthony Corrado, et al., Brookings Institute, 1997, page 28.

You may also find Campaign-Finance Reform: History and Timeline: A look at early campaign finance legislation and efforts to regulate fund raising for political campaigns by Beth Rowen a useful overview of “reform” in this area.

I first saw this in a tweet by Gregory Piatetsky.

July 4, 2014

The Restatement Project

Filed under: Law,Law - Sources,Text Mining — Patrick Durusau @ 4:15 pm

Rough Consensus, Running Standards: The Restatement Project by Jason Boehmig, Tim Hwang, and Paul Sawaya.

From part 3:

Supported by a grant from the Knight Foundation Prototype Fund, Restatement is a simple, rough-and-ready system which automatically parses legal text into a basic machine-readable JSON format. It has also been released under the permissive terms of the MIT License, to encourage active experimentation and implementation.

The concept is to develop an easily-extensible system which parses through legal text and looks for some common features to render into a standard format. Our general design principle in developing the parser was to begin with only the most simple features common to nearly all legal documents. This includes the parsing of headers, section information, and “blanks” for inputs in legal documents like contracts. As a demonstration of the potential application of Restatement, we’re also designing a viewer that takes documents rendered in the Restatement format and displays them in a simple, beautiful, web-readable version.

I skipped the sections justifying the project because in my circles, the need for text mining is presumed and the interesting questions are about the text and/or the techniques for mining.

As you might suspect, I have my doubts about using JSON for legal texts but for a first cut, let’s hope the project is successful. There is always time to convert to a more robust format at some later point, in response to a particular need.

Definitely a project to watch or assist if you are considering creating a domain specific conversion editor.

July 3, 2014

statsTeachR

Filed under: R,Statistics — Patrick Durusau @ 2:35 pm

statsTeachR

From the webpage:

statsTeachR is an open-access, online repository of modular lesson plans, a.k.a. “modules”, for teaching statistics using R at the undergraduate and graduate level. Each module focuses on teaching a specific statistical concept. The modules range from introductory lessons in statistics and statistical computing to more advanced topics in statistics and biostatistics. We are developing plans to create a peer-review process for some of the modules submitted to statsTeachR.

There are twenty-five (25) modules now and I suspect they would welcome you help in contributing more.

The path to a more numerically and specifically statistically savvy public is to teach people to use statistics. So when numbers “don’t sound right,” they will have the confident to speak up.

Enjoy!

I first saw this in a tweet by Karthik Ram.

rplos Tutorial

Filed under: R,Science,Text Mining — Patrick Durusau @ 2:14 pm

rplos Tutorial

From the webpage:

The rplos package interacts with the API services of PLoS (Public Library of Science) Journals. In order to use rplos, you need to obtain your own key to their API services. Instruction for obtaining and installing keys so they load automatically when you launch R are on our GitHub Wiki page Installation and use of API keys.

This tutorial will go through three use cases to demonstrate the kinds
of things possible in rplos.

  • Search across PLoS papers in various sections of papers
  • Search for terms and visualize results as a histogram OR as a plot through time
  • Text mining of scientific literature

Another source of grist for your topic map mill!

Taming Asynchronous Workflows…

Filed under: Clojure,Cybersecurity,Functional Reactive Programming (FRP),Security — Patrick Durusau @ 1:59 pm

Taming Asynchronous Workflows with Functional Reactive Programming by Leonardo Borges.

Unfortunately just the slides but they should be enough for you to get the gist of the presentation. And, Leonardo finishes with a number of pointers and references that will be helpful.

Managing subject identity in a workflow could have interesting implications for security. What if there is no “stored” data to be attacked? Either you were listening at the right time or not, to receive all the transmissions necessary to assemble a text? Or you were listening to all the right sources? And you need not store the message after receipt, it can always be re-created by the same processes that delivered it.

Lots of room for innovation and possible security improvements when we stop building targets (static data stores).

BTW, Leonardo is currently writing: “Clojure Reactive Programming: RAW.” (Packt) I like Packt books but do wish their website offered a search by title function. Try searching for “reactive” and see what you think.

July 2, 2014

Property Suggester

Filed under: Authoring Topic Maps,Wikidata — Patrick Durusau @ 7:09 pm

Wikidata just got 10 times easier to use by Lydia Pintscher.

From an email post:

We have just deployed the entity suggester. This helps you with suggesting properties. So when you now add a new statement to an item it will suggest what should most likely be added to that item. One example: You are on an item about a person but it doesn’t have a date of birth yet. Since a lot of other items about persons have a date of birth it will suggest you also add one to this item. This will make it a lot easier for you to figure out what the hell is missing on an item and which property to use.

Thank you so much to the student team who worked on this as part of their bachelor thesis over the last months as well as everyone who gave feedback and helped them along the way.

I’m really happy to see this huge improvement towards making Wikidata easier to use. I hope so are you.

I suspect such a suggester for topic map authoring would need to be domain specific but it would certainly be a useful feature.

At least so long as I can say: No more suggestions of X property. 😉

An added wrinkle could be suggested properties and why, from a design standpoint, they could be useful to include.

Aerospike goes Open Source

Filed under: Aerospike,Database — Patrick Durusau @ 6:57 pm

Aerospike goes Open Source

From the post:

We are excited to announce that the Aerospike database is now open source.

Aerospike’s mission is to disrupt the entire field of databases by offering an addictive proposition: a database literally ten times faster than existing NoSQL solutions, and one hundred times faster than existing SQL solutions. By offering as open source the battle-tested operational database that powers the largest and highest scale companies on the planet, Aerospike will change how applications are architected, and solutions are created!

The code for Aerospike clients and the Aerospike server is published on github. Clients are available under the Apache 2 license so you can use and modify with no restrictions. The server is available under AGPL V3 to protect the long term interests of the community – you are free to use it with no restrictions but if you change the server code, then those code changes must be contributed back.

Aerospike Community Edition has no limits on the number of servers, tps or terabytes of data, and is curated by Aerospike. Use is unlimited and the code is open. We cannot wait to see what you will do with it!

You will have to read the details to decide if Aerospike is appropriate for your requirements.

Among other things, I would focus on statements like:

This layer [Distribution Layer] scales linearly and implements many of the ACID guarantees.

That’s like reading a poorly written standards document. 😉 What does “many of the ACID guarantees” mean exactly?

From the ACID article at Wikipedia I read:

In computer science, ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee that database transactions are processed reliably.

Jim Gray defined these properties of a reliable transaction system in the late 1970s and developed technologies to achieve them automatically.

I don’t think four (4) requirements count as “many” but my first question would be:

Which of the “many” ACID guarantees does Aerospike not implement? How hard that this be? It has to be one of the four. Yes?

Second question: So, more than three decades after Jim Gray demonstrated how to satisfy all four ACID guarantees, Aerospike doesn’t? Yes?

I’m not denying there may be valid reasons to ignore one or more of the ACID guarantees. But let’s be clear about which ones and the trade-offs that justify it.

I first saw this in a tweet by Charles Ditzel.

Maps need context

Filed under: Mapping,Maps — Patrick Durusau @ 6:33 pm

Maps need context by Jon Schwabish and Bryan Connor.

From the post:

It might be the case that maps are the most data-dense visualizations. Consider your basic roadmap: it includes road types (highways, toll roads), directions (one-way, two-way), geography (rivers, lakes), cities, types of cities (capitals), points of interest (schools, parks), and distance. Maps that encode statistical data, such as bubble plots or choropleth maps, are also data-dense and replace some of these geographic characteristics with different types of data encodings. But lately we’ve been wondering if most maps fail to convey enough context.

As an example, consider this map of poverty rates by districts in India. It’s a fairly simple choropleth map and you can immediately discern different patterns: high poverty rates are concentrated in the districts in the northernmost part of the country, on part of the southeast border, and in a stretch across the middle of the country. Another set of high-poverty areas can be found in the land mass in the northeast part of the map. But here’s the thing: we don’t know much about India’s geography. Without some context—plotting cities or population centers—we can only just guess what this map is telling me.

Many readers will be more familiar with the geography of the United States. So when maps like this one from the Census Bureau show up, we are better equipped to understand it because we’re familiar with areas such as the high-poverty South and around the Texas-Mexico border. But then again, what about readers familiar with basic U.S. geography, but not familiar with patterns of poverty? How useful is this map for them?

In addition to establishing the potential need for more context, Jon and Bryan go on to describe a tool for building and comparing maps with different data sets included.

You should take context into account in deciding what groups of topics and associations to merge into or leave out of a topic map. Too much detail and your user may lose sight of the forest. Too little and they may not be able to find it at all.

circlize implements and enhances circular visualization in R

Filed under: Bioinformatics,Genomics,Multidimensional,R,Visualization — Patrick Durusau @ 6:03 pm

circlize implements and enhances circular visualization in R by Zuguang Gu, et al.

Abstract:

Summary: Circular layout is an efficient way for the visualization of huge amounts of genomic information. Here we present the circlize package, which provides an implementation of circular layout generation in R as well as an enhancement of available software. The flexibility of this package is based on the usage of low-level graphics functions such that self-defined high-level graphics can be easily implemented by users for specific purposes. Together with the seamless connection between the powerful computational and visual environment in R, circlize gives users more convenience and freedom to design figures for better understanding genomic patterns behind multi-dimensional data.

Availability and implementation: circlize is available at the Comprehensive R Archive Network (CRAN): http://cran.r-project.org/web/packages/circlize/

The article is behind a paywall but fortunately, the R code is not!

I suspect I know which one will get more “hits.” 😉

Useful for exploring multidimensional data as well as presenting multidimensional data encoded using a topic map.

Sometimes displaying information as nodes and edges isn’t the best display.

Remember the map of Napoleon’s invasion of Russia?

napoleon - russia

You could display the same information with nodes (topics) and associations (edges) but it would not be nearly as compelling.

Although, you could make the same map a “cover” for the topics (read people) associated with segments of the map, enabling a reader to take in the whole map and then drill down to the detail for any location or individual.

It would still be a topic map, even though its primary rendering would not be as nodes and edges.

Verticalize

Filed under: Bioinformatics,Data Mining,Text Mining — Patrick Durusau @ 3:05 pm

Verticalize by Pierre Lindenbum.

From the webpage:

Simple tool to verticalize text delimited files.

Pierre works in bioinformatics and is the author of many useful tools.

Definitely one for the *nix toolbox.

OpenPrism

Filed under: Open Data,Open Government — Patrick Durusau @ 2:50 pm

Searching Data Tables

From the webpage:

There are loads of open data portals There’s even portal about data portals. And each of these portals has loads of datasets.

OpenPrism is my most recent attempt at understanding what is going on in all of these portals. Read on if you want to see why I made it, or just go to the site and start playing with it.

Naive search method

One difficulty in discovering open data is the search paradigm.

Open data portals approach searching data as if data were normal prose; your search terms are some keywords, a category, &c., and your results are dataset titles and descriptions.

OpenPrism is one small attempt at making it easier to search. Rather than going to all of the different portals and making a separate search for each portal, you type your search in one search bar, and you get results from a bunch of different Socrata, CKAN and Junar portals.

Certainly more efficient than searching data portals separately but searching data portals is highly problematic in any event.

Or at least more problematic that using one of the standard web search engines. Search engines that rely upon the choices of millions of users to fine tune their results and even then they are often a mixed bag.

Inter-data portals and I suspect most intra-data portals do not share common schemas or metadata. Which means a search that is successful in one data portal may return no results in another data portal.

Not that I am about to advocate a “universal” schema for all data portals. 😉

A good first step would be enabling data silo to have searchable mappings for data columns as suggested by users. Not machine implemented but just simple prose. Users researching in particular areas are likely to encounter the same data sets and recording their mappings could well assist other users.

Relying on user suggested mappings would also enable improvements to those data sets that get used the most, the ones users care about possibly combining. As opposed to having IT guessing what data mappings should have priority.

Sound like a plan?

See the source at GitHub.

I first saw this in a tweet by Felienne Hermans

« Newer PostsOlder Posts »

Powered by WordPress