Apache Lucene and Solr 4.10

September 21st, 2014

Apache Lucene and Solr 4.10

From the post:

Today Apache Lucene and Solr PMC announced another version of Apache Lucene library and Apache Solr search server numbered 4.10. This is a next release continuing the 4th version of both Apache Lucene and Apache Solr.

Here are some of the changes that were made comparing to the 4.9:

Lucene

  • Simplified Version handling for analyzers
  • TermAutomatonQuery was added
  • Optimizations and bug fixes

Solr

  • Ability to automatically add replicas in SolrCloud mode in HDFS
  • Ability to export full results set
  • Distributed support for facet.pivot
  • Optimizations and bugfixes from Lucene 4.9

Full changes list for Lucene can be found at http://wiki.apache.org/lucene-java/ReleaseNote410. Full list of changes in Solr 4.10 can be found at: http://wiki.apache.org/solr/ReleaseNote410.

Apache Lucene 4.10 library can be downloaded from the following address: http://www.apache.org/dyn/closer.cgi/lucene/java/. Apache Solr 4.10 can be downloaded at the following URL address: http://www.apache.org/dyn/closer.cgi/lucene/solr/. Please remember that the mirrors are just starting to update so not all of them will contain the 4.10 version of Lucene and Solr.

A belated note about Apache Lucene and Solr 4.10.

I must have been distracted by the continued fumbling with the Ebola crisis. I no longer wonder how the international community would respond to an actual world wide threat. In a word, ineffectively.

WWW 2015 Call for Research Papers

September 20th, 2014

WWW 2015 Call for Research Papers

From the webpage:

Important Dates:

  • Research track abstract registration:
    Monday, November 3, 2014 (23:59 Hawaii Standard Time)
  • Research track full paper submission:
    Monday, November 10, 2014 (23:59 Hawaii Standard Time)
  • Notifications of acceptance:
    Saturday, January 17, 2015
  • Final Submission Deadline for Camera-ready Version:
    Sunday, March 8, 2015
  • Conference dates:
    May 18 – 22, 2015

Research papers should be submitted through EasyChair at:
https://easychair.org/conferences/?conf=www2015

For more than two decades, the International World Wide Web (WWW) Conference has been the premier venue for researchers, academics, businesses, and standard bodies to come together and discuss latest updates on the state and evolutionary path of the Web. The main conference program of WWW 2015 will have 11 areas (or themes) for refereed paper presentations, and we invite you to submit your cutting-edge, exciting, new breakthrough work to the relevant area. In addition to the main conference, WWW 2015 will also have a series of co-located workshops, keynote speeches, tutorials, panels, a developer track, and poster and demo sessions.

The list of areas for this year is as follows:

  • Behavioral Analysis and Personalization
  • Crowdsourcing Systems and Social Media
  • Content Analysis
  • Internet Economics and Monetization
  • Pervasive Web and Mobility
  • Security and Privacy
  • Semantic Web
  • Social Networks and Graph Analysis
  • Web Infrastructure: Datacenters, Content Delivery Networks, and Cloud Computing
  • Web Mining
  • Web Search Systems and Applications

Great conference, great weather (weather for Florence in May) and it is in Florence, Italy. What other reasons do you need to attend? ;-)

Why news organizations need to invest in better linking and tagging

September 20th, 2014

Why news organizations need to invest in better linking and tagging by Frédéric Filloux.

From the post:

Most media organizations are still stuck in version 1.0 of linking. When they produce content, they assign tags and links mostly to other internal content. This is done out of fear that readers would escape for good if doors were opened too wide. Assigning tags is not exact science: I recently spotted a story about the new pregnancy in the British royal family; it was tagged “demography,” as if it was some piece about Germany’s weak fertility rate.

But there is much more to come in that field. Two factors are are at work: APIs and semantic improvements. APIs (Application Programming Interfaces) act like the receptors of a cell that exchanges chemical signals with other cells. It’s the way to connect a wide variety of content to the outside world. A story, a video, a graph can “talk” to and be read by other publications, databases, and other “organisms.” But first, it has to pass through semantic filters. From a text, the most basic tools extract sets of words and expressions such as named entities, patronyms, places.

Another higher level involves extracting meanings like “X acquired Y for Z million dollars” or “X has been appointed finance minister.” But what about a video? Some go with granular tagging systems; others, such as Ted Talks, come with multilingual transcripts that provide valuable raw material for semantic analysis. But the bulk of content remains stuck in a dumb form: minimal and most often unstructured tagging. These require complex treatments to make them “readable” by the outside world. For instance, a untranscribed video seen as interesting (say a Charlie Rose interview) will have to undergo a speech-to-text analysis to become usable. This processes requires both human curation (finding out what content is worth processing) and sophisticated technology (transcribing a speech by someone speaking super-fast or with a strong accent.)

Great piece on the value of more robust tagging by news organizations.

Rather than tagging as an after-the-fact of publication activity, tagging needs to be part of the work flow that produces content. Tagging as a step in the process of content production avoids creating a mountain of untagged content.

To what end? Well, imagine simple tagging that associates a reporter with named sources in a report. When the subject of that report comes up in the future, wouldn’t it be a time saver to whistle up all the reporters on that subject with a list of their named contacts?

Never having worked in a newspaper I can’t say but that sounds like an advantage to an outsider.

That lesson can be broadened to any company producing content. The data in the content had a point of origin, it was delivered from someone, reported by someone else, etc. Capture those relationships and track the ebb and flow of your data and not just the values it represents.

I first saw this in a tweet by Marin Dimitrov.

Growing a Language

September 20th, 2014

Growing a Language by Guy L. Steele, Jr.

The first paper in a new series of posts from the Hacker School blog, “Paper of the Week.”

I haven’t found a good way to summarize Steele’s paper but can observe that a central theme is the growth of programming languages.

While enjoying the Steele paper, ask yourself how would you capture the changing nuances of a language, natural or artificial?

Enjoy!

ApacheCon EU 2014

September 20th, 2014

ApacheCon EU 2014

ApacheCon Europe 2014 – November 17-21 in Budapest, Hungary.

November is going to be here sooner than you think. You need to register now and start making travel arrangements.

A quick scroll down the schedule page will give you an idea of the breath of the Apache Foundation activities.

219 million stars: a detailed catalogue of the visible Milky Way

September 20th, 2014

219 million stars: a detailed catalogue of the visible Milky Way

From the post:

A new catalogue of the visible part of the northern part of our home Galaxy, the Milky Way, includes no fewer than 219 million stars. Geert Barentsen of the University of Hertfordshire led a team who assembled the catalogue in a ten year programme using the Isaac Newton Telescope (INT) on La Palma in the Canary Islands. Their work appears today in the journal Monthly Notices of the Royal Astronomical Society.

The production of the catalogue, IPHAS DR2 (the second data release from the survey programme The INT Photometric H-alpha Survey of the Northern Galactic Plane, IPHAS), is an example of modern astronomy’s exploitation of ‘big data’. It contains information on 219 million detected objects, each of which is summarised in 99 different attributes.

The new work appears in Barentsen et al, “The second data release of the INT Photometric Hα Survey of the Northern Galactic Plane (IPHAS DR2)“, Monthly Notices of the Royal Astronomical Society, vol. 444, pp. 3230-3257, 2014, published by Oxford University Press. A preprint version is available on the arXiv server.

The catalogue is accessible in queryable form via the VizieR service at the Centre de Données astronomiques de Strasbourg. The processed IPHAS images it is derived from are also publically available.

At 219 million detected objects, each with 99 different attributes, that sounds like “big data” to me. ;-)

Enjoy!

Transducers

September 20th, 2014

Transducers by Rich Hickey. (Strange Loop, 2014)

Rich has another go at explaining transducers at Strange Loop 2014.

You may want to look at Felienne’s Hermans live blog on the presentation: Transducers – Rich Hickey

I first saw Rich’s video in a tweet by Michael Klishin.

GDS unveils ‘Gov.UK Verify’ public services identity assurance scheme

September 19th, 2014

GDS unveils ‘Gov.UK Verify’ public services identity assurance scheme

From the post:

Gov.UK Verify is designed to overcome concerns about government setting up a central database of citizens’ identities to enable access to online public services – similar criticism led to the demise of the hugely unpopular identity card scheme set up under the Labour government.

Instead, users will register their details with one of several independent identity assurance providers – certified companies which will establish and verify a user’s identity outside government systems. When the user then logs in to a digital public service, the Verify system will electronically “ask” the external third-party provider to confirm the person is who they claim to be.

HELP!

Help me make sure I am reading this story of citizen identity correctly.

Citizens are fearful of their government having a central database of citizens’ identities but are comfortable with commercial firms, regulated by same government, managing those identities?

Do you think citizens of the UK are aware that commercial firms betray their customers to the U.S. government at the drop of a secret subpoena every day?

To say nothing of the failures of commercial firms to protect data from their customers, when they aren’t using that data to directly manipulate their customers.

Strikes me as damned odd that anyone would trust commercial firms more than they would trust the government. Neither one is actually trustworthy.

Am I reading this story correctly?

I first saw this in a tweet by Richard Copley.

Named Entity Recognition: A Literature Survey

September 19th, 2014

Named Entity Recognition: A Literature Survey by Rahul Sharnagat.

Abstract:

In this report, we explore various methods that are applied to solve NER. In section 1, we introduce the named entity problem. In section 2, various named entity recognition methods are discussed in three three broad categories of machine learning paradigm and explore few learning techniques in them. In the first part, we discuss various supervised techniques. Subsequently we move to semi-supervised and unsupervised techniques. In the end we discuss about the method from deep learning to solve NER.

If you are new to the named entity recognition issue or want to pass on an introduction, this may be the paper for you. It covers all the high points with a three page bibliography to get your started in the literature.

I first saw this in a tweet by Christopher.

You can be a kernel hacker!

September 19th, 2014

You can be a kernel hacker! by Julia Evans.

From the post:

When I started Hacker School, I wanted to learn how the Linux kernel works. I’d been using Linux for ten years, but I still didn’t understand very well what my kernel did. While there, I found out that:

  • the Linux kernel source code isn’t all totally impossible to understand
  • kernel programming is not just for wizards, it can also be for me!
  • systems programming is REALLY INTERESTING
  • I could write toy kernel modules, for fun!
  • and, most surprisingly of all, all of this stuff was useful.

I hadn’t been doing low level programming at all – I’d written a little bit of C in university, and otherwise had been doing web development and machine learning. But it turned out that my newfound operating systems knowledge helped me solve regular programming tasks more easily.

Post by the same name as her presentation at Strange Loop 2014.

Another reason to study the Linux kernel: The closer to the metal your understanding, the more power you have over the results.

That’s true for the Linux kernel, machine learning algorithms, NLP, etc.

You can have a canned result prepared by someone else, which may be good enough, or you can bake something more to your liking.

I first saw this in a tweet by Felienne Hermans.

Update: Video of You can be a kernel hacker!

Digital Dashboards: Strategic & Tactical: Best Practices, Tips, Examples

September 19th, 2014

Digital Dashboards: Strategic & Tactical: Best Practices, Tips, Examples by Avinash Kaushik.

From the post:

The Core Problem: The Failure of Just Summarizing Performance.

I humbly believe the challenge is that in a world of too much data, with lots more on the way, there is a deep desire amongst executives to get “summarize data,” to get “just a snapshot,” or to get the “top-line view.” This is understandable of course.

But this summarization, snapshoting and toplining on your part does not actually change the business because of one foundational problem:

People who are closest to the data, the complexity, who’ve actually done lots of great analysis, are only providing data. They don’t provide insights and recommendations.

People who are receiving the summarized snapshot top-lined have zero capacity to understand the complexity, will never actually do analysis and hence are in no position to know what to do with the summarized snapshot they see.

The end result? Nothing.

Standstill. Gut based decision making. No real appreciation of the delicious opportunity in front of every single company on the planet right now to have a huger impact with data.

So what’s missing from this picture that will transform numbers into action?

I believe the solution is multi-fold (and when is it not? : )). We need to stop calling everything a dashboard. We need to create two categories of dashboards. For both categories, especially the valuable second kind of dashboards, we need words – lots of words and way fewer numbers.

Be aware that the implication of that last part I’m recommending is that you are going to become a lot more influential, and indispensable, to your organization. Not everyone is ready for that, but if you are this is going to be a fun ride!

A long post on “dashboards” but I find it relevant to the design of interfaces.

In particular the advice:

This will be controversial but let me say it anyway. The primary purpose of a dashboard is not to inform, and it is not to educate. The primary purpose is to drive action!

Hence: List the next steps. Assign responsibility for action items to people. Prioritize, prioritize, prioritize. Never forget to compute business impact.

Curious how exploration using a topic map could feed into an action process? Would you represent actors in the map and enable the creation of associations that represent assigned tasks? Other ideas?

I found this in a post, Don’t data puke, says Avinash Kaushik by Kaiser Fung and followed it to the original post.

Tokenizing and Named Entity Recognition with Stanford CoreNLP

September 19th, 2014

Tokenizing and Named Entity Recognition with Stanford CoreNLP by Sujit Pal.

From the post:

I got into NLP using Java, but I was already using Python at the time, and soon came across the Natural Language Tool Kit (NLTK), and just fell in love with the elegance of its API. So much so that when I started working with Scala, I figured it would be a good idea to build a NLP toolkit with an API similar to NLTKs, primarily as a way to learn NLP and Scala but also to build something that would be as enjoyable to work with as NLTK and have the benefit of Java’s rich ecosystem.

The project is perenially under construction, and serves as a test bed for my NLP experiments. In the past, I have used OpenNLP and LingPipe to build Tokenizer implementations that expose an API similar to NLTK’s. More recently, I have built an Named Entity Recognizer (NER) with OpenNLP’s NameFinder. At the recommendation of one of my readers, I decided to take a look at Stanford CoreNLP, with which I ended up building a Tokenizer and a NER implementation. This post describes that work.

Truly a hard core way to learn NLP and Scala!

Excellent!

Looking forward to hearing more about this project.

Libraries may digitize books without permission, EU top court rules [Nation-wide Site Licenses?]

September 19th, 2014

Libraries may digitize books without permission, EU top court rules by Loek Essers.

From the post:

European libraries may digitize books and make them available at electronic reading points without first gaining consent of the copyright holder, the highest European Union court ruled Thursday.

The Court of Justice of the European Union (CJEU) ruled in a case in which the Technical University of Darmstadt digitized a book published by German publishing house Eugen Ulmer in order to make it available at its electronic reading posts, but refused to license the publisher’s electronic textbooks.

A spot of good news to remember next on the next 9/11 anniversary. A Member State may authorise libraries to digitise, without the consent of the rightholders, books they hold in their collection so as to make them available at electronic reading points

Users can’t make copies onto a USB stick but under contemporary fictions about property rights represented in copyright statutes that isn’t surprising.

What is surprising is that nations have not yet stumbled upon the idea of nation-wide site licenses for digital materials.

A nation acquiring a site license the ACM Digital Library, IEEE, Springer and a dozen or so other resources/collections would have these positive impacts:

  1. Access to core computer science publications for everyone located in that nation
  2. Publishers would have one payor and could reduce/eliminate the staff that manage digital access subscriptions
  3. Universities and colleges would not require subscriptions nor the staff to manage those subscriptions (integration of those materials into collections would remain a library task)
  4. Simplify access software based on geographic IP location (fewer user/password issues)
  5. Universities and colleges could spend funds now dedicated to subscriptions for other materials
  6. Digitization of both periodical and monograph literature would be encouraged
  7. Avoids tiresome and not-likely-to-succeed arguments about balancing the public interest in IP rights discussions.

For me, #7 is the most important advantage of nation-wide licensing of digital materials. As you can tell by my reference to “contemporary fictions about property rights” I fall quite firmly on a particular side of the digital rights debate. However, I am more interested in gaining access to published materials for everyone than trying to convince others of the correctness of my position. Therefore, let’s adopt a new strategy: “Pay the man.”

As I outline above, there are obvious financial advantages to publishers from nation-wide site licenses, in the form of reduced internal costs, reduced infrastructure costs and a greater certainty in cash flow. There are advantages for the public as well as universities and colleges, so I would call that a win-win solution.

The Developing World Initiatives by Francis & Taylor is described as:

Taylor & Francis Group is committed to the widest distribution of its journals to non-profit institutions in developing countries. Through agreements with worldwide organisations, academics and researchers in more than 110 countries can access vital scholarly material, at greatly reduced or no cost.

Why limit access to materials to “non-profit institutions in developing countries?” Granting that the site-license fees for the United States would be higher than Liberia but the underlying principle is the same. The less you regulate access the simpler the delivery model and the higher the profit to the publisher. What publisher would object to that?

There are armies of clerks currently invested in the maintenance of one-off subscription models but the greater public interest in access to materials consistent with publisher IP rights should carry the day.

If Tim O’Reilly and friends are serious about changing access models to information, let’s use nation-wide site licenses to eliminate firewalls and make effective linking and transclusion a present day reality.

Publishers get paid, readers get access. It’s really that simple. Just on a larger scale than is usually discussed.

PS: Before anyone raises the issues of cost for national-wide site licenses, remember that the United States has spent more than $1 trillion in a “war” on terrorism that has made no progress in making the United States or its citizens more secure.

If the United Stated decided to pay Spinger Science+Business Media the €866m ($1113.31m) total revenue it made in 2012, for the cost of its ‘war” on terrorism, it could have purchased a site license to all Spinger Science+Business Media content for the entire United States for 898.47 years. (Check my math: 1,000,000,000,000 / 1,113,000,000 = 898.472.)

I first saw this in Nat Torkington’s Four short links: 15 September 2014.

Common Sense and Statistics

September 18th, 2014

Common Sense and Statistics by John D. Cook.

From the post:

…, common sense is vitally important in statistics. Attempts to minimize the need for common sense can lead to nonsense. You need common sense to formulate a statistical model and to interpret inferences from that model. Statistics is a layer of exact calculation sandwiched between necessarily subjective formulation and interpretation. Even though common sense can go badly wrong with probability, it can also do quite well in some contexts. Common sense is necessary to map probability theory to applications and to evaluate how well that map works.

No matter how technical or complex analysis may appear, do not hesitate to ask for explanations if the data or results seem “off” to you. I witnessed a presentation several years ago when the manual for a statistics package was cited for the proposition that a result was significant.

I know you have never encountered that situation but you may know others who have.

Never fear asking questions about methods or results. Your colleagues are wondering the same things but are too afraid of appearing ignorant to ask questions.

Ignorance is curable. Willful ignorance is not.

If you aren’t already following John D. Cook, you should.

Learn Datalog Today

September 18th, 2014

Learn Datalog Today by Jonas Enlund.

From the homepage:

Learn Datalog Today is an interactive tutorial designed to teach you the Datomic dialect of Datalog. Datalog is a declarative database query language with roots in logic programming. Datalog has similar expressive power as SQL.

Datomic is a new database with an interesting and novel architecture, giving its users a unique set of features. You can read more about Datomic at http://datomic.com and the architecture is described in some detail in this InfoQ article.

Table of Contents

You have been meaning to learn Datalog but it just hasn’t happened.

Now is the time to break that cycle and do the deed!

This interactive tutorial should ease you on your way to learning Datalog.

It can’t learn Datalog for you but it can make the journey a little easier.

Enjoy!

From Frequency to Meaning: Vector Space Models of Semantics

September 18th, 2014

From Frequency to Meaning: Vector Space Models of Semantics by Peter D. Turney and Patrick Pantel.

Abstract:

Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term–document, word–context, and pair–pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field.

At forty-eight (48) pages with a thirteen (13) page bibliography, this survey of vector space models (VSMs) of semantics should keep you busy for a while. You will have to fill in VSMs developments since 2010 but mastery of this paper will certain give you the foundation to do so. Impressive work.

I do disagree with the authors when they say:

Computers understand very little of the meaning of human language.

Truth be told, I would say:

Computers have no understanding of the meaning of human language.

What happens with a VSM of semantics is that we as human readers choose a model we think represents semantics we see in a text. Our computers blindly apply that model to text and report the results. We as human readers choose results that we think are closer to the semantics we see in the text, and adjust the model accordingly. Our computers then blindly apply the adjusted model to the text again and so on. At no time does the computer have any “understanding” of the text or of the model that it is applying to the text. Any “understanding” in such a model is from a human reader who adjusted the model based on their perception of the semantics of a text.

I don’t dispute that VSMs have been incredibly useful and like the authors, I think there is much mileage left in their development for text processing. That is not the same thing as imputing “understanding” of human language to devices that in fact have none at all. (full stop)

Enjoy!

I first saw this in a tweet by Christopher Phipps.

PS: You probably recall that VSMs are based on creating a metric space for semantics, which have no preordained metric space. Transitioning from a non-metric space to a metric space isn’t subject to validation, at least in my view.

2015 Medical Subject Headings (MeSH) Now Available

September 18th, 2014

2015 Medical Subject Headings (MeSH) Now Available

From the post:

Introduction to MeSH 2015
The Introduction to MeSH 2015 is now available, including information on its use and structure, as well as recent updates and availability of data.

MeSH Browser
The default year in the MeSH Browser remains 2014 MeSH for now, but the alternate link provides access to 2015 MeSH. The MeSH Section will continue to provide access via the MeSH Browser for two years of the vocabulary: the current year and an alternate year. Sometime in November or December, the default year will change to 2015 MeSH and the alternate link will provide access to the 2014 MeSH.

Download MeSH
Download 2015 MeSH in XML and ASCII formats. Also available for 2015 from the same MeSH download page are:

  • Pharmacologic Actions (Forthcoming)
  • New Headings with Scope Notes
  • MeSH Replaced Headings
  • MeSH MN (tree number) changes
  • 2015 MeSH in MARC format

Enjoy!

Convince your boss to use Clojure

September 17th, 2014

Convince your boss to use Clojure by Eric Normand.

From the post:

Do you want to get paid to write Clojure? Let’s face it. Clojure is fun, productive, and more concise than many languages. And probably more concise than the one you’re using at work, especially if you are working in a large company. You might code on Clojure at home. Or maybe you want to get started in Clojure but don’t have time if it’s not for work.

One way to get paid for doing Clojure is to introduce Clojure into your current job. I’ve compiled a bunch of resources for getting Clojure into your company.

Take these resources and do your homework. Bringing a new language into an existing company is not easy. I’ve summarized some of the points that stood out to me, but the resources are excellent so please have a look yourself.

Great strategy and list of resources for Clojure folks.

How would you adapt this strategy to topic maps and what resources are we missing?

I first saw this in a tweet by Christophe Lalanne.

Elementary Applied Topology

September 17th, 2014

Elementary Applied Topology by Robert Ghrist.

From the introduction:

What topology can do

Topology was built to distinguish qualitative features of spaces and mappings. It is good for, inter alia:

  1. Characterization: Topological properties encapsulate qualitative signatures. For example, the genus of a surface, or the number of connected components of an object, give global characteristics important to classification.
  2. Continuation: Topological features are robust. The number of components or holes is not something that should change with a small error in measurement. This is vital to applications in scientific disciplines, where data is never noisy.
  3. Integration: Topology is the premiere tool for converting local data into global properties. As such, it is rife with principles and tools (Mayer-Vietoris, Excision, spectral sequences, sheaves) for integrating from local to global.
  4. Obstruction: Topology often provides tools for answering feasibility of certain problems, even when the answers to the problems themselves are hard to compute. These characteristics, classes, degrees, indices, or obstructions take the form of algebraic-topological entities.

What topology cannot do

Topology is fickle. There is no resource to tweaking epsilons should desiderata fail to be found. If the reader is a scientist or applied mathematician hoping that topological tools are a quick fix, take this text with caution. The reward of reading this book with care may be limited to the realization of new questions as opposed to new answers. It is not uncommon that a new mathematical tool contributes to applications not by answering a pressing question-of-the-day but by revealing a different (and perhaps more significant) underlying principle.

The text will require more than casual interest but what a tool to add to your toolbox!

I first saw this in a tweet from Topology Fact.

Stuff Goes Bad: Erlang in Anger

September 17th, 2014

Stuff Goes Bad: Erlang in Anger by Fred Herbert.

From the webpage:

This book intends to be a little guide about how to be the Erlang medic in a time of war. It is first and foremost a collection of tips and tricks to help understand where failures come from, and a dictionary of different code snippets and practices that helped developers debug production systems that were built in Erlang.

From the introduction:

This book is not for beginners. There is a gap left between most tutorials, books, training sessions, and actually being able to operate, diagnose, and debug running systems once they’ve made it to production. There’s a fumbling phase implicit to a programmer’s learning of a new language and environment where they just have to figure how to get out of the guidelines and step into the real world, with the community that goes with it.

This book assumes that the reader is proficient in basic Erlang and the OTP framework. Erlang/OTP features are explained as I see fit — usually when I consider them tricky — and it is expected that a reader who feels confused by usual Erlang/OTP material will have an idea of where to look for explanations if necessary.

What is not necessarily assumed is that the reader knows how to debug Erlang software, dive into an existing code base, diagnose issues, or has an idea of the best practices about deploying Erlang in a production environment. (footnote numbers omitted)

With exercises no less.

Reminds me of a book I had some years ago on causing and then debugging Solr core dumps. ;-) I don’t think it was ever a best seller but it was a fun read.

Great title by the way.

I first saw this in a tweet by Chris Meiklejean.

Understanding weak isolation is a serious problem

September 17th, 2014

Understanding weak isolation is a serious problem by Peter Bailis.

From the post:

Modern transactional databases overwhelmingly don’t operate under textbook “ACID” isolation, or serializability. Instead, these databases—like Oracle 11g and SAP HANA—offer weaker guarantees, like Read Committed isolation or, if you’re lucky, Snapshot Isolation. There’s a good reason for this phenomenon: weak isolation is faster—often much faster—and incurs fewer aborts than serializability. Unfortunately, the exact behavior of these different isolation levels is difficult to understand and is highly technical. One of 2008 Turing Award winner Barbara Liskov’s Ph.D. students wrote an entire dissertation on the topic, and, even then, the definitions we have still aren’t perfect and can vary between databases.

To put this problem in perspective, there’s a flood of interesting new research that attempts to better understand programming models like eventual consistency. And, as you’re probably aware, there’s an ongoing and often lively debate between transactional adherents and more recent “NoSQL” upstarts about related issues of usability, data corruption, and performance. But, in contrast, many of these transactional inherents and the research community as a whole have effectively ignored weak isolation—even in a single server setting and despite the fact that literally millions of businesses today depend on weak isolation and that many of these isolation levels have been around for almost three decades.2

That debates are occurring without full knowledge of the issues at hand isn’t all that surprising. Or as Job 38:2 (KJV) puts it: “Who is this that darkeneth counsel by words without knowledge?”

Peter raises a number of questions and points to resources that are good starting points for investigation of weak isolation.

What sort of weak isolation does your topic map storage mechanism use?

I first saw this in a tweet by Justin Sheehy.

Call for Review: HTML5 Proposed Recommendation Published

September 17th, 2014

Call for Review: HTML5 Proposed Recommendation Published

From the post:

The HTML Working Group has published a Proposed Recommendation of HTML5. This specification defines the 5th major revision of the core language of the World Wide Web: the Hypertext Markup Language (HTML). In this version, new features are introduced to help Web application authors, new elements are introduced based on research into prevailing authoring practices, and special attention has been given to defining clear conformance criteria for user agents in an effort to improve interoperability. Comments are welcome through 14 October. Learn more about the HTML Activity.

Now would be the time to submit comments, corrections, etc.

Deadline: 14 October 2014.

International Conference on Machine Learning 2014 Videos!

September 17th, 2014

International Conference on Machine Learning 2014 Videos!

You may recall my post on the ICML 2014 papers.

Speaking just for myself, I would prefer a resource with both the videos and relevant papers listed together.

Do you know of such a resource?

If not, when time permits I may conjure one up.

As with disclosed semantic mappings, it is more efficient if one person creates a mapping that is reused by many. As opposed to many separate and partial repetitions of the same mapping.

You may remember that one mapping, many reuses, is a central principal to indexes, library catalogs, filing systems, case citations, etc.

I first saw this in a tweet by EyeWire.

ODNI and the U.S. DOJ Commemorate 9/11

September 17th, 2014

Statement by the ODNI and the U.S. DOJ on the Declassification of Documents Related to the Protect America Act Litigation September 11, 2014

What better way to mark the anniversary of 9/11 than with a fuller account of another attack on the United States of American and its citizens. This attack not by a small band of criminals but a betrayal of the United States by those sworn to protect the rights of its citizens.

From the post:

On January 15, 2009, the U.S. Foreign Intelligence Surveillance Court of Review (FISC-R) published an unclassified version of its opinion in In Re: Directives Pursuant to Section 105B of the Foreign Intelligence Surveillance Act, 551 F.3d 1004 (Foreign Intel. Surv. Ct. Rev. 2008). The classified version of the opinion was issued on August 22, 2008, following a challenge by Yahoo! Inc. (Yahoo!) to directives issued under the Protect America Act of 2007 (PAA). Today, following a renewed declassification review, the Executive Branch is publicly releasing various documents from this litigation, including legal briefs and additional sections of the 2008 FISC-R opinion, with appropriate redactions to protect national security information. These documents are available at the website of the Office of the Director of National Intelligence (ODNI), www.dni.gov; and ODNI’s public website dedicated to fostering greater public visibility into the intelligence activities of the U.S. Government, IContheRecord.tumblr.com. A summary of the underlying litigation follows.

In case you haven’t been following along, the crux of the case was Yahoo’s refusal on Fourth Amendment grounds to comply with a fishing expedition by the Director of National Intelligence and the Attorney General for information on one or more alleged foreign nationals. Motion to Compel Compliance with Directives of the Director of National Intelligence and Attorney General.

Not satisfied with violating their duties to uphold the Constitution, the DNI and AG decided to add strong arming/extortion to their list of crimes. Civil contemp fines, fines that started at $250,000 per day and then doubled each week thereafter that Yahoo! failed to comply with the court’s judgement were sought by the government. Government’s Motion for an Order of Civil Contempt.

Take care to note that all of this occurred in absolute secrecy. Would not do to have other corporations or the American public to be aware that rogue elements in the government were deciding what rights citizens of the United States enjoy and which ones they don’t.

You may also want to read Highlights from the Newly Declassified FISCR Documents by Marc Zwillinger and Jacob Sommer. They are the lawyers who represented Yahoo in the challenge covered by the released documents.

We all owe them a debt of gratitude for their hard work but we also have to acknowledge that Yahoo, Zwillinger and Sommer were complicit in enabling the Foreign Intelligence Surveillance Court (FISC) and the Foreign Intelligence Court of Review (FISCR) to continue their secret work.

Yes, Yahoo, Zwillinger and Sommer would have faced life changing consequences had they gone public with what they did know, but everyone has a choice when faced with oppressive government action. You can, as the parties did in this case and further the popular fiction that mining user metadata is an effective (rather than convenient) tool against terrorism.

Or you can decide to “blow the whistle” on wasteful and illegal activities by the government in question.

Had Yahoo, Zwillinger or Sommer known of any data in their possession that was direct evidence a terrorist attack or plot, they would have called the Office of the Director of National Intelligence, or at least their local FBI office. Yes? Wouldn’t any sane person do the same?

Ah, but you see, that’s the point. There wasn’t any such data. Not then. Not now. Read the affidavits, at least the parts that aren’t blacked out and you get the distinct impression that the government is not only fishing, but it is hopeful fishing. “There might be something, somewhere that somehow might be useful to somebody but we don’t know.” is a fair summary of the government’s position in the Yahoo case.

A better way to commemorate 9/11 next year would be with numerous brave souls taking the moral responsibility to denounce those who have betrayed their constitutional duties in the cause of fighting terrorism. I prefer occasional terrorism over the destruction of the Constitution of the United States.

You?

I started the trail that lead to this post from a tweet by Ben Gilbert.

Life Is Random: Biologists now realize that “nature vs. nurture” misses the importance of noise

September 16th, 2014

Life Is Random: Biologists now realize that “nature vs. nurture” misses the importance of noise by Cailin O’Connor.

From the post:

Is our behavior determined by genetics, or are we products of our environments? What matters more for the development of living things—internal factors or external ones? Biologists have been hotly debating these questions since shortly after the publication of Darwin’s theory of evolution by natural selection. Charles Darwin’s half-cousin Francis Galton was the first to try to understand this interplay between “nature and nurture” (a phrase he coined) by studying the development of twins.

But are nature and nurture the whole story? It seems not. Even identical twins brought up in similar environments won’t really be identical. They won’t have the same fingerprints. They’ll have different freckles and moles. Even complex traits such as intelligence and mental illness often vary between identical twins.

Of course, some of this variation is due to environmental factors. Even when identical twins are raised together, there are thousands of tiny differences in their developmental environments, from their position in the uterus to preschool teachers to junior prom dates.

But there is more to the story. There is a third factor, crucial to development and behavior, that biologists overlooked until just the past few decades: random noise.

In recent years, noise has become an extremely popular research topic in biology. Scientists have found that practically every process in cells is inherently, inescapably noisy. This is a consequence of basic chemistry. When molecules move around, they do so randomly. This means that cellular processes that require certain molecules to be in the right place at the right time depend on the whims of how molecules bump around. (bold emphasis added)

Is another word for “noise” chaos?

The sort of randomness that impacts our understanding of natural languages? That leads us to use different words for the same thing and the same word for different things?

The next time you see a semantically deterministic system be sure to ask if they have accounted for the impact of noise on the understanding of people using the system. ;-)

To be fair, no system can but the pretense that noise doesn’t exist in some semantic environments (think description logic, RDF) is more than a little annoying.

You might want to start following the work of Cailin O’Connor (University of California, Irvine, Logic and Philosophy of Science).

Disclosure: I have always had a weakness for philosophy of science so your mileage may vary. This is real philosophy of science and not the strained crys of “science” you see on most mailing list discussions.

I first saw this in a tweet by John Horgan.

Getting Started with S4, The Self-Service Semantic Suite

September 16th, 2014

Getting Started with S4, The Self-Service Semantic Suite by Marin Dimitrov.

From the post:

Here’s how S4 developers can get started with The Self-Service Semantic Suite. This post provides you with practical information on the following topics:

  • Registering a developer account and generating API keys
  • RESTful services & free tier quotas
  • Practical examples of using S4 for text analytics and Linked Data querying

Ontotext is up front about the limitations on the “free” service:

  • 250 MB of text processed monthly (via the text analytics services)
  • 5,000 SPARQL queries monthly (via the LOD SPARQL service)

The number of pages in a megabyte of text varies depends on text content but assuming a working average of one (1) megabyte = five hundred (500) pages of text, you can analyze up to one hundred and twenty-five thousand (125,000) pages of text a month. Chump change for serious NLP but it is a free account.

The post goes on to detail two scenarios:

  • Annotate a news document via the News analytics service
  • Send a simple SPARQL query to the Linked Data service

Learn how effective entity recognition and SPARQL are with data of interest to you, at a minimum of investment.

I first saw this in a tweet by Tony Agresta.

Stephen Wolfram Launching Today: Mathematica Online! (w/ secret pricing)

September 16th, 2014

Launching Today: Mathematica Online! by Stephen Wolfram.

From the post:

It’s been many years in the making, and today I’m excited to announce the launch of Mathematica Online: a version of Mathematica that operates completely in the cloud—and is accessible just through any modern web browser.

In the past, using Mathematica has always involved first installing software on your computer. But as of today that’s no longer true. Instead, all you have to do is point a web browser at Mathematica Online, then log in, and immediately you can start to use Mathematica—with zero configuration.

Some of the advantages that Stephen outlines:

  • Manipulate can be embedded in any web page
  • Files are stored in the Cloud to be accessed from anywhere or easily shared
  • Mathematica can now be used on mobile devices

What’s the one thing that isn’t obvious from Stephen’s post?

The pricing for access to Mathematical Online.

A Wolfram insider, proofing Stephen’s post probably said: “Oh, shit! Our pricing information is secret! What do you say in the post?

So Stephen writes:

But get Mathematica Online too (which is easy to do—through Premier Service Plus for individuals, or a site license add-on).

You do that, or at least try to do that. If you manage to hunt down Premier Service, you will find you need an activation key in order to possibly get the pricing information.

If you don’t have a copy of Mathematica, you aren’t going to be ordering Mathematica Online today.

Sad that such remarkable software has such poor marketing.

Shout out to Stephen: Lots of people are interested in using Mathematica Online or off. Byzantine marketing excludes waiting, would be paying, customers.

I first saw this in a tweet by Alex Popescu.

A $23 million venture fund for the government tech set

September 16th, 2014

A $23 million venture fund for the government tech set by Nancy Scola.

Nancy tells a compelling story of a new VC firm, GovTech, which is looking for startups focused on providing governments with better technology infrastructure.

Three facts from the story stand out:

“The U.S. government buys 10 eBays’ worth of stuff just to operate,” from software to heavy-duty trucking equipment.

…working with government might be a tortuous slog, but Bouganim says that he saw that behind that red tape lay a market that could be worth in the neighborhood of $500 billion a year.

What most people don’t realize is government spends nearly $74 billion on technology annually. As a point of comparison, the video game market is a $15 billion annual market.

See Nancy’s post for the full flavor of the story but it sounds like there is gold buried in government IT.

Another way to look at it is the government is already spending $74 billion a year on technology that is largely an object of mockery and mirth. Effective software may be sufficiently novel and threatening to either attract business or a buy-out.

While you are pondering possible opportunities, existing systems, their structures and data are “subjects” in topic map terminology. Which means topic maps can protect existing contracts and relationships, while delivering improved capabilities and data.

Promote topic maps as “in addition to” existing IT systems and you will encounter less resistance both from within and without the government.

Don’t be squeamish about associating with governments, of whatever side. Their money spends just like everyone else’s. You can ask At&T and IBM about supporting both sides in a conflict.

I first saw this in a tweet by Mike Bracken.

User Onboarding

September 16th, 2014

User Onboarding by Samuel Hulick.

From the webpage:

Want to see how popular web apps handle their signup experiences? Here’s every one I’ve ever reviewed, in one handy list.

I have substantially altered Samuel’s presentation to fit the list onto one screen and to open new tabs, enabling quick comparison of onboarding experiences.

Asana iOS Instagram OkCupid Slingshot
Basecamp InVision Optimizely Snapchat
Buffer LessAccounting Pinterest Trello
Evernote LiveChat Pocket Tumblr
Foursquare Mailbox for Mac Quora Twitter
GetResponse Meetup Shopify Vimeo
Gmail Netflix Slack WhatsApp

Writers become better by reading good writers.

Non-random good onboarding comes from studying previous good onboarding.

Enjoy!

I first saw this in a tweet by Jason Ziccardi.

Shanghai Library adds 2 million records to WorldCat…

September 16th, 2014

Shanghai Library adds 2 million records to WorldCat to share its collection with the world Compiled by Ming POON, Josephine SCHE, and Mi Chu WIENS (November, 2004).

From the post:

Shanghai Library, the largest public library in China and one of the largest libraries in the world, has contributed 2 million holdings to WorldCat, including some 770,000 unique bibliographic records, to share its collection worldwide.

These records, which represent books and journals published between 1911 and 2013, were loaded in WorldCat earlier this year. The contribution from Shanghai Library, an OCLC member since 1996, enhances the richness and depth of Chinese materials in WorldCat as well as the discoverability of these collections around the world.

“We are pleased to add Shanghai Library’s holdings to WorldCat, which is the global union catalog of library collections,” said Dr. Jianzhong Wu, Director, Shanghai Library “Shanghai is a renowned, global city, and the library should be as well. With WorldCat, we not only raise the visibility of our collection to a global level but we also share our national heritage and identity with other libraries and their users through the OCLC WorldShare Interlibrary Loan service.”

“The leadership of Shanghai Library has a bold global vision,” says Andrew H. Wang, Vice President, OCLC Asia Pacific. “The addition of Shanghai Library’s holdings and unique records enriches coverage of the Chinese collection in WorldCat for researchers everywhere.”

I don’t have a feel for how many unique Chinese bibliographic records are online but 770,000 sounds like a healthy addition.

You may also be interested in: Online Resources for Chinese Studies in North American Libraries.

Given the compilation date, 2004, I ran the W3C Link Checker on http://www.loc.gov/rr/asian/china-bib/.

You can review the results at: http://www.durusau.net/publications/W3CLinkChecker:http:_www.loc.gov_rr_asian_china-bib_.html

Summary of results:

Code Occurrences What to do
(N/A) 6 The link was not checked due to robots exclusion rules. Check the link manually, and see also the link checker documentation on robots exclusion.
(N/A) 2 The hostname could not be resolved. Check the link for typos.
403 1 The link is forbidden! This needs fixing. Usual suspects: a missing index.html or Overview.html, or a missing ACL.
404 61 The link is broken. Double-check that you have not made any typo, or mistake in copy-pasting. If the link points to a resource that no longer exists, you may want to remove or fix the link.
500 5 This is a server side problem. Check the URI.

(emphasis added)

At a minimum, the broken links need to be corrected but updating the listing to include new resources would make a nice graduate student project.

I don’t have the background or language skills with Chinese resources to embark on such a project but would be happy to assist anyone who undertakes the task.