Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 10, 2014

Back to the future of databases

Filed under: Database,Linux OS,NoSQL — Patrick Durusau @ 6:37 pm

Back to the future of databases by Yin Wang.

From the post:

Why do we need databases? What a stupid question. I already heard some people say. But it is a legitimate question, and here is an answer that not many people know.

First of all, why can’t we just write programs that operate on objects? The answer is, obviously, we don’t have enough memory to hold all the data. But why can’t we just swap out the objects to disk and load them back when needed? The answer is yes we can, but not in Unix, because Unix manages memory as pages, not as objects. There are systems who lived before Unix that manage memory as objects, and perform object-granularity persistence. That is a feature ahead of its time, and is until today far more advanced than the current state-of-the-art. Here are some pictures of such systems:

Certainly thought provoking but how much of an advantage would object-granularity persistence have to offer before it could make headway against the install base of Unix?

The database field is certainly undergoing rapid research and development, with no clear path to a winner.

Will the same happen with OSes?

Parameterizing Queries in Solr and Elasticsearch

Filed under: ElasticSearch,Lucene,Solr — Patrick Durusau @ 6:19 pm

Parameterizing Queries in Solr and Elasticsearch by RAFAŁ KUĆ.

From the post:

We all know how good it is to have abstraction layers in software we create. We tend to abstract implementation from the method contracts using interfaces, we use n-tier architectures so that we can abstract and divide different system layers from each other. This is very good – when we change one piece, we don’t need to touch the other parts that only knew about method contracts, API’s, etc. Why not do the same with search queries? Can we even do that in Elasticsearch and Solr? We can and I’ll show you how to do that.

The problem

Imagine, that we have a query, a complicated one, with boosts, sorts, facets and so on. However in most cases the query is pretty static when it comes to its structure and the only thing that changes is one of the filters in the query (actually a filter value) and the query entered by the user. I guess such situation could ring a bell for someone who developed a search application. Of course we can include the whole query in the application itself and reuse it. But in such case, changes to boosts for example requires us to deploy the application or a configuration file. And if more than a single application uses the same query, than we need to change them all.

What if we could make the change on the search server side only and let application pass the necessary data only? That would be nice, but it requires us to do some work on the search server side.

For the purpose of the blog post, let’s assume that we want to have a query that:

  • searches for documents with terms entered by the user,
  • limits the searches to a given category,
  • displays facet results for the price ranges

This is a simple example, so that the queries are easy to understand. So, in the perfect world we would only need to provide user query and category identifier to a search engine.

It is encouraging to see someone give solutions to the same search problem from Solr and Elasticsearch perspectives.

Not to mention that I think you will find this very useful.

The Encyclopedia of Life v2:…

Filed under: Bioinformatics,Biology,Encyclopedia,Semantic Inconsistency — Patrick Durusau @ 4:11 pm

The Encyclopedia of Life v2: Providing Global Access to Knowledge About Life on Earth by Cynthia S. Parr, et al. (Biodiversity Data Journal 2: e1079 (29 Apr 2014) doi: 10.3897/BDJ.2.e1079)

Abstract:

The Encyclopedia of Life (EOL, http://eol.org) aims to provide unprecedented global access to a broad range of information about life on Earth. It currently contains 3.5 million distinct pages for taxa and provides content for 1.3 million of those pages. The content is primarily contributed by EOL content partners (providers) that have a more limited geographic, taxonomic or topical scope. EOL aggregates these data and automatically integrates them based on associated scientific names and other classification information. EOL also provides interfaces for curation and direct content addition. All materials in EOL are either in the public domain or licensed under a Creative Commons license. In addition to the web interface, EOL is also accessible through an Application Programming Interface.

In this paper, we review recent developments added for Version 2 of the web site and subsequent releases through Version 2.2, which have made EOL more engaging, personal, accessible and internationalizable. We outline the core features and technical architecture of the system. We summarize milestones achieved so far by EOL to present results of the current system implementation and establish benchmarks upon which to judge future improvements.

We have shown that it is possible to successfully integrate large amounts of descriptive biodiversity data from diverse sources into a robust, standards-based, dynamic, and scalable infrastructure. Increasing global participation and the emergence of EOL-powered applications demonstrate that EOL is becoming a significant resource for anyone interested in biological diversity.

This section on the organization of the taxonomy for the Encyclopedia of Life v2 seems particularly relevant:

Resource documents made available by content partners define the text and multimedia being provided as well as the taxa to which the content refers, the associations between content and taxa, and the associations among taxa (i.e. taxonomies). Expert taxonomists often disagree about the best classification for a given group of organisms, and there is no universal taxonomy for partners to adhere to (Patterson et al. 2008, Rotman et al. 2012a, Yoon and Rose 2001). As an aggregator, EOL accepts all taxonomic viewpoints from partners and attempts to assign them to existing Taxon Pages, or create new Taxon Pages when necessary. A reconciliation algorithm uses incoming taxon information, previously indexed data, and assertions from our curators to determine the best aggregation strategy. (links omitted)

Integration of information without agreement on a single view of the information. (Have we heard this before?)

If you think of the taxon pages as proxies, it is easier to see the topic map aspects of this project.

Self organising hypothesis networks

Filed under: Medical Informatics,Networks,Self-Organizing — Patrick Durusau @ 3:51 pm

Self organising hypothesis networks: a new approach for representing and structuring SAR knowledge by Thierry Hanser, et al. (Journal of Cheminformatics 2014, 6:21)

Abstract:

Background

Combining different sources of knowledge to build improved structure activity relationship models is not easy owing to the variety of knowledge formats and the absence of a common framework to interoperate between learning techniques. Most of the current approaches address this problem by using consensus models that operate at the prediction level. We explore the possibility to directly combine these sources at the knowledge level, with the aim to harvest potentially increased synergy at an earlier stage. Our goal is to design a general methodology to facilitate knowledge discovery and produce accurate and interpretable models.

Results

To combine models at the knowledge level, we propose to decouple the learning phase from the knowledge application phase using a pivot representation (lingua franca) based on the concept of hypothesis. A hypothesis is a simple and interpretable knowledge unit. Regardless of its origin, knowledge is broken down into a collection of hypotheses. These hypotheses are subsequently organised into hierarchical network. This unification permits to combine different sources of knowledge into a common formalised framework. The approach allows us to create a synergistic system between different forms of knowledge and new algorithms can be applied to leverage this unified model. This first article focuses on the general principle of the Self Organising Hypothesis Network (SOHN) approach in the context of binary classification problems along with an illustrative application to the prediction of mutagenicity.

Conclusion

It is possible to represent knowledge in the unified form of a hypothesis network allowing interpretable predictions with performances comparable to mainstream machine learning techniques. This new approach offers the potential to combine knowledge from different sources into a common framework in which high level reasoning and meta-learning can be applied; these latter perspectives will be explored in future work.

One interesting feature of this publication is a graphic abstract:

abstract

Assuming one could control the length of the graphic abstracts, that would be an interesting feature for conference papers.

What should be the icon for repeating old news before getting to the new stuff? 😉

Among a number of good points in this paper, see in particular:

  • Distinction between SOHN and “a Galois lattice used in Formal Concept
    Analysis [19] (FCA)” (at page 10).
  • Discussion of the transparency of this approach at page 21.

In a very real sense, announcing an answer to a medical question may be welcome, but it isn’t very informative. Nor will it enable others to advance the medical arts.

Other domains where answers are important but how you arrived at an answer is equally important if not more so?

May 9, 2014

Tips for Clojure Beginners

Filed under: Clojure,Functional Programming — Patrick Durusau @ 7:15 pm

Tips for Clojure Beginners by Ben Orenstein.

Ben has seven (7) practical tips for learning Clojure.

No one knows if Clojure will be the breakthrough functional programming language but when you realize that mutable data structures are artifacts of limited storage, any functional programming experience is going to be worthwhile.

Large Scale Web Scraping

Filed under: Data Mining,Web Scrapers — Patrick Durusau @ 7:03 pm

We Just Ran Twenty-Three Million Queries of the World Bank’s Website – Working Paper 362 by Sarah Dykstra, Benjamin Dykstra, and Justin Sandefur.

Abstract:

Much of the data underlying global poverty and inequality estimates is not in the public domain, but can be accessed in small pieces using the World Bank’s PovcalNet online tool. To overcome these limitations and reproduce this database in a format more useful to researchers, we ran approximately 23 million queries of the World Bank’s web site, accessing only information that was already in the public domain. This web scraping exercise produced 10,000 points on the cumulative distribution of income or consumption from each of 942 surveys spanning 127 countries over the period 1977 to 2012. This short note describes our methodology, briefly discusses some of the relevant intellectual property issues, and illustrates the kind of calculations that are facilitated by this data set, including growth incidence curves and poverty rates using alternative PPP indices. The full data can be downloaded at www.cgdev.org/povcalnet.

That’s what I would call large scale web scraping!

Useful model to follow for many sources, such as the U.S. Department of Agriculture. A gold mine of reports, data, statistics, but all broken up for the manual act of reading. Or at least that is a charitable explanation for their current data organization.

…Locality Sensitive Hashing for Unstructured Data

Filed under: Hashing,Jaccard Similarity,Similarity,Subject Identity — Patrick Durusau @ 6:51 pm

Practical Applications of Locality Sensitive Hashing for Unstructured Data by Jake Drew.

From the post:

The purpose of this article is to demonstrate how the practical Data Scientist can implement a Locality Sensitive Hashing system from start to finish in order to drastically reduce the search time typically required in high dimensional spaces when finding similar items. Locality Sensitive Hashing accomplishes this efficiency by exponentially reducing the amount of data required for storage when collecting features for comparison between similar item sets. In other words, Locality Sensitive Hashing successfully reduces a high dimensional feature space while still retaining a random permutation of relevant features which research has shown can be used between data sets to determine an accurate approximation of Jaccard similarity [2,3].

Complete with code and references no less!

How “similar” do two items need to be to count as the same item?

If two libraries own a physical copy of the same book, for some purposes they are distinct items but for annotations/reviews, you could treat them as one item.

If that sounds like a topic map-like question, your right!

What measures of similarity are you applying to what subjects?

The Data Journalism Handbook

Filed under: Journalism,News,Reporting — Patrick Durusau @ 6:41 pm

The Data Journalism Handbook edited by Jonathan Gray, Liliana Bounegru and Lucy Chambers.

From the webpage:

The Data Journalism Handbook is a free, open source reference book for anyone interested in the emerging field of data journalism.

It was born at a 48 hour workshop at MozFest 2011 in London. It subsequently spilled over into an international, collaborative effort involving dozens of data journalism’s leading advocates and best practitioners – including from the Australian Broadcasting Corporation, the BBC, the Chicago Tribune, Deutsche Welle, the Guardian, the Financial Times, Helsingin Sanomat, La Nacion, the New York Times, ProPublica, the Washington Post, the Texas Tribune, Verdens Gang, Wales Online, Zeit Online and many others.

A practical tome, it is available in English, Russian, French, German and Georgian.

A very useful and highly entertaining read.

Enjoy and recommend it to others!

Clojure’s Persistent Data Structures

Filed under: Clojure,Functional Programming,Programming — Patrick Durusau @ 6:23 pm

Clojure’s Persistent Data Structures by Craig Andera.

From the description:

A typical experience with Clojure is, “Come for the concurrency, stay for the data structures.” Clojure’s data structures are persistent, immutable, and performant. In this talk, we’ll discuss what they give you, how they work, and what you can do with them.

Craig Andera is a developer at Cognitect, where he builds large-scale web-based systems, primarily in Clojure. He is also the host of The Cognicast, Cognitect’s podcast. Craig can be found on Twitter @craigandera. The Cognicast is available at http://cognitect.com/podcast.

Best watched before this coming Sunday (Mother’s Day).

I first saw this at: Recommended Viewing : Clojure’s Persistent Data Structures by Charles Ditzel.

May 8, 2014

Patents Aren’t What They Used To Be

Filed under: Patents — Patrick Durusau @ 7:31 pm

US Patent Office Grants ‘Photography Against A White Background’ Patent To Amazon

I can remember when “having” a patent was a mark of real distinction.

Now, not so much.

See the post for the details but Amazon’s patent for photographs against a white background can be violated by your:

How does this breakthrough work in practice? Glad you asked.

1. Turn back lights on.
2. Turn front lights on.
3. Position thing on platform.
4. Take picture.

Now, we’ll note that in all fairness (HAHAHAHA), Amazon filed this application back in the early days of photography, circa 2011. Nearly three years later, that foresight has paid off, and Amazon can now corner the market on taking pictures in front of a white background.

The patent itself, US 8,676,045. Another blog post: You Can Close The Studio, Amazon Patents Photographing On Seamless White

Amazon does a lot of really cool stuff so I am hopeful that:

  1. Amazon will donate US 8,676,045 to the public domain.
  2. Fire whoever was responsible for this farce. Without consequences for abuse of the patent system by staff of companies who do have legitimate IP, the patent system will continue to deteriorate.

I knew I should have filed to patent addition last year! 😉

Fifteen ideas about data validation (and peer review)

Filed under: Data Quality,Data Science — Patrick Durusau @ 7:11 pm

Fifteen ideas about data validation (and peer review)

From the post:

Many open issues drift around data publication, but validation is both the biggest and the haziest. Some form of validation at some stage in a data publication process is essential; data users need to know that they can trust the data they want to use, data creators need a stamp of approval to get credit for their work, and the publication process must avoid getting clogged with unusable junk. However, the scientific literature’s validation mechanisms don’t translate as directly to data as its mechanism for, say, citation.

This post is in part a very late response to a data publication workshop I attended last February at the International Digital Curation Conference (IDCC). In a breakout discussion of models for data peer review, there were far more ideas about data review than time to discuss them. Here, for reference purposes, is a longish list of non-parallel, sometimes-overlapping ideas about how data review, validation, or quality assessment could or should work. I’ve tried to stay away from deeper consideration of what data quality means (which I’ll discuss in a future post) and from the broader issues of peer review associated with the literature, but they inevitably pop up anyway.

A good starting point for discussion of data validation concerns.

Perfect data would be preferred but let’s accept that perfect data is possible only for trivial or edge cases.

If you start off by talking about non-perfect data, it may be easier to see some of the consequences for when having non-perfect data makes a system fail. What are the consequences of that failure? For the data owner as well as others? Are those consequences acceptable?

Make those decisions up front and documented as part of planning data validation.

R Client for the U.S. Federal Register API

Filed under: Government,Government Data,Open Government,R — Patrick Durusau @ 6:49 pm

R Client for the U.S. Federal Register API by Thomas Leeper.

From the webpage:

This package provides access to the API for the United States Federal Register. The API provides access to all Federal Register contents since 1994, including Executive Orders by Presidents Clinton, Bush, and Obama and all “Public Inspection” Documents made available prior to publication in the Register. The API returns basic details about each entry in the Register and provides URLs for HTML, PDF, and plain text versions of the contents thereof, and the data are fully searchable. The federalregister package provides access to all version 1 API endpoints.

If you are interested in law, policy development, or just general awareness of government activity, this is an important client for you!

More than 30 years ago I had a hard copy subscription to the Federal Register. Even then it was a mind numbing amount of detail. Today it is even worse.

This API enables any number of business models based upon quick access to current and historical Federal Register data.

Enjoy!

Large-Scale Graph Computation on Just a PC

Filed under: GraphChi,Graphs — Patrick Durusau @ 4:38 pm

Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defense by Aapo Kyroa.

If you are looking for an overview of Kyroa’s work, this is the resource for you.

Slide 8: “Benefits of single machine systems Assuming it can handle your big problems…”, currently reads:

  1. Programmer productivity – Global state, debuggers…
  2. Inexpensive to install, administer, less power.
  3. Scalability – Use cluster of single-machine systems to solve many tasks in parallel. Idea: Trade latency for throughput < 32K bits/sec 8

I would add:

  1. A single machine forces creation of efficient data structures.

Think of it as using computation resources more effectively as opposed to scaling out to accommodate a problem.

Piercing the Document Barrier

Filed under: DOM4,Navigation,W3C — Patrick Durusau @ 4:13 pm

If you aspire to return more detailed search results than: “See this bundle of documents” to your users, you are talking about piercing the document barrier.

It’s a benefit to be able to search thousands of articles and get the top ten (10) or twenty (20) for some search but assuming an average of twelve (12) pages per article, I’m still left with between one hundred and twenty (120) and two hundred and forty (240) pages of material to read. Beats the hell out of the original thousands or hundreds of thousands of pages, but not be enough.

What if I could search for the latest graph research and the search results opted out of the traditional re-explanation of graphs that wastes space at the first of nearly every graph article? After all, anyone intentionally seeking out a published graph article probably has a lock on that detail. And if they don’t, the paragraphs wasted on explanation aren’t going to save them.

I mention that because the W3C‘s HTML Working Group has invited implementation of W3C DOM4.

From W3C Invites Implementations of W3C DOM4:

DOM defines a platform-neutral model for events and node trees.

I expect to see graph-based implementations out in force. Given the recent “discovery” by some people that graphs are “universal.” Having a single node is enough to have a graph under most definitions.

For your amusement from Glossary of graph theory

An edgeless graph or empty graph or null graph is a graph with zero or more vertices, but no edges. The empty graph or null graph may also be the graph with no vertices and no edges. If it is a graph with no edges and any number n of vertices, it may be called the null graph on n vertices. (There is no consistency at all in the literature.)

I don’t find the lack of “consistency” in the literature surprising.

You?

Creating Maps From Drone Imagery

Filed under: Image Processing,Mapping — Patrick Durusau @ 2:31 pm

Creating Maps From Drone Imagery by Bobby Sudekum.

From the post:

Here is an end to end walkthrough showing how to process drone imagery into maps and then share it online, all using imagery we collected on a recent flight with the 3D Robotics team and their Aero drone.

Whether you are pulling data from someone else’s drone or your own, imagery will need post-capture processing.

One way to prevent illegal use of drones would be to require imagery transmission to include identifying codes and on open channels. Reasoning that if you were doing something illegal, you would not put your ID on it and transmit on an open channel. I may be wrong about that.

We have littered the ocean, land, space and now it appears we are going to litter the sky with drones.

More data to be sure but what use cases justify degradation of the sky?

Chemistry

Filed under: Cheminformatics — Patrick Durusau @ 2:13 pm

ProfessorDaveatYork

Since I just urged you to read/study philosophy and humanities its only fair that I mention ProfessorDaveatYork and his chemistry videos.

David is an enthusiast to say the least. Which is understandable concerning the following from his background:

We are exploring one of the most exciting frontiers of modern chemistry – the nanoworld. Nanotechnology, the development of systems between 1 and 100 nm in size seems impossible tiny to the average person. However, for chemists, used to manipulating bonds just 0.1 nm long, the nanoworld is a large space, which requires new synthetic strategies.

Nanochemistry – the synthesis and study of nanoscale architectures, is therefore a fundamental part of nanotechnology. Applications of nanotechnology are completely dependent on the new objects which chemists can generate. Our approach uses non-covalent interactions between molecules – ‘supramolecular chemistry’ – in order to allow simple molecular-scale building blocks to spontaneously self-assemble into nanostructures. Self-assembly is a simple and powerful approach to constructing the nanoworld which allows us to generate a wide variety of systems, with applications ranging from nanomaterials to nanomedicine.

Hard to not be excited when you are on the cutting edge!

Chemistry or more precisely cheminfomatics is no stranger to the name/identifier problems found elsewhere in science, business, government, etc.

Enjoyable lectures that can refresh or build your chemistry basics.

Dave has recently passed 400,000 views on his channel. What say we help him on his way to 500,000? (Enjoyable lectures not being all that common, we should encourage them whenever possible.)

Avoid Philosophy?

Filed under: Humanities,Philosophy,Semantics — Patrick Durusau @ 2:00 pm

Why Neil deGrasse Tyson is a philistine by Damon Linker.

From the post:

Neil deGrasse Tyson may be a gifted popularizer of science, but when it comes to humanistic learning more generally, he is a philistine. Some of us suspected this on the basis of the historically and theologically inept portrayal of Giordano Bruno in the opening episode of Tyson’s reboot of Carl Sagan’s Cosmos.

But now it’s been definitively demonstrated by a recent interview in which Tyson sweepingly dismisses the entire history of philosophy. Actually, he doesn’t just dismiss it. He goes much further — to argue that undergraduates should actively avoid studying philosophy at all. Because, apparently, asking too many questions “can really mess you up.”

Yes, he really did say that. Go ahead, listen for yourself, beginning at 20:19 — and behold the spectacle of an otherwise intelligent man and gifted teacher sounding every bit as anti-intellectual as a corporate middle manager or used-car salesman. He proudly proclaims his irritation with “asking deep questions” that lead to a “pointless delay in your progress” in tackling “this whole big world of unknowns out there.” When a scientist encounters someone inclined to think philosophically, his response should be to say, “I’m moving on, I’m leaving you behind, and you can’t even cross the street because you’re distracted by deep questions you’ve asked of yourself. I don’t have time for that.”

“I don’t have time for that.”

With these words, Tyson shows he’s very much a 21st-century American, living in a perpetual state of irritated impatience and anxious agitation. Don’t waste your time with philosophy! (And, one presumes, literature, history, the arts, or religion.) Only science will get you where you want to go! It gets results! Go for it! Hurry up! Don’t be left behind! Progress awaits!

There are many ways to respond to this indictment. One is to make the case for progress in philosophical knowledge. This would show that Tyson is wrong because he fails to recognize the real advances that happen in the discipline of philosophy over time.

….

I remember thinking the first episode of Tyson’s Cosmos was rather careless with its handling of Bruno and the Enlightenment. But at the time I thought that was due to it being a “popular” presentation and not meant to be precise in every detail.

Damon has an excellent defense of philosophy and for that you should read his post.

I have a more pragmatic reason for recommending both philosophy in particular and the humanities in general to CS majors. You will waste less time in programming than you will from “deep questions.”

For example, why have intelligent to the point of being gifted CS types tried repeatedly to solve the issues of naming by proposing universal naming systems?

You don’t have to be very aware to know that naming systems are like standards. If you don’t like this one, make up another one.

That being the case, what makes anyone think their naming system will displace all others for any significant period of time? Considering there has never been a successful one.

Oh, I forgot, if you don’t know any philosophy, one place this issue gets discussed, or the humanities in general, you won’t be exposed to the long history of language and naming discussions. And the failures recorded there.

I would urge CS types to read and study both philosophy and the humanities for purely pragmatic reasons. CS pioneers were able to write the first FORTRAN compiler not because they had taken a compiler MOOC but because they had studied mathematics, linguistics, language, philosophy, history, etc.

Are you a designer (CS pioneers were) or are you a mechanic?

PS: If you are seriously interested in naming issues, my first suggestion would be to read The Search for the Perfect Language by Umberto Eco. It’s not all that needs to be read in this area but it is easily accessible.

I first saw this in a tweet by Christopher Phipps.

Functional Programming in the Cloud:…

Filed under: Cloud Computing,Functional Programming,Haskell,OpenShift,Red Hat — Patrick Durusau @ 1:07 pm

Functional Programming in the Cloud: How to Run Haskell on OpenShift by Katie Miller.

From the post:

One of the benefits of Platform as a Service (PaaS) is that it makes it trivial to try out alternative technology stacks. The OpenShift PaaS is a polyglot platform at the heart of a thriving open-source community, the contributions of which make it easy to experiment with applications written in a host of different programming languages. This includes the purely functional Haskell language.

Although it is not one of the Red Hat-supported languages for OpenShift, Haskell web applications run on the platform with the aid of the community-created Haskell cartridge project. This is great news for functional programming (FP) enthusiasts such as myself and those who want to learn more about the paradigm; Haskell is a popular choice for learning FP principles. In this blog post, I will discuss how to create a Haskell web application on OpenShift.

Prerequisites

If you do not have an OpenShift account yet, sign up for OpenShift Online for free. You’ll receive three gears (containers) in which to run applications. At the time of writing, each of these free gears come with 512MB of RAM and 1GB of disk space.

To help you communicate with the OpenShift platform, you should install the RHC client tools on your machine. There are instructions on how to do that for a variety of operating systems at the OpenShift Dev Center. Once the RHC tools are installed, run the command rhc setup on the command line to configure RHC ready for use.

Katie’s post is a great way to get started with OpenShift!

However, it also reminds me of why I dislike Daylight Savings Time. It is getting dark later in the Eastern United States but there are still only twenty-four (24) hours in a day! An extra eight (8) hours a day and the stamina to stay awake for them would be better. 😉

Unlikely to happen so enjoy Katie’s post during the usual twenty-four (24) hour day.

Hello, NSA

Filed under: Cybersecurity,Humor,NSA,Security — Patrick Durusau @ 11:05 am

Hello, NSA

Motherboard has a web app that generates random sentences laced with words of interest to the NSA. I saw this in Researchers find post-Snowden chill stifling our search terms by Lisa Vaas at nakedsecurity.

The about page for the app reads:

Turns out Uncle Sam is more of a peeping Tom than we even thought.

Now we know that the US government keeps our personal phone records, and can in certain cases access our emails, status updates, photos, and other personal information. We’re still not exactly sure how they sift through all this data.

But last year, the Department of Homeland Security released a list of over 370 keywords that served as trip-wires amidst the flow of conversation that pours through social media.

The operation—which is just one of an untold number of government programs keeping tabs on our tabs—flagged a variety of hot terms related to terrorism (dirty bomb), cyber security (Mysql injection), infrastructure (bridge, airport), health (pandemic), places (Mexico), and political dissent (radical), as well as more banal verbiage like ‘pork’ and ‘exercise.’

So let’s play a word game! Use our handy phrase generator to come up with pearls of keyword-loaded Twitter wit and perhaps earn you a new follower in Washington. Tweet it out, email it to a friend, share it around, you know the drill—and remember that the NSA and other government agencies might be reading along. And don’t forget to say hello.

Read more about government surveillance programs:

How to Build a Secret Facebook

The Motherboard Guide to Avoiding the NSA

Privacy’s Public, Government-Sponsored Death

A Majority of Americans Believe NSA Phone Tracking Is Acceptable

‘Going Dark’: What’s So Wrong with the FBI’s Plan to Tap Our Internet?

All the PRISM Data the Tech Giants Have Been Allowed to Disclose So Far

Sorry, NSA, Terrorists Don’t Use Verizon. Or Skype. Or Gmail

Please make Hello, NSA your browser homepage and forward that link to friends as a pubic service announcement.

Finding subjects is hard enough with “normal” levels of semantic noise. Help validate the $billions being spent scooping and searching the Internet. Turn the semantic noise knob up a bit.

May 7, 2014

Data Manipulation with Pig

Filed under: Data Mining,Pig — Patrick Durusau @ 7:13 pm

Data Manipulation with Pig by Wes Floyd.

A great slide deck on Pig! BTW, there is a transcript of the presentation available just under the slides.

I first saw this at: The essence of Pig by Alex Popescu.

New in Solr 4.8: Document Expiration

Filed under: Search Engines,Solr,Topic Maps — Patrick Durusau @ 7:07 pm

New in Solr 4.8: Document Expiration

From the post:

Lucene & Solr 4.8 were released last week and you can download Solr 4.8 from the Apache mirror network. Today I’d like to introduce you to a small but powerful feature I worked on for 4.8: Document Expiration.

The DocExpirationUpdateProcessorFactory provides two features related to the “expiration” of documents which can be used individually, or in combination:

  • Periodically delete documents from the index based on an expiration field
  • Computing expiration field values for documents from a “time to live” (TTL)

Assuming you are using a topic maps solution that presents topics as merged, this could be an interesting feature to emulate.

After all, if you are listing ticket sale outlets for concerts in a music topic map, good maintenance suggests those occurrences should go away after the concert has occurred.

Or if you need the legacy information for some purpose, at least not have it presented as currently available. Perhaps a change of its occurrence type?

Would you actually delete topics or add an “internal” occurrence so they would not participate in future presentations of merged topics?

Don’t Create A Data Governance Hairball

Filed under: Data Governance,Data Integration — Patrick Durusau @ 6:54 pm

Don’t Create A Data Governance Hairball by John Schmidt.

From the post:

Are you in one of those organizations that wants one version of the truth so badly that you have five of them? If so, you’re not alone. How does this happen? The same way the integration hairball happened; point solutions developed without a master plan in a culture of management by exception (that is, address opportunities as exceptions and deal with them as quickly as possible without consideration for broader enterprise needs). Developing a master plan to avoid a data governance hairball is a better approach – but there is a right way and a wrong way to do it.

As you probably can guess, I think John does a great job describing the “data governance hairball,” but not quite such high marks on avoiding the data governance hairball.

Not that I prefer some solution over John’s suggestions but that data governance hairballs are an essential characteristic of shared human knowledge. Human knowledge, can for some semantic locality avoid the data governance hairball, but that is always an accidental property.

An “essential” property is a property a subject must have to be that subject. The semantic differences even within domains, to say nothing of between domains, make it clear that master data governance is only possible within a limited semantic locality. An “accidental” property is a property a subject may or may not have but it is still the same subject.

The essential vs. accidental property distinction is useful in data integration/governance. If we recognize unbounded human knowledge is always subject to the data governance hairball description, then we can begin to look for John’s right level of “granularity.” That is we can create an accidental property that within a particular corporate context that we govern some data quite closely, but other data we don’t attempt to govern at all.

The difference between data we govern and data we don’t, being what ROI can be derived from the data we govern?

If data has no ROI and doesn’t enable ROI from other data, why bother?

Are you governing data with no established ROI?

Go ahead, compete with Google Search

Filed under: Marketing,Search Engines — Patrick Durusau @ 3:53 pm

Go ahead, compete with Google Search: Why its is not that crazy to go build a search engine. by Alexis Smirnov.

Alexis doesn’t sound very promising at the start:

Google’s mission is to organize the world’s information and make it universally accessible and useful. Google Search has become a shining example of progress towards accomplishing this mission.

Google Search is the best general-purpose search engine and it gets better all the time. Over the years it killed off most of it’s competitors.

But after a very interesting and useful review of non-general public search engines, he concludes:

To sum up, the best way to compete against Google is not to build another general-purpose search engine. It is to build another vertical semantic search engine. The better engines understand the specific domain, the better chance they have to be better than Google.

See the post for the details and get thee to the software forge to build a specialized search engine!

PS: We all realize the artwork that accompanies the post isn’t an accurate depiction of Google. Too good looking. 😉

PPS: I am particularly aware of the need for date/version ordered searching for software issues. Just today I was searching for a error that turned out to be a bad symbolic link but the results from one search engine included material from 2 or 3 years ago. Not all that helpful when you are running the latest release.

May 6, 2014

The Strange Naming Conventions of Astronomy

Filed under: Astroinformatics,Names — Patrick Durusau @ 7:31 pm

The Strange Naming Conventions of Astronomy by Ben Montet.

From the post:

If you’ve spent time around the astronomical literature, you’ve probably heard at least one term that made you wonder “why did astronomers do that?” G-type stars, early/late type galaxies, magnitudes, population I/II stars, sodium “D” lines, and the various types of supernovae are all members of the large, proud family of astronomy terms that are seemingly backwards, unrelated to the underlying physics, or annoyingly complicated. While it may seem surprising now, the origins of these terms were logical at the time of their creation. Today, let’s look at the history of a couple of these terms, to figure out why astronomers did that.

Ben covers a couple of odd naming cases but has left thousands of others as an exercise for the reader!

Names that are used in astronomical literature for centuries.

The richness of names isn’t going away so long as we keep records of our past. Whatever style of names, such as “cool URIs,” may come or go out of fashion.

Cyberterrorists vs. Squirrels

Filed under: Communication,Cybersecurity — Patrick Durusau @ 7:18 pm

power outages

From a tweet by Eli Dourado with the note: “Everything you need to know about cyberterrorism in one chart.”

It would be helpful to have a chart like this one for several topics. Terrorism, cyberterrorism, crime, etc.

This is one aspect of what is called “risk assessment.”

Risk assessment would say that a wandering planet crashing into the Earth is a very bad thing, the odds of that happening are very low and the likelihood of an effective response is nil.

Accurate risk assessments could reduce the waste of resources on improbable incidents or those for which there is no effective response.

The evolution of Ordnance Survey mapping

Filed under: Mapping,Maps — Patrick Durusau @ 6:58 pm

Evolution of Ordnance Survey mapping

From: The evolution of Ordance Survey mapping:

If you’re a lover of old maps, you may be aware of the changes that have taken place on Ordnance Survey maps over the years. Changes to colour, styling, the depiction of roads and vegetation for example. As you can imagine, visitors to our Southampton head office often want to visit our Cartography teams and see the work they’re doing now – and compare this to how things used to be done.

The Cartography team put their heads together and came up with a display to show visitors the past present and future roles of cartography. One aspect of the display was produced by Cartographer Alicja Karpinska, making use of her photography and digital image manipulation skills, to complete an image showing the evolution of Ordnance Survey mapping.

See the post for more details.

Assuming that topic maps recur in particular subject areas, a similar evolution of mapping is a distinct possibility at some point in the future.

Indexers map properties to subjects every day so topic map can’t claim to be the first to do so.

However, topic maps are the first technology that I am aware of that makes that mapping explicit. That a rather important difference and is crucial to supporting an “evolution of mapping” for topic maps at some point in the future.

May 5, 2014

All of Bach

Filed under: Music,Music Retrieval — Patrick Durusau @ 3:16 pm

All of Bach

From the webpage:

Every week, you will find a new recording here of one Johann Sebastian Bach’s 1080 works, performed by The Netherlands Bach Society and many guest musicians.

Six (6) works posted, only another one thousand and seventy-four (1074) to go. 😉

Music is an area with well known connections to many other domains, people, places, history, literature, religion and many others. Not that other domains lack such connections, but music seems particularly rich in such connections. Which also includes performers, places of performance, reactions to performances, reviews of performances, to say nothing of the instruments and the music itself.

A consequence of this tapestry of connections is that annotating music can draw from almost all known forms of recorded knowledge from an unlimited number of domains and perspectives.

Rather than the clamor of arbitrary links one after the other about a performance or its music, a topic map can support multiple, coherent views of any particular work. Perhaps ranging from the most recent review to the oldest known review of a work. Or exploding one review into historical context. Or exploring the richness of the composition proper.

The advantage of a topic map being that you don’t have to favor one view to the exclusion of another.

…immediate metrical meaning

Filed under: Metric Spaces,Semantics — Patrick Durusau @ 2:49 pm

Topology Fact tweeted today:

‘It’s not so easy to free oneself from the idea that coordinates must have an immediate metrical meaning.’ — Albert Einstein

In searching for that quote I found:

The simple fact is that in general relativity, coordinates are essentially arbitrary systems of markers chosen to distinguish one even from another. This gives us great freedom in how we define coordinates…. The relationship between the coordinate differences separating events and the corresponding intervals of time or distance that would be measured by a specified observer must be worked out using the metric of the spacetime. (Relativity, Gravitation and Cosmology by Robert J. A. Lambourne, page 155)

Let’s re-write the first sentence by Lambourne to read:

The simple fact is that in semantics, terms are essentially arbitrary systems of markers chosen to distinguish one semantic even from another.

Just to make clear that sets of terms have no external metric of semantic distance or closeness that separate them.

And re-write the second sentence to read:

The relationship between the term separating semantics and the corresponding semantic intervals would be measured by a specified observer.

I have omitted some words and added others to emphasize that “semantic intervals” have no metric other than as assigned and observed by some specified observer.

True, the original quote goes on to say: “…using the metric of the spacetime.” But spacetime has a generally accepted metric that has proven itself both accurate and useful since the early 20th century. So far as I know, despite contentions to the contrary, there is no similar metric for semantics.

In particular there is no general semantic metric that obtains across all observers.

Something to bear in mind when semantic distances are being calculated with great “precision” between terms. Most pocket calculators can be fairly precise. But being precise isn’t the same thing as being correct.

Wandora Bug Fix

Filed under: Topic Map Software,Topic Maps,Wandora — Patrick Durusau @ 10:46 am

From the change log:

2014-05-05

  • Fixes bugs in selection extractors of embedded WWW browser. Wandora used to calculate selection boundaries wrong and wasn’t able to perform selection extractions in the embedded WWW browser.
  • The directory structure extractor, simple HTML list and HTML table extractors now generate more clearer topic maps fragments and links the extraction to the Wandora class. This change hopefully increases the usability of these extractors.

Download the latest Wandora build 2014-05-05.

May 4, 2014

6X Performance with Impala

Filed under: HDFS,Impala — Patrick Durusau @ 7:18 pm

In-memory Caching in HDFS: Lower latency, same great taste by Andrew Wang.

From the post:

My coworker Colin McCabe and I recently gave a talk at Hadoop Summit Amsterdam titled “In-memory Caching in HDFS: Lower latency, same great taste.” I’m very pleased with how this feature turned out, since it was approximately a year-long effort going from initial design to production system. Combined with Impala, we showed up to a 6x performance improvement by running on cached data, and that number will only improve with time. Slides and video of our presentation are available online.

Finding data the person who signs the checks will be interested in seeing with 6X performance is left as an exercise for the reader. 😉

« Newer PostsOlder Posts »

Powered by WordPress