Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 28, 2013

…Bad Arguments

Filed under: Argumentation,Logic,Reasoning — Patrick Durusau @ 3:27 pm

An Illustrated Book of Bad Arguments by Ali Almossawi.

From “Who is this book for?”

This book is aimed at newcomers to the field of logical reasoning, particularly those who, to borrow a phrase from Pascal, are so made that they understand best through visuals. I have selected a small set of common errors in reasoning and visualized them using memorable illustrations that are supplemented with lots of examples. The hope is that the reader will learn from these pages some of the most common pitfalls in arguments and be able to identify and avoid them in practice.

A delightfully written and illustrated book on bad arguments.

I first saw this at “Bad Arguments” (a book by Ali Almossawi) by Deborah Mayo.

Intel® XDK

Filed under: Interface Research/Design,Javascript — Patrick Durusau @ 12:00 pm

Intel® XDK

From the webpage:

Intel® XDK, a no cost, integrated and front-to-back HTML5 app development environment for true cross-platform apps for multiple app stores, and form factor devices. Features in the first release included:

  • Editor, Device Emulator and Debugger
  • App for On-device Testing
  • Javascript UI library optimized for mobile
  • APIs for game developers with accelerated canvas
  • Prototype GUI quick-start wizard
  • Installs on Microsoft Windows*, Apple OS X*, runs in Google Chrome*
  • Intel cloud-based build system for most app stores
  • No need to download Native Platform SDKs
  • Tool to convert iOS* apps to HTML5

Numerous other resources, including forums, are available from this page.

If you want to deliver topic map based content to mobile devices, this is a must stop.

I first saw this in Nat Torkington’s Four short links: 27 December 2013.

Superconductor

Filed under: Graphics,Visualization — Patrick Durusau @ 11:52 am

Superconductor

From the about page:

Superconductor is a framework for creating interactive big data visualizations in the web browser. It contains two components: a JavaScript library for running visualizations in your browser, and a compiler which generates the high-performance visualization code from our simple domain specific language for describing visualizations.

Superconductor was created by Leo Meyerovich and Matthew Torok at the University of California, Berkeley’s Parallel Computing Laboratory. The ideas behind it evolved out of our research in the parallel browser project. Over the last two years, we’ve worked to apply the ideas behind that research to the task of big data visualization, and to create a polished, easy-to-use framework based around that work. Superconductor is the result.

The demos are working with 100,000 data points, interactively. Very impressive.

Available as a developer preview with the following requirements:

The developer preview of Superconductor currently only supports the following platform:

  • An Apple laptop/desktop computer
  • Mac OS X 10.8 (‘Mountain Lion’) or newer
  • An NVIDIA (preferred) or ATI graphics chip available in your computer

Support for more platforms is a high priority, and we’re working hard to add that to Superconductor.

Suggestions of a commercially available OS X 10.8 VM for Ubuntu? 😉

I first saw this in Nat Torkington’s Four short links: 27 December 2013.

Mining the Web to Predict Future Events

Filed under: Machine Learning,News,Prediction,Predictive Analytics — Patrick Durusau @ 11:30 am

Mining the Web to Predict Future Events by Kira Radinsky and Eric Horvitz.

Abstract:

We describe and evaluate methods for learning to forecast forthcoming events of interest from a corpus containing 22 years of news stories. We consider the examples of identifying significant increases in the likelihood of disease outbreaks, deaths, and riots in advance of the occurrence of these events in the world. We provide details of methods and studies, including the automated extraction and generalization of sequences of events from news corpora and multiple web resources. We evaluate the predictive power of the approach on real-world events withheld from the system.

The paper starts off well enough:

Mark Twain famously said that “the past does not repeat itself, but it rhymes.” In the spirit of this reflection, we develop and test methods for leveraging large-scale digital histories captured from 22 years of news reports from the New York Times (NYT) archive to make real-time predictions about the likelihoods of future human and natural events of interest. We describe how we can learn to predict the future by generalizing sets of specific transitions in sequences of reported news events, extracted from a news archive spanning the years 1986–2008. In addition to the news corpora, we leverage data from freely available Web resources, including Wikipedia, FreeBase, OpenCyc, and GeoNames, via the LinkedData platform [6]. The goal is to build predictive models that generalize from specific sets of sequences of events to provide likelihoods of future outcomes, based on patterns of evidence observed in near-term newsfeeds. We propose the methods as a means of generating actionable forecasts in advance of the occurrence of target events in the world.

But when it gets down to actual predictions, the experiment predicts:

  • Cholera following flooding in Bangladesh.
  • Riots following police shootings in immigrant/poor neighborhoods.

Both are generally true but I don’t need 22 years worth of New York Times (NYT) archives to make those predictions.

Test offers of predictive advice by asking for specific predictions relevant to your enterprise. Also ask long time staff to make their predictions. Compare the predictions.

Unless the automated solution is significantly better, reward the staff and drive on.

I first saw this in Nat Torkington’s Four short links: 26 December 2013.

December 27, 2013

The Lens

Filed under: Patents,Semantics,Topic Maps — Patrick Durusau @ 5:58 pm

The Lens

From the about page:

Welcome to The Lens, an open global cyberinfrastructure built to make the innovation system more efficient, fair, transparent and inclusive. The Lens is an extension of work started by Cambia in 1999 to render the global patent system more transparent, called the Patent Lens. The Lens is a greatly expanded and updated version of the Patent Lens with vastly more data and greater analytical capabilities. Our goal is to enable more people to make better decisions, informed by evidence and inspired by imagination.

The Lens already hosts a number of powerful tools for analysis and exploration of the patent literature, from integrated graphical representation of search results to advanced bioinformatics tools. But this is only just the beginning and we have lot more planned! See what we’ve done and what we plan to do soon on our timeline below:

The Lens current covers 80 million patents in 100 different jurisdictions.

When you create an account, the following appears in your workspace:

Welcome to the Lens! The Lens is a tool for innovation cartography, currently featuring over 78 million patent documents – many of them full-text – from nearly 100 different jurisdictions. The Lens also features hyperlinks to the scientific literature cited in patent documents – over 5 million to date.

But more than a patent search tool, the Lens has ben designed to make the patent system navigable, so that non-patent professionals can access the knowledge contained in the global patent literature. Properly mapped out, the global patent system has the potential to accelerate the pace of invention, to generate new partnerships, and to make a vast wealth of scientific and technical knowledge available for free.

The Lens is currently in beta version, with future versions featuring expanded access to both patent and scientific literature collections, as well as improved search and analytic capabilities.

As you already know, patents have extremely rich semantics and mapping of those semantics could be very profitable.

If you saw the post: Secure Cloud Computing – Very Secure, you will know that patent searches on “homomorphic encryption” are about to become very popular.

Are you ready to bundle and ship patent research?

Galaxy:…

Filed under: Bioinformatics,Biomedical,Biostatistics — Patrick Durusau @ 5:41 pm

Galaxy: Data Intensive Biology For Everyone

From the website:

Galaxy is an open, web-based platform for data intensive biomedical research. Whether on the free public server or your own instance, you can perform, reproduce, and share complete analyses.

From the Galaxy wiki:

Galaxy is an open, web-based platform for accessible, reproducible, and transparent computational biomedical research.

  • Accessible: Users without programming experience can easily specify parameters and run tools and workflows.
  • Reproducible: Galaxy captures information so that any user can repeat and understand a complete computational analysis.
  • Transparent: Users share and publish analyses via the web and create Pages, interactive, web-based documents that describe a complete analysis.

This is the Galaxy Community Wiki. It describes all things Galaxy.

Whether you are a home bio-hacker or an IT person looking to understand computational biology, Galaxy may be a good fit for you.

You can try out the public server before troubling to install it locally. Assuming you are paranoid about your bits going over the network. 😉

Topological maps or topographic maps?

Filed under: Maps,Topography,Topology — Patrick Durusau @ 5:06 pm

Topological maps or topographic maps? by Dave Richeson.

From the post:

While surfing the web the other day I read an article in which the author refers to a “topological map.” I think it is safe to say that he meant to write “topographic map.” This is an error I’ve seen many times before.

A topographic map is a map of a region that shows changes in elevation, usually with contour lines indicating different fixed elevations. This is a map that you would take on a hike.

A topological map is a continuous function between two topological spaces—not the same thing as a topographic map at all!

I thought for sure that there was no cartographic meaning for topological map. It turns out, however, that there is.

A topological map is a map that is only concerned with relative locations of features on the map, not on exact locations. A famous example is the graph that we use to solve the Bridges of Königsberg problem.

A useful reminder.

Although I would use even topological maps of concepts, establishing relative locations, with caution. Concepts have no universal metric and therefore placement on a topological map is largely arbitrary.

“See Something, Say Something”…

Filed under: Cybersecurity,Security — Patrick Durusau @ 4:51 pm

“See Something, Say Something” Unfamiliar to Most Americans by Steve Ander and Art Swift.

From the post:

Less than half of Americans say they have heard of the “If You See Something, Say Something” slogan, part of a government campaign designed to raise public awareness of signs of terrorism and terrorism-related crime. An even smaller percentage correctly identify it as targeting terrorism and crime.

The U.S. Department of Homeland Security (DHS) licensed the use of the slogan in 2010. DHS has awarded millions of dollars in federal grants through its collaborations with dozens of states, cities, sports teams, and other organizations to spread the campaign across the U.S. Yet, a Dec. 17-18 Gallup poll finds that a majority of Americans (55%) have never heard of it.

Terrorism can occur anywhere and anytime, and any person could see the signs before disaster strikes. For example, two street vendors reported the 2010 Times Square bomb scare to the New York Police Department. To prevent future attacks, high levels of situational awareness and clear action plans could greatly benefit everyone.

It really was an attempted terrorist attack in Time Square but what the vendor saw, according to the New York Times report:

Moments later, a T-shirt vendor on the sidewalk saw smoke coming out of vents near the back seat of the S.U.V., which was now parked awkwardly at the curb with its engine running and its hazard lights on. The vendor called to a mounted police officer, the mayor said, who smelled gunpowder when he approached the S.U.V. and called for assistance. The police began evacuating Times Square, starting with businesses along Seventh Avenue, including a Foot Locker store and a McDonald’s.

I don’t think people need a slogan to make them call in a car fire.

Do you?

For all the $millions spent, we are not (yet) a nation of informants.

I first saw this in Full Text Reports.

Imprecise machines mess with history

Filed under: Precision,Search Engines — Patrick Durusau @ 4:33 pm

Imprecise machines mess with history by Kaiser Fung.

From the post:

The mass media continues to gloss over the imprecision of machines/algorithms.

Here is another example I came across the other day. In conversation, the name Martin Van Buren popped up. I was curious about this eighth President of the United States.

What caught my eye in the following Google search result (right panel) is his height:

See Kaiser’s post for an amusing error on U.S. Presidents which has been echoed in U.S. classrooms without a doubt.

Kaiser asks how to make fact-checking machines possible?

I’m not sure we need fact-checking machines as much as we need several canonical sources of information on the WWW.

At one time, there were several world almanacs in print (may still be) and for most routine information, those were authoritative sources.

I don’t know that search engines need fact checkers so much as they need to be less promiscuous. At least in terms of the content that they repeat as fact.

There is a difference between “facts” you index at the New York Times and some local historical society.

The source of data was important before the WWW and it continues to be important today.

Naming Software?

Filed under: Names,Software — Patrick Durusau @ 4:20 pm

When you are naming software, please do not use UPPERCASE letters to distinguish your software from another name.

Why?

Because the income generating imitations of search engines regularize case, even if the terms are double quoted.

Thus, if I search for TWITter*, the first hit, (drum roll) will be: “twitter.com”

Which If I have gone to the trouble of double quoting the text, very likely isn’t what I am looking for.

Choose what you think is a good name for your software but if you want people to find it, don’t be clever with case as though it makes a difference.

*TWITer: I don’t know if this is the name of a real project or not. If it is, my apologies.

Secure Cloud Computing – Very Secure

Filed under: Cloud Computing,Cybersecurity,Encryption,Security — Patrick Durusau @ 3:59 pm

Daunting Mathematical Puzzle Solved, Enables Unlimited Analysis of Encrypted Data

From the post:

IBM inventors have received a patent for a breakthrough data encryption technique that is expected to further data privacy and strengthen cloud computing security.

The patented breakthrough, called “fully homomorphic encryption,” could enable deep and unrestricted analysis of encrypted information — intentionally scrambled data — without surrendering confidentiality. IBM’s solution has the potential to advance cloud computing privacy and security by enabling vendors to perform computations on client data, such as analyzing sales patterns, without exposing or revealing the original data.

IBM’s homomorphic encryption technique solves a daunting mathematical puzzle that confounded scientists since the invention of public-key encryption over 30 years ago.

Invented by IBM cryptography Researcher Craig Gentry, fully homomorphic encryption uses a mathematical object known as an “ideal lattice” that allows people to interact with encrypted data in ways previously considered impossible. The breakthrough facilitates analysis of confidential encrypted data without allowing the user to see the private data, yet it will reveal the same detailed results as if the original data was completely visible.

IBM received U.S. Patent #8,565,435: Efficient implementation of fully homomorphic encryption for the invention, which is expected to help cloud computing clients to make more informed business decisions, without compromising privacy and security.

If that sounds a bit dull, consider this prose from the IBM Homomorphic Encryption page:

What if you want to query a search engine, but don’t want to tell the search engine what you are looking for? You might consider encrypting your query, but if you use an ordinary encryption scheme, the search engine will not be able to manipulate your ciphertexts to construct a meaningful response. What you would like is a cryptographic equivalent of a photograph developer’s “dark room”, where the search engine can process your query intelligently without ever seeing it.

Or, what if you want to store your data on the internet, so that you can access it at your convenience? You want your data to remain private, even from the server that is storing them; so, you store your data in encrypt form. But you would also like to be able to access your data intelligently — e.g., you would like the server to be able to return exactly those files containing the word `homomorphic’ within five words of `encryption’. Again, you would like the server to be able to “process” your data while it remains encrypted.

A “fully homomorphic” encryption scheme creates exactly this cryptographic dark room. Using it, anyone can manipulate ciphertexts that encrypt data under some public key pk to construct a ciphertext that encrypts *any desired function* of that data under pk. Such a scheme is useful in the settings above (and many others).

The key sentence is:

“Using it, anyone can manipulate ciphertexts that encrypt data under some public key pk to construct a ciphertext that encrypts *any desired function* of that data under pk.”

Wikipedia has a number of references under: Homomorphic encryption.

You may also be interested in: A fully homographic encryption scheme (Craig Gentry’s PhD thesis.

One of the more obvious use cases of homomorphic encryption with topic maps being the encryption of topic maps as deliverables.

Purchasers could have access to the results of merging but not the grist that was ground to produce the merging.

The antics of the NSA, 2013’s poster boy for better digital security, such as subversion of security standards and software vendors, out-right theft, and perversion of governments, will bring other use cases to mind.

December 26, 2013

Topotime gallery & sandbox

Filed under: D3,Graphics,JSON,Time,Time Series,Timelines,Visualization — Patrick Durusau @ 8:31 pm

Topotime gallery & sandbox

From the website:

A pragmatic JSON data format, D3 timeline layout, and functions for representing and computing over complex temporal phenomena. It is under active development by its instigators, Elijah Meeks (emeeks) and Karl Grossner (kgeographer), who welcome forks, comments, suggestions, and reasonably polite brickbats.

Topotime currently permits the representation of:

  • Singular, multipart, cyclical, and duration-defined timespans in periods (tSpan in Period). A Period can be any discrete temporal thing, e.g. an historical period, an event, or a lifespan (of a person, group, country).
  • The tSpan elements start (s), latest start (ls), earliest end (ee), end (e) can be ISO-8601 (YYYY-MM-DD, YYYY-MM or YYYY), or pointers to other tSpans or their individual elements. For example, >23.s stands for ‘after the start of Period 23 in this collection.’
    • Uncertain temporal extents; operators for tSpan elements include: before (<), after (>), about (~), and equals (=).
  • Further articulated start and end ranges in sls and eee elements, respectively.
  • An estimated timespan when no tSpan is defined
  • Relations between events. So far, part-of, and participates-in. Further relations including has-location are in development.

Topotime currently permits the computation of:

  • Intersections (overlap) between between a query timespan and a collection of Periods, answering questions like “what periods overlapped with the timespan [-433, -344] (Plato’s lifespan possibilities)?” with an ordered list.

To learn more, check out these and other pages in the Wiki and the Topotime web page

I am currently reading the A Song of Fire and Ice (first volume, A Game of Thrones) and the uncertain temporal extents of Topotime may be useful for modeling some aspects of the narrative.

What will be more difficult to model will be facts known to some parties but not to others, at any point in the narrative.

Unlike graph models where every vertex is connected to every other vertex.

As I type that, I wonder if the edge connecting a vertex (representing a person) to some fact or event (another vertex), could have a property that represents the time in the novel’s narrative when the person in question knows a fact or event?

I need to plot out knowledge of a lineage. If you know the novel you can guess which one. 😉

Legivoc – connecting laws in a changing world

Filed under: EU,Law,Semantics — Patrick Durusau @ 8:02 pm

Legivoc – connecting laws in a changing world by Hughes-Jehan Vibert, Pierre Jouvelot, Benoît Pin.

Abstract:

On the Internet, legal information is a sum of national laws. Even in a changing world, law is culturally specific (nation-specific most of the time) and legal concepts only become meaningful when put in the context of a particular legal system. Legivoc aims to be a semantic interface between the subject of law of a State and the other spaces of legal information that it will be led to use. This project will consist of setting up a server of multilingual legal vocabularies from the European Union Member States legal systems, which will be freely available, for other uses via an application programming interface (API).

And I thought linking all legal data together was ambitious!

So long as the EU was composed of civil law jurisdictions, I would not have taken odds on the success of the project but it could have some useful results.

One you add in common law jurisdictions like the United Kingdom, the project may still have some useful results but there isn’t going to be mapping across all the languages.

Part of the difficulty will be language but part of it will be at the most basic assumptions of both systems.

In civil law, the drafters of legal codes attempt to systematically set out a set of principles that take each other into account and represent a blueprint for an ordered society.

Common law, on the other hand, has at its core court decisions that determine the results between two parties. And those decisions can be relied upon by other parties.

Between civil and common law jurisdictions, some laws/concepts may be more mappable than others. Modern labor law for example, may be new enough for semantic accretions to not prevent a successful mapping.

Older laws, property and inheritance laws, for example, are usually the most unique for any jurisdiction. Those are likely to prove impossible to map or reconcile.

Still, it will be an interesting project, particularly if they disclose the basis for any possible mapping, as opposed to simply declaring a mapping.

Both would be useful, but the former robust in the face of changing law and the latter is brittle.

The Case for Linking World Law Data

Filed under: Law,Linked Data — Patrick Durusau @ 4:40 pm

The Case for Linking World Law Data by Sergio Puig and Enric G. Torrents.

Abstract:

The present paper advocates for the creation of a federated, hybrid database in the cloud, integrating law data from all available public sources in one single open access system – adding, in the process, relevant meta-data to the indexed documents, including the identification of social and semantic entities and the relationships between them, using linked open data techniques and standards such as RDF. Examples of potential benefits and applications of this approach are also provided, including, among others, experiences from of our previous research, in which data integration, graph databases and social and semantic networks analysis were used to identify power relations, litigation dynamics and cross-references patterns both intra and inter-institutionally, covering most of the World international economic courts.

From the conclusion:

We invite any individual and organization to join in and participate in this open endeavor, to shape together this project, Neocodex, aspiring to replicate the impact that Justinian’s Corpus Juris Civilis, the original Codex, had in the legal systems of the Early Middle Ages.

Yes, well, I can’t say the authors lack for ambition. 😉

As you know, the Corpus Juris Civilis has heavily influenced the majority of legal jurisdictions today. (Civil Law)

Do be mindful that the OASIS Legal Citation Markup (LegalCiteM) TC is having its organizational meeting on 12th February 2014, in case you are interested in yet another legal citation effort.

Why anyone thinks we need another legal citation system, that leaves the previous one on the cutting room floor, is beyond me.

Yes, a new legal citation system might be non-proprietary, royalty-free, web-based, etc., but without picking up current citation practices, it will also be dead on arrival (DOA).

Dandlion’s New Bloom:…

Filed under: Knowledge Graph,Linked Data — Patrick Durusau @ 4:18 pm

Dandlion’s New Bloom: A Family Of Semantic Text Analysis APIs by Jennifer Zaino.

From the post:

Dandelion, the service from SpazioDati whose goal is to delivering linked and enriched data for apps, has just recently introduced a new suite of products related to semantic text analysis.

Its dataTXT family of semantic text analysis APIs includes dataTXT-NEX, a named entity recognition API that links entities in the input sentence with Wikipedia and DBpedia and, in turn, with the Linked Open Data cloud and dataTXT-SIM, an experimental semantic similarity API that computes the semantic distance between two short sentences. TXT-CL (now in beta) is a categorization service that classifies short sentences into user-defined categories, says SpazioDati.CEO Michele Barbera.

“The advantage of the dataTXT family compared to existing text analysis’ tools is that dataTXT relies neither on machine learning nor NLP techniques,” says Barbera. “Rather it relies entirely on the topology of our underlying knowledge graph to analyze the text.” Dandelion’s knowledge graph merges together several Open Community Data sources (such as DBpedia) and private data collected and curated by SpazioDati. It’s still in private beta and not yet publicly accessible, though plans are to gradually open up portions of the graph in the future via the service’s upcoming Datagem APIs, “so that developers will be able to access the same underlying structured data by linking their own content with dataTXT APIs or by directly querying the graph with the Datagem APIs; both of them will return the same resource identifiers,” Barbera says. (See the Semantic Web Blog’s initial coverage of Dandelion here, including additional discussion of its knowledge graph.)

The line, “…dataTXT relies neither on machine learning nor NLP techniques,…[r]ather it relies entirely on the topology of our underlying knowledge graph to analyze the text,” caught my eye.

In private beta now but I am interested in how well it works against data in the wild.

The Open-Source Data Science Masters – Curriculum

Filed under: Data Science — Patrick Durusau @ 4:00 pm

The Open-Source Data Science Masters – Curriculum by Clare Corthell.

An interesting mixture of online courses, books, software tools, etc.

Fully mastering all of the material mentioned would probably equal or exceed an MS in Data Science.

Probably.

I say “probably” because data sets, algorithms, processing models, and the like all have built-in assumptions that impact the results.

In a masters program worthy of the name, the assumptions of common methods of data analysis would be taught, along side how to recognize/discover assumptions in data and/or methodologies.

In lieu of a formal course of that nature, I suggest How to Lie with Statistics by Darrell Huff and How to Lie with Maps by Mark Monmonier.

Data Mining is more general than either of those two works so a “How to Lie with Data Mining” would not be amiss.

Or even a “Data Mining Lies Yearbook (year)” that annotates stories, press releases, articles, presentations with their questionable assumptions and/or choices.

Bearing in mind that incompetence is a far more common explanation of lies than malice.

December 25, 2013

Christmas tree with three.js

Filed under: D3,Graphics,Visualization — Patrick Durusau @ 8:37 pm

Christmas tree with three.js

From the webpage:

Today’s article refers to the Christmas and new year in the most direct manner. I prepared a remarkable and relevant demonstration of possibilities of three.js library in the form of an interactive Christmas card. This postcard has everything we need – the Christmas tree with toys, the star in the top, snow, snowflakes in the air – all to raise new year spirit of Christmas time. In this tutorial I will show you how to work with 3D scene, fog, cameras, textures, materials, basic objects (meshes), ground, lights, particles and so on.

Late for this year but a great demonstration of the power of visualization in a web browser.

Enjoy!

Duplicate News Story Detection Revisited

Filed under: Deduplication,Duplicates,News,Reporting — Patrick Durusau @ 5:34 pm

Duplicate News Story Detection Revisited by Omar Alonso, Dennis Fetterly, and Mark Manasse.

Abstract:

In this paper, we investigate near-duplicate detection, particularly looking at the detection of evolving news stories. These stories often consist primarily of syndicated information, with local replacement of headlines, captions, and the addition of locally-relevant content. By detecting near-duplicates, we can offer users only those stories with content materially different from previously-viewed versions of the story. We expand on previous work and improve the performance of near-duplicate document detection by weighting the phrases in a sliding window based on the term frequency within the document of terms in that window and inverse document frequency of those phrases. We experiment on a subset of a publicly available web collection that is comprised solely of documents from news web sites. News articles are particularly challenging due to the prevalence of syndicated articles, where very similar articles are run with different headlines and surrounded by different HTML markup and site templates. We evaluate these algorithmic weightings using human judgments to determine similarity. We find that our techniques outperform the state of the art with statistical significance and are more discriminating when faced with a diverse collection of documents.

Detecting duplicates or near-duplicates of subjects (such as news stories) is part and parcel of a topic maps toolkit.

What I found curious about this paper was the definition of “content” to mean the news story and not online comments as well.

That’s a rather limited view of near-duplicate content. And it has a pernicious impact.

If a story quotes a lead paragraph or two from a New York Times story, comments may be made at the “near-duplicate” site, not the New York Times.

How much of a problem is that? When was the last time you saw a comment that was not in English in the New York Times?

Answer: Very unlikely you have ever seen such a comment:

If you are writing a comment, please be thoughtful, civil and articulate. In the vast majority of cases, we only accept comments written in English; foreign language comments will be rejected. Comments & Readers’ Reviews

If a story appears in the New York Times and a “near-duplicate” in Arizona, Italy, and Sudan, with comments, according to the authors, you will not have the opportunity to see that content.

That’s replacing American Exceptionalism with American Myopia.

Doesn’t sound like a winning solution to me.

I first saw this at Full Text Reports as Duplicate News Story Detection Revisited.

E-Books Directory

Filed under: Books,eBooks — Patrick Durusau @ 3:24 pm

E-Books Directory

From the webpage:

Welcome! We have exactly 8631 free e-books in 649 categories.

E-Books Directory is a daily growing list of freely downloadable ebooks, documents and lecture notes found all over the internet. You can submit and promote your own ebooks, add comments on already posted books or just browse through the directory below and download anything you need.

Welcome additions to your reader device!

An arm saver as well: A New Kind of Science by Stephen (EN) Wolfram. (EN = Editor Needed)

Curious, do you think eBooks are going to lead to longer (read poorly edited) books in general?

The 25 Biggest Brand Fails of 2013

Filed under: Advertising,Marketing,Topic Maps — Patrick Durusau @ 3:11 pm

The 25 Biggest Brand Fails of 2013 by Tim Nudd.

From the post:

Arrogant, intolerant, sexist, disgusting, cheesy, tasteless, just plain stupid. Brand fails come in all kinds of off-putting shapes and sizes, though one thing remains constant—the guilty adrenaline rush of ad-enfreude that onlookers feel while watching brands implode for everyone to see.

We’ve collected some of the most delectably embarrassing marketing moments from 2013 for your rubbernecking pleasure. Eat it up, you heartless pigs. And just be thankful it wasn’t you who screwed up this royally.

Amusing but also lessons in what not to do when advertising topic maps.

Another approach that doesn’t work is “…why isn’t everybody migrating to technology X? It’s so great….”

I kid you not. The video seemed to go on forever.

The video missed what many of the 25 ads missed, it’s a two part test for effective advertising:

  1. What’s in it for the customer?
  2. Is the “what” something the customer cares about?

If you miss either one of those points, the ad is a dud even if it doesn’t make the top 25 poorest ads next year.

Discover Your Neighborhood with Census Explorer

Filed under: Census Data,Government Data — Patrick Durusau @ 2:57 pm

Discover Your Neighborhood with Census Explorer by Michael Ratcliffe.

From the post:

Our customers often want to explore neighborhood-level statistics and see how their communities have changed over time. Our new Census Explorer interactive mapping tool makes this easier than ever. It provides statistics on a variety of topics, such as percent of people who are foreign-born, educational attainment and homeownership rate. Statistics from the 2008 to 2012 American Community Survey power Census Explorer.

While you may be familiar with other ways to find neighborhood-level statistics, Census Explorer provides an interactive map for states, counties and census tracts. You can even look at how these neighborhoods have changed over time because the tool includes information from the 1990 and 2000 censuses in addition to the latest American Community Survey statistics. Seeing these changes is possible because the annual American Community Survey replaced the decennial census long form, giving communities throughout the nation more timely information than just once every 10 years.

Topics currently available in Census Explorer:

  • Total population
  • Percent 65 and older
  • Foreign-born population percentage
  • Percent of the population with a high school degree or higher
  • Percent with a bachelor’s degree or higher
  • Labor force participation rate
  • Home ownership rate
  • Median household income

Fairly coarse (census tract level) data but should be useful for any number of planning purposes.

For example, you could cross this data with traffic ticket and arrest data to derive “police presence” statistics.

Or add “citizen watcher” data from tweets about police car # and locations.

Different data sets often use different boundaries for areas.

Consider creating topic map based filters so when the boundaries change (a favorite activity of local governments) so will your summaries of that data.

December 24, 2013

How to Host Your Clojure App on OpenShift

Filed under: Clojure,OpenShift — Patrick Durusau @ 5:16 pm

How to Host Your Clojure App on OpenShift by Marek Jelen.

From the post:

Today we shall explore deploying a Clojure application on top of OpenShift. We will use Leiningen to manage the applications. This is not the only way to deploy Clojure applications so I will explore more options in following parts of this (mini)series.

Very much a “hello world” type introduction but it is motivation to sign up for an OpenShift account. (Online accounts are free.)

In fact, to complete the demo you will need an OpenShift account.

After signing up, you can deploy other Clojure apps from the books you got as presents!

Enjoy!

Resource Identification Initiative

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 4:30 pm

Resource Identification Initiative

From the webpage:

We are starting a pilot project, sponsored by the Neuroscience Information Framework and the International Neuroinformatics Coordinating Facility, to address the issue of proper resource identification within the neuroscience (actually biomedical) literature. We have now christened this project the Resource Identification Initiative (hastag #RII) and expanded the scope beyond neuroscience. This project is designed to make it easier for researchers to identify the key resources (materials, data, tools) used to produce the scientific findings within a published study and to find other studies that used the same resources. It is also designed to make it easier for resource providers to track usage of their resources and for funders to measure impacts of resource funding. The requirements are that key resources are identified in such a manner that they are identified uniquely and are:

1) Machine readable;

2) Are available outside the paywall;

3) Are uniform across publishers and journals. We are seeking broad input from the FORCE11 community to ensure that we come up with a solution that represents the best thinking available on these topics.

The pilot project was an outcome of a meeting held at the NIH on Jun 26th. A draft report from the June 26th Resource Identification meeting at the NIH is now available. As the report indicates, we have preliminary agreements from journals and publishers to implement a pilot project. We hope to extend this project well beyond the neuroscience literature, so please join this group if you are interested in participating.

….

Yes, another “unique identifier” project.

Don’t get me wrong, to the extent that a unique vocabulary can be developed and used, that’s great.

But it does not address:

  • tools/techniques/data that existed before the unique vocabulary came into existence
  • future tools/techniques/data that isn’t covered by the unique vocabulary
  • mappings between old, current and future tool/techniques/data

The project is trying to address a real need in neuroscience journals (lack of robust identification of organisms or antibodies).

If you have the time and interest, it is a worthwhile project that needs to consider the requirements for “robust” identification.

Ontology Matching

Filed under: Alignment,Ontology — Patrick Durusau @ 4:08 pm

Ontology Matching: Proceedings of the 8th International Workshop on Ontology Matching, co-located with the 12th International Semantic Web Conference (ISWC 2013) edited by Pavel Shvaiko, et. al.

Technical papers:

The Ontology Alignment Evaluation Initiative 2013 results are represented by seventeen (17) papers.

In addition, there are eleven (11) posters.

Complete proceedings in one PDF file.

Ontologies are a popular answer to semantic diversity.

You might say the more semantic diversity in a field, the greater the number of ontologies it has. 😉

A natural consequence of the proliferation of ontologies is the need to match or map between them.

As you know, I prefer to capture the reasons for mappings to avoid repeating the exercise over and over but that’s not a universal requirement.

If you have an hourly contract for mapping between ontologies, you may not want to lessen the burden of such mapping, year in and year out.

And for some purposes, mechanical mappings may be sufficient.

This work is a good update on the current state of the art for matching ontologies.

365 Days Of Astronomy

Filed under: Astroinformatics,Science — Patrick Durusau @ 3:41 pm

The 365 Days Of Astronomy Will Continue Its Quest In 2014.

From the post:

365 Days of Astronomy will continue its service in 2014! This time we will have more days available for new audio. Have something to share? We’re looking for content from 10 minutes long up to an hour!

logo

Since 2009, 365 Days of Astronomy has brought a new podcast every day to astronomy lovers around the world to celebrate the International Year of Astronomy. Fortunately, the project has continued until now and we will keep going for another year in 2014. This means we will continue to serve you for a 6th year.

Through these years, 365 Days Of Astronomy has been delivering daily podcasts discussing various topics in the constantly changing realm of astronomy. These include history of astronomy, the latest news, observing tips and topics on how the fundamental knowledge in astronomy has changed our paradigms of the world. We’ve also asked people to talk about the things that inspired them, and to even share their own stories, both of life doing astronomy and science fiction that got them imagining a more scientific future.

365 Days of Astronomy is a community podcast that relies on a network of dedicated podcasters across the globe who are willing to share their knowledge and experiences in astronomy with the world and it will continue that way. In 2013, 365 Days of Astronomy started a new initiative with CosmoQuest. We now offer great new audio every weekend, while on weekdays we serve up interesting podcasts from CosmoQuest and other dedicated partners. We also have several monthly podcasts from dedicated podcasters and have started two new series: Space Stories and Space Scoop. The former is a series of science fiction tales, and the latter is an astronomy news segment for children.

For more information please visit:
email: info@365daysofastronomy.org
365 Days of Astronomy: http://cosmoquest.org/blog/365daysofastronomy/
Astrosphere New Media: http://www.astrosphere.org/
Join in as podcaster: http://cosmoquest.org/blog/365daysofastronomy/join-in/
Donate to our media program : http://cosmoquest.org/blog/365daysofastronomy/donate/

If you or someone you know finds a telecope tomorrow or is already an active amateur astronomer, they may be interested in these podcasts.

Astronomy had “big data” before “big data” was a buzz word. It has a common coordinate system but how people talk about particular coordinates varies greatly. (Can you say: Needs semantic integration?)

It’s a great hobby with opportunities to explore professional data if you are interested.

I mention it because a topic map without interesting data isn’t very interesting.

Undated Search Results

Filed under: Search Data,Searching — Patrick Durusau @ 3:12 pm

Looking for HTML5 resources to mention in Design, Math, and Data was complicated by the lack of dating in search results.

Searching on “html5 interfaces examples,” the highest ranked result was:

HTML5 Website Showcase: 48 Potential Flash-Killing Demos (2009, est.)

That’s right, a four year old post.

Never mind the changes in CSS, jQuery, etc. over the last four years.

Several pages into the first search results I found:

40+ Useful HTML5 Examples and Tutorials (2012)

It was in a mixture of undated or variously dated resources.

Finally, after following an old post and then searching that site, I uncovered:

21 Fresh Examples of Websites Using HTML5 (2013)

Even there it wasn’t the highest ranked page at the site.

I realize that parsing dates for sites could be difficult but surely search engines know the date when they first encountered a page? That would make it trivial to order search results by time.

Pages would not have a strict chronological sequence but a better time sorting than the current time hodgepodge of results.

Design, Math, and Data

Filed under: Dashboard,Data,Design,Interface Research/Design — Patrick Durusau @ 2:58 pm

Design, Math, and Data: Lessons from the design community for developing data-driven applications by Dean Malmgren.

From the post:

When you hear someone say, “that is a nice infographic” or “check out this sweet dashboard,” many people infer that they are “well-designed.” Creating accessible (or for the cynical, “pretty”) content is only part of what makes good design powerful. The design process is geared toward solving specific problems. This process has been formalized in many ways (e.g., IDEO’s Human Centered Design, Marc Hassenzahl’s User Experience Design, or Braden Kowitz’s Story-Centered Design), but the basic idea is that you have to explore the breadth of the possible before you can isolate truly innovative ideas. We, at Datascope Analytics, argue that the same is true of designing effective data science tools, dashboards, engines, etc — in order to design effective dashboards, you must know what is possible.

As founders of Datascope Analytics, we have taken inspiration from Julio Ottino’s Whole Brain Thinking, learned from Stanford’s d.school, and even participated in an externship swap with IDEO to learn how the design process can be adapted to the particular challenges of data science (see interspersed images throughout).

If you fear “some assembly required,” imagine how users feel with new interfaces.

Good advice on how to explore potential interface options.

Do you think HTML5 will lead to faster mock-ups?

See for example:

21 Fresh Examples of Websites Using HTML5 (2013)

40+ Useful HTML5 Examples and Tutorials (2012)

HTML5 Website Showcase: 48 Potential Flash-Killing Demos (2009, est.)

elasticsearch-entity-resolution

Filed under: Duke,ElasticSearch,Entity Resolution,Search Engines,Searching — Patrick Durusau @ 2:17 pm

elasticsearch-entity-resolution

From the webpage:

This project is an interactive entity resolution plugin for Elasticsearch based on Duke. Basically, it uses Bayesian probabilities to compute probability. You can pretty much use it an interactive deduplication engine.

To understand basics, go to Duke project documentation.

A list of available comparators is available here.

Interesting pairing of Duke (entity resolution/record linkage software by Lars Marius Garshol) with ElasticSearch.

Strings and user search behavior can only take an indexing engine so far. This is a step in the right direction.

A step more likely be followed with an Apache License as opposed to its current LGPLv3.

A Salinas Card

Filed under: Government,Law — Patrick Durusau @ 1:51 pm

Salinas v. Texas

The Scotusblog has this summary of Salinas v. Texas:

When petitioner had not yet been placed in custody or received Miranda warnings, and voluntarily responded to some questions by police about a murder, the prosecution’s use of his silence in response to another question as evidence of his guilty at trial did not violate the Fifth Amendment because petitioner failed to expressly invoke his privilege not to incriminate himself in response to the officer’s question.

A lay translation: If the police ask you questions, before you have been arrested or read your rights, your silence can and will be used against you in court.

I could go on for thousands of words about why Salinas v. Texas was wrongly decided, but that won’t help in an interaction with the police.

I have a simpler and perhaps even effective course of action, the Salinas Card.

My name is: (insert your name).

I invoke my right against self-incrimination and refuse to answer any and all questions, verbal, written or otherwise communicated.

I invoke my right to counsel and cannot afford counsel. I request counsel be appointed and to be present for any questioning, lineups or other identification procedures, and/or any legal proceedings.

I do not consent to any searches of my person, my immediate surroundings or any vehicles or structures that I may own, rent or otherwise occupy.

Date: ___________________________
Police Officer

Get a local criminal defense attorney to approve the language for your state (some states have more protections than the U.S. Constitution). Print the card up on standard 3″ x 5″ index card stock.

When approached by the police, read your Salinas Card to them, date it and ask for their signature on it. (Keep the original, give them a copy.)

Personally I would keep four (4) or (5) 2-card sets on hand at all times.

PS: This is not legal advice but a suggestion that you get legal advice. Show this post to your local public defender and ask them to approve a Salinas card.

Take The Money And Run (RSA)

Filed under: Encryption,Government,NSA — Patrick Durusau @ 10:32 am

I think David Meyer’s headline captures the essence of the RSA story: Security firm denies knowingly including NSA backdoor — but not taking NSA cash.

RSA posts in its defense:

We made the decision to use Dual EC DRBG as the default in BSAFE toolkits in 2004, in the context of an industry-wide effort to develop newer, stronger methods of encryption. At that time, the NSA had a trusted role in the community-wide effort to strengthen, not weaken, encryption.

When concern surfaced around the algorithm in 2007, we continued to rely upon NIST as the arbiter of that discussion.

RSA, as a security company, never divulges details of customer engagements, but we also categorically state that we have never entered into any contract or engaged in any project with the intention of weakening RSA’s products, or introducing potential ‘backdoors’ into our products for anyone’s use.

So, if I had given the RSA $10 million on a contract, would that give me “a trusted role in the community-wide effort to strengthen, not weaken, encryption?”

Given the NSA mission to break encryption used by others, it isn’t clear how the NSA could ever have a “trusted role” in public encryption efforts.

To be sure, the NSA also has an interest in robust encryption for the U.S. government, but it has no interest in making those methods publicly available.

Quite the contrary, the only sensible goal of the NSA is to have breakable encryption used by everyone but the NSA and its clients. Yes?

The NSA was pursuing a rational strategy for a government spy agency and RSA was simply naive to believe otherwise.

As usual, cui bono (“to whose benefit?”), is the relevant question.

PS: If you need help asking that question, I was professionally trained in a hermeneutic of suspicion tradition that was centuries old when the feminists “discovered” it.

« Newer PostsOlder Posts »

Powered by WordPress