Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 20, 2014

Wikidata: A Free Collaborative Knowledge Base

Filed under: Wikidata,Wikipedia — Patrick Durusau @ 7:29 pm

Wikidata: A Free Collaborative Knowledge Base by Denny Vrandečić and Markus Krötzsch.

Abstract:

Unnoticed by most of its readers, Wikipedia is currently undergoing dramatic changes, as its sister project Wikidata introduces a new multilingual ‘Wikipedia for data’ to manage the factual information of the popular online encyclopedia. With Wikipedia’s data becoming cleaned and integrated in a single location, opportunities arise for many new applications.

In this article, we provide an extended overview of Wikidata, including its essential design choices and data model. Based on up-to-date statistics, we discuss the project’s development so far and outline interesting application areas for this new resource.

Denny Vrandečić, Markus Krötzsch. Wikidata: A Free Collaborative Knowledge Base. In Communications of the ACM (to appear). ACM 2014.

If you aren’t already impressed by Wikidata, this article should be the cure!

Topic Map Marketing Song?

Filed under: Marketing,Topic Maps — Patrick Durusau @ 8:29 am

Analysis of 50 years of hit songs yields tips for advertisers

Summary:

Researchers have analyzed 50 years’ worth of hit songs to identify key themes that marketing professionals can use to craft advertisements that will resonate with audiences. The researchers used computer programs to run textual analysis of the lyrics for all of the selected songs and analyzed the results to identify key themes. The researchers identified 12 key themes, and related terms, that came up most often in the hit songs. These themes are loss, desire, aspiration, breakup, pain, inspiration, nostalgia, rebellion, jaded, desperation, escapism and confusion.

As the researchers say, there’s no guarantee of a marketing success with particular music but you may be able to better you odds.

David H. Henard and Christian L. Rossetti. All You Need is Love? Communication Insights from Pop Music’s Number-One Hits. Journal of Advertising Research, 2014 (in press) DOI: 10.2501/JAR-54-1-000-000.

You may have noticed that the DOI looks broken (because it is). So, grab a copy of the paper here.

You will have a lot of fun reading this article, particularly the tables of words and themes.

It doesn’t mention Jimi Hendrix so now you know why I don’t work in marketing. 😉

March 19, 2014

Search Gets Smarter with Identifiers

Filed under: EU,Identifiers,Subject Identifiers,Subject Identity — Patrick Durusau @ 3:36 pm

Search Gets Smarter with Identifiers

From the post:

The future of computing is based on Big Data. The vast collections of information available on the web and in the cloud could help prevent the next financial crisis, or even tell you exactly when your bus is due. The key lies in giving everything – whether it’s a person, business or product – a unique identifier.

Imagine if everything you owned or used had a unique code that you could scan, and that would bring you a wealth of information. Creating a database of billions of unique identifiers could revolutionise the way we think about objects. For example, if every product that you buy can be traced through every step in the supply chain you can check whether your food has really come from an organic farm or whether your car is subject to an emergency recall.

….

The difficulty with using big data is that the person or business named in one database might have a completely different name somewhere else. For example, news reports talk about Barack Obama, The US President, and The White House interchangeably. For a human being, it’s easy to know that these names all refer to the same person, but computers don’t know how to make these connections. To address the problem, Okkam has created a Global Open Naming System: essentially an index of unique entities like people, organisations and products, that lets people share data.

“We provide a very fast and effective way of discovering data about the same entities across a variety of sources. We do it very quickly,” says Paolo Bouquet. “And we do it in a way that it is incremental so you never waste the work you’ve done. Okkam’s entity naming system allows you to share the same identifiers across different projects, different companies, different data sets. You can always build on top of what you have done in the past.”

The benefits of a unique name for everything

http://www.okkam.org/

The community website: http://community.okkam.org/ reports 8.5+ million entities.

When the EU/CORDIS show up late for a party, it’s really late.

A multi-lingual organization like the EU, kudos on their efforts in that direction, should know uniformity of language or identifiers is only found in dystopian fiction.

I prefer the language and cultural richness of Europe over the sterile uniformity of American fast food chains. Same issue.

You?

I first saw this in a tweet by Stefano Bertolo.

Full-Text-Indexing (FTS) in Neo4j 2.0

Filed under: Indexing,Neo4j,Texts — Patrick Durusau @ 2:58 pm

Full-Text-Indexing (FTS) in Neo4j 2.0 by Michael Hunger.

From the post:

With Neo4j 2.0 we got automatic schema indexes based on labels and properties for exact lookups of nodes on property values.

Fulltext and other indexes (spatial, range) are on the roadmap but not addressed yet.

For fulltext indexes you still have to use legacy indexes.

As you probably don’t want to add nodes to an index manually, the existing “auto-index” mechanism should be a good fit.

To use that automatic index you have to configure the auto-index upfront to be a fulltext index and then secondly enable it in your settings.

Great coverage of full-text indexing in Neo4j 2.0.

Looking forward to spatial indexing. In the most common use case, think of it as locating assets on the ground relative to other actors. In real time.

The Seven Parts of “HTML 5 Fundamentals”

Filed under: CSS3,HTML5,WWW — Patrick Durusau @ 2:08 pm

The Seven Parts of “HTML 5 Fundamentals” by Greg Duncan.

From the post:

It’s Web Wednesday and today we’re going to take a step back and share a series by David Giard, who’s going to give us a fundamental look at HTML5. Oh, I know YOU don’t need this, but you might have a "friend" who does (cough… like me… cough…).

HTML 5 Fundamentals

Read this series of articles to learn more about HTML5 and CSS3

Part 1- An Introduction to HTML5

Part 2 – New Tags

Part 3 – New Attributes

Part 4 – New Input Types

Part 5 – CSS3

Part 6 – More CSS3

Part 7 – HTML5 JavaScript APIs

Part 1 has this jewel:

Due to the enormous scope of HTML5 and the rate at which users tend to upgrade to new browsers, it is unlikely that HTML5 will be on all computers for another decade.

Let’s see, a web “year” is approximately 3 months according to Tim Berners-Lee, so in forty (40) web years, HTML5 will be on all computers.

That’s a long time to wait so I would suggest learning features as they are supported by the top three browsers. You won’t ever be terribly behind and at the same time, your webpages wit

That would make an interesting listing if it doesn’t exist already. The features of HTML5 as a matrix against the top three browsers.

Legend for the matrix: One browser – start learning, Two browsers – start writing, Three browsers – deploy.

Yes?

I first saw this in a tweet by Microsoft Channel 9.

Podcast: Thinking with Data

Filed under: Data,Data Analysis,Data Science — Patrick Durusau @ 1:39 pm

Podcast: Thinking with Data: Data tools are less important than the way you frame your questions by Jon Bruner.

From the description:

Max Shron and Jake Porway spoke with me at Strata a few weeks ago about frameworks for making reasoned arguments with data. Max’s recent O’Reilly book, Thinking with Data, outlines the crucial process of developing good questions and creating a plan to answer them. Jake’s nonprofit, DataKind, connects data scientists with worthy causes where they can apply their skills.

Curious if you agree with Max that data tools are “mature?”

Certainly better than they were when I was an undergraduate in political science but measuring sentiment was a current topic even then. 😉

And the controversy of tools versus good questions isn’t a new one either.

To his credit, Max does credit decades of discussion of rhetoric and thinking as helpful in this area.

For you research buffs, any pointers to prior tools versus good questions debates? (Think sociology/political science in the 1970s to date. It’s a recurring theme.)

I first saw this in a tweet by Mike Loukides.

Ontology work at the Royal Society of Chemistry

Filed under: Cheminformatics,Ontology,Science,Topic Maps — Patrick Durusau @ 1:07 pm

Ontology work at the Royal Society of Chemistry by Antony Williams.

From the description:

We provide an overview of the use we make of ontologies at the Royal Society of Chemistry. Our engagement with the ontology community began in 2006 with preparations for Project Prospect, which used ChEBI and other Open Biomedical Ontologies to mark up journal articles. Subsequently Project Prospect has evolved into DERA (Digitally Enhancing the RSC Archive) and we have developed further ontologies for text markup, covering analytical methods and name reactions. Most recently we have been contributing to CHEMINF, an open-source cheminformatics ontology, as part of our work on disseminating calculated physicochemical properties of molecules via the Open PHACTS. We show how we represent these properties and how it can serve as a template for disseminating different sorts of chemical information.

A bit wordy for my taste but it has numerous references and links to resources. Top stuff!

I had to laugh when I read slide #20:

Why a named reaction ontology?

Despite attempts to introduce systematic nomenclature for organic reactions, lots of chemists still prefer to attach human names.

Those nasty humans! Always wanting “human” names. Grrr! 😉

Afraid so. That is going to continue in a number of disciplines.

When I got to slides #29:

Ontologies as synonym sets for text-mining

it occurred to me that terms in an ontology are like base names in a topic map, in topics with associations with other topics, which also have base name.

The big difference being that ontologies are mono-views that don’t include mapping instructions based on properties in starting ontology or any other ontology to which you could map.

That is the ontologies I have seen can only report properties of their terms and not which properties must be matched to be the same subject.

Nor do such ontologies report properties of the subjects that are their properties. Much less any mappings from bundles of properties to bundles of properties in other ontologies.

I know the usual argument about combinatorial explosion of mappngs, etc., which leaves ontologists with too few arms and legs to point in the various directions.

That argument fails to point out that to have an “uber” ontology, someone has to do the mapping (undisclosed) from variants to the new master ontology. And, they don’t write that mapping down.

So the combinatorial explosion was present, it just didn’t get written down. Can you guess who is empowered as an expert in the new master ontology with undocumented mappings?

The other fallacy in that argument is that topic maps, for example, are always partial world views. I can map as much or as little between ontologies, taxonomies, vocabularies, folksonomies, etc. as I care to do.

If I don’t want to bother mapping “thing” as the root of my topic map, I am free to omit it. All the superstructure clutter goes away and I can focus on immediate ROI concerns.

Unless you want to pay for the superstructure clutter then by all means, I’m interested! 😉

If you have an ontology, by all means use it as a starting point for your topic map. Or if someone is willing to pay to create yet another ontology, do it. But if they need results before years of travel, debate and bad coffee, give topic maps a try!

PS: The travel, debate and bad coffee never go away for ontologies, even successful ones. Despite the desires of many, the world keeps changing and our views of it along with it. A static ontology is a dead ontology. Same is true for a topic map, save that agreement on its content is required only as far as it is used and no further.

March 18, 2014

eMOP Early Modern OCR Project

Filed under: OCR,Text Mining — Patrick Durusau @ 9:06 pm

eMOP Early Modern OCR Project

From the webpage:

The Early Modern OCR Project is an effort, on the one hand, to make access to texts more transparent and, on the other, to preserve a literary cultural heritage. The printing process in the hand-press period (roughly 1475-1800), while systematized to a certain extent, nonetheless produced texts with fluctuating baselines, mixed fonts, and varied concentrations of ink (among many other variables). Combining these factors with the poor quality of the images in which many of these books have been preserved (in EEBO and, to a lesser extent, ECCO), creates a problem for Optical Character Recognition (OCR) software that is trying to translate the images of these pages into archiveable, mineable texts. By using innovative applications of OCR technology and crowd-sourced corrections, eMOP will solve this OCR problem.

I first saw this project at: Automatic bulk OCR and full-text search for digital collections using Tesseract and Solr by Chris Adams.

I find it exciting because of the progress the project is making for texts between 1475-1800. For the texts in that time period for sure but also hoping those techniques can be adapted to older materials.

Say older by several thousand years.

Despite pretensions to the contrary, “web scale” is not very much when compared to data feeds from modern science colliders, telescopes, gene sequencing, etc., but also with the vast store of historical texts that remain off-line. To say nothing of the need for secondary analysis of those texts.

Every text that becomes available enriches a semantic tapestry that only humans can enjoy.

Automatic bulk OCR and full-text search…

Filed under: ElasticSearch,Search Engines,Solr,Topic Maps — Patrick Durusau @ 8:48 pm

Automatic bulk OCR and full-text search for digital collections using Tesseract and Solr by Chris Adams.

From the post:

Digitizing printed material has become an industrial process for large collections. Modern scanning equipment makes it easy to process millions of pages and concerted engineering effort has even produced options at the high-end for fragile rare items while innovative open-source projects like Project Gado make continual progress reducing the cost of reliable, batch scanning to fit almost any organization’s budget.

Such efficiencies are great for our goals of preserving history and making it available but they start making painfully obvious the degree to which digitization capacity outstrips our ability to create metadata. This is a big problem because most of the ways we find information involves searching for text and a large TIFF file is effectively invisible to a full-text search engine. The classic library solution to this challenge has been cataloging but the required labor is well beyond most budgets and runs into philosophical challenges when users want to search on something which wasn’t considered noteworthy at the time an item was cataloged.

In the spirit of finding the simplest thing that could possibly work I’ve been experimenting with a completely automated approach to perform OCR on new items and offering combined full-text search over both the available metadata and OCR text, as can be seen in this example:

If this weren’t impressive enough, Chris has a number of research ideas, including:

the idea for a generic web application which would display hOCR with the corresponding images for correction with all of the data stored somewhere like Github for full change tracking and review. It seems like something along those lines would be particularly valuable as a public service to avoid the expensive of everyone reinventing large parts of this process customized for their particular workflow.

More grist for a topic map mill!

PS: Should you ever come across a treasure trove of not widely available documents, please replicate them to as many public repositories as possible.

Traditional news outlets protect people in leak situations who knew they were playing in the street. Why they merit more protection than the average person is a mystery to me. Let’s protect the average people first and the players last.

UK statistics and open data…

Filed under: Government,Government Data,Open Data,Open Government — Patrick Durusau @ 7:52 pm

UK statistics and open data: MPs’ inquiry report published Owen Boswarva.

From the post:

This morning the Public Administration Select Committee (PASC), a cross-party group of MPs chaired by Bernard Jenkin, published its report on Statistics and Open Data.

This report is the product of an inquiry launched in July 2013. Witnesses gave oral evidence in three sessions; you can read the transcripts and written evidence as well.

Useful if you are looking for rhetoric and examples of use of government data.

Ironic that just last week the news broke that Google has given British security the power to censor “unsavory” (but legal) content from Youtube. UK gov wants to censor legal but “unsavoury” YouTube content by Lisa Vaas.

Lisa writes:

Last week, the Financial Times revealed that Google has given British security the power to quickly yank terrorist content offline.

The UK government doesn’t want to stop there, though – what it really wants is the power to pull “unsavoury” content, regardless of whether it’s actually illegal – in other words, it wants censorship power.

The news outlet quoted UK’s security and immigration minister, James Brokenshire, who said that the government must do more to deal with material “that may not be illegal but certainly is unsavoury and may not be the sort of material that people would want to see or receive.”

I’m not sure why the UK government wants to block content that people don’t want to see or receive. They simply won’t look at it. Yes?

But, intellectual coherence has never been a strong point of most governments and the UK in particular of late.

Is this more evidence for my contention that “open data” for government means only the data government wants you to have?

Aligning Controlled Vocabularies

Filed under: MARC,SKOS — Patrick Durusau @ 7:32 pm

Tutorial on the use of SILK for aligning controlled vocabularies

From the post:

A tutorial on the use of SILK has been published.The SILK framework is a tool for discovering relationships between data items within different Linked Data sources.This tutorial explains how SILK can be used to discover links between concepts in controlled vocabularies.

Example used in this Tutorial

The tutorial uses an example where SILK is used to create a mapping between the Named Authority Lists (NALs) of the Publications Office of the EU and the MARC countries list of the US Library of Congress. Both controlled vocabularies (NALs & MARC Countries list) use URIs to identify countires, compare for example, the following URIs for the country of Luxembourg

SILK represents mappings between NALs using the SKOS language (skos:exactMatch). In the case of the URIs for Luxembourg this is expressed as N-Triples:

The tutorial is here.

If you bother to look up the documentation on skos:exactMatch:

The property skos:exactMatch is used to link two concepts, indicating a high degree of confidence that the concepts can be used interchangeably across a wide range of information retrieval applications. skos:exactMatch is a transitive property, and is a sub-property of skos:closeMatch.

Are you happy with “…a high degree of confidence that the concepts can be used interchangeably across a wide range of information retrieval applications?”

I’m not really sure what that means?

Not to mention that if 97% of the people in a geographic region want a new government, some will say it can join a new country, but if the United States disagrees (for reasons best known to itself), then the will of 97% of the people is a violation of international law.

What? Too much democracy? I didn’t know that was a violation of international law.

If SKOS statements had some content, properties I suppose, along with authorship (and properties there as well), you could make an argument for skos:exactMatch being useful.

So far as I can see, it is not even a skos:closeMatch to “useful.”

Web Scraping: working with APIs

Filed under: Data Mining,Humanities,R,Social Sciences,Web Scrapers — Patrick Durusau @ 7:16 pm

Web Scraping: working with APIs by Rolf Fredheim.

From the post:

APIs present researchers with a diverse set of data sources through a standardised access mechanism: send a pasted together HTTP request, receive JSON or XML in return. Today we tap into a range of APIs to get comfortable sending queries and processing responses.

These are the slides from the final class in Web Scraping through R: Web scraping for the humanities and social sciences

This week we explore how to use APIs in R, focusing on the Google Maps API. We then attempt to transfer this approach to query the Yandex Maps API. Finally, the practice section includes examples of working with the YouTube V2 API, a few ‘social’ APIs such as LinkedIn and Twitter, as well as APIs less off the beaten track (Cricket scores, anyone?).

The final installment of Rolf’s course for humanists. He promises to repeat it next year. Should be interesting to see how techniques and resources evolve over the next year.

Forward the course link to humanities and social science majors.

Think globally and solve locally:… (GraphChi News)

Filed under: Bioinformatics,GraphChi,Graphs — Patrick Durusau @ 6:55 pm

Think globally and solve locally: secondary memory-based network learning for automated multi-species function prediction by Marco Mesiti, Matteo Re, and Giorgio Valentini.

Abstract:

Background: Network-based learning algorithms for Automated Function Prediction (AFP) are negatively a ffected by the limited coverage of experimental data and limited a priori known functional annotations. As a consequence their application to model organisms is often restricted to well characterized biological processes and pathways, and their e ffectiveness with poorly annotated species is relatively limited. A possible solution to this problem might consist in the construction of big networks including multiple species, but this in turn poses challenging computational problems, due to the scalability limitations of existing algorithms and the main memory requirements induced by the construction of big networks. Distributed computation or the usage of big computers could in principle respond to these issues, but raises further algorithmic problems and require resources not satis able with simple off -the-shelf computers.

Results: We propose a novel framework for scalable network-based learning of multi-species protein functions based on both a local implementation of existing algorithms and the adoption of innovative technologies: we solve “locally” the AFP problem, by designing “vertex-centric” implementations of network-based algorithms, but we do not give up thinking “globally” by exploiting the overall topology of the network. This is made possible by the adoption of secondary memory-based technologies that allow the efficient use of the large memory available on disks, thus overcoming the main memory limitations of modern off -the-shelf computers. This approach has been applied to the analysis of a large multi-species network including more than 300 species of bacteria and to a network with more than 200,000 proteins belonging to 13 Eukaryotic species. To our knowledge this is the fi rst work where secondary-memory based network analysis has been applied to multi-species function prediction using biological networks with hundreds of thousands of proteins.

Conclusions: The combination of these algorithmic and technological approaches makes feasible the analysis of large multi-species networks using ordinary computers with limited speed and primary memory, and in perspective could enable the analysis of huge networks (e.g. the whole proteomes available in SwissProt), using well-equipped stand-alone machines.

The biomolecular network material may be deep wading but you will find GraphChi making a significant difference in this use case.

What I found particularly interesting in Table 7 (page 20) was the low impact that additional RAM has on GraphChi.

I take that to mean that GraphChi can run efficiently on low-end boxes (4 GB RAM).

Yes?

I first saw this in a tweet by Aapo Kyrola.

Balisage Papers Due 18 April 2014

Filed under: Conferences,XML,XML Schema,XPath,XQuery,XSLT — Patrick Durusau @ 2:21 pm

Unlike the rolling dates for Obamacare, Balisage Papers are due 18 April 2014. (That’s this year for health care wonks.)

From the website:

Balisage is an annual conference devoted to the theory and practice of descriptive markup and related technologies for structuring and managing information.

Are you interested in open information, reusable documents, and vendor and application independence? Then you need descriptive markup, and Balisage is the conference you should attend. Balisage brings together document architects, librarians, archivists, computer scientists, XML wizards, XSLT and XQuery programmers, implementers of XSLT and XQuery engines and other markup-related software, Topic-Map enthusiasts, semantic-Web evangelists, standards developers, academics, industrial researchers, government and NGO staff, industrial developers, practitioners, consultants, and the world’s greatest concentration of markup theorists. Some participants are busy designing replacements for XML while other still use SGML (and know why they do). Discussion is open, candid, and unashamedly technical. Content-free marketing spiels are unwelcome and ineffective.

I can summarize that for you:

There are conferences on the latest IT buzz.

There are conferences on last year’s IT buzz.

Then there are conferences on information as power, which decides who will sup and who will serve.

Balisage is about information as power. How you use it, well, that’s up to you.

March 17, 2014

Isaac Newton’s College Notebook

Filed under: Mathematics,Navigation — Patrick Durusau @ 8:30 pm

College Notebook by Isaac Newton.

From the description:

This small notebook was probably used by Newton from about 1664 to 1665. It contains notes from his reading on mathematics and geometry, showing particularly the influence of John Wallis and René Descartes. It also provides evidence of the development of Newton’s own mathematical thinking, including his study of infinite series and development of binomial theorem, the evolution of the differential calculus, and its application to the problem of quadratures and integration.

This notebook contains many blank pages (all shown) and has been used by Newton from both ends. Our presentation displays the notebook in a sensible reading order. It shows the ‘front’ cover and the 79 folios that follow (more than half of them blank) and then turns the notebook upside down showing the other cover and the pages that follow it. A full transcription is provided. The notebook was photographed while it was disbound in 2011.

The video above provides an introduction to Newton’s mathematical thinking at the time of this manuscript.

The Web remains erratic but there are more jewels like this one than say ten (10) years ago.

Curious how you would link up Einstein’s original notes on gravity waves (Einstein Papers Project) with the recent reported observation of gravity waves?

Seems like that would be important. And to collate all the materials on gravity waves between Einstein’s notes and the recent observations.

More and more information is coming online but appears to be as disjointed as it was prior to coming online. That’s a pity.

I first saw this in a tweet by Steven Strogatz.

Steven also points to: What led Newton to discover the binomial theorum? Would you believe it was experimentation and not mathematical proofs?

Hmmm, is there a lesson for designing topic map interfaces? To experiment rather than rely upon the way we know it must be?

Facebook Graph Search with Cypher and Neo4j

Filed under: Cybersecurity,Cypher,Neo4j,NSA,Security — Patrick Durusau @ 8:14 pm

Facebook Graph Search with Cypher and Neo4j by Max De Marzi.

A great post as always but it has just been updated:

Update: Facebook has disabled this application

Your app is replicating core Facebook functionality.

Rather ironic considering this headline:

Mark Zuckerberg called Obama about the NSA. Let’s not hang up the phone by Dan Gillmor.

It’s hard to say why Mark is so upset.

Here are some possible reasons:

  • NSA surveillance is poaching on surveillance sales by Facebook
  • NSA leaks exposed surveillance by Facebook
  • NSA leaks exposed U.S. corporations doing surveillance for the government
  • NSA surveillance will make consumers leery of Facebook surveillance
  • NSA leaks make everyone more aware of surveillance
  • NSA leaks make Mark waste time on phone with Obama acting indignant.

I am sure I have missed dozens of reasons why Mark is upset.

Care to supply the ones I missed?

If you could make a computer do anything with documents,…

Filed under: Document Classification,Document Management,News,Reporting — Patrick Durusau @ 7:56 pm

If you could make a computer do anything with documents, what would you make it do?

The OverviewProject has made a number of major improvements in the last year and now they are asking your opinion on what to do next?

They do have funding, developers and are pushing out new features. I take all of those to be positive signs.

No guarantee that what you ask for is possible with their resources or even of any interest to them.

But, you won’t know if you don’t ask.

I will be posting my answer to that question on this blog this coming Friday, 21 March 2014.

Spread the word! Get other people to try Overview and to answer the survey.

Office Lens Is a Snap (Point and Map?)

Office Lens Is a Snap

From the post:

The moment mobile-phone manufacturers added cameras to their devices, they stopped being just mobile phones. Not only have lightweight phone cameras made casual photography easy and spontaneous, they also have changed the way we record our lives. Now, with help from Microsoft Research, the Office team is out to change how we document our lives in another way—with the Office Lens app for Windows Phone 8.

Office Lens, now available in the Windows Phone Store, is one of the first apps to use the new OneNote Service API. The app is simple to use: Snap a photo of a document or a whiteboard, and upload it to OneNote, which stores the image in the cloud. If there is text in the uploaded image, OneNote’s cloud-based optical character-recognition (OCR) software turns it into editable, searchable text. Office Lens is like having a scanner in your back pocket. You can take photos of recipes, business cards, or even a whiteboard, and Office Lens will enhance the image and put it into your OneNote Quick Notes for reference or collaboration. OneNote can be downloaded for free.

Less than five (5) years ago, every automated process in Office Lens would have been a configurable setting.

Today, it’s just point and shoot.

There is an interface lesson for topic maps in the Office Lens interface.

Some people will need the Office Lens API. But, the rest of us, just want to take a picture of the whiteboard (or some other display). Automatic storage and OCR are welcome added benefits.

What about a topic map authoring interface that looks a lot like MS Word™ or Open Office. A topic map is loaded much like a spelling dictionary. When the user selects “map-it,” links are inserted that point into the topic map.

Hover over such a link and data from the topic map is displayed. Can be printed, annotated, etc.

One possible feature would be “subject check” which displays the subjects “recognized” in the document. To enable the author to correct any recognition errors.

In case you are interested, I can point you to some open source projects that have general authoring interfaces. 😉

PS: If you have a Windows phone, can you check out Office Lens for me? I am still sans a cellphone of any type. Since I don’t get out of the yard a cellphone doesn’t make much sense. But I do miss out on the latest cellphone technology. Thanks!

ethereum

Filed under: Distributed Computing — Patrick Durusau @ 4:37 pm

ethereum

From the webpage:

Ethereum is a platform and a programming language that makes it possible for any developer to build and publish next-generation distributed applications.

Ethereum can be used to codify, decentralize, secure and trade just about anything: voting, domain names, financial exchanges, crowdfunding, company governance, contracts and agreements of most kind, intellectual property, and even smart property thanks to hardware integration.

Ethereum borrows the concept of decentralized consensus that makes bitcoin so resilient, yet makes it trivial to build on its foundation. To find out more about how Ethereum works, consult the whitepaper
.

…will you build out of the Ether?

Distributed systems are a great idea but most governments won’t tolerate parallel monetary systems.

Primarily because it interferes with the ability of a central bank to simply declare there is more money in the national treasury than anyone suspected.

Adjust the decimal place a bit and suddenly the government is solvent again.

Having a parallel monetary system, like Kong bucks, would interfere with that capability.

ACTUS

Filed under: Finance Services,Legal Informatics,Standards — Patrick Durusau @ 4:17 pm

ACTUS (Algorithmic Contract Types Unified Standards)

From the webpage:

The Alfred P. Sloan Foundation awarded Stevens Institute of Technology a grant to work on the proposal entitled “Creating a standard language for financial contracts and a contract-centric analytical framework”. The standard follows the theoretical groundwork laid down in the book “Unified Financial Analysis” (1) – UFA.The goal of this project is to build a financial instrument reference database that represents virtually all financial contracts as algorithms that link changes in risk factors (market risk, credit risk, and behavior, etc.) to cash flow obligations of financial contracts. This reference database will be the technological core of a future open source community that will maintain and evolve standardized financial contract representations for the use of regulators, risk managers, and researchers.

The objective of the project is to develop a set of about 30 unique contract types (CT’s) that represent virtually all existing financial contracts and which generate state contingent cash flows at a high level of precision. The term of art that describes the impact of changes in the risk factors on the cash flow obligations of a financial contract is called “state contingent cash flows,” which are the key input to virtually all financial analysis including models that assess financial risk.

1- Willi Brammertz, Ioannis Akkizidis, Wolfgang Breymann, Rami Entin, Marco Rustmann; Unified Financial Analysis – The Missing Links of Finance, Wiley 2009.

This will help with people who are not cheating in the financial markets.

After the revelations of the past couple of years, any guesses on the statistics of non-cheating members of the financial community?

😉

Even if these are used by non-cheaters, we know that the semantics are going to vary from user to user.

The real questions are: 1) How will we detect semantic divergence? and 2) How much semantic divergence can be tolerated?

I first saw this in a tweet by Stefano Bertolo.

How Statistics lifts the fog of war in Syria

Filed under: R,Random Forests,Record Linkage — Patrick Durusau @ 2:10 pm

How Statistics lifts the fog of war in Syria by David White.

From the post:

In a fascinating talk at Strata Santa Clara in February, HRDAG’s Director of Research Megan Price explained the statistical technique she used to make sense of the conflicting information. Each of the four agencies shown in the chart above published a list of identified victims. By painstakingly linking the records between the different agencies (no simple task, given incomplete information about each victim and variations in capturing names, ages etc.), HRDAG can get a more complete sense of the total number of casualties. But the real insight comes from recognizing that some victims were reported by no agency at all. By looking at the rates at which some known victims were not reported by all of the agencies, HRDAG can estimate the number of victims that were identified by nobody, and thereby get a more accurate count of total casualties. (The specific statistical technique used was Random Forests, using the R language. You can read more about the methodology here.)

Caution is always advisable with government issued data but especially so when it arises from an armed conflict.

A forerunner to topic maps, record linkage (which is still widely used), plays a central role in collating data recorded in various ways. It isn’t possible to collate heterogeneous data without creating a uniform set of records (record linkage) or by mapping the subjects of the original records together (topic maps).

The usual moniker, “big data” should really be: “big, homogeneous data (BHD). Which if that is what you have, works great. If that isn’t what you have, works less great. If at all.

BTW, groups like the Human Rights Data Analysis Group (HRDAG) would have far more credibility with me if their projects list didn’t read:

  • Africa
  • Asia
  • Europe
  • Middle East
  • Central America
  • South America

Do you notice anyone missing from that list?

I have always thought that “human rights” included cases of:

  • sexual abuse
  • chlid abuse
  • violence
  • discrimination
  • and any number of similar issues

I can think of another place where those conditions exist in epidemic proportions.

Can’t you?

Peyote and the International Plant Names Index

Filed under: Agriculture,Data,Names,Open Access,Open Data,Science — Patrick Durusau @ 1:30 pm

International Plant Names Index

What a great resource to find as we near Spring!

From the webpage:

The International Plant Names Index (IPNI) is a database of the names and associated basic bibliographical details of seed plants, ferns and lycophytes. Its goal is to eliminate the need for repeated reference to primary sources for basic bibliographic information about plant names. The data are freely available and are gradually being standardized and checked. IPNI will be a dynamic resource, depending on direct contributions by all members of the botanical community.

I entered the first plant name that came to mind: Peyote.

No “hits.” ?

Wikipedia gives Peyote’s binomial name as: Lophophora williamsii (think synonym).*

Searching on Lophophora williamsii, I got three (3) “hits.”

Had I bothered to read the FAQ before searching:

10. Can I use IPNI to search by common (vernacular) name?

No. IPNI does not include vernacular names of plants as these are rarely formally published. If you are looking for information about a plant for which you only have a common name you may find the following resources useful. (Please note that these links are to external sites which are not maintained by IPNI)

I understand the need to specialize in one form of names but “formally published” means that without a useful synonyms list, the general public has an additional burden to access publicly funded research results.

Even with a synonym list there is an additional burden because you have to look up terms in the list, then read the text with that understanding and then back to the synonym list again.

What would dramatically increase public access to publicly funded research would be to have a specialized synonym list for publications that transposes the jargon in articles to selected sets of synonyms. Would not be as precise or grammatical as the original, but it would allow the reading pubic to get a sense of even very technical research.

That could be a way to hitch topic maps to the access to publicly funded data band wagon.

Thoughts?

I first saw this in a tweet by Bill Baker.

* A couple of other fun facts from Wikipedia on Peyote: 1. It’s conservation status is listed as “apparently secure,” and 2. Wikipedia has photos of Peyote “in the wild.” I suppose saying “Peyote growing in a pot” would raise too many questions.

March 16, 2014

Metaphysicians

Filed under: Science — Patrick Durusau @ 7:50 pm

Metaphysicians: Sloppy researchers beware. A new institute has you in its sights

From the post:

“WHY most published research findings are false” is not, as the title of an academic paper, likely to win friends in the ivory tower. But it has certainly influenced people (including journalists at The Economist). The paper it introduced was published in 2005 by John Ioannidis, an epidemiologist who was then at the University of Ioannina, in Greece, and is now at Stanford. It exposed the ways, most notably the overinterpreting of statistical significance in studies with small sample sizes, that scientific findings can end up being irreproducible—or, as a layman might put it, wrong.

Dr Ioannidis has been waging war on sloppy science ever since, helping to develop a discipline called meta-research (ie, research about research). Later this month that battle will be institutionalised, with the launch of the Meta-Research Innovation Centre at Stanford.

METRICS, as the new laboratory is to be known for short, will connect enthusiasts of the nascent field in such corners of academia as medicine, statistics and epidemiology, with the aim of solidifying the young discipline. Dr Ioannidis and the lab’s co-founder, Steven Goodman, will (for this is, after all, science) organise conferences at which acolytes can meet in the world of atoms, rather than just online. They will create a “journal watch” to monitor scientific publishers’ work and to shame laggards into better behaviour. And they will spread the message to policymakers, governments and other interested parties, in an effort to stop them making decisions on the basis of flaky studies. All this in the name of the centre’s nerdishly valiant mission statement: “Identifying and minimising persistent threats to medical-research quality.”

Someone after my own heart! 😉

I spent a large portion of today proofing standards drafts and the only RFC that was cited correctly was one that I wrote for the template they were using. (Hence the low blogging output for today.)

It really isn’t that difficult to check references, to not cite obsoleted materials, to cite relevant materials, etc.

Or, at least I don’t think that it is.

You?

Hadoop Alternative Hydra Re-Spawns as Open Source

Filed under: Hadoop,Hydra,Interface Research/Design,Stream Analytics — Patrick Durusau @ 7:31 pm

Hadoop Alternative Hydra Re-Spawns as Open Source by Alex Woodie.

From the post:

It may not have the name recognition or momentum of Hadoop. But Hydra, the distributed task processing system first developed six years ago by the social bookmarking service maker AddThis, is now available under an open source Apache license, just like Hadoop. And according to Hydra’s creator, the multi-headed platform is very good at some big data tasks that the yellow pachyderm struggles with–namely real-time processing of very big data sets.

Hydra is a big data storage and processing platform developed by Matt Abrams and his colleagues at AddThis (formerly Clearspring), the company that develops the Web server widgets that allow visitors to easily share something via their Twitter, Facebook, Pintrest, Google+, or Instagram accounts.

When AddThis started scaling up its business in the mid-2000s, it got flooded with data about what users were sharing. The company needed a scalable, distributed system that could deliver real-time analysis of that data to its customers. Hadoop wasn’t a feasible option at that time. So it built Hydra instead.

So, what is Hydra? In short, it’s a distributed task processing system that supports streaming and batch operations. It utilizes a tree-based data structure to store and process data across clusters with thousands of individual nodes. It features a Linux-based file system, which makes it compatible with ext3, ext4, or even ZFS. It also features a job/cluster management component that automatically allocates new jobs to the cluster and rebalance existing jobs. The system automatically replicates data and handles node failures automatically.

The tree-based structure allows it to handle streaming and batch jobs at the same time. In his January 23 blog post announcing that Hydra is now open source, Chris Burroughs, a member of AddThis’ engineering department, provided this useful description of Hydra: “It ingests streams of data (think log files) and builds trees that are aggregates, summaries, or transformations of the data. These trees can be used by humans to explore (tiny queries), as part of a machine learning pipeline (big queries), or to support live consoles on websites (lots of queries).”

To learn a lot more about Hydra, see its GitHub page.

Another candidate for “real-time processing of very big data sets.”

The reflex is to applaud improvements in processing speed. But what sort of problems require that kind of speed? I know the usual suspects, modeling the weather, nuclear explosions, chemical reactions, but at some point, the processing ends and a human reader has to comprehend the results.

Better to get information to the human reader sooner rather than later, but there is a limit to the speed at which a user can understand the results of a computational process.

From a UI perspective, what research is there on how fast/slow information should be pushed at a user?

Could make the difference between an app that is just annoying and one that is truly useful.

I first saw this in a tweet by Joe Crobak.

March 15, 2014

Saxon/C

Filed under: Saxon,XML,XQuery,XSLT — Patrick Durusau @ 9:41 pm

Saxon/C by Michael Kay.

From the webpage:

Saxon/C is an alpha release of Saxon-HE on the C/C++ programming platform. APIs are offered currently to run XSLT 2.0 and XQuery 1.0 from C/C++ or PHP applications.

BTW, Micheal’s Why XSLT and XQuery? page reports that Saxon is passing more than 25,000 tests for XQuery 3.0.

If you are unfamiliar with either Saxon or Michael Kay, you need to change to a department that is using XML.

How Gmail Onboards New Users

Filed under: Advertising,Marketing — Patrick Durusau @ 9:28 pm

How Gmail Onboards New Users

From the post:

After passing Hotmail in 2012 as the world’s #1 email service with a sorta-impressive 425 million users(!), it can only be assumed that they’ve grown in the years since. Wanna see how it’s done in Gmail town?

A great set of sixty-nine (69) slides that point out how GMail has treated new users.

In a short phrase: Better than anyone else. (full stop)

You may have the “best” solution or the lower cost solution or whatever. If you don’t get users to stay long enough to realize that, well, you will soon be doing something else.

GMail’s approach won’t work as a cookie-cutter design for you but lessons can be adapted.

I first saw this in a tweet by Fabio Catapano

SharePoint Conference 2014

Filed under: Microsoft,SharePoint — Patrick Durusau @ 9:21 pm

The Ultimate Script to download SharePoint Conference 2014 Videos AND slides! by Vlad Catrinescu.

From the post:

After everyone posted about 10 script versions to download the SharePoint Conference 2014 videos I decided to add some extra value before releasing mine! This is what my script does:

  • Downloads all the SPC14 Sessions and Slides
  • Groups them by folders
  • Makes sure no errors come up due to Illegal File names.
  • If you stop the script and restart in the middle, it will start where it left off and not from beginning.

The Total size will be a bit under 70GB. (emphasis added)

I’m always looking for scripts that will help you collect data and this sounded interesting.

Well, until I read it’s about 70GB of presentations/videos on SharePoint! 😉

Still, I suppose it will be useful for data mining about SharePoint.

And it should give you a good idea of what the baseline is for SharePoint-like services.

(All teasing to one side, what SharePoint attempts to address is a hard problem. Poor project design and what I interpret as a desire to prevent data access are not the fault of SharePoint. Not that I am a SharePoint fan but fair is fair.)

Publishing biodiversity data directly from GitHub to GBIF

Filed under: Biodiversity,Data Repositories,Open Access,Open Data — Patrick Durusau @ 9:01 pm

Publishing biodiversity data directly from GitHub to GBIF by Roderic D. M. Page.

From the post:

Today I managed to publish some data from a GitHub repository directly to GBIF. Within a few minutes (and with Tim Robertson on hand via Skype to debug a few glitches) the data was automatically indexed by GBIF and its maps updated. You can see the data I uploaded here.

In case you don’t know about GBIF (I didn’t):

The Global Biodiversity Information Facility (GBIF) is an international open data infrastructure, funded by governments.

It allows anyone, anywhere to access data about all types of life on Earth, shared across national boundaries via the Internet.

By encouraging and helping institutions to publish data according to common standards, GBIF enables research not possible before, and informs better decisions to conserve and sustainably use the biological resources of the planet.

GBIF operates through a network of nodes, coordinating the biodiversity information facilities of Participant countries and organizations, collaborating with each other and the Secretariat to share skills, experiences and technical capacity.

GBIF’s vision: “A world in which biodiversity information is freely and universally available for science, society and a sustainable future.”

Roderic summarizes his post saying:

what I’m doing here is putting data on GitHub and having GBIF harvest that data directly from GitHub. This means I can edit the data, rebuild the Darwin Core Archive file, push it to GitHub, and GBIF will reindex it and update the data on the GBIF portal.

The process isn’t perfect but unlike disciplines where data sharing is the exception rather than the rule, the biodiversity community is trying to improve its sharing of data.

Every attempt at improvement will not succeed but lessons are learned from every attempt.

Kudos to the biodiversity community for a model that other communities should follow!

Words as Tags?

Filed under: Linguistics,Text Mining,Texts,Word Meaning — Patrick Durusau @ 8:46 pm

Wordcounts are amazing. by Ted Underwood.

From the post:

People new to text mining are often disillusioned when they figure out how it’s actually done — which is still, in large part, by counting words. They’re willing to believe that computers have developed some clever strategy for finding patterns in language — but think “surely it’s something better than that?“

Uneasiness with mere word-counting remains strong even in researchers familiar with statistical methods, and makes us search restlessly for something better than “words” on which to apply them. Maybe if we stemmed words to make them more like concepts? Or parsed sentences? In my case, this impulse made me spend a lot of time mining two- and three-word phrases. Nothing wrong with any of that. These are all good ideas, but they may not be quite as essential as we imagine.

Working with text is like working with a video where every element of every frame has already been tagged, not only with nouns but with attributes and actions. If we actually had those tags on an actual video collection, I think we’d recognize it as an enormously valuable archive. The opportunities for statistical analysis are obvious! We have trouble recognizing the same opportunities when they present themselves in text, because we take the strengths of text for granted and only notice what gets lost in the analysis. So we ignore all those free tags on every page and ask ourselves, “How will we know which tags are connected? And how will we know which clauses are subjunctive?”
….

What a delightful insight!

When we say text is “unstructured” what we really mean is something as dumb as a computer sees no structure in the text.

A human reader, even a 5 or 6 year old reader of a text sees lots of structure, meaning too.

Rather than trying to “teach” computers to read, perhaps we should use computers to facilitate reading by those who already can.

Yes?

I first saw this in a tweet by Matthew Brook O’Donnell.

Coconut Headphones: Why Agile Has Failed

Filed under: Programming,Project Management — Patrick Durusau @ 6:10 pm

Coconut Headphones: Why Agile Has Failed by Mike Hadlow.

From the post:

The 2001 agile manifesto was an attempt to replace rigid, process and management heavy, development methodologies with a more human and software-centric approach. They identified that the programmer is the central actor in the creation of software, and that the best software grows and evolves organically in contact with its users.

My first real contact with the ideas of agile software development came from reading Bob Martin’s book ‘Agile Software Development’. I still think it’s one of the best books about software I’ve read. It’s a tour-de-force survey of modern (at the time) techniques; a recipe book of how to create flexible but robust systems. What might surprise people familiar with how agile is currently understood, is that the majority of the book is about software engineering, not management practices.
….

Something to get your blood pumping on a weekend. 😉

We all have horror stories to tell about various programming paradigms. For “agile” programming, I remember a lead programmer saying a paragraph in an email was sufficient documentation for a plan to replace a content management system with a custom system written on top of subversion. Need I say he had management support?

Fortunately that project died but not through any competence of management. But in all fairness, that wasn’t “agile programming” in any meaningful sense of the phrase.

If you think about it, just about any programming paradigm will yield good results, if you have good management and programmers. Incompetence of management or programmers, and the best programming paradigm in the world will not yield a good result.

Programming paradigms have the same drawback as religion, people are an essential to both.

A possible explanation for high project failure rates and religions that are practiced in word and not deed.

Yes?

« Newer PostsOlder Posts »

Powered by WordPress