Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 30, 2015

“No Sir, Mayor Daley no longer dines here. He’s dead sir.” The Limits of Linked Data

Filed under: Linked Data,OCLC — Patrick Durusau @ 8:57 pm

That line from the Blues Brothers came to mind when I read OCLC to launch linked data pilot with seven leading libraries, which reads in part:

DUBLIN, Ohio, 11 September 2015OCLC is working with seven leading libraries in a pilot program designed to learn more about how linked data will influence library workflows in the future.

The Person Entity Lookup pilot will help library professionals reduce redundant data by linking related sets of person identifiers and authorities. Pilot participants will be able to surface WorldCat Person entities, including 109 million brief descriptions of authors, directors, musicians and others that have been mined from WorldCat, the world’s largest resource of library metadata.

By submitting one of a number of identifiers, such as VIAF, ISNI and LCNAF, the pilot service will respond with a WorldCat Person identifier and mappings to additional identifiers for the same person.

The pilot will begin in September and is expected to last several months. The seven participating libraries include Cornell University, Harvard University, the Library of Congress, the National Library of Medicine, the National Library of Poland, Stanford University and the University of California, Davis.

If you happen to use one of the known identifiers and like Mayor Daley, your subject is one of the 109 million authors, directors, musicians, etc., and you are at one of these seven participants, your in luck!

If your subject is one of the 253 million vehicles on U.S. roads, or one of the 123.4 million people employed full time in the U.S., or one or more of the 73.9 billion credit card transactions in 2012, or one of the 3 billion cellphone calls made every day in the U.S., then linked data and the OCLC pilot project will leave you high and dry. (Feel free to add in subjects of interest to you that aren’t captured by linked data.)

It’s not a bad pilot project but it does serve to highlight the primary weakness of linked data: It doesn’t include any subjects of interest to you.

You want to talk about your employees, your products, your investments, your trades, etc.

That’s understandable. That will drive your ROI from semantic technologies.

OCLC linked data can help you with dead people and some famous ones, but that doesn’t begin to satisfy your needs.

What you need is a semantic technology that puts the fewest constraints on you and at the same time enables to talk about your subjects, using your terms.

Interested?

Tracking Congressional Whores

Filed under: Government,Graphs,Neo4j — Patrick Durusau @ 2:06 pm

Introducing legis-graph – US Congressional Data With Govtrack and Neo4j by William Lyon.

From the post:

Interactions among members of any large organization are naturally a graph, yet the tools we use to collect and analyze data about these organizations often ignore the graphiness of the data and instead map the data into structures (such as relational databases) that make taking advantage of the relationships in the data much more difficult when it comes time to analyze the data. Collaboration networks are a perfect example. So let’s focus on one of the most powerful collaboration networks in the world: the US Congress.

Introducing legis-graph: US Congress in a graph

The goal of legis-graph is to provide a way of converting the data provided by Govtrack into a rich property-graph model that can be inserted into Neo4j for analysis or integrating with other datasets.

The code and instructions are available in this Github repo. The ETL workflow works like this:

  1. A shell script is used to rsync data from Govtrack for a specific Congress (i.e. the 114th Congress). The Govtrack data is a mix of JSON, CSV, and YAML files. It includes information about legislators, committees, votes, bills, and much more.
  2. To extract the pieces of data we are interested in for legis-graph a series of Python scripts are used to extract and transform the data from different formats into a series of flat CSV files.
  3. The third component is a series of Cypher statements that make use of LOAD CSV to efficiently import this data into Neo4j.

To get started with legis-graph in Neo4j you can follow the instructions here. Alternatively, a Neo4j data store dump is available here for use without having to execute the import scripts. We are currently working to streamline the ETL process so this may change in the future, but any updates will be appear in the Github README.

This project began during preparation for a graph database focused Meetup presentation. George Lesica and I wanted to provide an interesting dataset in Neo4j for new users to explore.

Whenever the U.S. Congress is mentioned, I am reminded of the Obi-Wan Kenobi’s line about Mos Eisley:

You will never find a more wretched hive of scum and villainy. We must be cautious.

The data model for William’s graph:

lgdatamodel

As you can see from the diagram above, the datamodel is rich and captures quite a bit of information about the actions of legislators, committees, and bills in Congress. Information about what properties are currently included is available here.

A great starting place that can be extended and enriched.

In terms of the data model, note that “subject” is now the title of a bill. Definitely could use some enrichment there.

Another association for the bill, “who_benefits.”

If you are really ambitious, try developing information on what individuals or groups make donations to the legislator on an annual basis.

To clear noise out of the data set, drop everyone who doesn’t contribute annually and even then, any total less than $5,000. Remember that members of congress depend on regular infusions of cash so erratic or one-time donors may get a holiday card but they are not on the ready access list.

The need for annual cash is one reason why episodic movements may make the news but they rarely make a difference. To make a difference requires years of steady funding and grooming of members of congress and improving your access, plus your influence.

Don’t be disappointed if you can “prove” member of congress X is in the pocket of Y or Z organization/corporation and nothing happens. More likely than not, such proof will increase their credibility for fund raising.

As Leonard Cohen says (Everybody Knows):

Everybody knows the dice are loaded, Everybody rolls with their fingers crossed

September 29, 2015

BBC News Labs Project List

Filed under: Journalism,News,Reporting — Patrick Durusau @ 2:55 pm

Not only does the BBC News Lab have a cool logo:

bbc-lab-logo

They have an impressive list of projects as well:

BBC News Labs Project List

  • News Switcher – News Switcher lets BBC journalists easily switch between the differents editions of the News website
  • Pool of Video – BBC News Labs is looking into some new research questions based on AV curation.
  • Suggestr – connecting the News industry with BBC tags – This prototype, by Outlandish.com for BBC News Labs, is about enabling local News organisations to tag with BBC News tags
  • Linked data on the TV – LDPTV is a project for surfacing more News content via smart TVs
  • #newsHACK – Industry Collaboration – #newsHACK is all about intense multi-discipline collaboration on Journalism innovation.
  • BBC Rewind Collaboration – The News Labs team is working with BBC Rewind – a project liberating the BBC archive – to share tech and approaches.
  • The News Slicer – The News Slicer takes MOS running orders and transcripts from BBC Broadcast Playout, then auto tags and auto segments the stories.
  • News Rig – The future of multilingual workflow – A prototype workflow for reversioning content into multiple languages, and controlling an "on demand" multilingual news service.
  • Atomised News – with BBC R&D – News Labs has been working with BBC R&D to explore a mobile-focussed, structured breadth & depth approach to News experiences
  • Connected Studio – World Service – A programme of international innovation activities, aimed at harnessing localised industry talent to explore new opportunites.
  • Language Technology – BBC News Labs started a stream of Language Technology projects in 2014, in order to scale our storytelling globally
  • Blue Labs – BBC Blue Room and News Labs are working together to more efficiently demonstrate innovation opportunities with emerging consumer Tech.
  • Immersive News – 360 Video & VR – We are looking into the craft and applications around 360 video filming and VR for Immersive News.
  • The Journalist Toolbox – In June 2015, a hack team at #newsHACK created a concept called The Journalist Toolbox, proposing that newsroom content publishing & presentation tools needed to be more accessible for journalists.
  • SUMMA – Scalable Understanding of Multilingual MediA – This Big Data project will leverage machines to do the heavy lifting in multilingual media monitoring.
  • Window on the Newsroom – Window on the Newsroom is a prototype discovery interface to help Journalists look quickly across Newsroom system by story or metadata properties
  • The Juicer – The Juicer takes news content, automatically semantically tags it, then provides a fully featured API to access this content and data.

This list is current as of 29 September 2015 so check with the project page for new and updated project information from the BBC.

The BBC is a British institution that merits admiration. Counting English, its content is available in twenty-eight (28) languages.

Alas, “breathless superlative American English,” a staple of the Fox network, is not one of them.

September 28, 2015

Discovering Likely Mappings between APIs using Text Mining [Likely Merging?]

Filed under: Programming,Text Mining — Patrick Durusau @ 8:23 pm

Discovering Likely Mappings between APIs using Text Mining by Rahul Pandita, Raoul Praful Jetley, Sithu D Sudarsan, Laurie Williams.

Abstract:

Developers often release different versions of their applications to support various platform/programming-language application programming interfaces (APIs). To migrate an application written using one API (source) to another API (target), a developer must know how the methods in the source API map to the methods in the target API. Given a typical platform or language exposes a large number of API methods, manually writing API mappings is prohibitively resource-intensive and may be error prone. Recently, researchers proposed to automate the mapping process by mining API mappings from existing codebases. However, these approaches require as input a manually ported (or at least functionally similar) code across source and target APIs. To address the shortcoming, this paper proposes TMAP: Text Mining based approach to discover likely API mappings using the similarity in the textual description of the source and target API documents. To evaluate our approach, we used TMAP to discover API mappings for 15 classes across: 1) Java and C# API, and 2) Java ME and Android API. We compared the discovered mappings with state-of-the-art source code analysis based approaches: Rosetta and StaMiner. Our results indicate that TMAP on average found relevant mappings for 57% more methods compared to previous approaches. Furthermore, our results also indicate that TMAP on average found exact mappings for 6.5 more methods per class with a maximum of 21 additional exact mappings for a single class as compared to previous approaches.

From the introduction:

Our intuition is: since the API documents are targeted towards developers, there may be an overlap in the language used to describe similar concepts that can be leveraged.

There are a number of insights in this paper but this statement of intuition alone is enough to justify reading the paper.

What if instead of API documents we were talking about topics that had been written for developers? Isn’t it fair to assume that concepts would have the same or similar vocabularies?

The evidence from this paper certainly suggests that to be the case.

Of course, merging rules would have to allow for “likely” merging of topics, which could then be refined by readers.

Readers who hopefully contribute more information to make “likely” merging more “precise.” (At least in their view.)

That’s one of the problems with most semantic technologies isn’t it?

“Precision” can only be defined from a point of view, which by definition varies from user to user.

What would it look like to allow users to determine their desired degree of semantic precision?

Suggestions?

Neo4j 2.3.0 Milestone 3 Release

Filed under: Graphs,Neo4j — Patrick Durusau @ 7:55 pm

Neo4j 2.3.0 Milestone 3 Release by Andreas Kollegger.

From the post:

Great news for all those Neo4j alpha-developers out there: The third (and final) milestone release of Neo4j 2.3 is now ready for download! Get your copy of Neo4j 2.3.0-M03 here.

Quick clarification: Milestone releases like this one are for development and experimentation only, and not all features are in their finalized form. Click here for the most fully stable version of Neo4j (2.2.5).

So, what cool new features made it into this milestone release of Neo4j? Let’s dive in.

If you want to “kick the tires” on Neo4j 2.3.0 before the production version arrives, now would be a good time.

Andreas covers new features in Neo4j 2.3.0, such as triadic selection, constraints, deletes and others.

As I read Andreas’ explanation of “triadic selection” and its role in recommendations, I started to wonder how effective recommendations are outside of movies, books, fast food, etc.

The longer I thought about it, the harder I was pressed to come up with a buying decision that isn’t influenced by recommendations, either explicit or implicit.

Can a “rational” market exist if we are all more or less like lemmings when it comes to purchasing (and other) decisions?

You don’t have to decide that question to take Neo4j 2.3.0 for a spin.

Balisage 2016!

Filed under: Conferences,XML — Patrick Durusau @ 10:52 am

From my inbox this morning:

Mark Your Calendars: Balisage 2016
 - pre-conference symposium 1 August 2016
 - Balisage: The Markup Conference 2-5 August 2016

Bethesda North Marriott Hotel & Conference Center
5701 Marinelli Road  
North Bethesda, Maryland  20852
USA 

This much advanced notice makes me think someone had a toe curling good time at Balisage 2015.

Was Bill Clinton there? 😉

Attend Balisage 2016, look for Bill Clinton or someone else with a silly grin on their faces!

September 27, 2015

What Does Probability Mean in Your Profession? [Divergences in Meaning]

Filed under: Mathematics,Probability,Subject Identity — Patrick Durusau @ 9:39 pm

What Does Probability Mean in Your Profession? by Ben Orlin.

Impressive drawings that illustrate the divergence in meaning of “probability” for various professions.

I’m not sold on the “actual meaning” drawing because if everyone in a discipline understands “probability” to mean something else, on what basis can you argue for the “actual meaning?”

If I am reading a paper by someone who subscribes to a different meaning than your claimed “actual” one, then I am going to reach erroneous conclusions about their paper. Yes?

That is in order to understand a paper I have to understand the words as they are being used by the author. Yes?

If I understand “democracy and freedom” to mean “serves the interest of U.S.-based multinational corporations,” then calls for “democracy and freedom” in other countries isn’t going to impress me all that much.

Enjoy the drawings!

1150 Free Online Courses from Top Universities (update) [Collating Content]

Filed under: Collation,Collocation,Education — Patrick Durusau @ 9:06 pm

1150 Free Online Courses from Top Universities (update).

From the webpage:

Get 1150 free online courses from the world’s leading universities — Stanford, Yale, MIT, Harvard, Berkeley, Oxford and more. You can download these audio & video courses (often from iTunes, YouTube, or university web sites) straight to your computer or mp3 player. Over 30,000 hours of free audio & video lectures, await you now.

An ever improving resource!

As of last January (2015), it listed 1100 courses.

Another fifty courses have been added and I discovered a course in Hittite!

The same problem with collating content across resources that I mentioned for data science books, obtains here as you take courses in the same discipline or read primary/secondary literature.

What if I find references that are helpful in the Hittite course in the image PDFs of the Chicago Assyrian Dictionary? How do I combine that with the information from the Hittite course so if you take Hittite, you don’t have to duplicate my search?

That’s the ticket isn’t it? Not having different users performing the same task over and over again? One user finds the answer and for all other users, it is simply “there.”

Quite a different view of the world of information than the repetitive, non-productive, ad-laden and often irrelevant results from the typical search engine.

The World’s First $9 Computer is Shipping Today!

Filed under: Computer Science — Patrick Durusau @ 8:43 pm

The World’s First $9 Computer is Shipping Today! by The World’s First $9 Computer is Shipping Today!Khyati Jain.

From the post:

Remember Project: C.H.I.P. ?

A $9 Linux-based, super-cheap computer that raised some $2 Million beyond a pledge goal of just $50,000 on Kickstarter will be soon in your pockets.

Four months ago, Dave Rauchwerk, CEO of Next Thing Co., utilized the global crowd-funding corporation ‘Kickstarter’ for backing his project C.H.I.P., a fully functioning computer that offers more than what you could expect for just $9.

See Khyati’s post for technical specifications.

Security by secrecy is meaningless when potential hackers (14-64) number 4.8 billion.

With enough hackers, all bugs can be found.

September 26, 2015

Writing “Python Machine Learning”

Filed under: Books,Machine Learning,Python — Patrick Durusau @ 8:46 pm

Writing “Python Machine Learning” by Sebastian Raschka.

From the post:

It’s been about time. I am happy to announce that “Python Machine Learning” was finally released today! Sure, I could just send an email around to all the people who were interested in this book. On the other hand, I could put down those 140 characters on Twitter (minus what it takes to insert a hyperlink) and be done with it. Even so, writing “Python Machine Learning” really was quite a journey for a few months, and I would like to sit down in my favorite coffeehouse once more to say a few words about this experience.

A delightful tale for those of us who have authored books and an inspiration (with some practical suggestions) for anyone who hopes to write a book.

Sebastian’s productivity hints will ring familiar for those with similar habits and bear study by those who hope to become more productive.

Sebastian never comes out and says it but his writing approach breaks each stage of the book into manageable portions. It is far easier to say (and do) “write an outline” than to “write the complete and fixed outline for an almost 500 page book.”

If the task is too large, the complete and immutable outline, you won’t get up enough momentum to make a reasonable start.

After reading Sebastian’s post, what book are you thinking about writing?

Free Data Science Books (Update, + 53 books, 117 total)

Filed under: Books,Data Science — Patrick Durusau @ 8:34 pm

Free Data Science Books (Update).

From the post:

Pulled from the web, here is a great collection of eBooks (most of which have a physical version that you can purchase on Amazon) written on the topics of Data Science, Business Analytics, Data Mining, Big Data, Machine Learning, Algorithms, Data Science Tools, and Programming Languages for Data Science.

While every single book in this list is provided for free, if you find any particularly helpful consider purchasing the printed version. The authors spent a great deal of time putting these resources together and I’m sure they would all appreciate the support!

Note: Updated books as of 9/21/15 are post-fixed with an asterisk (*). Scroll to updates

Great news but also more content.

Unlike big data, you have to read this content in detail to obtain any benefit from it.

And books in the same area are going to have overlapping content as well as some unique content.

Imagine how useful it would be to compose a free standing work with the “best” parts from several works.

Copyright laws would be a larger barrier but no more than if you cut-n-pasted your own version for personal use.

If such an approach could be made easy enough, the resulting value would drown out dissenting voices.

I think PDF is the principal practical barrier.

Do you suspect others?

I first saw this in a tweet by Kirk Borne.

Data Science Glossary

Filed under: Data Science,Glossary — Patrick Durusau @ 8:10 pm

Data Science Glossary by Bob DuCharme.

From the about page:

Terms included in this glossary are the kind that typically come up in data science discussions and job postings. Most are from the worlds of statistics, machine learning, and software development. A Wikipedia entry icon links to the corresponding Wikipedia entry, although these are often quite technical. Email corrections and suggestions to bob at this domain name.

Is your favorite term included?

You can follow Bob on Twitter @bobdc.

Or read his blog at: bobdc.blog.

Thanks Bob!

September 25, 2015

Attention Law Students: You Can Change the Way People Interact with the Law…

Filed under: Law,Law - Sources,Legal Informatics — Patrick Durusau @ 7:55 pm

Attention Law Students: You Can Change the Way People Interact with the Law…Even Without a J.D. by Katherine Anton.

From the post:

A lot of people go to law school hoping to change the world and make their mark on the legal field. What if we told you that you could accomplish that, even as a 1L?

Today we’re launching the WeCite contest: an opportunity for law students to become major trailblazers in the legal field. WeCite is a community effort to explain the relationship between judicial cases, and will be a driving force behind making the law free and understandable.

To get involved, all you have to do is go to http://www.casetext.com/wecite and choose the treatment that best describes a newer case’s relationship with an older case. Law student contributors, as well as the top contributing schools, will be recognized and rewarded for their contributions to WeCite.

Read on to learn why WeCite will quickly become your new favorite pastime and how to get started!

Shepard’s Citations began publication in 1873 and by modern times, had such an insurmountable lead, that the cost of creating a competing service were a barrier to anyone else entering the field.

To be useful to lawyers, a citation index can’t index some of the citations but all of the citations.

The WeCite project, based on crowd-sourcing, is poised to demonstrate creation of a public law citation index is doable.

While the present project is focused on law students, I am hopeful that the project opens up for contributions from more senior survivors of law school, practicing or not.

Three Reasons You May Not Want to Learn Clojure [One of these reasons applies to XQuery]

Filed under: Clojure,Functional Programming,XQuery — Patrick Durusau @ 7:34 pm

Three Reasons You May Not Want to Learn Clojure by Mark Bastian.

From the post:

I’ve been coding in Clojure for over a year now and not everything is unicorns and rainbows. Whatever language you choose will affect how you think and work and Clojure is no different. Read on to see what some of these effects are and why you might want to reconsider learning Clojure.

If you are already coding in Clojure, you will find this amusing.

If you are not already coding in Clojure, you may find this compelling.

I won’t say which one of these reasons applies to XQuery, at least not today. Watch this blog on Monday of next week.

September 24, 2015

Apache Lucene 5.3.1, Solr 5.3.1 Available

Filed under: Lucene,Solr — Patrick Durusau @ 8:13 pm

Apache Lucene 5.3.1, Solr 5.3.1 Available

From the post:

The Lucene PMC is pleased to announce the release of Apache Lucene 5.3.1 and Apache Solr 5.3.1

Lucene can be downloaded from http://www.apache.org/dyn/closer.lua/lucene/java/5.3.1
and Solr can be downloaded from http://www.apache.org/dyn/closer.lua/lucene/solr/5.3.1

Highlights of this Lucene release include:

Bug Fixes

  • Remove classloader hack in MorfologikFilter
  • UsageTrackingQueryCachingPolicy no longer caches trivial queries like MatchAllDocsQuery
  • Fixed BoostingQuery to rewrite wrapped queries

Highlights of this Solr release include:

Bug Fixes

  • security.json is not loaded on server start
  • RuleBasedAuthorization plugin does not work for the collection-admin-edit permission
  • VelocityResponseWriter template encoding issue. Templates must be UTF-8 encoded
  • SimplePostTool (also bin/post) -filetypes “*” now works properly in ‘web’ mode
  • example/files update-script.js to be Java 7 and 8 compatible.
  • SolrJ could not make requests to handlers with ‘/admin/’ prefix
  • Use of timeAllowed can cause incomplete filters to be cached and incorrect results to be returned on subsequent requests
  • VelocityResponseWriter’s $resource.get(key,baseName,locale) to use specified locale.
  • Resolve XSS issue in Admin UI stats page

Time to upgrade!

Enjoy!

Data Analysis for the Life Sciences… [No Ur-Data Analysis Book?]

Filed under: Data Analysis,Life Sciences,Science — Patrick Durusau @ 7:42 pm

Data Analysis for the Life Sciences – a book completely written in R markdown by Rafael Irizarry.

From the post:

Data analysis is now part of practically every research project in the life sciences. In this book we use data and computer code to teach the necessary statistical concepts and programming skills to become a data analyst. Following in the footsteps of Stat Labs, instead of showing theory first and then applying it to toy examples, we start with actual applications and describe the theory as it becomes necessary to solve specific challenges. We use simulations and data analysis examples to teach statistical concepts. The book includes links to computer code that readers can use to program along as they read the book.

It includes the following chapters: Inference, Exploratory Data Analysis, Robust Statistics, Matrix Algebra, Linear Models, Inference for High-Dimensional Data, Statistical Modeling, Distance and Dimension Reduction, Practical Machine Learning, and Batch Effects.

Have you ever wondered about the growing proliferation of data analysis books?

The absence of one Ur-Data Analysis book that everyone could read and use?

I have a longer post coming on a this idea but if each discipline has the need for its own view on data analysis, it is really surprising that no one system of semantics satisfies all communities?

In other words, is the evidence of heterogeneous semantics so strong that we should abandon attempts at uniform semantics and focus on communicating across systems of semantics?

I’m sure there are other examples of where every niche has its own vocabulary, tables in relational databases or column headers in spreadsheets for example.

What is your favorite example of heterogeneous semantics?

Assuming heterogeneous semantics are here to stay (they have been around since the start of human to human communication, possibly earlier), what solution do you suggest?

I first saw this in a tweet by Christophe Lalanne.

Guesstimating the Future

Filed under: Cypher,Graphs,Neo4j — Patrick Durusau @ 7:07 pm

I ran across some introductory slides on Neo4j with the line:

Forrester estimates that over 25% of enterprises will be using graph databases by 2017.

Well, Forrester also predicted that tablet sales would over take laptops sales in 2015: Forrester: Tablet Sales Will Eclipse Laptop Sales by 2015.

You might want to check that prediction against: Laptop sales ‘stronger than ever’ versus tablets – PCR Retail Advisory Board.

The adage “It is difficult to make predictions, especially about the future.,” remains appropriate.

Neo4j doesn’t need lemming-like behavior among consumers of technology to make a case for itself.

Compare Neo4j and its query language, Cypher, to your use cases and I think you will agree.

September 23, 2015

A review of learning vector quantization classifiers

Filed under: Classifier — Patrick Durusau @ 8:46 pm

A review of learning vector quantization classifiers by David Nova, Pablo A. Estevez.

Abstract:

In this work we present a review of the state of the art of Learning Vector Quantization (LVQ) classifiers. A taxonomy is proposed which integrates the most relevant LVQ approaches to date. The main concepts associated with modern LVQ approaches are defined. A comparison is made among eleven LVQ classifiers using one real-world and two artificial datasets.

From the introduction:

Learning Vector Quantization (LVQ) is a family of algorithms for statistical pattern classification, which aims at learning prototypes (codebook vectors) representing class regions. The class regions are defined by hyperplanes between prototypes, yielding Voronoi partitions. In the late 80’s Teuvo Kohonen introduced the algorithm LVQ1 [36, 38], and over the years produced several variants. Since their inception LVQ algorithms have been researched by a small but active community. A search on the ISI Web of Science in November, 2013, found 665 journal articles with the keywords “Learning Vector Quantization” or “LVQ” in their titles or abstracts. This paper is a review of the progress made in the field during the last 25 years.

Heavy sledding but if you want to review the development of a classification algorithm with a manageable history, this is a likely place to start.

Enjoy!

Fast k-NN search

Filed under: K-Nearest-Neighbors,Similarity — Patrick Durusau @ 8:36 pm

Fast k-NN search by Ville Hyvönen, Teemu Pitkänen, Sotiris Tasoulis, Liang Wang, Teemu Roos, Jukka Corander.

Abstract:

Random projection trees have proven to be effective for approximate nearest neighbor searches in high dimensional spaces where conventional methods are not applicable due to excessive usage of memory and computational time. We show that building multiple trees on the same data can improve the performance even further, without significantly increasing the total computational cost of queries when executed in a modern parallel computing environment. Our experiments identify suitable parameter values to achieve accurate searches with extremely fast query times, while also retaining a feasible complexity for index construction.

Not a quick read but an important one if you want to use multiple dimensions for calculation of similarity or sameness between two or more topics.

The technique requires you to choose a degree of similarity that works for your use case.

This paper makes a nice jumping off point for discussing how much precision does a particular topic map application need? Absolute precision is possible, but only in a limited number of cases and I suspect at high cost.

For some use cases, such as searching for possible suspects in crimes, some lack of precision is necessary to build up a large enough pool of suspects to include the guilty parties.

Any examples of precision and topic maps that come to mind?

SymbolHound

Filed under: Search Engines — Patrick Durusau @ 8:01 pm

SymbolHound

From the about page:

SymbolHound is a search engine that doesn’t ignore special characters. This means you can easily search for symbols like &, %, and π. We hope SymbolHound will help programmers find information about their chosen languages and frameworks more easily.

SymbolHound was started by David Crane and Thomas Feldtmose while students at the University of Connecticut. Another project by them is Toned Ear, a website for ear training.

I first saw SymbolHound mentioned in a discussion of delimiter options for a future XQuery feature.

For syntax drafting you need to have SymbolHound on your toolbar, not just bookmarked.

Government Travel Cards at Casinos or Adult Entertainment Establishments

Filed under: Auditing,Government,Humor,Topic Maps — Patrick Durusau @ 7:38 pm

Audit of DoD Cardholders Who Used Government Travel Cards at Casinos or Adult Entertainment Establishments by Michael J. Roark, Assistant Inspector General, Contract Management and Payments, Department of Defense.

From the memorandum:

We plan to begin the subject audit in September 2015. The Senate Armed Services Committee requested this audit as a follow-on review of transactions identified in Report No. DODIG-2015-125, “DoD Cardholders Used Their Government Travel Cards for Personal Use at Casinos and Adult Entertainment Establishments,” May 19, 2015. Our objective is to determine whether DoD cardholders who used government travel cards at casinos and adult entertainment establishments for personal use sought or received reimbursement for the charges. In addition, we will determine whether disciplinary actions have been taken in cases of personal use and if the misuse was repo1ted to the appropriate security office. We will consider suggestions from management on additional or revised objectives.

This project is a follow up to: Report No. DODIG-2015-125, “DoD Cardholders Used Their Government Travel Cards for Personal Use at Casinos and Adult Entertainment Establishments” (May 19, 2015), which summarizes its findings as:

We are providing this report for your review and comment. We considered management comments on a draft of this report when preparing the final report. DoD cardholders improperly used their Government Travel Charge Card for personal use at casinos and adult entertainment establishments. From July 1, 2013, through June 30, 2014, DoD cardholders had 4,437 transactions totaling $952,258, where they likely used their travel cards at casinos for personal use and had 900 additional transactions for $96,576 at adult entertainment establishments. We conducted this audit in accordance with generally accepted government auditing standards.

Let me highlight that for you:

July 1, 2013 through June 30, 2014, DoD cardholders:

4,437 transactions at casinos for $952,258

900 transactions at adult entertainment establishments for $96,576

Are lap dances that cheap? 😉

Almost no one goes to a casino or adult entertainment establishment alone, so topic maps would be a perfect fit for finding “associations” between DoD personnel.

The current project is to track the outcome of the earlier report, that is what if any actions resulted.

What do you think?

Will the DoD personnel claim they were doing off the record surveillance of suspected information leaks? Or just checking their resistance to temptation?

Before I forget, here is the breakdown by service (from the May 19, 2015 report, page 6):

DoD-hookers

I don’t know what to make up the distribution of “adult transactions” between the services.

Suggestions?

5.6 Million Fingerprints Stolen in OPM Hack [Still No Competence or Transparency]

Filed under: Cybersecurity,Government,Security — Patrick Durusau @ 7:07 pm

5.6 Million Fingerprints Stolen in OPM Hack by Chris Brook.

The management follies continue at the Office of Personnel Management (OPM), which I mentioned the other day had declined to use modern project management practices.

A snippet from Chris’ post, which you should read in it entirety:


OPM said at the beginning of September that it would begin sending letters to victims of the breach “in a few weeks,” yet the agency’s recent statement reiterates that an interagency team is still working in tandem with the Department of Defense to prep the letters.

“An interagency team will continue to analyze and refine the data as it prepares to mail notification letters to impacted individuals,” Schumach wrote.

Did you read between the lines to intuit the cause of the delay in letter preparation?

The next big shoe to drop, either on prodding by Congress or news media:

The Office of Personnel Management doesn’t have current addresses on all 21.5 million government workers.

When a data breach occurs at a major bank, credit card company, etc., sending the breach letter is a matter of composing it and hiring a mail house to do the mailing.

This is going on four months after OPM admitted the hack and still no letters?

I may be over estimating the competency of OPM management when it comes to letter writing but my bet would be on a lack of current addresses for a large portion of the employees impacted.

FYI, hiring former OPM staff has a name. It’s called assumption of risk.

Sharing Economy – Repeating the Myth of Code Reuse

Filed under: Government,Programming,Software — Patrick Durusau @ 4:47 pm

Bitcoin and sharing economy pave the way for new ‘digital state’ by Sophie Curtis.

Sophie quotes Cabinet Office minister Matthew Hancock MP as saying:

For example, he said that Cloud Foundry, a Californian company that provides platform-as-a-service technology, could help to create a code library for digital public services, helping governments around the world to share their software.

“Governments across the world need the same sorts of services for their citizens, and if we write them in an open source way, there’s no need to start from scratch each time,” he said.

“So if the Estonian government writes a program for licensing, and we do loads of licensing in the UK, it means we’ll be able to pull that code down and build the technology cheaper. Local governments deliver loads of services too and they can base their services on the same platforms.”

However, he emphasised that this is about sharing programs, code and techniques – not about sharing data. Citizens’ personal data will remain the responsibility of the government in question, and will not be shared across borders, he said.

I’m guess that “The Rt Hon Matt Hancock MP” hasn’t read:

The code reuse myth: why internal software reuse initiatives tend to fail by Ben Morris


The largest single barrier to effective code reuse is that it is difficult. It doesn’t just happen by accident. Reusable code has to be specifically designed for a generalised purpose and it is unlikely to appear spontaneously as a natural by-product of development projects.

Reusable components are usually designed to serve an abstract purpose and this absence of tangible requirements can make them unusually difficult to design, develop and test. Their development requires specific skills and knowledge of design patterns that is not commonly found in development teams. Developing for reuse is an art in itself and it takes experience to get the level of abstraction right without making components too specific for general use or too generalised to add any real value.

These design challenges can be exasperated by organisational issues in larger and more diffused development environments. If you are going to develop common components then you will need a very deep understanding of a range of constantly evolving requirements. As the number of projects and teams involved in reuse grow it can be increasingly difficult to keep track of these and assert any meaningful consistency.

Successful code reuse needs continuous effort to evolve shared assets in step with the changing business and technical landscape. This demands ownership and governance to ensure that assets don’t fall into disrepair after the initial burst of effort that established them. It also requires a mature development process that provides sufficient time to design, test, maintain and enhance reusable assets. Above all, you need a team of skilled architects and developers who are sufficiently motivated and empowered to take a lead in implementing code reuse.

Reuse Myth – can you afford reusable code? by Allan Kelly


In my Agile Business Conference present (“How much quality can we afford?”) I talked about the Reuse Myth, this is something always touch on when I deliver a training course but I’ve never taken time to write it down. Until now.

Lets take as our starting point Kevlin Henney’s observation that “there is no such thing as reusable code, only code that is reused.” Kevlin (given the opportunity!) goes on to examine what constitutes “reuse” over simple “use.” A good discussion itself but right now the I want to suggest that an awful lot of code which is “designed for reuse” is never actually re-used.

In effect that design effort is over engineering, waste in other words. One of the reasons developers want to “design for reuse” is not so much because the code will be reused but rather because they desire a set of properties (modularity, high cohesion, low coupling, etc.) which are desirable engineering properties but sound a bit abstract.

In other words, striving for “re-usability” is a developers way of striving for well engineered code. Unfortunately in striving for re-usability we loose focus which brings us to the second consideration…. cost of re-usability.

In Mythical Man Month (1974) Fred Brooks suggests that re-usable code costs three times as much to develop as single use code. I haven’t seen any better estimates so I tend to go with this one. (If anyone has any better estimates please send them over.)

Think about this. This means that you have to use your “reusable&#82#8221; code three times before you break even. And it means you only see a profit (saving) on the fourth reuse.

How much code which is built for reuse is reused four times?

Those are two “hits” out of 393,000 that I got this afternoon searching on (with the quotes) “code reuse.”

Let’s take The Rt Hon Matt Hancock MP statement and re-write it a bit:

Hypothetical Statement – Not an actual statement by The Rt Hon Matt Hancock MP, he’s not that well informed:

“So if the Estonian government spends three (3) times as much to write a program for licensing, and we do loads of licensing in the UK, it means we’ll be able to pull that code down and build the technology cheaper. Local governments deliver loads of services too and they can base their services on the same platforms.”

Will the Estonian government, which is like other governments, will spend three (3) times as much developing software on the off chance that the UK may want to use it?

Would any government undertake software development on that basis?

Do you have an answer other than NO! to either of those questions?

There are lots of competent computer people in the UK but none of them are advising The Rt Hon Matt Hancock MP. Or he isn’t listening. Amounts to the same thing.

Public Terminal on Your Network or Computer?

Filed under: Cybersecurity,Security — Patrick Durusau @ 3:34 pm

Update Flash now! Adobe releases patch, fixing critical security holes by Graham Cluley.

Graham details the latest in a series of patches for critical flaws in Flash and instead of completely removing Flash from your computer recommends:

Instead, I would suggest that Adobe Flash users consider enabling “Click to Play” in their browser.

Really?

And how are you going to decide if Flash content is malicious or not? Before you “click to play?”

To be honest, I can’t.

Flash on your computer is the equivalent of a public terminal to your network or computer on a street corner.

My recommendation? Remove Flash completely from your computer.

What about Flash content?

If I really want to view something that requires Flash, I write to the source saying I won’t install public access to my computer in order to view their content.

If enough of us do that, perhaps Flash will die the sort of death it deserves.

September 22, 2015

Coursera Specialization in Machine Learning:…

Filed under: Machine Learning — Patrick Durusau @ 7:29 pm

Coursera Specialization in Machine Learning: A New Way to Learn Machine Learning by Emily Fox.

From the post:

Machine learning is transforming how we experience the world as intelligent applications have become more pervasive over the past five years. Following this trend, there is an increasing demand for ML experts. To help meet this demand, Carlos and I were excited to team up with our colleagues at the University of Washington and Dato to develop a Coursera specialization in Machine Learning. Our goal is to avoid the standard prerequisite-heavy approach used in other ML courses. Instead, we motivate concepts through intuition and real-world applications, and solidify concepts with a very hands-on approach. The result is a self-paced, online program targeted at a broad audience and offered through Coursera with the first course available today.

Change how people learn about machine learning?

Do they mean to depart from simply replicating static textbook content in another medium?

Horrors! (NOT!)

Education has been evolving since the earliest days online and will continue to do so.

Still, it is encouraging to see people willing to admit to being different.

Enjoy!

I first saw this in a tweet by Dato.

King – Man + Woman = Queen:…

Filed under: Computational Linguistics,Linguistics — Patrick Durusau @ 6:24 pm

King – Man + Woman = Queen: The Marvelous Mathematics of Computational Linguistics.

From the post:

Computational linguistics has dramatically changed the way researchers study and understand language. The ability to number-crunch huge amounts of words for the first time has led to entirely new ways of thinking about words and their relationship to one another.

This number-crunching shows exactly how often a word appears close to other words, an important factor in how they are used. So the word Olympics might appear close to words like running, jumping, and throwing but less often next to words like electron or stegosaurus. This set of relationships can be thought of as a multidimensional vector that describes how the word Olympics is used within a language, which itself can be thought of as a vector space.

And therein lies this massive change. This new approach allows languages to be treated like vector spaces with precise mathematical properties. Now the study of language is becoming a problem of vector space mathematics.

Today, Timothy Baldwin at the University of Melbourne in Australia and a few pals explore one of the curious mathematical properties of this vector space: that adding and subtracting vectors produces another vector in the same space.

The question they address is this: what do these composite vectors mean? And in exploring this question they find that the difference between vectors is a powerful tool for studying language and the relationship between words.

A great lay introduction to:

Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vector Differences for Lexical Relation Learning by Ekaterina Vylomova, Laura Rimell, Trevor Cohn, Timothy Baldwin.

Abstract:

Recent work on word embeddings has shown that simple vector subtraction over pre-trained embeddings is surprisingly effective at capturing different lexical relations, despite lacking explicit supervision. Prior work has evaluated this intriguing result using a word analogy prediction formulation and hand-selected relations, but the generality of the finding over a broader range of lexical relation types and different learning settings has not been evaluated. In this paper, we carry out such an evaluation in two learning settings: (1) spectral clustering to induce word relations, and (2) supervised learning to classify vector differences into relation types. We find that word embeddings capture a surprising amount of information, and that, under suitable supervised training, vector subtraction generalises well to a broad range of relations, including over unseen lexical items.

The authors readily admit, much to their credit, this isn’t a one size fits all solution.

But, a line of research that merits your attention.

Security Alert! Have You Seen This Drive?

Filed under: Cybersecurity,Humor,Security — Patrick Durusau @ 3:07 pm

wdsfMyBook

The Ministry of Education, British Columbia, Canada posted MISSING DRIVE CONTENTS:

Despite extensive physical and electronic searches, the Ministry of Education has been unable to locate an unencrypted external hard drive with a variety of reports, databases, and some information detailed below.

The missing external drive is a black Western Digital drive about 7-inches high, 5.5 inches deep, and two inches thick. The disk has 437 GB worth of material made up of 8,766 folders with 138,830 files.

Inside some of the files is information on a total of 3.4 million individuals from between 1986-2009

The red color was in the original.

I’m not sure how listing the contents in detail is going to help find this drive but I do have a local copy should the online version disappear.

If I had to guess, someone converted the drive to home use and formatted it, losing the data of concern unless you want to pay for expensive data recovery efforts.

But, in the event it was stolen and sold along with other equipment, check any second hand Western digital drives you have purchased. Could be worth more than you paid for it.

I first saw this in a tweet by Dissent Doe today and I have no date for the actual data loss.

Text Making A Comeback As Interface?

Filed under: Interface Research/Design — Patrick Durusau @ 2:41 pm

Who Needs an Interface Anyway? Startups are piggybacking on text messaging to launch services. by Joshua Brustein.

From the post:

In his rush to get his latest startup off the ground, Ethan Bloch didn’t want to waste time designing a smartphone app. He thought people would appreciate the convenience of not having to download an app and then open it every time they wanted to use Digit, a tool that promotes savings. Introduced in February, it relies on text messaging to communicate with users. To sign up for the service, users go to Digit’s website and key in their cell number and checking account number. The software analyzes spending patterns and automatically sets money aside in a savings account. To see how much you’ve socked away, text “tell me my balance.” Key in “save more,” and Digit will do as you command. “A lot of the benefit of Digit takes place in the background. You don’t need to do anything,” says Bloch.

Conventional wisdom holds that intricately designed mobile apps are an essential part of most new consumer technology services. But there are signs people are getting apped out. While the amount of time U.S. smartphone users spend with apps continues to increase, the number of apps the average person uses has stayed pretty much flat for the last two years, according to a report Nielsen published in June. Some 200 apps account for more than 70 percent of total usage.

Golden Krishna, then a designer at Cooper, a San Francisco consulting firm that helps businesses create user experiences, anticipated the onset of app fatigue. In a 2012 blog post, “The best interface is no interface,” he argued that digital technology should strive to be invisible. It sparked a wide-ranging debate, and Krishna has spent the past several years making speeches, promoting a book with the same title as his essay, and doing consulting work for Silicon Valley companies.

Remembering the near ecstasy when visual interfaces replaced green screens it goes against experience to credit text as the best interface.

However, you should start with Golden Krishna’s essay, “The best interface is no interface,” then move on to his keynote address: “The best interface is no interface” at SXSW 2013 and of course, his website, http://www.nointerface.com/book/, which has many additional resources, including his book by the same name.

It is way cool to turn a blog post into a cottage industry. Not just any blog post, but a very good blog post on a critical issue for every user facing software application.

To further inspire you to consider text as an interface, take special note of the line that reads:

“Some 200 apps account for more than 70 percent of total usage.”

In order to become a top app, you have to not only displace one of the top 200 app, but your app has to be chosen to replace it. That sounds like an uphill battle.

Not to say that making a text interface is going to be easy, it’s not. You will have to think about the interface more than grabbing some stock widgets in order to build a visual interface.

On the upside, you may avoid the design clunkers that litter Krishna’s presentations and book.

An even better upside, you may avoid authoring one of the design clunkers that litter Krishna’s presentations.

I first saw this in a tweet by Bob DuCharme.

Python for Scientists [Warning – Sporadic Content Ahead]

Filed under: Programming,Python,Science — Patrick Durusau @ 10:36 am

Python for Scientists: A Curated Collection of Chapters from the O’Reilly Data and Programming Libraries

From the post:

More and more, scientists are seeing tech seep into their work. From data collection to team management, various tools exist to make your lives easier. But, where to start? Python is growing in popularity in scientific circles, due to its simple syntax and seemingly endless libraries. This free ebook gets you started on the path to a more streamlined process. With a collection of chapters from our top scientific books, you’ll learn about the various options that await you as you strengthen your computational thinking.

This free ebook includes chapters from:

  • Python for Data Analysis
  • Effective Computation in Physics
  • Bioinformatics Data Skills
  • Python Data Science Handbook

Warning: You give your name and email to the O’Reilly marketing marketing machine and get:

Python for Data Analysis

Python Language Essentials Appendix

Effective Computation in Physics

Chapter 1: Introduction to the Command Line
Chapter 7: Analysis and Visualization
Chapter 20: Publication

Bioinformatics Data Skills

Chapter 4: Working with Remote Machines
Chapter 5: Git for Scientists

Python Data Science Handbook

Chapter 3: Introduction to NumPy
Chapter 4: Introduction to Pandas

The content present is very good. The content missing is vast.

Topic Modeling and Twitter

Filed under: Latent Dirichlet Allocation (LDA),Python,Twitter — Patrick Durusau @ 9:57 am

Alex Perrier has two recent posts of interest to Twitter users and topic modelers:

Topic Modeling of Twitter Followers

In this post, we explore LDA an unsupervised topic modeling method in the context of twitter timelines. Given a twitter account, is it possible to find out what subjects its followers are tweeting about?

Knowing the evolution or the segmentation of an account’s followers can give actionable insights to a marketing department into near real time concerns of existing or potential customers. Carrying topic analysis of followers of politicians can produce a complementary view of opinion polls.

Segmentation of Twitter Timelines via Topic Modeling

Following up on our first post on the subject, Topic Modeling of Twitter Followers, we compare different unsupervised methods to further analyze the timelines of the followers of the @alexip account. We compare the results obtained through Latent Semantic Analysis and Latent Dirichlet Allocation and we segment Twitter timelines based on the inferred topics. We find the optimal number of clusters using silhouette scoring.

Alex has Python code, an interesting topic, great suggestions for additional reading, what is there not to like?

LDA, machine learning types follow @alexip but privacy advocates should as well.

Consider this recent tweet by Alex:

In the end the best way to protect your privacy is to behave erratically so that the Machine Learning algo will detect you as an outlier!

Perhaps, perhaps, but I suspect outliers/outsiders are classed as dangerous by several government agencies in the US.

Older Posts »

Powered by WordPress