Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 13, 2014

12 Steps For Teaching…

Filed under: OpenShift,Red Hat,Teaching — Patrick Durusau @ 7:54 pm

12 Steps For Teaching Your Next Programming Class on OpenShift by Katie Miller.

From the post:

The OpenShift Platform as a Service (PaaS) is a valuable resource for running tutorials on web programming, especially if you have a limited budget.

OpenShift abstracts away configuration headaches to help students create shareable applications quickly and easily, for free, using extensible open-source code – as I explained in a previous post.

In this blog post, I will draw on my personal workshop experiences to outline 12 steps for teaching your next programming class with OpenShift Online.

See Katie’s post for the details but as a sneak preview, the twelve steps are:

  1. Try Out OpenShift
  2. Choose Topic Areas
  3. Select Cartridges to Support Your Teaching Goals
  4. Develop a Work Flow
  5. Create and Publish Sample Code Base
  6. Write Workshop Instructions
  7. Determine Account Creation Strategy
  8. Prepare Environments
  9. Trial Workshop
  10. Recruit Helpers
  11. Run Workshop
  12. Share Results and Seek Feedback

An excellent resource for teaching the techie side of semantic integration.

Twitter Keyboard Shortcuts

Filed under: Tweets — Patrick Durusau @ 5:34 pm

Twitter Keyboard Shortcuts by Gregory Piatetsky.

Too useful not to pass along.

Gregory says the best shortcut is “?.” Gives you all the keyboard shortcuts.

Pass it on.

Algebra for Analytics:…

Filed under: Algebra,Algorithms,Analytics — Patrick Durusau @ 3:44 pm

Algebra for Analytics: Two pieces for scaling computations, ranking and learning by P. Oscar Boykin.

Slide deck from Oscar’s presentation at Strataconf 2014.

I don’t normally say a slide deck on algebra is inspirational but I have to for this one!

Looking forward to watching the video of the presentation that went along with it.

Think of all the things you can do with associativity and hashes before you review the slide deck.

It will make it all the more amazing.

I first saw this in a tweet by Twitter Open Source.

Mining of Massive Datasets 2.0

Filed under: BigData,Data Mining,Graphs,MapReduce — Patrick Durusau @ 3:29 pm

Mining of Massive Datasets 2.0

From the webpage:

The following is the second edition of the book, which we expect to be published soon. We have added Jure Leskovec as a coauthor. There are three new chapters, on mining large graphs, dimensionality reduction, and machine learning.

There is a revised Chapter 2 that treats map-reduce programming in a manner closer to how it is used in practice, rather than how it was described in the original paper. Chapter 2 also has new material on algorithm design techniques for map-reduce.

Aren’t you wishing for more winter now? 😉

I first saw this in a tweet by Gregory Piatetsky.

Reasoned Programming

Filed under: Clojure,Functional Programming,Logic,Programming — Patrick Durusau @ 2:49 pm

Reasoned Programming by Krysia Broda, Susan Eisenbach, Hessam Khoshnevisan, and Steve Vickers.

From the preface:

Can we ever be sure that our computer programs will work reliably? One approach to this problem is to attempt a mathematical proof of reliability, and this has led to the idea of Formal Methods: if you have a formal, logical specification of the properties meant by `working reliably’, then perhaps you can give a formal mathematical proof that the program (presented as a formal text) satisfies them.

Of course, this is by no means trivial. Before we can even get started on a formal proof we must turn the informal ideas intended by `working reliably’ into a formal specification, and we also need a formal account of what it means to say that a program satisfies a specification (this amounts to a semantics of the programming language, an account of the meaning of programs). None the less, Formal Methods are now routinely practised by a number of software producers.

However, a tremendous overhead derives from the stress on formality, that is to say, working by the manipulation of symbolic forms. A formal mathematical proof is a very different beast from the kind of proof that you will see in mathematical text books. It includes the minutest possible detail, both in proof steps and in background assumptions, and is not for human consumption &emdash; sophisticated software support tools are needed to handle it. For this reason, Formal Methods are often considered justifiable only in `safety critical’ systems, for which reliability is an overriding priority.

The aim of this book is to present informal formal methods, showing the benefits of the approach even without strict formality: although we use logic as a notation for the specifications, we rely on informal semantics &emdash; a programmer’s ordinary intuitions about what small, linear stretches of code actually do &emdash; and we use proofs to the level of rigour of ordinary mathematics.

A bit dated (1994) and teaches Miranda, a functional programming language and uses it to reason about imperative programming.

Even thinking about a “specification” isn’t universally admired these days but the author’s cover that point when they say:

This `precise account of the users’ needs and wants’ is called a specification, and the crucial point to understand is that it is expressing something quite different from the code, that is, the users’ interests instead of the computer’s. If the specification and code end up saying the same thing in different ways &emdash; and this can easily happen if you think too much from the computer’s point of view when you specify &emdash; then doing both of them is largely a waste of time. (emphasis added, Chapter 1, Section 1.3)

That’s blunt enough. 😉

You can pick up Miranda, homesite or translate the examples into a more recent functional language, Clojure comes to mind.

I first saw this in a tweet by Computer Science.

February 12, 2014

Islamic Finance: A Quest for Publically Available Bank-level Data

Filed under: Data,Finance Services,Government,Government Data — Patrick Durusau @ 9:38 pm

Islamic Finance: A Quest for Publically Available Bank-level Data by Amin Mohseni-Cheraghlou.

From the post:

Attend a seminar or read a report on Islamic finance and chances are you will come across a figure between $1 trillion and $1.6 trillion, referring to the estimated size of the global Islamic assets. While these aggregate global figures are frequently mentioned, publically available bank-level data have been much harder to come by.

Considering the rapid growth of Islamic finance, its growing popularity in both Muslim and non-Muslim countries, and its emerging role in global financial industry, especially after the recent global financial crisis, it is imperative to have up-to-date and reliable bank-level data on Islamic financial institutions from around the globe.

To date, there is a surprising lack of publically available, consistent and up-to-date data on the size of Islamic assets on a bank-by-bank basis. In fairness, some subscription-based datasets, such Bureau Van Dijk’s Bankscope, do include annual financial data on some of the world’s leading Islamic financial institutions. Bank-level data are also compiled by The Banker’s Top Islamic Financial Institutions Report and Ernst & Young’s World Islamic Banking Competitiveness Report, but these are not publically available and require subscription premiums, making it difficult for many researchers and experts to access. As a result, data on Islamic financial institutions are associated with some level of opaqueness, creating obstacles and challenges for empirical research on Islamic finance.

The recent opening of the Global Center for Islamic Finance by World Bank Group President Jim Young Kim may lead to exciting venues and opportunities for standardization, data collection, and empirical research on Islamic finance. In the meantime, the Global Financial Development Report (GFDR) team at the World Bank has also started to take some initial steps towards this end.

I can think of two immediate benefits from publicly available data on Islamic financial institutions:

First, hopefully it will increase demands for meaningful transparency in Western financial institutions.

Second, it will blunt government hand waving and propaganda about the purposes of Islamic financial institutions. Which on a par with financial institutions everywhere want to remain solvent, serve the needs of their customers and play active roles in their communities. Nothing more sinister than that.

Perhaps the best way to vanquish suspicion is with transparency. Except for the fringe cases who treat lack of evidence as proof of secret evil doing.

…Open GIS Mapping Data To The Public

Filed under: Geographic Data,GIS,Maps,Open Data — Patrick Durusau @ 9:13 pm

Esri Allows Federal Agencies To Open GIS Mapping Data To The Public by Alexander Howard.

From the post:

A debate in the technology world that’s been simmering for years, about whether mapping vendor Esri will allow public geographic information systems (GIS) to access government customers’ data, finally has an answer: The mapping software giant will take an unprecedented step, enabling thousands of government customers around the U.S. to make their data on the ArcGIS platform open to the public with a click of a mouse.

“Everyone starting to deploy ArcGIS can now deploy an open data site,” Andrew Turner, chief technology officer of Esri’s Research and Development Center in D.C., said in an interview. “We’re in a unique position here. Users can just turn it on the day it becomes public.”

Government agencies can use the new feature to turn geospatial information systems data in Esri’s format into migratable, discoverable, and accessible open formats, including CSVs, KML and GeoJSON. Esri will demonstrate the ArcGIS feature in ArcGIS at the Federal Users Conference in Washington, D.C. According to Turner, the new feature will go live in March 2014.

I’m not convinced that GIS data alone is going to make government more transparent but it is a giant step in the right direction.

To have even partial transparency in government, not only would you need GIS data but to have that correlated with property sales and purchases going back decades, along with tracing the legal ownership of property past shell corporations and holding companies, to say nothing of the social, political and professional relationships of those who benefited from various decisions. For a start.

Still, the public may be a better starting place to demand transparency with this type of data.

Sistine Chapel full 360°

Filed under: Data,History — Patrick Durusau @ 9:00 pm

Sistine Chapel full 360°

It’s not like being there, but then visitors can’t “zoom” in as you can with this display.

If you could capture one perspective, current or historical, for the Sistine Chapel, what would it be?

If you are ever in Rome, it is worth the hours in line and exhibits you will see along the way, to finish in the Sistine Chapel.

I first saw this in a tweet by Merete Sanderhoff.

John von Neumann and the Barrier to Universal Semantics

Filed under: Semantics,Topic Maps — Patrick Durusau @ 8:39 pm

Chris Boshuizen posted an image of a letter by John von Neumann “…lamenting that people don’t read other’s code, in 1952!”

von Neumann writes:

The subject mentioned by Stone is not an easy one. Plans to standardize and publish code of various groups have been made in the past, and they have not been very successful so far. The difficulty is that most people who have been active in this field seem to believe that it is easier to write new code than to understand an old one. This is probably exaggerated, but it is certainly true that the process of understanding a code practically involves redoing it de novo. The situation is not very unlike the one that existed in formal logics over a long period, where every new author invented a new symbolism. It took several decades until a few of these found wider acceptance, at least within limited groups. In the case of computing machine codes, the situation is even more difficult, since all formal logics refer, at least ideally, to the same substratum, whereas the machine codes frequently refer to physically different machines. (emphasis added)

To reword von Neumann slightly: whereas semantics refer to the perceptions of physically different people.

Yes?

Non-adoption of RDF or OWL isn’t a reflection on their capabilities or syntax. Rather it reflects that the vast majority of users don’t see the world as presented by RDF or OWL.

Since it is more difficult to learn a way other than your own, inertia favors whatever system you presently follow.

None of that is to deny or minimize the benefits of integrating information from various viewpoints. But a starting premise that users need to change their world views to X, is a non-starter if the goal is integration of information from different viewpoints.

My suggestion is that we start where users are today, with their languages, their means of identification, their subjects as it were. How to do that has as many answers as there are users with goals and priorities. Which will make the journey all the more interesting and enjoyable.

Specializations On Coursera

Filed under: CS Lectures,Data Science — Patrick Durusau @ 4:31 pm

Specializations On Coursera

Coursera is offering sequences of courses that result in certificates in particular areas.

For example, John Hopkins is offering a certificate in Data Science, nine courses at $49.00 each or $490 for a specialization certificate.

I first saw this in a post by Stephen Turner, Coursera Specializations: Data Science, Systems Biology, Python Programming.

iPhone interface design

Filed under: Interface Research/Design — Patrick Durusau @ 11:51 am

iPhone interface design by Edward Tufte.

From the post:

The iPhone platform elegantly solves the design problem of small screens by greatly intensifying the information resolution of each displayed page. Small screens, as on traditional cell phones, show very little information per screen, which in turn leads to deep hierarchies of stacked-up thin information–too often leaving users with “Where am I?” puzzles. Better to have users looking over material adjacent in space rather than stacked in time.

To do so requires increasing the information resolution of the screen by the hardware (higher resolution screens) and by screen design (eliminating screen-hogging computer administrative debris, and distributing information adjacent in space).

Tufte’s take on iPhone interface design with reader comments.

The success of the iPhone interface is undeniable. The spread of its lessons, at least to “big” screens, less so.

There are interfaces that I use where a careless click of the mouse offers a second or even third way to perform a task or at least more menus.

If you are looking for an industry with nearly unlimited potential for growth, think user interface/user experience.

I first saw this in a tweet by Gregory Piatetsky.

February 11, 2014

Neo4j Spatial Part 2

Filed under: Geographic Data,Georeferencing,Graphs,Neo4j — Patrick Durusau @ 2:27 pm

Neo4j Spatial Part 2 by Max De Marzi.

Max finishes up part 1 with sample spatial data on for restaurants and deploying his proof of concept using GrapheneDB on Heroku.

Restaurants are typical cellphone app fare but if I were in Kiev, I’d want an app with geo-locations of ingredients for a proper Molotov cocktail.

A jar filled with gasoline and a burning rag is nearly as dangerous to the thrower as the target.

Of course, substitutions for ingredients, in what quantities, in different languages, could be added features of such an app.

Data management is a weapon within the reach of all sides.

Build your own [Secure] Google Maps…

Filed under: Geography,Google Maps,Maps — Patrick Durusau @ 2:03 pm

Build your own Google Maps (and more) with GeoServer on OpenShift by Steven Citron-Pousty.

From the post:

Greetings Shifters! Today we are going to continue in our spatial series and bring up Geoserver on OpenShift and connect it to our PostGIS database. By the end of the post you will have your own map tile server OR KML (to show on Google Earth) or remote GIS server.

The team at Geoserver has put together a nice short explanation of the geoserver and then a really detailed list. If you want commercial support, Boundless will give you a commercial release and/or support for all your corporate needs. Today though I am only going to focus on the FOSS bits.

From the GeoServer site:

GeoServer allows you to display your spatial information to the world. Implementing the Web Map Service (WMS) standard, GeoServer can create maps in a variety of output formats. OpenLayers, a free mapping library, is integrated into GeoServer, making map generation quick and easy. GeoServer is built on Geotools, an open source Java GIS toolkit.

There is much more to GeoServer than nicely styled maps, though. GeoServer also conforms to the Web Feature Service (WFS) standard, which permits the actual sharing and editing of the data that is used to generate the maps. Others can incorporate your data into their websites and applications, freeing your data and permitting greater transparency.

I added “[Secure]” to the title, assuming that you will not hand over data to the NSA about yourself or your maps. I can’t say that for everyone that offers mapping services on the WWW.

Depending on how much security you need, certainly develop on OpenShift but I would deploy on shielded and physically secure hardware. Depends on your appetite for risk.

Scalable Vector Graphics (SVG) 2

Filed under: Graphics,SVG — Patrick Durusau @ 1:46 pm

Scalable Vector Graphics (SVG) 2

Abstract:

This specification defines the features and syntax for Scalable Vector Graphics (SVG) Version 2, a language for describing two-dimensional vector and mixed vector/raster graphics. Although an XML serialization is given, processing is defined in terms of a DOM.

Changes from SVG 1.1 Second Edition.

No time like the present to start learning about the next version of SVG!

Not to mention that your comments may contribute to the style and/or substance of a standard we will all be using sooner than later.

CUBE and ROLLUP:…

Filed under: Aggregation,Hadoop,Pig — Patrick Durusau @ 1:29 pm

CUBE and ROLLUP: Two Pig Functions That Every Data Scientist Should Know by Joshua Lande.

From the post:

I recently found two incredible functions in Apache Pig called CUBE and ROLLUP that every data scientist should know. These functions can be used to compute multi-level aggregations of a data set. I found the documentation for these functions to be confusing, so I will work through a simple example to explain how they work.

Joshua starts his post with a demonstration of using GROUP BY in Pig for simple aggregations. That sets the stage for demonstrating how important CUBE and ROLLUP can be for data aggregations in PIG.

Interesting possibilities suggest themselves by the time you finish Joshua’s posting.

I first saw this in a tweet by Dmitriy Ryaboy.

Is 11 Feb 2014 The Day We Fight Back?

Filed under: Cryptography,Cybersecurity,NSA,Privacy,Security — Patrick Durusau @ 11:31 am

Is 11 Feb 2014 The Day We Fight Back? by Mark Stockley.

From the post:

Appalled with government surveillance without oversight? Sick of having your privacy invaded? Numb from stories about the NSA? If you are, you’ll have had many more bad days than good since June 2013.

But today, just perhaps, could be one of the better ones.

Mark covers the general theme of protests quite well and then admits, ok, so people are protesting, now what?

Lacking a target like SOPA, there is not specific action to ask for or for anyone to take.

Or as Mark points out:

Who do we lobby to fix that situation [government surveillance} and how will we ever know if we have succeeded?

I put it to you the government(s) being petitioned for privacy protection are the same ones that spied on you. Is there irony that situation?

Is it a reflection on your gullibility that despite years of known lies, deceptions and rights violations, you are willing to trust the people responsible for the ongoing lies, deceptions and rights violations?

If you aren’t going to trust the government, if you aren’t going to protest, what does that leave?

Fighting back effectively.

Mark points out a number of efforts to secure the technical infrastructure of the Internet. Learn more about those, support them and even participate in them.

Among other efforts, consider the OASIS PKCS 11 TC:

The OASIS PKCS 11 Technical Committee develops enhancements to improve the PKCS #11 standard for ease of use in code libraries, open source applications, wrappers, and enterprise/COTS products: implementation guidelines, usage tutorials, test scenarios and test suites, interoperability testing, coordination of functional testing, development of conformance profiles, and providing reference implementations.

The updated standard provides additional support for mobile and cloud computing use cases: for distributed/federated applications involving key management functions (key generation, distribution, translation, escrow, re-keying); session-based models; virtual devices and virtual keystores; evolving wireless/sensor applications using near field communication (NFC), RFID, Bluetooth, and Wi-Fi.

TC members are also designing new mechanisms for API instrumentation, suitable for use in prototyping, profiling, and testing in resource-constrained application environments. These updates enable support for easy integration of PKCS #11 with other cryptographic key management system (CKMS) standards, including a broader range of cryptographic algorithms and CKMS cryptographic service models. (from the TC homepage)

Whatever security you have from government intrusion is going to come from you and others like you who create it.

Want to fight back today? Join one of the efforts that Marks lists or the OASIS PKCS 11 TC. Today!

February 10, 2014

Free MarkLogic Courses?

Filed under: MarkLogic,XML — Patrick Durusau @ 3:02 pm

MarkLogic Announces Free NoSQL Database Training Courses

From the post:

MarkLogic Corporation, the leading Enterprise NoSQL database platform company, today announced the schedule for its MarkLogic University public courses with hands-on instruction to attending users and developers free of charge. The courses are led by an instructor in various live, online, and classroom locations, and provide MarkLogic customers and developers with the training to optimize their NoSQL database deployments and the education to develop applications on the MarkLogic database.

Since 2001, MarkLogic has focused on providing a powerful and trusted Enterprise NoSQL database platform that empowers organizations to turn all data into valuable and actionable information. The MarkLogic University program was created to give customers the access to best practices for managing vast amounts of diverse data. Now project managers, architects, developers, testers, and administrators can improve their MarkLogic skills with no cost training.

“The demand for MarkLogic development and administration skills is increasing in the market and with a sharp focus on customer success, we are dedicated to providing easy access to information and education that will assist developers and IT professionals to better manage and do more with their data,” said Jon Bakke, senior vice president, global technical services, MarkLogic. “By making MarkLogic training resources widely available, we are helping to build up much-needed technical skills that enterprises need to derive value from the vast amounts of enterprise data that is being created and stored today.”

But, when I visit: http://www.marklogic.com/services/training/class-schedule/

I see refundable booking fees. (As of 10 February 2013 at 15:00 EST.)

Nor could I find a statement by MarkLogic on its blog or pressroom confirming free classes.

I have seen this at several sources and suggest further inquiry before anyone gets too excited.

Data visualization with Elasticsearch aggregations and D3

Filed under: D3,ElasticSearch,Visualization — Patrick Durusau @ 1:53 pm

Data visualization with Elasticsearch aggregations and D3 by Shelby Sturgis.

From the post:

For those of you familiar with Elasticsearch, you know that its an amazing modern, scalable, full-text search engine with Apache Lucene and the inverted index at its core. Elasticsearch allows users to query their data and provides efficient and blazingly fast look up of documents that make it perfect for creating real-time analytics dashboards.

Currently, Elasticsearch includes faceted search, a functionality that allows users to compute aggregations of their data. For example, a user with twitter data could create buckets for the number of tweets per year, quarter, month, day, week, hour, or minute using the date histogram facet, making it quite simple to create histograms.

Faceted search is a powerful tool for data visualization. Kibana is a great example of a front-end interface that makes good use of facets. However, there are some major restrictions to faceting. Facets do not retain information about which documents fall into which buckets, making complex querying difficult. Which is why, Elasticsearch is pleased to introduce the aggregations framework with the 1.0 release. Aggregations rips apart its faceting restraints and provides developers the potential to do much more with visualizations.

Aggregations (=Awesomeness!)

Aggregations is “faceting reborn”. Aggregations incorporate all of the faceting functionality while also providing much more powerful capabilities. Aggregations is a “generic” but “extremely powerful” framework for building any type of aggregation. There are several different types of aggregations, but they fall into two main categories: bucketing and metric. Bucketing aggregations produce a list of buckets, each one with a set of documents that belong to it (e.g., terms, range, date range, histogram, date histogram, geo distance). Metric aggregations keep track and compute metrics over a set of documents (e.g., min, max, sum, avg, stats, extended stats).

Using Aggregations for Data Visualization (with D3)

Lets dive right in and see the power that aggregations give us for data visualization. We will create a donut chart and a dendrogram using the Elasticsearch aggregations framework, the Elasticsearch javascript client, and D3.

If you are new to Elasticsearch, it is very easy to get started. Visit the Elasticsearch overview page to learn how to download, install, and run Elasticsearch version 1.0.

The dendrogram of football (U.S.) touchdowns is particularly impressive.

BTW, https://github.com/stormpython/Elasticsearch-datasets/archive/master.zip, returns Elasticsearch-datasets-master.zip on your local drive. Just to keep you from hunting for it.

Text Retrieval Conference (TREC) 2014

Filed under: Conferences,TREC — Patrick Durusau @ 11:33 am

Text Retrieval Conference (TREC) 2014

Schedule: As soon as possible — submit your application to participate in TREC 2014 as described below.
Submitting an application will add you to the active participants’ mailing list. On Feb 26, NIST will announce a new password for the “active participants” portion of the TREC web site.

Beginning March 1
Document disks used in some existing TREC collections distributed to participants who have returned the required forms. Please note that no disks will be shipped before March 1.

July–August
Results submission deadline for most tracks. Specific deadlines for each track will be included in the track guidelines, which will be finalized in the spring.

September 30 (estimated)
relevance judgments and individual evaluation scores due back to participants.

Nov 18–21
TREC 2014 conference at NIST in Gaithersburg, Md. USA

From the webpage:

The Text Retrieval Conference (TREC) workshop series encourages research in information retrieval and related applications by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results. Now in its 23rd year, the conference has become the major experimental effort in the field. Participants in the previous TREC conferences have examined a wide variety of retrieval techniques and retrieval environments, including cross-language retrieval, retrieval of web documents, multimedia retrieval, and question answering. Details about TREC can be found at the TREC web site, http://trec.nist.gov.

You are invited to participate in TREC 2014. TREC 2014 will consist of a set of tasks known as “tracks”. Each track focuses on a particular subproblem or variant of the retrieval task as described below. Organizations may choose to participate in any or all of the tracks. Training and test materials are available from NIST for some tracks; other tracks will use special collections that are available from other organizations for a fee.

Dissemination of TREC work and results other than in the (publicly available) conference proceedings is welcomed, but the conditions of participation specifically preclude any advertising claims based on TREC results. All retrieval results submitted to NIST are published in the Proceedings and are archived on the TREC web site. The workshop in November is open only to participating groups that submit retrieval results for at least one track and to selected government invitees.

The eight (8) tracks:

Clinical Decision Support Track: The clinical decision support track investigates techniques for linking medical cases to information relevant for patient care.

Contextual Suggestion Track: The Contextual Suggestion track investigates search techniques for complex information needs that are highly dependent on context and user interests.

Federated Web Search Track: The Federated Web Search track investigates techniques for the selection and combination of search results from a large number of real on-line web search services.

Knowledge Base Acceleration Track: This track looks to develop techniques to dramatically improve the efficiency of (human) knowledge base curators by having the system suggest modifications/extensions to the KB based on its monitoring of the data streams.

Microblog Track: The Microblog track examines the nature of real-time information needs and their satisfaction in the context of microblogging environments such as Twitter.

Session Track: The Session track aims to provide the necessary resources in the form of test collections to simulate user interaction and help evaluate the utility of an IR system over a sequence of queries and user interactions, rather than for a single “one-shot” query.

Temporal Summarization Track: The goal of the Temporal Summarization track is to develop systems that allow users to efficiently monitor the information associated with an event over time.

Web Track: The goal of the Web track is to explore and evaluate Web retrieval technologies that are both effective and reliable.

As of the data of this post, only the Clinical Decision Support Track webpage has been updated for the 2014 conference. The others will follow in due time.

Apologies for the late notice but since the legal track doesn’t appear this year it dropped off my radar.

Application Details

Organizations wishing to participate in TREC 2014 should respond to this call for participation by submitting an application. Participants in previous TRECs who wish to participate in TREC 2014 must submit a new application. To apply, submit the online application at: http://ir.nist.gov/trecsubmit.open/application.html

Parallel Data Generation Framework

Filed under: Benchmarks,Data — Patrick Durusau @ 11:06 am

Parallel Data Generation Framework

From the webpage:

The Parallel Data Generation Framework (PDGF) is a generic data generator for database benchmarking. Its development started at the University of Passau at the group of Prof. Dr. Harald Kosch.

PDGF was designed to take advantage of today’s multi-core processors and large clusters of computers to generate large amounts of synthetic benchmark data very fast. PDGF uses a fully computational approach and is a pure Java implementation which makes it very portable.

I mention this to ask if you are aware of methods for generating unstructured text with known characteristics such as the number of entities and their representations in the data set?

A “natural” dataset, say blog posts or emails, etc., can be probed to determine its semantic characteristics but I am interested in generation of a dataset with known semantic characteristics.

Thoughts?

I first saw this in a tweet by Stefano Bertolo.

February 9, 2014

Introducing PigPen: Map-Reduce for Clojure

Filed under: Clojure,Functional Programming,MapReduce — Patrick Durusau @ 8:06 pm

Introducing PigPen: Map-Reduce for Clojure by Matt Bossenbroek.

pigpen

From the post:

It is our pleasure to release PigPen to the world today. PigPen is map-reduce for Clojure. It compiles to Apache Pig, but you don’t need to know much about Pig to use it.

What is PigPen?

  • A map-reduce language that looks and behaves like clojure.core
  • The ability to write map-reduce queries as programs, not scripts
  • Strong support for unit tests and iterative development

Note: If you are not familiar at all with Clojure, we strongly recommend that you try a tutorial here, here, or here to understand some of the basics.

Not a quick read but certainly worth the effort!

Write and Run Giraph Jobs on Hadoop

Filed under: Cloudera,Giraph,Graphs,Hadoop,MapReduce — Patrick Durusau @ 7:52 pm

Write and Run Giraph Jobs on Hadoop by Mirko Kämpf.

From the post:

Create a test environment for writing and testing Giraph jobs, or just for playing around with Giraph and small sample datasets.

Apache Giraph is a scalable, fault-tolerant implementation of graph-processing algorithms in Apache Hadoop clusters of up to thousands of computing nodes. Giraph is in use at companies like Facebook and PayPal, for example, to help represent and analyze the billions (or even trillions) of connections across massive datasets. Giraph was inspired by Google’s Pregel framework and integrates well with Apache Accumulo, Apache HBase, Apache Hive, and Cloudera Impala.

Currently, the upstream “quick start” document explains how to deploy Giraph on a Hadoop cluster with two nodes running Ubuntu Linux. Although this setup is appropriate for lightweight development and testing, using Giraph with an enterprise-grade CDH-based cluster requires a slightly more robust approach.

In this how-to, you will learn how to use Giraph 1.0.0 on top of CDH 4.x using a simple example dataset, and run example jobs that are already implemented in Giraph. You will also learn how to set up your own Giraph-based development environment. The end result will be a setup (not intended for production) for writing and testing Giraph jobs, or just for playing around with Giraph and small sample datasets. (In future posts, I will explain how to implement your own graph algorithms and graph generators as well as how to export your results to Gephi, the “Adobe Photoshop for graphs”, through Impala and JDBC for further inspection.)

The first in a series of posts on Giraph.

This is great stuff!

It should keep you busy during your first conference call and/or staff meeting on Monday morning.

Monday won’t seem so bad. 😉

Generating an XML test corpus

Filed under: MarkLogic,XML — Patrick Durusau @ 5:58 pm

Generating an XML test corpus by Anthony Coates.

From the post:

My current role requires me to work with the MarkLogic NoSQL database. I’ve had some experience with it in the past, if not as much as I would have liked to have had.

Compared to relational databases, “document databases” like MarkLogic have the advantage that content is stored in a denormalised “document” format. If you have your data denormalised appropriately into documents, such that each query requires only a single document, then the database gives its optimum performance. With relational databases, there’s generally no way to avoid having some joins in queries, even if some of the data is denormalised into tables.
….

Anthony is an old hand with XML and has started a new blog.

I am particularly interested in Anthony’s questions about linking documents, denormalizing data, to say nothing of generating the test corpus.

I signed up for the RSS feed but don’t depend on me to mention every post. 😉

Medical research—still a scandal

Filed under: Medical Informatics,Open Access,Open Data,Research Methods — Patrick Durusau @ 5:45 pm

Medical research—still a scandal by Richard Smith.

From the post:

Twenty years ago this week the statistician Doug Altman published an editorial in the BMJ arguing that much medical research was of poor quality and misleading. In his editorial entitled, “The Scandal of Poor Medical Research,” Altman wrote that much research was “seriously flawed through the use of inappropriate designs, unrepresentative samples, small samples, incorrect methods of analysis, and faulty interpretation.” Twenty years later I fear that things are not better but worse.

Most editorials like most of everything, including people, disappear into obscurity very fast, but Altman’s editorial is one that has lasted. I was the editor of the BMJ when we published the editorial, and I have cited Altman’s editorial many times, including recently. The editorial was published in the dawn of evidence based medicine as an increasing number of people realised how much of medical practice lacked evidence of effectiveness and how much research was poor. Altman’s editorial with its concise argument and blunt, provocative title crystallised the scandal.

Why, asked Altman, is so much research poor? Because “researchers feel compelled for career reasons to carry out research that they are ill equipped to perform, and nobody stops them.” In other words, too much medical research was conducted by amateurs who were required to do some research in order to progress in their medical careers.

Ethics committees, who had to approve research, were ill equipped to detect scientific flaws, and the flaws were eventually detected by statisticians, like Altman, working as firefighters. Quality assurance should be built in at the beginning of research not the end, particularly as many journals lacked statistical skills and simply went ahead and published misleading research.

If you are thinking things are better today, consider a further comment from Richard:

The Lancet has this month published an important collection of articles on waste in medical research. The collection has grown from an article by Iain Chalmers and Paul Glasziou in which they argued that 85% of expenditure on medical research ($240 billion in 2010) is wasted. In a very powerful talk at last year’s peer review congress John Ioannidis showed that almost none of thousands of research reports linking foods to conditions are correct and how around only 1% of thousands of studies linking genes with diseases are reporting linkages that are real. His famous paper “Why most published research findings are false” continues to be the most cited paper of PLoS Medicine.

Not that I think open access would be a panacea for poor research quality but at least it would provide the opportunity for discovery.

All this talk about medical research reminds me of the Big Mechanism DARPA. Assume the research data on pathways is no better or no worse than mapping genes to diseases, DARPA will be spending $42 million to mine data with 1% accuracy.

A better use of those “Big Mechanism” dollars would be to test solutions to produce better medical research for mining.

1% sounds like low-grade ore to me.

CryptoAlgebra

Filed under: Cryptography — Patrick Durusau @ 5:20 pm

CryptoAlgebra by Matt Gautreau.

From the post:

Just as so you know, the material being covered in this blog will be based on what I learn in class, partly from these books:

The first section of this blog, corresponding to the textbook, is going to be about what are referred to as “Classical Cryptosystems”. These types of encryption algorithms are what was used before the invention of computers. The computing power of your cell phone could easily brute force these algorithms, but hopefully we will get a chance to take a look at more elegant ways to attack these systems, which you could do with a pencil and paper if you so desired.

I hope you are as excited for this blog as I am for my classes this semester!

All I know is what you see quoted from the blog.

Assuming Matt does well and keeps up with the blog, this could be a lot of fun.

Suggest you not leave any of your cryptographic doodles laying around at your local airport. 😉

Enjoy!

Snowden Used Low-Cost Tool to Best N.S.A.

Filed under: Cybersecurity,Humor,NSA,Web Scrapers,Webcrawler — Patrick Durusau @ 4:47 pm

Snowden Used Low-Cost Tool to Best N.S.A. by David E. Sanger and Eric Schmitt.

From the post:

Intelligence officials investigating how Edward J. Snowden gained access to a huge trove of the country’s most highly classified documents say they have determined that he used inexpensive and widely available software to “scrape” the National Security Agency’s networks, and kept at it even after he was briefly challenged by agency officials.

Using “web crawler” software designed to search, index and back up a website, Mr. Snowden “scraped data out of our systems” while he went about his day job, according to a senior intelligence official. “We do not believe this was an individual sitting at a machine and downloading this much material in sequence,” the official said. The process, he added, was “quite automated.”

The findings are striking because the N.S.A.’s mission includes protecting the nation’s most sensitive military and intelligence computer systems from cyberattacks, especially the sophisticated attacks that emanate from Russia and China. Mr. Snowden’s “insider attack,” by contrast, was hardly sophisticated and should have been easily detected, investigators found.

Moreover, Mr. Snowden succeeded nearly three years after the WikiLeaks disclosures, in which military and State Department files, of far less sensitivity, were taken using similar techniques.

Mr. Snowden had broad access to the N.S.A.’s complete files because he was working as a technology contractor for the agency in Hawaii, helping to manage the agency’s computer systems in an outpost that focuses on China and North Korea. A web crawler, also called a spider, automatically moves from website to website, following links embedded in each document, and can be programmed to copy everything in its path.
….

A highly amusing article that explains the ongoing Snowden leaks and perhaps a basis for projecting when Snowden leaks will stop….not any time soon! The suspicion is that Snowden may have copied 1.7 million files.

Not with drag-n-drop but using a program!

I’m sure that was news to a lot of managers in both industry and government.

Now of course the government is buttoning up all the information (allegedly), which will hinder access to materials by those with legitimate need.

It’s one thing to have these “true to your school” types in management at agencies where performance isn’t expected or tolerated. But in a spy agency that you are trying to use to save your citizens from themselves, that’s just self-defeating.

The real solution for the NSA and any other agency where you need high grade operations is to institute an Apache meritocracy process to manage both projects and to fill management slots. It would not be open source or leak to the press, at least not any more than it does now.

The upside would be the growth, over a period of years, of highly trained and competent personnel who would institute procedures that assisted with their primary functions, not simply to enable the hiring of contractors.

It’s worth a try, the NSA could hardly do worse than it is now.

PS: I do think the NSA is violating the U.S. Constitution but the main source of my ire is their incompetence in doing so. Gathering up phone numbers because they are easy to connect for example. Drunks under the streetlight.

PPS: This is also a reminder that it isn’t the cost/size of the tool but the effectiveness with which it is used that makes a real difference.

OTexts.org Update!

Filed under: Books,Open Access — Patrick Durusau @ 3:59 pm

OTexts.org has added three new books since my post on the launch of OTexts.

New titles:

Applied biostatistical analysis using R by Stephen B. Cox.

Introduction to Computing : Explorations in Language, Logic, and Machines by David Evans.

Modal logic of strict necessity and possibility by Evgeni Latinov.

The STEM fields have put the humanities to shame in terms of open access to high quality materials.

Don’t you think it was about time the humanities started using open access technologies?

Eventual Consistency Of Topic Maps

Filed under: Consistency,Topic Maps — Patrick Durusau @ 3:32 pm

What if all transactions required strict global consistency? by Matthew Aslett.

From the post:

My mum recently moved house. Being the dutiful son that I am I agreed to help her pack up her old house, drive to her new place and help unpack when we got there.

As it happens the most arduous part of the day did not involve packing, driving or unpacking but waiting: waiting for the various solicitors involved to confirm that the appropriate funds had been deposited in the appropriate bank accounts before the estate agents could hand over the keys.

It took hours, and was a reminder that while we might think of bank transfers as being instantaneous, there can be considerable delays involved in confirming that the correct amount has been debited from one bank account and credited to another.

Matthew goes on to illustrate that banking transactions have always been “eventually consistent.” He doesn’t mention it but the the Uniform Commercial Code has several sections that cover checks, bank deposits and other matters. Text of the UCC at LLI.

The one thing the financial industry has that topic maps lack, is a common expectation of “eventual consistency.” The Uniform Commercial Code establishes (where adopted) the rules by which “eventual consistency” for banks is governed.

To avoid client disappointment, discuss “eventual consistency” up front. If your client expects instantaneous merging, with some data sets, they are likely to be disappointed.

The saying about a project being completed: faster, cheaper, better, but you can only pick two out of the three? Works with merging as well.

Open and transparent altmetrics for discovery

Filed under: Citation Analysis,Similarity — Patrick Durusau @ 11:50 am

Open and transparent altmetrics for discovery by Peter Kraker.

From the post:

Altmetrics are a hot topic in scientific community right now. Classic citation-based indicators such as the impact factor are amended by alternative metrics generated from online platforms. Usage statistics (downloads, readership) are often employed, but links, likes and shares on the web and in social media are considered as well. The altmetrics promise, as laid out in the excellent manifesto, is that they assess impact quicker and on a broader scale.

The main focus of altmetrics at the moment is evaluation of scientific output. Examples are the article-level metrics in PLOS journals, and the Altmetric donut. ImpactStory has a slightly different focus, as it aims to evaluate the oeuvre of an author rather than an individual paper.

This is all good and well, but in my opinion, altmetrics have a huge potential for discovery that goes beyond rankings of top papers and researchers. A potential that is largely untapped so far.

How so? To answer this question, it is helpful to shed a little light on the history of citation indices.
….

Peter observes that co-citation is a measure of subject similarity, without the need to use the same terminology (Science Citation Index). Peter discovered in his PhD research that co-readership is also an indicator of subject similarity.

But more research is needed on co-readership to make it into a reproducible and well understood measure.

Peter is appealing for data sets suitable for this research.

It is subject similarity at the document level but if as useful as co-citation analysis has proven to be, it will be well worth the effort.

Help out if you are able.

I first saw this in a tweet by Jason Priem.

« Newer PostsOlder Posts »

Powered by WordPress