Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 19, 2013

Using a WHERE clause to filter paths

Filed under: Graphs,Networks,Visualization — Patrick Durusau @ 1:52 pm

neo4j/cypher: Using a WHERE clause to filter paths by Mark Needham.

From the post:

One of the cypher queries that I wanted to write recently was one to find all the players that have started matches for Arsenal this season and the number of matches that they’ve played in.

Mark sorts out the use of a where clause on paths.

Visualization of a query as it occurs, tracing a path from node to node, slowed down for the human observer, could be an interesting debugging technique.

Will have to give that some thought.

Could also be instructive for debugging topic map merging as well.

Either one would be subject to visual “clutter” so it might work best with a test set that illustrates the problem.

Or perhaps by starting with the larger data set and slowly excluding content until only the problem area remains.

“…XML User Interfaces” As in Using XML?

Filed under: Conferences,Interface Research/Design,XML,XML Schema,XPath,XQuery,XSLT — Patrick Durusau @ 1:00 pm

International Symposium on Native XML user interfaces

This came across the wire this morning and I need your help interpreting it.

Why would you want to have an interface to XML?

All these years I have been writing XML in Emacs because XML wasn’t supposed to have an interface.

Brave hearts, male, female and unknown, struggling with issues too obscure for mere mortals.

Now I find that isn’t supposed to be so? You can imagine my reaction.

I moved my laptop a bit closer to the peat fire to make sure I read it properly. Waiting for the ox cart later this week to take my complaint to the local bishop about this disturbing innovation.

😉

15 March 2013 — Peer review applications due
19 April 2013 — Paper submissions due
19 April 2013 — Applications due for student support awards due
21 May 2013 — Speakers notified
12 July 2013 — Final papers due
5 August 2013 — International Symposium on Native XML user interfaces
6–9 August 2013 — Balisage: The Markup Conference

International Symposium on
Native XML user interfaces

Monday August 5, 2013 Hotel Europa, Montréal, Canada

XML is everywhere. It is created, gathered, manipulated, queried, browsed, read, and modified. XML systems need user interfaces to do all of these things. How can we make user interfaces for XML that are powerful, simple to use, quick to develop, and easy to maintain?

How are we building user interfaces today? How can we build them tomorrow? Are we using XML to drive our user interfaces? How?

This one-day symposium is devoted to the theory and practice of user interfaces for XML: the current state of implementations, practical case studies, challenges for users, and the outlook for the future development of the technology.

Relevant topics include:

  • Editors customized for specific purposes or users
  • User interfaces for creation, management, and use of XML documents
  • Uses of XForms
  • Making tools for creation of XML textual documents
  • Using general-purpose user-interface libraries to build XML interfaces
  • Looking at XML, especially looking at masses of XML documents
  • XML, XSLT, and XQuery in the browser
  • Specialized user interfaces for specialized tasks
  • XML vocabularies for user-interface specification

Presentations can take a variety of forms, including technical papers, case studies, and tool demonstrations (technical overviews, not product pitches).

This is the same conference I wrote about in: Markup Olympics (Balisage) [No Drug Testing].

In times of lean funding for conferences, if you go to a conference this year, it really should be Balisage.

You will be the envy of your co-workers and have tales to tell your grandchildren.

Not bad for one conference registration fee.

Using Clouds for MapReduce Measurement Assignments [Grad Class Near You?]

Filed under: Cloud Computing,MapReduce — Patrick Durusau @ 11:30 am

Using Clouds for MapReduce Measurement Assignments by Ariel Rabkin, Charles Reiss, Randy Katz, and David Patterson. (ACM Trans. Comput. Educ. 13, 1, Article 2 (January 2013), 18 pages. DOI = 10.1145/2414446.2414448)

Abstract:

We describe our experiences teaching MapReduce in a large undergraduate lecture course using public cloud services and the standard Hadoop API. Using the standard API, students directly experienced the quality of industrial big-data tools. Using the cloud, every student could carry out scalability benchmarking assignments on realistic hardware, which would have been impossible otherwise. Over two semesters, over 500 students took our course. We believe this is the first large-scale demonstration that it is feasible to use pay-as-you-go billing in the cloud for a large undergraduate course. Modest instructor effort was sufficient to prevent students from overspending. Average per-pupil expenses in the Cloud were under $45. Students were excited by the assignment: 90% said they thought it should be retained in future course offerings.

With properly structured assignments, I can see this technique being used to introduce library graduate students to data mining and similar topics on non-trivial data sets.

Getting “hands on” experience should make them more than a match for the sales types from information vendors.

Not to mention that data mining flourishes when used with an understanding of the underlying semantics of the data set.

I first saw this at: On Teaching MapReduce via Clouds

Really Large Queries: Advanced Optimization Techniques, Feb. 27

Filed under: MySQL,Performance,SQL — Patrick Durusau @ 11:10 am

Percona MySQL Webinar: Really Large Queries: Advanced Optimization Techniques, Feb. 27 by Peter Boros.

From the post:

Do you have a query you never dared to touch?
Do you know it’s bad, but it’s needed?
Does it fit your screen?
Does it really have to be that expensive?
Do you want to do something about it?

During the next Percona webinar on February 27, I will present some techniques that can be useful when troubleshooting such queries. We will go through case studies (each case study is made from multiple real-world cases). In these cases we were often able to reduce query execution time from 10s of seconds to a fraction of a second.

If you have SQL queries in your work flow, this will definitely be of interest.

How Stable is Your Ontology?

Filed under: Gene Ontology,Genome,Ontology — Patrick Durusau @ 8:00 am

Assessing identity, redundancy and confounds in Gene Ontology annotations over time by Jesse Gillis and Paul Pavlidis. (Bioinformatics (2013) 29 (4): 476-482. doi: 10.1093/bioinformatics/bts727)

Abstract:

Motivation: The Gene Ontology (GO) is heavily used in systems biology, but the potential for redundancy, confounds with other data sources and problems with stability over time have been little explored.

Results: We report that GO annotations are stable over short periods, with 3% of genes not being most semantically similar to themselves between monthly GO editions. However, we find that genes can alter their ‘functional identity’ over time, with 20% of genes not matching to themselves (by semantic similarity) after 2 years. We further find that annotation bias in GO, in which some genes are more characterized than others, has declined in yeast, but generally increased in humans. Finally, we discovered that many entries in protein interaction databases are owing to the same published reports that are used for GO annotations, with 66% of assessed GO groups exhibiting this confound. We provide a case study to illustrate how this information can be used in analyses of gene sets and networks.

Availability: Data available at http://chibi.ubc.ca/assessGO.

How does your ontology account for changes in identity over time?

Searching for Dark Data

Filed under: Dark Data,Lucene,LucidWorks — Patrick Durusau @ 7:42 am

Searching for Dark Data by Paul Doscher.

From the post:

We live in a highly connected world where every digital interaction spawns chain reactions of unfathomable data creation. The rapid explosion of text messaging, emails, video, digital recordings, smartphones, RFID tags and those ever-growing piles of paper – in what was supposed to be the paperless office – has created a veritable ocean of information.

Welcome to the world of Dark Data

Welcome to the world of Dark Data, the humongous mass of constantly accumulating information generated in the Information Age. Whereas Big Data refers to the vast collection of the bits and bytes that are being generated each nanosecond of each day, Dark Data is the enormous subset of unstructured, untagged information residing within it.

Research firm IDC estimates that the total amount of digital data, aka Big Data, will reach 2.7 zettabytes by the end of this year, a 48 percent increase from 2011. (One zettabyte is equal to one billion terabytes.) Approximately 90 percent of this data will be unstructured – or Dark.

Dark Data has thrown traditional business intelligence and reporting technologies for a loop. The software that countless executives have relied on to access information in the past simply cannot locate or make sense of the unstructured data that comprises the bulk of content today and tomorrow. These tools are struggling to tap the full potential of this new breed of data.

The good news is that there’s an emerging class of technologies that is ready to pick up where traditional tools left off and carry out the crucial task of extracting business value from this data.

Effective exploration of Dark Data will require something different from search tools that depend upon:

  • Pre-specified semantics (RDF) because Dark Data has no pre-specified semantics.
  • Structure because Dark Data has no structure.

Effective exploration of Dark Data will require:

Machine assisted-Interactive searching with gifted and grounded semantic comparators (people) creating pathways, tunnels and signposts into the wilderness of Dark Data.

I first saw this at: Delving into Dark Data.

dtsearch Tutorial Videos

Filed under: dtSearch,e-Discovery,Search Engines — Patrick Durusau @ 7:26 am

Tutorials for the dtsearch engine have been posted to ediscovery TV.

In five parts:

Part 1

Part 2

Part 3

Part 4

Part 5

I skipped over the intro videos only to find:

Not being able to “select all” in Excel doesn’t increase my confidence in the presentation. (part 3)

The copying of files that are “responsive” to a search request is convenient but not all that impressive. (part 4)

User isn’t familiar with basic operations in dtsearch, such as files not copied. Does finally appear. (part 5)

Disappointing because I remember dtsearch from years ago and it was (and still is) an impressive bit of work.

Suggestion: Don’t judge dtsearch by these videos.

I started to suggest you download all the brochures/white papers you will find at: http://www.dtsearch.com/contact.html

There is a helpful “Download All: PDF Porfolio” link. Except that it doesn’t work in Chrome at least. Keeps giving me a Download Adobe Acrobat 10 download window. Even after I install Adobe Acrobat 10.

Here’s a general hint for vendors: Don’t try to help. You will get it wrong. If you want to give users access to file, great, but let viewing/use be on their watch.

So, download the brochures/white papers individually until dtsearch recovers from another self-inflicted marketing wound.

Then grab a 30-day evaluation copy of the software.

It may or may not fit your needs but you will get a fairer sense of the product than you will from the videos or parts of the dtsearch website.

Maybe that’s the key: They are great search engineers, not so hot at marketing or websites.

I first saw this at dtSearch Harnesses TV Power. Where videos are cited, but not watched.

G-8 International Conference on Open Data for Agriculture

Filed under: Government,Government Data,Open Data — Patrick Durusau @ 6:38 am

G-8 International Conference on Open Data for Agriculture

April 29-30, 2013 Washington, D.C.

Deadline for proposals: Midnight, February 28, 2013.

From the call for ideas:

Are you interested in addressing global challenges, such as food security, by providing open access to information? Would you like the opportunity to present to leaders from around the world?

We are seeking innovative products and ideas that demonstrate the potential of using open data to increase food security. This April 29-30th in Washington, D.C., the G-8 International Conference on Open Data for Agriculture will host policy makers, thought leaders, food security stakeholders, and data experts to build a strategy to share agriculture data and make innovation more accessible. As part of the conference, we are giving innovators a chance to showcase innovative uses of open data for food security in a lightning presentation or in the exhibit hall. This call for ideas is a chance to demonstrate the potential that open data can have in ensuring food security, and can inform an unprecedented global collaboration. Visit data.gov to see what agricultural data is already available and connect to other G-8 open data sites!

We are seeking top innovators to show the world what can be done with open data through:

  • Lightning Presentations: brief (3-5 minute), image rich presentations intended to convey an idea
  • Exhibit Hall: an opportunity to convey an idea through an image-rich exhibit.

Presentations should inspire others to share their data or imagine how open data could be used to increase food security. Presentations may include existing, new, or proposed applications of open data and should meet one or more of the following criteria:

  • Demonstrate the impact of open data on food security.
  • Demonstrate the impact of access to agriculturally-relevant data on developed and/or developing countries.
  • Demonstrate the impact of bringing multiple sources of agriculturally-relevant public and/or private open data together (think about the creation of an agriculture equivalent of weather.com)

For those with a new idea, we invite you to submit your proposal to present it to leading experts in food security, technology and data innovation. Proposals should identify which data is needed that is publicly available, for free, on the internet. Proposals must also include a design of the application including relevance to the target audience and plans for beta testing. A successful prototype will be mobile, interactive, and scalable. Proposals to showcase existing products or pitch new ideas will be reviewed by a global panel of technical experts from the G-8 countries.

Short notice but from the submission form on the website, you only get 75-100 words to summarize your proposal.

Hell, I have trouble identifying myself in 75-100 words. 😉

Still, if you are in D.C. and interested, it could be a good way to meet people in this area.

The nine flags for the G-8 are confusing at first. Not an example of government committee counting. The EU has a representative at G-8 meetings.

I first saw this at: Open Call to Innovators: Apply to present at G-8 International Conference on Open Data for Agriculture.

February 18, 2013

Simple Web Semantics – Index Post

Filed under: OWL,RDF,Semantic Web — Patrick Durusau @ 4:23 pm

Sam Hunting suggested that I add indexes to the Simple Web Semantics posts to facilitate navigating from one to the other.

It occurred to me that having a single index page could also be useful.

The series began with:

Reasoning why something isn’t working is important to know before proposing a solution.

I have gotten good editorial feedback on the proposal and will be posting a revision in the next couple of days.

Nothing substantially different but clearer and more precise.

If you have any comments or suggestions, please make them at your earliest convenience.

I am always open to comments but the sooner they arrive the sooner I can make improvements.

VOStat: A Statistical Web Service… [Open Government, Are You Listening?]

Filed under: Astroinformatics,Statistics,Topic Maps,VOStat — Patrick Durusau @ 11:59 am

VOStat: A Statistical Web Service for Astronomers

From the post:

VOStat is a simple statistical web service that lets you analyze your data without the hassle of downloading or installing any software. VOStat provides interactive statistical analysis of astronomical tabular datasets. It is integrated into the suite of analysis and visualization tools associated with the Virtual Observatory (VO) through the SAMP communication system. A user supplies VOStat with a dataset and chooses among ~60 statistical functions, including data transformations, plots and summaries, density estimation, one- and two-sample hypothesis tests, global and local regressions, multivariate analysis and clustering, spatial analysis, directional statistics, survival analysis , and time series analysis. VOStat was developed by the Center for Astrostatistics (Pennsylvania State University).

The astronomical community has data sets that dwarf any open government data set and they have ~ 60 statistical functions?

Whereas in open government data, dumping data files to public access is considered being open?

The technology to do better already exists.

So, what is your explanation for defining openness as “data dumps to the web?”


PS: Have you ever thought about creating a data interface that hold mappings between data sets, such as a topic map would produce?

Would papering over agency differences in terminology assist users in taking advantage of their data sets? (Subject to disclosure that is happening.)

Would you call that a “TMStat: A Topic Map Statistical Web Service?”

(Disclosure of the basis for mapping being what distinguishes a topic map statistical web service from a fixed mapping between undefined column headers in different tables.)

So, what’s brewing with HCatalog

Filed under: Hadoop,HCatalog — Patrick Durusau @ 11:36 am

So, what’s brewing with HCatalog

From the post:

Apache HCatalog announced release of version 0.5.0 in the past week. Along with that, it has initiated steps to graduate from an incubator project to be an Apache Top Level project or sub-project. Let’s look at the current state of HCatalog, its increasing relevance and where it is heading.

HCatalog for a small introduction, is a “table management and storage management layer for Apache Hadoop” which:

  • enables Pig, MapReduce, and Hive users to easily share data on the grid.
  • provides a table abstraction for a relational view of data in HDFS
  • ensures format indifference (viz RCFile format, text files, sequence files)
  • provides a notification service when new data becomes available

Nice summary of the current state of HCatalog, pointing to a presentation by Alan Gates from Big Data Spain 2012.

hyperloglog-redis

Filed under: BigData,Graph Traversal,HyperLogLog,Probablistic Counting — Patrick Durusau @ 10:04 am

hyperloglog-redis

From the webpage:

This gem is a pure Ruby implementation of the HyperLogLog algorithm for estimating cardinalities of sets observed via a stream of events. A Redis instance is used for storing the counters. A minimal example:

require 'redis'
require 'hyperloglog-redis'

counter = HyperLogLog::Counter.new(Redis.new)
['john', 'paul', 'george', 'ringo', 'john', 'paul'].each 
       do |beatle|
  counter.add('beatles', beatle)
end

puts "There are approximately #{counter.count('beatles')} 
        distinct Beatles"

Each HyperLogLog counter uses a small, fixed amount of space but can estimate the cardinality of any set of up to around a billion values with relative error of 1.04 / Math.sqrt(2 ** b) with high probability, where b is a parameter passed to the HyperLogLog::Counter initializer that defaults to 10. With b = 10, each counter is represented by a 1 KB string in Redis and we get an expected relative error of 3%. Contrast this with the amount of space needed to compute set cardinality exactly, which is over 100 MB for a even a bit vector representing a set with a billion values.

The basic idea of HyperLogLog (and its predecessors PCSA, LogLog, and others) is to apply a good hash function to each value observed in the stream and record the longest run of zeros seen as a prefix of any hashed value. If the hash function is good, the bits in any hashed value should be close to statistically independent, so seeing a value that starts with exactly X zeros should happen with probability close to 2 ** -(X + 1). So, if you’ve seen a run of 5 zeros in one of your hash values, you’re likely to have around 2 ** 6 = 64 values in the underlying set. The actual implementation and analysis are much more advanced than this, but that’s the idea.

This gem implements a few useful extensions to the basic HyperLogLog algorithm which allow you to estimate unions and intersections of counters as well as counts within specific time ranges. These extensions are described in detail below.

The HyperLogLog algorithm is described and analyzed in the paper “HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm” by Flajolet, Fusy, Gandouet, and Meunier. Our implementation closely follows the program described in Section 4 of that paper.

The same paper is mentioned in: Count a billion distinct objects w/ 1.5KB of Memory (Coarsening Graph Traversal). Consult the implementation details there as well.

I first saw this in NoSQL Weekly, Issue 116.

Writing Hive UDFs – a tutorial

Filed under: Hive,HiveQL — Patrick Durusau @ 9:44 am

Writing Hive UDFs – a tutorial by Alexander Dean.

Synopsis:

In this article you will learn how to write a user-defined function (“UDF”) to work with the Apache Hive platform. We will start gently with an introduction to Hive, then move on to developing the UDF and writing tests for it. We will write our UDF in Java, but use Scala’s SBT as our build tool and write our tests in Scala with Specs2.

In order to get the most out of this article, you should be comfortable programming in Java. You do not need to have any experience with Apache Hive, HiveQL (the Hive query language) or indeed Hive UDFs – I will introduce all of these concepts from first principles. Experience with Scala is advantageous, but not necessary.

The example UDF isn’t impressive so those are left as an exercise for the reader. 😉

Also of interest:

Hive User Defined Functions (at the Apache Hive wiki).

Which you should compare to:

What are the biggest feature gaps between HiveQL and SQL? (at Quora)

There are plenty of opportunities for new UDFs, including those addressing semantic integration.

I first saw this in NoSQL Weekly, Issue 116.

Real World Hadoop – Implementing a Left Outer Join in Map Reduce

Filed under: Hadoop,MapReduce — Patrick Durusau @ 6:25 am

Real World Hadoop – Implementing a Left Outer Join in Map Reduce by Matthew Rathbone.

From the post:

This article is part of my guide to map reduce frameworks, in which I implement a solution to a real-world problem in each of the most popular hadoop frameworks.

If you’re impatient, you can find the code for the map-reduce implementation on my github, otherwise, read on!

The Problem
Let me quickly restate the problem from my original article.

I have two datasets:

  1. User information (id, email, language, location)
  2. Transaction information (transaction-id, product-id, user-id, purchase-amount, item-description)

Given these datasets, I want to find the number of unique locations in which each product has been sold.

Not as easy a problem as it appears. But I suspect a common one in practice.

Clydesdale: Structured Data Processing on MapReduce

Filed under: Clydesdale,Hadoop,MapReduce — Patrick Durusau @ 6:16 am

Clydesdale: Structured Data Processing on MapReduce by Tim Kaldewey, Eugene J. Shekita, Sandeep Tata.

Abstract:

MapReduce has emerged as a promising architecture for large scale data analytics on commodity clusters. The rapid adoption of Hive, a SQL-like data processing language on Hadoop (an open source implementation of MapReduce), shows the increasing importance of processing structured data on MapReduce platforms. MapReduce offers several attractive properties such as the use of low-cost hardware, fault-tolerance, scalability, and elasticity. However, these advantages have required a substantial performance sacrifice.

In this paper we introduce Clydesdale, a novel system for structured data processing on Hadoop – a popular implementation of MapReduce. We show that Clydesdale provides more than an order of magnitude in performance improvements compared to existing approaches without requiring any changes to the underlying platform. Clydesdale is aimed at workloads where the data fits a star schema. It draws on column oriented storage, tailored join-plans, and multicore execution strategies and carefully fits them into the constraints of a typical MapReduce platform. Using the star schema benchmark, we show that Clydesdale is on average 38x faster than Hive. This demonstrates that MapReduce in general, and Hadoop in particular, is a far more compelling platform for structured data processing than previous results suggest. (emphasis in original)

The authors make clear that Clydesdale is a research prototype and lacks many features needed for full production use.

But an order of magnitude and sometimes two orders of magnitude improvement should pique your interest in helping with such improvements.

I find the “re-use” of existing Hadoop infrastructure particularly exciting.

Order of magnitude or more gains with current approaches is a signal someone is thinking about issues and not simply throwing horsepower at a problem.

I first saw this in NoSQL Weekly, Issue 116.

Open Annotation Collaboration

Filed under: AnnotateIt,Annotation,Annotator — Patrick Durusau @ 5:41 am

Open Annotation Collaboration

From the webpage:

We are pleased to announce the publication of the 1.0 release of the Open Annotation Data Model & Ontology. This work is the product of the W3C Open Annotation Community Group jointly founded by the Annotation Ontology and the Open Annotation Collaboration. The OA Community Group will be hosting three public rollout events, U.S. West Coast, U.S. East Coast, and in the U.K. this Spring and early Summer. Implementers, developers, and information managers who attend one of these meetings will learn about the OA Data Model & Ontology firsthand from OA Community implementers and see existing annotation services that have been built using the OA model.

Open Annotation Specification Rollout Dates

U.S. West Coast Rollout – 09 April 2013 at Stanford University

U.S. East Coast Rollout – 06 May 2013 at the University of Maryland

U.K. Rollout – 24 June 2013 at the University of Manchester

No registration fee but RSVP required.

Materials on the Open Annotation Data Model & Ontology (W3C) and other annotation resources.

The collection of Known Annotation Clients is my favorite.

International Conference on Theory and Practice of Digital Libraries (TPDL)

Filed under: Conferences,Digital Library,Librarian/Expert Searchers,Library — Patrick Durusau @ 5:26 am

International Conference on Theory and Practice of Digital Libraries (TPDL)

Valletta, Malta, September 22-26, 2013. I thought that would get your attention. Details follow.

Dates:

Full and Short papers, Posters, Panels, and Demonstrations deadline: March 23, 2013

Workshops and Tutorials proposals deadline: March 4, 2013

Doctoral Consortium papers submission deadline: June 2, 2013

Notification of acceptance for Papers, Posters, and Demonstrations: May 20, 2013

Notification of acceptance for Panels, Workshops and Tutorials: April 22, 2013

Doctoral Consortium acceptance notification: June 24, 2013

Camera ready versions: June 9, 2013

End of early registration: July 31, 2013

Conference dates: September 22-26, 2013

The general theme of the conference is “Sharing meaningful information,” a theme reflected in the topics for conference submissions:

General areas of interests include, but are not limited to, the following topics, organized in four categories, according to a conceptualization that coincides with the four arms of the Maltese Cross:

Foundations

  • Information models
  • Digital library conceptual models and formal issues
  • Digital library 2.0
  • Digital library education curricula
  • Economic and legal aspects (e.g. rights management) landscape for digital libraries
  • Theoretical models of information interaction and organization
  • Information policies
  • Studies of human factors in networked information
  • Scholarly primitives
  • Novel research tools and methods with emphasis on digital humanities
  • User behavior analysis and modeling
  • Social-technical perspectives of digital information

Infrastructures

  • Digital library architectures
  • Cloud and grid deployments
  • Federation of repositories
  • Collaborative and participatory information environments
  • Data storage and indexing
  • Big data management
  • e-science, e-government, e-learning, cultural heritage infrastructures
  • Semi structured data
  • Semantic web issues in digital libraries
  • Ontologies and knowledge organization systems
  • Linked Data and its applications

Content

  • Metadata schemas with emphasis to metadata for composite content (Multimedia, geographical, statistical data and other special content formats)
  • Interoperability and Information integration
  • Digital Curation and related workflows
  • Preservation, authenticity and provenance
  • Web archiving
  • Social media and dynamically generated content for particular uses/communities (education, science, public, etc.)
  • Crowdsourcing
  • 3D models indexing and retrieval
  • Authority management issues

Services

  • Information Retrieval and browsing
  • Multilingual and Multimedia Information Retrieval
  • Personalization in digital libraries
  • Context awareness in information access
  • Semantic aware services
  • Technologies for delivering/accessing digital libraries, e.g. mobile devices
  • Visualization of large-scale information environments
  • Evaluation of online information environments
  • Quality metrics
  • Interfaces to digital libraries
  • Data mining/extraction of structure from networked information
  • Social networks analysis and virtual organizations
  • Traditional and alternative metrics of scholarly communication
  • Mashups of resources

Do you know if there are plans for recording presentations?

Given the location and diminishing travel funding, an efficient way to increase the impact of the presentations.

February 17, 2013

Simple Web Semantics (SWS) – Syntax Refinement

Filed under: OWL,RDF,Semantic Web — Patrick Durusau @ 8:19 pm

In Saving the “Semantic” Web (part 5), the only additional HTML syntax I proposed was:

<meta name=”dictionary” content=”URI”>

in the <head> element of an HTML document.

(Where you would locate the equivalent declaration of a URI dictionary in other document formats will vary.)

But that sets the URI dictionary for an entire document.

What if you want more fine grained control over the URI dictionary for a particular URI?

It would be possible to do something complicated with namespaces, containers, scope, etc. but the simpler solution would be:

<a dictionary="URI" href="yourURI">

Either the URI is governed by the declaration for the entire page or it has a declared dictionary URI.

Or to summarize the HTML syntax of SWS at this point:

<meta name=”dictionary” content=”URI”>

<a dictionary="URI" href="yourURI">

REMOTE: Office Not Required

Filed under: Books,Business Intelligence — Patrick Durusau @ 8:18 pm

REMOTE: Office Not Required

From the post:

As an employer, restricting your hiring to a small geographic region means you’re not getting the best people you can. As an employee, restricting your job search to companies within a reasonable commute means you’re not working for the best company you can. REMOTE, the new book by 37signals, shows both employers and employees how they can work together, remotely, from any desk, in any space, in any place, anytime, anywhere.

REMOTE will be published in the fall of 2013 by Crown (Random House).

I was so impressed by Rework (see: Emulate Drug Dealers [Marketing Topic Maps]) that I am recommending REMOTE ahead of its publication.

Whether the lessons in REMOTE will be heard by most employers or shall we say their managers, remains to be seen.

Perhaps performance in revenue and the stock market will be important clues. 😉

Developing Your Own Solr Filter part 2

Filed under: Lucene,Solr — Patrick Durusau @ 8:17 pm

Developing Your Own Solr Filter part 2

From the post:

In the previous entry “Developing Your Own Solr Filter” we’ve shown how to implement a simple filter and how to use it in Apache Solr. Recently, one of our readers asked if we can extend the topic and show how to write more than a single token into the token stream. We decided to go for it and extend the previous blog entry about filter implementation.

What better way to start the week!

Video: Data Mining with R

Filed under: Data Mining,R — Patrick Durusau @ 8:17 pm

Video: Data Mining with R by David Smith.

From the post:

Yesterday's Introduction to R for Data Mining webinar was a record setter, with more than 2000 registrants and more than 700 attending the live session presented by Joe Rickert. If you missed it, I've embedded the video replay below, and Joe's slides (with links to many useful resources) are also available.

During the webinar, Joe demoed several examples of data mining with R packages, including rattle, caret, and RevoScaleR from Revolution R Enteprise. If you want to adapt Joe's demos for your own data mining ends, Joe has made his scripts and data files available for download on github.

Glad this showed up! I accidentally missed the webinar.

Enjoy!

Text Processing (part 1) : Entity Recognition

Filed under: Entity Extraction,Text Mining — Patrick Durusau @ 8:17 pm

Text Processing (part 1) : Entity Recognition by Ricky Ho.

From the post:

Entity recognition is commonly used to parse unstructured text document and extract useful entity information (like location, person, brand) to construct a more useful structured representation. It is one of the most common text processing to understand a text document.

I am planning to write a blog series on text processing. In this first blog of a series of basic text processing algorithm, I will introduce some basic algorithm for entity recognition.

Looking forward to this series!

Finding tools vs. making tools:…

Filed under: Journalism,News,Software — Patrick Durusau @ 8:17 pm

Finding tools vs. making tools: Discovering common ground between computer science and journalism by Nick Diakopoulos.

From the post:

The second Computation + Journalism Symposium convened recently at the Georgia Tech College of Computing to ask the broad question: What role does computation have in the practice of journalism today and in the near future? (I was one of its organizers.) The symposium attracted almost 150 participants, both technologists and journalists, to discuss and debate the issues and to forge a multi-disciplinary path forward around that question.

Topics for panels covered the gamut, from precision and data journalism, to verification of visual content, news dissemination on social media, sports and health beats, storytelling with data, longform interfaces, the new economic landscape of content, and the educational needs of aspiring journalists. But what made these sessions and topics really pop was that participants on both sides of the computation and journalism aisle met each other in a conversational format where intersections and differences in the ways they viewed these topics could be teased apart through dialogue. (Videos of the sessions are online.)

While the panelists were all too civilized for any brawls to break out, mixing two disciplines as different as computing and journalism nonetheless did lead to some interesting discussions, divergences, and opportunities that I’d like to explore further here. Keeping these issues top-of-mind should help as this field moves forward.

Tool foragers and tool forgers

The following metaphor is not meant to be incendiary, but rather to illuminate two different approaches to tool innovation that seemed apparent at the symposium.

Imagine you live about 10,000 years ago, on the cusp of the Neolithic Revolution. The invention of agriculture is just around the corner. It’s spring and you’re hungry after the long winter. You can start scrounging around for berries and other tasty roots to feed you and your family — or you can stop and try to invent some agricultural implements, tools adapted to your own local crops and soil that could lead to an era of prosperity. If you take the inventive approach, you might fail, and there’s a real chance you’ll starve trying — while foraging will likely guarantee you another year of subsistence life.

What role does computation have in your field of practice?

Download all your tweets [Are You An Outlier/Drone Target?]

Filed under: Data,Tweets — Patrick Durusau @ 8:17 pm

Download all your tweets by Ajay Ohri.

From the post:

Now that the Government of the United States of America has the legal power to request your information without a warrant (The Chinese love this!)

Anyways- you can also download your own twitter data. Liberate your data.

Have you looked at your own data? Go there at https://twitter.com/settings/account and review the changes.

Modern governments invent evidence out of whole clothe, enough to topple other governments, so whether my communications are secure or not may be a moot point.

It may make a difference on whether your communications stand out, such that they focus on inventing evidence about you.

In that case, having all your tweets, particularly with the tweets of others, could be a useful thing.

With enough data a profile could be constructed so that your tweets come within, +- some percentage, of the normal tweets for your demographic.

I don’t ever tweet about American Idol (#idol) so I am already an outlier. 😉

Mapping the demographics to content and hash tags, along with dates, events, etc. would make for a nice graph/topic map type application.

Perhaps a deviation warning system if your tweets started to curve away from the pack.

Hiding from data mining isn’t an option.

The question is how to hide in plain sight?

Interpreting scientific literature: A primer

Filed under: Humor,Semantics — Patrick Durusau @ 8:16 pm

Interpreting scientific literature: A primer by kshameer.

It’s visual so follow the link.

I shouldn’t re-post this sort of thing, being something of a professional academic, but it’s too funny to resist.

Would be interesting to create an auto-tagger that could be run against online text to supply markup with the “they mean” values to be displayed on command.

😉

I first saw this at Christophe Lalanne’s A bag of tweets / January 2013.

Models and Algorithms for Crowdsourcing Discovery

Filed under: Crowd Sourcing — Patrick Durusau @ 5:15 pm

Models and Algorithms for Crowdsourcing Discovery by Siamak Faridani. (PDF)

From the abstract:

The internet enables us to collect and store unprecedented amounts of data. We need better models for processing, analyzing, and making conclusions from the data. In this work, crowdsourcing is presented as a viable option for collecting data, extracting patterns and insights from big data. Humans in collaboration, when provided with appropriate tools, can collectively see patterns, extract insights and draw conclusions from data. We study diff erent models and algorithms for crowdsourcing discovery.

In each section in this dissertation a problem is proposed, the importance of it is discussed, and solutions are proposed and evaluated. Crowdsourcing is the unifying theme for the projects that are presented in this dissertation. In the first half of the dissertation we study diff erent aspects of crowdsourcing like pricing, completion times, incentives, and consistency with in-lab and controlled experiments. In the second half of the dissertation we focus on Opinion Space1 and the algorithms and models that we designed for collecting innovative ideas from participants. This dissertation speci cally studies how to use crowdsourcing to discover patterns and innovative ideas.

We start by looking at the CONE Welder project2 which uses a robotic camera in a remote location to study the eff ect of climate change on the migration of birds. In CONE, an amateur birdwatcher can operate a robotic camera at a remote location from within her web browser. She can take photos of diff erent bird species and classify diff erent birds using the user interface in CONE. This allowed us to compare the species presented in the area from 2008 to 2011 with the species presented in the area that are reported by Blacklock in 1984 [Blacklock, 1984]. Citizen scientists found eight avian species previously unknown to have breeding populations within the region. CONE is an example of using crowdsourcing for discovering new migration patterns.

Crowdsourcing has great potential.

Especially if you want to discover the semantics people are using rather than dictating the semantics they ought to be using.

I think the former is more accurate than the latter.

You?

I first saw this at Christophe Lalanne’s A bag of tweets / January 2013.

Lisp lore : a guide to programming the Lisp machine (1986)

Filed under: Lisp,Programming — Patrick Durusau @ 4:09 pm

Lisp lore : a guide to programming the Lisp machine (1986) by Hank Bromley.

From the introduction:

The full 11 -volume set of documentation that comes with a Symbolics lisp machine is understandably intimidating to the novice. “Where do I start?” is an oft-heard question, and one without a good answer. The eleven volumes provide an excellent reference medium, but are largely lacking in tutorial material suitable for a beginner. This book is intended to fill that gap. No claim is made for completeness of coverage — the eleven volumes fulfill that need. My goal is rather to present a readily grasped introduction to several representative areas of interest, including enough information to show how easy it is to build useful programs on the lisp machine. At the end of this course, the student should have a clear enough picture of what facilities exist on the machine to make effective use of the complete documentation, instead of being overwhelmed by it.

From the days when documentation was an expectation, not a luxury.

One starting place to decide if the ideas in a patent application are “new” or invented before a patent examiner went to college. 😉

Some other Lisp content you may find of interest:

I first saw this at Christophe Lalanne’s “A bag of tweets / January 2013.”

February 16, 2013

Deep Inside: A Study of 10,000 Porn Stars and Their Careers

Filed under: Data,Data Mining,Porn — Patrick Durusau @ 4:49 pm

Deep Inside: A Study of 10,000 Porn Stars and Their Careers by Jon Millward.

From the post:

For the first time, a massive data set of 10,000 porn stars has been extracted from the world’s largest database of adult films and performers. I’ve spent the last six months analyzing it to discover the truth about what the average performer looks like, what they do on film, and how their role has evolved over the last forty years.

I can now name the day when I became aware of the Internet Adult Film Database, today!

When you get through grinning, go take a look at the post. This is serious data analysis.

Complete with an idealized porn star face composite from the most popular porn stars.

Improve your trivia skills: What two states in the United States have one porn star each in the Internet Adult Film Database? (Jon has a map of the U.S. with distribution of porn stars.)

A full report with more details about the analysis is forthcoming.

I first saw this at Porn star demographics by Nathan Yau.

Methods of Proof — Direct Implication

Filed under: Mathematical Reasoning,Mathematics — Patrick Durusau @ 4:49 pm

Methods of Proof — Direct Implication by Jeremy Kun.

From the post:

I recently posted an exploratory piece on why programmers who are genuinely interested in improving their mathematical skills can quickly lose stamina or be deterred. My argument was essentially that they don’t focus enough on mastering the basic methods of proof before attempting to read research papers that assume such knowledge. Also, there are a number of confusing (but in the end helpful) idiosyncrasies in mathematical culture that are often unexplained. Together these can cause enough confusion to stymie even the most dedicated reader. I have certainly experienced it enough to call the feeling familiar.

Now I’m certainly not trying to assert that all programmers need to learn mathematics to improve their craft, nor that learning mathematics will be helpful to any given programmer. All I claim is that someone who wants to understand why theorems are true, or how to tweak mathematical work to suit their own needs, cannot succeed without a thorough understanding of how these results are developed in the first place. Function definitions and variable declarations may form the scaffolding of a C program while the heart of the program may only be contained in a few critical lines of code. In the same way, the heart of a proof is usually quite small and the rest is scaffolding. One surely cannot understand or tweak a program without understanding the scaffolding, and the same goes for mathematical proofs.

And so we begin this series focusing on methods of proof, and we begin in this post with the simplest such methods. I call them the “basic four,” and they are:

  • Proof by direct implication
  • Proof by contradiction
  • Proof by contrapositive, and
  • Proof by induction.

This post will focus on the first one, while introducing some basic notation we will use in the future posts. Mastering these proof techniques does take some practice, and it helps to have some basic mathematical content with which to practice on. We will choose the content of set theory because it’s the easiest field in terms of definitions, and its syntax is the most widely used in all but the most pure areas of mathematics. Part of the point of this primer is to spend time demystifying notation as well, so we will cover the material at a leisurely (for an experienced mathematician: aggravatingly slow) pace.

Parallel processing, multi-core memory architectures, graphs and the like are a long way from the cookbook stage of programming.

If you want to be on the leading edge, some mathematics are going to be required.

This series can bring you one step closer to mathematical literacy.

I say “can” because whether it will or not, depends upon you.

« Newer PostsOlder Posts »

Powered by WordPress