Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 11, 2013

School of Haskell

Filed under: Functional Programming,Haskell — Patrick Durusau @ 5:52 pm

School of Haskell

I didn’t see a history of the site but judging from the dates on the published articles, the School of Haskell came online in late December 2012.

Since then it has accumulated approximately two hundred and eighty four (284) published articles.

Not to mention these resources listed on the homepage:

  • How to Use the School of Haskell School of Haskell Learning, playing, and keeping up with the latest. ## Welcome to the SoH We have created the School of Haskell to…
  • Pick of the Week A weekly selection of interesting tutorials created by our users….
  • IDE Tutorials Tutorials on how to make the best use of the FP Haskell Center including hidden gems, and code examples.
  • Project Templates Project templates for use with FP Haskell Center. The world's first commercial Haskell IDE and deployment platform.
  • Basics of Haskell A gentle introduction to Haskell for beginners….
  • Introduction to Haskell A basic introduction to Haskell based on a half-credit course (CIS 194) taught at the University of Pennsylvania.
  • Text manipulation A collection of tutorials on processing text – parsing structured forms, etc.
  • Random numbers in Haskell A look at how to use System.Random
  • Database access A tutorial covering database usage from Haskell with persistent-db.
  • Beautiful Concurrency ## Beautiful concurrency appeared in [Beautiful code](http://shop.oreilly.com/product/9780596510046.do)…
  • Basics of Yesod A gentle introduction to a Haskell web framework for beginners (no Haskell required)
  • Advanced Haskell Articles and tutorials for advanced Haskell programmers
  • Haskell Fast & Hard The SoH Version of Haskell Fast & Hard tutorial. From very beginner up to Monads in a very short and dense tutorial….

Interested in Haskell or functional programming? This is a site to re-visit on a regular basis.

FastBit:..

Filed under: Bitmap Indexes,FastBit,Indexing — Patrick Durusau @ 4:44 pm

FastBit: An Efficient Compressed Bitmap Index Technology

From the webpage:

FastBit is an open-source data processing library following the spirit of NoSQL movement. It offers a set of searching functions supported by compressed bitmap indexes. It treats user data in the column-oriented manner similar to well-known database management systems such as Sybase IQ, MonetDB, and Vertica. It is designed to accelerate user’s data selection tasks without imposing undue requirements. In particular, the user data is NOT required to be under the control of FastBit software, which allows the user to continue to use their existing data analysis tools.

Software

The FastBit software is distributed under the Less GNU Public License (LGPL). The software is available at codeforge.lbl.gov. The most recent release is FastBit ibis1.3.7; it comes as a source tar ball named fastbit-ibis1.3.7.tar.gz. The latest development version is available from http://goo.gl/Ho7ty.

Other items of interest:

FastBit related publications

The most recent entry in this list is 2011. A quick search of the ACM Digital Library (for fastBit) found seventeen (17) articles for 2012 – 2013.

FastBit Users Guide

From the users guide:

This package implements a number of different bitmap indexes compressed with Word-Aligned Hybrid code. These indexes differ in their bitmap encoding methods and binning options. The basic bitmap index compressed with WAH has been shown to answer one-dimensional queries in time that is proportional to the number of hits in theory. In a number of performance measurements, WAH compressed indexes were found to be much more efficient than other indexes [CIKM 2001] [SSDBM 2002] [DOLAP 2002]. One of the crucial step in achieving these efficiency is to be able to perform bitwise OR operations on a large compressed bitmaps efficiently without decompression [VLDB 2004]. Numerous other bitmap encodings and binning strategies are implemented in this software package, please refer to indexSpec.html for descriptions on how to access these indexes and refer to our publications for extensive studies on these methods. FastBit was primarily developed to test these techniques for improving compressed bitmap indexes. Even though, it has grown to include a small number other useful data analysis functions, its primary strength is still in having a diversity of efficient compressed bitmap indexes.

Just in case you want to follow up on the use of fastBit in the RaptorDB.

RaptorDB – the Document Store

Filed under: .Net,Database,RaptorDB — Patrick Durusau @ 3:51 pm

RaptorDB – the Document Store by Mehdi Gholam.

From the post:

This article is the natural progression from my previous article about a persisted dictionary to a full blown NoSql document store database. While a key/value store is useful, it’s not as useful to everybody as a "real" database with "columns" and "tables". RaptorDB uses the following articles:

Some advanced R&D (for more than a year) went into RaptorDB, in regards to the hybrid bitmap index. Similar technology is being used by Microsoft’s Power Pivot for Excel and US Department of Energy Berkeley labs project called fastBit to track terabytes of information from particle simulations. Only the geeks among us care about this stuff and the normal person just prefer to sit in the Bugatti Veyron and drive, instead of marvel at the technological underpinnings.

To get here was quite a journey for me as I had to create a lot of technology from scratch, hopefully RaptorDB will be a prominent alternative, built on the .net platform to other document databases which are either java or c++ based. 

RaptorDB puts the joy back into programming, as you can see in the sample application section.

If you want to take a deep dive into a .net project, this may be the one for you.

The use of fastBit, developed at US Department of Energy Berkeley, is what caught my attention.

A project using DOE developed software merits a long pause.

Latest version is dated October 10, 2013.

Bitsy 1.5

Filed under: Bitsy,Blueprints,Database,Graphs — Patrick Durusau @ 3:16 pm

Bitsy 1.5

Version 1.5 of Bitsy is out!

Bitsy is a small, fast, embeddable, durable in-memory graph database that implements the Blueprints API.

Slides: Improvements in Bitsy 1.5 by Sridhar Ramachandran.

The current production version is Bitsy 1.2 and Bitsy 1.5 is for research, evaluation and development. Webpage reports that Bitsy 1.5 should be available for production by the end of 2013.

Enjoy!

Free Text and Spatial Search…

Filed under: Lucene,Searching,Spatial Index — Patrick Durusau @ 3:08 pm

Free Text and Spatial Search with Spatial4J and Lucene Spatial by Steven Citron-Pousty.

From the post:

Hey there, Shifters. One of my talks at FOSS4G 2013 covered Lucene Spatial. Todays post is going to follow up on my post about creating Lucene Indices by adding spatial capabilities to the index. In the end you will have a a full example on how create a fast and full featured full text spatial search on any documents you want to use.

How to add spatial to your Lucene index

In the last post I covered how to create a Lucene index so in this post I will just cover how to add spatial. The first thing you need to understand are the two pieces of how spatial is handled by Lucene. A lot of this work is done by Dave Smiley. He gave a great presentation on all this technology at Lucene/Solr Revolution 2013. If you really want to dig in deep, I suggest you watch his 1:15 h:m long video – my blog post is more the Too Long Didn’t Listen (TL;DL) version.

  • Spatial4J: This Java library provides geospatial shapes, distance calculations, and importing and exporting shapes. It is Apache Licensed so it can be used with other ASF projects. Lucene Spatial uses Spatial4J to create the spatial objects that get indexed along with the documents. It will also be used when calculating distances in a query or when we want to convert between distance units. Spatial4J is able to handle real-world on a sphere coordinates (what comes out of a GPS unit) and projected coordinates (any 2D map) for both shapes and distances.

Short aside: The oldest Java based spatial library is JTS and is used in many other Open Source Java geospatial projects. Spatial4J uses JTS under the hood if you want to work with Polygon shapes. Unfortunately, until recently it was LGPL and so could not be included in Lucene. JTS has announced it’s intention to go to a BSD type license which should allow Spatial4J and JTS to start working together for more Java Spatial goodness for all. One of the beauties of FOSS is the ability to see development discussions happen in the open.

  • Lucene Spatial After many different and custom iterations – there is now lucene spatial built right into Lucene as a standard library. It is new with the 4.x releases of Lucene. What Lucene spatial does is provide the indexing and search strategies for spatial4j shapes stored in a Lucene index. It has SpatialStrategy as the base class to define the signature that any spatial strategy must fulfill. You then use the same strategy for the index writing and reading.

Today I will show the code to use spatial4j with Lucene Spatial to add a spatially indexed field to your lucene index.

Pay special attention to the changes that made it possible for Spatial4J and JTS work together.

Cooperation between projects makes the resulting whole stronger.

Some office projects need to have that realization.

N1QL – It Makes Cents! [Rediscovery of Paths]

Filed under: CouchDB,N1QL,XPath,XSLT — Patrick Durusau @ 11:39 am

N1QL – It Makes Cents! by Robin Johnson.

*Ba Dum Tschhh* …See what I did there? Makes cents? Get it? Haha.

So… N1QL (pronounced Nickel)… Couchbase’s new next-generation query language; what is it? Well, it’s a rather genius designed, human readable / writable, extensible language designed for ad-hoc and operational querying within Couchbase. For those already familiar with querying within Couchbase, that blurb will probably make sense to you. If not – well, probably not, so let me clear it up a little more.

But before I do that, I must inform you that this blog article isn’t the best place for you to go if you want to dive in and get started learning N1QL. It is a view into N1QL from a developer’s perspective including why I am so excited about it, and the features I am proud to point out. If you want to get started learning about N1QL, click here. Or alternatively, go and have a go of the Online Tutorial. Anyway, back to clearing up what I mean when I say N1QL…

“N1QL is similar to the standard SQL language for relational databases, but also includes additional features; which are suited for document-oriented databases.” N1QL has been designed as an intuitive Query Language for use on databases structured around Documents instead of tables. To locate and utilise information in a document-oriented database, you need the correct logic and expressions for navigating documents and document structures. N1QL provides a clear, easy-to-understand abstraction layer to query and retrieve information in your document-database.

Before we move on with N1QL, let’s talk quickly about document modeling within Couchbase. As you probably know; within Couchbase we model our documents primarily in JSON. We’re all familiar with JSON, so I won’t go into it in detail, but one thing we need to bear in mind is the fact that: our JSON documents can have complex nested data structures, nested arrays and objects which ordinarily would make querying a problem. Contrary to SQL though, N1QL has the ability to navigate nested data because it supports the concept of paths. This is very cool. We can use paths by using a dot-notation syntax to give us the logical location of an attribute within a document. For example; if we had an e-commerce site with documents containing customers’ orders, we could look inside those documents, to an Nth nested level for attributes. So if we wanted to look for the customer’s shipping street: (emphasis in original)

Paths are “very cool,” but I thought that documents could already be navigated by paths?

Remembering: XSL Transformations (XSLT) Version 2.0, XSLT 2.0 and XQuery 1.0 Serialization (Second Edition), and XQuery 1.0 and XPath 2.0 Functions and Operators (Second Edition).

Yes?

True, CouchDB uses JSON documents but the notion of paths in data structures isn’t news.

Not having paths into data structures, now, that would be news. 😉

Quick Etymology

Filed under: Interface Research/Design,Language — Patrick Durusau @ 10:34 am

A tweet by Norm Walsh observes:

“Etymology of the word ___” in Google gives a railroad diagram answer on the search results page. Nice.

That along with “define ____” are suggestive of short-cuts for a topic map interface.

Yes?

Thinking of: “Relationships with _____”

Of course, Tiger Woods would be a supernode (…a vertex with a disproportionately high number of incident edges.”). 😉

October 10, 2013

MarkLogic Rolls Out the Red Carpet for…

Filed under: MarkLogic,RDF,Semantic Web — Patrick Durusau @ 7:09 pm

MarkLogic Rolls Out the Red Carpet for Semantic Triples by Alex Woodie.

From the post:

You write a query with great care, and excitedly hit the “enter” button, only to see a bunch of gobbledygook spit out on the screen. MarkLogic says the chances of this happening will decrease thanks to the new RDF Triple Store feature that it formally introduced today with the launch of version 7 of its eponymous NoSQL database.

The capability to store and search semantic triples in MarkLogic 7 is one of the most compelling new features of the new NoSQL database. The concept of semantic triples is central to the Resource Description Framework (RDF) way of storing and searching for information. Instead of relating information in a database using an “entity-relationship” or “class diagram” model, the RDF framework enables links between pieces of data to be searched using the “subject-predicate-object” concept, which more closely corresponds to the way humans think and communicate.

The real power of this approach becomes evident when one considers the hugely disparate nature of information on the Internet. An RDF powered application can build links between different pieces of data, and effectively “learn” from the connections created by the semantic triples. This is the big (and as yet unrealized) pipe dream of the semantic Web.

RDF has been around for a while, and while you probably wouldn’t call it mainstream, there are a handful of applications using this approach. What makes MarkLogic’s approach unique is that it’s storing the semantic triples–the linked data–right inside the main NoSQL database, where it can make use of all the rich data and metadata stored in documents and other semi-structured files that NoSQL databases like MarkLogic are so good at storing.

This approach puts semantic triples right where it can do the most good. “Until now there has been a disconnect between the incredible potential of semantics and the value organizations have been able to realize,” states MarkLogic’s senior vice president of product strategy, Joe Pasqua.

“Managing triples in dedicated triple stores allowed people to see connections, but the original source of that data was disconnected, ironically losing context,” he continues. “By combining triples with a document store that also has built-in querying and APIs for delivery, organizations gain the insights of triples while connecting the data to end users who can search documents with the context of all the facts at their fingertips.”

A couple of things caught my eye in this post.

First, the comment that:

RDF has been around for a while, and while you probably wouldn’t call it mainstream, there are a handful of applications using this approach.

I can’t disagree so why would MarkLogic make RDF support a major feature of this release?

Second, the next sentence reads:

What makes MarkLogic’s approach unique is that it’s storing the semantic triples–the linked data–right inside the main NoSQL database, where it can make use of all the rich data and metadata stored in documents and other semi-structured files that NoSQL databases like MarkLogic are so good at storing.

I am reading that to mean that if you store all the documents in which triples appear, along with the triples, you have more context. Yes?

Trivially true but I not sure how significant an advantage that would be. Shouldn’t all that “contextual” metadata be included with the triples?

But I haven’t gotten a copy of version 7 so that’s all speculation on my part.

If you have a copy of MarkLogic 7, care to comment?

Thanks!

BRDI Announces Data and Information Challenge

Filed under: Challenges,Contest — Patrick Durusau @ 6:50 pm

BRDI Announces Data and Information Challenge by Stephanie Hagstrom.

From the post:

The National Academy of Sciences Board on Research Data and Information (BRDI; www.nas.edu/brdi) announces an open challenge to increase awareness of current issues and opportunities in research data and information. These issues include, but are not limited to, accessibility, integration, discoverability, reuse, sustainability, perceived versus real value and reproducibility.

A Letter of Intent is requested by December 1, 2013 and the deadline for final entries is May 15, 2014.

Awardees will be invited to present their projects at the National Academy of Sciences in Washington DC as part of a symposium of the regularly scheduled Board of Research Data and Information meeting in the latter half of 2014.

More information is available at http://sites.nationalacademies.org/PGA/brdi/PGA_085255. Please contact Cheryl Levey (clevey@nas.edu) with any questions.

This looks quite interesting.

The main site reports:

The National Academy of Sciences Board on Research Data and Information (BRDI; www.nas.edu/brdi) is holding an open challenge to increase awareness of current issues and opportunities in research data and information. These issues include, but are not limited to, accessibility, integration, discoverability, reuse, sustainability, perceived versus real value and reproducibility. Opportunities include, but are not limited to, analyzing such data and information in new ways to achieve significant societal benefit.

Entrants are expected to describe one or more of the following:

  • Novel ideas
  • Tools
  • Processes
  • Models
  • Outcomes

using research data and information. There is no restriction on the type of data or information, or the type of innovation that can be described. All data and tools that form the basis of a contestant’s entry must be made freely and openly available. The challenge is held in memory of Lee Dirks, a pioneer in scholarly communication.

Anticipated outcomes of the challenge include the potential for original and innovative solutions to societal problems using existing research data and information, national recognition for the successful contestants and possibly their institutions.

Looks ideal for a topic map-based proposal.

Suggestions on data sets?

F1 And Spanner Holistically Compared

Filed under: F1,Scalability,Spanner — Patrick Durusau @ 6:36 pm

F1 And Spanner Holistically Compared

From the post:

This aricle, F1: A Distributed SQL Database That Scales by Srihari Srinivasan, is republished with permission from a blog you really should follow: Systems We Make – Curating Complex Distributed Systems.

With both the F1 and Spanner papers out its now possible to understand their interplay a bit holistically. So lets start by revisiting the key goals of both systems.

Just in case you missed the F1 paper.

The conclusion should give you enough reason to read this post and the papers carefully:

The F1 system has been managing all AdWords advertising campaign data in production since early 2012. AdWords is a vast and diverse ecosystem including 100s of applications and 1000s of users, all sharing the same database. This database is over 100 TB, serves up to hundreds of thousands of requests per second, and runs SQL queries that scan tens of trillions of data rows per day. Availability reaches five nines, even in the presence of unplanned outages, and observable latency on our web applications has not increased compared to the old MySQL system.

Keep this in mind when you read stories composed of excuses about the recent collapse of healthcare.gov.

Who-To-Follow…

Filed under: Tweets — Patrick Durusau @ 4:51 pm

The Ultimate Who-To-Follow Guide for Tweeting Librarians, Info Pros, and Educators by Ellyssa Kroski.

Lists thirty (30) librarian feeds, then thirty (30) tech feeds (bad list construction, the ones under publication continue the tech feed list), ten (10) feeds for book lovers and pointers to three other lists of feeds.

You may need several more Twitter accounts or a better reader than most of the ones I have seen. Rules, regexes and some ML would all be useful.

Not to mention outputting the captured tweets into a topic map for navigation.

PS: I first saw this on Force11.

Raw, a tool to turn spreadsheets to vector graphics

Filed under: Graphics,Spreadsheets,Visualization — Patrick Durusau @ 4:35 pm

Raw, a tool to turn spreadsheets to vector graphics by Nathan Yau.

From the post:

Sometimes it can be a challenge to produce data graphics in vector format, which is useful for high-resolution prints. Raw, an alpha-version tool by Density Design, helps make the process smoother.

As the description Nathan quotes says:

…it is a sketch tool, useful for quick and preliminary data explorations as well as for generating editable visualizations.

I’m comfortable with the idea of data explorations.

Makes it clear that no visualization is inherent in data but is a matter of choice.

Semantics: The Next Big Issue in Big Data

Filed under: BigData,Semantics — Patrick Durusau @ 3:44 pm

Semantics: The Next Big Issue in Big Data by Glen Fest.

From the post:

The use of semantics often is a way to evade the issue at hand (i.e., Bill Clinton’s parsed definition of “is”). But in David Saul’s world of bank compliance and regulation, it’s something that can help get right to the heart of the matter.

Saul, the chief scientist at State Street Corp. in Boston, views the technology of semantics—in which data is structured in ways that it can be shared easily between bank divisions, institutions and regulators—as an ends to better understand and manage big-bank risk profiles.

“By bringing all of this data together with the semantic models, we’re going to be able to ask the questions you need to ask to prepare regulatory reporting,” as well as internal risk calculations, Saul promised at a recent forum held at the New York offices of SWIFT, the Society for Worldwide Interbank Financial Telecommunication. Saul’s championing of semantics technology was part of a wider-ranging panel discussion on the role of technology in helping banks meet the current and forthcoming compliance demands of global regulators. “That’s really what we’re doing: trying to pull risk information from a variety of different systems and platforms, written at different times by different people,” Saul says.

To bridge the underlying data, the Financial Industry Business Ontology (FIBO), a group that Saul participates in, is creating the common terms and data definitions that will put banks and regulators on the same semantic page.

What’s ironic is in the same post you find:

Semantics technology already is a proven concept as an underlying tool of the Web that requires common data formats for sites to link to one another, says Saul. At large global banks, common data infrastructure is still in most cases a work in progress, if it’s underway at all. Legacy departmental divisions have allowed different (and incompatible) data sets and systems to evolve internally, leaving banks with the heavy chore of accumulating and repurposing data for both compliance reporting and internal risk analysis.

The inability to automate or reuse data across silos is at the heart of banks’ big-data dilemma—or as Saul likes to call it, a “smart data” predicament.

I’m not real sure what having a “common data forma” has to do with linking between data sets. Most websites use something close to HTML but that doesn’t mean they can be usefully linked together.

Not to mention the “legacy departmental divisions.” What is going to happen to them and their data?

How “semantics technology” is going to deal with different and incompatible data sets isn’t clear. Change all the recorded data retroactively? How far back?

If you have any contacts in the banking industry, tell them the FIBO proposal sounds like a bad plan.

Geocode the world…

Filed under: Geographic Data,Geography,GeoNames,Maps — Patrick Durusau @ 3:29 pm

Geocode the world with the new Data Science Toolkit by Pete Warden.

From the post:

I’ve published a new version of the Data Science Toolkit, which includes David Blackman’s awesome TwoFishes city-level geocoder. Largely based on data from the Geonames project, the biggest improvement is that the Google-style geocoder now handles millions of places around the world in hundreds of languages:

Who or what do you want to locate? 😉

Apache Lucene: Then and Now

Filed under: Lucene,Solr,SolrCloud — Patrick Durusau @ 3:06 pm

Apache Lucene: Then and Now by Doug Cutting.

From the description at Washington DC Hadoop Users Group:

Doug Cutting originally wrote Lucene in 1997-8. It joined the Apache Software Foundation’s Jakarta family of open-source Java products in September 2001 and became its own top-level Apache project in February 2005. Until recently it included a number of sub-projects, such as Lucene.NET, Mahout, Solr and Nutch. Solr has merged into the Lucene project itself and Mahout, Nutch, and Tika have moved to become independent top-level projects. While suitable for any application which requires full text indexing and searching capability, Lucene has been widely recognized for its utility in the implementation of Internet search engines and local, single-site searching. At the core of Lucene’s logical architecture is the idea of a document containing fields of text. This flexibility allows Lucene’s API to be independent of the file format. Text from PDFs, HTML, Microsoft Word, and OpenDocument documents, as well as many others (except images), can all be indexed as long as their textual information can be extracted.

In today’s discussion, Doug will share background on the impetus and creation of Lucene. He will talk about the evolution of the project and explain what the core technology has enabled today. Doug will also share his thoughts on what the future holds for Lucene and SOLR

Interesting walk down history lane with the creator of Lucene, Doug Cutting.

Understanding Entity Search [Better Late Than Never]

Filed under: Entities,Entity Resolution,Topic Maps — Patrick Durusau @ 2:33 pm

Understanding Entity Search by Paul Bruemmer.

From the post:

Over the past two decades, the Internet, search engines, and Web users have had to deal with unstructured data, which is essentially any data that has not been organized or classified according to any sort of pre-defined data model. Thus, search engines were able to identify patterns within webpages (keywords) but were not really able to attach meaning to those pages.

Semantic Search provides a method for classifying the data by labeling each piece of information as an entity — this is referred to as structured data. Consider retail product data, which contains enormous amounts of unstructured information. Structured data enables retailers and manufacturers to provide extremely granular and accurate product data for search engines (machines/bots) to consume, understand, classify and link together as a string of verified information.

Semantic or entity search will optimize much more than just retail product data. Take a look at Schema.org’s schema types – these schemas represent the technical language required to create a structured Web of data (entities with unique identifiers) — and this becomes machine-readable. Machine-readable structured data is disambiguated and more reliable; it can be cross-verified when compared with other sources of linked entity data (unique identifiers) on the Web.

Interesting to see unstructured data defined as:

any data that has not been organized or classified according to any sort of pre-defined data model.

I suppose you can say that but is that how any of us write?

We all write with specific entities in minds, entities that represent subjects we could identify with additional properties if required.

So it is more accurate to say that unstructured data can be defined as:

any data that has not been explicitly identified by one or more properties.

Well, that’s the trick isn’t it? We look at an entity and see properties that a machine does not.

Explicit identification is a requirement. But on the other hand, a “unique” identifier is not.

That’s not just a topic map opinion but is in fact in play at the Global Biodiversity Information Facility (GBIF) I posted about yesterday.

GBIF realizes that ongoing identifications are never going to converge on that happy state where every entity has only one unique reference. In part because an on-going system has to account for all existing data as well as new data which could have new identifiers.

There isn’t enough time or resources to find all prior means of identifying an entity and replacing those with an new identifier. Rather than cutting the Gordian knot of multiple identifiers with a URI sword, GBIF understands multiple identifiers for an entity.

Robust entity search capabilities require the capturing of all identifiers for an entity. So no user is disadvantaged by the identification they know for an entity.

The properties of subjects represented by entities and their identifiers serve as the basis for mapping between identifiers.

None of which needs to be exposed to the user. All a user may see is whatever identifier they have for an entity returns the correct entity and information that was recorded using other identifiers (if they look closely).

What else should an interface disclose other than the result desired by the user?

PS: “Better Late Than Never,” refers to Steve Newcomb and Michel Biezunski promotion of the use of properties to identify the subject represented by entities since the 1990’s. The W3C approach is to replace existing identifiers with a URI. How an opaque URI is better than an opaque string isn’t apparent to me.

Government Shutdown = Free Oxford Content!

Filed under: Books,Interface Research/Design,Reference — Patrick Durusau @ 1:03 pm

Free access to Oxford content during the government shutdown

From the post:

The current shutdown in Washington is limiting the access that scholars and researchers have to vital materials. To that end, we have opened up access for the next two weeks to three of our online resources: Oxford Reference, American National Biography Online, and the US Census demographics website, Social Explorer.

  • Oxford Reference is the home of Oxford’s quality reference publishing, bringing together over 2 million entries from subject reference, language, and quotations dictionaries, many of which are illustrated, into a single cross-searchable resource. Start your journey by logging in using username: tryoxfordreference and password: govshutdown
  • American National Biography Online provides articles that trace a person’s life through the sequence of significant events as they occurred from birth to death offering portraits of more than 18,700 men and women— from all eras and walks of life—whose lives have shaped the nation. To explore, simply log in using username: tryanb and password: govshutdown
  • Social Explorer provides quick and easy access to current and historical census data and demographic information. It lets users create maps and reports to illustrate, analyze, and understand demography and social change. In addition to its comprehensive data resources, Social Explorer offers features and tools to meet the needs of demography experts and novices alike. For access to Social Explorer, email online reference@oup.com for a username and password.

An example of:

It’s an ill wind that blows nobody good

Whatever your political persuasion, a great opportunity to experience first class reference materials.

It’s only for two weeks so pass this onto your friends and colleagues!

PS: From a purely topic map standpoint, the site is also instructive as a general UI.

October 9, 2013

Explore the world’s constitutions with a new online tool

Filed under: Law,Searching — Patrick Durusau @ 7:52 pm

Explore the world’s constitutions with a new online tool

From the post:

Constitutions are as unique as the people they govern, and have been around in one form or another for millennia. But did you know that every year approximately five new constitutions are written, and 20-30 are amended or revised? Or that Africa has the youngest set of constitutions, with 19 out of the 39 constitutions written globally since 2000 from the region?

With this in mind, Google Ideas supported the Comparative Constitutions Project to build Constitute, a new site that digitizes and makes searchable the world’s constitutions. Constitute enables people to browse and search constitutions via curated and tagged topics, as well as by country and year. The Comparative Constitutions Project cataloged and tagged nearly 350 themes, so people can easily find and compare specific constitutional material. This ranges from the fairly general, such as “Citizenship” and “Foreign Policy,” to the very specific, such as “Suffrage and turnouts” and “Judicial Autonomy and Power.”

I applaud the effort but wonder about the easily find and compare specific constitutional material?

Legal systems are highly contextual.

See the Constitution Annotated (U.S.) if you want see interpretations of words that would not occur to you. Promise.

Intro to D3 (Manu Kapoor)

Filed under: D3,Graphics,Visualization — Patrick Durusau @ 7:39 pm

Intro to D3 (Manu Kapoor)

Charles Iliya Krempeaux embeds a tutorial about D3 (visualization).

Not knowing D3 is a problem that can be corrected.

ElasticHQ

Filed under: ElasticSearch,Searching — Patrick Durusau @ 7:33 pm

ElasticHQ

From the homepage:

Real-Time Monitoring

From monitoring individual cluster nodes, to viewing real-time threads, ElasticHQ enables up-to-the-second insight in to ElasticSearch cluster runtime metrics and configurations, using the ElasticSearch REST API. ElasticHQ’s real-time update feature works by polling your ElasticSearch cluster intermittently, always pulling the latest aggregate information and deltas; keeping you up-to-date with the internals of your working cluster.

Full Cluster Management

Elastic HQ gives you complete control over your ElasticSearch clusters, nodes, indexes, and mappings. The sleek, intuitive UI gives you all the power of the ElasticSearch Admin API, without having to tangle with REST and large cumbersome JSON requests and responses.

Search and Query

Easily find what you’re looking for by querying a specific Index or several Indices at once. ElasticHQ provides a Query interface, along with all of the other Administration UI features.

No Software to Install

ElasticHQ does not require any software. It works in your web browser, allowing you to manage and monitor your ElasticSearch clusters from anywhere at any time. Built on responsive CSS design, ElasticHQ adjusts itself to any screen size on any device.

I don’t know of any compelling reason to make ElasticSearch management and monitoring difficult for sysadmins. 😉

If approaches like ElasticHQ make their lives easier, perhaps they won’t begrudge users having better UIs as well.

Logical and Computational Structures for Linguistic Modeling

Filed under: Language,Linguistics,Logic,Modeling — Patrick Durusau @ 7:25 pm

Logical and Computational Structures for Linguistic Modeling

From the webpage:

Computational linguistics employs mathematical models to represent morphological, syntactic, and semantic structures in natural languages. The course introduces several such models while insisting on their underlying logical structure and algorithmics. Quite often these models will be related to mathematical objects studied in other MPRI courses, for which this course provides an original set of applications and problems.

The course is not a substitute for a full cursus in computational linguistics; it rather aims at providing students with a rigorous formal background in the spirit of MPRI. Most of the emphasis is put on the symbolic treatment of words, sentences, and discourse. Several fields within computational linguistics are not covered, prominently speech processing and pragmatics. Machine learning techniques are only very sparsely treated; for instance we focus on the mathematical objects obtained through statistical and corpus-based methods (i.e. weighted automata and grammars) and the associated algorithms, rather than on automated learning techniques (which is the subject of course 1.30).

Abundant supplemental materials, slides, notes, further references.

In particular you may like Notes on Computational Aspects of Syntax by Sylvain Schmitz, that cover the first part of Logical and Computational Structures for Linguistic Modeling.

As with any model, there are trade-offs and assumptions build into nearly every choice.

Knowing where to look for those trade-offs and assumptions will give you a response to: “Well, but the model shows that….”

Global Biodiversity Information Facility

Filed under: Biodiversity,Biology,PostgreSQL,Solr — Patrick Durusau @ 7:10 pm

Global Biodiversity Information Facility

Some stats:

417,165,184 occurrences

1,426,888 species

11,976 data sets

578 data publishers

What lies at the technical heart of this beast?

Would you believe a PostgreSQL database and an embedded Apache SOLR index?

Start with the Summary of the GBIF infrastructure. The details on PostgreSQL and Solr are under the Registry tab.

BTW, the system recognizes multiple identification systems and more are to be added.

Need to read more of the documents on that part of the system.

The Return of Al Jazeera

Filed under: News — Patrick Durusau @ 6:36 pm

Al Jazeera will launch a dedicated online channel, bring videos back to YouTube by Janko Roettgers.

From the post:

Quatar-based TV news network Al Jazeera will launch a dedicated online video channel within the next few months. The channel was announced at the MIPCOM industry convention in Cannes this week, and a press release quoted Al Jazeera’s Innovation and Incubation Manager Moeed Ahmad with the following words:

“There is a generation of people who wants to go beyond the box. They instinctively turn online to consume news. They share, like, Tweet, comment, interact and forward. To reach them requires a fresh approach. We’re working hard to bring this to the world. It will be engaging and at times entertaining – but always with purpose. Get ready – we’re coming soon.”

An Al Jazeera representative told the Hollywood Reporter that the new online venture will be headquartered in San Francisco, with dedicated offices in Johannesburg, Beijing, New Delhi and elsewhere.

If you are collecting international news for a topic map, this will be welcome news.

Quite naturally the disappearance of Al Jazeera from YouTube for U.S. based viewers was restrictive agreements with U.S. cable companies.

You can imagine the odds of a cable provider carrying Al Jazeera in Norther Georgia.

October 8, 2013

Data Mining Book Review: How to Lie with Statistics

Filed under: Graphs,Humor,Statistics — Patrick Durusau @ 7:14 pm

Data Mining Book Review: How to Lie with Statistics by Sandro Saitta.

Sandro reviews “How to Lie with Statistics.”

It’s not a “recent” publication. 😉

However, it is an extremely amusing publication.

If you search for “How to Lie with Statistics PDF” I am fairly sure you will turn up copies on the WWW.

Enjoy!

Hadoop: Is There a Metadata Mess Waiting in the Wings?

Filed under: Hadoop,HDFS — Patrick Durusau @ 6:47 pm

Hadoop: Is There a Metadata Mess Waiting in the Wings? by Robin Bloor.

From the post:

Why is Hadoop so popular? There are many reasons. First of all it is not so much a product as an ecosystem, with many components: MapReduce, HBase, HCatalog, Pig, Hive, Sqoop, Mahout and quite a few more. That makes it versatile, and all these components are open source, so most of them improve with each release cycle.

But, as far as I can tell, the most important feature of Hadoop is its file system: HDFS. This has two notable features: it is a key-value store, and it is built for scale-out use. The IT industry seemed to have forgotten about key-value stores. They used to be called ISAM files and came with every operating system until Unix, then Windows and Linux took over. These operating systems didn’t provide general purpose key-value stores, and nobody seemed to care much because there was a plethora of databases that you could use to store data, and there were even inexpensive open source ones. So, that seemed to take care of the data layer.

But it didn’t. The convenience of a key-value store is that you can put anything you want into it as long as you can think of a suitable index for it, and that is usually a simple choice. With a database you have to create a catalog or schema to identify what’s in every table. And, if you are going to use the data coherently, you have to model the data and determine what tables to hold and what attributes are in each table. This puts a delay into importing data from new sources into the database.

Now you can, if you want, treat a database table as a key-value store and define only the index. But that is regarded as bad practice, and it usually is. Add to this the fact that the legacy databases were never built to scale out and you quickly conclude that Hadoop can do something that a database cannot. It can become a data lake – a vast and very scalable data staging area that will accommodate any data you want, no matter how “unstructured” it is.

I rather like that imagery, unadorned Hadoop as a “data lake.”

But that’s not the only undocumented data in a Hadoop ecosystem.

What about the PIG scripts? The MapReduce routines? Or Mahout, Hive, Hbase, etc., etc.

Do you think all the other members of the Hadoop ecosystem also have undocumented data? And other variables?

When Robin mentions Revelytix as having a solution, I assume he means Loom.

Looking at Loom, ask yourself how well it documents other parts of the Hadoop ecosystem?

Robin has isolated a weakness in the current Hadoop system that will unexpectedly and suddenly make itself known.

Will you be ready?

Automata [Starts 4 Nov. 2013]

Filed under: Automata,Compilers,Programming,Regex — Patrick Durusau @ 6:12 pm

Automata by Jeff Ullman.

From the course description:

Why Study Automata Theory?

This subject is not just for those planning to enter the field of complexity theory, although it is a good place to start if that is your goal. Rather, the course will emphasize those aspects of the theory that people really use in practice. Finite automata, regular expressions, and context-free grammars are ideas that have stood the test of time. They are essential tools for compilers. But more importantly, they are used in many systems that require input that is less general than a full programming language yet more complex than “push this button.”

The concepts of undecidable problems and intractable problems serve a different purpose. Undecidable problems are those for which no computer solution can ever exist, while intractable problems are those for which there is strong evidence that, although they can be solved by a computer, they cannot be solved sufficiently fast that the solution is truly useful in practice. Understanding this theory, and in particular being able to prove that a problem you are facing belongs to one of these classes, allows you to justify taking another approach — simplifying the problem or writing code to approximate the solution, for example.

During the course, I’m going to prove a number of things. The purpose of these proofs is not to torture you or confuse you. Neither are the proofs there because I doubt you would believe me were I merely to state some well-known fact. Rather, understanding how these proofs, especially inductive proofs, work, lets you think more clearly about your own work. I do not advocate proofs that programs are correct, but whenever you attempt something a bit complex, it is good to have in mind the inductive proofs that would be needed to guarantee that what you are doing really works in all cases.

Recommended Background

You should have had a second course in Computer Science — one that covers basic data structures (e.g., lists, trees, hashing), and basic algorithms (e.g., tree traversals, recursive programming, big-oh running time). In addition, a course in discrete mathematics covering propositional logic, graphs, and inductive proofs is valuable background.

If you need to review or learn some of these topics, there is a free on-line textbook Foundations of Computer Science, written by Al Aho and me, available at http://i.stanford.edu/~ullman/focs.html. Recommended chapters include 2 (Recursion and Induction), 3 (Running Time of Programs), 5 (Trees), 6 (Lists), 7 (Sets), 9 (Graphs), and 12 (Propositional Logic). You will also find introductions to finite automata, regular expressions, and context-free grammars in Chapters 10 and 11. Reading Chapter 10 would be good preparation for the first week of the course.

The course includes two programming exercises for which a knowledge of Java is required. However, these exercises are optional. You will receive automated feedback, but the results will not be recorded or used to grade the course. So if you are not familiar with Java, you can still take the course without concern for prerequisites.

All of “Foundations of Computer Science” is worth reading but for this course:

Chapter 2 Iteration, Induction, and Recursion
Chapter 3 The Running Time of Programs
Chapter 5 The Tree Data Model
Chapter 6 The List Data Model
Chapter 7 The Set Data Model
Chapter 9 The Graph Data Model
Chapter 10 Patterns, Automata, and Regular Expressions
Chapter 11 Recursive Description of Patterns
Chapter 12 Propositional Logic

Six very intensive weeks but on the bright side, you will be done before the holiday season. 😉

Quepid [Topic Map Tuning?]

Filed under: Recommendation,Searching,Solr — Patrick Durusau @ 4:36 pm

Measure and Improve Search Quality with Quepid by Doug Turnbull.

From the post:

Let’s face it, returning good search results means making money. To this end, we’re often hired to tune search to ensure that search results are as close as possible to the intent of a user’s search query. Matching users intent to results, what we call “relevancy” is what gets us up in the morning. It’s what drives us to think hard about the dark mysteries of tuning Solr or machine-learning topics such as recommendation-based product search.

While we can do amazing feats of wizardry to make individual improvements, it’s impossible with today’s tools to do much more than prove that one problem has been solved. Search engines rank results based on a single set of rules. This single set of rules is in charge of how all searches are ranked. It’s very likely that even as we solve one problem by modifying those rules, we create another problem — or dozens of them, perhaps far more devastating than the original problem we solved.

Quepid is our instant search quality testing product. Born out of our years of experience tuning search, Quepid has become our go to tool for relevancy problems. Built around the idea of Test Driven Relevancy, Quepid allows the search developer to collaborate with product and content experts to

  1. Identify, store, and execute important queries
  2. Provide statistics/rankings that measure the quality of a search query
  3. Tune search relevancy
  4. Immediately visualize the impact of tuning on queries
  5. Rinse & Repeat Instantly

The result is a tool that empowers search developers to experiment with the impact of changes across the search experience and prove to their bosses that nothing broke. Confident in that data will prove or disprove their ideas instantly, developers are even freer experiment more than they might ever have before.

Any thoughts on automating a similar cycle to test the adding of subjects to a topic map?

Or adding subject identifies that would trigger additional merging?

Or just reporting the merging over and above what was already present?

Subgraph Frequencies

Filed under: Graphs,Subgraphs — Patrick Durusau @ 4:18 pm

Subgraph Frequencies by Johan Ugander.

From the webpage:

In our upcoming paper Subgraph Frequencies: Mapping the Empirical and Extremal Geography of Large Graph Collections” (PDF), to appear in May 2013 at the 22nd ACM International World Wide Web Conference, we found that there are fundamental combinatorial constraints governing the frequency of subgraphs that constrain all graphs in previously undocumented ways. We also found that there are empirical properties of social graphs that relegate their subgraph frequencies to a seriously restricted subspace of the full feasible space. Together, we present two complementary frameworks that shed light on a fundamental question pertaining to social graphs:

What properties of social graphs are ‘social’ properties and what properties are ‘graph’ properties?

We contribute two frameworks for analyzing this question: one characterizing extremal properties of all graphs and one characterizing empirical properties of social graphs. Together the two frameworks offer a direct contribution to one of the most well-known observations about social graphs: the tendency of social relationships to close triangles, and the relative infrequency of what is sometimes called the ‘forbidden triad’: three people with two social relationships between them, but one absent relationship. Our frameworks show that the frequency of this ‘forbidden triad’ is non-trivially restricted not just in social graphs, but in all graphs.

Harnessing our results more generally, we are in fact able to show that almost all k node subgraphs have a frequency that is non-trivially bounded. Thus, there is an extent to which almost all subgraphs are mathematically ‘forbidden’ from occurring beyond a certain frequency.

Interesting to see “discovery” of “empirical properties” of social graphs.

What empirical properties will you discover while processing a social graph with graph software?

kaiso 0.12.0

Filed under: Graphs,Neo4j — Patrick Durusau @ 3:41 pm

kaiso 0.12.0

From the webpage:

A graph based queryable object persistance framework built on top of Neo4j.

In addition to objects, Kaiso also stores the class information in the graph. This allows us to use cypher to query instance information, but also to answer questions about our types.

Early stages of the project but this looks interesting.

« Newer PostsOlder Posts »

Powered by WordPress