Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 30, 2013

Patterns of information use and exchange:…

Filed under: Design,Interface Research/Design,Marketing,Usability,Users — Patrick Durusau @ 3:05 pm

Patterns of information use and exchange: case studies of researchers in the life sciences

From the post:

A report of research patterns in life sciences revealing that researcher practices diverge from policies promoted by funders and information service providers

This report by the RIN and the British Library provides  a unique insight into how information is used by researchers across life sciences. Undertaken by the University of Edinburgh’s Institute for the Study of Science, Technology and Innovation, and the UK Digital Curation Centre and the University of Edinburgh?s Information Services, the report concludes that one-size-fits-all information and data sharing policies are not achieving scientifically productive and cost-efficient information use in life sciences.

The report was developed using an innovative approach to capture the day-to-day patterns of information use in seven research teams from a wide range of disciplines, from botany to clinical neuroscience. The study undertaken over 11 months and involving 56 participants found that there is a significant gap between how researchers behave and the policies and strategies of funders and service providers. This suggests that the attempts to implement such strategies have had only a limited impact. Key findings from the report include:

  • Researchers use informal and trusted sources of advice from colleagues, rather than institutional service teams, to help identify information sources and resources
  • The use of social networking tools for scientific research purposes is far more limited than expected
  • Data and information sharing activities are mainly driven by needs and benefits perceived as most important by life scientists rather than top-down policies and strategies
  • There are marked differences in the patterns of information use and exchange between research groups active in different areas of the life sciences, reinforcing the need to avoid standardised policy approaches

Not the most recent research in the area but a good reminder that users do as users do, not as system/software/ontology architects would have them do.

What approach does your software take?

Does it make users perform their tasks the “right” way?

Or does it help users do their tasks “their” way?

GraphLab Workshop 2013 (Update)

Filed under: Conferences,GraphLab,Machine Learning — Patrick Durusau @ 2:46 pm

GraphLab Workshop 2013 Confirmed Agenda

You probably already have your plane tickets and hotel reservation but have you registered for GraphLab Workshop 2013?

Not just a select few graph databases for comparison but:

We have secured talks and demos about the hottest graph processing systems out there: GraphLab (CMU/UW), Pregel (Google), Giraph (Facebook) , Cassovary (Twitter), Grappa (UW), Combinatorial BLAS (LBNL/UCSB), Allegro Graph (Franz) ,Neo4j, Titan (Aurelius), DEX (Sparsity Technologies), YarcData and others!

Registration.

2013 Graphlab Workshop on Large Scale Machine Learning
Sessions Events LLC
Monday, July 1, 2013 from 8:00 AM to 7:00 PM (PDT)
San Francisco, CA

I know, I know, 8 AM is an unholy time to be anywhere (other than on your way home) on the West Coast.

Just pull an all-dayer for a change. 😉

Expecting to see lots of posts and tweets from the conference!

XDGBench: 3rd party benchmark results against graph databases [some graph databases]

Filed under: AllegroGraph,Benchmarks,Fuseki,Neo4j,OrientDB — Patrick Durusau @ 2:19 pm

XDGBench: 3rd party benchmark results against graph databases by Luca Garulli.

From the post:

Toyotaro Suzumura and Miyuru Dayarathna from the Department of Computer Science of the Tokyo Institute of Technology and IBM Research published an interesting research about a benchmark between Graph Databases in the Clouds called:

XGDBench: A Benchmarking Platform for Graph Stores in Exascale Clouds”

This research conducts a performance evaluation of four famous graph data stores AllegroGraph, Fuseki, Neo4j, an OrientDB using XGDBench on Tsubame 2.0 HPC cloud environment. XGDBench is an extension of famous Yahoo! Cloud Serving Benchmark (YCSB).

OrientDB is the faster Graph Database among the 4 products tested. In particular OrientDB is about 10x faster (!) than Neo4j in all the tests.

Look at the Presentation (25 slides) and Research PDF.

Researchers are free to pick any software packages for comparison but the selection here struck me as odd before reading a comment on the original post asking for ObjectivityDB be added to the comparison.

For that matter, where are GraphChi, Infinite Graph, Dex, Titan, FlockDB? Just to call a few of the other potential candidates out.

Will be interesting when a non-winner on such a benchmark cites it for the proposition that easy of use, reliability, lower TOC outweighs brute speed in a benchmark test.

The Dataverse Network Project

Filed under: Data,Dataverse Network — Patrick Durusau @ 1:48 pm

The Dataverse Network Project sponsored by the Institute for Quantitative Social Science, Harvard University.

Described on its homepage:

A repository for research data that takes care of long term preservation and good archival practices, while researchers can share, keep control of and get recognition for their data.

Dataverses currently in operation:

One shortfall I hope is corrected quickly is the lack of searching across instances of the Dataverse software.

For example, if I go to UC Davis and choose the Center for Poverty Research dataverse, I can find: “The Research Supplemental Poverty Measure Public Use Research Files” by Kathleen Short (a study).

But, if I search at the Harvard Dataverse Advanced Search by “Kathleen Short,” or “The Research Supplemental Poverty Measure Public Use Research Files,” I get no results.

An isolated dataverse is more of a data island than a dataverse.

We have lots of experience with data islands. It’s time for something different.

PS: Semantic integration issues need to be addressed as well.

Harvard Dataverse Network

Filed under: Data,Dataverse Network — Patrick Durusau @ 1:23 pm

Harvard Dataverse Network

From the webpage:

The Harvard Dataverse Network is open to all scientific data from all disciplines worldwide. It includes the world’s largest collection of social science research data. If you would like to upload your research data, first create a dataverse and then create a study. If you already have a dataverse, log in to add new studies.

Sharing of data that underlies published research.

Dataverses (520 of those) contain studies (52,289) which contain files (722,615).

For example, following the link for the Tom Clark dataverse, provides a listing of five (5) studies, ordered by their global ids.

Following the link to the Locating Supreme Court Opinions in Doctrine Space study, defaults to detailed cataloging information for the study.

The interface is under active development.

One feature that I hope is added soon is the ability to browse dataverses by author and self-assigned subjects.

Searching works, but is more reliable if you know the correct search terms to use.

I didn’t see any plans to deal with semantic ambiguity/diversity.

MindMup MapJs

Filed under: Graphics,JQuery,Mind Maps,Visualization — Patrick Durusau @ 10:53 am

MindMup MapJs

From the webpage:

MindMup is a zero-friction mind map canvas. Our aim is to create the most productive mind mapping environment out there, removing all the distractions and providing powerful editing shortcuts.

This git project is the JavaScript visualisation portion of MindMup. It provides a canvas for users to create and edit mind maps in a browser. You can see an example of this live on http://www.mindmup.com.

This project is relatively stand alone and you can use it to create a nice mind map visualisation separate from the MindMup Server.

Do see the live demo at: http://www.mindmup.com.

It may not fit your needs but it is a great demo of thoughful UI design. (At least to me.)

Could be quite useful if you like The Back of the Napkin : Solving Problems and Selling Ideas with Pictures by Dan Roam.

I recently started reading “The Back of the Napkin,” and will have more to report on it in a future post. So far, it has been quite a delight to read.

I first saw this at JQuery Rain under: MindMup MapJs : Zero Friction Mind Map Canvas with jQuery.

Mapping the News [Idea for a NewsApp]

Filed under: Mapping,News — Patrick Durusau @ 10:32 am

NewsRel Uses Machine Learning To Summarize News Stories And Put Them On A Map by Frederic Lardinois.

From the post:

After 24 hours of staring at their screens, the teams that participated in our Disrupt NY 2013 Hackathon have now finished their projects and are currently presenting them onstage. With more than 160 hacks, there are far too many cool ones to write about, but one that stood out to me was NewsRel, an iPad-based news app that uses machine-learning techniques to understand how news stories relate to one other. The app uses Google Maps as its main interface and automatically decides which location is most appropriate for any given story.

The app currently uses Reuters‘ RSS feed and analyzes the stories, looking for clusters of related stories and then puts them on the map. Say you are looking at a story about the Boston Marathon bombings. The app, of course, will show you a number of news stories about it clustered around Boston, then maybe something about the president’s comments about it from Washington and another article that relates it to the massacre during the Munich Olympics in 1972.

In addition to this, the team built an algorithm that picks the most important sentences from each story to summarize it for you.

No pointers to software, just the news blurb.

But, does raise an interesting possibility.

What if news video streams were tagged with geolocation and type information?

So I could exclude “train hits parade float” stories from several states away, automobile accidents, crime stories and replaces it with substantive commentary from the BBC or Al Jazeera.

Now that would be a video feed worth paying for. Particularly if for a premium it was commercial free.

Freedom from Wolf Blitzer’s whines in disaster areas should come as a free pre-set.

Just a small amount of additional semantics could lead to entirely new markets and delivery systems.

Real-Time Data Aggregation [White Paper Registration Warning]

Filed under: Aggregation,Cloud Computing,Storm — Patrick Durusau @ 10:06 am

Real-Time Data Aggregation by Caroline Lim.

From the post:

Fast response times generate costs savings and greater revenue. Enterprise data architectures are incomplete unless they can ingest, analyze, and react to data in real-time as it is generated. While previously inaccessible or too complex — scalable, affordable real-time solutions are now finally available to any enterprise.

Infochimps Cloud::Streams

Read Infochimps’ newest whitepaper on how Infochimps Cloud::Streams is a proprietary stream processing framework based on four years of experience with sourcing and analyzing both bulk and in-motion data sources. It offers a linearly and fault-tolerant stream processing engine that leverages a number of well-proven web-scale solutions built by Twitter and Linkedin engineers, with an emphasis on enterprise-class scalability, robustness, and ease of use.

The price of this whitepaper is disclosure of your contact information.

Annoying considering the lack of substantive content about the solution. The use cases are mildly interesting but admit to any number of similar solutions.

If you need real-time data aggregation, skip the white paper and contact your IT consultant/vendor. (Including Infochimps, who do very good work, which is why a non-substantive white paper is so annoying.)

Quandl – Update

Filed under: Data,Dataset — Patrick Durusau @ 4:52 am

Quandl

When I last wrote about Quandl, they were at over 2,000,000 datasets.

Following a recent link to their site, I found they are now over 5,000,000 data sets.

No mean feat, but among the questions that remain:

How do I judge the interoperability of data sets?

Where do I find the information needed to make data sets interoperable?

And just as importantly,

Where do I write down information I discovered or created to make a data set interoperable? (To avoid doing the labor over again.)

April 29, 2013

Indexing Millions Of Documents…

Filed under: Indexing,Solr,Tika — Patrick Durusau @ 2:13 pm

Indexing Millions Of Documents Using Tika And Atomic Update by Patricia Gorla.

From the post:

On a recent engagement, we were posed with the problem of sorting through 6.5 million foreign patent documents and indexing them into Solr. This totaled about 1 TB of XML text data alone. The full corpus included an additional 5 TB of images to incorporate into the index; this blog post will only cover the text metadata.

Streaming large volumes of data into Solr is nothing new, but this dataset posed a unique challenge: Each patent document’s translation resided in a separate file, and the location of each translation file was unknown at runtime. This meant that for every document processed we wouldn’t know where its match would be. Furthermore, the translations would arrive in batches, to be added as they come. And lastly, the project needed to be open to different languages and different file formats in the future.

Our options for dealing with inconsistent data came down to: cleaning all data and organizing it before processing, or building an ingester robust enough to handle different situations.

We opted for the latter and built an ingester that would process each file individually and index the documents with an atomic update (new in Solr 4). To detect and extract the text metadata we chose Apache Tika. Tika is a document-detection and content-extraction tool useful for parsing information from many different formats.

On the surface Tika offers a simple interface to retrieve data from many sources. Our use case, however, required a deeper extraction of specific data. Using the built-in SAX parser allowed us to push Tika beyond its normal limits, and analyze XML content according to the type of information it contained.

No magic bullet but an interesting use case (patents in multiple languages).

scalingpipe – …

Filed under: LingPipe,Linguistics,Scala — Patrick Durusau @ 2:07 pm

scalingpipe – porting LingPipe tutorial examples to Scala by Sujit Pal.

From the post:

Recently, I was tasked with evaluating LingPipe for use in our NLP processing pipeline. I have looked at LingPipe before, but have generally kept away from it because of its licensing – while it is quite friendly to individual developers such as myself (as long as I share the results of my work, I can use LingPipe without any royalties), a lot of the stuff I do is motivated by problems at work, and LingPipe based solutions are only practical when the company is open to the licensing costs involved.

So anyway, in an attempt to kill two birds with one stone, I decided to work with the LingPipe tutorial, but with Scala. I figured that would allow me to pick up the LingPipe API as well as give me some additional experience in Scala coding. I looked around to see if anybody had done something similar and I came upon the scalingpipe project on GitHub where Alexy Khrabov had started with porting the Interesting Phrases tutorial example.

Now there’s a clever idea!

Achieves a deep understanding of the LingPipe API and Scala experience.

Not to mention having useful results for other users.

Atlas of Design

Filed under: Design,Graphics,Interface Research/Design,Mapping,Maps,Visualization — Patrick Durusau @ 2:01 pm

Atlas of Design by Caitlin Dempsey.

From the post:

Do you love beautiful maps? The Atlas of Design has been reprinted and is now available for purchase. Published by the North American Cartographic Information Society (NACIS), this compendium showcases cartography at some of its finest. The atlas was originally published in 2012 and features the work of 27 cartographers. In early 2012, a call for contributions was sent out and 140 entries from 90 different individuals and groups submitted their work. A panel of eight volunteer judges plus the book’s editors evaluated the entries and selected the finalists.

The focus of the Atlas of Design is on the aesthetics and design involved in mapmaking. Tim Wallace and Daniel Huffman, the editors of Atlas of Design explain the book’s introduction about the focus of the book:

Aesthetics separate workable maps from elegant ones.

This book is about the latter category.

My personal suspicion is that aesthetics separate legible topic maps from those that attract repeat users.

The only way to teach aesthetics (which varies by culture and social group) is by experience.

This is a great starting point for your aesthetics education.

The Pragmatic Haskeller, Episode 4 – Recipe Puppy

Filed under: DSL,Haskell — Patrick Durusau @ 4:59 am

The Pragmatic Haskeller, Episode 4 – Recipe Puppy by Alfredo Di Napoli.

From the post:

Now we have our webapp that can read json from the outside world and store them inside MongoDB. But during my daily job what I usually need to do is to talk to some REST service and get, manipulate and store some arbitrary JSON. Fortunately for us, Haskell and its rich, high-quality libraries ecosystem makes the process a breeze.

Alfredo continues his series on building a basis web app in Haskell.

Promises a small DSL for describing recipes in the next espisode.

Which reminds me to ask, is anyone using a DSL to enable users to compose domain specific topic maps?

That is we say topic, scope, association, occurrence, etc. only because that is our vocabulary for topic maps.

No particular reason why everyone has to use those names in composing a topic map.

For a recipe topic map the user might see: recipe (topic), ingredient (topics), ordered instructions (occurrences), measurements, with associations being implied between the recipe and ingredients and between ingredients and measurements, along with role types, etc.

To a topic map processor, all of those terms are treated as topic map information items but renamed for presentation to end users.

If you select an ingredient, say fresh tomatoes in the salads category, it displays other recipes that also use fresh tomatoes.

How it does that need not trouble the author or the end user.

Yes?

April 28, 2013

A Partly Successful Attempt To Create Life With Data Explorer

Filed under: Data Explorer,Game of Life — Patrick Durusau @ 3:54 pm

A Partly Successful Attempt To Create Life With Data Explorer by Chris Webb.

From the post:

I’ll apologise for the title right away: this post isn’t about a Frankenstein-like attempt at creating a living being in Excel, I’m afraid. Instead, it’s about my attempt to implement Jon Conway’s famous game ‘Life’ using Data Explorer, how it didn’t fully succeed and some of the interesting things I learned along the way…

When I’m learning a new technology I like to set myself mini-projects that are more fun than practically useful, and for some reason a few weeks ago I remembered ‘Life’ (which I’m sure almost anyone who has learned programming has had to write a version of at some stage), so I began to wonder if I could write a version of it in Data Explorer. This wasn’t because I thought Data Explorer was an appropriate tool to do this – there are certainly better ways to implement Life in Excel – but I thought doing this would help me in my attempts to learn Data Explorer’s formula language and might also result in an interesting blog post.

Here’s a suggestion on learning new software.

Have you ever thought about playing the game of life with topic maps?

What’s New in Hue 2.3

Filed under: Hadoop,Hive,Hue,Impala — Patrick Durusau @ 3:43 pm

What’s New in Hue 2.3

From the post:

We’re very happy to announce the 2.3 release of Hue, the open source Web UI that makes Apache Hadoop easier to use.

Hue 2.3 comes only two months after 2.2 but contains more than 100 improvements and fixes. In particular, two new apps were added (including an Apache Pig editor) and the query editors are now easier to use.

Here’s the new features list:

  • Pig Editor: new application for editing and running Apache Pig scripts with UDFs and parameters
  • Table Browser: new application for managing Apache Hive databases, viewing table schemas and sample of content
  • Apache Oozie Bundles are now supported
  • SQL highlighting and auto-completion for Hive/Impala apps
  • Multi-query and highlight/run a portion of a query
  • Job Designer was totally restyled and now supports all Oozie actions
  • Oracle databases (11.2 and later) are now supported

Time to upgrade!

Agile Knowledge Engineering and Semantic Web (AKSW)

Filed under: RDFa,Semantic Web — Patrick Durusau @ 3:28 pm

Agile Knowledge Engineering and Semantic Web (AKSW)

From the webpage:

The Research Group Agile Knowledge Engineering and Semantic Web (AKSW) is hosted by the Chair of
Business Information Systems (BIS) of the Institute of Computer Science (IfI) / University of Leipzig as well as the Institute for Applied Informatics (InfAI).

Goals

  • Development of methods, tools and applications for adaptive Knowledge Engineering in the context of the Semantic Web
  • Research of underlying Semantic Web technologies and development of fundamental Semantic Web tools and applications
  • Maturation of strategies for fruitfully combining the Social Web paradigms with semantic knowledge representation techniques

AKSW is committed to the free software, open source, open access and open knowledge movements.

Complete listing of projects.

I have mentioned several of these projects before. On seeing a reminder of the latest release of RDFaCE (RDFa Content Editor), I thought I should post on the common source of those projects.

Qi4j SDK Release 2.0

Filed under: Context,Programming — Patrick Durusau @ 3:18 pm

Qi4j SDK Release 2.0

From the post:

After nearly 2 years of hard work, the Qi4j Community today launched its second generation Composite Oriented Programming framework.

Qi4j is Composite Oriented Programming for the Java platform. It is a top-down approach to write business applications in a maintainable and efficient manner. Qi4j let you focus on the business domain, removing most impedance mismatches in software development, such as object-relation mapping, overlapping concerns and testability.

Qi4j’s main areas of excellence are its enforcement of application layering and modularization, the typed and generic AOP approach, affinity based dependency injection, persistence management, indexing and query subsystems, but there are much more.

The 2.0 release is practically a re-write of the entire runtime, according to co-founder Niclas Hedhman; “Although we are breaking compatibility in many select areas, most 1.4 applications can be converted with relatively few changes.”. He continues; “These changes are necessary for the next set of planned features, including full Scala integration, the upcoming JDK8 and Event Sourcing integrated into the persistence model.”

“It has been a bumpy ride to get this release out the door.”, said Paul Merlin, the 2.0 Release Manager, “but we are determined that Qi4j represents the best technological platform for Java to create applications with high business value.” Not only has the community re-crafted a remarkable codebase, but also created a brand new website, fully integrated with the new Gradle build process.

See: http://qi4j.org and http://qi4j.org/2.0/.

Principles of Composite Oriented Programming:

  • Behavior depends on Context
  • Decoupling is a virtue
  • Business Rules matters more
  • Classes are dead, long live interfaces

“Behavior depends on Context” sounds a lot like identity depends on context, either of what the object represents or a user.

Does your application capture context for data or its users? If so, what does it do with that information?

Speak of the devil,… I just mentioned Peter Neubauer in a prior post, then I see his tweet on Qi4j. 😉

Scientific Lenses over Linked Data… [Operational Equivalence]

Scientific Lenses over Linked Data: An approach to support task specifi c views of the data. A vision. by Christian Brenninkmeijer, Chris Evelo, Carole Goble, Alasdair J G Gray, Paul Groth, Steve Pettifer, Robert Stevens, Antony J Williams, and Egon L Willighagen.

Abstract:

Within complex scienti fic domains such as pharmacology, operational equivalence between two concepts is often context-, user- and task-specifi c. Existing Linked Data integration procedures and equivalence services do not take the context and task of the user into account. We present a vision for enabling users to control the notion of operational equivalence by applying scienti c lenses over Linked Data. The scientifi c lenses vary the links that are activated between the datasets which aff ects the data returned to the user.

Two additional quotes from this paper should convince you of the importance of this work:

We aim to support users in controlling and varying their view of the data by applying a scientifi c lens which govern the notions of equivalence applied to the data. Users will be able to change their lens based on the task and role they are performing rather than having one fixed lens. To support this requirement, we propose an approach that applies context dependent sets of equality links. These links are stored in a stand-off fashion so that they are not intermingled with the datasets. This allows for multiple, context-dependent, linksets that can evolve without impact on the underlying datasets and support diff ering opinions on the relationships between data instances. This flexibility is in contrast to both Linked Data and traditional data integration approaches. We look at the role personae can play in guiding the nature of relationships between the data resources and the desired a ffects of applying scientifi c lenses over Linked Data.

and,

Within scienti fic datasets it is common to fi nd links to the “equivalent” record in another dataset. However, there is no declaration of the form of the relationship. There is a great deal of variation in the notion of equivalence implied by the links both within a dataset’s usage and particularly across datasets, which degrades the quality of the data. The scienti fic user personae have very di fferent needs about the notion of equivalence that should be applied between datasets. The users need a simple mechanism by which they can change the operational equivalence applied between datasets. We propose the use of scientifi c lenses.

Obvious questions:

Does your topic map software support multiple operational equivalences?

Does your topic map interface enable users to choose “lenses” (I like lenses better than roles) to view equivalence?

Does your topic map software support declaring the nature of equivalence?

I first saw this in the slide deck: Scientific Lenses: Supporting Alternative Views of the Data by Alasdair J G Gray at: 4th Open PHACTS Community Workshop.

BTW, the notion of equivalence being represented by “links” reminds me of a comment Peter Neubauer (Neo4j) once made to me, saying that equivalence could be modeled as edges. Imagine typing equivalence edges. Will have to think about that some more.

4th Open PHACTS Community Workshop (slides) [Operational Equivalence]

Filed under: Bioinformatics,Biomedical,Drug Discovery,Linked Data,Medical Informatics — Patrick Durusau @ 12:24 pm

4th Open PHACTS Community Workshop : Using the power of Open PHACTS

From the post:

The fourth Open PHACTS Community Workshop was held at Burlington House in London on April 22 and 23, 2013. The Workshop focussed on “Using the Power of Open PHACTS” and featured the public release of the Open PHACTS application programming interface (API) and the first Open PHACTS example app, ChemBioNavigator.

The first day featured talks describing the data accessible via the Open PHACTS Discovery Platform and technical aspects of the API. The use of the API by example applications ChemBioNavigator and PharmaTrek was outlined, and the results of the Accelrys Pipeline Pilot Hackathon discussed.

The second day involved discussion of Open PHACTS sustainability and plans for the successor organisation, the Open PHACTS Foundation. The afternoon was attended by those keen to further discuss the potential of the Open PHACTS API and the future of Open PHACTS.

During talks, especially those detailing the Open PHACTS API, a good number of signup requests to the API via dev.openphacts.org were received. The hashtag #opslaunch was used to follow reactions to the workshop on Twitter (see storify), and showed the response amongst attendees to be overwhelmingly positive.

This summary is followed by slides from the two days of presentations.

Not like being there but still quite useful.

As a matter of fact, I found a lead on “operational equivalence” with this data set. More to follow in a separate post.

Algorithms Every Data Scientist Should Know: Reservoir Sampling

Filed under: Algorithms,Data Science,Reservoir Sampling — Patrick Durusau @ 12:12 pm

Algorithms Every Data Scientist Should Know: Reservoir Sampling by Josh Wills.

Data scientists, that peculiar mix of software engineer and statistician, are notoriously difficult to interview. One approach that I’ve used over the years is to pose a problem that requires some mixture of algorithm design and probability theory in order to come up with an answer. Here’s an example of this type of question that has been popular in Silicon Valley for a number of years:

Say you have a stream of items of large and unknown length that we can only iterate over once. Create an algorithm that randomly chooses an item from this stream such that each item is equally likely to be selected.

The first thing to do when you find yourself confronted with such a question is to stay calm. The data scientist who is interviewing you isn’t trying to trick you by asking you to do something that is impossible. In fact, this data scientist is desperate to hire you. She is buried under a pile of analysis requests, her ETL pipeline is broken, and her machine learning model is failing to converge. Her only hope is to hire smart people such as yourself to come in and help. She wants you to succeed.

Beaker image
Remember: Stay Calm.

The second thing to do is to think deeply about the question. Assume that you are talking to a good person who has read Daniel Tunkelang’s excellent advice about interviewing data scientists. This means that this interview question probably originated in a real problem that this data scientist has encountered in her work. Therefore, a simple answer like, “I would put all of the items in a list and then select one at random once the stream ended,” would be a bad thing for you to say, because it would mean that you didn’t think deeply about what would happen if there were more items in the stream than would fit in memory (or even on disk!) on a single computer.

The third thing to do is to create a simple example problem that allows you to work through what should happen for several concrete instances of the problem. The vast majority of humans do a much better job of solving problems when they work with concrete examples instead of abstractions, so making the problem concrete can go a long way toward helping you find a solution.

In addition to great interview advice, Josh also provides a useful overview of reservoir sampling.

Whether reservoir sampling will be useful to you depends on your test for subject identity.

I tend to think of subject identity as being very precise but that isn’t necessarily the case.

Or should I say that precision of subject identity is a matter of requirements?

For some purposes, it may be sufficient to know the gender of attendees, as a subject, within some margin of statistical error. With enough effort we could know that more precisely but the cost may be prohibitive.

Thinking of any test for subject identity being located on a continuum of subject identification. Where the notion of “precision” itself is up for definition.

Russia’s warnings on one of the Boston Marathon bombers, a warning that used his name as he did, not as captured by the US intelligence community, was a case of mistaken level of precision.

Mostly likely the result of an analyst schooled in an English-only curriculum.

Topic Maps Logo?

Filed under: Advertising,Marketing — Patrick Durusau @ 10:32 am

While writing about Drake, I was struck by the attractiveness of the project logo:

Drake logo

So I decided to look at some other projects logos, just to get some ideas on what other projects were doing as far as logos:

Hadoop logo

Mahout logo

Chukwa logo

But the most famous project at Apache has the simplest logo of all:

HTTPD logo

To be truthful, when someone says web server, I automatically think of the Apache server. Others exist and new ones are invented, but Apache server is nearly synonymous with web server.

Perhaps the lesson is the logo did not make it so.

Has anyone written a history of the Apache web server?

A cross between a social history and a technical one, that illustrates how the project responded to user demands and and requirements. That could make a very nice blueprint for other projects to follow.

Introducing Drake, a kind of ‘make for data’

Filed under: Data Streams,Drake,Workflow — Patrick Durusau @ 9:55 am

Introducing Drake, a kind of ‘make for data’ by Aaron Crow.

From the post:

Here at Factual we’ve felt the pain of managing data workflows for a very long time. Here are just a few of the issues:

  • a multitude of steps, with complicated dependencies
  • code and input can change frequently – it’s tiring and error-prone to figure out what needs to be re-built
  • inputs scattered all over (home directories, NFS, HDFS, etc.), tough to maintain, tough to sustain repeatability

Paul Butler, a self-described Data Hacker, recently published an article called “Make for Data Scientists“, which explored the challenges of managing data processing work. Paul went on to explain why GNU Make could be a viable tool for easing this pain. He also pointed out some limitations with Make, for example the assumption that all data is local.

We were gladdened to read Paul’s article, because we’d been hard at work building an internal tool to help manage our data workflows. A defining goal was to end up with a kind of “Make for data”, but targeted squarely at the problems of managing data workflow.

A really nice introduction to Drake, with a simple example and pointers to more complete resources.

Not hard to see how Drake could fit into a topic map authoring work flow.

LevelGraph [Graph Databases and Semantic Diversity]

Filed under: Graphs,leveldb,LevelGraph,Semantic Diversity — Patrick Durusau @ 9:47 am

LevelGraph

From the webpage:

LevelGraph is a Graph Database. Unlike many other graph database, LevelGraph is built on the uber-fast key-value store LevelDB through the powerful LevelUp library. You can use it inside your node.js application.

LevelGraph loosely follows the Hexastore approach as presente in the article: Hexastore: sextuple indexing for semantic web data management C Weiss, P Karras, A Bernstein – Proceedings of the VLDB Endowment, 2008. Following this approach, LevelGraph uses six indices for every triple, in order to access them as fast as it is possible.

The family of graph databases gains another member.

The growth of graph database offerings is evidence the effort to reduce semantic diversity is a fool’s errand.

It isn’t hard to find graph database projects, yet new ones appear on a regular basis.

With every project starting over with the basic issues of graph representation and algorithms.

The reasons for that diversity are likely as diverse as the diversity itself.

If the world has been diverse, remains diverse and evidence is it will continue to be diverse, what are the odds in fighting diversity?

That’s what I thought.

Topic maps, embracing diversity.

I first saw this in a tweet by Frank Denis.

April 27, 2013

Controversial Cyber Security Bill CISPA…

Filed under: Cybersecurity,Security — Patrick Durusau @ 6:55 pm

Controversial Cyber Security Bill CISPA Passed Again By The US House by Avik Sarkar.

From the post:

Couple of months ago we reported that the White House is planning for an executive cyber security order, from some official sources it has also come to know that the U.S. President Mr. Barack Obama has a special plan to re-introduce the Cyber Intelligence Sharing and Protection Act (CISPA). Today that deceleration get executed as the US House of Representatives has passed the controversial Cyber Information Sharing and Protection Act. This is the second time when CISPA have been passed by the White House, first it was rejected by the Senator while saying that the bill did not do enough to protect privacy. But yet again with the initiative of Obama and a substantial majority of politicians in the House backed the bill. Though there is a huge chance of getting rejected. According to some relevant sources it has been came to light that, this time also CISPA could fail again in the Senate after threats from President Obama to veto it over privacy concerns. Sources are saying that the main reason of re-introducing CISPA is the the President Barack Obama expressed concerns that it could pose a privacy risk. The White House wants amendments so more is done to ensure the minimum amount of data is handed over in investigations.  The law is passing through the US legislative system as American federal agencies warn that malicious hackers, motivated by money or acting on behalf of foreign governments, such as China, are one of the biggest threats facing the nation.  “If you want to take a shot across China’s bow, this is the answer,” said Mike Rogers, the Republican politician who co-wrote CISPA and chairs the House Intelligence Committee. 

Don’t be distracted by the privacy/civil liberties/cybersecurity dance in Washington, D.C.

Why would you trust a government with a kill list to balk at listening to your phone or reading your email traffic?

A government that does those things and lies to the public about them, is unworthy of trust.

Guard your privacy as best you can.

No one else is going to do it for you.

PS: Topic maps may be able to help your watch the watchers. See how they like a good dose of sunshine.

Designing Search: Displaying Results

Filed under: Interface Research/Design,Search Interface,Searching — Patrick Durusau @ 6:32 pm

Designing Search: Displaying Results by Tony Russell-Rose.

From the post:

Search is a conversation: a dialogue between user and system that can be every bit as rich as human conversation. Like human dialogue, it is bidirectional: on one side is the user with their information need, which they articulate as some form of query.

On the other is the system and its response, which it expresses a set of search results. Together, these two elements lie at the heart of the search experience, defining and shaping much of the information seeking dialogue. In this piece, we examine the most universal of elements within that response: the search result.

Basic Principles

Search results play a vital role in the search experience, communicating the richness and diversity of the overall result set, while at the same time conveying the detail of each individual item. This dual purpose creates the primary tension in the design: results that are too detailed risk wasting valuable screen space while those that are too succinct risk omitting vital information.

Suppose you’re looking for a new job, and you browse to the 40 or so open positions listed on UsabilityNews. The results are displayed in concise groups of ten, occupying minimal screen space. But can you tell which ones might be worth pursuing?

As always a great post by Tony but a little over the top with:

“…a dialogue between user and system that can be every bit as rich as human conversation.”

Not in my experience but that’s not everyone’s experience.

Has anyone tested the thesis that dialogue between a user and search engine is as rich as between user and reference librarian?

Open Source TokuDB Resources

Filed under: Fractal Trees,TokuDB,Tokutek — Patrick Durusau @ 6:07 pm

Open Source TokuDB Resources

A quick summary of the Tokutek repositories at Github and pointers to Google groups for discussion of TokuDB.

Extracting and connecting chemical structures…

Filed under: Cheminformatics,Data Mining,Text Mining — Patrick Durusau @ 6:00 pm

Extracting and connecting chemical structures from text sources using chemicalize.org by Christopher Southan and Andras Stracz.

Abstract:

Background

Exploring bioactive chemistry requires navigating between structures and data from a variety of text-based sources. While PubChem currently includes approximately 16 million document-extracted structures (15 million from patents) the extent of public inter-document and document-to-database links is still well below any estimated total, especially for journal articles. A major expansion in access to text-entombed chemistry is enabled by chemicalize.org. This on-line resource can process IUPAC names, SMILES, InChI strings, CAS numbers and drug names from pasted text, PDFs or URLs to generate structures, calculate properties and launch searches. Here, we explore its utility for answering questions related to chemical structures in documents and where these overlap with database records. These aspects are illustrated using a common theme of Dipeptidyl Peptidase 4 (DPPIV) inhibitors.

Results

Full-text open URL sources facilitated the download of over 1400 structures from a DPPIV patent and the alignment of specific examples with IC50 data. Uploading the SMILES to PubChem revealed extensive linking to patents and papers, including prior submissions from chemicalize.org as submitting source. A DPPIV medicinal chemistry paper was completely extracted and structures were aligned to the activity results table, as well as linked to other documents via PubChem. In both cases, key structures with data were partitioned from common chemistry by dividing them into individual new PDFs for conversion. Over 500 structures were also extracted from a batch of PubMed abstracts related to DPPIV inhibition. The drug structures could be stepped through each text occurrence and included some converted MeSH-only IUPAC names not linked in PubChem. Performing set intersections proved effective for detecting compounds-in-common between documents and/or merged extractions.

Conclusion

This work demonstrates the utility of chemicalize.org for the exploration of chemical structure connectivity between documents and databases, including structure searches in PubChem, InChIKey searches in Google and the chemicalize.org archive. It has the flexibility to extract text from any internal, external or Web source. It synergizes with other open tools and the application is undergoing continued development. It should thus facilitate progress in medicinal chemistry, chemical biology and other bioactive chemistry domains.

A great example of building a resource to address identity issues in a specific domain.

The result speaks for itself.

PS: The results were not delayed awaiting a reformation of chemistry to use a common identifier.

Developing a Solr Plugin

Filed under: Searching,Solr,Uncategorized — Patrick Durusau @ 4:37 pm

Developing a Solr Plugin by Andrew Janowczyk.

From the post:

For our flagship product, Searchbox.com, we strive to bring the most cutting-edge technologies to our users. As we’ve mentioned in earlier blog posts, we rely heavily on Solr and Lucene to provide the framework for these functionalities. The nice thing about the Solr framework is that it allows for easy development of plugins which can greatly extend the capabilities of the software. We’ll be creating a set of slideshares which describe how to implement 3 types of plugins so that you can get ahead of the learning curve and start extending your own custom Solr installation now.

There are mainly 4 types of custom plugins which can be created. We’ll discuss their differences here:

Sometimes Andrew says three (3) types of plugins and sometimes he says four (4).

I tried to settle the question by looking at the Solr Wiki on plugins.

Depends on how you want to count separate plugins. 😉

But, Andrew’s advice about learning to write plugins is sound. It will put your results above those of others.

Bulk Access to Law-Related Linked Data:…

Filed under: Law,Legal Informatics — Patrick Durusau @ 4:23 pm

Bulk Access to Law-Related Linked Data: LC & VIAF Name Authority Records and LC Subject Authority Records

From the post:

Linked Data versions of Library of Congress name authority records and subject authority records are now available for bulk download from the Library of Congress Linked Data Service, according to Kevin Ford at Library of Congress.

In addition, VIAF, the Virtual International Authority File, now provides bulk access to Linked Data versions of name authority records for organizations, including government entities and business organizations, from more than 30 national or research libraries. VIAF data are also searchable through the VIAF Web user interface.

Always good to have more data but I would use caution with the Library of Congress authority records.

See for example, TFM (To Find Me) Mark Twain.

Authority record means just that, a record issued by an authority.

The state of being a “correct” records is something else entirely.

How Scaling Really Works in Apache HBase

Filed under: HBase — Patrick Durusau @ 4:14 pm

How Scaling Really Works in Apache HBase by Matteo Bertozzi.

From the post:

At first glance, the Apache HBase architecture appears to follow a master/slave model where the master receives all the requests but the real work is done by the slaves. This is not actually the case, and in this article I will describe what tasks are in fact handled by the master and the slaves.

You can use a tool or master a tool.

Recommend the latter.

Older Posts »

Powered by WordPress