Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 25, 2012

The life cycle of physicists

Filed under: Humor — Patrick Durusau @ 7:16 pm

The life cycle of physicists

What (if anything) would you change about this cartoon to make it apply to topic map mavens, ontologies, semantic web types, etc.?

Hope you are at the start of a great week!

Tesseract – Fast Multidimensional Filtering for Coordinated Views

Filed under: Analytics,Dataset,Filters,Multivariate Statistics,Visualization — Patrick Durusau @ 7:16 pm

Tesseract – Fast Multidimensional Filtering for Coordinated Views

From the post:

Tesseract is a JavaScript library for filtering large multivariate datasets in the browser. Tesseract supports extremely fast (<30ms) interaction with coordinated views, even with datasets containing a million or more records; we built it to power analytics for Square Register, allowing merchants to slice and dice their payment history fluidly.

Since most interactions only involve a single dimension, and then only small adjustments are made to the filter values, incremental filtering and reducing is significantly faster than starting from scratch. Tesseract uses sorted indexes (and a few bit-twiddling hacks) to make this possible, dramatically increasing the perfor­mance of live histograms and top-K lists. For more details on how Tesseract works, see the API reference.

Are you ready to “slice and dice” your data set?

Guardian open weekend: mapping with Google Fusion tables

Filed under: Fusion Tables,Graphics,Visualization — Patrick Durusau @ 7:16 pm

Guardian open weekend: mapping with Google Fusion tables

From the post:

As part of the Guardian Open Weekend, we’ve presented how we work with data and how we use Google Fusion Tables. Check out our presentations – and give it a go yourself.

The Guardian is one of the leaders in visualization of news data.

Very much worth your time. On a regular basis.

CS 194-16: Introduction to Data Science

Filed under: Data Mining,Data Science — Patrick Durusau @ 7:16 pm

CS 194-16: Introduction to Data Science

From the homepage:

Organizations use their data for decision support and to build data-intensive products and services. The collection of skills required by organizations to support these functions has been grouped under the term “Data Science”. This course will attempt to articulate the expected output of Data Scientists and then equip the students with the ability to deliver against these expectations. The assignments will involve web programming, statistics, and the ability to manipulate data sets with code.

Tip: Look closely at the resources page and the notes from the 2011 course.

How to Get Published – Elsevier

Filed under: Publishing — Patrick Durusau @ 7:15 pm

How to Get Published.

Author training webcasts from Elsevier.

Whether you are thinking about publishing in professional journals or simply want to improve (write?) useful user documentation, this isn’t a bad resource.

Lucene Full Text Indexing with Neo4j

Filed under: Indexing,Lucene,Neo4j,Neo4jClient — Patrick Durusau @ 7:15 pm

Lucene Full Text Indexing with Neo4j by Romiko Derbynew.

From the post:

I spent some time working on full text search for Neo4j. The basic goals were as follows.

  • Control the pointers of the index
  • Full Text Search
  • All operations are done via Rest
  • Can create an index when creating a node
  • Can update and index
  • Can check if an index exists
  • When bootstrapping Neo4j in the cloud run Index checks
  • Query Index using full text search lucene query language.

Download:

This is based on Neo4jClient: http://nuget.org/List/Packages/Neo4jClient

Source Code at:http://hg.readify.net/neo4jclient/

Introduction

So with the above objectives, I decided to go with Manual Indexing. The main reason here is that I can put an index pointing to node A based on values in node B.

Imagine the following.

You have Node A with a list:

Surname, FirstName and MiddleName. However Node A also has a relationship to Node B which has other names, perhaps Display Names, Avatar Names and AKA’s.

So with manual indexing, you can have all the above entries for names in Node A and Node B point to Node A only. (emphasis added)

Not quite merging but it is an interesting take on creating a single point of reference.

BTW, search for Neo4j while you are at Romiko’s blog. Several very interesting posts and I am sure more are forthcoming.

Will Google Big Query Transform Big Data Analysis?

Filed under: Big Query,BigData — Patrick Durusau @ 7:15 pm

Will Google Big Query Transform Big Data Analysis? by Doug Henschen.

From the post:

Google shared details Wednesday about Google Big Query, a cloud-based service that promises to bring the search giant’s immense compute power and expertise with algorithms to bear on large data sets. The service is still in limited beta preview, but it promises to speed analysis of Google ad data while opening up ways to mash up and analyze huge data sets from external sources.

Google Big Query was described by Ju-Kay Kwek, product manager for Google Cloud Platform Team, as offering an array of SQL and graphical-user-interface-driven SQL analyses of tens of terabytes of data per customer, yet it doesn’t require indexing or pre-caching. What’s more, customers will get fine-grained analysis of all their data without summaries or aggregations.

“Fine-grained data is the key to the service because we don’t know what questions customers are going to ask,” said Kwek in an onstage interview at this week’s GigaOm Structure Data conference in New York.

Some of Google’s beta customers are uploading data to the service with batches and data streams and treating it as a cloud-based data warehouse, but Kwek said ad data would be the first priority, supporting a Google customer’s need to understand massive global campaigns running in multiple languages.

“When an advertiser wants to understand the ROI or effectiveness of a keyword campaign running across the globe, that’s a big-data problem,” Kwek said. “They’re currently extracting data using the Adwords API, building sharded databases on-premises, doing all the indexing, and sometimes losing track of the questions they wanted to ask by the time they have the data available.”

Thus, time to insight will be the biggest benefit of the service, Kwek said, with analyses taking a day or less, rather than days or weeks, when customers face extracting and structuring data on less robust and capable on-premises platforms.

I am troubled by the presumptions that Google is making with Big Query.

Google’s Big Query presumes:

  1. Customer’s big data has value to be extracted.
  2. Value is not being extracted now due to lack of computing resources.
  3. The missing computing resources can be supplied by Big Query.
  4. The customer has the analysis resources to extract the value using Big Query. (Not the same thing as writing SQL or dashboards.)
  5. The customer can act upon the value extracted from its big data.

If any of those presumptions fail, then so does the value of using Google’s Big Query.

Resources for BigQuery developers. Including version 2 of the Developers Guide.

Book Review- Machine Learning for Hackers

Filed under: Machine Learning,R — Patrick Durusau @ 7:14 pm

Book Review- Machine Learning for Hackers by Ajay Ohri.

From the post:

This is review of the fashionably named book Machine Learning for Hackers by Drew Conway and John Myles White (O’Reilly ). The book is about hacking code in R.

The preface introduces the reader to the authors conception of what machine learning and hacking is all about. If the name of the book was machine learning for business analytsts or data miners, I am sure the content would have been unchanged though the popularity (and ambiguity) of the word hacker can often substitute for its usefulness. Indeed the many wise and learned Professors of statistics departments through out the civilized world would be mildly surprised and bemused by their day to day activities as hacking or teaching hackers. The book follows a case study and example based approach and uses the GGPLOT2 package within R programming almost to the point of ignoring any other native graphics system based in R. It can be quite useful for the aspiring reader who wishes to understand and join the booming market for skilled talent in statistical computing.

A chapter by chapter review that highlights a number of improvements that one hopes will appear in a second (2nd) edition. Mostly editorial, clarity type improvements that should be been caught in editorial review.

The complete source code for examples can be downloaded here. It is a little over 100 MB in zip format. I checked and the data files for various exercises are included. Which explains the size of the source code file.

High-effort graphics

Filed under: Graphics,Visualization — Patrick Durusau @ 7:14 pm

High-effort graphics from Junk Charts.

You need to visit Junk Charts to see this graphic.

My reaction is largely the same as Junk Charts. This is over-crowded with information and tiresome to puzzle out.

What do you think?

A Twelve Step Program for Searching the Internet

Filed under: Common Crawl,Search Data,Search Engines,Searching — Patrick Durusau @ 7:14 pm

OK, the real title is: Twelve steps to running your Ruby code across five billion web pages

From the post:

Common Crawl is one of those projects where I rant and rave about how world-changing it will be, and often all I get in response is a quizzical look. It's an actively-updated and programmatically-accessible archive of public web pages, with over five billion crawled so far. So what, you say? This is going to be the foundation of a whole family of applications that have never been possible outside of the largest corporations. It's mega-scale web-crawling for the masses, and will enable startups and hackers to innovate around ideas like a dictionary built from the web, reverse-engineering postal codes, or any other application that can benefit from huge amounts of real-world content.

Rather than grabbing each of you by the lapels individually and ranting, I thought it would be more productive to give you a simple example of how you can run your own code across the archived pages. It's currently released as an Amazon Public Data Set, which means you don't pay for access from Amazon servers, so I'll show you how on their Elastic MapReduce service.

I'm grateful to Ben Nagy for the original Ruby code I'm basing this on. I've made minimal changes to his original code, and built a step-by-step guide describing exactly how to run it. If you're interested in the Java equivalent, I recommend this alternative five-minute guide.

A call to action and an awesome post!

If you have ever forwarded a blog post, forward this one.

This would make a great short course topic. Will have to give that some thought.

March 24, 2012

Quid

Filed under: Quid — Patrick Durusau @ 7:37 pm

Quid

Quid software is described as enabling you to:

Map an emerging technology sector to reveal its key components

Discover new opportunities through white space analysis

Understand R&D focus of technology companies

Monitor technology development in emerging markets

Track scientific breakthroughs from academic origins to commercialization

Understand co-investment relationships and derive investment strategies

Identify the standout companies within a sector

Which are all claims you and I have seen (if not even more so) in any number of contexts.

But, then take a look at their explanation of their technical process.

What I find compelling about it is the description of a process that has multiple steps, no one magic dust that holds it together. Obviously a lot of human analytical skill goes into the final results.

Take a close look at Quid, I suspect it will be coming up again from time to time.

Best Written Paper

Filed under: Documentation,Writing — Patrick Durusau @ 7:36 pm

Best Written Paper by Michael Mitzenmacher.

From the post:

Daniel Lemire pointed to an article on bad writing in science (here if you care to see, not CS-specific), which got me to thinking: do we (in whatever subcommunity you think of yourself being in) value good writing? Should we?

One question is what qualifies as good writing in science. I’m not sure there’s any consensus here — although that’s true for writing more generally as well. While colorful word choice and usage can garner some attention* (and, generally, wouldn’t hurt), unlike what some people may think, good writing in science is not a vocabulary exercise. I find that two key features cover most of what I mean by good writing:

  1. Be clear.
  2. Tell a story.

Those two points cover “good writing” for all types of writing.

Two New AWS Getting Started Guides

Filed under: Amazon Web Services AWS,Cloud Computing — Patrick Durusau @ 7:36 pm

Two New AWS Getting Started Guides

From the post:

We’ve put together a pair of new Getting Started Guides for Linux and Microsoft Windows. Both guides will show you how to use EC2, Elastic Load Balancing, Auto Scaling, and CloudWatch to host a web application.

The Linux version of the guide (HTML, PDF) is built around the popular Drupal content management system. The Windows version (HTML, PDF) is built around the equally popular DotNetNuke CMS.

These guides are comprehensive. You will learn how to:

  • Sign up for the services
  • Install the command line tools
  • Find an AMI
  • Launch an Instance
  • Deploy your application
  • Connect to the Instance using the MindTerm SSH Client or PuTTY
  • Configure the Instance
  • Create a custom AMI
  • Create an Elastic Load Balancer
  • Update a Security Group
  • Configure and use Auto Scaling
  • Create a CloudWatch Alarm
  • Clean up

Other sections cover pricing, costs, and potential cost savings.

Not quite a transparent computing fabric, yet. 😉

Apache ZooKeeper 3.3.5 has been released

Filed under: Zookeeper — Patrick Durusau @ 7:36 pm

Apache ZooKeeper 3.3.5 has been released by Patrick Hunt.

From the post:

Apache ZooKeeper release 3.3.5 is now available. This is a bug fix release covering 11 issues, two of which were considered blockers. Some of the more serious issues include:

  • ZOOKEEPER-1367 Data inconsistencies and unexpired ephemeral nodes after cluster restart
  • ZOOKEEPER-1412 Java client watches inconsistently triggered on reconnect
  • ZOOKEEPER-1277 Servers stop serving when lower 32bits of zxid roll over
  • ZOOKEEPER-1309 Creating a new ZooKeeper client can leak file handles
  • ZOOKEEPER-1389 It would be nice if start-foreground used exec $JAVA in order to get rid of the intermediate shell process
  • ZOOKEEPER-1089 zkServer.sh status does not work due to invalid option of nc

Stability, Compatibility and Testing

3.3.5 is a stable release that’s fully backward compatible with 3.3.4. Only bug fixes relative to 3.3.4 have been applied. Version 3.3.5 will be incorporated into the upcoming CDH3U4 release.

Just in case you are curious, ZOOKEEPER-1367 and ZOOKEEPER-1412 were the blocking issues. I would have thought leaking file handles (ZOOKEEPER-1309) would be as well. It’s fixed now but I am curious about the basis for classification of issues. (Not entirely academic since the ODF TC uses JIRA, after a fashion, to track issues with standard revision.)

Cloudera Manager 3.7.4 released! (spurious alerts?)

Filed under: Cloudera,Hadoop — Patrick Durusau @ 7:36 pm

Cloudera Manager 3.7.4 released! by Bala Venkatrao.

From the post:

We are pleased to announce that Cloudera Manager 3.7.4 is now available! The most notable updates in this release are:

  • A fixed memory leak in supervisord
  • Compatibility with a scheduled refresh of CDH3u3
  • Significant improvements to the alerting functionality, and the rate of ‘false positive alerts’
  • Support for several new multi-homing features
  • Updates to the default heap sizes for the management daemons (these have been increased).

The detailed Cloudera Manager 3.7.4 release notes are available at: https://ccp.cloudera.com/display/ENT/Cloudera+Manager+3.7.x+Release+Notes

Cloudera Manager 3.7.4 is available to download from: https://ccp.cloudera.com/display/SUPPORT/Downloads

I admit to being curious (or is that suspicious?) and so when I read ‘false positive alerts’, I had to consult the release notes:

  • Some of the alerting behaviors have changed, including selected default settings. This has streamlined some of the alerting behavior and avoids spurious alerts in certain situations. These changes include:
    • The default alert values have been changed so that summary level alerts are disabled by default, to avoid unnecessary email alerts every time an individual health check alert email is sent.
    • The default behavior for DataNodes and TaskTrackers is now to never emit alerts.
    • The “Job Failure Ratio Thresholds” parameter has been disabled by default. The utility of this test very much depends on how the cluster is used. This parameter and the “Job Failure Ratio Minimum Failing Jobs” parameters can be used to alert when jobs fail.

So, the alerts in question were not spurious alerts but alerts users of Cloudera Manager could not correctly configure?

Question: Can your Cloudera Manager users correctly configure alerts? (That could be a good Cloudera installation interview question. Use a machine disconnected from your network and the Internet for testing.)

HBASE CON2012

Filed under: Conferences,HBase — Patrick Durusau @ 7:35 pm

HBASE CON2012

Early Bird Registration ends 6 April 2012

May 22, 2012
InterContinental San Francisco Hotel
888 Howard Street
San Francisco, CA 94103

From the webpage:

Real-Time Your Hadoop

Join us for HBaseCon 2012, the first industry conference for Apache HBase users, contributors, administrators and application developers.

Network. Share ideas with colleagues and others in the the rapidly growing HBase community. See who is speaking

Learn. Attend sessions and lightning talks about what’s new in HBase, how to contribute, best practices on running HBase in production, use cases and applications. View the agenda

Train. Make the most of your week and attend Cloudera training for Apache HBase, in the 2 days following the conference. Sign up

BTW, if you attend, you get a voucher for a free ebook: HBase: The Definitive Guide from O’Reilly.

As rapidly as solutions are developing, conferences look like a major source of up to date information.

Apache HBase 0.92.1 now available

Filed under: Cloudera,Hadoop,HBase — Patrick Durusau @ 7:35 pm

Apache HBase 0.92.1 now available by Shaneal Manek

From the post:

Apache HBase 0.92.1 is now available. This release is a marked improvement in system correctness, availability, and ease of use. It’s also backwards compatible with 0.92.0 — except for the removal of the rarely-used transform functionality from the REST interface in HBASE-5228.

Apache HBase 0.92.1 is a bug fix release covering 61 issues – including 6 blockers and 6 critical issues, such as:

Sunburst and Cartograms in R

Filed under: Graphs,R,Visualization — Patrick Durusau @ 7:35 pm

Sunburst and Cartograms in R by Ajay Ohri.

From the post:

There are still some graphs that cannot be yet made in R using a straightforward function or package.

One is sunburst (which is radial kind of treemap-that can be made in R). See diagrams below to see the difference. Note sunburst is visually similar to coxcomb (Nightangle) graphs. Coxcombs can also be manipulated and made- but I am yet to find a straight package to make coxcomb using a single function _histdata package in R comes close in terms on historical datasets.

The Treemap uses a rectangular, space-filling slice-and-dice technique to visualize objects in the different levels of a hierarchy. The area and color of each item corresponds to an attribute of the item as well.

The Sunburst technique is an alternative, space-filling visualization that uses a radial rather than a rectangular layout. An example Sunburst display is shown below. citation- http://www.cc.gatech.edu/gvu/ii/sunburst/

Maybe it is being graphically-challenged as they say but I really appreciate clever graphics.

I think you will find the graphs demonstrated here useful in a number of contexts.

GovTrack (US) – Update

Filed under: Government,Transparency — Patrick Durusau @ 7:35 pm

GovTrack (US) (site improvements)

The GovTrack site has undergone some major modifications.

If you are interested in tracking legislation in the U.S. Congress, this is the site for you.

The Heterogeneous Programming Jungle

The Heterogeneous Programming Jungle by Michael Wolfe.

Michael starts off with one definition of “heterogeneous:”

The heterogeneous systems of interest to HPC use an attached coprocessor or accelerator that is optimized for certain types of computation.These devices typically exhibit internal parallelism, and execute asynchronously and concurrently with the host processor. Programming a heterogeneous system is then even more complex than “traditional” parallel programming (if any parallel programming can be called traditional), because in addition to the complexity of parallel programming on the attached device, the program must manage the concurrent activities between the host and device, and manage data locality between the host and device.

And while he returns to that definition in the end, another form of heterogeneity is lurking not far behind:

Given the similarities among system designs, one might think it should be obvious how to come up with a programming strategy that would preserve portability and performance across all these devices. What we want is a method that allows the application writer to write a program once, and let the compiler or runtime optimize for each target. Is that too much to ask?

Let me reflect momentarily on the two gold standards in this arena. The first is high level programming languages in general. After 50 years of programming using Algol, Pascal, Fortran, C, C++, Java, and many, many other languages, we tend to forget how wonderful and important it is that we can write a single program, compile it, run it, and get the same results on any number of different processors and operating systems.

So there is the heterogeneity of attached coprocessor and, just as importantly, of the processors with coprocessors.

His post concludes with:

Grab your Machete and Pith Helmet

If parallel programming is hard, heterogeneous programming is that hard, squared. Defining and building a productive, performance-portable heterogeneous programming system is hard. There are several current programming strategies that attempt to solve this problem, including OpenCL, Microsoft C++AMP, Google Renderscript, Intel’s proposed offload directives (see slide 24), and the recent OpenACC specification. We might also learn something from embedded system programming, which has had to deal with heterogeneous systems for many years. My next article will whack through the underbrush to expose each of these programming strategies in turn, presenting advantages and disadvantages relative to the goal.

These are languages that share common subjects (think of their target architectures) and so are ripe for a topic map that co-locates their approaches to a particular architecture. Being able to incorporate official and non-official documentation, tests, sample code, etc., might enable faster progress in this area.

The future of HPC processors is almost upon us. It will not do to be tardy.

March 23, 2012

Trouble at the text mine

Filed under: Data Mining,Search Engines,Searching — Patrick Durusau @ 7:24 pm

Trouble at the text mine by Richard Van Noorden.

From the post:

When he was a keen young biology graduate student in 2006, Max Haeussler wrote a computer program that would scan, or ‘crawl’, plain text and pull out any DNA sequences. To test his invention, the naive text-miner downloaded around 20,000 research papers that his institution had paid to access — and promptly found his IP address blocked by the papers’ publisher.

It was not until 2009 that Haeussler, then at the University of Manchester, UK, and now at the University of California, Santa Cruz, returned to the project in earnest. He had come to realize that standard site licences do not permit systematic downloads, because publishers fear wholesale theft of their content. So Haeussler began asking for licensing terms to crawl and text-mine articles. His goal was to serve science: his program is a key part of the text2genome project, which aims to use DNA sequences in research papers to link the publications to an online record of the human genome. This could produce an annotated genome map linked to millions of research articles, so that biologists browsing a genomic region could immediately click through to any relevant papers.

But Haeussler and his text2genome colleague Casey Bergman, a genomicist at the University of Manchester, have spent more than two years trying to agree terms with publishers — and often being ignored or rebuffed. “We’ve learned it’s a long, hard road with every journal,” says Bergman.

What Haeussler and Bergman don’t seem to “get” is that publishers have no interest in advancing science. Their sole and only goal is profiting from the content they have published. (I am not going to argue right or wrong but am simply trying to call out the positions in question.)

The question that Haeussler and Bergman should answer for publishers is this one: What is in this “indexing” for the publishers?

I suspect one acceptable answer would run along the lines of:

  • The full content of articles cannot be reconstructed from the indexes. The largest block of content delivered will be the article abstract, along with bibliographic reference data.
  • Pointers to the articles will point towards either the publisher’s content site and/or other commercial content providers that carry the publisher’s content.
  • The publisher’s designated journal logo (of some specified size) will appear with every reported citation.
  • The indexed content will be provided to the publisher’s at no charge.

Does this mean that publisher’s will be benefiting from allowing the indexing of their content? Yes. Next question.

A new RDFa Test Harness

Filed under: RDFa,Semantic Web — Patrick Durusau @ 7:24 pm

A new RDFa Test Harness by Gregg Kellogg.

From the post:

This is an introductory blog post on the creation of a new RDFa Test Suite. Here we discuss the use of Sinatra, Backbone.js and Bootstrap.js to run the test suite. Later will come articles on the usefulness of JSON-LD as a means of driving a test harness, generating test reports, and the use of BrowserID to deal with Distributed Denial of Service attacks that cropped up overnight.

Interesting but strikes me a formal/syntax validation of the RDFa in question. Useful, but only up to a point. Yes?

Can you point me to an RDFa or RDF test harness that tests the semantic “soundness” of the claims made in RDFa or RDF?

Quite easily may exist and I have just not seen it.

Thanks!

Challenges in maintaining a high performance search engine written in Java

Filed under: Lucene,Search Algorithms,Search Engines — Patrick Durusau @ 7:24 pm

Challenges in maintaining a high performance search engine written in Java

You will find this on the homepage or may have to search for it. I was logged in when I accessed it.

Very much worth your time for a high level overview of issues that Lucene will face sooner rather than later.

After reviewing, think about it, make serious suggestions and if possible, contributions to the future of Lucene.

Just off the cuff, I would really like to see Lucene become a search engine framework with a default data structure that admits to either extension or replacement by other data structures. Some data structures may have higher performance costs than others, but if that is what your requirements call for, they can hardly be wrong. Yes? A “fast” search engine that doesn’t meet your requirements is no prize.

Distributed Terminology System 4.0 – Apelon – != a Topic Map?

Filed under: DTS,Health care,Terminology — Patrick Durusau @ 7:24 pm

APELON INTRODUCES DISTRIBUTED TERMINOLOGY SYSTEM 4.0 – Latest Version of Leading Open Source Terminology Management Software Provides Enhanced Interoperability and Integration Capabilities

From the post:

Apelon, Inc., an international provider of terminology and data interoperability solutions, is pleased to announce a major new release (4.0) of its Distributed Terminology System (DTS), the healthcare industry’s leading open source terminology management platform. Based on extensive user feedback from deployments around the world, the new release features significant usability enhancements, new methods for tracking terminology changes over time, and greater integration with Java Enterprise Edition (JEE) and Software Oriented Architecture (SOA) infrastructures. The product will be unveiled this month at the Healthcare Information and Management Systems Society (HIMSS) 2012 Conference and Exhibition in Las Vegas, February 21 – 23, 2012.

Apelon’s DTS is a comprehensive open-source solution for the acquisition, management and practical deployment of standardized healthcare terminologies. Integration of data standards is a critical element for healthcare organizations to realize care improvement. The product supports data standardization and interoperability in Electronic Health Records systems, Healthcare Information Exchanges, and Clinical Decision Systems.

With version 4.0, DTS users easily manage the complete terminology lifecycle. The system provides the ability to transparently view, query, and browse across terminology versions. This facilitates the management of rapidly evolving standards such as SNOMED CT, ICD-10-CM, LOINC and RxNorm, and supports their use for longitudinal electronic health records. Local vocabularies, subsets and cross-maps can be versioned and queried in the same way, meaning that DTS users can tailor and adapt standards to their particular needs. Users also benefit from usability enhancements to DTS applications such as the DTS 4.0 Editor and DTS Browser, including internationalization capabilities for non-English-speaking environments.

To simplify integration into existing enterprise systems, DTS 4.0 is built on the JEE platform, supporting a complete set of web service APIs, in addition to the existing Java and .NET interfaces. Continuing the company’s commitment to open standards, DTS version 4.0 also supports HL7 Common Terminology Services 2 (CTS2).

According to Stephen Coady, Apelon president and CEO, the increasing use of reference terminologies in healthcare has precipitated the need for enhanced functionality in terminology management tools. “DTS 4.0 evidences our long-term commitment to making open source tools that allow organizations worldwide to improve care using reference terminologies. The new version is simpler to use, and will help even more institutions interoperate and integrate the latest decision support technologies into their daily work.”

DTS establishes a single common resource for an organization’s terminology assets that can be deployed across the spectrum of health delivery systems. Apelon made DTS open source in early 2007, providing the industry with significant cost, integration and adoption advantages compared to proprietary solutions. Since then the software has been downloaded by more than 3,500 informaticists and healthcare organizations worldwide.

You can grab a copy of the software (not the 4.0, yet) at Sourceforge: Apelon-DTS.

I just grabbed a copy so it will be several days before I have substantive comments on the 3.5.2 version of DTS at Sourceforge.

Part of what I will be investigating is how DTS differs from a topic map solution. Which one is appropriate for you will depend on your requirements.

Statistical Analysis: Common Mistakes

Filed under: Statistics — Patrick Durusau @ 7:24 pm

Statistical Analysis: Common Mistakes by Sandro Saitta.

The post cites the following example from the paper:

“Imagine you are a regional sales head for a major retailer in U.S. and you want to know what drives sales in your top performing stores. Your research team comes back with a revealing insight – the most significant predictor in their model is the average number of cars present in stores’ parking lots.”

A good paper to re-read from time to time.

Spruce Up Your Data Visualization Skills

Filed under: Data,Visualization — Patrick Durusau @ 7:23 pm

Spruce Up Your Data Visualization Skills

Juice Analytics has released five (5) new design videos on its resources page.

If its Design Principles page is completed with a webpage for every design principle down the right-hand side of the page, it will be a formidable design resource.

Design suggestion: Rather than users looking on two different pages for design resources, why not combine the white papers, resource page with the design principles page? I could not tell which one was going to result in videos except for the links in the blog post.

Good principles and videos.

Innovation History via 6,000 Pages of Annual Reports

Filed under: Data,Visualization — Patrick Durusau @ 7:23 pm

Innovation History via 6,000 Pages of Annual Reports

Nathan Yau from FlowingData reports on a visualization of all the GE annual reports from 1892 until 2011.

Selecting keywords lights up pages with those words.

Billed as tracing evolution of innovation but I am not sure I would go that far.

Interesting visualization but not every visualization, even an interesting one, is useful.

Fathom Information Design is responsible for a number of unusual visuallizations.

Excellent Papers for 2011 (Google)

Filed under: HCIR,Machine Learning,Multimedia,Natural Language Processing — Patrick Durusau @ 7:23 pm

Excellent Papers for 2011 (Google)

Corinna Cortes and Alfred Spector of Google Research have collected up great papers published by Glooglers in 2011.

To be sure there are the obligatory papers on searching and natural language processing but there are also papers on audio processing, human-computer interfaces, multimedia, systems and other topics.

Many of these will be the subjects of separate posts in the future. For now, peruse at your leisure and sing out when you see one of special interest.

Building a Bigger Haystack

Filed under: Data Mining,Marketing,Topic Maps — Patrick Durusau @ 7:23 pm

Counterterrorism center increases data retention time to five years by Mark Rockwell.

From the post:

The National Counterterrorism Center, which acts as the government’s clearinghouse for terrorist data, has moved to hold onto certain types of data for up to five years to improve its ability to keep track of it across government databases.

On March 22, NCTC implemented new guidelines allowing much lengthier data retention period for “terrorism information” in federal datasets including non-terrorism information. NCTC had previously been required to destroy data on citizens within three months if no ties were found to terrorism. Those rules, according to NCTC, limited the effectiveness of the data, since in some instances, the ability to link across data sets over time could help track threats that weren’t immediate, or immediately evident. According to the center, the longer retention time can aid in connecting dots that aren’t immediately evident when the initial data is collected.

Director of National Intelligence James Clapper, Attorney General Eric Holder, and National Counterterrorism Center (NCTC) Director Matthew Olsen signed the updated guidelines designed on March 22 to allow NCTC to obtain and more effectively analyze certain data in the government’s possession to better address terrorism-related threats.

I looked for the new guidelines but apparently they are not posted to the NCTC website.

Here is the justification for the change:

One of the issues identified by congress and the intelligence community after the 2009 Fort Hood shootings and the Christmas Day 2009 bombing attempt was the government’s limited ability to query multiple federal datasets and to correlate information from many sources that might relate to a potential attack, said the center. A review of those attacks recommended the intelligence community push for the of state-of-the-art search and correlation capabilities, including techniques that would provide a single point of entry to various government databases, it said.

“Following the failed terrorist attack in December 2009, representatives of the counterterrorism community concluded it is vital for NCTC to be provided with a variety of datasets from various agencies that contain terrorism information,” said Clapper in a March 22 statement. “The ability to search against these datasets for up to five years on a continuing basis as these updated Guidelines permit will enable NCTC to accomplish its mission more practically and effectively than the 2008 Guidelines allowed.”

OK, so for those two cases, what evidence would having search capabilities over five years worth of data uncover? Even with the clarity of hindsight, there has been no showing of what data could have been uncovered.

The father of the attacker reported his son’s intentions to the CIA on November 19, 2009. That right, within 45 days of the attack.

Building a bigger haystack is a singularly ineffectual way to fight terrorism. It will generate more data, more IT systems, with the personnel to man and sustain them, all of which are agency drone, not fighting terrorism goals.

Cablegate was the result of a “bigger haystack” project. Do you think we need another one?

Topic maps and other semantic technologies can produce smaller, relevant haystacks.

I guess that is the question:

Do you want more staff and a larger budget or to have the potential to combat terrorism? (The latter is only potential given that US intelligence can’t intercept bombers on 36 day notice.)

March 22, 2012

Tracking Microsoft Buzz with Blogs, Twitter, Bitly and Videos

Filed under: Microsoft,Searching — Patrick Durusau @ 7:43 pm

Tracking Microsoft Buzz with Blogs, Twitter, Bitly and Videos

Matthew Hurst writes:

Microsoft is an incredibly diverse company. I’ve just celebrated 5 years here and still don’t have a full appreciation of the breadth and depth of products and innovation that the corporation generates. After BlogPulse was unplugged, I felt something of a hankering to continue to follow the buzz around Microsoft, partly as a way to better follow what the company is doing and how it is perceived in the online world.

I’m a big fan of TechMeme, but it has some challenges when it comes to tracking news and trends around a specific company. Firstly, I don’t know the sources that are used and the ranking mechanisms in place, so it is hard to really understand quantitatively what it represents. Secondly, with limited real estate, while a big story may be happening for a company of interest, it can be crowded out by other events. Thirdly, I can’t help but think it has a strong valley culture bias. Fourthly, it hasn’t evolved much in the years that I’ve been visiting it.

So I’ve put together an experimental site called track // microsoft which follows a few blogs, clusters posts that are related and uses Bitly and Twitter data to rank the articles and clusters of stories. In doing this, I observed that many posts in the blogosphere about Microsoft would contain videos (be they of Windows 8 demos or the latest research leveraging the Kinect platform).

A great illustration that not every useful search application crawls the entire WWW.

It should crawl only as much as you need. The rest is just noise.

« Newer PostsOlder Posts »

Powered by WordPress