Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 24, 2012

Consistency through semantics

Filed under: Consistency,Semantics,Software — Patrick Durusau @ 2:13 pm

Consistency through semantics by Oliver Kennedy.

From the post:

When designing a distributed systems, one of the first questions anyone asks is what kind of consistency model to use. This is a fairly nuanced question, as there isn’t really one right answer. Do you enforce strong consistency and accept the resulting latency and communication overhead? Do you use locking, and accept the resulting throughput limitations? Or do you just give up and use eventual consistency and accept that sometimes you’ll end up with results that are just a little bit out of sync.

It’s this last bit that I’d like to chat about today, because it’s actually quite common in a large number of applications. This model is present in everything from user-facing applications like Dropbox to SVN/GIT, to back-end infrastructure systems like Amazon’s Dynamo and Yahoo’s PNUTs. Often, especially in non-critical applications latency and throughput are more important than dealing with the possibility that two simultaneous updates will conflict.

So what happens when this dreadful possibility does come to pass? Clearly the system can’t grind to a halt, and often just randomly discarding one of these updates is the wrong thing to do. So what happens? The answer is common across most of these systems: They punt to the user.

Intuitively, this is the right thing to do. The user sees the big picture. The user knows best how to combine these operations. The user knows what to do, so on those rare occurrences where the system can’t handle it, the user can.

But why is this the right thing to do? What does the user have that the infrastructure doesn’t?

Take the time to read the rest of Oliver’s post.

He distinguishes rather nicely between applications and users.

The Seventh Law of Data Quality

Filed under: Data,Data Quality — Patrick Durusau @ 12:02 pm

The Seventh Law of Data Quality by Jim Harris.

Jim’s series on the “laws” of data quality can be recommended without reservation. There are links to each one in his coverage of the seventh law.

The seventh of data quality law reads:

Determine the business impact of data quality issues BEFORE taking any corrective action in order to properly prioritize data quality improvement efforts.

I would modify that slightly to make it applicable to data issues more broadly as:

Determine the business impact of a data issue BEFORE addressing it at all.

Your data may be completely isolated in silos, but without a business purpose to be served by freeing them, why bother?

And that purpose should have a measurable ROI.

In the absence of a business purpose and a measurable ROI, keep both hands on your wallet.

Dreamworks Animation releases OpenVDB 0.99

Filed under: Graphics,Visualization — Patrick Durusau @ 11:37 am

Dreamworks Animation releases OpenVDB 0.99

From the post:

Dreamworks Animation has released a new version of its OpenVDB library. The animation production company open sourced the project in August and has now released version 0.99.0. OpenVDB has been used for some time within Dreamworks for features such as Puss in Boots, Madagascar 3: Europe’s Most Wanted and the just released Rise of the Guardians.

OpenVDB is a C++ library which includes a hierarchical data structure and a suite of tools for manipulating data within that structure as sparse, time-varying, volumetric data mapped to a three dimensional grid. Developed at Dreamworks by Ken Museth, the original VDB allows animators to use an “infinite” 3D index space yet benefit from compact storage and fast data access when processing spaces. The library’s algorithms include filtering compositing, numerical simulation, sampling and voxelisation, all optimised to the OpenVDB data structures.

The library and tools also include a standalone OpenGL viewer and native Houdini integration. Houdini is a 3D animation package from Side Effects Software and its next major release will incorporate OpenVDB and a full suite of volume processing nodes. Side Effects Software developers are listed as contributors to OpenVDB.

Perhaps a bit high-end for 2-D graph rendering but for exploring more sophisticated data visualization, it may be just the thing.

BIOKDD 2013 :…Biological Knowledge Discovery and Data Mining

Filed under: Bioinformatics,Biomedical,Conferences,Data Mining,Knowledge Discovery — Patrick Durusau @ 11:24 am

BIOKDD 2013 : 4th International Workshop on Biological Knowledge Discovery and Data Mining

When Aug 26, 2013 – Aug 30, 2013
Where Prague, Czech Republic
Abstract Registration Due Apr 3, 2013
Submission Deadline Apr 10, 2013
Notification Due May 10, 2013
Final Version Due May 20, 2013

From the call for papers:

With the development of Molecular Biology during the last decades, we are witnessing an exponential growth of both the volume and the complexity of biological data. For example, the Human Genome Project provided the sequence of the 3 billion DNA bases that constitute the human genome. And, consequently, we are provided too with the sequences of about 100,000 proteins. Therefore, we are entering the post-genomic era: after having focused so many efforts on the accumulation of data, we have now to focus as much effort, and even more, on the analysis of these data. Analyzing this huge volume of data is a challenging task because, not only, of its complexity and its multiple and numerous correlated factors, but also, because of the continuous evolution of our understanding of the biological mechanisms. Classical approaches of biological data analysis are no longer efficient and produce only a very limited amount of information, compared to the numerous and complex biological mechanisms under study. From here comes the necessity to use computer tools and develop new in silico high performance approaches to support us in the analysis of biological data and, hence, to help us in our understanding of the correlations that exist between, on one hand, structures and functional patterns of biological sequences and, on the other hand, genetic and biochemical mechanisms. Knowledge Discovery and Data Mining (KDD) are a response to these new trends.

Topics of BIOKDD’13 workshop include, but not limited to:

Data Preprocessing: Biological Data Storage, Representation and Management (data warehouses, databases, sequences, trees, graphs, biological networks and pathways, …), Biological Data Cleaning (errors removal, redundant data removal, completion of missing data, …), Feature Extraction (motifs, subgraphs, …), Feature Selection (filter approaches, wrapper approaches, hybrid approaches, embedded approaches, …)

Data Mining: Biological Data Regression (regression of biological sequences…), Biological data clustering/biclustering (microarray data biclustering, clustering/biclustering of biological sequences, …), Biological Data Classification (classification of biological sequences…), Association Rules Learning from Biological Data, Text mining and Application to Biological Sequences, Web mining and Application to Biological Data, Parallel, Cloud and Grid Computing for Biological Data Mining

Data Postprocessing: Biological Nuggets of Knowledge Filtering, Biological Nuggets of Knowledge Representation and Visualization, Biological Nuggets of Knowledge Evaluation (calculation of the classification error rate, evaluation of the association rules via numerical indicators, e.g. measurements of interest, … ), Biological Nuggets of Knowledge Integration

Being held in conjunction with 24th International Conference on Database and Expert Systems Applications – DEXA 2013.

In case you are wondering about BIOKDD, consider the BIOKDD Programme for 2012.

Or the DEXA program for 2012.

Looks like a very strong set of conferences and workshops.

The Ironies of MDM [Master Data Management/Muti-Database Mining]

Filed under: Data Mining,Master Data Management,Multi-Database Mining — Patrick Durusau @ 11:06 am

A survey on mining multiple data sources by T. Ramkumar, S. Hariharan and S. Selvamuthukumaran.

Abstract:

Advancements in computer and communication technologies demand new perceptions of distributed computing environments and development of distributed data sources for storing voluminous amount of data. In such circumstances, mining multiple data sources for extracting useful patterns of significance is being considered as a challenging task within the data mining community. The domain, multi-database mining (MDM) is regarded as a promising research area as evidenced by numerous research attempts in the recent past. The methods exist for discovering knowledge from multiple data sources, they fall into two wide categories, namely (1) mono-database mining and (2) local pattern analysis. The main intent of the survey is to explain the idea behind those approaches and consolidate the research contributions along with their significance and limitations.

I can’t reach the full article, yet, but it sounds like one that merits attention.

I was struck by the irony of MDM, which some data types would expand to be “Master Data Management,” is read here to mean, “Multi-Database Mining.”

To be sure, “Master Data Management” can be useful, but be mindful that non-managed data lurks just outside your door.

November 23, 2012

Javadoc coding standards

Filed under: Documentation,Java — Patrick Durusau @ 11:30 am

Javadoc coding standards by Stephen Colebourne.

From the post:

These are the standards I tend to use when writing Javadoc. Since personal tastes differ, I’ve tried to explain some of the rationale for some of my choices. Bear in mind that this is more about the formatting of Javadoc, than the content of Javadoc.

There is an Oracle guide which is longer and more detailed than this one. The two agree in most places, however these guidelines are more explicit about HTML tags, two spaces in @param and null-specification, and differ in line lengths and sentence layout.

Each of the guidelines below consists of a short description of the rule and an explanation, which may include an example:

Documentation of source code is vital to its maintenance. (cant)

But neither Stephen nor Oracle made much of the need to document the semantics of the source and/or data. If I am indexing/mapping across source files, &ltcode> elements aren’t going to be enough to compare field names across documents.

I am assuming that semantic diversity is as present in source code as elsewhere. Would you assume otherwise?

Data Mining and Machine Learning in Astronomy

Filed under: Astroinformatics,Data Mining,Machine Learning — Patrick Durusau @ 11:30 am

Data Mining and Machine Learning in Astronomy by Nicholas M. Ball and Robert J. Brunner. (International Journal of Modern Physics D, Volume 19, Issue 07, pp. 1049-1106 (2010).)

Abstract:

We review the current state of data mining and machine learning in astronomy. Data Mining can have a somewhat mixed connotation from the point of view of a researcher in this field. If used correctly, it can be a powerful approach, holding the potential to fully exploit the exponentially increasing amount of available data, promising great scientific advance. However, if misused, it can be little more than the black box application of complex computing algorithms that may give little physical insight, and provide questionable results. Here, we give an overview of the entire data mining process, from data collection through to the interpretation of results. We cover common machine learning algorithms, such as artificial neural networks and support vector machines, applications from a broad range of astronomy, emphasizing those in which data mining techniques directly contributed to improving science, and important current and future directions, including probability density functions, parallel algorithms, Peta-Scale computing, and the time domain. We conclude that, so long as one carefully selects an appropriate algorithm and is guided by the astronomical problem at hand, data mining can be very much the powerful tool, and not the questionable black box.

At fifty-eight (58) pages and three hundred and seventy-five references, this is a great starting place to learn about data mining and machine learning from an astronomy perspective!

And should yield new techniques or new ways to apply old ones to your data, with a little imagination.

Dates from 2010 so word of more recent surveys welcome!

…Knowledge Extraction From Complex Astronomical Data Sets

Filed under: Astroinformatics,BigData,Data Mining,Knowledge Discovery — Patrick Durusau @ 11:29 am

CLaSPS: A New Methodology For Knowledge Extraction From Complex Astronomical Data Sets by R. D’Abrusco, G. Fabbiano, G. Djorgovski, C. Donalek, O. Laurino and G. Longo. (R. D’Abrusco et al. 2012 ApJ 755 92 doi:10.1088/0004-637X/755/2/92)

Abstract:

In this paper, we present the Clustering-Labels-Score Patterns Spotter (CLaSPS), a new methodology for the determination of correlations among astronomical observables in complex data sets, based on the application of distinct unsupervised clustering techniques. The novelty in CLaSPS is the criterion used for the selection of the optimal clusterings, based on a quantitative measure of the degree of correlation between the cluster memberships and the distribution of a set of observables, the labels, not employed for the clustering. CLaSPS has been primarily developed as a tool to tackle the challenging complexity of the multi-wavelength complex and massive astronomical data sets produced by the federation of the data from modern automated astronomical facilities. In this paper, we discuss the applications of CLaSPS to two simple astronomical data sets, both composed of extragalactic sources with photometric observations at different wavelengths from large area surveys. The first data set, CSC+, is composed of optical quasars spectroscopically selected in the Sloan Digital Sky Survey data, observed in the x-rays by Chandra and with multi-wavelength observations in the near-infrared, optical, and ultraviolet spectral intervals. One of the results of the application of CLaSPS to the CSC+ is the re-identification of a well-known correlation between the αOX parameter and the near-ultraviolet color, in a subset of CSC+ sources with relatively small values of the near-ultraviolet colors. The other data set consists of a sample of blazars for which photometric observations in the optical, mid-, and near-infrared are available, complemented for a subset of the sources, by Fermi γ-ray data. The main results of the application of CLaSPS to such data sets have been the discovery of a strong correlation between the multi-wavelength color distribution of blazars and their optical spectral classification in BL Lac objects and flat-spectrum radio quasars, and a peculiar pattern followed by blazars in the WISE mid-infrared colors space. This pattern and its physical interpretation have been discussed in detail in other papers by one of the authors.

A new approach for mining “…correlations in complex and massive astronomical data sets produced by the federation of the data from modern automated astronomical facilities.”

Mining complex and massive data sets. I have heard that somewhere recently. Sure it will come back to me.

First Light for the Millennium Run Observatory

Filed under: Astroinformatics,Data Mining,Simulations — Patrick Durusau @ 11:29 am

First Light for the Millennium Run Observatory by Cmarchesin.

From the post:

The famous Millennium Run (MR) simulations now appear in a completely new light – literally. The project, led by Gerard Lemson of the MPA and Roderik Overzier of the University of Texas, combines detailed predictions from cosmological simulations with a virtual observatory in order to produce synthetic astronomical observations. In analogy to the moment when newly constructed astronomical observatories receive their “first light”, the Millennium Run Observatory (MRObs) has produced its first images of the simulated universe. These virtual observations allow theorists and observers to analyse the purely theoretical data in exactly the same way as they would purely observational data. Building on the success of the Millennium Run Database, the simulated observations are now being made available to the wider astronomical community for further study. The MRObs browser – a new online tool – allows users to explore the simulated images and interact with the underlying physical universe as stored in the database. The team expects that the advantages offered by this approach will lead to a richer collaboration between theoretical and observational astronomers.

At least with simulated observations, there is no need to worry about cloudy nights. 😉

Interesting in its own right but also as an example of yet another tool for data mining, that of simulation.

Not in the sense of generating “test” data but of deliberating altering data and then measuring the impact of the alterations on data mining tools.

Quite possibly in a double blind context where only some third party knows which data sets were “altered” until all tests have been performed.

Millennium Run Observatory Web Portal and access to the MRObs browser

Combining Neo4J and Hadoop (part I)

Filed under: Hadoop,Neo4j — Patrick Durusau @ 11:29 am

Combining Neo4J and Hadoop (part I) by Kris Geusebroek.

From the post:

Why combine these two different things.

Hadoop is good for data crunching, but the end-results in flat files don’t present well to the customer, also it’s hard to visualize your network data in excel.

Neo4J is perfect for working with our networked data. We use it a lot when visualizing our different sets of data.
So we prepare our dataset with Hadoop and import it into Neo4J, the graph database, to be able to query and visualize the data.
We have a lot of different ways we want to look at our dataset so we tend to create a new extract of the data with some new properties to look at every few days.

This blog is about how we combined Hadoop and Neo4J and describes the phases we went trough in our search for the optimal solution.

Mostly covers slow load speeds into Neo4j and attempts to improve it.

A future post will cover use of a distributed batchimporter process.

I first saw this at DZone.

Tera-scale Astronomical Data Analysis and Visualization

Filed under: Astroinformatics,BigData,Data Analysis,Visualization — Patrick Durusau @ 11:27 am

Tera-scale Astronomical Data Analysis and Visualization by A. H. Hassan, C. J. Fluke, D. G. Barnes, V. A. Kilborn.

Abstract:

We present a high-performance, graphics processing unit (GPU)-based framework for the efficient analysis and visualization of (nearly) terabyte (TB)-sized 3-dimensional images. Using a cluster of 96 GPUs, we demonstrate for a 0.5 TB image: (1) volume rendering using an arbitrary transfer function at 7–10 frames per second; (2) computation of basic global image statistics such as the mean intensity and standard deviation in 1.7 s; (3) evaluation of the image histogram in 4 s; and (4) evaluation of the global image median intensity in just 45 s. Our measured results correspond to a raw computational throughput approaching one teravoxel per second, and are 10–100 times faster than the best possible performance with traditional single-node, multi-core CPU implementations. A scalability analysis shows the framework will scale well to images sized 1 TB and beyond. Other parallel data analysis algorithms can be added to the framework with relative ease, and accordingly, we present our framework as a possible solution to the image analysis and visualization requirements of next-generation telescopes, including the forthcoming Square Kilometre Array pathfinder radiotelescopes.

Looks like the original “big data” folks (astronomy) are moving up to analysis of near terabyte size images.

A glimpse of data and techniques that are rapidly approaching.

I first saw this in a tweet by Stefano Bertolo.

Course on Information Theory, Pattern Recognition, and Neural Networks

Filed under: CS Lectures,Information Theory,Neural Networks,Pattern Recognition — Patrick Durusau @ 11:27 am

Course on Information Theory, Pattern Recognition, and Neural Networks by David MacKay.

From the description:

A series of sixteen lectures covering the core of the book “Information Theory, Inference, and Learning Algorithms (Cambridge University Press, 2003)” which can be bought at Amazon, and is available free online. A subset of these lectures used to constitute a Part III Physics course at the University of Cambridge. The high-resolution videos and all other course material can be downloaded from the Cambridge course website.

Excellent lectures on information theory, the probability that a message sent is the one received.

Makes me wonder if there is a similar probability theory for the semantics of a message sent being the semantics of the message as received?

BLAST – Basic Local Alignment Search Tool

Filed under: Bioinformatics,BLAST,Genomics — Patrick Durusau @ 11:27 am

BLAST – Basic Local Alignment Search Tool (Wikipedia)

From Wikipedia:

In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. Different types of BLASTs are available according to the query sequences. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence. The BLAST program was designed by Stephen Altschul, Warren Gish, Webb Miller, Eugene Myers, and David J. Lipman at the NIH and was published in the Journal of Molecular Biology in 1990.[1]

I found the uses of BLAST of particular interest:

Uses of BLAST

BLAST can be used for several purposes. These include identifying species, locating domains, establishing phylogeny, DNA mapping, and comparison.

Identifying species
With the use of BLAST, you can possibly correctly identify a species and/or find homologous species. This can be useful, for example, when you are working with a DNA sequence from an unknown species.
Locating domains
When working with a protein sequence you can input it into BLAST, to locate known domains within the sequence of interest.
Establishing phylogeny
Using the results received through BLAST you can create a phylogenetic tree using the BLAST web-page. Phylogenies based on BLAST alone are less reliable than other purpose-built computational phylogenetic methods, so should only be relied upon for “first pass” phylogenetic analyses.
DNA mapping
When working with a known species, and looking to sequence a gene at an unknown location, BLAST can compare the chromosomal position of the sequence of interest, to relevant sequences in the database(s).
Comparison
When working with genes, BLAST can locate common genes in two related species, and can be used to map annotations from one organism to another.

Not just for the many uses of BLAST in genomics, but what of using similar techniques with other data sets?

Are they not composed of “sequences?”

d8taplex [UK bicycle theft = terrorism?]

Filed under: Dataset — Patrick Durusau @ 11:27 am

d8taplex

Bills itself as:

Explore over 50 thousand data sets containing over 1 million time series.

But searching at random (there was no description of which 50,000 datasets were in play):

Astronomy – 8 “hits” – All doctorates awarded by field of study.

Chemistry – 22 “hits” – Degrees, students, periodical prices.

Physics – 38 “hits” – Degrees, students, periodical prices, staff.

Automobile accidents – 492 “hits” – What you would expect about road conditions, condition of drivers, etc.

Terrorist attacks – 11 “hits” –

containing document: Crime in England and Wales 2009/10: Supplementary Tables: Nature of burglary, vehicle-related theft, bicycle theft, other household theft, personal and other theft, vandalism and violent crime | data.gov.uk
anchor text: Personal theft

I really don’t equate “bicycle theft” with an act of terrorism. Inconvenient yes, terrorism no.

Unless you are getting money from the U.S. Department of Homeland Security of course. They fund studies of how to hide power transmission stations that are too large and dependent on air cooling to be enclosed.

I guess putting blank spots on maps would only serve to highlight their presence. DHS could ban the manufacture of printed maps. Only allow electronic ones. Which can be distorted to show or conceal whatever the flavor of terrorism is for the week.

It would not take long for the only content of the map to be “You are here.” With no markers as to where “here” might be. But then you are there so look around.

Unicode 6.2.0 Available

Filed under: Unicode — Patrick Durusau @ 11:26 am

Unicode 6.2.0 Available

Summary:

Version 6.2 of the Unicode Standard is a special release dedicated to the early publication of the newly encoded Turkish lira sign. This version also rolls in various minor corrections for errata and other small updates for the Unicode Character Database. In addition, there are some significant changes to the Unicode algorithms for text segmentation and line breaking, including changes to the line break property to improve line breaking for emoji symbols.

Just in case you don’t follow Unicode releases closely.

The character set against which all others should be mapped.

November 22, 2012

Teiid (8.2 Final Released!) [Component for TM System]

Filed under: Data Integration,Federation,Information Integration,JDBC,SQL,Teiid,XQuery — Patrick Durusau @ 11:16 am

Teiid

From the homepage:

Teiid is a data virtualization system that allows applications to use data from multiple, heterogenous data stores.

Teiid is comprised of tools, components and services for creating and executing bi-directional data services. Through abstraction and federation, data is accessed and integrated in real-time across distributed data sources without copying or otherwise moving data from its system of record.

Teiid Parts

  • Query Engine: The heart of Teiid is a high-performance query engine that processes relational, XML, XQuery and procedural queries from federated datasources.  Features include support for homogenous schemas, hetrogenous schemas, transactions, and user defined functions.
  • Embedded: An easy-to-use JDBC Driver that can embed the Query Engine in any Java application. (as of 7.0 this is not supported, but on the roadmap for future releases)
  • Server: An enterprise ready, scalable, managable, runtime for the Query Engine that runs inside JBoss AS that provides additional security, fault-tolerance, and administrative features.
  • Connectors: Teiid includes a rich set of Translators and Resource Adapters that enable access to a variety of sources, including most relational databases, web services, text files, and ldap.  Need data from a different source? A custom translators and resource adaptors can easily be developed.
  • Tools:

Teiid 8.2 final was released on November 20, 2012.

Like most integration services, not strong on integration between integration services.

Would make one helluva component for a topic map system.

A system with an inter-integration solution mapping layer in addition to the capabilities of Teiid.

SC12 Salt Lake City, Utah (Proceedings)

Filed under: Conferences,HPC,Supercomputing — Patrick Durusau @ 10:41 am

SC12 Salt Lake City, Utah

Proceeding from SC12 are online!

ACM Digital Library: SC12 Conference Proceedings

IEEE Xplore: SC12 Conference Proceedings

Everything from graphs to search and lots in between.

Enjoy!

eGIFT: Mining Gene Information from the Literature

eGIFT: Mining Gene Information from the Literature by Catalina O Tudor, Carl J Schmidt and K Vijay-Shanker.

Abstract:

Background

With the biomedical literature continually expanding, searching PubMed for information about specific genes becomes increasingly difficult. Not only can thousands of results be returned, but gene name ambiguity leads to many irrelevant hits. As a result, it is difficult for life scientists and gene curators to rapidly get an overall picture about a specific gene from documents that mention its names and synonyms.

Results

In this paper, we present eGIFT (http://biotm.cis.udel.edu/eGIFT webcite), a web-based tool that associates informative terms, called iTerms, and sentences containing them, with genes. To associate iTerms with a gene, eGIFT ranks iTerms about the gene, based on a score which compares the frequency of occurrence of a term in the gene’s literature to its frequency of occurrence in documents about genes in general. To retrieve a gene’s documents (Medline abstracts), eGIFT considers all gene names, aliases, and synonyms. Since many of the gene names can be ambiguous, eGIFT applies a disambiguation step to remove matches that do not correspond to this gene. Another additional filtering process is applied to retain those abstracts that focus on the gene rather than mention it in passing. eGIFT’s information for a gene is pre-computed and users of eGIFT can search for genes by using a name or an EntrezGene identifier. iTerms are grouped into different categories to facilitate a quick inspection. eGIFT also links an iTerm to sentences mentioning the term to allow users to see the relation between the iTerm and the gene. We evaluated the precision and recall of eGIFT’s iTerms for 40 genes; between 88% and 94% of the iTerms were marked as salient by our evaluators, and 94% of the UniProtKB keywords for these genes were also identified by eGIFT as iTerms.

Conclusions

Our evaluations suggest that iTerms capture highly-relevant aspects of genes. Furthermore, by showing sentences containing these terms, eGIFT can provide a quick description of a specific gene. eGIFT helps not only life scientists survey results of high-throughput experiments, but also annotators to find articles describing gene aspects and functions.

Website: http://biotm.cis.udel.edu/eGIFT

Another lesson for topic map authoring interfaces: Offer domain specific search capabilities.

Using a ****** search appliance is little better than a poke with a sharp stick in most domains. The user is left to their own devices to sort out ambiguities, discover synonyms, again and again.

Your search interface may report > 900,000 “hits,” but anything beyond the first 20 or so are wasted.

(If you get sick, get something that comes up in the first 20 “hits” in PubMed. Where most researchers stop.)

Developing a biocuration workflow for AgBase… [Authoring Interfaces]

Filed under: Bioinformatics,Biomedical,Curation,Genomics,Text Mining — Patrick Durusau @ 9:50 am

Developing a biocuration workflow for AgBase, a non-model organism database by Lakshmi Pillai, Philippe Chouvarine, Catalina O. Tudor, Carl J. Schmidt, K. Vijay-Shanker and Fiona M. McCarthy.

Abstract:

AgBase provides annotation for agricultural gene products using the Gene Ontology (GO) and Plant Ontology, as appropriate. Unlike model organism species, agricultural species have a body of literature that does not just focus on gene function; to improve efficiency, we use text mining to identify literature for curation. The first component of our annotation interface is the gene prioritization interface that ranks gene products for annotation. Biocurators select the top-ranked gene and mark annotation for these genes as ‘in progress’ or ‘completed’; links enable biocurators to move directly to our biocuration interface (BI). Our BI includes all current GO annotation for gene products and is the main interface to add/modify AgBase curation data. The BI also displays Extracting Genic Information from Text (eGIFT) results for each gene product. eGIFT is a web-based, text-mining tool that associates ranked, informative terms (iTerms) and the articles and sentences containing them, with genes. Moreover, iTerms are linked to GO terms, where they match either a GO term name or a synonym. This enables AgBase biocurators to rapidly identify literature for further curation based on possible GO terms. Because most agricultural species do not have standardized literature, eGIFT searches all gene names and synonyms to associate articles with genes. As many of the gene names can be ambiguous, eGIFT applies a disambiguation step to remove matches that do not correspond to this gene, and filtering is applied to remove abstracts that mention a gene in passing. The BI is linked to our Journal Database (JDB) where corresponding journal citations are stored. Just as importantly, biocurators also add to the JDB citations that have no GO annotation. The AgBase BI also supports bulk annotation upload to facilitate our Inferred from electronic annotation of agricultural gene products. All annotations must pass standard GO Consortium quality checking before release in AgBase.

Database URL: http://www.agbase.msstate.edu/

Another approach to biocuration. I will be posting on eGift separately but do note this is a domain specific tool.

The authors did not set out to create the universal curation tool but one suited to their specific data and requirements.

I think there is an important lesson here for semantic authoring interfaces. Word processors offer very generic interfaces but consequently little in the way of structure. Authoring annotated information requires more structure and that requires domain specifics.

Now there is an idea, create topic map authoring interfaces on top of a common skeleton, instead of hard coding interfaces as users “should” use the tool.

Developing New Ways to Search for Web Images

Developing New Ways to Search for Web Images by Shar Steed.

From the post:

Collections of photos, images, and videos are quickly coming to dominate the content available on the Web. Currently internet search engines rely on the text with which the images are labeled to return matches. But why is only text being used to search visual mediums? These labels can be unreliable, unhelpful and sometimes not available at all.

To solve this problem, scientists at Stanford and Princeton have been working to “create a new generation of visual search technologies.” Dr. Fei-Fei Li, a computer scientist at Stanford, has built the world’s largest visual database, containing more than 14 million labeled objects.

A system called ImageNet, applies the data gathered from the database to recognize similar, unlabeled objects with much greater accuracy than past algorithms.

A remarkable amount of material to work with, either via the API or downloading for your own hacking.

Another tool for assisting in the authoring of topic maps (or other content).

VAO Software Release: Data Discovery Tool (version 1.4)

Filed under: Astroinformatics,BigData — Patrick Durusau @ 6:07 am

VAO Software Release: Data Discovery Tool (version 1.4)

From the post:

The VAO has released a new version of the Data Discovery Tool (v1.4) on October 11, 2012. With this tool you can find datasets from thousands of astronomical collections known to the VO and over wide areas of the sky. This includes thousands of astronomical collections – photometric catalogs and images – and archives around the world.

New features of the Data Discovery Tool include:

  • New tooltips describing each data field in search results
  • Improved display and manipulation of numeric filters
  • Automatic color assignment for overlay graphics in the all-sky viewer

Try it now at http://www.usvao.org/tools.

From a community that used “big data” long before it became a buzz word for IT marketers.

W3C Community and Business Groups

Filed under: Semantics,W3C — Patrick Durusau @ 5:53 am

W3C Community and Business Groups

A listing of current Community and Business Groups at the W3C. W3C membership is not required to join but you do need a free W3C account.

Several are relevant to semantics and semantic integration and are avenues for meeting other people interested in those topics.

SDshare Community Group

Filed under: RDF,SDShare — Patrick Durusau @ 5:28 am

SDshare Community Group

From the webpage:

SDshare is a highly RESTful protocol for synchronization of RDF (and potentially other) data, by publishing feeds of data changes as Atom feeds.

A W3C community group on SDShare.

The current SDShare draft.

Its known issues.

Co-chaired by Lars Marius Garshol and Graham Moore.

Webnodes Semantic Integration Server (SDShare Protocol)

Filed under: Odata,ODBC,SDShare,SPARQL — Patrick Durusau @ 5:09 am

Webnodes AS announces Webnodes Semantic Integration Server by Mike Johnston.

From the post:

Webnodes AS, a company developing a .NET based semantic content management system, today announced the release of a new product called Webnodes Semantic Integration Server.

Webnodes Semantic Integration Server is a standalone product that has two main components: A SPARQL endpoint for traditional semantic use-cases and the full-blown integration server based on the SDShare protocol. SDShare is a new protocol for allowing different software to share and consume data with each other, with minimal amount of setup.

The integration server ships with connectors out of the box for OData- and SPARQL endpoints and any ODBC compatible RDBMS. This means you can integrate many of the software systems on the market with very little work. If you want to support software not compatible with any of the available connectors, you can create custom connectors. In addition to full-blown connectors, the integration server can also push the raw data to another SPARQL endpoint (the internal data format is RDF) or a HTTP endpoint (for example Apache SOLR).

I wonder about the line:

This means you can integrate many of the software systems on the market with very little work.

I think wiring disparate systems together is a better description. To “integrate” systems implies some useful result.

Wiring systems together is a long way from the hard task of semantic mapping, which produces integration of systems.

I first saw this in a tweet by Paul Hermans.

visualizing the dependencies in a Makefile

Filed under: Graphics,Visualization — Patrick Durusau @ 4:31 am

visualizing the dependencies in a Makefile by Pierre Lindenbaum.

From the post:

I’ve just coded a tool to visualize the dependencies in a Makefile. The java source code is available on github at : https://github.com/lindenb/jsandbox/blob/master/src/sandbox/MakeGraphDependencies.java.

Outputs to a graphviz-dot file.

Clever and raises interesting questions about visualization of other dependency situations.

November 21, 2012

Archive of datasets bundled with R

Filed under: Data,Dataset,R — Patrick Durusau @ 12:19 pm

Archive of datasets bundled with R by Nathan Yau.

From the post:

R comes with a lot of datasets, some with the core distribution and others with packages, but you’d never know which ones unless you went through all the examples found at the end of help documents. Luckily, Vincent Arel-Bundock cataloged 596 of them in an easy-to-read page, and you can quickly download them as CSV files.

Many of the datasets are dated, going back to the original distribution of R, but it’s a great resource for teaching or if you’re just looking for some data to play with.

A great find! Thanks Nathan and to Vincent for pulling it together!

Sindice SPARQL endpoint

Filed under: RDF,Sindice,SPARQL — Patrick Durusau @ 12:00 pm

Sindice SPARQL endpoint by Gabi Vulcu.

From an email by Gabi:

We have released a new version of the SIndice SPARQL endpoint (http://sparql.sindice.com/) with two new datasets: sudoc and yago

Below are the current dump datasets that are in the Sparql endpoint:

dataset_uri dataset_name
http://sindice.com/dataspace/default/dataset/dbpedia “dbpedia”
http://sindice.com/dataspace/default/dataset/medicare “medicare”
http://sindice.com/dataspace/default/dataset/whoiswho “whoiswho”
http://sindice.com/dataspace/default/dataset/sudoc “sudoc”
http://sindice.com/dataspace/default/dataset/nytimes “nytimes”
http://sindice.com/dataspace/default/dataset/ookaboo “ookaboo”
http://sindice.com/dataspace/default/dataset/europeana “europeana”
http://sindice.com/dataspace/default/dataset/basekb “basekb”
http://sindice.com/dataspace/default/dataset/geonames “geonames”
http://sindice.com/dataspace/default/dataset/wordnet “wordnet”
http://sindice.com/dataspace/default/dataset/dailymed “dailymed”
http://sindice.com/dataspace/default/dataset/reactome “reactome”
http://sindice.com/dataspace/default/dataset/yago “yago”

The list of crawled website datasets that have been rdf-ized and loaded into the Sparql endpoint can be found here [1]

Due to space limitation we limited both the amount of dump datasets to the ones in the above table and the websites datasets to the top 1000 domains based on the DING[3] score.

However, upon request, if someone needs a particular dataset( there are more to choose from here [4]), we can arrange to get it into the Sparql endpoint in the next release.

[1] https://docs.google.com/spreadsheet/ccc?key=0AvdgRy2el8d9dERUZzBPNEZIbVJTTVVIRDVUWHhKdWc
[2] https://docs.google.com/spreadsheet/ccc?key=0AvdgRy2el8d9dGhDaHMta0MtaG9vWWhhbTd5SVVaX1E
[3] http://ding.sindice.com
[4] https://docs.google.com/spreadsheet/ccc?key=0AvdgRy2el8d9dGhDaHMta0MtaG9vWWhhbTd5SVVaX1E#gid=0

You may also be interested in: sindice-dev — Sindice developers list.

ALGEBRA, Chapter 0

Filed under: Algebra,Category Theory,Mathematics — Patrick Durusau @ 11:27 am

ALGEBRA, Chapter 0 by Paolo Aluffi. (PDF)

From the introduction:

This text presents an introduction to algebra suitable for upper-level undergraduate or beginning graduate courses. While there is a very extensive offering of textbooks at this level, in my experience teaching this material I have invariably felt the need for a self-contained text that would start ‘from zero’ (in the sense of not assuming that the reader has had substantial previous exposure to the subject), but impart from the very beginning a rather modern, categorically-minded viewpoint, and aim at reaching a good level of depth. Many textbooks in algebra satisfy brilliantly some, but not all of these requirements. This book is my attempt at providing a working alternative.

There is a widespread perception that categories should be avoided at first blush, that the abstract language of categories should not be introduced until a student has toiled for a few semesters through example-driven illustrations of the nature of a subject like algebra. According to this viewpoint, categories are only tangentially relevant to the main topics covered in a beginning course, so they can simply be mentioned occasionally for the general edification of a reader, who will in time learn about them (by osmosis?). Paraphrasing a reviewer of a draft of the present text, ‘Discussions of categories at this level are the reason why God created appendices’.

It will be clear from a cursory glance at the table of contents that I think otherwise. In this text, categories are introduced around p. 20, after a scant reminder of the basic language of naive set theory, for the main purpose of providing a context for universal properties. These are in turn evoked constantly as basic definitions are introduced. The word ‘universal’ appears at least 100 times in the first three chapters.

If you are interested in a category theory based introduction to algebra, this may be the text for you. Suitable (according to the author) for use in a classroom or for self-study.

The ability to reason carefully, about what we imagine is known, should not be underestimated.

I first saw this in a tweet from Algebra Fact.

Prioritizing PubMed articles…

Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information by Sun Kim, Won Kim, Chih-Hsuan Wei, Zhiyong Lu and W. John Wilbur.

Abstract:

The Comparative Toxicogenomics Database (CTD) contains manually curated literature that describes chemical–gene interactions, chemical–disease relationships and gene–disease relationships. Finding articles containing this information is the first and an important step to assist manual curation efficiency. However, the complex nature of named entities and their relationships make it challenging to choose relevant articles. In this article, we introduce a machine learning framework for prioritizing CTD-relevant articles based on our prior system for the protein–protein interaction article classification task in BioCreative III. To address new challenges in the CTD task, we explore a new entity identification method for genes, chemicals and diseases. In addition, latent topics are analyzed and used as a feature type to overcome the small size of the training set. Applied to the BioCreative 2012 Triage dataset, our method achieved 0.8030 mean average precision (MAP) in the official runs, resulting in the top MAP system among participants. Integrated with PubTator, a Web interface for annotating biomedical literature, the proposed system also received a positive review from the CTD curation team.

An interesting summary of entity recognition issues in bioinformatics occurs in this article:

The second problem is that chemical and disease mentions should be identified along with gene mentions. Named entity recognition (NER) has been a main research topic for a long time in the biomedical text-mining community. The common strategy for NER is either to apply certain rules based on dictionaries and natural language processing techniques (5–7) or to apply machine learning approaches such as support vector machines (SVMs) and conditional random fields (8–10). However, most NER systems are class specific, i.e. they are designed to find only objects of one particular class or set of classes (11). This is natural because chemical, gene and disease names have specialized terminologies and complex naming conventions. In particular, gene names are difficult to detect because of synonyms, homonyms, abbreviations and ambiguities (12,13). Moreover, there are no specific rules of how to name a gene that are actually followed in practice (14). Chemicals have systematic naming conventions, but finding chemical names from text is still not easy because there are various ways to express chemicals (15,16). For example, they can be mentioned as IUPAC names, brand names, generic names or even molecular formulas. However, disease names in literature are more standardized (17) compared with gene and chemical names. Hence, using terminological resources such as Medical Subject Headings (MeSH) and Unified Medical Language System (UMLS) Metathesaurus help boost the identification performance (17,18). But, a major drawback of identifying disease names from text is that they often use general English terms.

Having a common representative for a group of identifiers for a single entity, should simplify the creation of mappings between entities.

Yes?

Lucene with Zing, Part 2

Filed under: Indexing,Java,Lucene,Zing JVM — Patrick Durusau @ 9:22 am

Lucene with Zing, Part 2 by Mike McCandless.

From the post:

When I last tested Lucene with the Zing JVM the results were impressive: Zing’s fully concurrent C4 garbage collector had very low pause times with the full English Wikipedia index (78 GB) loaded into RAMDirectory, which is not an easy feat since we know RAMDirectory is stressful for the garbage collector.

I had used Lucene 4.0.0 alpha for that first test, so I decided to re-test using Lucene’s 4.0.0 GA release and, surprisingly, the results changed! MMapDirectory’s max throughput was now better than RAMDirectory’s (versus being much lower before), and the concurrent mark/sweep collector (-XX:-UseConcMarkSweepGC) was no longer hitting long GC pauses.

This was very interesting! What change could improve MMapDirectory’s performance, and lower the pressure on concurrent mark/sweep’s GC to the point where pause times were so much lower in GA compared to alpha?

Mike updates his prior experience with Lucene and Zing.

Covers the use gcLogAnalyser and Fragger to understand “why” his performance test results changed from the alpha to GA releases.

Insights into both Lucene and Zing.

Have you considered loading your topic map into RAM?

« Newer PostsOlder Posts »

Powered by WordPress