Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 22, 2013

Class-imbalanced classifiers for high-dimensional data

Filed under: BigData,Classifier,High Dimensionality — Patrick Durusau @ 2:41 pm

Class-imbalanced classifiers for high-dimensional data by Wei-Jiun Lin and James J. Chen. (Brief Bioinform (2013) 14 (1): 13-26. doi: 10.1093/bib/bbs006)

Abstract:

A class-imbalanced classifier is a decision rule to predict the class membership of new samples from an available data set where the class sizes differ considerably. When the class sizes are very different, most standard classification algorithms may favor the larger (majority) class resulting in poor accuracy in the minority class prediction. A class-imbalanced classifier typically modifies a standard classifier by a correction strategy or by incorporating a new strategy in the training phase to account for differential class sizes. This article reviews and evaluates some most important methods for class prediction of high-dimensional imbalanced data. The evaluation addresses the fundamental issues of the class-imbalanced classification problem: imbalance ratio, small disjuncts and overlap complexity, lack of data and feature selection. Four class-imbalanced classifiers are considered. The four classifiers include three standard classification algorithms each coupled with an ensemble correction strategy and one support vector machines (SVM)-based correction classifier. The three algorithms are (i) diagonal linear discriminant analysis (DLDA), (ii) random forests (RFs) and (ii) SVMs. The SVM-based correction classifier is SVM threshold adjustment (SVM-THR). A Monte–Carlo simulation and five genomic data sets were used to illustrate the analysis and address the issues. The SVM-ensemble classifier appears to perform the best when the class imbalance is not too severe. The SVM-THR performs well if the imbalance is severe and predictors are highly correlated. The DLDA with a feature selection can perform well without using the ensemble correction.

At least the “big data” folks are right on one score: We are going to need help sorting out all the present and future information.

Not that we will ever attempt to sort it all out, as was reported in: The Untapped Big Data Gap (2012) [Merry Christmas Topic Maps!], only 23% of “big data” is going to be valuable if we do analyze it.

And your enterprise’s part of that 23% is even smaller.

Enough that your users will need help dealing with it, but not nearly the deluge that is being predicted.

Content-Based Image Retrieval at the End of the Early Years

Content-Based Image Retrieval at the End of the Early Years by Arnold W.M. Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, and Ramesh Jain. (Smeulders, A.W.M.; Worring, M.; Santini, S.; Gupta, A.; Jain, R.; , “Content-based image retrieval at the end of the early years,” Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol.22, no.12, pp.1349-1380, Dec 2000
doi: 10.1109/34.895972)

Abstract:

Presents a review of 200 references in content-based image retrieval. The paper starts with discussing the working conditions of content-based retrieval: patterns of use, types of pictures, the role of semantics, and the sensory gap. Subsequent sections discuss computational steps for image retrieval systems. Step one of the review is image processing for retrieval sorted by color, texture, and local geometry. Features for retrieval are discussed next, sorted by: accumulative and global features, salient points, object and shape features, signs, and structural combinations thereof. Similarity of pictures and objects in pictures is reviewed for each of the feature types, in close connection to the types and means of feedback the user of the systems is capable of giving by interaction. We briefly discuss aspects of system engineering: databases, system architecture, and evaluation. In the concluding section, we present our view on: the driving force of the field, the heritage from computer vision, the influence on computer vision, the role of similarity and of interaction, the need for databases, the problem of evaluation, and the role of the semantic gap.

Excellent survey article from 2000 (not 2002 as per the Ostermann paper).

I think you will appreciate the treatment of the “semantic gap,” both in terms of its description as well as ways to address it.

If you are using annotated images in your topic map application, definitely a must read.

User evaluation of automatically generated keywords and toponyms… [of semantic gaps]

User evaluation of automatically generated keywords and toponyms for geo-referenced images by Frank O. Ostermann, Martin Tomko, Ross Purves. (Ostermann, F. O., Tomko, M. and Purves, R. (2013), User evaluation of automatically generated keywords and toponyms for geo-referenced images. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22738)

Abstract:

This article presents the results of a user evaluation of automatically generated concept keywords and place names (toponyms) for geo-referenced images. Automatically annotating images is becoming indispensable for effective information retrieval, since the number of geo-referenced images available online is growing, yet many images are insufficiently tagged or captioned to be efficiently searchable by standard information retrieval procedures. The Tripod project developed original methods for automatically annotating geo-referenced images by generating representations of the likely visible footprint of a geo-referenced image, and using this footprint to query spatial databases and web resources. These queries return raw lists of potential keywords and toponyms, which are subsequently filtered and ranked. This article reports on user experiments designed to evaluate the quality of the generated annotations. The experiments combined quantitative and qualitative approaches: To retrieve a large number of responses, participants rated the annotations in standardized online questionnaires that showed an image and its corresponding keywords. In addition, several focus groups provided rich qualitative information in open discussions. The results of the evaluation show that currently the annotation method performs better on rural images than on urban ones. Further, for each image at least one suitable keyword could be generated. The integration of heterogeneous data sources resulted in some images having a high level of noise in the form of obviously wrong or spurious keywords. The article discusses the evaluation itself and methods to improve the automatic generation of annotations.

An echo of Steve Newcomb’s semantic impedance appears at:

Despite many advances since Smeulders et al.’s (2002) classic paper that set out challenges in content-based image retrieval, the quality of both nonspecialist text-based and content-based image retrieval still appears to lag behind the quality of specialist text retrieval, and the semantic gap, identified by Smeulders et al. as a fundamental issue in content-based image retrieval, remains to be bridged. Smeulders defined the semantic gap as

the lack of coincidence between the information that one can extract from the visual data and the interpretation that the same data have for a user in a given situation. (p. 1353)

In fact, text-based systems that attempt to index images based on text thought to be relevant to an image, for example, by using image captions, tags, or text found near an image in a document, suffer from an identical problem. Since text is being used as a proxy by an individual in annotating image content, those querying a system may or may not have similar worldviews or conceptualizations as the annotator. (emphasis added)

That last sentence could have come out of a topic map book.

Curious what you make of the author’s claim that spatial locations provide an “external context” that bridges the “semantic gap?”

If we all use the same map of spatial locations, are you surprised by the lack of a “semantic gap?”

Crash Course on Web Performance

Filed under: Web Server — Patrick Durusau @ 2:40 pm

Faster Websites: Crash Course on Web Performance by Ilya Grigorik.

From the post:

Delivering a fast and optimized user experience in the browser requires careful thinking across many layers of the stack – TCP and up. In a rather ambitious undertaking, when I got the chance to run a three hour (marathon) workshop at Devoxx 2012, I tried to do exactly that: a crash course on web performance. Even with that much time, much was left unsaid, but I’m happy with how it went – it turned out to be one of the most popular workshops.

The best part is, the video is now available online for free! The Devoxx team did an amazing job of post-processing the recording, with inline slides, full agenda navigation, and more. Check it out below. Hope you like it, and let me know if you have any feedback, comments or questions.

You may have the best content on the Web, but if users tire of waiting for it, it may as well not exist.

Just a reminder that delivery is an important part of the value of content.

On ranking relevant entities in heterogeneous networks…

Filed under: Entities,Heterogeneous Data,Ranking — Patrick Durusau @ 2:40 pm

On ranking relevant entities in heterogeneous networks using a language-based model by Laure Soulier, Lamjed Ben Jabeur, Lynda Tamine, Wahiba Bahsoun. (Soulier, L., Jabeur, L. B., Tamine, L. and Bahsoun, W. (2013), On ranking relevant entities in heterogeneous networks using a language-based model. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22762)

Abstract:

A new challenge, accessing multiple relevant entities, arises from the availability of linked heterogeneous data. In this article, we address more specifically the problem of accessing relevant entities, such as publications and authors within a bibliographic network, given an information need. We propose a novel algorithm, called BibRank, that estimates a joint relevance of documents and authors within a bibliographic network. This model ranks each type of entity using a score propagation algorithm with respect to the query topic and the structure of the underlying bi-type information entity network. Evidence sources, namely content-based and network-based scores, are both used to estimate the topical similarity between connected entities. For this purpose, authorship relationships are analyzed through a language model-based score on the one hand and on the other hand, non topically related entities of the same type are detected through marginal citations. The article reports the results of experiments using the Bibrank algorithm for an information retrieval task. The CiteSeerX bibliographic data set forms the basis for the topical query automatic generation and evaluation. We show that a statistically significant improvement over closely related ranking models is achieved.

Note the “estimat[ion] of topic similarity between connected entities.”

Very good work but rather than a declaration of similarity (topic maps) we have an estimate of similarity.

Before you protest about the volume of literature/data, recall that some author write the documents in question. And selected the terms and references found therein.

Rather than guessing what may be similar to what the author wrote, why not devise a method to allow the author to say?

And build upon similarity/sameness declarations across heterogeneous networks of data.

January 21, 2013

No Joy in Vindication

Filed under: Government,Government Data,Transparency — Patrick Durusau @ 7:31 pm

You may have seen the news about the latest GAO report on auditing the U.S. government: U.S. Government’s Fiscal Years 2012 and 2011 Consolidated Financial Statements, GAO-13-271R, Jan 17, 2013, http://www.gao.gov/products/GAO-13-271R.

The reasons why the GAO can’t audit the U.S. government:

(1) serious financial management problems at DOD that have prevented its financial statements from being auditable,

(2) the federal government’s inability to adequately account for and reconcile intragovernmental activity and balances between federal agencies, and

(3) the federal government’s ineffective process for preparing the consolidated financial statements.

Number 2 reminds me of: The 560+ $Billion Shell Game, where I provided data files based on the OMB Sequestration report, detailing that over 560 $billion in agency transfers could not be tracked.

That problem has now been confirmed by the GAO.

I am sure my analysis was not original and has been known to insiders at the GAO and others for years.

But did you know that I mailed that analysis to both of my U.S. Senators and got no response?

I did get a “bug letter” from my representative, Austin Scott:

Washington continues to spend at unsustainable levels. That is why I voted against H.R. 8, the American Taxpayer Relief Act when it passed Congress on January 1, 2013. This plan does not address the real driver of our debt – spending. President Obama’s unwillingness to address this continues to cripple our efforts to find a long-term solution. We cannot tax our way out of this fiscal situation.

The President himself has said on multiple occasions that spending cuts must be part of the solution. In fact, on April 13, 2011 he remarked, “So any serious plan to tackle our deficit will require us to put everything on the table, and take on excess spending wherever it exists in the budget.” However, his words have seldom matched his actions.

We owe it to our children and grandchildren to make the tough choices and devise a long-term solution that gets our economy back on track and reduces our deficits. I remain hopeful that the President will join us in this effort. Thank you for contacting me. It’s an honor to represent the Eighth Congressional District of Georgia.

Non-responsive would be a polite word for it.

My original point has been vindicated by the GAO but that brings no joy.

My request to the officials I have contacted was simple:

All released government financial data must be available in standard spreadsheet formats (Excel, CSV, ODF).

There are a whole host of other issues that will arise from such data but the first step is to get it in a crunchable format.

Redis on Windows Azure

Filed under: Azure Marketplace,Redis — Patrick Durusau @ 7:31 pm

One step closer to full support for Redis on Windows, MS Open Tech releases 64-bit and Azure installer by Claudio Caldato.

From the post:

I’m happy to report new updates today for Redis on Windows Azure: the open-source, networked, in-memory, key-value data store. We’ve released a new 64-bit version that gives developers access to the full benefits of an extended address space. This was an important step in our journey toward full Windows support. You can download it from the Microsoft Open Technologies github repository.

Last April we announced the release of an important update for Redis on Windows: the ability to mimic the Linux Copy On Write feature, which enables your code to serve requests while simultaneously saving data on disk.

Along with 64-bit support, we are also releasing a Windows Azure installer that enables deployment of Redis on Windows Azure as a PaaS solution using a single command line tool. Instructions on using the tool are available on this page and you can find a step-by-step tutorial here. This is another important milestone in making Redis work great on the Windows and Windows Azure platforms.

We are happy to communicate that we are using now the Microsoft Open Technologies public github repository as our main go-to SCM so the community will be able to follow what is happening more closely and get involved in our project.

Is it just me or does it seem like technology is getting easier to deploy?

Perhaps my view is jaded by doing Linux installs with raw write 1.44 MB floppies and editing boot sectors at the command line. 😉

If you like Redis or Azure, either way this is welcome news!

Drupal + Azure = OData Repository

Filed under: Azure Marketplace,Drupal,Odata,Topic Maps — Patrick Durusau @ 7:31 pm

Using Drupal on Windows Azure to create an OData repository by Brian Benz.

From the post:

OData is an easy to use protocol that provides access to any data defined as an OData service provider. Microsoft Open Technologies, Inc., is collaborating with several other organizations and individuals in development of the OData standard in the OASIS OData Technical Committee, and the growing OData ecosystem is enabling a variety of new scenarios to deliver open data for the open web via standardized URI query syntax and semantics. To learn more about OData, including the ecosystem, developer tools, and how you can get involved, see this blog post.

In this post I’ll take you through the steps to set up Drupal on Windows Azure as an OData provider. As you’ll see, this is a great way to get started using both Drupal and OData, as there is no coding required to set this up.

It also won’t cost you any money – currently you can sign up for a 90 day free trial of Windows Azure and install a free Web development tool (Web Matrix) and a free source control tool (Git) on your local machine to make this happen, but that’s all that’s required from a client point of view. We’ll also be using a free tier for the Drupal instance, so you may not need to pay even after the 90 day trial, depending on your needs for bandwidth or storage.

So let’s get started!

Definitely worthwhile to spend some time getting to know the OData specification. It is currently under active development at OASIS.

Doesn’t do everything you might want but tries to do the things everyone needs as a basis for other services.

Thoughts on how to represent “merged” entities in OData subject to the conditions:

  1. Entities and their unique identifiers are not re-written, and
  2. Solution is consistent with the base OData data model?

Thinking back to the original text of ISO/IEC 13250 which required presentation of topic as merged, whether bits moved about to create a “merged” representation or not.

(Disclosure: I am a member of the OData TC.)

Mapping Mashups using Google Maps, Facebook and Twitter

Filed under: Marketing,Mashups,Topic Maps — Patrick Durusau @ 7:30 pm

Mapping Mashups using Google Maps, Facebook and Twitter by Wendell Santos.

From the post:

Over one-third of our mashup directory is made up of mapping mashups and their popularity shows no signs of slowing down. We have taken two looks at mapping mashups in the past. With it being a year since our last review, now is a good time to look at the newest mashups taking advantage of mapping APIs. Read below for more information on each.

Covers four (4) mashup APIs:

Should we be marketing topic maps as “re-usable” mashups?

Or as possessing the ability to “recycle” mashups?

O’Reilly’s Open Government book [“…more equal than others” pigs]

Filed under: Government,Government Data,Open Data,Open Government,Transparency — Patrick Durusau @ 7:30 pm

We’re releasing the files for O’Reilly’s Open Government book by Laurel Ruma.

From the post:

I’ve read many eloquent eulogies from people who knew Aaron Swartz better than I did, but he was also a Foo and contributor to Open Government. So, we’re doing our part at O’Reilly Media to honor Aaron by posting the Open Government book files for free for anyone to download, read and share.

The files are posted on the O’Reilly Media GitHub account as PDF, Mobi, and EPUB files for now. There is a movement on the Internet (#PDFtribute) to memorialize Aaron by posting research and other material for the world to access, and we’re glad to be able to do this.

You can find the book here: github.com/oreillymedia/open_government

Daniel Lathrop, my co-editor on Open Government, says “I think this is an important way to remember Aaron and everything he has done for the world.” We at O’Reilly echo Daniel’s sentiment.

Be sure to read Chapter 25, “When Is Transparency Useful?”, by the late Aaron Swartz.

It includes this passage:

…When you create a regulatory agency, you put together a group of people whose job is to solve some problem. They’re given the power to investigate who’s breaking the law and the authority to punish them. Transparency, on the other hand, simply shifts the work from the government to the average citizen, who has neither the time nor the ability to investigate these questions in any detail, let alone do anything about it. It’s a farce: a way for Congress to look like it has done something on some pressing issue without actually endangering its corporate sponsors.

As a tribute to Aaron, are you going to dump data on the WWW or enable the calling of “more equal than others” pigs to account?

The #NIPS2012 Videos are out

Filed under: Decision Making,Inference,Machine Learning — Patrick Durusau @ 7:29 pm

The #NIPS2012 Videos are out by Igor Carron.

From the post:

Videolectures came through earlier than last year. woohoo! Presentations relevant to Nuit Blanche were featured earlier here. Videos for the presentations for the Posner Lectures, Invited Talks and Oral Sessions of the conference are here. Videos for the presentations for the different Workshops are here. Some videos are not available because the presenters have not given their permission to the good folks at Videolectures. If you know any of them, let them know the world is waiting.

Just in case Netflix is down. 😉

Concept Maps – Pharmaceuticals

Filed under: Bioinformatics,Biomedical,Concept Maps — Patrick Durusau @ 7:29 pm

Designing concept maps for a precise and objective description of pharmaceutical innovations by Maia Iordatii, Alain Venot and Catherine Duclos. (BMC Medical Informatics and Decision Making 2013, 13:10 doi:10.1186/1472-6947-13-10)

Abstract:

Background

When a new drug is launched onto the market, information about the new manufactured product is contained in its monograph and evaluation report published by national drug agencies. Health professionals need to be able to determine rapidly and easily whether the new manufactured product is potentially useful for their practice. There is therefore a need to identify the best way to group together and visualize the main items of information describing the nature and potential impact of the new drug. The objective of this study was to identify these items of information and to bring them together in a model that could serve as the standard for presenting the main features of new manufactured product.

Methods

We developed a preliminary conceptual model of pharmaceutical innovations, based on the knowledge of the authors. We then refined this model, using a random sample of 40 new manufactured drugs recently approved by the national drug regulatory authorities in France and covering a broad spectrum of innovations and therapeutic areas. Finally, we used another sample of 20 new manufactured drugs to determine whether the model was sufficiently comprehensive.

Results

The results of our modeling led to three sub models described as conceptual maps representing: i) the medical context for use of the new drug (indications, type of effect, therapeutical arsenal for the same indications), ii) the nature of the novelty of the new drug (new molecule, new mechanism of action, new combination, new dosage, etc.), and iii) the impact of the drug in terms of efficacy, safety and ease of use, compared with other drugs with the same indications.

Conclusions

Our model can help to standardize information about new drugs released onto the market. It is potentially useful to the pharmaceutical industry, medical journals, editors of drug databases and medical software, and national or international drug regulation agencies, as a means of describing the main properties of new pharmaceutical products. It could also used as a guide for the writing of comprehensive and objective texts summarizing the nature and interest of new manufactured product. (emphasis added)

We all design categories starting with what we know, as pointed out under methods above.

And any three authors could undertake a such a quest, with equally valid results but different terminology and perhaps even a different arrangement of concepts.

The problem isn’t the undertaking, which is a useful.

The problem is a lack of a binding between such undertakings, which enables users to migrate between such maps, as they develop over time.

A problem that topic maps offer an infrastructure to solve.

Win ‘Designing the Search Experience:…’

Filed under: Interface Research/Design,Searching,Usability,Users — Patrick Durusau @ 7:29 pm

I mentioned the return of 1950’s/60’s marketing techniques just a day or so ago and then I find:

Win This Book! Designing the Search Experience: The information architecture of discovery by Tony Russell-Rose and Tyler Tate.

Three ways to enter, err, see the post for those.

New at Freebase

Filed under: Astroinformatics,Freebase — Patrick Durusau @ 7:29 pm

I saw a note at SemanticWeb.com about Freebase offering a new interface. Went to see.

Looked under astronomy, which had far fewer sub-topics than I would have imagined and visited the entry for “star.”

“Star” reports:

A star is really meant to be a single stellar object, not just something that looks like a star from earth. However, in many cases, other objects, such as multi-star systems, were originally thought to be stars. Because people have historically believed these to be stars, they are type as such, but they are also typed as what we now know them to be.

I understand the need to preserve prior “types” but that is a question of scope, not simply adding more types.

Moreover, if “star” means a “single stellar object,” then were do I put different classes of stars? Do they have occurrences too? Does that mean their occurrences get listed under “star” as well?

RDF 1.1 Concepts and Abstract Syntax [New Draft]

Filed under: RDF,Semantic Web — Patrick Durusau @ 7:23 pm

RDF 1.1 Concepts and Abstract Syntax

From the introduction:

The Resource Description Framework (RDF) is a framework for representing information in the Web.

RDF 1.1 Concepts and Abstract Syntax defines an abstract syntax (a data model) which serves to link all RDF-based languages and specifications. The abstract syntax has two key data structures: RDF graphs are sets of subject-predicate-object triples, where the elements may be IRIs, blank nodes, or datatyped literals. They are used to express descriptions of resources. RDF datasets are used to organize collections of RDF graphs, and comprise a default graph and zero or more named graphs. This document also introduces key concepts and terminology, and discusses datatyping and the handling of fragment identifiers in IRIs within RDF graphs.

Numerous issues await your comments and suggestions.

January 20, 2013

Operation Asymptote – [PlainSite / Aaron Swartz]

Filed under: Government,Government Data,Law,Law - Sources,Legal Informatics,Uncategorized — Patrick Durusau @ 8:06 pm

Operation Asymptote

Operation Asymptote’s goal is to make U.S. federal court data freely available to everyone.

The data is available now, but free only up to $15 worth every quarter.

Serious legal research hits that limit pretty quickly.

The project does not cost you any money, only some of your time.

The result will be another source of data to hold the system accountable.

So, how real is your commitment to doing something effective in memory of Aaron Swartz?

…topological data analysis

Filed under: Algorithms,Topological Data Analysis,Topology — Patrick Durusau @ 8:05 pm

New big data firm to pioneer topological data analysis by John Burn-Murdoch.

From the post:

A US big data firm is set to establish algebraic topology as the gold standard of data science with the launch of the world’s leading topological data analysis (TDA) platform.

Ayasdi, whose co-founders include renowned mathematics professor Gunnar Carlsson, launched today in Palo Alto, California, having secured $10.25m from investors including Khosla Ventures in the first round of funding.

The funds will be used to build on its Insight Discovery platform, the culmination of 12 years of research and development into mathematics, computer science and data visualisation at Stanford.

Ayasdi’s work prior to launching as a company has already yielded breakthroughs in the pharmaceuticals industry. In one case it revealed new insights in eight hours – compared to the previous norm of over 100 hours – cutting the turnaround from analysis to clinical trials in the process.

The project? CompTop, which I covered here.

Does topological data analysis sound more interesting now than before?

silenc: Removing the silent letters from a body of text

Filed under: Graphics,Text Analytics,Text Mining,Texts,Visualization — Patrick Durusau @ 8:05 pm

silenc: Removing the silent letters from a body of text by Nathan Yau.

From the post:

During a two-week visualization course, Momo Miyazaki, Manas Karambelkar, and Kenneth Aleksander Robertsen imagined what a body of text would be without the the silent letters in silenc.

Nathan suggest it isn’t fancy on the analysis side but the views are interesting.

True enough that removing silent letters (once mapped) isn’t difficult, but the results of the technique may be more than just visually interesting.

Usage patterns of words with silent letters would be an interesting question.

Or extending the technique to remove all adjectives from a text (that would shorten ad copy).

“Seeing” text or data from a different or unexpected perspective can lead to new insights. Some useful, some less so.

But it is the job of analysis to sort them out.

Promoting Topic Maps With Disasters?

Filed under: Geographic Data,Marketing,Topic Maps — Patrick Durusau @ 8:04 pm

When I saw the post headline:

Unrestricted access to the details of deadly eruptions

I immediately thought about the recent (ongoing?) rash of disaster movies. What they lack in variety they make up for in special effects.

The only unrealistic part is that largely governments respond effectively or at least attempt to, rather than making the rounds on the few Sunday morning interview programs. Well, it is fiction after all.

But the data set sounds like one that could be used to market topic maps as a “disaster” app.

Imagine a location based app that shows your proximity to the “kill” zone of a historic volcano.

Along with mapping to other vital data, such as the nearest movie star. 😉

Something to think about.

Volcanic eruptions have the potential to cause loss of life, disrupt air traffic, impact climate, and significantly alter the surrounding landscape. Knowledge of the past behaviours of volcanoes is key to producing risk assessments of the hazards of modern explosive events.

The open access database of Large Magnitude Explosive Eruptions (LaMEVE) will provide this crucial information to researchers, civil authorities and the general public alike.

Compiled by an international team headed by Dr Sian Crosweller from the Bristol’s School of Earth Sciences with support from the British Geological Survey, the LaMEVE database provides – for the first time – rapid, searchable access to the breadth of information available for large volcanic events of magnitude 4 or greater with a quantitative data quality score.

Dr Crosweller said: “Magnitude 4 or greater eruptions – such as Vesuvius in 79AD, Krakatoa in 1883 and Mount St Helens in 1980 – are typically responsible for the most loss of life in the historical period. The database’s restriction to eruptions of this size puts the emphasis on events whose low frequency and large hazard footprint mean preparation and response are often poor.”

Currently, data fields include: magnitude, Volcanic Explosivity Index (VEI), deposit volumes, eruption dates, and rock type; such parameters constituting the mainstay for description of eruptive activity.

Planned expansion of LaMEVE will include the principal volcanic hazards (such as pyroclastic flows, tephra fall, lahars, debris avalanches, ballistics), and vulnerability (for example, population figures, building type) – details of value to those involved in research and decisions relating to risk.

LaMEVE is the first component of the Volcanic Global Risk Identification and Analysis Project (VOGRIPA) database for volcanic hazards developed as part of the Global Volcano Model (GVM).

Principal Investigator and co-author, Professor Stephen Sparks of Bristol’s School of Earth Sciences said: “The long-term goal of this project is to have a global source of freely available information on volcanic hazards that can be used to develop protocols in the event of volcanic eruptions.

“Importantly, the scientific community are invited to actively participate with the database by sending new data and modifications to the database manager and, after being given clearance as a GVM user, entering data thereby maintaining the resource’s dynamism and relevance.”

LaMEVE is freely available online at http://www.bgs.ac.uk/vogripa.

Semantic Web meets Integrative Biology: a survey

Filed under: Bioinformatics,Semantic Web — Patrick Durusau @ 8:04 pm

Semantic Web meets Integrative Biology: a survey by Haujun Chen, Tong Yu and Jake Y. Chen.

Abstract:

Integrative Biology (IB) uses experimental or computational quantitative technologies to characterize biological systems at the molecular, cellular, tissue and population levels. IB typically involves the integration of the data, knowledge and capabilities across disciplinary boundaries in order to solve complex problems. We identify a series of bioinformatics problems posed by interdisciplinary integration: (i) data integration that interconnects structured data across related biomedical domains; (ii) ontology integration that brings jargons, terminologies and taxonomies from various disciplines into a unified network of ontologies; (iii) knowledge integration that integrates disparate knowledge elements from multiple sources; (iv) service integration that build applications out of services provided by different vendors. We argue that IB can benefit significantly from the integration solutions enabled by Semantic Web (SW) technologies. The SW enables scientists to share content beyond the boundaries of applications and websites, resulting into a web of data that is meaningful and understandable to any computers. In this review, we provide insight into how SW technologies can be used to build open, standardized and interoperable solutions for interdisciplinary integration on a global basis. We present a rich set of case studies in system biology, integrative neuroscience, bio-pharmaceutics and translational medicine, to highlight the technical features and benefits of SW applications in IB.

A very good summary the issues of data integration in bioinformatics.

I disagree with the prescription, as you might imagine, but it is a good starting place for discussion of the issues of data integration.

OpenAIRE Study

Filed under: EU,Open Data — Patrick Durusau @ 8:04 pm

Implementing Open Access Mandates in Europe: OpenAIRE Study by Thembani Malapela.

From the webpage:

Implementing Open Access Mandates in Europe : OpenAIRE Study on the Development of Open Access Repository Communities in Europe is the title of a recent book authored by Birgit Schmidt and Iryna Kuchma. The book highlights the existing open access policies in Europe and provides an overview of publisher’s self archiving policies and it further gives strategies for policy implementation. Such strategies include both institutional and national – which have been used in implementing open access policy mandates. This work provides a unique overview of national awareness of open access in 32 European countries involving all EU member states and in addition, Norway, Iceland, Croatia, Switzerland and Turkey.

What makes this book an interesting read is that it taps into activities implemented through OpenAIRE project and related repository projects by other stakeholders in Europe. Despite its extensive coverage on the implementation of Open Access Mandates in the region, the authors acknowledge, “the main issues that still need to be resolved in the coming years include the effective promotion of open access among research communities and support in copyright management for researchers and research institutions as well as intermediaries such as libraries and repositories”.

The more data that becomes “open,” the greater the semantic diversity you will find.

Important to follow the discussion as you prepare to map more and more information into your topic map.

…NCSU Library URLs in the Common Crawl Index

Filed under: Common Crawl,Library — Patrick Durusau @ 8:03 pm

Analysis of the NCSU Library URLs in the Common Crawl Index by Lisa Green.

From the post:

Last week we announced the Common Crawl URL Index. The index has already proved useful to many people and we would like to share an interesting use of the index that was very well described in a great blog post by Jason Ronallo.

Jason is the Associate Head of Digital Library Initiatives at North Carolina State University Libraries. He used the Common Crawl Index to look at NCSU Library URLs in the Common Crawl Index. You can see his description of his work and results below and on his blog. Be sure to follow Jason on Twitter and on his blog to keep up to date with other interesting work he does!

A great starting point for using the Common Crawl Index!

Interactive Text Mining

Filed under: Annotation,Bioinformatics,Curation,Text Mining — Patrick Durusau @ 8:03 pm

An overview of the BioCreative 2012 Workshop Track III: interactive text mining task Cecilia N. Arighi, et. al. (Database (2013) 2013 : bas056 doi: 10.1093/database/bas056)

Abstract:

In many databases, biocuration primarily involves literature curation, which usually involves retrieving relevant articles, extracting information that will translate into annotations and identifying new incoming literature. As the volume of biological literature increases, the use of text mining to assist in biocuration becomes increasingly relevant. A number of groups have developed tools for text mining from a computer science/linguistics perspective, and there are many initiatives to curate some aspect of biology from the literature. Some biocuration efforts already make use of a text mining tool, but there have not been many broad-based systematic efforts to study which aspects of a text mining tool contribute to its usefulness for a curation task. Here, we report on an effort to bring together text mining tool developers and database biocurators to test the utility and usability of tools. Six text mining systems presenting diverse biocuration tasks participated in a formal evaluation, and appropriate biocurators were recruited for testing. The performance results from this evaluation indicate that some of the systems were able to improve efficiency of curation by speeding up the curation task significantly (∼1.7- to 2.5-fold) over manual curation. In addition, some of the systems were able to improve annotation accuracy when compared with the performance on the manually curated set. In terms of inter-annotator agreement, the factors that contributed to significant differences for some of the systems included the expertise of the biocurator on the given curation task, the inherent difficulty of the curation and attention to annotation guidelines. After the task, annotators were asked to complete a survey to help identify strengths and weaknesses of the various systems. The analysis of this survey highlights how important task completion is to the biocurators’ overall experience of a system, regardless of the system’s high score on design, learnability and usability. In addition, strategies to refine the annotation guidelines and systems documentation, to adapt the tools to the needs and query types the end user might have and to evaluate performance in terms of efficiency, user interface, result export and traditional evaluation metrics have been analyzed during this task. This analysis will help to plan for a more intense study in BioCreative IV.

Curation is an aspect of topic map authoring, albeit with the latter capturing information for later merging with other sources of information.

Definitely an article you will want to read if you are designing text mining as part of a topic map solution.

Hadoop Bingo [The More Things Change…, First Game 22nd Jan. 2013]

Filed under: BigData,Hadoop,Hortonworks — Patrick Durusau @ 8:03 pm

Don’t be Tardy for This Hadoop BINGO Party! by Kim Truong.

It had to happen. Virtual “door prizes” for attending webinars, now bingo games.

I’m considering a contest to guess what the next 1950’s/60’s marketing tool will appear next. 😉

From the post:

I’m excited to kick-off our first webinar series for 2013: The True Value of Apache Hadoop.

Get all your friends, co-workers together and be prepared to geek out to Hadoop!

This 4-part series will have a mixture of amazing guest speakers covering topics such as Hortonworks 2013 vision and roadmaps for Apache Hadoop and Big Data, What’s new with Hortonworks Data Platform v1.2, How Luminar (an Entravision company) adopted Apache Hadoop, and use case on Hadoop, R and GoogleVis. This series will provide organizations an opportunity to gain a better understanding of Apache Hadoop and Big Data landscape and practical guidance on how to leverage Hadoop as part of your Big Data strategy.

How is that a party?

Don’t be confused. The True Value of Apache Hadoop is the series name and Hortonworks State of the Union and Vision for Apache Hadoop in 2013 is the first webinar title. My note on the “State of the Union.”

Don’t get me wrong. Entirely appropriate to recycle 1950’s/60’s techniques (or older).

We are people and people haven’t changed in terms of motivations, virtues or vices in recorded history.

If the past works, use it.

A Comparison of 7 Graph Databases

Filed under: AllegroGraph,DEX,FlockDB,InfiniteGraph,Neo4j,OrientDB — Patrick Durusau @ 8:02 pm

A Comparison of 7 Graph Databases by Alex Popescu.

Alex links to a graphic from InfiniteGraph that compares Infinite Graph, Neo4j, AllegroGraph, Titan, FlockDB, Dex and OrientDB.

The graphic is nearly unreadable so Alex embeds and points to a GoogleDoc spreadsheet by Peter Karussell that you will find easier to view.

Thanks Alex and Peter!

January 19, 2013

Where Akka Came From

Filed under: Actor-Based,Akka — Patrick Durusau @ 7:09 pm

Where Akka Came From

From the post:

Sparked by the recent work on an Akka article on wikipedia, Jonas, Viktor and Yours Truly sat down to think back to the early days and how it all came about (I was merely an intrigued listener for the most part). While work on the article is ongoing, we thought it would be instructive to share a list of references to papers, talks and concepts which influenced the design—made Akka what it is today and what still is to come.

As you already know, I have a real weakness for source documentation, both ancient as well as more recent.

Enjoy!

The Pacific Symposium on Biocomputing 2013 [Proceedings]

Filed under: Bioinformatics,Biomedical,Data Mining,Text Mining — Patrick Durusau @ 7:09 pm

The Pacific Symposium on Biocomputing 2013 by Will Bush.

From the post:

For 18 years now, computational biologists have convened on the beautiful islands of Hawaii to present and discuss research emerging from new areas of biomedicine. PSB Conference Chairs Teri Klein (@teriklein), Keith Dunker, Russ Altman (@Rbaltman) and Larry Hunter (@ProfLHunter) organize innovative sessions and tutorials that are always interactive and thought-provoking. This year, sessions included Computational Drug Repositioning, Epigenomics, Aberrant Pathway and Network Activity, Personalized Medicine, Phylogenomics and Population Genomics, Post-Next Generation Sequencing, and Text and Data Mining. The Proceedings are available online here, and a few of the highlights are:

See Will’s post for the highlights. Or browse the proceedings. You are almost certainly going to find something relevant to you.

Do note Will’s use of Twiiter IDs as identifiers. Unique, persistent (I assume Twitter doesn’t re-assign them), easy to access.

It wasn’t clear from Will’s post if the following image was from Biocomputing 2013 or if he stopped by a markup conference. Hard to tell. 😉

Biocomputing 2013

Building the Library of Twitter

Filed under: Intelligence,Security,Tweets — Patrick Durusau @ 7:08 pm

Building the Library of Twitter by Ian Armas Foster.

From the post:

On an average day people around the globe contribute 500 million messages to Twitter. Collecting and storing every single tweet and its resulting metadata from a single day would be a daunting task in and of itself.

The Library of Congress is trying something slightly more ambitious than that: storing and indexing every tweet ever posted.

With the help of social media facilitator Gnip, the Library of Congress aims to create an archive where researchers can access any tweet recorded since Twitter’s inception in 2006.

According to this update on the progress of the seemingly herculean project, the LOC has already archived 170 billion tweets and their respective metadata. That total includes the posts from 2006-2010, which Gnip compressed and sent to the LOC over three different files of 2.3 terabytes each. When the LOC uncompressed the files, they filled 20 terabytes’ worth of server space representing 21 billion tweets and its supplementary 50 metadata fields.

It is often said that 90% of the world’s data has accrued over the last two years. That is remarkably close to the truth for Twitter, as an additional 150 billion tweets (88% of the total) poured into the LOC archive in 2011 and 2012. Further, Gnip delivers hourly updates to the tune of half a billion tweets a day. That means 42 days’ worth of 2012-2013 tweets equal the total amount from 2006-2010. In all, they are dealing with 133.2 terabytes of information.

Now there’s a big data problem for you! Not to mention a resource problem for the Library of Congress.

You might want to make a contribution to help fund their work on this project.

Obviously of incredible value for researchers at all levels, smaller sub-sets of the Twitter stream may be valuable as well.

If I were designing a Twitter based lexicon for covert communication for example, I would want to use frequent terms from particular geographic locations.

And/or create patterns of tweets from particular accounts so that they don’t stand out from others.

Not to mention trying to crunch the Twitter stream for content I know must be present.

Documentation: It Doesn’t Suck! [Topic Maps As Semantic Documentation]

Filed under: Documentation,Topic Maps — Patrick Durusau @ 7:08 pm

Documentation: It Doesn’t Suck! by Jes Schulz Borland.

documentation illustration

Jes writes:

Some parts of our jobs are not glamorous, but necessary. For example, I have to brush Brent’s Bob Dylan wig weekly, to make sure it’s shiny and perfect. Documentation is a task many people roll their eyes at, procrastinate about starting, have a hard time keeping up-to-date, and in general avoid.

Stop avoiding it, and embrace the benefits!

The most important part of documentation is starting, so I’d like to help you by giving you a list of things to document. It’s going to take time and won’t be as fun as tuning queries from 20 minutes to 2 seconds, but it could save the day sometime in the future.

You can call this your SQL Server Run Book, your SQL Server Documentation, your SQL Server Best Practices Guide – whatever works for your environment. Make sure it’s filled in for each server, and kept up to date, and you’ll soon realize the benefits

There is even a video: Video: Documentation – It Doesn’t Suck!.

Semantic documentation isn’t the entire story behind topic maps but it is what enables the other benefits from using topic maps.

With a topic map you can document what must be matched by other documentation (other topic maps, yours or someone else’s), for both to be talking about the same subject.

And you get to choose the degree of documentation you want. You could choose a string, like owl:SameAs, and have a variety of groups using it to mean any number of things.

Or, you could choose to require several properties, language, publishing house, journal, any number of properties, and then others are talking about the same subject as yourself.

Doesn’t mean that mis-use is completely avoided, only means it is made less likely. Or easier to avoid might be a better way to say it.

Not to mention that six months or a year from now, it may be easier for you re-use your identification, since it has more than one property that must be matched.

Federal Big Data Forum

Filed under: BigData,Conferences,Intelligence,Security — Patrick Durusau @ 7:07 pm

Are you architecting sensemaking solutions in the national security space? Register for 30 Jan Federal Big Data Forum sponsored by Cloudera by Bob Gourley.

From the post:

Friends at Cloudera are lead sponsors and coordinators of a new Big Data Forum focused on Apache Hadoop. The first, which will be held 30 January 2013 in Columbia Maryland, will be focused on lessons learned of use to the national security community. This is primarily for practitioners and leaders fielding real working Big Data solutions on Apache Hadoop and related technologies. I’ve seen a draft agenda, it includes a lineup of the nation’s greatest Big Data technologists, including the chairman of the Apache Software foundation and creator of Hadoop, Lucene and Nutch Doug Cutting.

This event is intentionally being focused on real practitioners and others who can benefit from lessons learned by those who have created/fielded real enterprise solutions. This will fill up fast. Please mark you calendar now and register right away. To register see: http://info.cloudera.com/federal-big-data-hadoop-forum.html

Bob’s post also has the invite.

I won’t be able to attend but would love to hear from anyone who does. Thanks!

« Newer PostsOlder Posts »

Powered by WordPress