Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 19, 2013

NASA Support – Dr. Kirk Borne of George Mason University

Filed under: Astroinformatics,Outlier Detection — Patrick Durusau @ 7:07 pm

The Arts and Entertainment Magazine (an unlikely source for me) published TAEM Interview with Dr. Kirk Borne of George Mason University, which is a delightful interview to generate support for NASA.

Of particular interest, Dr. Kirk Borne says:

My current research is focused on outlier detection, which I prefer to call Surprise Discovery – finding the unknown unknowns and the unexpected patterns in the data. These discoveries may reveal data quality problems (i.e., problems with the experiment or data processing pipeline), but they may also reveal totally new astrophysical phenomena: new types of galaxies or stars or whatever. That discovery potential is huge within the huge data collections that are being generated from the large astronomical sky surveys that are taking place now and will take place in the coming decades. I haven’t yet found that one special class of objects or new type of astrophysical process that will win me a Nobel Prize, but you never know what platinum-plated needles may be hiding in those data haystacks.

Topic maps are known for encoding knowns and known patterns in data.

How would you explore a topic map to find “…unknown unknowns and the unexpected patterns in the data?”

BTW, Dr. Borne invented the term “astroinformatics.”

Hadoop “State of the Union” [Webinar]

Filed under: Hadoop,Hortonworks — Patrick Durusau @ 7:07 pm

Hortonworks State of the Union and Vision for Apache Hadoop in 2013 by Kim Rose.

From the post:

Who: Shaun Connolly, Vice President of Corporate Strategy, Hortonworks

When: Tuesday, January 22, 2013 at 1:00 p.m. ET/10:00am PT

Where: http://info.hortonworks.com/Winterwebinarseries_TheTrueValueofHadoop.html

Click to Tweet: #Hortonworks hosting “State of the Union” webinar to discuss 2013 vision for #Hadoop, 1/22 at 1 pm ET. Register here: http://bit.ly/VYJxKX

The “State of the Union” webinar is the first in a four-part Hortonworks webinar series titled, “The True Value of Apache Hadoop,” designed to inform attendees of key trends, future roadmaps, best practices and the tools necessary for the successful enterprise adoption of Apache Hadoop.

During the “State of the Union,” Connolly will look at key company highlights from 2012, including the release of the Hortonworks Data Platform (HDP)—the industry’s online 100-percent open source platform powered by Apache Hadoop—and the further development of the Hadoop ecosystem through partnerships with leading software vendors, such as Microsoft and Teradata. Connolly will also provide insight into upcoming initiatives and projects that the Company plans to focus on this year as well as topical advances in the Apache Hadoop community.

Attendees will learn:

  • How Hortonworks’ focus contributes to innovation within the Apache open source community while addressing enterprise requirements and ecosystem interoperability;
  • About the latest releases in the Hortonworks product offering; and
  • About Hortonworks’ roadmap and major areas of investment across core platform, data and operational services for productive operations and management.

For more information, or to register for the “State of the Union” webinar, please visit: http://info.hortonworks.com/Winterwebinarseries_TheTrueValueofHadoop.html.

You will learn more from this “State of the Union” address than any similarly titled presentations with Congressional responses and the sycophantic choruses that accompany them.

Could Governments Run Out of Patience with Open Data? [Semantic Web?]

Filed under: Government,Open Data,Semantic Web — Patrick Durusau @ 7:06 pm

Could Governments Run Out of Patience with Open Data? by Andrea Di Maio.

From the post:

Yesterday I had yet another client conversation – this time with a mid-size municipality in the north of Europe – on the topic of the economic value generated through open data. The problem we discussed is the same I highlighted in a post last year: nobody argues the potential long term value of open data but it may be difficult to maintain a momentum (and to spend time, money and management bandwidth) on something that will come to fruition in the more distant future, while more urgent problems need to be solved now, under growing budget constraints.

Faith is not enough, nor are the many examples that open data evangelists keep sharing to demonstrate value. Open data must help solve today’s problems too, in order to gain the credibility and the support required to realize future economic value.

While many agree that open data can contribute to shorter term goals, such as improving inter-agency transparency and data exchange or engaging citizens on solving concrete problems, making this happen in a more systematic way requires a change of emphasis and a change of leadership.

Emphasis must be on directing efforts – be they idea collections, citizen-.developed dashboards or mobile apps – onto specific, concrete problems that government organizations need to solve. One might argue that this is not dissimilar from having citizens offer perspectives on how they see existing issues and related solutions. But there is an important difference: what usually happens is that citizens and other stakeholders are free to use whichever data they want to use. The required change is to entice them to help governments solve problems the way governments see them. In other terms, whereas citizens would clearly remain free to come up with whichever use of any open data they deem important, they should get incentives, awards, prizes only for those uses that meet clear government requirements. Citizens would be at the service of government rather than the other way around. For those who might be worried that this advocates for an unacceptable change of responsibility and that governments are at the service of citizens and not the other way around, what I mean is that citizens should help governments serve them.

2012 Year in Review: New AWS Technical Whitepapers, Articles and Videos Published

Filed under: Amazon Web Services AWS — Patrick Durusau @ 7:05 pm

2012 Year in Review: New AWS Technical Whitepapers, Articles and Videos Published

From the post:

In addition to delivering great services and features to our customers, we are constantly working towards helping customers so that they can build highly-scalable, highly-available cost-effective cloud solutions using our services. We not only provide technical documentation for each service but also provide guidance on economics, cross-service architectures, reference implementations, best practices and details on how to get started so customers and partners can use the services effectively.

In this post, let’s review all the content that we published in 2012 so you can help build and prioritize our content roadmap for 2013. We are looking for feedback on content topics that you would like us to build this year.

A mother lode of technical content on AWS!

Definitely a page to bookmark even as new content appears in 2013!

Hadoop in Perspective: Systems for Scientific Computing

Filed under: Hadoop,Scientific Computing — Patrick Durusau @ 7:05 pm

Hadoop in Perspective: Systems for Scientific Computing by Evert Lammerts.

From the post:

When the term scientific computing comes up in a conversation it’s usually just the occasional science geek who shows signs of recognition. But although most people have little or no knowledge of the field’s existence, it has been around since the second half of the twentieth century and has played an increasingly important role in many technological and scientific developments. Internet search engines, DNA analysis, weather forecasting, seismic analysis, renewable energy, and aircraft modeling are just a small number of examples where scientific computing is nowadays indispensible.

Apache Hadoop is a newcomer in scientific computing, and is welcomed as a great new addition to already existing systems. In this post I mean to give an introduction to systems for scientific computing, and I make an attempt at giving Hadoop a place in this picture. I start by discussing arguably the most important concept in scientific computing: parallel computing; what is it, how does it work, and what tools are available? Then I give an overview of the systems that are available for scientific computing at SURFsara, the Dutch center for academic IT and home to some of the world’s most powerful computing systems. I end with a short discussion on the questions that arise when there’s many different systems to choose from.

A good overview of the range of options for scientific computing, where, just as with more ordinary problems, no one solution is the best for all cases.

January 18, 2013

Hortonworks Data Platform 1.2 Available Now!

Filed under: Apache Ambari,Hadoop,HBase,Hortonworks,MapReduce — Patrick Durusau @ 7:18 pm

Hortonworks Data Platform 1.2 Available Now! by Kim Rose.

From the post:

Hortonworks Data Platform (HDP) 1.2, the industry’s only complete 100-percent open source platform powered by Apache Hadoop is available today. The enterprise-grade Hortonworks Data Platform includes the latest version of Apache Ambari for comprehensive management, monitoring and provisioning of Apache Hadoop clusters. By also introducing additional new capabilities for improving security and ease of use, HDP delivers an enterprise-class distribution of Apache Hadoop that is endorsed and adopted by some of the largest vendors in the IT ecosystem.

Hortonworks continues to drive innovation through a range of Hadoop-related projects, packaging the most enterprise-ready components, such as Ambari, into the Hortonworks Data Platform. Powered by an Apache open source community, Ambari represents the forefront of innovation in Apache Hadoop management. Built on Apache Hadoop 1.0, the most stable and reliable code available today, HDP 1.2 improves the ease of enterprise adoption for Apache Hadoop with comprehensive management and monitoring, enhanced connectivity to high-performance drivers, and increased enterprise-readiness of Apache HBase, Apache Hive and Apache HCatalog projects.

The Hortonworks Data Platform 1.2 features a number of new enhancements designed to improve the enterprise viability of Apache Hadoop, including:

  • Simplified Hadoop Operations—Using the latest release of Apache Ambari, HDP 1.2 now provides both cluster management and the ability to zoom into cluster usage and performance metrics for jobs and tasks to identify the root cause of performance bottlenecks or operations issues. This enables Hadoop users to identify issues and optimize future job processing.
  • Improved Security and Multi-threaded Query—HDP 1.2 provides an enhanced security architecture and pluggable authentication model that controls access to Hive tables and the metastore. In addition, HDP 1.2 improves scalability by supporting multiple concurrent query connections to Hive from business intelligence tools and Hive clients.
  • Integration with High-performance Drivers Built for Big Data—HDP 1.2 empowers organizations with a trusted and reliable ODBC connector that enables the integration of current systems with high-performance drivers built for big data. The ODBC driver enables integration with reporting or visualization components through a SQL engine built into the driver. Hortonworks has partnered with Simba to deliver a trusted, reliable high-performance ODBC connector that is enterprise ready and completely free.
  • HBase Enhancements—By including and testing HBase 0.94.2, HDP 1.2 delivers important performance and operational improvements for customers building and deploying highly scalable interactive applications using HBase.

There goes the weekend!

PPInterFinder

Filed under: Associations,Bioinformatics,Biomedical — Patrick Durusau @ 7:18 pm

PPInterFinder—a mining tool for extracting causal relations on human proteins from literature by Kalpana Raja, Suresh Subramani and Jeyakumar Natarajan. (Database (2013) 2013 : bas052 doi: 10.1093/database/bas052)

Abstract:

One of the most common and challenging problem in biomedical text mining is to mine protein–protein interactions (PPIs) from MEDLINE abstracts and full-text research articles because PPIs play a major role in understanding the various biological processes and the impact of proteins in diseases. We implemented, PPInterFinder—a web-based text mining tool to extract human PPIs from biomedical literature. PPInterFinder uses relation keyword co-occurrences with protein names to extract information on PPIs from MEDLINE abstracts and consists of three phases. First, it identifies the relation keyword using a parser with Tregex and a relation keyword dictionary. Next, it automatically identifies the candidate PPI pairs with a set of rules related to PPI recognition. Finally, it extracts the relations by matching the sentence with a set of 11 specific patterns based on the syntactic nature of PPI pair. We find that PPInterFinder is capable of predicting PPIs with the accuracy of 66.05% on AIMED corpus and outperforms most of the existing systems.

Database URL: http://www.biomining-bu.in/ppinterfinder/

I thought the shortened form of the title would catch your eye. 😉

Important work for bioinformatics but it is also an example of domain specific association mining.

By focusing on a specific domain and forswearing designs on being a universal association solution, PPInterFinder produces useful results today.

A lesson that should be taken and applied to semantic mappings more generally.

Computational Folkloristics

Filed under: Digital Research,Folklore,Social Sciences — Patrick Durusau @ 7:17 pm

JAF Special Issue 2014 : Computational Folkloristics – Special Issue of the Journal of American Folklore

I wasn’t able to confirm this call at the Journal of American Folklore, but wanted to pass it along anyway.

There are few areas with the potential for semantic mappings as rich as folklore. A natural for topic maps.

From the call I cite above:

Submission Deadline Jun 15, 2013
Notification Due Aug 1, 2013
Final Version Due Oct 1, 2013

Over the course of the past decade, a revolution has occurred in the materials available for the study of folklore. The scope of digital archives of traditional expressive forms has exploded, and the magnitude of machine-readable materials available for consideration has increased by many orders of magnitude. Many national archives have made significant efforts to make their archival resources machine-readable, while other smaller initiatives have focused on the digitization of archival resources related to smaller regions, a single collector, or a single genre. Simultaneously, the explosive growth in social media, web logs (blogs), and other Internet resources have made previously hard to access forms of traditional expressive culture accessible at a scale so large that it is hard to fathom. These developments, coupled to the development of algorithmic approaches to the analysis of large, unstructured data and new methods for the visualization of the relationships discovered by these algorithmic approaches – from mapping to 3-D embedding, from time-lines to navigable visualizations – offer folklorists new opportunities for the analysis of traditional expressive forms. We label approaches to the study of folklore that leverage the power of these algorithmic approaches “Computational Folkloristics” (Abello, Broadwell, Tangherlini 2012).

The Journal of American Folklore invites papers for consideration for inclusion in a special issue of the journal edited by Timothy Tangherlini that focuses on “Computational Folkloristics.” The goal of the special issue is to reveal how computational methods can augment the study of folklore, and propose methods that can extend the traditional reach of the discipline. To avoid confusion, we term those approaches “computational” that make use of algorithmic methods to assist in the interpretation of relationships or structures in the underlying data. Consequently, “Computational Folkloristics” is distinct from Digital Folklore in the application of computation to a digital representation of a corpus.

We are particularly interested in papers that focus on: the automatic discovery of narrative structure; challenges in Natural Language Processing (NLP) related to unlabeled, multilingual data including named entity detection and resolution; topic modeling and other methods that explore latent semantic aspects of a folklore corpus; the alignment of folklore data with external historical datasets such as census records; GIS applications and methods; network analysis methods for the study of, among other things, propagation, community detection and influence; rapid classification of unlabeled folklore data; search and discovery on and across folklore corpora; modeling of folklore processes; automatic labeling of performance phenomena in visual data; automatic classification of audio performances. Other novel approaches to the study of folklore that make use of algorithmic approaches will also be considered.

A significant challenge of this special issue is to address these issues in a manner that is directly relevant to the community of folklorists (as opposed to computer scientists). Articles should be written in such a way that the argument and methods are accessible and understandable for an audience expert in folklore but not expert in computer science or applied mathematics. To that end, we encourage team submissions that bridge the gap between these disciplines. If you are in doubt about whether your approach or your target domain is appropriate for consideration in this special issue, please email the issue editor, Timothy Tangherlini at tango@humnet.ucla.edu, using the subject line “Computational Folkloristics query”. Deadline for all queries is April 1, 2013.

Timothy Tangherlini homepage.

Something to look forward to!

Similarity Search and Applications

Filed under: Conferences,Similarity,Similarity Retrieval — Patrick Durusau @ 7:17 pm

International Conference on Similarity Search and Applications (SISAP 2013)

From the webpage:

The International Conference on Similarity Search and Applications (SISAP) is an annual forum for researchers and application developers in the area of similarity data management. It aims at the technological problems shared by numerous application domains, such as data mining, information retrieval, computer vision, pattern recognition, computational biology, geography, biometrics, machine learning, and many others that need similarity searching as a necessary supporting service.

The SISAP initiative (www.sisap.org) aims to become a forum to exchange real-world, challenging and innovative examples of applications, new indexing techniques, common test-beds and benchmarks, source code and up-to-date literature through its web page, serving the similarity search community. Traditionally, SISAP puts emphasis on the distance-based searching, but in general the conference concerns both the effectiveness and efficiency aspects of any similarity search problem.

Dates:

Paper submission: April 2013
Notification: June 2013
Final version: July 2013
Conference: October 2, 3, and 4, 2013

The specific topics include, but are not limited to:

  • Similarity queries – k-NN, range, reverse NN, top-k, etc.
  • Similarity operations – joins, ranking, classification, categorization, filtering, etc.
  • Evaluation techniques for similarity queries and operations
  • Merging/combining multiple similarity modalities
  • Cost models and analysis for similarity data processing
  • Scalability issues and high-performance similarity data management
  • Feature extraction for similarity-based data findability
  • Test collections and benchmarks
  • Performance studies, benchmarks, and comparisons
  • Similarity Search over outsourced data repositories
  • Similarity search cloud services
  • Languages for similarity databases
  • New modes of similarity for complex data understanding
  • Applications of similarity-based operations
  • Image, video, voice, and music (multimedia) retrieval systems
  • Similarity for forensics and security

You should be able to find one or more topics that interest you. 😉

How similar must two or more references to an entity be before they are identifying the same entity?

Or for that matter, is similarity an association between two or more references?

ISWC 2013 : The 12th International Semantic Web Conference

Filed under: Conferences,Semantic Web — Patrick Durusau @ 7:17 pm

ISWC 2013 : The 12th International Semantic Web Conference

Dates:

When Oct 21, 2013 – Oct 25, 2013
Where Sydney, Australia
Abstract Registration Due May 1, 2013
Submission Deadline May 10, 2013
Notification Due Jul 3, 2013
Final Version Due Aug 5, 2013

ISWC is the premier venue for presenting innovative systems and research results related to the Semantic Web and Linked Data. We solicit the submission of original research papers for ISWC 2013’s research track, dealing with analytical, theoretical, empirical, and practical aspects of all areas of the Semantic Web. Submissions to the research track should describe original, significant research on the Semantic Web or on Semantic Web technologies, and are expected to provide some principled means of evaluation.

To maintain the high level of quality and impact of the ISWC series, all papers will be reviewed by at least three program committee members and one vice chair of the program committee. To assess papers, reviewers will judge their originality and significance for further advances in the Semantic Web, as well as the technical soundness of the proposed approaches and the overall readability of the submitted papers. We will give specific attention to the evaluation of the approaches described in the papers. We strongly encourage evaluations that are repeatable: preference will be given to papers that provide links to the data sets and queries used to evaluate their approach, as well as systems papers providing links to their source code or to some live deployment.

Topics of Interest

Topics of interest include, but are not limited to:

  • Management of Semantic Web data and Linked Data
  • Languages, tools, and methodologies for representing and managing Semantic Web data
  • Database, IR, NLP and AI technologies for the Semantic Web
  • Search, query, integration, and analysis on the Semantic Web
  • Robust and scalable knowledge management and reasoning on the Web
  • Cleaning, assurance, and provenance of Semantic Web data, services, and processes
  • Semantic Web Services
  • Semantic Sensor Web
  • Semantic technologies for mobile platforms
  • Evaluation of semantic web technologies
  • Ontology engineering and ontology patterns for the Semantic Web
  • Ontology modularity, mapping, merging, and alignment
  • Ontology Dynamics
  • Social and Emergent Semantics
  • Social networks and processes on the Semantic Web
  • Representing and reasoning about trust, privacy, and security
  • User Interfaces to the Semantic Web
  • Interacting with Semantic Web data and Linked Data
  • Information visualization of Semantic Web data and Linked Data
  • Personalized access to Semantic Web data and applicationsSemantic Web technologies for eGovernment, eEnvironment, eMobility or eHealth
  • Semantic Web and Linked Data for Cloud environments

Submission

Pre-submission of abstracts is a strict requirement. All papers and abstracts have to be submitted electronically via the EasyChair conference submission System https://www.easychair.org/conferences/?conf=iswc2013.

Semantic Web Gets Closer To The Internet of Things [Close Enough To Be Useful?]

Filed under: Semantic Web — Patrick Durusau @ 7:17 pm

Semantic Web Gets Closer To The Internet of Things by Jennifer Zaino. [herein, IoT = Internet of Things]

From the post:

The Internet of Things is coming, but it needs a semantic backbone to flourish.

Applying semantic technologies to IoT, however, has several research challenges, the authors note, pointing out that IoT and using semantics in IoT is still in its early days. Being in on the ground floor of this movement is undeniably exciting to the research community, including people such as Konstantinos Kotis, Senior Research Scientist at University of the Aegean, and IT Manager in the regional division of the Samos and Ikaria islands at North Aegean Regional Administration Authority.

Well, but the Semantic Web has been in “its early days” for quite some time now. A decade or more?

And with every proposal, now the Internet of Things, the semantic issues are going to be solved real soon now. But in reality the semantic can gets kicked down the road. Again.

Not that I have a universal semantic solution, different from all the other universal semantic solutions to propose. Mainly because universal semantic solutions fail. (full stop)

Not to mention that selling a business case for an investment now and for the foreseeable future that will pay off, maybe, if someday everyone else adopts the same solution.

I am not an investor or business mogul but that sounds on the risky side to me.

If I were going to invest in a semantic solution, I would want it to have a defined payoff for my enterprise or organization, whether anyone else adopted it or not.

Web accessible identifiers are not a counter-example. Web accessible identifiers work perfectly well in the absence of any universal semantic solution.

Scalable Cross Comparison Service (version 1.1)

Filed under: Astroinformatics,BigData — Patrick Durusau @ 7:16 pm

Scalable Cross Comparison Service (version 1.1)

From the post:

The VAO has released a new version of the Scalable Cross Comparison (SCC) Service v1.1 on 02 January, 2013. SCC is a web-based application that performs spatial crossmatching between user source tables and on-line catalogs. New features of the service include:

  • Indexed cross-match candidate tables for very large and frequently used survey catalogs. New indexed catalogs include PPMXL, WISE, DENIS3, UCAC3, TYCHO2.
  • Interoperability of SCC with other VO tools and services via SAMP. With SAMP, tools can broadcast data from one tool to the next without the need to read or write files. Examples of other SAMP enabled virtual observatory tools include DDT, Iris, Topcat, Vizier, DS9). This is a Beta release of interoperability in SCC.

Try it now at http://www.usvao.org/tools.

Astronomy (optical and radio) has examples of integration of bigdata, before there was “bigdata.”

Graph Algorithms

Filed under: Algorithms,Graphs,Networks — Patrick Durusau @ 7:16 pm

Graph Algorithms by David Eppstein.

Graph algorithms course with a Wikipedia book, Graph Algorithms, made up of articles from Wikipedia.

The syllabus does include some materials not found at Wikipedia, so be sure to check there as well.

Strong components of the Wikipedia graph

Filed under: Algorithms,Graphs,Networks,Wikipedia — Patrick Durusau @ 7:16 pm

Strong components of the Wikipedia graph

From the post:

I recently covered strong connectivity analysis in my graph algorithms class, so I’ve been playing today with applying it to the link structure of (small subsets of) Wikipedia.

For instance, here’s one of the strong components among the articles linked from Hans Freudenthal (a mathematician of widely varied interests): Algebraic topology, Freudenthal suspension theorem, George W. Whitehead, Heinz Hopf, Homotopy group, Homotopy groups of spheres, Humboldt University of Berlin, Luitzen Egbertus Jan Brouwer, Stable homotopy theory, Suspension (topology), University of Amsterdam, Utrecht University. Mostly this makes sense, but I’m not quite sure how the three universities got in there. Maybe from their famous faculty members?

One of responses to this post suggest grabbing the entire Wikipedia dataset for purposes of trying out algorithms.

A good suggestion for algorithms, perhaps even algorithms meant to reduce visual clutter, but at what point does a graph become too “busy” for visual analysis?

Recalling the research that claims people can only remember seven or so things at one time.

Freeing the Plum Book

Filed under: Government,Government Data,Transparency — Patrick Durusau @ 7:15 pm

Freeing the Plum Book by Derek Willis.

From the post:

The federal government produces reams of publications, ranging from the useful to the esoteric. Pick a topic, and in most cases you’ll find a relevant government publication: for example, recent Times articles about presidential appointments draw on the Plum Book. Published annually by either the House or the Senate (the task alternates between committees), the Plum Book is a snapshot of appointments throughout the federal government.

The Plum Book is clearly a useful resource for reporters. But like many products of the Government Printing Office, its two main publication formats are print and PDF. That means the digital version isn’t particularly searchable, unless you count Ctrl-F as a legitimate search mechanism. And that’s a shame, because the Plum Book is basically a long list of names, positions and salary information. It’s data.

Derek describes freeing the Plum Book from less than useful formats.

It is now available in JSON and YAML formats at Github and in Excel.

Curious, what other public datasets would you want to match up to the Plum Book?

January 17, 2013

Complete Guardian Dataset Listing!

Filed under: Data,Dataset,News — Patrick Durusau @ 7:28 pm

All our datasets: the complete index by Chris Cross.

From the post:

Lost track of the hundreds of datasets published by the Guardian Datablog since it began in 2009? Thanks to ScraperWiki, this is the ultimate list and resource. The table below is live and updated every day – if you’re still looking for that ultimate dataset, the chance is we’ve already done it. Click below to find out

I am simply in awe of the number of datasets produced by the Guardian since 2009.

A few of the more interesting titles include:

You will find things in the hundreds of datasets you have wondered about and other things you can’t imagine wondering about. 😉

Enjoy!

SVDFeature: A Toolkit for Feature-based Collaborative Filtering

Filed under: Feature Vectors,Filters,Recommendation — Patrick Durusau @ 7:27 pm

SVDFeature: A Toolkit for Feature-based Collaborative Filtering – implementation by Igor Carron.

From the post:

SVDFeature: A Toolkit for Feature-based Collaborative Filtering by Tianqi ChenWeinan Zhang,  Qiuxia LuKailong Chen Zhao Zheng, Yong Yu. The abstract reads:

In this paper we introduce SVDFeature, a machine learning toolkit for feature-based collaborative filtering. SVDFeature is designed to efficiently solve the feature-based matrix factorization. The feature-based setting allows us to build factorization models incorporating side information such as temporal dynamics, neighborhood relationship, and hierarchical information. The toolkit is capable of both rate prediction and collaborative ranking, and is carefully designed for efficient training on large-scale data set. Using this toolkit, we built solutions to win KDD Cup for two consecutive years.

The wiki for the project and attendant code is here.

Can’t argue with two KDD cups in as many years!

Licensed under Apache 2.0.

MongoDB Text Search Tutorial

Filed under: MongoDB,Search Engines,Searching,Text Mining — Patrick Durusau @ 7:26 pm

MongoDB Text Search Tutorial by Alex Popescu.

From the post:

Today is the day of the experimental MongoDB text search feature. Tobias Trelle continues his posts about this feature providing some examples for query syntax (negation, phrase search)—according to the previous post even more advanced queries should be supported, filtering and projections, multiple text fields indexing, and adding details about the stemming solution used (Snowball).

Alex also has a list of his posts on the text search feature for MongoDB.

Graph Database Resources

Filed under: Graphs,Networks — Patrick Durusau @ 7:26 pm

Graph Database Resources by Danny Bickson.

Danny provides a short list of graph database resources.

Do be careful with:

A paper that summarizes the state of graph databases that might be worth reading: http://swp.dcc.uchile.cl/TR/2005/TR_DCC-2005-010.pdf

Summarizes the state of the art as of 2005.

Still worth reading because many of the techniques and insights are relevant for today.

And if you pay attention to the citations, you will discover that “graphs as a new way of thinking” is either: ignorance or marketing hype.

The earliest paper cited in the 2005 state of art for graphs dates from 1965:

D. J. de S. Price. Networks of Scientific papers. Science, 149:510–515, 1965.

And there are plenty of citations from the 1970’s and 1980’s on hypergraphs, etc.

I am very much a graph enthusiast but the world wasn’t created anew because we came of age.

UniChem…[How Much Precision Can You Afford?]

Filed under: Bioinformatics,Biomedical,Cheminformatics,Topic Maps — Patrick Durusau @ 7:26 pm

UniChem: a unified chemical structure cross-referencing and identifier tracking system by Jon Chambers, Mark Davies, Anna Gaulton, Anne Hersey, Sameer Velankar, Robert Petryszak, Janna Hastings, Louisa Bellis, Shaun McGlinchey and John P Overington. (Journal of Cheminformatics 2013, 5:3 doi:10.1186/1758-2946-5-3)

Abstract:

UniChem is a freely available compound identifier mapping service on the internet, designed to optimize the efficiency with which structure-based hyperlinks may be built and maintained between chemistry-based resources. In the past, the creation and maintenance of such links at EMBL-EBI, where several chemistry-based resources exist, has required independent efforts by each of the separate teams. These efforts were complicated by the different data models, release schedules, and differing business rules for compound normalization and identifier nomenclature that exist across the organization. UniChem, a large-scale, non-redundant database of Standard InChIs with pointers between these structures and chemical identifiers from all the separate chemistry resources, was developed as a means of efficiently sharing the maintenance overhead of creating these links. Thus, for each source represented in UniChem, all links to and from all other sources are automatically calculated and immediately available for all to use. Updated mappings are immediately available upon loading of new data releases from the sources. Web services in UniChem provide users with a single simple automatable mechanism for maintaining all links from their resource to all other sources represented in UniChem. In addition, functionality to track changes in identifier usage allows users to monitor which identifiers are current, and which are obsolete. Lastly, UniChem has been deliberately designed to allow additional resources to be included with minimal effort. Indeed, the recent inclusion of data sources external to EMBL-EBI has provided a simple means of providing users with an even wider selection of resources with which to link to, all at no extra cost, while at the same time providing a simple mechanism for external resources to link to all EMBL-EBI chemistry resources.

From the background section:

Since these resources are continually developing in response to largely distinct active user communities, a full integration solution, or even the imposition of a requirement to adopt a common unifying chemical identifier, was considered unnecessarily complex, and would inhibit the freedom of each of the resources to successfully evolve in future. In addition, it was recognized that in the future more small molecule-containing databases might reside at EMBL-EBI, either because existing databases may begin to annotate their data with chemical information, or because entirely new resources are developed or adopted. This would make a full integration solution even more difficult to sustain. A need was therefore identified for a flexible integration solution, which would create, maintain and manage links between the resources, with minimal maintenance costs to the participant resources, whilst easily allowing the inclusion of additional sources in the future. Also, since the solution should allow different resources to maintain their own identifier systems, it was recognized as important for the system to have some simple means of tracking identifier usage, at least in the sense of being able to archive obsolete identifiers and assignments, and indicate when obsolete assignments were last in use.

The UniChem project highlights an important aspect of mapping identifiers: How much mapping can you afford?

Or perhaps even better: What is the cost/benefit ratio for a complete mapping?

The mapping in question isn’t a academic exercise in elegance and completeness.

It’s users have immediate need for the mapping data and it is it not quite right, human users are in the best position to correct it and suggest corrections.

Not to mention that new identifiers are likely to arrive before the old ones are completely mapped.

Suggestive that evolving mappings may be an appropriate paradigm for topic maps.

…Functional Programming and Scala

Filed under: Functional Programming,Scala — Patrick Durusau @ 7:25 pm

Resources for Getting Started With Functional Programming and Scala by Kelsey Innis.

From the post:

This is the “secret slide” from my recent talk Learning Functional Programming without Growing a Neckbeard, with links to the sources I used to put the talk together and some suggestions for ways to get started writing Scala code.

The “…without growing a neckbeard” merits mention even if you are not interested in functional programming and topic maps.

Nice list of resources.

Don’t miss the presentation!

I first saw this at This week in #Scala (11/01/2013) by Chris Cundill.

CS 229 Machine Learning – Final Projects, Autumn 2012

Filed under: Machine Learning — Patrick Durusau @ 7:25 pm

CS 229 Machine Learning – Final Projects, Autumn 2012

Two hundred and forty-five (245) final project reports in machine learning.

I started to provide a sampling but I would miss the one that would capture your interest.

BTW, yes, this is Andrew Ng’s Machine Learning course.

Machine Learning and Data Mining – Association Analysis with Python

Machine Learning and Data Mining – Association Analysis with Python by Marcel Caraciolo.

From the post:

Recently I’ve been working with recommender systems and association analysis. This last one, specially, is one of the most used machine learning algorithms to extract from large datasets hidden relationships.

The famous example related to the study of association analysis is the history of the baby diapers and beers. This history reports that a certain grocery store in the Midwest of the United States increased their beers sells by putting them near where the stippers were placed. In fact, what happened is that the association rules pointed out that men bought diapers and beers on Thursdays. So the store could have profited by placing those products together, which would increase the sales.

Association analysis is the task of finding interesting relationships in large data sets. There hidden relationships are then expressed as a collection of association rules and frequent item sets. Frequent item sets are simply a collection of items that frequently occur together. And association rules suggest a strong relationship that exists between two items.

When I think of associations in a topic map, I assume I am at least starting with the roles and the players of those roles.

As this post demonstrates, that may be overly optimistic on my part.

What if I discover an association but not its type or the roles in it? And yet I still want to preserve the discovery for later use?

An incomplete association as it were.

Suggestions?

Virtual Astronomical Observatory – 221st AAS Meeting

Filed under: Astroinformatics,Data — Patrick Durusau @ 7:24 pm

The Virtual Astronomical Observatory (VAO) at the 221st AAS Meeting

From the post:

The VAO is funded to provide a computational infrastructure for virtual astronomy. When complete, it will enable astronomers to discover and access data in archives worldwide, allow them to share and publish datasets, and support analysis of data through an “ecosystem” of interoperable tools.

Nine out of twelve posters are available for download, including:

Even if you live in an area of severe night pollution, the heavens may only be an IP address away.

Enjoy!

chemf: A purely functional chemistry toolkit

Filed under: Cheminformatics,Functional Programming — Patrick Durusau @ 7:23 pm

chemf: A purely functional chemistry toolkit by Stefan Höck and Rainer Riedl. (Journal of Cheminformatics 2012, 4:38 doi:10.1186/1758-2946-4-38)

Abstract:

Background

Although programming in a type-safe and referentially transparent style offers several advantages over working with mutable data structures and side effects, this style of programming has not seen much use in chemistry-related software. Since functional programming languages were designed with referential transparency in mind, these languages offer a lot of support when writing immutable data structures and side-effects free code. We therefore started implementing our own toolkit based on the above programming paradigms in a modern, versatile programming language.

Results

We present our initial results with functional programming in chemistry by first describing an immutable data structure for molecular graphs together with a couple of simple algorithms to calculate basic molecular properties before writing a complete SMILES parser in accordance with the OpenSMILES specification. Along the way we show how to deal with input validation, error handling, bulk operations, and parallelization in a purely functional way. At the end we also analyze and improve our algorithms and data structures in terms of performance and compare it to existing toolkits both object-oriented and purely functional. All code was written in Scala, a modern multi-paradigm programming language with a strong support for functional programming and a highly sophisticated type system.

Conclusions

We have successfully made the first important steps towards a purely functional chemistry toolkit. The data structures and algorithms presented in this article perform well while at the same time they can be safely used in parallelized applications, such as computer aided drug design experiments, without further adjustments. This stands in contrast to existing object-oriented toolkits where thread safety of data structures and algorithms is a deliberate design decision that can be hard to implement. Finally, the level of type-safety achieved by \emph{Scala} highly increased the reliability of our code as well as the productivity of the programmers involved in this project.

Another vote in favor of functional programming as a path to parallel processing.

Can the next step, identity transparency*, be far behind?

*Identity transparency: where any identification of an entity can be replaced with another identification of the same entity.

January 16, 2013

Apache Hive 0.10.0 is Now Available

Filed under: Hadoop,Hive,MapReduce — Patrick Durusau @ 7:57 pm

Apache Hive 0.10.0 is Now Available by Ashutosh Chauhan.

From the post:

We are pleased to announce the the release of Apache Hive version 0.10.0. More than 350 JIRA issues have been fixed with this release. A few of the most important fixes include:

Cube and Rollup: Hive now has support for creating cubes with rollups. Thanks to Namit!

List Bucketing: This is an optimization that lets you better handle skew in your tables. Thanks to Gang!

Better Windows Support: Several Hive 0.10.0 fixes support running Hive natively on Windows. There is no more cygwin dependency. Thanks to Kanna!

Explain’ Adds More Info: Now you can do an explain dependency and the explain plan will contain all the tables and partitions touched upon by the query. Thanks to Sambavi!

Improved Authorization: The metastore can now optionally do authorization checks on the server side instead of on the client, providing you with a better security profile. Thanks to Sushanth!

Faster Simple Queries: Some simple queries that don’t require aggregations, and therefore MapReduce jobs, can now run faster.Thanks to Navis!

Better YARN Support: This release contains additional work aimed at making Hive work well with Hadoop YARN. While not all test cases are passing yet, there has been a lot of good progress made with this release. Thanks to Zhenxiao!

Union Optimization: Hive queries with unions will now result in a lower number of MapReduce jobs under certain conditions. Thanks to Namit!

Undo Your Drop Table: While not really truly ‘undo’, you can now reinstate your table after dropping it. Thanks to Andrew!

Show Create Table: The lets you see how you created your table. Thanks to Feng!

Support for Avro Data: Hive now has built-in support for reading/writing Avro data. Thanks to Jakob!

Skewed Joins: Hive’s support for joins involving skewed data is now improved. Thanks to Namit!

Robust Connection Handling at the Metastore Layer: Connection handling between a metastore client and server and also between a metastore server and the database layer has been improved. Thanks to Bhushan and Jean!

More Statistics: Its now possible to collect and store scalar-valued statistics for your tables and partitions. This will enable better query planning in upcoming releases. Thanks to Shreepadma!

Better-Looking HWI : HWI now uses a bootstrap javascript library. It looks really slick. Thanks to Hugo!

If you are excited about some of these new features, I recommend that you download hive-0.10 from: Hive 0.10 Release.

The full Release Notes are available here: Hive 0.10.0 Release Notes

This release saw contributions from many different people. We have numerous folks reporting bugs, writing patches for new features, fixing bugs, testing patches, helping users on mailing lists etc. We would like to give a big thank you to everyone who made hive-0.10 possible.

-Ashutosh Chauhan

A long quote but it helps to give credit where credit is due.

RuleML 2013

Filed under: Conferences,Machine Learning,RuleML — Patrick Durusau @ 7:56 pm

RuleML 2013

Important Dates:

Abstract submission: Feb. 19, 2013
Paper submission: Feb. 20, 2013
Notification of acceptance/rejection: April 12, 2013
Camera-ready copy due: May 3, 2013
RuleML-2013 dates: July 11-13, 2013

From the call for papers:

The annual International Web Rule Symposium (RuleML) is an international conference on research, applications, languages and standards for rule technologies. RuleML is the leading conference for building bridges between academia and industry in the field of rules and its applications, especially as part of the semantic technology stack. It is devoted to rule-based programming and rule-based systems including production rules systems, logic programming rule engines, and business rules engines/business rules management systems; Semantic Web rule languages and rule standards (e.g., RuleML, SWRL, RIF, PRR, SBVR); Legal RuleML; rule-based event processing languages (EPLs) and technologies; hybrid rule-based methods; and research on inference rules, transformation rules, decision rules, production rules, and ECA rules.

The 7th International Symposium on Rules and the Web (RuleML 2013) will be held on July 11-13, 2013 just prior to the AAAI conference in the Seattle Metropolitan Area, Washington. Selected papers will be published in book form in the Springer Lecture Notes in Computer Science (LNCS) series.

Topics:

  • Rules and automated reasoning
  • Rule-based policies, reputation, and trust
  • Rule-based event processing and reaction rules
  • Rules and the web
  • Fuzzy rules and uncertainty
  • Logic programming and nonmonotonic reasoning
  • Non-classical logics and the web (e.g modal and epistemic logics)
  • Hybrid methods for combining rules and statistical machine learning techniques (e.g., conditional random fields, PSL)
  • Rule transformation, extraction, and learning
  • Vocabularies, ontologies, and business rules
  • Rule markup languages and rule interchange formats
  • Rule-based distributed/multi-agent systems
  • Rules, agents, and norms
  • Rule-based communication, dialogue, and argumentation models
  • Vocabularies and ontologies for pragmatic primitives (e.g. speech acts and deontic primitives)
  • Pragmatic web reasoning and distributed rule inference / rule execution
  • Rules in online market research and online marketing
  • Applications of rule technologies in health care and life sciences
  • Legal rules and legal reasoning
  • Industrial applications of rules
  • Controlled natural language for rule encoding (e.g. SBVR, ACE, CLCE)
  • Standards activities related to rules
  • General rule topics

A number of those seem quite at home in a topic maps setting.

Vislt 2.6.0

Filed under: Graphics,Visualization — Patrick Durusau @ 7:56 pm

Vislt 2.6.0

From the post:

VisIt is a free interactive parallel visualization and graphical analysis tool for viewing scientific data on Unix and PC platforms. Users can quickly generate visualizations from their data, animate them through time, manipulate them, and save the resulting images for presentations. VisIt contains a rich set of visualization features so that you can view your data in a variety of ways. It can be used to visualize scalar and vector fields defined on two- and three-dimensional (2D and 3D) structured and unstructured meshes. VisIt was designed to handle very large data set sizes in the terascale range and yet can also handle small data sets in the kilobyte range.

I’m not promising what your results will be but samples to show what is possible:

Raleigh-Taylor instability

Raleigh-Taylor instability

The featured visualization shows a Pseudocolor plot that highlights a Raleigh-Taylor instability caused by two mixing fluids.

Solid geometry with volume rendering

Solid geometry with volume rendering Image courtesy of Patrick Chris Fragile Ph.D., UC Santa Barbara.

Multiple plots

VisIt can put multiple plots in a single visualization, allowing you to visualize data in multiple ways. The featured image shows four representations of Mount St. Helens elevation data from a DEM file.

The DEM dataset used to create the featured image was obtained from the USGS.

Do you support annotation of graphics with your topic map engine?

Camel Essential Components

Filed under: Apache Camel,Integration — Patrick Durusau @ 7:56 pm

Camel Essential Components by Christian Posta. (new Dzone Refcard)

From the webpage:

What is Apache Camel?

Camel is an open-source, lightweight, integration library that allows your applications to accomplish intelligent routing, message transformation, and protocol mediation using the established Enterprise Integration Patterns and out-of-the-box components with a highly expressive Domain Specific Language (Java, XML, or Scala). With Camel you can implement integration solutions as part of an overarching ESB solution, or as individual routes deployed to any container such as Apache Tomcat, Apache ServiceMix, JBoss AS, or even a stand-alone java process.

Why use Camel?

Camel simplifies systems integrations with an easy-to-use DSL to create routes that clearly identify the integration intentions and endpoints. Camel’s out of the box integration components are modeled after the Enterprise Integration Patterns cataloged in Gregor Hohpe and Bobby Wolf’s book (http://www.eaipatterns.com). You can use these EIPs as pre-packaged units, along with any custom processors or external adapters you may need, to easily assemble otherwise complex routing and transformation routes. For example, this route takes an XML message from a queue, does some processing, and publishes to another queue:

No explicit handling of subject identity but that’s what future releases are for. 😉

Research Data Curation Bibliography

Filed under: Archives,Curation,Data Preservation,Librarian/Expert Searchers,Library — Patrick Durusau @ 7:56 pm

Research Data Curation Bibliography (version 2) by Charles W. Bailey.

From the introduction:

The Research Data Curation Bibliography includes selected English-language articles, books, and technical reports that are useful in understanding the curation of digital research data in academic and other research institutions. For broader coverage of the digital curation literature, see the author's Digital Curation Bibliography: Preservation and Stewardship of Scholarly Works,which presents over 650 English-language articles, books, and technical reports.

The "digital curation" concept is still evolving. In "Digital Curation and Trusted Repositories: Steps toward Success," Christopher A. Lee and Helen R. Tibbo define digital curation as follows:

Digital curation involves selection and appraisal by creators and archivists; evolving provision of intellectual access; redundant storage; data transformations; and, for some materials, a commitment to long-term preservation. Digital curation is stewardship that provides for the reproducibility and re-use of authentic digital data and other digital assets. Development of trustworthy and durable digital repositories; principles of sound metadata creation and capture; use of open standards for file formats and data encoding; and the promotion of information management literacy are all essential to the longevity of digital resources and the success of curation efforts.

This bibliography does not cover conference papers, digital media works (such as MP3 files), editorials, e-mail messages, interviews, letters to the editor, presentation slides or transcripts, unpublished e-prints, or weblog postings. Coverage of technical reports is very selective.

Most sources have been published from 2000 through 2012; however, a limited number of earlier key sources are also included. The bibliography includes links to freely available versions of included works. If such versions are unavailable, italicized links to the publishers' descriptions are provided.

Such links, even to publisher versions and versions in disciplinary archives and institutional repositories, are subject to change. URLs may alter without warning (or automatic forwarding) or they may disappear altogether. Inclusion of links to works on authors' personal websites is highly selective. Note that e prints and published articles may not be identical.

An archive of prior versions of the bibliography is available.

If you are a beginning library student, take the time to know the work of Charles Bailey. He has consistently made a positive contribution for researchers from very early in the so-called digital revolution.

To the extent that you want to design topic maps for data curation, long or short term, the 200+ items in this bibliography will introduce you to some of the issues you will be facing.

« Newer PostsOlder Posts »

Powered by WordPress