Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 19, 2011

OpenHelix

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 8:10 pm

OpenHelix

From the about page:

More efficient use of the most relevant resources means quicker and more effective research. OpenHelix empowers researchers by

  • providing a search portal to find the most relevant genomics resource and training on those resources.
  • distributing extensive and effective tutorials and training materials on the most powerful and popular genomics resourcs.
  • contracting with resource providers to provide comprehensive, long-term training and outreach programs.

If you are interested in learning the field of genomics research, other than by returning to graduate/medical school, this site will figure high on your list of resources.

It offers a very impressive gathering of both commercial and non-commercial resources under one roof.

I haven’t taken any of the tutorials produced by OpenHelix and so would appreciate comments from anyone who has.

Bioinformatics is an important subject area for topic maps for several reasons:

First, the long term (comparatively speaking) interest in the use of computers and the use in fact of computers in biology indicates there is a need for information for which other people will spend money. There is a key phrase in that sentence, “…for which other people will spend money.” You are already spending your time working on topic maps so it is important to identify other people who are willing to part with cash for your software or assistance. Bioinformatics is a field where that is already known to happen, other people spend their money on software or expertise.

Second, for all of the progress on identification issues in bioinformatics, any bioinformatics journal you pick up, will have references to the need for greater integration of biological resources. There is plenty of opportunity now and as far as anyone can tell, for many tomorrows to follow.

Third, for good or ill, any progress in the field attracts a disproportionate amount of coverage. The public rarely reads or sees coverage of discoveries being less than what was initially reported. And not only health professionals hear such news so it would be good PR for topic maps.

Information extraction from chemical patents

Filed under: Cheminformatics,Patents — Patrick Durusau @ 8:10 pm

Information extraction from chemical patents by David M. Jessop.

Abstract:

The automated extraction of semantic chemical data from the existing literature is demonstrated. For reasons of copyright, the work is focused on the patent literature, though the methods are expected to apply equally to other areas of the chemical literature. Hearst Patterns are applied to the patent literature in order to discover hyponymic relations describing chemical species. The acquired relations are manually validated to determine the precision of the determined hypernyms (85.0%) and of the asserted hyponymic relations (94.3%). It is demonstrated that the system acquires relations that are not present in the ChEBI ontology, suggesting that it could function as a valuable aid to the ChEBI curators. The relations discovered by this process are formalised using the Web Ontology Language (OWL) to enable re-use. PatentEye – an automated system for the extraction of reactions from chemical patents and their conversion to Chemical Markup Language (CML) – is presented. Chemical patents published by the European Patent Office over a ten-week period are used to demonstrate the capability of PatentEye – 4444 reactions are extracted with a precision of 78% and recall of 64% with regards to determining the identity and amount of reactants employed and an accuracy of 92% with regards to product identification. NMR spectra are extracted from the text using OSCAR3, which is developed to greatly increase recall. The resulting system is presented as a significant advancement towards the large-scale and automated extraction of high-quality reaction information. Extended Polymer Markup Language (EPML), a CML dialect for the description of Markush structures as they are presented in the literature, is developed. Software to exemplify and to enable substructure searching of EPML documents is presented. Further work is recommended to refine the language and code to publication-quality before they are presented to the community.

Curious to see how the system would perform against U.S. Patent office literature?

Perhaps more to the point, how would it compared to commercial chemical indexing services?

Always possible to duplicate what has already been done.

Curious what current systems, commercial or otherwise, are lacking that could be a value-add proposition?

How would you poll users? In what journals? What survey instruments or practices would you use?

Journal of Biomedical Semantics

Filed under: Bioinformatics,Biomedical,Semantics — Patrick Durusau @ 8:10 pm

Journal of Biomedical Semantics

From Aims and Scope:

Journal of Biomedical Semantics addresses issues of semantic enrichment and semantic processing in the biomedical domain. The scope of the journal covers two main areas:

Infrastructure for biomedical semantics: focusing on semantic resources and repositories, meta-data management and resource description, knowledge representation and semantic frameworks, the Biomedical Semantic Web, and semantic interoperability.

Semantic mining, annotation, and analysis: focusing on approaches and applications of semantic resources; and tools for investigation, reasoning, prediction, and discoveries in biomedicine.

Research in biology and biomedicine relies on various types of biomedical data, information, and knowledge, represented in databases with experimental and/or curated data, ontologies, literature, taxonomies, and so on. Semantics is essential for accessing, integrating, and analyzing such data. The ability to explicitly extract, assign, and manage semantic representations is crucial for making computational approaches in the biomedical domain productive for a large user community.

Journal of Biomedical Semantics addresses issues of semantic enrichment and semantic processing in the biomedical domain, and comprises practical and theoretical advances in biomedical semantics research with implications for data analysis.

In recent years, the availability and use of electronic resources representing biomedical knowledge has greatly increased, covering ontologies, taxonomies, literature, databases, and bioinformatics services. These electronic resources contribute to advances in the biomedical domain and require interoperability between them through various semantic descriptors. In addition, the availability and integration of semantic resources is a key part in facilitating semantic web approaches for life sciences leading into reasoning and other advanced ways to analyse biomedical data.

Random items to whet your appetite:

The 2nd DBCLS BioHackathon: interoperable bioinformatics Web services for integrated applications
Toshiaki Katayama, Mark D Wilkinson, Rutger Vos, Takeshi Kawashima, Shuichi Kawashima, Mitsuteru Nakao, Yasunori Yamamoto, Hong-Woo Chun, Atsuko Yamaguchi, Shin Kawano, Jan Aerts, Kiyoko F Aoki-Kinoshita, Kazuharu Arakawa, Bruno Aranda, Raoul JP Bonnal, José M Fernández, Takatomo Fujisawa, Paul MK Gordon, Naohisa Goto, Syed Haider, Todd Harris, Takashi Hatakeyama, Isaac Ho, Masumi Itoh, Arek Kasprzyk, Nobuhiro Kido, Young-Joo Kim, Akira R Kinjo, Fumikazu Konishi, Yulia Kovarskaya Journal of Biomedical Semantics 2011, 2:4 (2 August 2011)

Simple tricks for improving pattern-based information extraction from the biomedical literature
Quang Nguyen, Domonkos Tikk, Ulf Leser Journal of Biomedical Semantics 2010, 1:9 (24 September 2010)

Rewriting and suppressing UMLS terms for improved biomedical term identification
Kristina M Hettne, Erik M van Mulligen, Martijn J Schuemie, Bob JA Schijvenaars, Jan A Kors Journal of Biomedical Semantics 2010, 1:5 (31 March 2010)

Journal of Computing Science and Engineering

Filed under: Bioinformatics,Computer Science,Linguistics,Machine Learning,Record Linkage — Patrick Durusau @ 8:09 pm

Journal of Computing Science and Engineering

From the webpage:

Journal of Computing Science and Engineering (JCSE) is a peer-reviewed quarterly journal that publishes high-quality papers on all aspects of computing science and engineering. The primary objective of JCSE is to be an authoritative international forum for delivering both theoretical and innovative applied researches in the field. JCSE publishes original research contributions, surveys, and experimental studies with scientific advances.

The scope of JCSE covers all topics related to computing science and engineering, with a special emphasis on the following areas: embedded computing, ubiquitous computing, convergence computing, green computing, smart and intelligent computing, and human computing.

I got here from following a sponsor link at a bioinformatics conference.

Then just picking at random from the current issue I see:

A Fast Algorithm for Korean Text Extraction and Segmentation from Subway Signboard Images Utilizing Smartphone Sensors by Igor Milevskiy, Jin-Young Ha.

Abstract:

We present a fast algorithm for Korean text extraction and segmentation from subway signboards using smart phone sensors in order to minimize computational time and memory usage. The algorithm can be used as preprocessing steps for optical character recognition (OCR): binarization, text location, and segmentation. An image of a signboard captured by smart phone camera while holding smart phone by an arbitrary angle is rotated by the detected angle, as if the image was taken by holding a smart phone horizontally. Binarization is only performed once on the subset of connected components instead of the whole image area, resulting in a large reduction in computational time. Text location is guided by user’s marker-line placed over the region of interest in binarized image via smart phone touch screen. Then, text segmentation utilizes the data of connected components received in the binarization step, and cuts the string into individual images for designated characters. The resulting data could be used as OCR input, hence solving the most difficult part of OCR on text area included in natural scene images. The experimental results showed that the binarization algorithm of our method is 3.5 and 3.7 times faster than Niblack and Sauvola adaptive-thresholding algorithms, respectively. In addition, our method achieved better quality than other methods.

Secure Blocking + Secure Matching = Secure Record Linkage by Alexandros Karakasidis, Vassilios S. Verykios.

Abstract:

Performing approximate data matching has always been an intriguing problem for both industry and academia. This task becomes even more challenging when the requirement of data privacy rises. In this paper, we propose a novel technique to address the problem of efficient privacy-preserving approximate record linkage. The secure framework we propose consists of two basic components. First, we utilize a secure blocking component based on phonetic algorithms statistically enhanced to improve security. Second, we use a secure matching component where actual approximate matching is performed using a novel private approach of the Levenshtein Distance algorithm. Our goal is to combine the speed of private blocking with the increased accuracy of approximate secure matching.

A Survey of Transfer and Multitask Learning in Bioinformatics by Qian Xu, Qiang Yang.

Abstract:

Machine learning and data mining have found many applications in biological domains, where we look to build predictive models based on labeled training data. However, in practice, high quality labeled data is scarce, and to label new data incurs high costs. Transfer and multitask learning offer an attractive alternative, by allowing useful knowledge to be extracted and transferred from data in auxiliary domains helps counter the lack of data problem in the target domain. In this article, we survey recent advances in transfer and multitask learning for bioinformatics applications. In particular, we survey several key bioinformatics application areas, including sequence classification, gene expression data analysis, biological network reconstruction and biomedical applications.

And the ones I didn’t list from the current issue are just as interesting and relevant to identity/mapping issues.

This journal is a good example of people who have deliberately reached further across disciplinary boundaries than most.

About the only excuse for not doing so left is the discomfort of being the newbie in a field not your own.

Is that a good enough reason to miss possible opportunities to make critical advances in your home field? (Only you can answer that for yourself. No one can answer it for you.)

Journal of Bioinformatics and Computational Biology (JBCB)

Filed under: Bioinformatics,Computational Biology — Patrick Durusau @ 8:08 pm

Journal of Bioinformatics and Computational Biology (JBCB)

From the Aims and Scope page:

The Journal of Bioinformatics and Computational Biology aims to publish high quality, original research articles, expository tutorial papers and review papers as well as short, critical comments on technical issues associated with the analysis of cellular information.

The research papers will be technical presentations of new assertions, discoveries and tools, intended for a narrower specialist community. The tutorials, reviews and critical commentary will be targeted at a broader readership of biologists who are interested in using computers but are not knowledgeable about scientific computing, and equally, computer scientists who have an interest in biology but are not familiar with current thrusts nor the language of biology. Such carefully chosen tutorials and articles should greatly accelerate the rate of entry of these new creative scientists into the field.

To give you an idea of the type of content you will find, consider:

A RE-EVALUATION OF BIOMEDICAL NAMED ENTITY–TERM RELATIONS by TOMOKO OHTA, SAMPO PYYSALO, JIN-DONG KIM, JUN’ICHI TSUJII. Volume: 8, Issue: 5(2010) pp. 917-928 DOI: 10.1142/S0219720010005014.

Abstract:

Text mining can support the interpretation of the enormous quantity of textual data produced in biomedical field. Recent developments in biomedical text mining include advances in the reliability of the recognition of named entities (NEs) such as specific genes and proteins, as well as movement toward richer representations of the associations of NEs. We argue that this shift in representation should be accompanied by the adoption of a more detailed model of the relations holding between NEs and other relevant domain terms. As a step toward this goal, we study NE–term relations with the aim of defining a detailed, broadly applicable set of relation types based on accepted domain standard concepts for use in corpus annotation and domain information extraction approaches.

as representative content.

Enjoy!

December 18, 2011

The best way to get value from data is to give it away

Filed under: Data,Marketing — Patrick Durusau @ 8:49 pm

The best way to get value from data is to give it away from the Guardian.

From the article:

Last Friday I wrote a short piece on for the Datablog giving some background and context for a big open data big policy package that was announced yesterday morning by Vice President Neelie Kroes. But what does the package contain? And what might the new measures mean for the future of open data in Europe?

The announcement contained some very strong language in support of open data. Open data is the new gold, the fertile soil out of which a new generation of applications and services will grow. In a networked age, we all depend on data, and opening it up is the best way to realise its value, to maximise its potential.

There was little ambiguity about the Commissioner’s support for an ‘open by default’ position for public sector information, nor for her support for the open data movement, for “those of us who believe that the best way to get value from data is to give it away“. There were props to Web Inventor Tim Berners-Lee, the Open Knowledge Foundation, OpenSpending, WheelMap, and the Guardian Datablog, amongst others.

Open government data at no or low cost, represents a real opportunity for value-add data vendors. Particularly those using topic maps.

Topic maps enable the creation of data products that can be easily integrated with data products created from different perspectives.

Not to mention reuse of data analysis to create new products to respond to public demand.

For example, after the recent misfortunes with flooding and nuclear reactors in Japan, there was an upsurge of interest in the safety of reactors in other countries. The information provided by news outlets was equal parts summary and reassurance. A data product that mapped together known issues with the plants in Japan, their design, inspection reports on reactors in some locale, plus maps of their locations, etc., would have found a ready audience.

Creation of a data product like that, in time to catch the increase in public interest, would depend on prior analysis of large amounts public data. Analysis that could be re-used for a variety of purposes.

Introducing the Events API

Filed under: Data Source,New York,News — Patrick Durusau @ 8:48 pm

Introducing the Events API by Brian Balser.

From the post:

This past November, The New York Times launched the Arts & Entertainment Guide, an interactive guide to noteworthy cultural events in and around New York City. The application lets you browse through a hand-selected listing of events, customizing the view based on date range, category and location.

At our annual Hack Day we made the Event Listings API, used by the interactive guide, publicly available to the developer community on the NYTimes Developer Network. The API supports three types of search: spatial, faceted and full-text. Each can be used separately or in conjunction in order to find events by different sets of criteria.

If the twenty-two (million) metro area population doesn’t sound like a large enough market, consider that New York City is projected to have fifty (50) million visitors in 2012.

Topic maps that merge data from this feed and conference websites seems a likely early use of this data. But more creative uses are certainly possible.

What would you suggest?

Subway Map Visualization jQuery Plugin

Filed under: JQuery,Mapping,Maps,Visualization — Patrick Durusau @ 8:47 pm

Subway Map Visualization jQuery Plugin by Nik Kalyani.

From the post:

I have always been fascinated by the visual clarity of the London Underground map. Given the number of cities that have adopted this mapping approach for their own subway systems, clearly this is a popular opinion. At a conference some years back, I saw a poster for the Yahoo! Developer Services. They had taken the concept of a subway map and applied it to create a YDN Metro Map. Once again, I was in awe of the visual clarity of this map in helping one understand the various Yahoo! services and how they inter-related with each other. I thought it would be awesome if there were a pseudo-programmatic way in which to render such maps to convey real-world ecosystems. A few examples I can think of:

  • University departments, offices, student groups
  • Government
  • Open Source projects
  • Internet startups by category

More examples on this blog: Ten Examples of the Subway Map Metaphor.

Fast-forward to now. Finally, with the advent of HTML5 <canvas> element and jQuery, I felt it was now possible to implement this in a way that with a little bit of effort, anyone who knows HTML can easily create a subway map. I felt a jQuery plugin was the way to go as I had never created one before and also it seemed like the most well-suited for the task.

A complete step-by-step example follows and is the sort of documentation that while difficult to write, saves every user of the software time further down the road.

The plug-in has any number of uses, a traditional public transportation map for your locale or as used by the author, a map that lays out a software project.

If you use this for a software project, you will need to make your own icons for derailment, track hazards and the causes of the same. 😉

Arts, Humanities, and Complex Networks…

Filed under: Conferences,Graphs,Networks — Patrick Durusau @ 8:46 pm

Arts, Humanities, and Complex Networks — 3rd Leonardo satellite symposium at NetSci2012

Dates:

The deadline for applications is March 16, 2012.
Decisions for acceptance will be sent out by March 30, 2012.
The symposium will take place at Northwestern University near Chicago, IL on the shores of Lake Michigan on June 19, 2012.

From the webpage:

We are pleased to announce the third Leonardo satellite symposium at NetSci2012 on Arts, Humanities, and Complex Networks. The aim of the symposium is to foster cross-disciplinary research on complex systems within or with the help of arts and humanities.

The symposium will highlight arts and humanities as an interesting source of data, where the combined experience of arts, humanities research, and natural science makes a huge difference in overcoming the limitations of artificially segregated communities of practice.

Furthermore, the symposium will focus on striking examples, where artists and humanities researchers make an impact within the natural sciences. By bringing together network scientists and specialists from the arts and humanities we strive for a better understanding of networks and their visualizations in general.

The overall mission is to bring together pioneer work, leveraging previously unused potential by developing the right questions, methods, and tools, as well as dealing with problems of information accuracy and incompleteness. Running parallel to the NetSci2012 conference, the symposium will also provide a unique opportunity to mingle with leading researchers and practitioners of complex network science, potentially sparking fruitful collaborations.

In addition to keynotes and interdisciplinary discussion, we are looking for a number of contributed talks. Selected papers will be published in print in a Special Section of Leonardo Journal (MIT Press), as well as online in Leonardo Transactions.

For previous edition papers and video presentations please visit the following URLs:
2010 URL: http://artshumanities.netsci2010.net
2011 URL: http://artshumanities.netsci2011.net

This sounds seriously cool!

You do realize the graphs and networks of the “hard” sciences are impoverished when compared to the networks encountered on a daily basis by humanists? In the humanities, some of the nodes and edges can only be deduced from their impact on other nodes and edges. And are themselves subject to being influenced by other unseen and perhaps unknowable nodes and edges.

Still, it can be instructive to simplify a graph from the humanities for representation by hard science techniques. At least we can gain some sense of what has to be thrown away from the humanities side.

Wandora – New Release 2011-12-07

Filed under: Topic Map Software,Wandora — Patrick Durusau @ 8:45 pm

Wandora – New Release 2011-12-07

A new release of Wandora is out!

I haven’t tested the new features but I am sure the project would appreciate any comments you have.

Some early remarks:

The “version” of Wandora should appear in the file name, which would be helpful for those of us with multiple versions on our hard drives.

There should be more detailed release notes, for bugs as well as new features.

I may be overlooking it but if a formal bug/feature tracking system is being used (other than the forum), it would be useful to have at least a read-only link to that tracking system.

Math Journals: Theoretical CS, Graphs, Combinatorics, Combinatorics Optimization

Filed under: Combinatorics,Graphs — Patrick Durusau @ 8:44 pm

Math Journals: Theoretical CS, Graphs, Combinatorics, Combinatorics Optimization

The question was asked at Theoretical Computer Science about open access journals, especially those that don’t charge authors high fees for making their work freely available.

I have collated those mentioned below (probably duplicates other, more complete lists but thought it might be handy here as well):

The Chicago Journal of Theoretical Computer Science
Leibniz International Proceedings in Informatics LIPICS
The Electronic Journal of Combinatorics
Electronic Proceedings in Theoretical Computer Science
Journal of Computational Geometry
Journal of Graph Algorithms and Applications
Logical Methods in Computer Science
Theory of Computing

As of 18 December 2011.

On the Treewidth of Dynamic Graphs

Filed under: Dynamic Graphs,Graphs — Patrick Durusau @ 8:43 pm

On the Treewidth of Dynamic Graphs by Bernard Mans, Luke Mathieson.

Abstract:

Dynamic graph theory is a novel, growing area that deals with graphs that change over time and is of great utility in modelling modern wireless, mobile and dynamic environments. As a graph evolves, possibly arbitrarily, it is challenging to identify the graph properties that can be preserved over time and understand their respective computability.

In this paper we are concerned with the treewidth of dynamic graphs. We focus on metatheorems, which allow the generation of a series of results based on general properties of classes of structures. In graph theory two major metatheorems on treewidth provide complexity classifications by employing structural graph measures and finite model theory. Courcelle’s Theorem gives a general tractability result for problems expressible in monadic second order logic on graphs of bounded treewidth, and Frick & Grohe demonstrate a similar result for first order logic and graphs of bounded local treewidth.

We extend these theorems by showing that dynamic graphs of bounded (local) treewidth where the length of time over which the graph evolves and is observed is finite and bounded can be modelled in such a way that the (local) treewidth of the underlying graph is maintained. We show the application of these results to problems in dynamic graph theory and dynamic extensions to static problems. In addition we demonstrate that certain widely used dynamic graph classes naturally have bounded local treewidth.

There are going to be topic maps that are static editorial artifacts, similar to a back-of-book index. Which can be modeled as traditional graphs. But, merging two or more such graphs and/or topic maps that are designed as dynamic information sources, will require different analysis. This paper is a starting place for work on those issues.

Error Handling, Validation and Cleansing with Semantic Types and Mappings

Filed under: Expressor,Mapping,Semantics,Types — Patrick Durusau @ 8:41 pm

Error Handling, Validation and Cleansing with Semantic Types and Mappings by Michael Tarallo.

From the post:

expressor ETL applications can setup data validation rules and error handling in a few ways. The traditional approach with many ETL tools is to build in the rules using the various ETL operators. A more streamlined approached is to also use the power of expressor Semantic Mappings and Semantic Types.

  • Semantic Mappings specify how a variety of characteristics are to be handled when string, number, and date-time data types are mapped from the physical schema (your source) to the logical semantic layer known as the Semantic Type.
  • Semantic Types allow you to define, in business terms, how you want the data and the data model to be represented.

The use of these methods both provide a means of attribute data validation and invoking corrective actions if rules are violated.

  • Data Validation rules can be in the form of pattern matching, value ranges, character lengths, formatting, currency and other specific data type constraints.
  • Corrective Actions can be in the form of null, default and correction value replacements as well as specific operator handling to either skip records or reject them to another operator.

NOTE: Semantic Mapping rules are applied first before Semantic Type rules.

Read more here:

I am still trying to find time to test at least the community edition of the software.

What “extra” time I have now is being soaked up configuring/patching Eclipse to build Nutch, to correct a known problem between Nutch and Solr. I suspect you could sell a packaged version of open source software that has all the paths and classes hard coded into the software. No more setting paths, having inconsistent library versions, etc. Just unpack and run. Store data in separate directory. New version comes out, simply rm – R on the install directory and unpack the new one. That should also include the “.” files. Configuration/patching isn’t a good use of anyone’s time. (Unless you want to sell the results. 😉 )

But I will get to it! Unless someone beats me to it and wants to send me a link to their post that I can cite and credit on my blog.


Two things I would change about Michael’s blog:

Prerequisite: Knowledge of expressor Studio and dataflows. You can find tutorials and documentation here

To read:

Prerequisites:

  • expressor software (community or 30-day free trial) here.
  • Knowledge of expressor Studio and dataflows. You can find tutorials and documentation here

And, well, not Michael’s blog but on the expressor download page, if the desktop/community edition is “expressor Studio” then call it that on the download page.

Don’t use different names for a software package and expect users to sort it out. Not if you want to encourage downloads and sales anyway. Surveys show you have to wait until they are paying customers to start abusing them. 😉

1st International ICST conference on No SQL Databases and Social Applications

Filed under: Conferences,NoSQL,Topic Maps — Patrick Durusau @ 8:40 pm

1st International ICST conference on No SQL Databases and Social Applications June 6-8, 2012, Berlin, Germany.

Important Dates:

Submission deadline 15 February 2012
Notification and Registration opens 31 March 2012
Camera-ready deadline 30 April 2012
Start of Conference 6 June 2012

From the call for papers:

The INOSA conference in Berlin / Germany focuses on breakthroughs, new concepts and applications for developing, operating and optimizing social software systems, as well analysing and exploiting the data created by these systems.

Computer Science is intertwined with users from its early beginnings. There is hardly a software just for itself. Each software is created for user needs. This process has now reached the private life. The Internet, Web and wireless communication has fundamentally changed our daily communication and how information flows in commercial and private networks. New issues came up with this trend: Non-SQL databases offer better feature for a number of social applications. Semantic systems helps to bridge the gap between words and their meaning. Data mining and inference helps to extract implicit facts. P2P systems are proper answers to an increasing amount of data that shall be exchanged. In 2010, more smart phones than computer are sold. Mobile devices and context aware systems (e.g. locations based systems) play a major role for social applications. New threats accompany these trends as well, though. Social hacking, lost of privacy, vague or complex copy rights are just one of them.

Lutz Maicher is on the organizing committee so it would be nice to see some topic map papers at the conference.

254B, Notes 1: Basic theory of expander graphs

Filed under: Graphs,Mathematics — Patrick Durusau @ 8:39 pm

254B, Notes 1: Basic theory of expander graphs

Tough sledding for non-mathematicians but cf. the citation to the Wikipedia article in the first sentence for the importance of this area for advanced topic maps.

The objective of this course is to present a number of recent constructions of expander graphs, which are a type of sparse but “pseudorandom” graph of importance in computer science, the theory of random walks, geometric group theory, and in number theory. The subject of expander graphs and their applications is an immense one, and we will not possibly be able to cover it in full in this course. In particular, we will say almost nothing about the important applications of expander graphs to computer science, for instance in constructing good pseudorandom number generators, derandomising a probabilistic algorithm, constructing error correcting codes, or in building probabilistically checkable proofs. For such topics, I recommend the survey of Hoory-Linial-Wigderson. We will also only pass very lightly over the other applications of expander graphs, though if time permits I may discuss at the end of the course the application of expander graphs in finite groups such as {SL_2(F_p)} to certain sieving problems in analytic number theory, following the work of Bourgain, Gamburd, and Sarnak.

Instead of focusing on applications, this course will concern itself much more with the task of constructing expander graphs. This is a surprisingly non-trivial problem. On one hand, an easy application of the probabilistic method shows that a randomly chosen (large, regular, bounded-degree) graph will be an expander graph with very high probability, so expander graphs are extremely abundant. On the other hand, in many applications, one wants an expander graph that is more deterministic in nature (requiring either no or very few random choices to build), and of a more specialised form. For the applications to number theory or geometric group theory, it is of particular interest to determine the expansion properties of a very symmetric type of graph, namely a Cayley graph; we will also occasionally work with the more general concept of a Schreier graph. It turns out that such questions are related to deep properties of various groups {G} of Lie type (such as {SL_2({\bf R})} or {SL_2({\bf Z})}), such as Kazhdan’s property (T), the first nontrivial eigenvalue of a Laplacian on a symmetric space {G/\Gamma} associated to {G}, the quasirandomness of {G} (as measured by the size of irreducible representations), and the product theory of subsets of {G}. These properties are of intrinsic interest to many other fields of mathematics (e.g. ergodic theory, operator algebras, additive combinatorics, representation theory, finite group theory, number theory, etc.), and it is quite remarkable that a single problem – namely the construction of expander graphs – is so deeply connected with such a rich and diverse array of mathematical topics. (Perhaps this is because so many of these fields are all grappling with aspects of a single general problem in mathematics, namely when to determine whether a given mathematical object or process of interest “behaves pseudorandomly” or not, and how this is connected with the symmetry group of that object or process.)

(There are also other important constructions of expander graphs that are not related to Cayley or Schreier graphs, such as those graphs constructed by the zigzag product construction, but we will not discuss those types of graphs in this course, again referring the reader to the survey of Hoory, Linial, and Wigderson.)

Follow with:

254B, Notes 2: Cayley graphs and Kazhdan’s property (T)

254B, Notes 3: Quasirandom groups, expansion, and Selberg’s 3/16 theorem

Introducing Hadoop in 20 pages

Filed under: Hadoop,MapReduce — Patrick Durusau @ 8:36 pm

Introducing Hadoop in 20 pages by Prashant Sharma.

Getting started in hadoop for a newbie is a non trivial task, with amount of knowledge base available a significant amount of effort is gone in figuring out, where and how should one start exploring this field. Introducing hadoop in 20 pages is a concise document to briefly introduce just the right information in right amount, before starting out in-depth in this field. This document is intended to be used as a first and shortest guide to both understand and use Map-Reduce for building distributed data processing applications.

Well, counting the annexes it’s 35 pages but still useful. Could use some copy editing.

Disappointing because an introduction to the entire Hadoop ecosystem, carrying a single example, even an inverted index of a text, would be a better exercise at this point in Hadoop development. Two versions, one with the code examples at the end, for people who want to get a high level view and one with the code inline and commented, for people who want to code to follow along.

December 17, 2011

Open Source News: “Hating Microsoft/IBM/Oracle/etc is not a strategy”

Filed under: Open Source,RDF,Topic Maps — Patrick Durusau @ 7:55 pm

Publishing News: “Hating Amazon is not a strategy”

Sorry, the parallels to the open source community and the virgins, hermits and saints that regularly abuse the vendors who support most of the successful open source projects, either directly or indirectly, was just too obvious to pass up. Apologies to Don Linn for stealing his line.

By the same token, hating RDF isn’t a strategy either. 😉

Which is why I have come to think that RDF instances should be consumed and processed as seems most appropriate to the situation. RDF is just another data format and what we make of it is an answer to be documented as part of our processing of that data. Just as any other data source. Most of which are not going to be RDF.

SDShare

Filed under: RDF,SDShare,Semantic Web,Semantics — Patrick Durusau @ 7:54 pm

SDShare (PDF file)

According to the blog entry dated 16 December 2011, with a pointer to this presentation, this is a “recent” presentation. But the presentation has a copyright claim dated 2010. So it is either nearly a year old or it is one of those timeless artifacts on the web.

The ones that have no reliable indication of a date of composition or publishing. Appropriate for the ephemera that make up the eternal “now” of the WWW. Less appropriate for important technology documents, particularly ones that aspire to be ISO standards in the near future.

The slide deck is a good overview of the goals of SDShare, if a bit short in terms of the actual details. I would suggest using the slide deck to interest others in learning more and then passing onto them the original SDShare document.

I would quibble with the claim at slide 34 that RDF data makes “…merging simple.” So far as I know, RDF never specifies what happens when you have multiple distinct and perhaps inconsistent values for the same property. Perhaps I have overlooked that in the plethora of RDF standards, revisions and retreats.

P2PU

Filed under: Education,Marketing,Topic Maps — Patrick Durusau @ 7:53 pm

P2PU

From the website:

At P2PU, people work together to learn a particular topic by completing tasks, assessing individual and group work, and providing constructive feedback.

I just ran across the site today but was wondering if anyone else has used it or something similar? In order to grow the usage of topic maps, some sort of classes need to appear on a regular basis. Ones that are more widely available that graduate courses at some institutions.

Good idea? Bad idea? Comments?

Tall Big Data, Wide Big Data

Filed under: BigData,Sampling,Tall Data,Wide Data — Patrick Durusau @ 7:53 pm

Tall Big Data, Wide Big Data by Luis Apiolaza.

From the post:

After attending two one-day workshops last week I spent most days paying attention to (well, at least listening to) presentations in this biostatistics conference. Most presenters were R users—although Genstat, Matlab and SAS fans were also present and not once I heard “I can’t deal with the current size of my data sets”. However, there were some complaints about the speed of R, particularly when dealing with simulations or some genomic analyses.

Some people worried about the size of coming datasets; nevertheless that worry was across statistical packages or, more precisely, it went beyond statistical software. How will we able to even store the data from something like the Square Kilometer Array, let alone analyze it?

Luis makes a distinction between “tall data,” large data set but few predictors per item, “tall data,” versus small data set but large number of predictors per item, “wide data.” Not certain that sampling works with wide data.

Sampling wide data is a question that can be settled by experimentation. Takers?

R package DOSE released

Filed under: Enrichment,R,Similarity — Patrick Durusau @ 7:53 pm

R package DOSE released

From the post:

Disease Ontology (DO) provides an open source ontology for the integration of biomedical data that is associated with human disease. DO analysis can lead to interesting discoveries that deserve further clinical investigation.

DOSE was designed for semantic similarity measure and enrichment analysis.

Four information content (IC)-based methods, proposed by Resnik, Jiang, Lin and Schlicker, and one graph structure-based method, proposed by Wang, were implemented. The calculation details can be referred to the vignette of R package GOSemSim.

Hypergeometric test was implemented for enrichment analysis.

I like that “…semantic similarity measure and enrichment analysis.”

What semantic similarity measures are you using?

SolrEntity Processor

Filed under: Solr — Patrick Durusau @ 7:52 pm

SolrEntity Processor

From the web page:

This EntityProcessor imports data from different Solr instances and cores. The data is retrieved based on a specified (filter) query. This EntityProcessor is useful in cases you want to copy your Solr index and slightly want to modify the data in the target index. In some cases Solr might be the only place were all data is available. The SolrEntityProcessor can only copy fields that are stored in the source index.

Due to appear in Solr 3.6! So, grab a nightly build for now if you want to start learning this feature.

Essential Keys

Filed under: Competence — Patrick Durusau @ 7:52 pm

Essential Keys

From the post:

I often wonder how ‘IT Professionals’ get away with what they do without being able to execute the simplest computer tasks. Seeing a 200$/hour consultant fiddling with the mouse cursor, repeatedly missing keys and taking seemingly endless minutes just to log in to a Siebel application makes me cringe.

So I decided to post the obvious: Some basic keystrokes which every IT pro should be able to reproduce. This is just a basic list, please feel free to use the comments to expand it.

Wonder if there needs to be a series of one-page “tests” for IT consultant’s? Turning these tasks into questions would be one. Simple SQL tasks would be another. Same could be done for topic maps as well.

SQL to MongoDB: An Updated Mapping

Filed under: Aggregation,MongoDB,NoSQL — Patrick Durusau @ 7:52 pm

SQL to MongoDB: An Updated Mapping from Kristina Chodorow.

From the post:

The aggregation pipeline code has finally been merged into the main development branch and is scheduled for release in 2.2. It lets you combine simple operations (like finding the max or min, projecting out fields, taking counts or averages) into a pipeline of operations, making a lot of things that were only possible by using MapReduce doable with a “normal” query.

In celebration of this, I thought I’d re-do the very popular MySQL to MongoDB mapping using the aggregation pipeline, instead of MapReduce.

If you are interested in MongoDB-based solutions, this will be very interesting.

Data, Best Used By…

Filed under: Data — Patrick Durusau @ 7:51 pm

Data, Best Used By:

From the post:

To state the obvious, “Big Data” is big. The deluge of data, has people talking about volume of data, which is understandable, but not as much attention has been paid to how the value of data can age. Instead, value is often actually not just about volume. It can also be thought of as perishable.

The same can be said of data in a topic map. Note the emphasis on can. Whether your data is perishable or not, depends both on the data and on your requirements.

Historical trend data, for example, isn’t so much perishable as it may be incomplete (if not kept current).

Mr. Pearson, meet Mr. Mandelbrot:…

Filed under: Associations,Mathematics — Patrick Durusau @ 7:51 pm

Mr. Pearson, meet Mr. Mandelbrot: Detecting Novel Associations in Large Data Sets

Something you may enjoy along with: Detecting Novel Associations in Large Data Sets.

Jeremy Fox asks what I think about this paper by David N. Reshef, Yakir Reshef, Hilary Finucane, Sharon Grossman, Gilean McVean, Peter Turnbaugh, Eric Lander, Michael Mitzenmacher, and Pardis Sabeti which proposes a new nonlinear R-squared-like measure.

My quick answer is that it looks really cool!

From my quick reading of the paper, it appears that the method reduces on average to the usual R-squared when fit to data of the form y = a + bx + error, and that it also has a similar interpretation when “a + bx” is replaced by other continuous functions.

Vowpal Wabbit

Filed under: Artificial Intelligence,Machine Learning — Patrick Durusau @ 7:50 pm

Vowpal Wabbit version 6.1

Refinements in 6.1:

  1. The cluster parallel learning code better supports multiple simultaneous runs, and other forms of parallelism have been mostly removed. This incidentally significantly simplifies the learning core.
  2. The online learning algorithms are more general, with support for l1 (via a truncated gradient variant) and l2 regularization, and a generalized form of variable metric learning.
  3. There is a solid persistent server mode which can train online, as well as serve answers to many simultaneous queries, either in text or binary.

NLTK Trees

Filed under: NLTK — Patrick Durusau @ 7:49 pm

NLTK Trees by Richard Marsden.

A number of NLTK functions work with Tree objects. For example, part of speech tagging and chunking classifiers, naturally return trees. Sentence manipulation functions also work with trees. Although Natural Language Processing with Python (Bird et al) includes a couple of pages about NLTK’s Tree module, coverage is generally sparse. The online documentation actually contains some good coverage although it is not always in the most logical location (e.g. the unit tests contain some very good documentation). This article is intended as a quick introduction, and the more informative documentation pages are listed under Further Reading.

A handy introduction to NLTK trees.

Semantic Prediction?

Filed under: Bug Prediction,Search Behavior,Semantics,Software — Patrick Durusau @ 6:34 am

Bug Prediction at Google

From the post:

I first read this post because of the claim that 50% of the code base at Google changes each month. So it says but perhaps more on that another day.

While reading the post I ran across the following:

In order to help identify these hot spots and warn developers, we looked at bug prediction. Bug prediction uses machine-learning and statistical analysis to try to guess whether a piece of code is potentially buggy or not, usually within some confidence range. Source-based metrics that could be used for prediction are how many lines of code, how many dependencies are required and whether those dependencies are cyclic. These can work well, but these metrics are going to flag our necessarily difficult, but otherwise innocuous code, as well as our hot spots. We’re only worried about our hot spots, so how do we only find them? Well, we actually have a great, authoritative record of where code has been requiring fixes: our bug tracker and our source control commit log! The research (for example, FixCache) indicates that predicting bugs from the source history works very well, so we decided to deploy it at Google.

How it works

In the literature, Rahman et al. found that a very cheap algorithm actually performs almost as well as some very expensive bug-prediction algorithms. They found that simply ranking files by the number of times they’ve been changed with a bug-fixing commit (i.e. a commit which fixes a bug) will find the hot spots in a code base. Simple! This matches our intuition: if a file keeps requiring bug-fixes, it must be a hot spot because developers are clearly struggling with it.

So, if that is true for software bugs, doesn’t it stand to reason the same is true for semantic impedance? That is when a user selects one result and then within some time window selects one different from the first, the reason is the first failed to meet their criteria for a match? Same intuition. Users change because the match, in their view, failed.

Rather than trying to “reason” about the semantics of terms, we can simply observe user behavior with regard to those terms in the aggregate. And perhaps even salt the mine as it were with deliberate cases to test theories about the semantics of terms.

I haven’t done the experiment, yet, but it is certainly something that I will be looking into this next year. I think it has definite potential and would scale.

Airport Watch!

Filed under: Aviation,FAA — Patrick Durusau @ 6:33 am

Airport Watch!

I ran across Private aircraft flight plans won’t be disclosed after all, says FAA, which reads in part:

The owners and operators of private aircraft won a reprieve on December 16 when the Federal Aviation Administration announced that it will continue to allow those owners and operators to keep confidential their plane’s tail numbers and flight plans, rather than have that sensitive information automatically disclosed as part of two nationwide public aviation information dissemination systems.

The FAA said it acted after Congress passed H.R. 2112, the bill that appropriates funds for the U.S. Department of Transportation for the balance of FY2012, which includes language that specifically bars the FAA from implementing any limitation on aircraft owners’ rights to have their aircraft data blocked.

I am really curious what is “sensitive” about plane tail numbers and flight plans? Unless “sensitive” includes taking junkets at the company’s expense, perhaps without their spouses, etc.

There is an alternative to having the FAA keep track, at least of the tail numbers. It would not be that hard to organize an Airport Watch along the lines of Neighborhood Watch. Just solicit volunteers with binoculars and digital images of the most common plane types, along with a web interface for entry of sightings of planes at their local airport with the time. If they want to take long range photos of anyone getting off the plane they can upload those as well.

The distance a particular plane can fly would set an outer limit on its first sighting. Uploading those to a web interface would give anyone, boss, spouse, etc., easy access to that information. And citizen watchdog groups, news media, etc.

Could support it with advertising and the occasional sale of data to the FAA when they have a CFIT (controlled flight into terrain).

« Newer PostsOlder Posts »

Powered by WordPress