Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 25, 2011

Citeology

Filed under: Citation Indexing,HCIR — Patrick Durusau @ 6:07 pm

Citeology: Visualizing the Relationships between Research Publications

From the post:

Justin Matejka at Autodesk Research has recently released the sophisticated visualization “Citeology: Visualizing Paper Genealogy” [autodeskresearch.com]. The visualization shows the 3,502 unique academic research papers that were published at the CHI and UIST, two of the most renowned human-computer interaction (HCI) conferences, between the years 1982 and 2010.

All the articles are listed by year and sorted with the most cited papers in the middle, whereas the 11,699 citations that connect the articles to one another are represented by curved lines. Selecting a single paper reveals colors the papers from the past that the paper referenced in blue, in addition to the future articles which referenced it, in brownish-red. Titles, The resulting graphs can be explored as a low-rez interactive screen, or as a high-rez, static PDF graph.

Interesting visualization but what does it mean for one paper to cite another?

I was spoiled by the granularity of legal decision indexing, at least for United States decisions, that broke cases down by issues. So that you could separate out a case being cited for a jurisdictional issue versus the same case being cited on a damages issue. I realize it took a large number of very clever editors (now I assume assisted by computers) to create such an index but it made use of the vast archives of legal decisions possible.

I suppose my question is: Why does one paper cite another? To agree with some fact finding or to disagree? If either, which fact(s)? To extend, supprt or correct some technique? If so, which one? For exampe, so I could trace papers that extend the Patricia trie as opposed to those that cite it in passing. It would certainly make research in any number of areas much easier and possibly more effective.

A Compressed Self-Index for Genomic Databases

Filed under: Bioinformatics,Biomedical,Indexing — Patrick Durusau @ 6:07 pm

A Compressed Self-Index for Genomic Databases by Travis Gagie, Juha Kärkkäinen, Yakov Nekrich, and Simon J. Puglisi.

Abstract:

Advances in DNA sequencing technology will soon result in databases of thousands of genomes. Within a species, individuals’ genomes are almost exact copies of each other; e.g., any two human genomes are 99.9% the same. Relative Lempel-Ziv (RLZ) compression takes advantage of this property: it stores the first genome uncompressed or as an FM-index, then compresses the other genomes with a variant of LZ77 that copies phrases only from the first genome. RLZ achieves good compression and supports fast random access; in this paper we show how to support fast search as well, thus obtaining an efficient compressed self-index.

As the authors note, an area with rapidly increasing need for efficient effective indexing.

It would be a step forward to see a comparison of this method on a common genome set with:

I suppose I am presuming a common genome data set for indexing demonstrations.

Questions:

  • Is there a common genome data set for comparison of indexing techniques?
  • Are there other indexing techniques that should be included in a comparison?

Obviously important for topic maps used in genome projects.

But insights about identification of subjects that vary only slightly in one (or more) dimensions to identify different subjects, will be useful in other contexts.

An easy example would be isotopes. Let’s see, ah, celestial or other coordinate systems. Don’t know but would guess that spectra from stars/galaxies are largely common. (Do you know for sure?) What other data sets have subjects that are identified on the basis of small or incremental changes in a largely identical identifier?

A Faster Grammar-Based Self-Index

Filed under: Bioinformatics,Biomedical,Indexing — Patrick Durusau @ 6:06 pm

A Faster Grammar-Based Self-Index by Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, Simon J. Puglisi.

Abstract:

To store and search genomic databases efficiently, researchers have recently started building compressed self-indexes based on straight-line programs and LZ77. In this paper we show how, given a balanced straight-line program for a string (S [1..n]) whose LZ77 parse consists of $z$ phrases, we can add $\Oh{z \log \log z}$ words and obtain a compressed self-index for $S$ such that, given a pattern (P [1..m]), we can list the $\occ$ occurrences of $P$ in $S$ in $\Oh{m^2 + (m + \occ) \log \log n}$ time. All previous self-indexes are either larger or slower in the worst case.

Updated version of the paper I covered at: A Faster LZ77-Based Index.

In a very real sense, indexing is fundamental to information retrieval. That is to say that when information is placed in electronic storage, the only way to retrieve it is via indexing. The index may be one to one with a memory location and hence not terribly efficient, but the fact remains that an index is part of every information retrieval transaction.

Learning Machine Learning with Apache Mahout

Filed under: Machine Learning,Mahout — Patrick Durusau @ 6:06 pm

Learning Machine Learning with Apache Mahout

From the post:

Once in a while I get questions like Where to start learning more on machine learning. Other than the official sources I think there is quite good coverage also in the Mahout community: Since it was founded several presentations have been given that give an overview of Apache Mahout, introduce special features or even go into more details on particular implementations. Below is an attempt to create a collection of talks given so far without any claim to contain links to all videos or lectures. Feel free to add your favourite in the comments section. In addition I linked to some online courses with further material to get you started.

When looking for books of course check out Mahout in Action. Also Taming Text and the data mining book that comes with weka are good starting points for practitioners.

Nice collection of resources on getting started with Apache Mahout.

TeXmaker 3.2 released

Filed under: TeX/LaTeX — Patrick Durusau @ 2:25 pm

TeXmaker 3.2 released

Update of: TeXmaker 3.0 Released!

From the post:

The version 3.2 of the free cross-platform LaTeX editor TeXmaker has been released today, read on LaTeX-community.org. New features, cited from the ChangeLog file:

  • block selection mode has been added (alt+mouse)
  • a “search in folders” dialog has been added
  • the settings file can now be saved, deleted or loaded
  • all the colors for the syntax highlighting can now be changed (a preconfigured dark theme is available)
  • graphics environments and .asy files have their own syntax highlighting mode
  • a selected piece of text can now be surrounded by french/german quotes (these quotes has been added to the “LaTeX” menu and to the completion)
  • a panel can be added in the structure view to show the list of opened files (”View” menu)
  • the Texdoc tool can be launched directly via the Help menu (users can select the name of the environment before calling Texdoc)
  • the list of label and bibliography items can now be used to customize the completion
  • the “recent files” list can now be cleaned
  • the shortcuts of some commands can now be changed (”switching between the editor and the pdf viewer”, “french/german quotes”, “next/previous document”,…)
  • *.asy files can now be opened directly without using the “all files” filter
  • *.jpeg has been added to the list of the “includegraphics wizard”
  • .thm and .pre files are now deleted while using the “clean” command
  • windows and mac versions are now compiled with Qt 4.8 and poppler 0.18.2
  • a version compiled on macosx lion is now available
  • the version number is now added to the info.plist file (macosx)

Further several bugs have been fixed. The complete ChangeLog can be found here. Click here for downloading versions for Linux, Mac OS X or Windows or source files.

There may be high-end publishing systems that rival TeX/LaTeX but I haven’t seen them.

If you are new to TeX/LaTeX or are just looking for information, try the TeX Users Group (TUG) website. Please consider joining TUG or any of the other TeX user groups. Memberships help sponsor the wealth of resources you see at this site.

December 24, 2011

Mapreduce & Hadoop Algorithms in Academic Papers (5th update – Nov 2011)

Filed under: Hadoop,MapReduce — Patrick Durusau @ 4:44 pm

Mapreduce & Hadoop Algorithms in Academic Papers (5th update – Nov 2011)

From the post:

The prior update of this posting was in May, and a lot has happened related to Mapreduce and Hadoop since then, e.g.

1) big software companies have started offering hadoop-based software (Microsoft and Oracle),

2) Hadoop-startups have raised record amounts, and

3) nosql-landscape becoming increasingly datawarehouse’ish and sql’ish with the focus on high-level data processing platforms and query languages.

Personally I have rediscovered Hadoop Pig and combine it with UDFs and streaming as my primary way to implement mapreduce algorithms here in Atbrox.

Best regards,

Amund Tveit (twitter.com/atveit)

Only new material in this post, which I think makes the listing easier to use.

But in any event, a very useful service to the community!

Natural

Filed under: Lexical Analyzer,Natural Language Processing,node-js — Patrick Durusau @ 4:44 pm

Natural

From the webpage:

“Natural” is a general natural language facility for nodejs. Tokenizing, stemming, classification, phonetics, tf-idf, WordNet, and some inflection are currently supported.

It’s still in the early stages, and am very interested in bug reports, contributions and the like.

Note that many algorithms from Rob Ellis’s node-nltools are being merged in to this project and will be maintained here going forward.

At the moment most algorithms are English-specific but long-term some diversity is in order.

Aside from this README the only current documentation is here on my blog.

Just in case you are looking for natural language processing capabilities with nodejs.

kernel-machine library

Filed under: Kernel Methods — Patrick Durusau @ 4:44 pm

kernel-machine library

From the webpage:

The Kernel-Machine Library is a free (released under the LGPL) C++ library to promote the use of and progress of kernel machines. It is intended for use in research as well as in practice. The library is known to work with a recent C++ compiler on GNU/Linux, on Mac OS, and on several flavours of Windows.

Below, we would like to give you the choice to either install, use, or improve the library.

The documentation seems a bit slim but perhaps this is an area where contributions would be welcome.

Cleo: Flexible, partial, out-of-order and real-time typeahead search

Filed under: Forward Index,Indexing,Typeahead Search — Patrick Durusau @ 4:44 pm

Cleo: Flexible, partial, out-of-order and real-time typeahead search

From the webpage:

Cleo is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead and autocomplete services. It is suitable for data sets of various sizes from different domains. The Cleo software library is published under the terms of the Apache Software License version 2.0, a copy of which has been included in the LICENSE file shipped with the Cleo distribution.

Not to be mistaken with query autocomplete, Cleo does not suggest search terms or queries. Cleo is a library for developing applications that can perform real typeahead queries and deliver instantaneous typeahead results/objects/elements as you type.

Cleo is also different from general-purpose search libraries because 1) it does not evaluate search terms but the prefixes of those terms, and 2) it enables search by means of Bloom Filter and forward indexes rather than inverted indexes.

You may be amused by the definition of “forward index” offered by NIST:

An index into a set of texts. This is usually created as the first step to making an inverted index.

Or perhaps more usefully from the Wikipedia entry on Index (Search Engine):

The forward index stores a list of words for each document. The following is a simplified form of the forward index:

Forward Index
Document Words
Document 1 the,cow,says,moo
Document 2 the,cat,and,the,hat
Document 3 the,dish,ran,away,with,the,spoon

The rationale behind developing a forward index is that as documents are parsing, it is better to immediately store the words per document. The delineation enables Asynchronous system processing, which partially circumvents the inverted index update bottleneck.[19] The forward index is sorted to transform it to an inverted index. The forward index is essentially a list of pairs consisting of a document and a word, collated by the document. Converting the forward index to an inverted index is only a matter of sorting the pairs by the words. In this regard, the inverted index is a word-sorted forward index.

So, was it an indexing performance issue that lead to use of a “forward index” or was it some other capability?

Suggestions on what “typeahead search” would/could mean in a topic map context?

IndexTank is now open source!

Filed under: Database,IndexTank,NoSQL — Patrick Durusau @ 4:43 pm

IndexTank is now open source! by Diego Basch, Director of Engineering, LinkedIn.

From the post:

We are proud to announce that the technology behind IndexTank has just been released as open-source software under the Apache 2.0 License! We promised to do this when LinkedIn acquired IndexTank, so here we go:

indextank-engine: Indexing engine

indextank-service: API, BackOffice, Storefront, and Nebulizer

We know that many of our users and other interested parties have been patiently waiting for this release. We want to thank you for your patience, for your kind emails, and for your continued support. We are looking forward to seeing IndexTank thrive as an open-source project. Of course we’ll do our part; our team is hard at work building search infrastructure at LinkedIn. We are part of a larger team that has built and released search technologies such as Zoie, Bobo, and just this past Monday, Cleo. We are excited to add IndexTank to this array of powerful open source tools.

From the indextank.com homepage:

PROVEN FULL-TEXT SEARCH API

  • Truly real-time: instant updates without reindexing
  • Geo & Social aware: use location, votes, ratings or comments
  • Works with Ruby, Rails, Python, Java, PHP, .NET & more!

CUSTOM SEARCH THAT YOU CONTROL

  • You control how to sort and score results
  • “Fuzzy”, Autocomplete, Facets for how users really search
  • Highlights & Snippets quickly shows search results relevance

EASY, FAST & HOSTED

  • Scalable from a personal blog to hundreds of millions of documents! (try Reddit)
  • Free up 100K documents
  • Easier than SQL, SOLR/Lucene & Sphinx.

If you are looking for documentation, rather than github, you best look here.

So far, I haven’t seen anything out of the ordinary for a search engine. I mention it in case some people prefer it over others.

Do you see anything out of the ordinary?

RTextTools v1.3.2 Released

Filed under: R,Text Analytics — Patrick Durusau @ 4:42 pm

RTextTools v1.3.2 Released

From the post:

RTextTools was updated to version 1.3.2 today, adding support for n-gram token analysis, a faster maximum entropy algorithm, and numerous bug fixes. The source code has been synced with the Google Code repository, so please feel free to check out a copy and add your own features!

With the core feature set of RTextTools finalized, the next major release (v1.4.0) will focus on optimizing existing code and refining the API for the package. Furthermore, my goal is to add compressed sparse matrix support for all nine algorithms to reduce memory consumption; currently maximum entropy, support vector machines, and glmnet support compressed sparse matrices.

If you are doing text analysis to extract subjects and their properties or have an interest in contributing to a project on text analysis, this may be your chance.

Development Life Cycle and Tools for Data Exchange Specification

Filed under: Integration,XML,XML Schema — Patrick Durusau @ 4:42 pm

Development Life Cycle and Tools for Data Exchange Specification (2008) by KC Morris , Puja Goyal.

Abstract:

In enterprise integration, a data exchange specification is an architectural artifact that evolves along with the business. Developing and maintaining a coherent semantic model for data exchange is an important, yet non-trivial, task. A coherent semantic model of data exchange specifications supports reuse, promotes interoperability, and, consequently, reduces integration costs. Components of data exchange specifications must be consistent and valid in terms of agreed upon standards and guidelines. In this paper, we describe an activity model and NIST developed tools for the creation, test, and maintenance of a shared semantic model that is coherent and supports scalable, standards-based enterprise integration. The activity model frames our research and helps define tools to support the development of data exchange specification implemented using XML (Extensible Markup Language) Schema.

A paper that makes it clear that interoperability is not a trivial task. Could be helpful in convincing the ‘powers that be’ that projects on semantic integration or interoperability have to be properly resourced in order to have a useful result.

Manufacturing System Integration Division – MSID XML Testbed (NIST)

Filed under: Integration,XML,XML Schema — Patrick Durusau @ 4:42 pm

Manufacturing System Integration Division – MSID XML Testbed (NIST)

From the website:

NIST’s efforts to define methods and tools for developing XML Schemas to support systems integraton will help you effectively build and deploy XML Schemas amongst partners in integration projects. Through the Manufacturing Interoperability Program (MIP) XML Testbed, NIST provides guidance on how to build XML Schemas as well as a collection of tools that will help with the process allowing projects to more quickly and efficiently meet their goals.

The NIST XML Schema development and testing process is documented as the Model Development Life Cycle, which is an activity model for the creation, use, and maintenance of shared semantic models, and has been used to frame our research and development tools. We have worked with a number of industries on refining and automating the specification process and provide a wealth of information on how to use XML to address your integration needs.

On this site you will find a collection of tools and ideas to help you in developing high quality XML schemas. The tools available on this site are offered to the general public free of charge. They have been developed by the United States Government and as such are not subject to copyright or other restrictions.

If you are interested in seeing the tools extended or having some of your work included in the service please contact us.

The thought did occur to me that you could write an XML schema that governs the documentation of the subjects, their properties and merging conditions in your information systems. Perhaps even to the point of using XSLT to run against the resulting documentation to create SQL statements for the integration of information resources held in your database (or accessible therefrom).

Efficient Similarity Query Processing (Previously Efficient Exact Similarity Join)

Filed under: Query Language,Similarity — Patrick Durusau @ 4:42 pm

Efficient Similarity Query Processing (Previously Efficient Exact Similarity Join)

From the webpage:

Given a similarity function and two sets of objects, a similarity join returns all pairs of objects (from each set respectively) such that their similarity value satisifies a given criterion. A typical example is to find pairs of documents such that their cosine similarity is above a constant threshold (say, 0.95), as they are likely to be near duplicate documents.

In this project, we focus on efficient algorithms to perform the similarity join on (multi-) sets or strings both exactly and approximately. Commonly used similarity functions for (multi-) sets or strings are more complex than Euclidean distance functions. As a result, many previous approaches have to compute the approximate results instead.

We have developed several fast algorithms that address the above performance issue. They work for Jaccard, Dice, Cosine similarities and Edit distance constraints. We have also devised algorithms to compute top-k similar pairs progressively so that a user does not need to specify a similarity threshold for some unknown dataset. Recently, we have obtained several interesting new results on edit similarity search and joins by exploiting asymmetric signature schemes. The resulting IndexChunkTurbo algorithm can process most of the similarity queries efficiently while occupying almost the least amount of index space.

We also investigated the rpbolem of approximate entity matching. Our SIGMOD09 work can extract approximate mentions of entities in a document given an entity dictionary.

The first five (5) papers:

  • Xiang Zhao, Chuan Xiao, Xuemin Lin, Wei Wang. Efficient Graph Similarity Joins with Edit Distance Constraints. ICDE 2012.

    Summary: An efficient algorithm to compute graphs within certain (graph) edit distance away from a query graph.

  • Sunanda Patro, Wei Wang. Learning Top-k Transformation Rules. DEXA 2011.

    Summary: A new algorithm to learn transformation rules (e.g., VLDB = Very Large Data Base) from unannotated, potentially noisy data. Compared with previous approaches, our method can find much more valid rules in the top-k output.

  • Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, Guoren Wang. Efficient Similarity Joins for Near Duplicate Detection. ACM Transaction of Database Systems (TODS).

    Summary: This is the journal version of our WWW 2008 paper, with extension to implementing the PPJoin family of algorithms on top of relational database systems.

  • Jianbin Qin, Wei Wang, Yifei Lu, Chuan Xiao, Xuemin Lin. Efficient Exact Edit Similarity Query Processing with Asymmetric Signature Schemes. SIGMOD 2011.

    Summary: Two simple yet highly efficient algorithms are proposed in this paper that works very well for both edit similarity search and joins. Perhaps equally intereting is a comprehensive experiment involding Flamingo (ver 3), PartEnum, Ed-Join, Bed-tree, Trie-Join, NGPP, VGRAM, NS09.

  • Chaokun Wang, Jianmin Wang, Xuemin Lin, Wei Wang, Haixun Wang, Hongsong Li, Wanpeng Tian, Jun Xu, Rui Li. MapDupReducer: Detecting Near Duplicates over Massive Datasets. SIGMOD 2010. PPT

    Summary: This work essentially ports the ppjoin+ algorithm to the Map-Reduce framework in order to deal with huge volume of data.

The first paper is saying exactly what your suspect: similarity of arbitrary sub-graphs. Nothing like picking a hard problem is there? Not fully general solution but see what you think of theirs.

Did I mention there is working code at this site as well? Definitely a group to watch in the future.

Topic Maps & Oracle: A Smoking Gun

Filed under: Database,Oracle,SQL — Patrick Durusau @ 4:42 pm

Using Similarity-based Operations for Resolving Data-Level Conflicts (2003)

Abstract:

Dealing with discrepancies in data is still a big challenge in data integration systems. The problem occurs both during eliminating duplicates from semantic overlapping sources as well as during combining complementary data from different sources. Though using SQL operations like grouping and join seems to be a viable way, they fail if the attribute values of the potential duplicates or related tuples are not equal but only similar by certain criteria. As a solution to this problem, we present in this paper similarity-based variants of grouping and join operators. The extended grouping operator produces groups of similar tuples, the extended join combines tuples satisfying a given similarity condition. We describe the semantics of these operators, discuss efficient implementations for the edit distance similarity and present evaluation results. Finally, we give examples how the operators can be used in given application scenarios.

No, the title of the post is not a mistake.

The authors of this paper, in 2003, conclude:

In this paper we presented database operators for finding related data and identifying duplicates based on user-specific similarity criteria. The main application area of our work is the integration of heterogeneous data where the likelihood of occurrence of data objects representing related or the same real-world objects though containing discrepant values is rather high. Intended as an extended grouping operation and by combining it with aggregation functions for merging/reconciling groups of conflicting values our grouping operator fits well into the relational algebra framework and the SQL query processing model. In a similar way, an extended join operator takes similarity predicates used for both operators into consideration. These operators can be utilized in ad-hoc queries as part of more complex data integration and cleaning tasks.

In addition to a theoretical background, the authors illustrate an implementation of their techniques, using Oracle 8i. (Oracle 11i is the current version.)

Don’t despair! 😉

Leaves a lot to be done, including:

  • Interchange between relational database stores
  • Semantic integration in non-relational database stores
  • Interchange in mixed relational/non-relational environments
  • Identifying bases for semantic integration in particular data sets (the tough nut)
  • Others? (your comments can extend this list)

The good news for topic maps is that Oracle has some name recognition in IT contexts. 😉

There is a world of difference between a CIO saying to the CEO:

“That was a great presentation about how we can use our data more effectively with topic maps and some software, what did he say the name was?”

and,

“That was a great presentation about using our Oracle database more effectively!”

Yes?

Big iron for your practice of topic maps. A present for your holiday tradition.


Aside to Matt O’Donnell. Yes, I am going to be covering actual examples of using these operators for topic map purposes.

Right now I am sifting through a 400 document collection on “multi-dimensional indexing” where I discovered this article. Remind me to look at other databases/indexers with similar operators.

December 23, 2011

How accurate can manual review be?

Filed under: Authoring Topic Maps,Precision,Recall,Retrieval — Patrick Durusau @ 4:31 pm

How accurate can manual review be?

From the post:

One of the chief pleasures for me of this year’s SIGIR in Beijing was attending the SIGIR 2011 Information Retrieval for E-Discovery Workshop (SIRE 2011). The smaller and more selective the workshop, it often seems, the more focused and interesting the discussion.

My own contribution was “Re-examining the Effectiveness of Manual Review”. The paper was inspired by an article from Maura Grossman and Gord Cormack, whose message is neatly summed up in its title: “Technology-assisted review in e-discovery can be more effective and more efficient than exhaustive manual review”.

Fascinating work!

Does this give you pause about automated topic map authoring? Why/why not?

Introducing Google Plugin for Eclipse 2.5

Filed under: Eclipse,Google App Engine — Patrick Durusau @ 4:30 pm

Simple development of App Engine apps using Cloud SQL – Introducing Google Plugin for Eclipse 2.5

From the post:

Since we added SQL support to App Engine in the form of Google Cloud SQL, the Google Plugin for Eclipse (GPE) team has been working hard on improving the developer experience for developing App Engine apps that can use a Cloud SQL instance as the backing database.

We are pleased to announce the availability of Google Plugin for Eclipse 2.5. GPE 2.5 simplifies app development by eliminating the need for manual tasks like copying Cloud JDBC drivers, setting classpaths, typing in JDBC URLs or filling in JVM arguments for connecting to local/remote database instances.

I don’t guess Google will mind my scraping that feedburner crap at the end of the URL off for this post. Why browsers don’t do that automatically it is hard to say.

The Best Data Visualization Projects of 2011

Filed under: Graphics,Visualization — Patrick Durusau @ 4:30 pm

The Best Data Visualization Projects of 2011

From FlowingData.com.

No verbal description would be adequate.

Scala IDE for Eclipse

Filed under: Eclipse,Scala — Patrick Durusau @ 4:30 pm

Scala IDE for Eclipse

We released the Scala IDE V2.0 for Eclipse today! After 9 months of intensive work by the community contributors, users and the IDE team we are really proud to release the new version. Not only is it robust and reliable but also comes with much improved performance and responsiveness. There are a whole lot of new features that make it a real pleasure to use, Report errors as you type, Project builder with dependency tracking, Definition Hyperlinking and Inferred type hovers, Code completion and better integration with Java build tools, and lots more. You can learn more about them all below. We hope you will enjoy using the new version and continue to help us with ideas and improvement suggestions, or just contribute them.

While working on V2.0 the team has been listening hard to what the IDE users need. Simply stated faster compilation, better debugging and better integration with established Java tools like Maven. The good news is the team is ready for and excited by the challenge. Doing V2.0 we learned a lot about the build process and now understand what is needed to make significant gains in large project compile times. This and providing a solid debugging capability will be the main thrust of the next IDE development cycle. More details will be laid out as we go through the project planning phase and establish milestones. Contributors will be most welcome and we have made it a lot easier to be one. So if you want us to get the next version faster, come and help!

A lot of effort has gone into this version of the IDE and we would like to recognize the people who have contributed so much time and energy to the success of the project.

Breaking (the) News

Filed under: News — Patrick Durusau @ 4:30 pm

Breaking (the) News by Mathew Hurst.

A recap of work that Matthew Hurst has done this year on news reporting and teasers about where he may be going in the coming year.

I think his coming focus on how resources for news reporting influence what is reported as news should be quite useful.

Another fruitful area for investigation would be the influence of appearance on news reporting. Even simply counting the number of unattractive people shown in crime stories versus the number of white, attractive, pre-teen females show in kidnap, missing cases would be instructive.

Machine Learning and Hadoop

Filed under: Hadoop,Machine Learning — Patrick Durusau @ 4:28 pm

Machine Learning and Hadoop

Interesting slide deck from Josh Wills,Tom Pierce, and Jeff Hammerbacher of Cloudera.

The mention of “Pareto optimization” reminds me of a debate tournament judge who had written his dissertation on that topic. Who carefully pointed out that it wasn’t possible to “know” how close (or far away) a society was from any optimal point. 😉 Oh well, it was a “case” that sounded good to opponents unfamiliar with economic theory at any rate. An example of those “critical evaluation” skills I was talking about a day or so ago.

Not that you can’t benefit from machine learning and Hadoop. You can, but ask careful questions and persist until you are given answers that make sense. To you. With demonstrable results.

In other words, don’t be afraid to ask “stupid” questions and keep on asking them until you are satisfied with the answers. Or hire someone who is willing to play that role.

Apache Avro at RichRelevance

Filed under: Avro — Patrick Durusau @ 4:28 pm

Apache Avro at RichRelevance

From the post:

In Early 2010 at RichRelevance, we were searching for a new way to store our long lived data that was compact, efficient, and maintainable over time. We had been using Hadoop for about a year, and started with the basics – text formats and SequenceFiles. Neither of these were sufficient. Text formats are not compact enough, and can be painful to maintain over time. A basic binary format may be more compact, but it has the same maintenance issues as text. Furthermore, we needed rich data types including lists and nested records.

After analysis similar to Doug Cutting’s blog post, we chose Apache Avro. As a result we were able to eliminate manual version management, reduce joins during data processing, and adopt a new vision for what data belongs in our event logs. On Cyber Monday 2011, we logged 343 million page view events, and nearly 100 million other events into Avro data files.

I think you are going to like this post and Avro as well!

ETL Demo with Data From Data.Gov

Filed under: ETL,Expressor — Patrick Durusau @ 4:28 pm

ETL Demo with Data From Data.Gov by Kevin E. Kline.

From the post:

A little over a month ago, I wrote an article (Is There Such a Thing as Easy ETL) about expressor software and their desktop ETL application, expressor Studio. I wrote about how it seemed much easier than the native ETL tools in SQL Server when I was reading up on the tool, but that the “proof would be in the pudding” so to speak when I actually tried it out loading some free (and incredibly useful) data from the US federal data clearinghouse, Data.Gov.

If you’d rather not read my entire previous article – quick recap, expressor Studio uses “semantic types” to manage and abstract mappings between sources and targets. In essence, these types are used for describing data in terms that humans can understand—instead of describing data in terms that computers can understand. The idea of semantic abstraction is quite intriguing and it gave me an excuse to use data from data.gov to build a quick demo. You can download the complete data set I used from the following location: International Statistics. (Note: I have this dream that I’m going to someday download all of this free statistical data sets, build a bunch of amazing and high-value analytics, and make a mint. If, instead, YOU do all of those things, then please pay to send at least one of my seven kids to college in repayment for the inspiration. I’m not kidding. I have SEVEN kids. God help me).

The federal government, to their credit, has made great progress in making data available. However, there is a big difference between accessing data and understanding data. When I first looked at one of the data files I downloaded, I figured it was going to take me years to decrypt the field names. Luckily, I did notice an Excel file with field names and descriptions. Seriously, there are single letter field names in these files where the field name “G” has a description of “Age group indicator” (Oh Wow). See the figure below.

I like Kevin’s point about the difference between “accessing data and understanding data.”

Bed-tree: an all-purpose index structure for string similarity search based on edit distance

Filed under: Edit Distance,Indexing,Similarity,String Matching — Patrick Durusau @ 4:28 pm

Bed-tree: an all-purpose index structure for string similarity search based on edit distance by Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, and Divesh Srivastava.

Abstract:

Strings are ubiquitous in computer systems and hence string processing has attracted extensive research effort from computer scientists in diverse areas. One of the most important problems in string processing is to efficiently evaluate the similarity between two strings based on a specified similarity measure. String similarity search is a fundamental problem in information retrieval, database cleaning, biological sequence analysis, and more. While a large number of dissimilarity measures on strings have been proposed, edit distance is the most popular choice in a wide spectrum of applications. Existing indexing techniques for similarity search queries based on edit distance, e.g., approximate selection and join queries, rely mostly on n-gram signatures coupled with inverted list structures. These techniques are tailored for specific query types only, and their performance remains unsatisfactory especially in scenarios with strict memory constraints or frequent data updates. In this paper we propose the Bed-tree, a B+-tree based index structure for evaluating all types of similarity queries on edit distance and normalized edit distance. We identify the necessary properties of a mapping from the string space to the integer space for supporting searching and pruning for these queries. Three transformations are proposed that capture different aspects of information inherent in strings, enabling efficient pruning during the search process on the tree. Compared to state-of-the-art methods on string similarity search, the Bed-tree is a complete solution that meets the requirements of all applications, providing high scalability and fast response time.

The authors classify similarity queries as:

Type Example
range address in customer database
top-k results of search engine
all-pairs joins
pairs of proteins or genes

There are a couple of things that trouble me about the paper.

First:

6.3 Top-K construction

In many cases, top-k queries are more practical than range queries. However, existing indexing schemes with inverted lists do not naturally support such queries. To illustrate
the performance benefits of our proposals, we implemented a simple strategy with Flamingo, by increasing the range query threshold gradually until more than k string results
are found. Notice that we use the same Bed-tree structures to support all different types of queries. Thus, we skip the performance comparisons on index construction but focus on query processing efficiency. (emphasis added)

I am not sure what is meant by inverted lists “…do not naturally support …[top-k queries].” Inverted list structures are wildly popular among WWW search engines so I would like to know more about this notion of “…naturally support….”

Moreover, indexes aren’t simply used, they are created as well. Puzzling that we are left to wonder how long it will take to have a Bed-tree database that we can use.

Second, there are a couple of fairly serious mis-citation errors in the paper. The authors refer to “Flamingo” and “Mismatch” (from 2008) as comparisons but the articles cited: “[15] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257–266, 2008” and “C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933–944, 2008, respectively, are innocent of any such implementations.

C. Li is the faculty adviser for the Flamingo project, which has a new release since I mentioned it at: The FLAMINGO Project on Data Cleaning, but you don’t cite a project by picking a paper at random that doesn’t mention the project. (If you haven’t looked at the FLAMINGO project, its software and papers you really should.)

C. Xiao and company have a “mismatch” filter but it isn’t ever used as the name of an implementation.

Tracing the history of advances in computer science is easier or at least less time consuming if researchers don’t have to chase rabbits in the form of bad citations. Not to mention that if you aren’t cited properly, you may not get full credit for all the work you have actually done. Good citation practices are in everyone’s interest.

Similarity Search on Bregman Divergence: Towards NonMetric Indexing

Filed under: Indexing,NonMetric Indexing — Patrick Durusau @ 4:28 pm

Similarity Search on Bregman Divergence: Towards NonMetric Indexing by Zhenjie Zhang, Beng Chin, Ooi Srinivasan, Parthasarathy Anthony, K. H. Tung. (2009)

Abstract:

In this paper, we examine the problem of indexing over non-metric distance functions. In particular, we focus on a general class of distance functions, namely Bregman Divergence [6], to support nearest neighbor and range queries. Distance functions such as KL-divergence and Itakura-Saito distance, are special cases of Bregman divergence, with wide applications in statistics, speech recognition and time series analysis among others. Unlike in metric spaces, key properties such as triangle inequality and distance symmetry do not hold for such distance functions. A direct adaptation of existing indexing infrastructure developed for metric spaces is thus not possible. We devise a novel solution to handle this class of distance measures by expanding and mapping points in the original space to a new extended space. Subsequently, we show how state-of-the-art tree-based indexing methods, for low to moderate dimensional datasets, and vector approximation file (VA-file) methods, for high dimensional datasets, can be adapted on this extended space to answer such queries efficiently. Improved distance bounding techniques and distribution-based index optimization are also introduced to improve the performance of query answering and index construction respectively, which can be applied on both the R-trees and VA files. Extensive experiments are conducted to validate our approach on a variety of datasets and a range of Bregman divergence functions.

This paper hasn’t been cited a lot (yet) but I think it will be of particular interest to the topic map community. Mostly because natural languages, the stuff most users use to describe/identify subjects, are inherently nonmetric.

Just as a placeholder for future conversation, it occurs to me that there is a performance continuum for subject-sameness tests. How complex must matching become with some number X of topics before performance degrades? Or perhaps better,, what are the comparative performance characteristics of different subject-sameness tests?

December 22, 2011

How Moviepilot Walks The Graph

Filed under: Neo4j — Patrick Durusau @ 7:40 pm

How Moviepilot Walks The Graph

Neo4j is not just another Swedish open source database project, they are one of the strongest players inside this market, so they’re a natural first choice when starting a graph database project. Last but not the least, Neo4j can be used from our favorite programming language, Ruby.

Most of these databases are just embedded, although it has to be said that they are rapidly moving to a traditional server architecture. So, we decided to have our own server, exposing this through a rest api.

The prior post in this series was: Meet Sheldon, Our Custom Database Server.

Percona Toolkit

Filed under: MySQL — Patrick Durusau @ 7:40 pm

Percona Toolkit

From the webpage:

Percona Toolkit is a collection of advanced command-line tools used by Percona support staff to perform a variety of MySQL and system tasks that are too difficult or complex to perform manually, including:

  • Verify master and replica data consistency
  • Efficiently archive rows
  • Find duplicate indexes
  • Summarize MySQL servers
  • Analyze queries from logs and tcpdump
  • Collect vital system information when problems occur

Tools are a vital part of any MySQL deployment, so it’s important to use ones that are reliable and well-designed. Over 2,000 tests and several years of deployment, including some of the Internet’s best-known sites, have proven the reliability of the tools in Percona Toolkit. And the combined experience and expertise of Percona ensures that each tool is well thought-out and designed.

With tools and documentation like this, I am sorely tempted to throw a MySQL installation on my box just for fun. (Or more serious purposes.)

Experimental isarithmic maps visualise electoral data

Filed under: Mapping,Maps,Visualization — Patrick Durusau @ 7:40 pm

Experimental isarithmic maps visualise electoral data

From the post:

David B. Sparks, a fifth-year PhD candidate in the Department of Political Science at Duke University, has today published a fascinating set of experiments using ‘Isarithmic’ maps to visualise US party identification. Isarithmic maps are essentially topographic/contour maps and offer an alternative approach to plotting geo-spatial data using choropleth maps. This is a particularly interesting approach for the US with its extreme population patterns.

Very impressive work. Read this post and then David’s original.

FYI:

Choropleth maps use city, county, etc. boundaries, within which colors appear.

Isarithmic maps use color to present the same information but without the legal boundaries that appear in choropleth maps.

MyFCC Platform Enables Government Data Mashups

Filed under: Government Data,Mashups — Patrick Durusau @ 7:40 pm

MyFCC Platform Enables Government Data Mashups by Kin Lane.

From the post:

The FCC just launched a new tool that allows any user to custom build a dashboard from a variety of FCC released data, tools and services, built on the FCC API. The tool, called MyFCC, lets you create a customized FCC online experience for quick access to the tools and information you feel is most important. MyFCC make it possible to easily create, save and manage a customized page, choosing from a menu of 22 “widgets” such as, latest headlines and official documents, the daily digest, FCC forms and online filings.

Once you have built your customized MyFCC page, you can share your work using popular social network platforms, or embed on any other website. The platform allows for each widget to independently be shared, or embed the entire dashboard into another site.

Modulo my usual comments about subject identification and reuse of identifications, this is at least a step in the right direction.

Lucene & Solr Year 2011 in Review

Filed under: Lucene,Solr — Patrick Durusau @ 7:38 pm

Lucene & Solr Year 2011 in Review

An excellent review of the developments in Lucene and Solr for 2011.

“Big data” may be the buzz words for 2012 but Lucene and Solr are part of the buzz saw (along with SQL and NoSQL databases) tool kit to tame “bag data.”

If you have the time, you would be well advised to at least monitory the user if not developer lists for both projects.

« Newer PostsOlder Posts »

Powered by WordPress