Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 10, 2012

Factor Analysis at 100:… [Two Subjects – One Name]

Filed under: Factor Analysis,Statistics — Patrick Durusau @ 3:53 am

Factor Analysis at 100: Historical Developments And Future Directions (Cudeck, MacCallum, Lawrence Erlbaum Associates Inc, 2007. (384 pp.)) was mentioned by Christophe Lalanne in Some Random Notes, as one of his recent book acquisitions.

While searching for that volume, I encountered a conference with the same name: Factor Analysis at 100: Historical Developments And Future Directions [Conference, 2004] .

At the conference site you will find links to materials from thirteen speakers, plus a “Factor Analysis Genealogy” and “Factor Analysis Timeline.”

The presentations from the conference became papers that appear in the volume Christophe recently purchased.

Charles Spearman’s paper, “General Intelligence, Objectively Determined and Measured,” in the American Journal of Psychology [PDF version] [HTML version] (1904) was posted to the conference homepage.


The relevant subject identifiers are obvious. What else would you add to topics representing these subjects? Why?

August 9, 2012

Evaluating the state of the art in coreference resolution for electronic medical records

Evaluating the state of the art in coreference resolution for electronic medical records by Ozlem Uzuner, Andreea Bodnari, Shuying Shen, Tyler Forbush, John Pestian, and Brett R South. (J Am Med Inform Assoc 2012; 19:786-791 doi:10.1136/amiajnl-2011-000784)

Abstract:

Background The fifth i2b2/VA Workshop on Natural Language Processing Challenges for Clinical Records conducted a systematic review on resolution of noun phrase coreference in medical records. Informatics for Integrating Biology and the Bedside (i2b2) and the Veterans Affair (VA) Consortium for Healthcare Informatics Research (CHIR) partnered to organize the coreference challenge. They provided the research community with two corpora of medical records for the development and evaluation of the coreference resolution systems. These corpora contained various record types (ie, discharge summaries, pathology reports) from multiple institutions.

Methods The coreference challenge provided the community with two annotated ground truth corpora and evaluated systems on coreference resolution in two ways: first, it evaluated systems for their ability to identify mentions of concepts and to link together those mentions. Second, it evaluated the ability of the systems to link together ground truth mentions that refer to the same entity. Twenty teams representing 29 organizations and nine countries participated in the coreference challenge.

Results The teams’ system submissions showed that machine-learning and rule-based approaches worked best when augmented with external knowledge sources and coreference clues extracted from document structure. The systems performed better in coreference resolution when provided with ground truth mentions. Overall, the systems struggled in solving coreference resolution for cases that required domain knowledge.

That systems “struggled in solving coreference resolution for cases that required domain knowledge” isn’t surprising.

But, as we saw in > 4,000 Ways to say “You’re OK” [Breast Cancer Diagnosis], for any given diagnosis, there is a finite number of ways to say it.

Usually far fewer than 4,000. If we capture the ways as they are encountered, our systems don’t need “domain knowledge.”

As the lead character in O Brother, Where Art Thou? says, our applications can be as “dumb as a bag of hammers.”

PS: Apologies but I could not find an accessible version of this article. Will run down the details on the coreference workshop tomorrow and hopefully some accessible materials on it.

The Cell: An Image Library

Filed under: Bioinformatics,Biomedical,Data Source,Medical Informatics — Patrick Durusau @ 3:50 pm

The Cell: An Image Library

For the casual user, an impressive collection of cell images.

For the professional user, the advanced search page gives you an idea of the depth of images in this collection.

A good source of images for curated (not “mash up”) alignment with other materials. Such as instructional resources on biology or medicine.

Teaching the World to Search

Filed under: CS Lectures,Searching — Patrick Durusau @ 3:50 pm

Teaching the World to Search by Maggie Johnson.

From the post:

For two weeks in July, we ran Power Searching with Google, a MOOC (Massive Open Online Course) similar to those pioneered by Stanford and MIT. We blended this format with our social and communication tools to create a community learning experience around search. The course covered tips and tricks for Google Search, like using the search box as a calculator, or color filtering to find images.

The course had interactive activities to practice new skills and reinforce learning, and many opportunities to connect with other students using tools such as Google Groups, Moderator and Google+. Two of our search experts, Dan Russell and Matt Cutts, moderated Hangouts on Air, answering dozens of questions from students in the course. There were pre-, mid- and post-class assessments that students were required to pass to receive a certificate of completion. The course content is still available.

Won’t be the same as taking the course but if you missed it, see the materials online.

As you learn new search techniques, consider what about the data (or the user) makes those techniques effective?

Understanding the relationships between data and search techniques may make you a better searcher.

Understanding the relationship between tool and user may make you a better tool designer.

Review of Tufte course

Filed under: Graphics,Visualization — Patrick Durusau @ 3:49 pm

Review of Tufte course

Nathan Yau reports a negative review of Edward Tufte’s course. The comments seem to be confirming the negative analysis.

Curious because visualization/interface issues are bound up in delivery of topic map content.

Poor delivery = topic map’s fault.

If you have taken the Tufte course, hop over to Nathan’s blog and enter your comments.

Groundhog: Hadoop Fork Testing

Filed under: Hadoop,Systems Administration — Patrick Durusau @ 3:49 pm

Groundhog: Hadoop Fork Testing by Anupam Seth.

From the post:

Hadoop is widely used at Yahoo! to do all kinds of processing. It is used for everything from counting ad clicks to optimizing what is shown on the front page for each individual user. Deploying a major release of Hadoop to all 40,000+ nodes at Yahoo! is a long and painful process that impacts all users of Hadoop. It involves doing a staged rollout onto different clusters of increasing importance (e.g. QA, sandbox, research, production) and asking all teams that use Hadoop to verify that their applications work with this new version. This is to harden the new release before it is deployed on clusters that directly impact revenue, but it comes at the expense of the users of these clusters because they have to share the pain of stabilizing a newer version. Further, this process can take over 6 months. Waiting 6 months to get a new feature, which users have asked for, onto a production system is way too long. It stifles innovation both for Hadoop and for the code running on Hadoop. Other software systems avoid these problems by more closely following continuous integration techniques.

Groundhog is an automated testing tool to help ensure backwards compatibility (in terms of API, functionality, and performance) between releases of Hadoop before deploying a new release onto clusters with a high QoS. Groundhog does this by providing an automated mechanism to capture user jobs (currently limited to pig scripts) as they are run on a cluster and then replay them on a different cluster with a different version of Hadoop to verify that they still produce the same results. The test cluster can take inevitable downtime and still help ensure that the latest version of Hadoop has not introduced any new regressions. It is called groundhog because that way Hadoop can relive a pig script over and over again until it gets it right, like the movie Groundhog Day. There is similarity in concept to traditional fork/T testing in that jobs are duplicated and ran on another location. However, Hadoop fork testing differs in that the testing will not occur in real-time but instead the original job with all needed inputs and outputs will be captured and archived. Then at any later date, the archived job can be re-ran.

The main idea is to reduce the deployment cycle of a new Hadoop release by making it easier to get user oriented testing started sooner and at a larger scope. Specifically, get testing running to quickly discover regressions and backwards incompatibility issues. Past efforts to bring up a test cluster and have Hadoop users run their jobs on the test cluster has been less successful than desired. Therefore, fork testing is a method for reducing the human effort needed to get user oriented testing ran against a Hadoop cluster. Additionally, if the level of effort to capture and run tests is reduced, then testing can be performed more often and experiments can also be run. All of this must happen while following data governance policies though.

Thus, Fork testing is a form of end to end testing. If there was a complete suite of end to end tests for Hadoop, the need for fork testing might not exist. Alas, the end to end suite does not exist and creating fork testing is deemed a faster path to achieving the testing goal.

Groundhog currently is limited to work only with pig jobs. The majority of user jobs run on Hadoop at Yahoo! are written in pig. This is what allows Groundhog to nevertheless have a good sampling of production jobs.

This is way cool!

Discovering problems, even errors, before they show up in live installations is always a good thing.

When you make changes to merging rules, how do you test the impact on your topic maps?

I first saw this at: Alex Popescu’s myNoSQL under Groundhog: Hadoop Automated Testing at Yahoo!

An Introduction to Linked Open Data in Libraries, Archives & Museums

Filed under: Linked Data,LOD — Patrick Durusau @ 3:48 pm

An Introduction to Linked Open Data in Libraries, Archives & Museums by Jon Voss.

From the description:

According to a definition onLinkedData.org, “The term Linked Data refers to a set of best practices for publishing and connecting structured data on the web.” This has enormous implications for discoverability and interoperability for libraries, archives, and museums, not to mention a dramatic shift in the World Wide Web as we know it. In this introductory presentation, we’ll explore the fundamental elements of Linked Open Data and discover how rapidly growing access to metadata within the world’s libraries, archives and museums is opening exciting new possibilities for understanding our past, and may help in predicting our future.

Be forewarned that Jon thinks “mashing up” music tracks has a good result.

And you will encounter advocates for Linked Data in libraries.

You should be prepared to encounter both while topic mapping.

The Bookless Library

Filed under: Books,Library — Patrick Durusau @ 3:45 pm

The Bookless Library by David A. Bell. (New Republic, July 12, 2012)

Although Bell is quick to dismiss the notion of libraries without physical books, the confusion of libraries with physical books is one that has hurt the cause of libraries.

He remarks:

Libraries are also sources of crucial expertise. Librarians do not just maintain physical collections of books. Among other things, they guide readers, maintain catalogues, develop access portals for electronic sources, organize special programs and exhibitions, oversee special collections, and make acquisition decisions. The fact that more and more acquisition decisions now involve a question of which databases to subscribe to, rather than which physical books and journals to buy, does not make these functions any less important. To the contrary: the digital landscape is wild and wooly, and it is crucial to have well-trained, well-informed librarians on hand to figure out which content to spend scarce subscription dollars on, and how to guide readers through it.

Digital resources and collections have already out-stripped the physical collections possible in even major research libraries. Digitization efforts promise that more and more of the written record will become readily accessible to more readers.

Accessible in the sense that they can “read” the text, whether it is understood or not, is a different issue.

Without librarians to act as intelligent filters, digital content will be a sea of information that washes over all but the most intrepid scholars.

Increases in digital resources require increases in the number of librarians performing the creative aspects of their professions.

Acting as teachers, guides and fellow travellers in the exploration cultural riches past and present, and preparing for those yet to come.

…Creating Reliable Billion Page View Web Services

Filed under: Performance,Systems Administration,Web Analytics,Web Server — Patrick Durusau @ 3:40 pm

High Scalability reports in 3 Tips and Tools for Creating Reliable Billion Page View Web Services an article by Amir Salihefendic that suggests:

  • Realtime monitor everything
  • Be proactive
  • Be notified when crashes happen

Are three tips to follow on the hunt to a reliable billion page view web service.

I’m a few short of that number but it was still an interesting post. 😉

And you can’t ever tell, might snag a client that is more likely to reach those numbers.

Apache Hadoop YARN – Background and an Overview

Filed under: Hadoop,Hadoop YARN,MapReduce — Patrick Durusau @ 3:39 pm

Apache Hadoop YARN – Background and an Overview by Arun Murth.

From the post:

MapReduce – The Paradigm

Essentially, the MapReduce model consists of a first, embarrassingly parallel, map phase where input data is split into discreet chunks to be processed. It is followed by the second and final reduce phase where the output of the map phase is aggregated to produce the desired result. The simple, and fairly restricted, nature of the programming model lends itself to very efficient and extremely large-scale implementations across thousands of cheap, commodity nodes.

Apache Hadoop MapReduce is the most popular open-source implementation of the MapReduce model.

In particular, when MapReduce is paired with a distributed file-system such as Apache Hadoop HDFS, which can provide very high aggregate I/O bandwidth across a large cluster, the economics of the system are extremely compelling – a key factor in the popularity of Hadoop.

One of the keys to this is the lack of data motion i.e. move compute to data and do not move data to the compute node via the network. Specifically, the MapReduce tasks can be scheduled on the same physical nodes on which data is resident in HDFS, which exposes the underlying storage layout across the cluster. This significantly reduces the network I/O patterns and allows for majority of the I/O on the local disk or within the same rack – a core advantage.

An introduction to the architecture of Apache Hadoop YARN that starts with its roots in MapReduce.

NIST … guide for managing computer security incidents [Why Black Hats Win]

Filed under: Security,Topic Maps — Patrick Durusau @ 3:39 pm

NIST publishes updated guide for managing computer security incidents by: Mark Rockwell.

Mark provides a brief overview of the National Institute of Standards and Technology (NIST), Computer Security Incident Handling Guide.

At seventy-nine (79) pages it isn’t everything you will want to know but its a starting point.

Of particular note is the section on sharing information with others, which reads in part:

The nature of contemporary threats and attacks makes it more important than ever for organizations to work together during incident response. Organizations should ensure that they effectively coordinate portions of their incident response activities with appropriate partners. The most important aspect of incident response coordination is information sharing, where different organizations share threat, attack, and vulnerability information with each other so that each organization’s knowledge benefits the other. Incident information sharing is frequently mutually beneficial because the same threats and attacks often affect multiple organizations simultaneously.

As mentioned in Section 2, coordinating and sharing information with partner organizations can strengthen the organization’s ability to effectively respond to IT incidents. For example, if an organization identifies some behavior on its network that seems suspicious and sends information about the event to a set of trusted partners, someone else in that network may have already seen similar behavior and be able to respond with additional details about the suspicious activity, including signatures, other indicators to look for, or suggested remediation actions. Collaboration with the trusted partner can enable an organization to respond to the incident more quickly and efficiently than an organization operating in isolation.

This increase in efficiency for standard incident response techniques is not the only incentive for cross-organization coordination and information sharing. Another incentive for information sharing is the ability to respond to incidents using techniques that may not be available to a single organization, especially if that organization is small to medium size. For example, a small organization that identifies a particularly complex instance of malware on its network may not have the in-house resources to fully analyze the malware and determine its effect on the system. In this case, the organization may be able to leverage a trusted information sharing network to effectively outsource the analysis of this malware to third party resources that have the adequate technical capabilities to perform the malware analysis.

I would summarize all that as follows:

For all of the $Billions spent on computer security, teams of security experts, software, audits, etc., why do black hats stay ahead of the game?

Leaving all the tedious self-justification of the security industry to one side, the answer is quite simple: Black Hats share information.

Whether the information is about social engineering, exploits to software, insecure networks or techniques for any of the foregoing, Black Hats share information.

I am not suggesting that the NSA publish its network woes on Facebook (although someone created a page for it: Facebook – NSA) but it should be capable of automatic sharing of computer security incidents with like minded agencies.

Topic maps could help both share and filter the sharing of information in a highly automated fashion.

Don’t know that you would catch up to the Black Hats but at least you would not be losing ground.

August 8, 2012

GitLaw in Germany

Filed under: Law,Law - Sources,Legal Informatics — Patrick Durusau @ 1:51 pm

GitLaw in Germany: Deutsche Bundesgesetze- und verordnungen im Markdown auf GitHub = German Federal Laws and Regulations in Markdown on GitHub

Legal Informatics reports that German Federal Laws and Regulations are available in Markdown.

A useful resource if you have legal resources to make good use of it.

I would not advise self-help based on a Google translation of any of these materials.

Day Nine of a Predictive Coding Narrative: A scary search…

Filed under: e-Discovery,Email,Prediction,Predictive Analytics — Patrick Durusau @ 1:50 pm

Day Nine of a Predictive Coding Narrative: A scary search for false-negatives, a comparison of my CAR with the Griswold’s, and a moral dilemma by Ralph Losey.

From the post:

In this sixth installment I continue my description, this time covering day nine of the project. Here I do a quality control review of a random sample to evaluate my decision in day eight to close the search.

Ninth Day of Review (4 Hours)

I began by generating a random sample of 1,065 documents from the entire null set (95% +/- 3%) of all documents not reviewed. I was going to review this sample as a quality control test of the adequacy of my search and review project. I would personally review all of them to see if any were False Negatives, in other words, relevant documents, and if relevant, whether any were especially significant or Highly Relevant.

I was looking to see if there were any documents left on the table that should have been produced. Remember that I had already personally reviewed all of the documents that the computer had predicted were like to be relevant (51% probability). I considered the upcoming random sample review of the excluded documents to be a good way to check the accuracy of reliance on the computer’s predictions of relevance.

I know it is not the only way, and there are other quality control measures that could be followed, but this one makes the most sense to me. Readers are invited to leave comments on the adequacy of this method and other methods that could be employed instead. I have yet to see a good discussion of this issue, so maybe we can have one here.

I can appreciate Ralph’s apprehension at a hindsight review of decisions already made. In legal proceedings, decisions are made and they move forward. Some judgements/mistakes can be corrected, others are simply case history.

Days Seven and Eight of a Predictive Coding Narrative [Re-Use of Analysis?]

Filed under: e-Discovery,Email,Prediction,Predictive Analytics — Patrick Durusau @ 1:50 pm

Days Seven and Eight of a Predictive Coding Narrative: Where I have another hybrid mind-meld and discover that the computer does not know God by Ralph Losey.

From the post:

In this fifth installment I will continue my description, this time covering days seven and eight of the project. As the title indicates, progress continues and I have another hybrid mind-meld moment. I also discover that the computer does not recognize the significance of references to God in an email. This makes sense logically, but is unexpected and kind of funny when encountered in a document review.

Ralph discovered new terms to use for training as the analysis of the documents progressed.

While Ralph captures those for his use, my question would be how to capture what he learned for re-use?

As in re-use by other parties, perhaps in other litigation.

Thinking of reducing the cost of discovery by sharing analysis of data sets, rather than every discovery process starting at ground zero.

The 2012 Nucleic Acids Research Database Issue…

Filed under: Bioinformatics,Biomedical,Genome — Patrick Durusau @ 1:50 pm

The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection by Michael Y. Galperin, and Xosé M. Fernández-Suárez.

Abstract:

The 19th annual Database Issue of Nucleic Acids Research features descriptions of 92 new online databases covering various areas of molecular biology and 100 papers describing recent updates to the databases previously described in NAR and other journals. The highlights of this issue include, among others, a description of neXtProt, a knowledgebase on human proteins; a detailed explanation of the principles behind the NCBI Taxonomy Database; NCBI and EBI papers on the recently launched BioSample databases that store sample information for a variety of database resources; descriptions of the recent developments in the Gene Ontology and UniProt Gene Ontology Annotation projects; updates on Pfam, SMART and InterPro domain databases; update papers on KEGG and TAIR, two universally acclaimed databases that face an uncertain future; and a separate section with 10 wiki-based databases, introduced in an accompanying editorial. The NAR online Molecular Biology Database Collection, available at http://www.oxfordjournals.org/nar/database/a/, has been updated and now lists 1380 databases. Brief machine-readable descriptions of the databases featured in this issue, according to the BioDBcore standards, will be provided at the http://biosharing.org/biodbcore web site. The full content of the Database Issue is freely available online on the Nucleic Acids Research web site (http://nar.oxfordjournals.org/).

Abstract of the article describing: Nucleic Acids Research, Database issue, Volume 40 Issue D1 January 2012.

Very much like being a kid in a candy store. Hard to know what to look at next! Both for subject matter experts and those of us interested in the technology aspects of the databases.

ANNOVAR: functional annotation of genetic variants….

Filed under: Bioinformatics,Biomedical,Genome — Patrick Durusau @ 1:50 pm

ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data by Kai Wang, Mingyao Li, and Hakon Hakonarson. (Nucl. Acids Res. (2010) 38 (16): e164. doi: 10.1093/nar/gkq603)

Just in case you are unfamiliar with ANNOVAR, the software mentioned in: gSearch: a fast and flexible general search tool for whole-genome sequencing:

Abstract:

High-throughput sequencing platforms are generating massive amounts of genetic variation data for diverse genomes, but it remains a challenge to pinpoint a small subset of functionally important variants. To fill these unmet needs, we developed the ANNOVAR tool to annotate single nucleotide variants (SNVs) and insertions/deletions, such as examining their functional consequence on genes, inferring cytogenetic bands, reporting functional importance scores, finding variants in conserved regions, or identifying variants reported in the 1000 Genomes Project and dbSNP. ANNOVAR can utilize annotation databases from the UCSC Genome Browser or any annotation data set conforming to Generic Feature Format version 3 (GFF3). We also illustrate a ‘variants reduction’ protocol on 4.7 million SNVs and indels from a human genome, including two causal mutations for Miller syndrome, a rare recessive disease. Through a stepwise procedure, we excluded variants that are unlikely to be causal, and identified 20 candidate genes including the causal gene. Using a desktop computer, ANNOVAR requires ∼4 min to perform gene-based annotation and ∼15 min to perform variants reduction on 4.7 million variants, making it practical to handle hundreds of human genomes in a day. ANNOVAR is freely available at http://www.openbioinformatics.org/annovar/.

Approximately two years separates ANNOVAR from gSearch. Should give you an idea of the speed of development in bioinformatics. They haven’t labored over finding a syntax for everyone to use for more than a decade. I suspect there is a lesson in there somewhere.

gSearch: a fast and flexible general search tool for whole-genome sequencing

Filed under: Bioinformatics,Biomedical,Genome — Patrick Durusau @ 1:49 pm

gSearch: a fast and flexible general search tool for whole-genome sequencing by Taemin Song, Kyu-Baek Hwang, Michael Hsing, Kyungjoon Lee, Justin Bohn, and Sek Won Kong.

Abstract:

Background: Various processes such as annotation and filtering of variants or comparison of variants in different genomes are required in whole-genome or exome analysis pipelines. However, processing different databases and searching among millions of genomic loci is not trivial.

Results: gSearch compares sequence variants in the Genome Variation Format (GVF) or Variant Call Format (VCF) with a pre-compiled annotation or with variants in other genomes. Its search algorithms are subsequently optimized and implemented in a multi-threaded manner. The proposed method is not a stand-alone annotation tool with its own reference databases. Rather, it is a search utility that readily accepts public or user-prepared reference files in various formats including GVF, Generic Feature Format version 3 (GFF3), Gene Transfer Format (GTF), VCF and Browser Extensible Data (BED) format. Compared to existing tools such as ANNOVAR, gSearch runs more than 10 times faster. For example, it is capable of annotating 52.8 million variants with allele frequencies in 6 min.

Availability: gSearch is available at http://ml.ssu.ac.kr/gSearch. It can be used as an independent search tool or can easily be integrated to existing pipelines through various programming environments such as Perl, Ruby and Python.

As the abstract says: “…searching among millions of genomic loci is not trivial.”

Either for integration with topic map tools in a pipeline or for searching technology, definitely worth a close reading.

BioContext: an integrated text mining system…

Filed under: Bioinformatics,Biomedical,Entity Extraction,Text Mining — Patrick Durusau @ 1:49 pm

BioContext: an integrated text mining system for large-scale extraction and contextualization of biomolecular events by Martin Gerner, Farzaneh Sarafraz, Casey M. Bergman, and Goran Nenadic. (Bioinformatics (2012) 28 (16): 2154-2161. doi: 10.1093/bioinformatics/bts332)

Abstract:

Motivation: Although the amount of data in biology is rapidly increasing, critical information for understanding biological events like phosphorylation or gene expression remains locked in the biomedical literature. Most current text mining (TM) approaches to extract information about biological events are focused on either limited-scale studies and/or abstracts, with data extracted lacking context and rarely available to support further research.

Results: Here we present BioContext, an integrated TM system which extracts, extends and integrates results from a number of tools performing entity recognition, biomolecular event extraction and contextualization. Application of our system to 10.9 million MEDLINE abstracts and 234 000 open-access full-text articles from PubMed Central yielded over 36 million mentions representing 11.4 million distinct events. Event participants included over 290 000 distinct genes/proteins that are mentioned more than 80 million times and linked where possible to Entrez Gene identifiers. Over a third of events contain contextual information such as the anatomical location of the event occurrence or whether the event is reported as negated or speculative.

Availability: The BioContext pipeline is available for download (under the BSD license) at http://www.biocontext.org, along with the extracted data which is also available for online browsing.

If you are interested in text mining by professionals, this is a good place to start.

Should be of particular interest to anyone interested in mining literature for construction of a topic map.

Lucene Eurocon / ApacheCon Europe

Filed under: Lucene,LucidWorks — Patrick Durusau @ 1:48 pm

Lucene Eurocon / ApacheCon Europe November 5-8 | Sinsheim, Germany

From a post I got today from Lucid Imagination:

Lucid Imagination and the Apache Foundation have agreed to co-locate Lucid’s Apache Lucene EuroCon with ApacheCon Europe being held this November 5-8 in Sinsheim, Germany. Lucene EuroCon at ApacheCon Europe will cover the breadth and depth of search innovation and application. The dedicated track will bring together Apache Lucene/Solr committers and technologists from around the world to offer compelling presentations that share future directions for the project and technical implementation experiences. Topic examples include channeling the flood of structured and unstructured data into faster, more cost-effective Lucene/Solr search applications that span a host of sectors and industries.

Some of the most talented Lucene/Solr developers gather each year at Apache Lucene EuroCon to share best practices and create next-generation search applications. Coupling Apache Lucene EuroCon with this year’s ApacheCon Europe offers a great benefit to the community at large. The combined attendees benefit from expert trainings and in-depth sessions, real-world case studies, excellent networking and the opportunity to connect with the industry’s leading minds.

Call For Papers Deadline is August 13

The Call for Papers for ApacheCon has been extended to August 13, 2012, and can be found on the ApacheCon website. As always, proceeds from Apache Lucene EuroCon benefit The Apache Software Foundation. We encourage all Lucene/Solr committers and developers who have a technical story to tell to submit an abstract. Apache Lucene/Solr has a rich community of developers. Supporting ApacheCon Europe by submitting your abstract and sharing your story is important for maintaining this important and thriving community.

Just so you don’t think this is a search only event, papers are welcome on:

  • Apache Daily – Tools frameworks and components used on a daily basis
  • ApacheEE – Java enterprise projects
  • Big Data – Cassandra, Hadoop, HBase, Hive, Kafka, Mahout, Pig, Whirr, ZooKeeper and friends
  • Camel in Action – All things Apache Camel, from their problems to their solutions
  • Cloud – Cloud-related applications of a broad range of Apache projects
  • Linked Data – (need a concise caption for this track)
  • Lucene, SOLR and Friends – Learn about important web search technologies from the experts
  • Modular Java Applications – Using Felix, ACE, Karaf, Aries and Sling to deploy modular Java applications to public and private cloud environments
  • NoSQL Database – Use cases and recent developments in Cassandra, HBase, CouchDBa and Accumulo
  • OFBiz – The Apache Enterprise Automation project
  • Open Office – Open Office and the Apache Content Ecosystem
  • Web Infrastructure – HTTPD, TomCat and Traffic Server, the heart of many Internet projects

Submissions are welcome from any developer or user of Apache projects. First-time speakers are just as welcome as experienced ones, and we will do our best to make sure that speakers get all the help they need to give a great presentation.

Riak 1.2 Webinar – 21st August 2012

Filed under: Erlang,Riak — Patrick Durusau @ 1:48 pm

Riak 1.2 Webinar – 21st August 2012

  • 11:00 Pacific Daylight Time (San Francisco, GMT-07:00)
  • 14:00 Eastern Daylight Time (New York, GMT-04:00)
  • 20:00 Europe Summer Time (Berlin, GMT+02:00)

From the registration page:

Join Basho Technologies’ Engineer, Joseph Blomstedt, for an in-depth overview of Riak 1.2, the latest version of Basho’s flagship open source database. In this live webinar, you will see changes in Riak 1.2 open source and Enterprise versions, including:

  • New approach to cluster administration
  • Built-in capability negotiation
  • Repair Search or KV Partitions thru Riak Console
  • Enhanced Handoff Reporting
  • Protobuf API Support for 2i and Search indexes
  • New Packaging for FreeBSD, SmartOS, and Ubuntu
  • Stats Improvements
  • LevelDB Improvements

I would have included this with the Riak 1.2 release post but was afraid you would not get past the download link and not see the webinar.

It’s on my calendar. How about yours?

Riak 1.2 Is Official!

Filed under: Erlang,Riak — Patrick Durusau @ 1:46 pm

Riak 1.2 Is Official!

From the post:

Nearly three years ago to the day, from a set of green, worn couches in a modest office Cambridge, Massachusetts, the Basho team announced Riak to the world. To say we’ve come a long way from that first release would be an understatement, and today we’re pleased to announce the release and general availability of Riak 1.2.

Here’s the tl;dr on what’s new and improved since the Riak 1.1 release:

  • More efficiently add multiple Riak nodes to your cluster
  • Stage and review, then commit or abort cluster changes for easier operations; plus smoother handling of rolling upgrades
  • Better visibility into active handoffs
  • Repair Riak KV and Search partitions by attaching to the Riak Console and using a one-line command to recover from data corruption/loss
  • More performant stats for Riak; the addition of stats to Riak Search
  • 2i and Search usage thru the Protocol Buffers API
  • Official Support for Riak on FreeBSD
  • In Riak Enterprise: SSL encryption, better balancing and more granular control of replication across multiple data centers, NAT support

If that’s all you need to know, download the new release or read the official release notes. Also, go register for RICON.

OK, but I have a question: What happened to the lucky “…green, worn couches…”? 😉

August 7, 2012

RESQUE: Network reduction…. [Are you listening NSA?]

Filed under: Bioinformatics,Graphs,Networks — Patrick Durusau @ 4:23 pm

RESQUE: Network reduction using semi-Markov random walk scores for efficient querying of biological networks by Sayed Mohammad Ebrahim Sahraeian and Byung-Jun Yoon. (Bioinformatics (2012) 28 (16): 2129-2136. doi: 10.1093/bioinformatics/bts341)

Abstract:

Motivation: Recent technological advances in measuring molecular interactions have resulted in an increasing number of large-scale biological networks. Translation of these enormous network data into meaningful biological insights requires efficient computational techniques that can unearth the biological information that is encoded in the networks. One such example is network querying, which aims to identify similar subnetwork regions in a large target network that are similar to a given query network. Network querying tools can be used to identify novel biological pathways that are homologous to known pathways, thereby enabling knowledge transfer across different organisms.

Results: In this article, we introduce an efficient algorithm for querying large-scale biological networks, called RESQUE. The proposed algorithm adopts a semi-Markov random walk (SMRW) model to probabilistically estimate the correspondence scores between nodes that belong to different networks. The target network is iteratively reduced based on the estimated correspondence scores, which are also iteratively re-estimated to improve accuracy until the best matching subnetwork emerges. We demonstrate that the proposed network querying scheme is computationally efficient, can handle any network query with an arbitrary topology and yields accurate querying results.

Availability: The source code of RESQUE is freely available at http://www.ece.tamu.edu/~bjyoon/RESQUE/

If you promise not to tell, you can get a preprint version of the article at the source code link.

RESQUE: REduction-based scheme using Semi-Markov scores for networkQUEing.

Sounds like a starting point if you are interested in the Visualize This! (NSA Network Visualization Contest).


If you know of anyone looking for a tele-commuting researcher who finds interesting things, pass my name along. Thanks!

Cassandra and OpsCenter from Datastax

Filed under: Cassandra,DataStax — Patrick Durusau @ 4:03 pm

Cassandra and OpsCenter from Datastax

Istvan Szegedi details installation of both Cassandra and OpsCenter along with some basic operations.

From the post:

Cassandra – originally developed at Facebook – is another popular NoSQL database that combines Amazon’s Dynamo distributed systems technologies and Google’s Bigtable data model based on Column Families. It is designed for distributed data at large scale.Its key components are as follows:

Keyscape: it acts as a container for data, similar to RDBMS schema. This determines the replication parameters such as replication factor and replication placement strategy as we will see it later in this post. More details on replication placement strategy can be read here.

Column Familiy: within a keyscape you can have one or more column families. This is similar to tables in RDBMS world. They contain multiple columns which are referenced by row keys.

Column: it is the smallest increment of data. It is a tuple having a name, a value and and a timestamp.

Another information center you are likely to encounter.

Visualize This! (NSA Network Visualization Contest)

Filed under: Graphics,Graphs,Networks,Visualization — Patrick Durusau @ 3:54 pm

Visualize This! (NSA Network Visualization Contest)

From the webpage:

Are you a visual designer who can distill complex ideas and information into a clear and elegant display? Do you love testing your skills against the most difficult creative challenges? Then Visualize This! is the competition for you! The National Security Agency is looking for breakthrough visualizations that can bring order to the chaotic displays of large-scale computer networks.

Network performance and security depend on being able to quickly and effectively identify the changes occuring in network activity. However, current visualization tools are not suited to displaying these changes in ways that are clear and actionable.

You will be presented with a scenario of events and challenged to design a next-generation display that enables a network manager to immediately take appropriate action. This is the opportunity to channel your creative energies toward safer networks – make your mark on the world VISIBLE!

The challenge is due to appear in about two (2) months but I thought you might want to start sharpening your graph, tuple and topic map skills on simulated network typologies ahead of time.

I wonder if “clear and actionable” includes targeting information? 😉

BTW, be aware that the NSA markets Renoir:

Renoir: General Network Visualization and Manipulation Program

and it is covered by US Patent 6,515,666 which reads in part:

Method for constructing graph abstractions

Abstract

A method of constructing graph abstractions using a computer is described. The abstraction is presented on a computer display and used by a human viewer to understand a more complicated set of raw graphs. The method provides rapid generation of an abstraction that offers an arbitrary composition graph of vertices into composite vertices, dispersing and marshaling of composite vertices, arbitrary hiding and showing of portions of the composition, and marking of points of elision.

Patents like this remind me I need to finish my patent on addition/subtraction, if someone hasn’t beaten me to it.

Choking Cassandra Bolt

Filed under: Cassandra,Kafka,Storm,Tuples — Patrick Durusau @ 1:57 pm

Got your attention? Good!

Brian O’Neill details in A Big Data Trifecta: Storm, Kafka and Cassandra an architecture that was fast enough to choke the Cassandra Bolt component. (And also details how to fix that problem.)

Based on the exchange of tuples. Writing at 5,000 writes per second on a laptop.

More details to follow but I think you can get enough from the post to start experimenting on your own.

I first saw this at: Alex Popesu’s myNoSQL under A Big Data Triefecta: Storm, Kafka and Cassandra.

Labcoat by Precog

Filed under: BigData,Data Mining,Labcoat — Patrick Durusau @ 1:32 pm

Labcoat by Precog

From the webpage:

Labcoat is an interactive data analysis tool.

With Labcoat, developers and data scientists can integrate, analyze, and visualize massive volumes of semi-structured data.

It’s now incredibly easy to understand your data. Simply visit Labcoat to explore sample data sets and run some queries – all right in your browser. Unleash your inner data scientist!

I just encountered this so don’t have any details to report.

But, if they put as much effort into the technical side as they did the marketing images (Your App versus Your App on Precog is particularly funny), then this could prove to be quite useful.

It is in private beta so not much to report. Will update when it comes out of beta or I can point to more resources.

I first saw this at KDNuggets.

Announcing Scalable Performance Monitoring (SPM) for JVM

Filed under: Java,Systems Administration — Patrick Durusau @ 12:56 pm

Announcing Scalable Performance Monitoring (SPM) for JVM (Sematext)

From the post:

Up until now, SPM existed in several flavors for monitoring Solr, HBase, ElasticSearch, and Sensei. Besides metrics specific to a particular system type, all these SPM flavors also monitor OS and JVM statistics. But what if you want to monitor any Java application? Say your custom Java application run either in some container, application server, or from a command line? You don’t really want to be forced to look at blank graphs that are really meant for stats from one of the above mentioned systems. This was one of our own itches, and we figured we were not the only ones craving to scratch that itch, so we put together a flavor of SPM for monitoring just the JVM and (Operating) System metrics.

Now SPM lets you monitor OS and JVM performance metrics of any Java process through the following 5 reports, along with all other SPM functionality like integrated Alerts, email Subscriptions, etc. If you are one of many existing SPM users these graphs should look very familiar.

JVM monitoring isn’t like radio station management where you can listen for dead air. It a bit more complicated than that.

SPM may help with it.

Beyond the JVM and OS, how do you handle monitoring of topic map applications?

HttpFS for CDH3 – The Hadoop FileSystem over HTTP

Filed under: Hadoop — Patrick Durusau @ 12:28 pm

HttpFS for CDH3 – The Hadoop FileSystem over HTTP by Alejandro Abdelnur

From the post:

HttpFS is an HTTP gateway/proxy for Hadoop FileSystem implementations. HttpFS comes with CDH4 and replaces HdfsProxy (which only provided read access). Its REST API is compatible with WebHDFS (which is included in CDH4 and the upcoming CDH3u5).

HttpFs is a proxy so, unlike WebHDFS, it does not require clients be able to access every machine in the cluster. This allows clients to to access a cluster that is behind a firewall via the WebHDFS REST API. HttpFS also allows clients to access CDH3u4 clusters via the WebHDFS REST API.

Given the constant interest we’ve seen by CDH3 users in Hoop, we have backported Apache Hadoop HttpFS to work with CDH3.

Another step in the evolution of Hadoop.

And another name change. (Hoop to HttpFS)

Not that changing names would confuse any techie types. Or their search engines. 😉

I wonder if Hadoop is a long tail community? Thoughts?

Catering to the long tail? (business opportunity)

Filed under: Game Theory,Games,Marketing — Patrick Durusau @ 12:16 pm

I was struck by a line in Lattice games and the Economics of aggregators, by P. Jordan, U. Nadav, K. Punera, A. Skrzypacz, and G. Varghese, that reads:

A vendor that can provide good tools for to reduce the cost of doing business F is likely to open the floodgates for new small aggregators to cater to the long tail of user interests — and reap a rich reward in doing so.

You see? Struggling through all the game theory parts of the paper were worth your time!

A topic map application that enables small aggregators select/re-purpose/re-brand content for their “long tail of user interests” could be such an application.

Each aggregator could have their “view/terminology/etc.” both as a filter for content delivered as well as how it appears to their users.

Not long tails but think of the recent shooting incident in Aurora.

A topic map application could deliver content to gun control aggregators, with facts about the story that support new gun control laws, petitions and other activities.

At the same time, the same topic map application could delivery to NRA aggregators, the closest gun stores and hours for people who take such incidents as a reason to more fully arm themselves.

Same content, just repurposed on demand for different aggregators.

True, any relatively sophisticated user can setup their own search/aggregation service, but that’s the trick isn’t it? Any “relatively sophisticated user.”

Thinking not so much as a “saved search” or “alert”, dumpster diving is only to productive and it is tiring, but curated and complex searches that users can select for inclusion. So they are getting the “best” searches composed by experts.

I am sure there are other options and possibilities for delivery of both services and content. Topic maps should score high for either one.

PS: Slides from Stanford RAIN Seminar

Titan Provides Real-Time Big Graph Data

Filed under: Amazon Web Services AWS,Graphs,Titan — Patrick Durusau @ 10:50 am

Titan Provides Real-Time Big Graph Data

From the post:

Titan is an Apache 2 licensed, distributed graph database capable of supporting tens of thousands of concurrent users reading and writing to a single massive-scale graph. In order to substantiate the aforementioned statement, this post presents empirical results of Titan backing a simulated social networking site undergoing transactional loads estimated at 50,000–100,000 concurrent users. These users are interacting with 40 m1.small Amazon EC2 servers which are transacting with a 6 machine Amazon EC2 cc1.4xl Titan/Cassandra cluster.

The presentation to follow discusses the simulation’s social graph structure, the types of processes executed on that structure, and the various runtime analyses of those processes under normal and peak load. The presentation concludes with a discussion of the Amazon EC2 cluster architecture used and the associated costs of running that architecture in a production environment. In short summary, Titan performs well under substantial load with a relatively inexpensive cluster and as such, is capable of backing online services requiring real-time Big Graph Data.

Fuller version of the information you will find at: Titan Stress Poster [Government Comparison Shopping?].

BTW, Titan is reported to emerge as 0.1 (from 0.1 alpha) later this (2012) summer.

« Newer PostsOlder Posts »

Powered by WordPress