Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 10, 2013

Hadoop Summit North America 2013

Filed under: Conferences,Hadoop — Patrick Durusau @ 1:47 pm

Oldest and Largest Apache Hadoop Community Event in North America Opens Call for Papers by Kim Rose.

Dates:

Early Bird Registration ends February 1, 2013

Abstract Deadline: February 22, 2013

Conference: June 26-27, 2013 (San Jose, CA)

From the post:

Hadoop Summit North America 2013, the premier Apache Hadoop community event, will take place at the San Jose Convention Center, June 26-27, 2013. Hosted by Hortonworks, a leading contributor to Apache Hadoop, and Yahoo!, Hadoop Summit brings together the community of developers, architects, administrators, data analysts, data scientists and vendors interested in advancing, extending and implementing Apache Hadoop as the next-generation enterprise data platform.

This 6th Annual Hadoop Summit North America will feature seven tracks and more than 80 sessions focused on building, managing and operating Apache Hadoop from some of the most influential speakers in the industry. Growing 30 percent to more than 2,200 attendees last year, Hadoop Summit reached near sell-out crowds. This year, the Summit is expected to be even larger.

Apache Hadoop is the open source technology that enables organizations to more efficiently and cost-effectively store, process, manage and analyze the ever-increasing volume of data being created and collected every day. Yahoo! pioneered Apache Hadoop and is still a leading user of the big data platform. Hortonworks is a core contributor to the Apache Hadoop technology via the company’s key architects and engineers.

The Hadoop Summit tracks include the following:

  • Hadoop-Driven Business / Business Intelligence: Will focus on how Apache Hadoop is powering a new generation of business intelligence solutions, including tools, techniques and solutions for deriving business value and competitive advantage from the large volumes of data flowing through today’s enterprise.
  • Applications and Data Science: Will focus on the practice of data science using Apache Hadoop, including novel applications, tools and algorithms, as well as areas of advanced research and emerging applications that use and extend the Apache Hadoop platform.
  • Deployment and Operations: Will focus on the deployment, operation and administration of Apache Hadoop clusters at scale, with an emphasis on tips, tricks and best practices.
  • Enterprise Data Architecture: Will focus on Apache Hadoop as a data platform and how it fits within broader enterprise data architectures.
  • Future of Apache Hadoop: Will take a technical look at the key projects and research efforts driving innovation in and around the Apache Hadoop platform.
  • Apache Hadoop (Disruptive) Economics: Focusing on business innovation, this track will provide concrete examples of how Apache Hadoop enables businesses across a wide range of industries to become data-driven, deriving value from data in order to achieve competitive advantage and/or new levels of productivity.
  • Reference Architectures: Apache Hadoop impacts every level of the enterprise data architecture from storage and operating systems through end-user tools and applications. This track will focus on how the various components of the enterprise ecosystem integrate and interoperate with Apache Hadoop.

The Hadoop Summit North America 2013 call for papers is now open. The deadline to submit an abstract for consideration is February 22, 2013. Track sessions will be voted on by all members of the Apache Hadoop ecosystem using a free voting system called Community Choice. The top ranking sessions in each track will automatically be added to the Hadoop Summit agenda. Remaining sessions will be chosen by a committee of industry experts using their experience and feedback from the Community Choice.

Discounted early bird registration is available now through February 1, 2013. To register for the event or to submit a speaking abstract for consideration, please visit: www.hadoopsummit.org/san-jose/

Sponsorship packages are also now available. For more information on how to sponsor this year’s event please visit: www.hadoopsummit.org/san-jose/sponsors/

I am sure your Hadoop based topic maps solution would be welcome at this conference.

And, it makes a nice warm up for the Balisage conference in August.

Markup Olympics (Balisage) [No Drug Testing]

Filed under: Conferences,XML,XML Database,XML Schema,XPath,XQuery,XSLT — Patrick Durusau @ 1:46 pm

Markup athletes take heart! Unlike venues that intrude into the personal lives of competitors, there are no, repeat no drug tests for presenters at Balisage!

Fear no trainer betrayals or years of being dogged by second-raters in the press.

Eat, drink, visit, ???, present, in the company of your peers.

The more traditional call for participation, yawn, has the following details:

Dates:

15 March 2013 – Peer review applications due
19 April 2013 – Paper submissions due
19 April 2013 – Applications due for student support awards due
21 May 2013 – Speakers notified
12 July 2013 – Final papers due

5 August 2013 – Pre-conference Symposium on XForms
6-9 August 2013 – Balisage: The Markup Conference

From the call:

Balisage is where people interested in descriptive markup meet each year in August for informed technical discussion, occasionally impassioned debate, good coffee, and the incomparable ambience of one of North America’s greatest cities, Montreal. We welcome anyone interested in discussing the use of descriptive markup to build strong, lasting information systems.

Practitioner or theorist, tool-builder or tool-user, student or lecturer — you are invited to submit a paper proposal for Balisage 2013. As always, papers at Balisage can address any aspect of the use of markup and markup languages to represent information and build information systems. Possible topics include but are not limited to:

  • XML and related technologies
  • Non-XML markup languages
  • Big Data and XML
  • Implementation experience with XML parsing, XSLT processors, XQuery processors, XML databases, XProc integrations, or any markup-related technology
  • Semantics, overlap, and other complex fundamental issues for markup languages
  • Case studies of markup design and deployment
  • Quality of information in markup systems
  • JSON and XML
  • Efficiency of Markup Software
  • Markup systems in and for the mobile web
  • The future of XML and of descriptive markup in general
  • Interesting applications of markup

In addition, please consider becoming a Peer Reviewer. Reviewers play a critical role towards the success of Balisage. They review blind submissions — on topics that interest them — for technical merit, interest, and applicability. Your comments and recommendations can assist the Conference Committee in creating the program for Balisage 2013!

How:

More IQ per square foot than any other conference you will attend in 2013!

Getting Started with ArrayFire – a 30-minute Jump Start

Filed under: GPU,HPC — Patrick Durusau @ 1:46 pm

Getting Started with ArrayFire – a 30-minute Jump Start

From the post:

In case you missed it, we recently held a webinar on the ArrayFire GPU Computing Library. This webinar was part of an ongoing series of webinars that will help you learn more about the many applications of ArrayFire, while interacting with AccelerEyes GPU computing experts.

ArrayFire is the world’s most comprehensive GPU software library. In this webinar, James Malcolm, who has built many of ArrayFire’s core components, walked us through the basic principles and syntax for ArrayFire. He also provided an overview of existing efforts in GPU software, and compared them to the extensive capabilities of ArrayFire.

If you need to push the limits of current performance, GPUs are one way to go.

Maybe 2013 will be your GPU year!

Stop Hosting Data and Code on your Lab Website

Filed under: Archives,Data — Patrick Durusau @ 1:45 pm

Stop Hosting Data and Code on your Lab Website by Stephen Turner.

From the post:

It’s happened to all of us. You read about a new tool, database, webservice, software, or some interesting and useful data, but when you browse to http://instititution.edu/~home/professorX/lab/data, there’s no trace of what you were looking for.

THE PROBLEM

This isn’t an uncommon problem. See the following two articles:

Schultheiss, Sebastian J., et al. “Persistence and availability of web services in computational biology.” PLoS one 6.9 (2011): e24914. 

Wren, Jonathan D. “404 not found: the stability and persistence of URLs published in MEDLINE.” Bioinformatics 20.5 (2004): 668-672.

The first gives us some alarming statistics. In a survey of nearly 1000 web services published in the Nucleic Acids Web Server Issue between 2003 and 2009:

  • Only 72% were still available at the published address.
  • The authors could not test the functionality for 33% because there was no example data, and 13% no longer worked as expected.
  • The authors could only confirm positive functionality for 45%.
  • Only 274 of the 872 corresponding authors answered an email.
  • Of these 78% said a service was developed by a student or temporary researcher, and many had no plan for maintenance after the researcher had moved on to a permanent position.

The Wren et al. paper found that of 1630 URLs identified in Pubmed abstracts, only 63% were consistently available. That rate was far worse for anonymous login FTP sites (33%).

Is this a problem for published data in the topic map community?

What data should we be archiving? Discussion lists? Blogs? Public topic maps?

What do you think of Stephen’s solution?

Ontology Alert! Molds are able to reproduce sexually

Filed under: Biomedical,Ontology — Patrick Durusau @ 1:45 pm

Unlike we thought for 100 years: Molds are able to reproduce sexually

For over 100 years, it was assumed that the penicillin-producing mould fungus Penicillium chrysogenum only reproduced asexually through spores. An international research team led by Prof. Dr. Ulrich Kück and Julia Böhm from the Chair of General and Molecular Botany at the Ruhr-Universität has now shown for the first time that the fungus also has a sexual cycle, i.e. two “genders”. Through sexual reproduction of P. chrysogenum, the researchers generated fungal strains with new biotechnologically relevant properties – such as high penicillin production without the contaminating chrysogenin. The team from Bochum, Göttingen, Nottingham (England), Kundl (Austria) and Sandoz GmbH reports in PNAS. The article will be published in this week’s Online Early Edition and was selected as a cover story.

J. Böhm, B. Hoff, C.M. O’Gorman, S. Wolfers, V. Klix, D. Binger, I. Zadra, H. Kürnsteiner, S. Pöggeler, P.S. Dyer, U. Kück (2013): Sexual reproduction and mating-type – mediated strain development in the penicillin-producing fungus Penicillium chrysogenum, PNAS, DOI: 10.1073/pnas.1217943110

If you have hard coded asexual reproduction into your ontology, time to reconsider that decision. And get agreement on reworking all the dependent relationships.

January 9, 2013

izik Debuts as #1 Free Reference App on iTunes

Filed under: Interface Research/Design,izik,Search Engines,Search Interface — Patrick Durusau @ 12:04 pm

izik Debuts as #1 Free Reference App on iTunes

From the post:

We launched izik, our search app for tablets, last Friday and are amazed at the responses we’ve received! Thanks to our users, on day one izik was the #1 free reference app on iTunes and #49 free app overall. Yesterday we were mentioned twice in the New York Times, here and here (also in the B1 story in print). We are delighted that there is such a strong desire to see something fresh and new in search, and that our vision with izik is so well received.

The twitterverse has been especially active in spreading the word about izik. We’ve seen a lot of comments about the beautiful design and interface, the useful categories, and most importantly the high quality results that make izik a truly viable choice for searching on tablets.

Just last Monday I remarked: “From the canned video I get the sense that the interface is going to make search different.” (izik: Take Search for a Joy Ride on Your Tablet)

Users with tablets have supplied the input I asked for in that post and it is overwhelmingly in favor of izik.

To paraphrase Ray Charles in the Blues Brothers:

“E-excuse me, uh, I don’t think there’s anything wrong with the action on [search applications].”

There is plenty of “action” left in the search space.

izik is fresh evidence for that proposition.

Cloudera Impala: A Modern SQL Engine for Hadoop [Webinar – 10 Jan 2013]

Filed under: Cloudera,Hadoop,Impala — Patrick Durusau @ 12:04 pm

Cloudera Impala: A Modern SQL Engine for Hadoop

From the post:

Join us for this technical deep dive about Cloudera Impala, the project that makes scalable parallel databse technology available to the Hadoop community for the first time. Impala is an open-sourced code base that allows users to issue low-latency queries to data stored in HDFS and Apache HBase using familiar SQL operators.

Presenter Marcel Kornacker, creator of Impala, will begin with an overview of Impala from the user’s perspective, followed by an overview of Impala’s architecture and implementation, and will conclude with a comparison of Impala with Apache Hive, commercial MapReduce alternatives and traditional data warehouse infrastructure.

Looking forward to the comparison part. Picking the right tool for a job is an important first step.

Interactive SVG + Canvas Plot

Filed under: D3,Visualization — Patrick Durusau @ 12:04 pm

Interactive SVG + Canvas Plot by Sujay Vennam.

Amusing and potentially useful.

What forces, positive or negative would you have operating on your nodes?

I first saw this in a tweet by Christophe Viau.

A Guide to Python Frameworks for Hadoop

Filed under: Hadoop,MapReduce,Python — Patrick Durusau @ 12:03 pm

A Guide to Python Frameworks for Hadoop by Uri Laserson.

From the post:

I recently joined Cloudera after working in computational biology/genomics for close to a decade. My analytical work is primarily performed in Python, along with its fantastic scientific stack. It was quite jarring to find out that the Apache Hadoop ecosystem is primarily written in/for Java. So my first order of business was to investigate some of the options that exist for working with Hadoop from Python.

In this post, I will provide an unscientific, ad hoc review of my experiences with some of the Python frameworks that exist for working with Hadoop, including:

  • Hadoop Streaming
  • mrjob
  • dumbo
  • hadoopy
  • pydoop
  • and others

Ultimately, in my analysis, Hadoop Streaming is the fastest and most transparent option, and the best one for text processing. mrjob is best for rapidly working on Amazon EMR, but incurs a significant performance penalty. dumbo is convenient for more complex jobs (objects as keys; multistep MapReduce) without incurring as much overhead as mrjob, but it’s still slower than Streaming.

Read on for implementation details, performance comparisons, and feature comparisons.

A non-word count Hadoop example? Who would have thought? 😉

Enjoy!

Using a Graph Database with Ruby [Parts 1 and 2]

Filed under: Graphs,Neo4j,Ruby — Patrick Durusau @ 12:03 pm

Using a Graph Database with Ruby. Part 1: Introduction and Using a Graph Database with Ruby. Part 2: Integration by Thiago Jackiw.

From the introduction to Part 2:

In the first article, we learned about graph databases, their differences and advantages over traditional databases, and about Neo4j. In this article, we are going to install Neo4j, integrate and evaluate the gems listed in the first part of this series.

The scenario that we are going to be working with is the continuation of the simple idea in the first article, a social networking example that is capable of producing traversal queries such as “given the fact that Bob is my friend, give me all friends that are friend’s of friend’s of friend’s of Bob”.

You may want to skip the first part if you are already familiar enough with Neo4j or graphs to want to use them. 😉

The second part walks you through creation of enough data to demonstrate traversals and some of the capabilities of Neo4j.

I first saw this in a tweet by Glenn Goodrich.

Bitly Social Data APIs

Filed under: Bitly,Data Source,Search Data — Patrick Durusau @ 12:02 pm

Bitly Social Data APIs by Hilary Mason.

From the post:

We just released a bunch of social data analysis APIs over at bitly. I’m really excited about this, as it’s offering developers the power to use social data in a way that hasn’t been available before. There are three types of endpoints and each one is awesome for a different reason.

First, we share the analysis that we do at the link level….

Second, we’ve opened up access to a realtime search engine. …

Finally, we asked the question — what is the world paying attention to right now?…”bursting phrases”…

See Hilary’s post for the details, or even better, take a shot at the APIs!

I first saw this in a tweet by Dave Fauth.

@AMS Webinars on Linked Data

Filed under: Linked Data,LOD,Semantic Web — Patrick Durusau @ 12:01 pm

@AMS Webinars on Linked Data

From the website:

The traditional approach of sharing data within silos seems to have reached its end. From governments and international organizations to local cities and institutions, there is a widespread effort of opening up and interlinking their data. Linked Data, a term coined by Tim Berners-Lee in his design note regarding the Semantic Web architecture, refers to a set of best practices for publishing, sharing, and interlinking structured data on the Web.

Linked Open Data (LOD), a concept that has leapt onto the scene in the last years, is Linked Data distributed under an open license that allows its reuse for free. Linked Open Data becomes a key element to achieve interoperability and accessibility of data, harmonisation of metadata and multilinguality.

There are four remaining seminars in this series:

Webinar in French | 22nd January 2013 – 11:00am Rome time
Clarifiez le sens de vos données publiques grâce au Web de données
Christophe Guéret, Royal Netherlands Academy of Arts and Sciences, Data Archiving and Networked Services (DANS)

Webinar in Chinese | 29th January 2013 – 02:00am Rome time
基于网络的研讨会 “题目:理解和利用关联数据 --图情档博(LAM)作为关联数据的提供者和使用者”
Marcia Zeng, School of Library and Information Science, Kent State University

Webinar in Russian | 5th February 2013 – 11:00am Rome time
Введение в концепцию связанных открытых данных
Irina Radchenko, Centre of Semantic Technologies, Higher School of Economics

Webinar in Arabic | 12th February 2013 – 11:00am Rome time
Ibrahim Elbadawi, UAE Federal eGovernment

Mark your agenda! New Free Webinars @ AIMS on Linked Open Data for registration and more details.

Center for Effective Government Announces Launch [Name Change]

Filed under: Government,Government Data,Transparency — Patrick Durusau @ 12:00 pm

Center for Effective Government Announces Launch

The former OMB Watch is now the Center for Effective Government (www.foreffectivegov.org).

A change to reflect a broader expertise on government effectiveness in general.

From the post:

The Center for Effective Government will continue to offer expert analysis, in-depth reports, and news updates on the issues it has been known for in the past. Specifically, the organization will:

  • Analyze federal tax and spending choices and advocate for progressive revenue options and transparency in federal spending;
  • Defend and improve national standards and safeguards and the regulatory systems that produce and enforce them;
  • Expose undue special interest influence in federal policymaking and advocate for open government reforms that ensure public officials put the public interest first; and
  • Encourage more active citizen engagement in our democracy by ensuring people have access to easy-to-understand, contextualized, meaningful public information and understand how they can participate in public policy decision making processes.

If you have been running a topic map in this area, reflect the name change to the OMB Watch topic.

Beyond simple semantic impedance, which is always present, government is replete with examples of intentional impedance if not outright deception.

A fertile field for topic map practitioners!

NewGenLib Open Source…Update! [Library software]

Filed under: Library,Library software,OPACS,Software — Patrick Durusau @ 12:00 pm

NewGenLib Open Source releases version 3.0.4 R1 Update 1

From the blog:

The NewGenLib Open Source has announced the release of a new version 3.0.4 R1 Update 1. NewGenLib is an integrated library management system developed by Verus Solutions in conjunction with Kesaran Institute of Information and Knowledge Management in India. The software has the modules acquisitions, technical processing, serials management, circulation, administration, and MIS reports and OPAC.

What’s new in the Update?

This new update comes with a basket of additional features and enhancements, these include:

  • Full text indexing and searching of digital attachments: NewGenLib now uses Apache Tika. With this new tool not only catalogue records but their digital attachments and URLs are indexed. Now you can also search based on the content of your digital attachments
  • Web statistics: The software facilitates the generation of statistics on OPAC usage by having an allowance for Google Analytics code.
  • User ratings of Catalogue Records: An enhancement for User reviews is provided in OPAC. Users can now rate a catalogue record on a scale of 5 (Most useful to not useful). Also, one level of approval is added for User reviews and ratings. 
  • Circulation history download: Users can now download their Circulation history as a PDF file in OPAC

NewGenLib supports MARC 21 bibliographic data, MARC authority files, Z39.50 Client for federated searching. Bibliographic records can be exported in MODS 3.0 and AGRIS AP . The software is OAI-PMH compliant. NewGenLib has a user community with an online discussion forum.

If you are looking for potential topic map markets, the country population rank graphic from Wikipedia may help:
World Population Graph

Population isn’t everything but it should not be ignored either.

Announcing TokuDB v6.6: Performance Improvements

Filed under: MariaDB,MySQL,TokuDB — Patrick Durusau @ 12:00 pm

Announcing TokuDB v6.6: Performance Improvements

From the post:

We are excited to announce TokuDB® v6.6, the latest version of Tokutek’s flagship storage engine for MySQL and MariaDB.

This version offers three types of performance improvements: in-memory, multi-client and fast updates.

Although TokuDB is optimized for large tables, which are larger than memory, many workloads consist of a mix of large and small tables. TokuDB v6.6 offers improvements on in-memory performance, with a more than 100% improvement on Sysbench at many concurrency levels and more than 200% improvement on TPC-C at many concurrency levels. Details to follow.

We have also made improvements in multi-threaded performance. For example, single threaded trickle loads have always been fast in TokuDB. But now multi-threaded trickle loads are even faster. An iibench run with four writers shows an increase from ~18K insertions/sec to ~28K insertions/sec. With a writer and reader running concurrently, we achieve ~13K insertions/sec.

Leif Walsh, one of our engineers, will be posting some details of how this particular improvement was achieved. So stay tuned for this and posts comparing our concurrent iibench performance with InnoDB’s.

A bit late for Christmas but performance improvements on top of already impressive performance are always welcome!

Looking forward to hearing more of the details!

January 8, 2013

Designing algorithms for Map Reduce

Filed under: Algorithms,BigData,Hadoop,MapReduce — Patrick Durusau @ 11:48 am

Designing algorithms for Map Reduce by Ricky Ho.

From the post:

Since the emerging of Hadoop implementation, I have been trying to morph existing algorithms from various areas into the map/reduce model. The result is pretty encouraging and I’ve found Map/Reduce is applicable in a wide spectrum of application scenarios.

So I want to write down my findings but then found the scope is too broad and also I haven’t spent enough time to explore different problem domains. Finally, I realize that there is no way for me to completely cover what Map/Reduce can do in all areas, so I just dump out what I know at this moment over the long weekend when I have an extra day.

Notice that Map/Reduce is good for “data parallelism”, which is different from “task parallelism”. Here is a description about their difference and a general parallel processing design methodology.

I’ll cover the abstract Map/Reduce processing model below. For a detail description of the implementation of Hadoop framework, please refer to my earlier blog here.

A bit dated (2010) but still worth your time.

I missed its initial appearance so appreciated Ricky pointing back to it in MapReduce: Detecting Cycles in Network Graph.

You may also want to consult: Designing good MapReduce algorithms by Jeffrey Ullman.

MapReduce: Detecting Cycles in Network Graph [Merging Duplicate Identifiers]

Filed under: Giraph,MapReduce,Merging — Patrick Durusau @ 11:47 am

MapReduce: Detecting Cycles in Network Graph by Ricky Ho.

From the post:

I recently received an email from an audience of my blog on Map/Reduce algorithm design regarding how to detect whether a graph is acyclic using Map/Reduce. I think this is an interesting problem and can imagine there can be wide range of application to it.

Although I haven’t solved this exact problem in the past, I’d like to sketch out my thoughts on a straightforward approach, which may not be highly optimized. My goal is to invite other audience who has solved this problem to share their tricks.

To define the problem: Given a simple directed graph, we want to tell whether it contains any cycles.

Relevant to processing of identifiers in topic maps which may occur on more than one topic (prior to merging).

What is your solution in a mapreduce context?

NYU Large Scale Machine Learning Class [Not a MOOC]

Filed under: CS Lectures,Machine Learning — Patrick Durusau @ 11:46 am

NYU Large Scale Machine Learning Class by John Langford.

From the post:

Yann LeCun and I are coteaching a class on Large Scale Machine Learning starting late January at NYU. This class will cover many tricks to get machine learning working well on datasets with many features, examples, and classes, along with several elements of deep learning and support systems enabling the previous.

This is not a beginning class—you really need to have taken a basic machine learning class previously to follow along. Students will be able to run and experiment with large scale learning algorithms since Yahoo! has donated servers which are being configured into a small scale Hadoop cluster. We are planning to cover the frontier of research in scalable learning algorithms, so good class projects could easily lead to papers.

For me, this is a chance to teach on many topics of past research. In general, it seems like researchers should engage in at least occasional teaching of research, both as a proof of teachability and to see their own research through that lens. More generally, I expect there is quite a bit of interest: figuring out how to use data to make predictions well is a topic of growing interest to many fields. In 2007, this was true, and demand is much stronger now. Yann and I also come from quite different viewpoints, so I’m looking forward to learning from him as well.

We plan to videotape lectures and put them (as well as slides) online, but this is not a MOOC in the sense of online grading and class certificates. I’d prefer that it was, but there are two obstacles: NYU is still figuring out what to do as a University here, and this is not a class that has ever been taught before. Turning previous tutorials and class fragments into coherent subject matter for the 50 students we can support at NYU will be pretty challenging as is. My preference, however, is to enable external participation where it’s easily possible.

Not a MOOC but videos of the lectures will be available. Details under development.

Note the request for suggestions on the class.

NULL_SETS

Filed under: Graphics,Interface Research/Design,Visualization — Patrick Durusau @ 11:46 am

NULL_SETS

From the webpage:

null_sets is a new body of artwork aimed at exploring the gap between data and information. consisting of a set of images (plus a free app), this project stems from our interest in glitches, code-breaking, and translation. our custom script encodes text files as images, making it possible to visualize both the size and architecture of large-scale data sets through an aesthetic lens. so if you ever wanted to see hamlet as a jpeg and find artistic merit hiding within its code, here’s your chance.

The video on the homepage gives you a good introduction to the project.

I included this under “interface research/design” in addition to visualization.

If it is fair to talk about Hadoop needing to be “interactive,” it stands to reason that visualization of large data sets should be as well.

Does make me wonder what change tracking would look like for interactive visualization? So you could play-back or revert to some earlier view. (Or exchange snapshots of views with others.)

I first saw this at: Null_Sets: Encoding Text as Abstract Images by Andrew Vande Moere.

Kids, programming, and doing more

Filed under: Marketing,Teaching,Topic Maps — Patrick Durusau @ 11:45 am

Kids, programming, and doing more by Greg Linden.

From the post:

I built Code Monster and Code Maven to get more kids interested in programming. Why is programming important?

Computers are a powerful tool. They let you do things that would be hard or impossible without them.

Trying to find a name that might be misspelled in a million names would take weeks to do by hand, but takes mere moments with a computer program. Computers can run calculations and transformations of data in seconds that would be impossible to do yourself in any amount of time. People can only keep about seven things in their mind at once; computers excel at looking at millions of pieces of data and discovering correlations in them.

Being able to fully use a computer requires programming. If you can program, you can do things others can’t. You can do things faster, you can do things that otherwise would be impossible. You are more powerful.

A reminder from Greg that our presentation of programming can make it “difficult” or “attractive.”

The latter requires more effort on our part but as he has demonstrated, it is possible.

Children (allegedly) being more flexible than adults, should be good candidates for attractive interfaces that use topic map principles.

So they become conditioned to options such as searching under different names for the same subjects. Or associations using different names appear as one association.

Topic map flexibility becomes their expectation rather than an exception to the rule.

Can Extragalactic Data Be Standardized? Part 2

Filed under: Astroinformatics,BigData,Parallel Programming — Patrick Durusau @ 11:44 am

Can Extragalactic Data Be Standardized? Part 2 by Ian Armas Foster.

From the post:

Last week, we profiled an effort headed by the Taiwanese Extragalactic Astronomical Data Center (TWEA-DC) to standardize astrophysical computer science.

Galaxy

Specifically, the object laid out by the TWEA-DC team was to create a language specifically designed for far-reaching astronomy—a Domain Specified Language. This would create a standard environment from which software could be developed.

For the researchers at the TWEA-DC, one of the bigger issues lies in the software currently being developed for big data management. Sebastien Foucaud and Nicolas Kamennoff co-authored the paper alongside Yasuhiro Hashimoto and Meng-Feng Tsai, who are based in Taiwan, laying out the TWEA-DC. They argue that since parallel processing is a relatively recent phenomenon, many programmers have not been versed in how to properly optimize their software. Specifically, they go into how the developers are brought up in a world where computing power steadily increases.

Indeed, preparing a new generation of computer scientists and astronomers is a main focus of the data center that opened in 2010. “One of the major goals of the TWEA-DC,” the researchers say, “is to prepare the next generation of astronomers, who will have to keep up pace with the changing face of modern Astronomy.”

Standard environments for software are useful, so long as they are recognized as also being ephemeral.

What was the standard environment for software development in the 1960’s wasn’t the same as the 1980’s nor the 1980’s the same as today.

Along with temporary “standard environments,” we should also construct entrances into and be thinking about exits from those environments.

Big Data Applications Not Meeting Expectations

Filed under: BigData,Marketing,Topic Maps — Patrick Durusau @ 11:43 am

Big Data Applications Not Meeting Expectations by Ian Armas Foster.

From the post:

Now that the calendar has turned over to 2013, it is as good a time as any to check in on how big corporations are faring with big data.

The answer? According to an Actuate study, not that well. The study showed that 49% of companies overall are not planning on even evaluating big data, including 40% of companies that take in revenues of over a billion dollars. Meanwhile, only 19% of companies have implemented big data, including 26% of billion-plus revenue streams. The remaining are either planning big data applications (10% overall, 12% billion-plus) or evaluating its viability.

Where is the disconnect? According to the study, the problem lies in both a lack of expertise in handling big data and an unease regarding the cost of possible initiatives. Noting the fact that the plurality of companies are turning to Hadoop, either through Apache itself or the vendor Cloudera, that disconnect makes a little more sense. After all, it is well documented that the talent to make sense of and work in Hadoop does not quite reach the demand.

When you look at slides 12 – 15 of the report, how many of those would require agreement on the semantics of data?

All of them you say? 😉

As Dr. AnHai Doan pointed out in his ACM award winning dissertation, Learning to Map Between Structured Representations of Data, the costs of mapping between data can be extremely high.

Perhaps the companies that choose to forego big data projects are more aware than most of the difficulties they face?

If you can’t capture the “why” of those mappings, enabling an incremental approach with benefits along the way, not a bad reason for reluctance.

On the other hand, if using a topic map to capture those mappings for re-use, and generate benefits in the short term, not someday by and by, might reach a different decision.

Data Integration Is Now A Business Problem – That’s Good

Filed under: Data Integration,Marketing,Semantics — Patrick Durusau @ 11:43 am

Data Integration Is Now A Business Problem – That’s Good by John Schmidt.

From the post:

Since the advent of middleware technology in the mid-1990’s, data integration has been primarily an IT-lead technical problem. Business leaders had their hands full focusing on their individual silos and were happy to delegate the complex task of integrating enterprise data and creating one version of the truth to IT. The problem is that there is now too much data that is highly fragmented across myriad internal systems, customer/supplier systems, cloud applications, mobile devices and automatic sensors. Traditional IT-lead approaches whereby a project is launched involving dozens (or hundreds) of staff to address every new opportunity are just too slow.

The good news is that data integration challenges have become so large, and the opportunities for competitive advantage from leveraging data are so compelling, that business leaders are stepping out of their silos to take charge of the enterprise integration task. This is good news because data integration is largely an agreement problem that requires business leadership; technical solutions alone can’t fully solve the problem. It also shifts the emphasis for financial justification of integration initiatives from IT cost-saving activities to revenue-generating and business process improvement initiatives. (emphasis added)

I think the key point for me is the bolded line: data integration is largely an agreement problem that requires business leadership; technical solutions alone can’t fully solve the problem.

Data integration never was a technical problem, not really. It just wasn’t important enough for leaders to create agreements to solve it.

Like a lack of sharing between U.S. intelligence agencies. Which is still the case, twelve years this next September 11th as a matter of fact.

Topic maps can capture data integration agreements, but only if users have the business leadership to reach them.

Could be a very good year!

PLOS Computational Biology: Translational Bioinformatics

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 11:42 am

PLOS Computational Biology: Translational Bioinformatics. Maricel Kann, Guest Editor, and Fran Lewitter, PLOS Computational Biology Education Editor.

Following up on the collection where Biomedical Knowledge Integration appears, only to find:

Introduction to Translational Bioinformatics Collection by Russ B. Altman. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002796

Chapter 1: Biomedical Knowledge Integration by Philip R. O. Payne. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002826

Chapter 2: Data-Driven View of Disease Biology by Casey S. Greene and Olga G. Troyanskaya. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002816

Chapter 3: Small Molecules and Disease by David S. Wishart. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002805

Chapter 4: Protein Interactions and Disease by Mileidy W. Gonzalez by Maricel G. Kann. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002819

Chapter 5: Network Biology Approach to Complex Diseases by Dong-Yeon Cho, Yoo-Ah Kim and Teresa M. Przytycka. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002820

Chapter 6: Structural Variation and Medical Genomics by Benjamin J. Raphael. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002821

Chapter 7: Pharmacogenomics by Konrad J. Karczewski, Roxana Daneshjou and Russ B. Altman. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002817

Chapter 8: Biological Knowledge Assembly and Interpretation by Han Kim. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002858

Chapter 9: Analyses Using Disease Ontologies by Nigam H. Shah, Tyler Cole and Mark A. Musen. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002827

Chapter 10: Mining Genome-Wide Genetic Markers by Xiang Zhang, Shunping Huang, Zhaojun Zhang and Wei Wang. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002828

Chapter 11: Genome-Wide Association Studies by William S. Bush and Jason H. Moore. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002822

Chapter 12: Human Microbiome Analysis by Xochitl C. Morgan and Curtis Huttenhower. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002808

Chapter 13: Mining Electronic Health Records in the Genomics Era by Joshua C. Denny. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002823

Chapter 14: Cancer Genome AnalysisMiguel Vazquez, Victor de la Torre and Alfonso Valencia. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002824

An example of scholarship at its best!

Biomedical Knowledge Integration

Filed under: Bioinformatics,Biomedical,Data Integration,Medical Informatics — Patrick Durusau @ 11:41 am

Biomedical Knowledge Integration by Philip R. O. Payne.

Abstract:

The modern biomedical research and healthcare delivery domains have seen an unparalleled increase in the rate of innovation and novel technologies over the past several decades. Catalyzed by paradigm-shifting public and private programs focusing upon the formation and delivery of genomic and personalized medicine, the need for high-throughput and integrative approaches to the collection, management, and analysis of heterogeneous data sets has become imperative. This need is particularly pressing in the translational bioinformatics domain, where many fundamental research questions require the integration of large scale, multi-dimensional clinical phenotype and bio-molecular data sets. Modern biomedical informatics theory and practice has demonstrated the distinct benefits associated with the use of knowledge-based systems in such contexts. A knowledge-based system can be defined as an intelligent agent that employs a computationally tractable knowledge base or repository in order to reason upon data in a targeted domain and reproduce expert performance relative to such reasoning operations. The ultimate goal of the design and use of such agents is to increase the reproducibility, scalability, and accessibility of complex reasoning tasks. Examples of the application of knowledge-based systems in biomedicine span a broad spectrum, from the execution of clinical decision support, to epidemiologic surveillance of public data sets for the purposes of detecting emerging infectious diseases, to the discovery of novel hypotheses in large-scale research data sets. In this chapter, we will review the basic theoretical frameworks that define core knowledge types and reasoning operations with particular emphasis on the applicability of such conceptual models within the biomedical domain, and then go on to introduce a number of prototypical data integration requirements and patterns relevant to the conduct of translational bioinformatics that can be addressed via the design and use of knowledge-based systems.

A chapter in “Translational Bioinformatics” collection for PLOS Computational Biology.

A very good survey of the knowledge integration area, which alas does not include topic maps. 🙁

Well, but it does include use cases at the end of the chapter that are biomedical specific.

Thinking those would be good cases to illustrate the use of topic maps for biomedical knowledge integration.

Yes?

January 7, 2013

A new Lucene highlighter is born [The final inch problem]

Filed under: Indexing,Lucene,Searching,Synonymy — Patrick Durusau @ 10:27 am

A new Lucene highlighter is born Mike McCandless.

From the post:

Robert has created an exciting new highlighter for Lucene, PostingsHighlighter, our third highlighter implementation (Highlighter and FastVectorHighlighter are the existing ones). It will be available starting in the upcoming 4.1 release.

Highlighting is crucial functionality in most search applications since it’s the first step of the hard-to-solve final inch problem, i.e. of getting the user not only to the best matching documents but getting her to the best spot(s) within each document. The larger your documents are, the more crucial it is that you address the final inch. Ideally, your user interface would let the user click on each highlight snippet to jump to where it occurs in the full document, or at least scroll to the first snippet when the user clicks on the document link. This is in general hard to solve: which application renders the content is dependent on its mime-type (i.e., the browser will render HTML, but will embed Acrobat Reader to render PDF, etc.).

Google’s Chrome browser has an ingenious solution to the final inch problem, when you use “Find…” to search the current web page: it highlights the vertical scroll bar showing you where the matches are on the page. You can then scroll to those locations, or, click on the highlights in the scroll bar to jump there. Wonderful!

All Lucene highlighters require search-time access to the start and end offsets per token, which are character offsets indicating where in the original content that token started and ended. Analyzers set these two integers per-token via the OffsetAttribute, though some analyzers and token filters are known to mess up offsets which will lead to incorrect highlights or exceptions during highlighting. Highlighting while using SynonymFilter is also problematic in certain cases, for example when a rule maps multiple input tokens to multiple output tokens, because the Lucene index doesn’t store the full token graph.

An interesting addition to the highlighters in Lucene.

Be sure to follow the link to Mike’s comments about the limitations on SynonymFilter and the difficulty of correction.

Akka Documentation Release 2.1.0

Filed under: Actor-Based,Akka,Programming — Patrick Durusau @ 10:05 am

Akka Documentation Release 2.1.0 from Typesafe Inc. (PDF file)

The documentation answers the question, “What is Akka?” as follows:

Scalable real-time transaction processing

We believe that writing correct concurrent, fault-tolerant and scalable applications is too hard. Most of the time it’s because we are using the wrong tools and the wrong level of abstraction. Akka is here to change that. Using the Actor Model we raise the abstraction level and provide a better platform to build correct, concurrent, and scalable applications. For fault-tolerance we adopt the “Let it crash” model which the telecom industry has used with great success to build applications that self-heal and systems that never stop. Actors also provide the abstraction for transparent distribution and the basis for truly scalable and fault-tolerant applications.

Chris Cundill says “it’s virtually a book!,” which at 424 pages I think is a fair statement. 😉

Just skimming this looks quite readable!

I first saw this at This week in #Scala (04/01/2013) by Chris Cundill.

Akka 2.1.0 Released

Filed under: Actor-Based,Akka — Patrick Durusau @ 9:40 am

Akka 2.1.0 Released

From the post:

We—the Akka committers—are pleased to be able to announce the availability of Akka 2.1.0 ‘Mingus’. We are proud to include the work of 17 external committers, plus the work done by our great community in reporting and helping to diagnose bugs along the way.

This release refines and builds upon version 2.0, which was published a bit over nine months ago. The most prominent new features are

  • cluster support (experimental, including cluster membership logic & death watch and cluster-aware routers, see more below)
  • integration with Scala standard library (SIP-14 Futures, dataflow as add-on module, akka-actor.jar will be part of the Scala distribution)
  • Akka Camel support (Raymond Roestenburg & Piotr Gabryanczyk)
  • Encrypted Akka Remoting using SSL/TLS (Peter Badenhorst)
  • OSGi meta-information for most bundles (excluding samples and tests, Gert Vanthienen)
  • an ActorDSL for more concise actor declarations, e.g. in the REPL
  • a module for multi-node testing (to support you in developing clustered applications, experimental in the same sense as cluster support)
  • a Java API for the TestKit

In addition there have been a great number of small fixes and improvements, documentation updates (including a whole new section on message delivery guarantees), an area for contributions—akka-contrib—where community developments can mature and prove themselves and many more. A series of blog posts high-lighting the new features has been published over the past weeks on this blog, see this tag.

Looking forward to exploring the new features in this release!

Akka website

Akka downloads

I first saw this at This week in #Scala (04/01/2013) by Chris Cundill.

Scala Cheatsheet

Filed under: Programming,Scala — Patrick Durusau @ 9:23 am

Scala Cheatsheet by Brendan O’Connor.

Quick reference to Scala syntax.

Also includes examples of bad practice, labeled as such.

I first saw this at This week in #Scala (04/01/2013) by Chris Cundill.

izik: Take Search for a Joy Ride on Your Tablet

Filed under: Interface Research/Design,Search Interface,Searching — Patrick Durusau @ 7:40 am

izik: Take Search for a Joy Ride on Your Tablet

From the post:

We are giddy to announce the launch of izik, our new search app built specifically with the iPad and Android tablets in mind. With izik, every search on your tablet is transformed into a beautiful, glossy page that utilizes rich images, categories, and, of course, gesture controls. Check it: so much content, so many ways to explore.

Tablets are increasingly getting integrated into our lives, so we wracked our noggins to figure out how we could use our search technology to optimally serve tablet users. Not surprisingly, our research revealed that tablets take on a very different role in our lives than laptops and desktops. Laptops are for work; tablets are for fun. Laptops are task-oriented (“what’s the capital of Bulgaria?”); tablets are more exploratory (“what’s Jennifer Lopez doing these days?”).

So, our goal with izik was to move the task-oriented search product we all use on our computers (aka 10 blue links) and turn it into a more fun, tablet-appropriate product. That means an image-rich layout with an appearance and experience very different than what we’re used to seeing on a laptop.

I remain without a tablet so am dependent upon your opinions for how izik works for real users.

From the canned video I get the sense that the interface is going to make search different.

Is the scroll gesture more natural than using a mouse? Are some movements easier using gestures?

What other features of a tablet interface can change/improve search experiences?

« Newer PostsOlder Posts »

Powered by WordPress