Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 2, 2012

TokuDB v6.0: Download Available

Filed under: MySQL,TokuDB — Patrick Durusau @ 3:08 pm

TokuDB v6.0: Download Available by Martin Farach-Colton.

From the post:

TokuDB v6.0 is full of great improvements, like getting rid of slave lag, better compression, improved checkpointing, and support for XA.

I’m happy to announce that TokuDB v6.0 is now generally available and can be downloaded here.

Are you familiar with any independent benchmark testing on TokuDB?

Not that I doubt the TokuDB numbers.

Thinking that contributing standard numbers to a more centralized resource would help with evaluations.

Auto Tagging Articles using Semantic Analysis and Machine Learning

Filed under: Authoring Topic Maps,Auto Tagging,Semantic Annotation,Topic Models (LDA) — Patrick Durusau @ 2:51 pm

Auto Tagging Articles using Semantic Analysis and Machine Learning

Description:

The idea is to implement an auto tagging feature that provides tags automatically to the user depending upon the content of the post. The tags will get populated as soon as the user leaves the focus on the content text area or via ajax on the press of a button.I’ll be using semantic analysis and topic modeling techniques to judge the topic of the article and extract keywords also from it. Based on an algorithm and a ranking mechanism the user will be provided with a list of tags from which he can select those that best describe the article and also train a user-content specific semi-supervised machine learning model in the background.

A Drupal sandbox for work on auto tagging posts.

Or, topic map authoring without being “in your face.”

Depends on how you read “tags.”

OpenMeetings

Filed under: Collaboration,OpenMeetings,Web Conferencing — Patrick Durusau @ 2:32 pm

OpenMeetings: Open-Source Web-Conferencing

From the website:

Openmeetings provides video conferencing, instant messaging, white board, collaborative document editing and other groupware tools using API functions of the Red5 Streaming Server for Remoting and Streaming.

OpenMeetings is a project of the Apache Incubator, the old project website at GoogleCode will receive no updates anymore. The website at Apache is the only place that receives updates.

OpenMeetings is the type of application that could benefit from subject-centric capabilities.

Even “in-house” as they say, not all participants will share a common vocabulary.

There are commercial applications that make that and other unhelpful assumptions. Write if you need contact details.

Natural Language Processing (almost) from Scratch

Filed under: Artificial Intelligence,Natural Language Processing,Neural Networks,SENNA — Patrick Durusau @ 2:18 pm

Natural Language Processing (almost) from Scratch by Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa.

Abstract:

We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including: part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.

In the introduction the authors remark:

The overwhelming majority of these state-of-the-art systems address a benchmark task by applying linear statistical models to ad-hoc features. In other words, the researchers themselves discover intermediate representations by engineering task-specifi c features. These features are often derived from the output of preexisting systems, leading to complex runtime dependencies. This approach is e ffective because researchers leverage a large body of linguistic knowledge. On the other hand, there is a great temptation to optimize the performance of a system for a speci fic benchmark. Although such performance improvements can be very useful in practice, they teach us little about the means to progress toward the broader goals of natural language understanding and the elusive goals of Arti ficial Intelligence.

I am not an AI enthusiast but I agree that pre-judging linguistic behavior (based on our own) in a data set will find no more (or less) linguistic behavior than our judgment allows. Reliance on the research of others just adds more opinions to our own. Have you ever wonder on what basis we accept the judgments of others?

A very deep and annotated dive into NLP approaches (the author’s and others) with pointers to implementations, data sets and literature.

In case you are interested, the source code is available at: SENNA (Semantic/syntactic Extraction using a Neural Network Architecture)

Google BigQuery and the Github Data Challenge

Filed under: Contest,Data,Google BigQuery — Patrick Durusau @ 10:54 am

Google BigQuery and the Github Data Challenge

Deadline May 21, 2012

From the post:

Github has made data on its code repositories, developer updates, forks etc. from the public GitHub timeline available for analysis, and is offering prizes for the most interesting visualization of the data. Sounds like a great challenge for R programmers! The R language is currently the 26th most popular on GitHub (up from #29 in December), and it would be interesting to visualize the usage of R compared to other languages, for example. The deadline for submissions to the contest is May 21.

Interestingly, GitHub has made this data available on the Google BigQuery service, which is available to the public today. BigQuery was free to use while it was in beta test, but Google is now charging for storage of the data: $0.12 per gigabyte per month, up to $240/month (the service is limited to 2TB of storage – although there a Premier offering that supports larger data sizes … at a price to be negotiated). While members of the public can run SQL-like queries on the GitHub data for free, Google is charging subscribers to the service 3.5 cents per Gb processed in the query: this is measured by the source data accessed (although columns of data not referenced aren't counted); the size of the result set doesn't matter.

Watch your costs but thoughts on how you would visualize the data?

Apache MRUnit 0.9.0-incubating has been released!

Filed under: Hadoop,MapReduce,MRUnit — Patrick Durusau @ 10:17 am

Apache MRUnit 0.9.0-incubating has been released! by Brock Noland.

The post reads in part:

We (the Apache MRUnit team) have just released Apache MRUnit 0.9.0-incubating (tarball, nexus, javadoc). Apache MRUnit is an Apache Incubator project that is a Java library which helps developers unit test Apache Hadoop MapReduce jobs. Unit testing is a technique for improving project quality and reducing overall costs by writing a small amount of code that can automatically verify the software you write performs as intended. This is considered a best practice in software development since it helps identify defects early, before they’re deployed to a production system.

The MRUnit project is quite active, 0.9.0 is our fourth release since entering the incubator and we have added 4 new committers beyond the projects initial charter! We are very interested in having new contributors and committers join the project! Please join our mailing list to find out how you can help!

The MRUnit build process has changed to produce mrunit-0.9.0-hadoop1.jar and mrunit-0.9.0-hadoop2.jar instead of mrunit-0.9.0-hadoop020.jar, mrunit-0.9.0-hadoop100.jar and mrunit-0.9.0-hadoop023.jar. The hadoop1 classifier is for all Apache Hadoop versions based off the 0.20.X line including 1.0.X. The hadoop2 classifier is for all Apache Hadoop versions based off the 0.23.X line including the unreleased 2.0.X.

Reading about JUnit recently, in part just to learn more about software testing but also thinking of what it would look like to test semantics of integration? Or with opaque mappings is that even a meaningful question? Or is the lack of meaning to that question a warning sign?

Perhaps there is no “test” for the semantics of integration. You can specify integration of data and the results may be useful or not, meaningful (in some context) or not, but the question isn’t one of testing. The question is: Are these the semantics you want for integration?

Data has no semantics until someone “looks” at it so the semantics of proposed integration have to be specified and the client/user says yes or no.

Sorry, digressed, commend MRUnit to your attention.

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform by Anthony J. Cox, Markus J. Bauer, Tobias Jakobi, and Giovanna Rosone.

Abstract:

Motivation

The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for compression and indexing of text data, but the cost of computing the BWT of very large string collections has prevented these techniques from being widely applied to the large sets of sequences often encountered as the outcome of DNA sequencing experiments. In previous work, we presented a novel algorithm that allows the BWT of human genome scale data to be computed on very moderate hardware, thus enabling us to investigate the BWT as a tool for the compression of such datasets.

Results

We first used simulated reads to explore the relationship between the level of compression and the error rate, the length of the reads and the level of sampling of the underlying genome and compare choices of second-stage compression algorithm.

We demonstrate that compression may be greatly improved by a particular reordering of the sequences in the collection and give a novel `implicit sorting’ strategy that enables these benefits to be realised without the overhead of sorting the reads. With these techniques, a 45x coverage of real human genome sequence data compresses losslessly to under 0.5 bits per base, allowing the 135.3Gbp of sequence to fit into only 8.2Gbytes of space (trimming a small proportion of low-quality bases from the reads improves the compression still further).

This is more than 4 times smaller than the size achieved by a standard BWT-based compressor (bzip2) on the untrimmed reads, but an important further advantage of our approach is that it facilitates the building of compressed full text indexes such as the FM-index on large-scale DNA sequence collections.

Important work for several reasons.

First, if the human genome is thought of as “big data,” it opens the possibility that compressed full text indexes can be build for other instances of “big data.”

Second, indexing is similar to topic mapping in the sense that pointers to information about a particular subject are gathered to a common location. Indexes often account for synonyms (see also) and distinguish the use of the same word for different subjects (polysemy).

Third, depending on the granularity of tokenizing and indexing, index entries should be capable of recombination to create new index entries.

Source code for this approach:

Code to construct the BWT and SAP-array on large genomic data sets is part of the BEETL library, available as a github respository at git@github.com:BEETL/BEETL.git.

Comments?

Bed Cartography

Filed under: Cartography,Humor — Patrick Durusau @ 7:53 am

Bed Cartography from Nathan Yau.

It had to happen. Cartography has spread to the bedroom.

Can graphs be far behind? 😉

May 1, 2012

Basic graph analytics using igraph

Filed under: Graphs,igraph,Networks — Patrick Durusau @ 4:47 pm

Basic graph analytics using igraph by Ricky Ho.

From the post:

Social Network Site such as Facebook, Twitter becomes are integral part of people’s life in. People interact with each other in different form of activities and a lot of information has been captured in the social network. Mining such a network can reveal some very useful information that can help an organization to gain competitive advantages.

I recently come across a powerful tools called igraph that provides some very powerful graph mining capabilities. Following are some interesting things that I have found.

Ricky doesn’t give a link to igraph, which you can find here. Development version.

He does cover:

  • Create a Graph
  • Basic Graph Algorithms
  • Graph Statistics
  • Centrality Measures

Picking the Connectome Data Lock

Filed under: Connectome,Neuroinformatics — Patrick Durusau @ 4:47 pm

Picking the Connectome Data Lock by Nicole Hemsoth

From the post:

Back in 2005, researchers at Indiana University and Lausanne University simultaneously (yet independently) spawned a concept and pet term that would become the hot topic in neuroscience for the next several years—connectomics.

The concept itself isn’t necessarily new, even thought the use of “connectomics” in popular science circles is relatively so.

….

A hybrid between the study of genomics (the biological blueprint) and neural networks (the “connect”) this term quickly caught on, including with large organizations like the National Institutes of Health (NIH) and its Human Connectome Project.

For instance, the NIH is in the midst of a five-year effort (starting in 2009) to map the neural pathways that underlie human brain function. The purpose is to acquire and share data about the structural and functional connectivity of the human brain to advance imaging and analysis capabilities and make strides in understanding brain circuitry and associated disorders.

[images omitted]

And talk about data… just to reconstruct the neural and synaptic connections in a mouse retina and primary visual cortex involved a 12 TB data set (which incidentally is now available to all at the Open Connectome Project).

Mapping the connectome requires a complete mapping process of the neural systems on a neuron-by-neuron basis, a task that requires accounting for billions of neurons, at least for most larger, complex mammals. According to Open Connectome Project, the human cerebral cortex alone contains something in the neighborhood of 1010 neurons linked by 1014 synaptic connections.

That number is a bit difficult to digest without context, so how about this: the number of base-pairs in a human genome is 109.

I didn’t want anyone to feel I was neglecting the “big data” side of things, although 12 TB of data will only be “big data” for your home computer. 😉

Moreover, Sebastian Seung, Professor of Computational Neuroscience at MIT and author of the book, Connectome, is quoted as speculating that memories may be represented in the patterns of connections between neurons. Which sounds familiar to anyone who has heard Steve Newcomb talk about the subjects that are implicit in associations.

I wonder if it is possible to represent a summation of the connectome, much in the same way that we accept lower resolution images for some purposes? So that the task isn’t a one-to-one representation of the connectome, which would be a connectome itself (a map equivalent to the territory itself is the territory, one of those philosophy things).

That’s a nice data structure/information theory problem that would not require dimming the lights in your neighborhood when your system boots up. At least until you wanted to run a simulation. 😉

If you are interested in a game to make discoveries about the neural structure of the retina, see: http://www.eyewire.org/.

Researchers Turn Data into Dynamic Demographics

Filed under: Data,Demographics,Foursquare — Patrick Durusau @ 4:46 pm

Researchers Turn Data into Dynamic Demographics

From the post:

Aside from showing off how their travel, culinary and nightlife habits, users of the geolocated “check-in” service Foursquare could shed light on the character of a particular city and its neighborhoods.

Researchers at Carnegie Mellon University’s School of Computer Science say that instead of relying on stagnant, unyielding census and neighborhood zoning data to take the temperature of a given community, Foursquare checkin data can provide the much –needed layer of dynamic city life.

The researchers have developed developed an algorithm that takes the check-ins generated when foursquare members visit participating businesses or venues, and clusters them based on a combination of the location of the venues and the groups of people who most often visit them. This information is then mapped to reveal a city’s Livehoods, a term coined by the SCS researchers.

All of the Livehoods analysis is based on foursquare check-ins that users have shared publicly via social networks such as Twitter. This dataset of 18 million check-ins includes user ID, time, latitude and longitude, and the name and category of the venue for each check-in.

“Our goal is to understand how cities work through the lens of social media,” said Justin Cranshaw, a Ph.D. student in SCS’s Institute for Software Research.

The researchers analyzed data from foursquare, but the same computational techniques could be applied to several other databases of location information. The researchers are exploring applications to city planning, transportation and real estate development. Livehoods also could be useful for businesses developing marketing campaigns or for public health officials tracking the spread of disease.

A good example of remapping data. The data was collected and “mapped” for one purpose but subsequently was re-mapped and re-purposed.

Mapping the semantics of data empowers its re-use/re-purposing, which creates further opportunities for re-use and re-purposing.

See also: http://livehoods.org/

Data Management is Based on Philosophy, Not Science

Filed under: Data Management,Identity,Philosophy — Patrick Durusau @ 4:46 pm

Data Management is Based on Philosophy, Not Science by Malcolm Chisholm.

From the post:

There’s a joke running around on Twitter that the definition of a data scientist is “a data analyst who lives in California.” I’m sure the good natured folks of the Golden State will not object to me bringing this up to make a point. The point is: Thinking purely in terms of marketing, which is a better title — data scientist or data philosopher?

My instincts tell me there is no contest. The term data scientist conjures up an image of a tense, driven individual, surrounded by complex technology in a laboratory somewhere, wrestling valuable secrets out of the strange substance called data. By contrast, the term data philosopher brings to mind a pipe-smoking elderly gentleman sitting in a winged chair in some dusty recess of academia where he occasionally engages in meaningless word games with like-minded individuals.

These stereotypes are obviously crude, but they are probably what would come into the minds of most executive managers. Yet how true are they? I submit that there is a strong case that data management is much more like applied philosophy than it is like applied science.

Applied philosophy. I like that!

You know where I am going to come out on this issue so I won’t belabor it.

Enjoy reading Malcolm’s post!

Shadow-Activated QR Code Actually Useful and Cool

Filed under: Museums,QR Codes — Patrick Durusau @ 4:46 pm

Shadow-Activated QR Code Actually Useful and Cool Retailer’s sign scannable only at lunch by David Griner.

From the post:

For all the talk of mobile-marketing tech, there remains a pretty wide gap between the potential and the practicality of QR codes. That’s why it’s nice to see this case study from Korea, where a retailer increased lunchtime sales by 25 percent with a shadow-based QR code that’s only scannable in the middle of the day. Emart’s “Sunny Sale” codes are created with three-dimensional displays outside several dozen locations in Seoul. When the sun is at its zenith, the shadows line up, allowing the code to be scanned for access to coupons and online ordering. It’s a smart idea that, in the short term at least, has generated plenty of strong PR and sales. While the wow factor is sure to fade quickly, it’s still a great example of a marketer finding a way to turn QR codes into something actually worth scanning.

From Seoul. No surprise there. Heavy investment in education and technology infrastructure. Some soon-to-be-former technology leaders did the same thing but then lost their way.

If you think of QR codes as a cheap equivalent to a secure RFID tag, you have to “see” it to scan it, it should be more popular than it is. Physical security being the first principle of network security (to “see” the QR code).

Museums could use QR codes (linking into topic maps) to provide information in multiple languages. With sponsors for coupons to local eateries. No expensive tags, networks, sensors, etc.

AWS NYC Summit 2012

Filed under: Amazon Web Services AWS,Cloud Computing — Patrick Durusau @ 4:46 pm

AWS NYC Summit 2012

The line that lead me to this read:

We posted 25 presentations from the New York 2012 AWS Summit.

Actually, no.

Posted 25 slide decks, not presentations.

Useful yes, presentations, no.

Not to complain too much given the rapid expansion of services and technical guidance but let’s not confuse slides with presentations.

The AWS Report (Episode 2) has one major improvement: The clouds in the background don’t move! (As they did in the first episode. Now there was a shadow that moved over the front of the desk.)

We need to ask Amazon to get Jeff a new laptop without all the stickers on the top. If Paula Abdul or Vanna White were doing the interview, the laptop stickers would not be distracting. Or at least not enough to complain. Jeff isn’t Paula Abdul or Vanna White. Sorry Jeff.

I think the AWS Report has real potential. Several short segments with more “facts” and fewer “general” statements would be great.

Enjoyed the Elastic Beanstalk episode but hearing customers are busy, happy and requirements were gathered for other language support (besides Java) is like hearing public service announcements on PBS.

Nothing to disagree with but no real content either.

Suggestion: Perhaps short, say 90 to 120 second description of a typical issue (off mailing list?) that ends with: What is your solution? and feature one or more solutions on the next show? To get the audience involved and get other people hawking the show.

Not quite the cover of the Rolling Stone but perhaps someday… 😉

PolyZoom a New Tool to View, Study Graphics

Filed under: Graphics,Navigation,PolyZoom,Visualization — Patrick Durusau @ 4:46 pm

PolyZoom a New Tool to View, Study Graphics

From the post:

Researchers have created a next-generation zoom function to view and compare portions of complex graphics such as scientific images, city maps or pages of text. The new tool, PolyZoom, makes it possible to simultaneously magnify many parts of a graphic without losing sight of the original picture.

“With standard programs, once you zoom in, you lose perspective and have to zoom out again to see that bigger picture,” said Niklas Elmqvist, an assistant professor of electrical and computer engineering at Purdue University. “This new tool maintains your perspective or orientation.”

The zoomed-in regions appear as separate pullout boxes displayed next to each other. These boxes, or “correlated graphics,” allow the user to see where the magnified viewpoints are located in relation to each other and the whole.

“The tool is useful if you are trying to compare different spaces on a map, like the city centers of two major metropolitan areas, segments of a Hubble Space Telescope picture or even pages in a lengthy document,” said Elmqvist, who is working with doctoral students Waqas Javed and Sohaib Ghani. “Say you are a historian looking at a large collection of scanned pages from a book. You might want to zoom into a particular page and read the words, or look at many pages at the same time and compare those.

Key point:

This new tool maintains your perspective or orientation.” (emphasis added)

When you think about it, that happens a lot, loss of perspective or orientation. In a reading context i would say I “lost” my place in the text.

Web browsers allow you to tab but that isn’t the same. Can open new “windows” but they are cluttered with all the navigation crap. Would be nice to have resizable panes with scroll bars that you could “pin” to locations on your screen. Seen anything like that recently?

You can see the paper on this technique: https://engineering.purdue.edu/~elm/projects/polyzoom/polyzoom.pdf

Or try out a demo: http://web.ics.purdue.edu/~wjaved/projects/stackZoom

Masstree – Much Faster than MongoDB, VoltDB, Redis, and Competitive with Memcached

Filed under: Masstree,Memcached,MongoDB,Redis,VoltDB — Patrick Durusau @ 4:45 pm

Masstree – Much Faster than MongoDB, VoltDB, Redis, and Competitive with Memcached

From the post:

The EuroSys 2012 system conference has an excellent live blog summary of their talks for: Day 1, Day 2, Day 3 (thanks Henry at the Paper Trail blog). Summaries for each of the accepted papers are here.

One of the more interesting papers from a NoSQL perspective was Cache Craftiness for Fast Multicore Key-Value Storage, a wonderfully detailed description of the low level techniques used to implement Masstree:

A storage system specialized for key-value data in which all data fits in memory, but must persist across server restarts. It supports arbitrary, variable-length keys. It allows range queries over those keys: clients can traverse subsets of the database, or the whole database, in sorted order by key. On a 16-core machine Masstree achieves six to ten million operations per second on parts A–C of the Yahoo! Cloud Serving Benchmark benchmark, more than 30x as fast as VoltDB [5] or MongoDB [2].

An inspiration for anyone pursuing pure performance in the key-value space.

As the authors note when comparing Masstree to other systems:

Many of these systems support features that Masstree does not, some of which may bottleneck their performance. We disable other systems’ expensive features when possible.

The lesson here is to not buy expensive features unless you need them.

A matter of compactness

Filed under: Graphics,Visualization — Patrick Durusau @ 4:45 pm

A matter of compactness

Kaiser Fung helpfully revises Figures 2.1-2.2.8 of the World Happiness Report.

Should the report be known as the:

  • Artless World Happiness Report
  • Clueless World Happiness Report
  • Confused World Happiness Report
  • Unhappy World Happiness Report
  • (Your suggestion can appear here)

This isn’t as bad as graphics can get. I have seen worse.

Unfortunately, I have blotted out my memory of where.

Tenth workshop on Mining and Learning with Graphs (MLG-2012)

Filed under: Conferences,Graphs — Patrick Durusau @ 4:45 pm

Tenth workshop on Mining and Learning with Graphs (MLG-2012)


Papers due: May 7, 2012.

Workshop: July 1, 2012.

Located with ICML-2012.

Description:

There is a great deal of interest in analyzing data that is best represented as a graph. Examples include the WWW, social networks, biological networks, communication networks, food webs, and many others. The importance of being able to effectively mine and learn from such data is growing, as more and more structured and semi-structured data is becoming available. Traditionally, a number of subareas have worked with mining and learning from graph structured data, including communities in graph mining, learning from structured data, statistical relational learning, inductive logic programming, and, moving beyond subdisciplines in computer science, social network analysis, and, more broadly network science. The objective of this workshop is to bring together researchers from a variety of these areas, and discuss commonality and differences in challenges faced, survey some of the different approaches, and provide a forum for to present and learn about some of the most cutting edge research in this area. As an outcome, we expect participants to walk away with a better sense of the variety of different tools available for graph mining and learning, and an appreciation for some of the interesting emerging applications for mining and learning from graphs.

« Newer Posts

Powered by WordPress