Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 22, 2013

CensusReporter

Filed under: Census Data,Interface Research/Design — Patrick Durusau @ 2:38 pm

Easier Census data browsing with CensusReporter by Nathan Yau.

Nathan writes:

Census data can be interesting and super informative, but getting the data out of the dreaded American FactFinder is often a pain, especially if you don’t know the exact table you want. (This is typically the case.) CensusReporter, currently in beta, tries to make the process easier.

Whatever your need for census data, even the beta interface is worth your attention!

India…1,745 datasets for agriculture

Filed under: Agriculture,Data,Open Data — Patrick Durusau @ 2:09 pm

Open Data Portal India launched: Already 1,745 datasets for agriculture

From the post:

The Government of India has launched its open Data Portal India (data.gov.in), a portal for the public to access and use datasets and applications provided by ministries and departments of the Government of India.

Aim: “To increase transparency in the functioning of Government and also open avenues for many more innovative uses of Government Data to give different perspective.” (“About portal,” data.gov.in)

The story goes on to report there are more than 4,000 data sets from over 51 offices. An adviser to the prime minister of India is hopeful there will be more than 10,000 data sets in six months.

Not quite as much fun as the IMDB, but on the other hand, the data is more likely to be of interest to business types.

Samza

Filed under: Samza,Storm — Patrick Durusau @ 1:45 pm

Samza

From the webpage:

Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.

  • Simple API: Unlike most low-level messaging system APIs, Samza provides a very simple call-back based "process message" API that should be familiar to anyone who's used Map/Reduce.
  • Managed state: Samza manages snapshotting and restoration of a stream processor's state. Samza will restore a stream processor's state to a snapshot consistent with the processor's last read messages when the processor is restarted.
  • Fault tolerance: Samza will work with YARN to restart your stream processor if there is a machine or processor failure.
  • Durability: Samza uses Kafka to guarantee that messages will be processed in the order they were written to a partition, and that no messages will ever be lost.
  • Scalability: Samza is partitioned and distributed at every level. Kafka provides ordered, partitioned, re-playable, fault-tolerant streams. YARN provides a distributed environment for Samza containers to run in.
  • Pluggable: Though Samza works out of the box with Kafka and YARN, Samza provides a pluggable API that lets you run Samza with other messaging systems and execution environments.
  • Processor isolation: Samza works with Apache YARN, which supports processor security through Hadoop's security model, and resource isolation through Linux CGroups.

Check out Hello Samza to try Samza. Read the Background page to learn more about Samza.

Ironic that I should find Samza when I was searching for the new incubator page for Storm. 😉

In fact, I found a comparison of Samza and Storm.

You can learn a great deal about Storm (and Samza) reading the comparison. It’s quite good.

Apache Takes Storm Into Incubation

Filed under: Storm — Patrick Durusau @ 1:30 pm

Apache Takes Storm Into Incubation by Isaac Lopez.

From the post:

Get used to saying it: “Apache Storm.”

On Wednesday night, Doug Cutting, Director for the Apache Software Foundation (ASF), announced that the organization will be adding the distributed real time computation system known as Storm as the foundations newest Incubator podling.

Storm was created by BackType lead engineer, Nathan Marz in early 2011, before the software (along with the entire company) was acquired by Twitter. At Twitter, Storm became the back bone of the social giant’s web analytics framework, tracking every click happening within the rapidly-expanding Twittersphere. The Blue Bird also uses Storm as part of its “What’s Trending” widget.

In September of 2011, Marz announced that Storm would be released into open source, where it has enjoyed a great deal of success, getting used by such companies as Groupon, Yahoo!, InfoChimps, NaviSite, Nodeable, Ooyala, The Weather Channel, and more.

In case you don’t know Storm:

Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, is used by many companies, and is a lot of fun to use!

The Rationale page on the wiki explains what Storm is and why it was built. This presentation is also a good introduction to the project.

Storm has a website at storm-project.net. Follow @stormprocessor on Twitter for updates on the project. (From the Storm Github page.)

When the Apache pages for Storm are posted I will update this post.

…Introducing … Infringing Content Online

Filed under: Intellectual Property (IP),Search Engines,Searching — Patrick Durusau @ 12:43 pm

New Study Finds Search Engines Play Critical Role in Introducing Audiences To Infringing Content Online

From the summary at Full Text Reports:

Today, MPAA Chairman Senator Chris Dodd joined Representatives Howard Coble, Adam Schiff, Marsha Blackburn and Judy Chu on Capitol Hill to release the results of a new study that found that search engines play a significant role in introducing audiences to infringing movies and TV shows online. Infringing content is a TV show or movie that has been stolen and illegally distributed online without any compensation to the show or film’s owner.
…
The study found that search is a major gateway to the initial discovery of infringing content online, even in cases when the consumer was not looking for infringing content. 74% of consumers surveyed cited using a search engine as a navigational tool the first time they arrived at a site with infringing content. And the majority of searches (58%) that led to infringing content contained only general keywords — such as the titles of recent films or TV shows, or phrases related to watching films or TV online — and not specific keywords aimed at finding illegitimate content.

I rag on search engines fairly often about the quality of their results so in light of this report, I wanted to give them a shout out of: Well done!

They may not be good at the sophisticated content discovery that I find useful, but on the other hand, when sweat hogs are looking for entertainment, search content can fill the bill.

On the other hand, knowing that infringing content can be found may be good for PR purposes but not much more. Search results don’t capture (read identify) enough subjects to enable the mining of patterns of infringement and other data analysis relevant to opposing to infringement.

Infringing content is easy to find so the business case for topic maps lies with content providers. Who need more detail (read subjects and associations) than a search engine can provide.

New Study Finds Search Engines Play Critical Role in Introducing Audiences To Infringing Content Online (PDF of the news release)


Update: Understanding the Role of Search in Online Piracy. The full report. Additional detail but no links to the data.

ZFS disciples form one true open source database

Filed under: Files,Open Source,Storage — Patrick Durusau @ 10:46 am

ZFS disciples form one true open source database by Lucy Carey.

From the post:

The acronym ‘ZFS’ may no longer actually stand for anything, but the “world’s most advanced file sharing system” is in no way redundant. Yesterday, it emerged that corporate advocates of the Sun Microsystems file system and logical volume manager have joined together to offer a new “truly open source” incarnation of the file system, called, fittingly enough, OpenZFS.

Along with the the launch of the open-zfs.org website – which is incidentally, a domain owned by ZFS co-founder Matt Ahrens- the group of ZFS lovers, which includes developers from the illumos, FreeBSD, Linux, and OS X platforms, as well as an assortment of other parties who are building products on top of OpenZFS, have set out a clear set of objectives.

Speaking of scaling, Wikipedia reports:

A ZFS file system can store up to 256 quadrillion Zebibytes (ZiB).

Just in case anyone mentions scalable storage as an issue. 😉

September 21, 2013

Search Rules using Mahout’s Association Rule Mining

Filed under: Machine Learning,Mahout,Searching — Patrick Durusau @ 2:05 pm

Search Rules using Mahout’s Association Rule Mining by Sujit Pal.

This work came about based on a conversation with one of our domain experts, who was relaying a conversation he had with one of our clients. The client was looking for ways to expand the query based on terms already in the query – for example, if a query contained “cattle” and “neurological disorder”, then we should also server results for “bovine spongiform encephalopathy”, also known as “mad cow disease”.

We do semantic search, which involves annotating words and phrases in documents with concepts from our taxonomy. One view of an annotated document is the bag of concepts view, where a document is modeled as a sparsely populated array of scores, each position corresponding to a concept. One way to address the client’s requirement would be to do Association Rule Mining on the concepts, looking for significant co-occurrences of a set of concepts per document across the corpus.

The data I used to build this proof-of-concept with came from one of my medium sized indexes, and contains 12,635,756 rows and 342,753 unique concepts. While Weka offers the Apriori algorithm, I suspect that it won’t be able to handle this data volume. Mahout is probably a better fit, and it offers the FPGrowth algorithm running on Hadoop, so thats what I used. This post describes the things I had to do to prepare my data for Mahout, run the job with Mahout on Amazon Elastic Map Reduce (EMR) platform, then post process the data to get useful information out of it.
(…)

I don’t know that I would call these “search rules” but they would certainly qualify as input into defining merging rules.

Particularly if I was mining domain literature where co-occurrences of terms are likely to have the same semantic. Not always but likely. The likelihood of semantic sameness is something you can sample for and develop confidence measures about.

Easy 6502

Filed under: Computer Science,Programming — Patrick Durusau @ 1:54 pm

Easy 6502 by Nick Morgan.

From the post:

In this tiny ebook I’m going to show you how to get started writing 6502 assembly language. The 6502 processor was massive in the seventies and eighties, powering famous computers like the BBC Micro, Atari 2600, Commodore 64, Apple II, and the Nintendo Entertainment System. Bender in Futurama has a 6502 processor for a brain. Even the Terminator was programmed in
6502
.

So, why would you want to learn 6502? It’s a dead language isn’t it? Well, so’s Latin. And they still teach that. Q.E.D.

(Actually, I’ve been reliably informed that 6502 processors are still being produced by Western Design Center, so clearly 6502 isn’t a dead language! Who knew?)

Seriously though, I think it’s valuable to have an understanding of assembly language. Assembly language is the lowest level of abstraction in computers – the point at which the code is still readable. Assembly language translates directly to the bytes that are executed by your computer’s processor. If you understand how it works, you’ve basically become a computer magician.

Then why 6502? Why not a useful assembly language, like x86? Well, I don’t think learning x86 is useful. I don’t think you’ll ever have to write assembly language in your day job – this is purely an academic exercise, something to expand your mind and your thinking. 6502 was originally written in a different age, a time when the majority of developers were writing assembly directly, rather than in these new-fangled high-level programming languages. So, it was designed to be written by humans. More modern assembly languages are meant to written by compilers, so let’s leave it to them. Plus, 6502 is fun. Nobody ever called x86 fun.

I’m not too sure about no assembly language in your day job. The security on Intel’s Ivy Bridge line of microprocessors can be broken by changing doping on the chip. Not an everyday skill. See, Researchers can slip an undetectable trojan into Intel’s Ivy Bridge CPUs by Dan Goodin for the details.

I mention the 6502 ebook because programming where subject identity is an issue is different from programming where all identities are opaque.

Think about calling a method on a class in Java. Either the runtime finds the class and method or fails. It does not look for a class with particular characteristics and if it passes a subject identity test, then invoke the method.

I mention this because it seems relevant to the scaling question for topic maps. When you are manipulating subject identities, there is more work going on than invoking opaque strings that are found or not.

Part of deciding how to deal with the additional overhead is choosing which subjects are treated as having identifiers and which are going to be treated as opaque strings.

September 20, 2013

Apache Camel tunes its core with new release

Filed under: Apache Camel,Integration — Patrick Durusau @ 6:47 pm

Apache Camel tunes its core with new release by Lucy Carey.

From the post:

The community around open-source integration framework Apache Camel is a prolific little hub, and in the space of just four and a half months, has put together a shiny new release – Apache Camel 2.12 – the 53rd Camel version to date.

On the menu for developers is a total of 17 new components, four new examples, and souped-up performance in simple or bean languages and general routing. More than three hundred JIRA tickets have been solved, and a lot of bug swatting and general fine tuning has taken place. Reflecting the hugely active community around the platform, around half of these new components come courtesy of external contributors, and the rest from Camel team developers.

Fulltime Apache Camel committer Claus Ibsen notes in his blog that this is the first release where steps have been taken to “allow Camel components documentation in the source code which gets generated and included in the binaries.” He also writes that “a Camel component can offer endpoint completion which allows tooling to offer smart completion”, citing the hawtio web console as an example of the ways in which this enables functions like auto completion for JMS queue names, file directory names, bean names in the registry.
(…)

Camel homepage.

If you are looking for a variety of explanations about Camel, the Camel homepage recommends a discussion at StackOver.

Not quite the blind men with the elephant but enough differences in approaches to be amusing.

September 19, 2013

Over $1 Trillion in Tax Breaks…

Filed under: Data,Government,News — Patrick Durusau @ 6:51 pm

Over $1 Trillion in Tax Breaks Are Detailed in New Report by Jessica Schieder.

From the post:

Tax breaks cost the federal government approximately $1.13 trillion in fiscal year 2013, according to a new report by the National Priorities Project (NPP). That is just slightly less than all federal discretionary spending in FY 2013 combined.

So, the headline got your attention? It certainly got mine.

But unlike many alarmist headlines (can you say CNN?), this story came with data to back up its statements.

How much data you ask?

Well, from 1974 to present, tax break data, described as:

NPP has created the first time series tax break dataset by obtaining archived budget requests, converting them to electronic format, and standardizing the categories and names over time. We’ve also added several calculations and normalizations to make these data more useful to researchers.

What you will find in this dataset:

  • Tax break names, standardized over time
  • Tax break categories, standardized over time
  • Estimated annual tax break costs (both real dollars and adjusted for inflation)
  • Annual tax break costs as a percent change from the previous year
  • Annual tax break costs as a percentage of Gross Domestic Product (GDP)
  • Annual tax break costs as a percentage of their corresponding category

The full notes and sources, including our methodology and a data dictionary, are here.

The original report: The Big Money in Tax Breaks, Report: Exposing the Big Money in Tax Breaks by Mattea Kramer. Data support by Becky Sweger and Asher Dvir-Djerassi.

Sponsored by the National Priorities Project.

Do your main sources of news distribute relevant data? To enable you to reach your own conclusions?

If not, you should start asking why not?

Pixar’s 22 Rules of Storytelling

Filed under: Communication,Marketing — Patrick Durusau @ 6:26 pm

Pixar’s 22 Rules of Storytelling by DinoIgnacio.

From the webpage:

Former Pixar story artist Emma Coats tweeted this series of “story basics” in 2011. https://twitter.com/lawnrocket These were guidelines that she learned from her more senior colleagues on how to create appealing stories. I superimposed all 22 rules over stills from Pixar films to help me remember them. All Disney copyrights, trademarks, and logos are owned by The Walt Disney Company.

If you find the rules hard to read with the picture backgrounds (I do), see the text version: http://imgur.com/a/MRfTb.

While cast as rules for “storytelling,” these are rules for effective communication in any context.

Sparkey

Filed under: Key-Value Stores,NoSQL — Patrick Durusau @ 2:25 pm

Sparkey

From the webpage:

Sparkey is an extremely simple persistent key-value store. You could think of it as a read-only hashtable on disk and you wouldn’t be far off. It is designed and optimized for some server side usecases at Spotify but it is written to be completely generic and makes no assumptions about what kind of data is stored.

Some key characteristics:

  • Supports data sizes up to 2^63 – 1 bytes.
  • Supports iteration, get, put, delete
  • Optimized for bulk writes.
  • Immutable hash table.
  • Any amount of concurrent independent readers.
  • Only allows one writer at a time per storage unit.
  • Cross platform storage file.
  • Low overhead per entry.
  • Constant read startup cost
  • Low number of disk seeks per read
  • Support for block level compression.
  • Data agnostic, it just maps byte arrays to byte arrays.

What it’s not:

  • It’s not a distributed key value store – it’s just a hash table on disk.
  • It’s not a compacted data store, but that can be implemented on top of it, if needed.
  • It’s not robust against data corruption.

The usecase we have for it at Spotify is serving data that rarely gets updated to users or other services. The fast and efficient bulk writes makes it feasible to periodically rebuild the data, and the fast random access reads makes it suitable for high throughput low latency services. For some services we have been able to saturate network interfaces while keeping cpu usage really low.

If you are looking for a very high-performance key-value store with little to no frills, your search may be over.

Originating with Spotify and being able to saturate network interfaces bodes well for those needing pure performance.

I first saw this in Nat Torkington’s Four short links: 10 September 2013.

Context Aware Searching

Filed under: Context,RDF,Searching,Semantic Graph,Semantic Web — Patrick Durusau @ 9:53 am

Scaling Up Personalized Query Results for Next Generation of Search Engines

From the post:

North Carolina State University researchers have developed a way for search engines to provide users with more accurate, personalized search results. The challenge in the past has been how to scale this approach up so that it doesn’t consume massive computer resources. Now the researchers have devised a technique for implementing personalized searches that is more than 100 times more efficient than previous approaches.

At issue is how search engines handle complex or confusing queries. For example, if a user is searching for faculty members who do research on financial informatics, that user wants a list of relevant webpages from faculty, not the pages of graduate students mentioning faculty or news stories that use those terms. That’s a complex search.

“Similarly, when searches are ambiguous with multiple possible interpretations, traditional search engines use impersonal techniques. For example, if a user searches for the term ‘jaguar speed,’ the user could be looking for information on the Jaguar supercomputer, the jungle cat or the car,” says Dr. Kemafor Anyanwu, an assistant professor of computer science at NC State and senior author of a paper on the research. “At any given time, the same person may want information on any of those things, so profiling the user isn’t necessarily very helpful.”

Anyanwu’s team has come up with a way to address the personalized search problem by looking at a user’s “ambient query context,” meaning they look at a user’s most recent searches to help interpret the current search. Specifically, they look beyond the words used in a search to associated concepts to determine the context of a search. So, if a user’s previous search contained the word “conservation” it would be associated with concepts likes “animals” or “wildlife” and even “zoos.” Then, a subsequent search for “jaguar speed” would push results about the jungle cat higher up in the results — and not the automobile or supercomputer. And the more recently a concept has been associated with a search, the more weight it is given when ranking results of a new search.

I rather like the contrast of ambiguous searches being resolved with “impersonal techniques.”

The paper, Scaling Concurrency of Personalized Semantic Search over Large RDF Data by Haizhou Fu, Hyeongsik Kim, and Kemafor Anyanwu, has this abstract:

Recent keyword search techniques on Semantic Web are moving away from shallow, information retrieval-style approaches that merely find “keyword matches” towards more interpretive approaches that attempt to induce structure from keyword queries. The process of query interpretation is usually guided by structures in data, and schema and is often supported by a graph exploration procedure. However, graph exploration-based interpretive techniques are impractical for multi-tenant scenarios for large database because separate expensive graph exploration states need to be maintained for different user queries. This leads to significant memory overhead in situations of large numbers of concurrent requests. This limitation could negatively impact the possibility of achieving the ultimate goal of personalizing search. In this paper, we propose a lightweight interpretation approach that employs indexing to improve throughput and concurrency with much less memory overhead. It is also more amenable to distributed or partitioned execution. The approach is implemented in a system called “SKI” and an experimental evaluation of SKI’s performance on the DBPedia and Billion Triple Challenge datasets show orders-of-magnitude performance improvement over existing techniques.

If you are interesting in scaling issues for topic maps, note the use of indexing as opposed to graph exploration techniques in this paper.

Also consider mining “discovered” contexts that lead to “better” results from the viewpoint of users. Those could be the seeds for serializing those contexts as topic maps.

Perhaps even directly applicable to work by researchers, librarians, intelligence analysts.

Seasoned searchers use richer contexts in searching that the average user and if those contexts are captured, they could enrich the search contexts of the average user.

September 18, 2013

NSA – Sharing vs. Security

Filed under: Cybersecurity,NSA,Security — Patrick Durusau @ 6:30 pm

Tom Gjelten’s report, Officials: Edward Snowden’s Leaks Were Masked By Job Duties, goes a long way towards explaining how Edward Swowden was able to move secret documents without suspicion.

The story also highlights the tension between needing to share information and at the same time keeping it secure:

According to the officials, the documents Snowden leaked — the memoranda, PowerPoint slides, agency reports, court orders and opinions — had all been stored in a file-sharing location on the NSA’s intranet site. The documents were put there so NSA analysts and officials could read them online and discuss them.

The importance of such information-sharing procedures was one of the lessons of the Sept. 11 attacks. Law enforcement and intelligence agencies were unable to “connect the dots” prior to the attacks, because they were not always aware of what other agencies knew.

Does the NSA response:

The NSA will now be “tagging” sensitive documents and data with identifiers that will limit access to those individuals who have a need to see the documents and who are authorized by NSA leadership to view them. The tagging will also allow supervisors to see what individuals do with the data they see and handle.

surprise you?

Perhaps I should ask:

Do you see what’s common in the “connect the dots,” and “secure the data” responses?

The NSA isn’t sharing data or information about people, places, events, etc., it is sharing (or securing) documents.

The data is something that every analyst must pry out on their own. And every analyst has to repeat the mining operation of every other analyst who discovers the same data.

The lesson here is that document level sharing or security is too coarse to satisfy goals of sharing or security. To say nothing of being watseful of the time of analysts, probably a more precious resource than even more data.

Sharing vs. Securing Example

Assume that I am researching visitor X to the United States who has just gone through passport control. I get a number of resources listed in response to a search. In addition to phone traffic records, etc., there is a top secret report document.

This top secret report document is a report from an American embassy that says visitor X was seen by a CIA operative (insert code name) in the company of bad actor 1 and bad actor 2 at some location. No details of what was said but subsequent surveillance establishes visitor X may be friends with bad actor 1 or 2.

Under the current process, the choice is to either allow me access to this document, which makes me aware of the embassy, the CIA operative’s codename, dates of observation, etc., in short a lot of information that may not be necessary for my task or to deny me access altogether.

For some particular task, such as who else to watch for in the U.S., knowing visitor X is a possible associate of bad actors 1 and 2, alerts me that known associates of bad actors 1 and 2 should also be put under a higher level of surveillance while visitor X is in the U.S. I don’t need to know the embassy, CIA operative, etc., to make that determination.

End Example

The document level sharing/security paradigm of the NSA was the only solution when documents were physical documents. It wasn’t possible to share documents other than as whole documents. Well, there is the Magic Marker option but that doesn’t really scale. And every document has to be “marked” for a particular person or level of clearance.

If we treat electronic documents as, well, electronic documents, we can split them at whatever level is appropriate for security control. Non-trivial but a similar process was developed and used at the Y-12 Complex at Oak Ridge, Tennessee (the place where they build and maintain nuclear weapons).

Beyond splitting documents for security purposes, it is also possible to accumulate the insights of analysts who have read portions of those documents. So that every analyst doesn’t have to read every relevant document but can build upon what has already been discovered by others.

Capturing the insights of analysts on a granular and re-usable level conserves something more precious than more raw data, human insight into data.

None of that can happen over night but continuing with a model of documents as physical objects only delays the day when more granular access enables more sharing and better security. To say nothing of capturing the insights of analysts for the benefit of an entire enterprise.

PS: If you know anyone at the NSA, forward this post to them. I dislike poor information systems more than I dislike any government.

Structr 0.8 Release

Filed under: Graphs,Neo4j,structr — Patrick Durusau @ 4:20 pm

Release 0.8 is out! by Axel Morgner.

From the post:

Yesterday, we released Structr 0.8. It was a really important milestone on the way to 1.0.

Axel answers your immediate question, “Why so Important?” with:

Because it contains a lot of improvements to the UI, and the UI is important for broad adoption. For example, we introduced “Widgets”.

In case you are unfamiliar with Structr:

Structr (pronounced ‘structure’) is a framework for mobile and web applications based on the graph database Neo4j, with a supplemental UI providing CMS functionality to serve pages, files and images.

It was designed to simplify the creation and operation of graph database applications by providing a comprehensive Java API with a built-in feature set common to most use cases, like e.g. authentication, users and groups, constraints and validation, etc..

All custom-built features are automatically exposed through a flexible RESTful API which enables developers to build sophisticated web or mobile apps based on Neo4j within just hours. [From the Structr homepage.]

The latest release is definitely worth a close look.

September 17, 2013

Havalo [NoSQL for Small Data]

Filed under: Data Analysis,NoSQL — Patrick Durusau @ 6:33 pm

Havalo

From the webpage:

A zero configuration, non-distributed NoSQL key-value store that runs in any Servlet 3.0 compatible container.

Sometimes you just need fast NoSQL storage, but don’t need full redundancy and scalability (that’s right, localhost will do just fine). With Havalo, simply drop havalo.war into your favorite Servlet 3.0 compatible container and with almost no configuration you’ll have access to a fast and lightweight K,V store backed by any local mount point for persistent storage. And, Havalo has a pleasantly simple RESTful API for your added enjoyment.

Havalo is perfect for testing, maintaining fast indexes of data stored “elsewhere”, and almost any other deployment scenario where relational databases are just too heavy.

The latest stable version of Havalo is 1.4.

Interesting move toward the shallow end of the data pool for NoSQL.

I don’t know of any reason why small data could not benefit from NoSQL flexibility.

Lowering the overhead of NoSQL for small data may introduce more people to NoSQL earlier in their data careers.

Which means when they move up the ladder to “big data,” they won’t be easily impressed.

Are there other “small data” friendly NoSQL solutions you would recommend?

Groups: knowledge spreadsheets for symbolic biocomputing [Semantic Objects]

Filed under: Bioinformatics,Knowledge Map,Knowledge Representation — Patrick Durusau @ 4:53 pm

Groups: knowledge spreadsheets for symbolic biocomputing by Michael Travers, Suzanne M. Paley, Jeff Shrager, Timothy A. Holland and Peter D. Karp.

Abstract:

Knowledge spreadsheets (KSs) are a visual tool for interactive data analysis and exploration. They differ from traditional spreadsheets in that rather than being oriented toward numeric data, they work with symbolic knowledge representation structures and provide operations that take into account the semantics of the application domain. ‘Groups’ is an implementation of KSs within the Pathway Tools system. Groups allows Pathway Tools users to define a group of objects (e.g. groups of genes or metabolites) from a Pathway/Genome Database. Groups can be transformed (e.g. by transforming a metabolite group to the group of pathways in which those metabolites are substrates); combined through set operations; analysed (e.g. through enrichment analysis); and visualized (e.g. by painting onto a metabolic map diagram). Users of the Pathway Tools-based BioCyc.org website have made extensive use of Groups, and an informal survey of Groups users suggests that Groups has achieved the goal of allowing biologists themselves to perform some data manipulations that previously would have required the assistance of a programmer.

Database URL: BioCyc.org.

Not my area so a biologist would have to comment on the substantive aspects of using these particular knowledge spreadsheets.

But there is much in this article that could be applied more broadly.

From the introduction:

A long-standing problem in computing is that of providing non-programmers with intuitive, yet powerful tools for manipulating and analysing sets of entities. For example, a number of bioinformatics database websites provide users with powerful tools for composing database queries, but once a user obtains the query results, they are largely on their own. What if a user wants to store the query results for future reference, or combine them with other query results, or transform the results, or share them with a colleague? Sets of entities of interest arise in other contexts for life scientists, such as the entities that are identified as significantly perturbed in a high-throughput experiment (e.g. a set of differentially occurring metabolites), or a set of genes of interest that emerge from an experimental investigation.

We observe that spreadsheets have become a dominant form of end-user programming and data analysis for scientists. Although traditional spreadsheets provide a compelling interaction model, and are excellent tools for the manipulation of the tables of numbers that are typical of accounting and data analysis problems, they are less easily used with the complex symbolic computations typical of symbolic biocomputing. For example, they cannot perform semantic transformations such as converting a gene list to the list of pathways the genes act in.

We coined the term knowledge spreadsheet (KS) to describe spreadsheets that are characterized by their ability to manipulate semantic objects and relationships instead of just numbers and strings. Both traditional spreadsheets and KSs represent data in tabular structures, but in a KS the contents of a cell will typically be an object from a knowledge base (KB) [such as a MetaCyc (1) frame or a URI entity from an RDF store]. Given that a column in a KS will typically contain objects of the same ontological type, a KS can offer high-level semantically knowledgeable operations on the data. For example, given a group with a column of metabolites, a semantic operation could create a parallel column in which each cell contained the reactions that produced that metabolite. Another difference between our implementation of KSs and traditional spreadsheets is that cells in our KSs can contain multiple values.
(…)

Can you think of any domain that would not benefit from better handling of “semantic objects?”

As you read the article closely, any number of ideas or techniques for manipulating “semantic objects” will come to mind.

Data Mining and Analysis Textbook

Filed under: Data Analysis,Data Mining — Patrick Durusau @ 4:40 pm

Data Mining and Analysis Textbook by Ryan Swanstrom.

Ryan points out: Data Mining and Analysis: Fundamental Concepts and Algorithms by Mohammed J. Zaki and Wagner Meira, Jr. is available for PDF download.

Due out from Cambridge Press in 2014.

If you want to encourage Cambridge Press and others to continue releasing pre-publication PDFs, please recommend this text over less available ones for classroom adoption.

Or for that matter, read the PDF version and submit comments and corrections, also pre-publication.

Good behavior reinforces good behavior. You know what the reverse brings.

Joy of Holiday Assembly!

Filed under: Graphs,Neo4j — Patrick Durusau @ 4:22 pm

IKEA wardrobes and Graphs: a perfect fit! by Rik Van Bruggen.

Assembly graph

The idea for this blogpost was quite long in the making. We all know IKEA which is, like Neo4j, from Sweden. Most of us have delivered a daring attempt at assembling one of their furnitures. And most recently, even my 8- and 10-year old kids assembled their Swedish bedside tables themselves. Win!

In the past year or so, ever so often does someone approached me to talk about how to use Neo4j in a manufacturing context. And every single time I thought to myself: what a great, wonderful fit! We all know “reality is a graph”, but when you look at manufacturing processes – and the way different process components interact – you quickly see that these wonderful flowchart diagrams, actually represent a network. A graph. And when you then start thinking about all the required parts and components that are required to deliver these processes – then it becomes even more clearer: the “bill of material” of manufactured goods can also, predictably, be represented as a graph.

So there you have it. Manufacturing processes and bills or materials can be represented as a graph. And IKEA cupboards, wardrobes, tables, beds, stools – everywhere. How to make the match?

Rik does a great job of demonstrating the use of Neo4j for the familiar task of assembling furniture. And suggests that a similar graph could be used by the manufacturer of such products.

I think it is implied that creating the graph for components and the assembly process is also a way to delay the onset of assembly itself.

Rik’s mention of Sweden, however, is a tip-off this example is culturally bound. To Sweden that is.

A graph of assembly instructions in the United States, particularly during the holiday season, would be substantially different than Rik’s.

Tracking the assembly instructions, the graph would follow these rules:

  1. Number of nodes must not match number of parts
  2. Labels on nodes might match part names and/or differ by one letter.
  3. Labels on arcs would be correct no more than 80% of the time.
  4. Assembly arcs would include arcs for other models.

Perhaps a new holiday tradition?

Creating an assembly graph for a randomly chosen set of instructions?

😉

NIST recommends against NSA-influenced standards

Filed under: Cryptography,Cybersecurity,NSA,Security — Patrick Durusau @ 9:38 am

NIST recommends against NSA-influenced standards by Frank Konkel.

From the post:

The National Institute of Standards and Technology, the agency that sets guidelines, policy and standards used by computer systems in the federal government and worldwide, now “strongly” recommends against using an encryption standard that leaked top-secret documents show was weakened by the National Security Agency.

NIST’s Information Technology Laboratory recently authored a technical bulletin that urges users not to make use of Special Publication (SP) 800-90A, which was reopened for public comment with draft Special Publications 800-90B and 800-90C on Sept. 10, providing the cryptographic community another chance to comment on encryption standards that were approved by NIST in 2006.

“NIST strongly recommends that, pending the resolution of the security concerns and the re-issuance of SP 800-90A, the Dual_EC_DRBG, as specified in the January 2012 version of SP 800-90A, no longer be used,” the bulletin states.

The NIST bullentin, SUPPLEMENTAL ITL BULLETIN FOR SEPTEMBER 2013, is important for several reasons.

First, it is fair warning to security designers to not use the encryption described in SP 800-90A. Use of SP 800-90A after this report is a slam dunk on security malpractice.

Second, it reminds us that while rare, there are government agencies who take their missions to serve the public quite seriously. Who are prone to honest actions and statements.

Quite unlike the departments of State and Defense, where the real question isn’t whether they are lying, but of the motivation for lying.

September 16, 2013

Self Organizing Maps

Filed under: Self Organizing Maps (SOMs),Semantics — Patrick Durusau @ 4:48 pm

Self Organizing Maps by Giuseppe Vettigli.

From the post:

The Self Organizing Maps (SOM), also known as Kohonen maps, are a type of Artificial Neural Networks able to convert complex, nonlinear statistical relationships between high-dimensional data items into simple geometric relationships on a low-dimensional display. In a SOM the neurons are organized in a bidimensional lattice and each neuron is fully connected to all the source nodes in the input layer. An illustration of the SOM by Haykin (1996) is the following

If you are looking for self organizing maps using Python, this is the right place.

As with all mathematical techniques, SOMs requires the author to bridge the gap between semantics and discrete values for processing.

An iffy process at best.

Building better search tools: problems and solutions

Filed under: Search Behavior,Search Engines,Searching — Patrick Durusau @ 4:38 pm

Building better search tools: problems and solutions by Vincent Granville

From the post:

Have you ever done a Google search for mining data? It returns the same results as for data mining. Yet these are two very different keywords: mining data usually means data about mining. And if you search for data about mining you still get the same results anyway.

(graphic omitted)

Yet Google has one of the best search algorithms. Imagine an e-store selling products, allowing users to search for products via a catalog powered with search capabilities, but returning irrelevant results 20% of the time. What a loss of money! Indeed, if you were an investor looking on Amazon to purchase a report on mining data, all you will find are books on data mining and you won’t buy anything: possibly a $500 loss for Amazon. Repeat this million times a year, and the opportunity cost is in billions of dollars.

There are a few issues that make this problem difficult to fix. While the problem is straightforward for decision makers, CTO’s or CEO’s to notice, understand and assess the opportunity cost (just run 200 high value random search queries, see how many return irrelevant results), the communication between the analytic teams and business people is faulty: there is a short somewhere.

There might be multiple analytics teams working as silos – computer scientists, statisticians, engineers – sometimes aggressively defending their own turfs and having conflicting opinions. What the decision makers eventually hears is a lot of noise and lots of technicalities, and they don’t know how to start, how much it will cost to fix it, and how complex the issue is, and who should fix it.

Here I discuss the solution and explain it in very simple terms, to help any business having a search engine and an analytic team, easily fix the issue.

Vincent has some clever insights into this particular type of search problem but I think it falls short of being “easily” fixed.

Read his original post and see if you think the solution is an “easy” one.

Questions

Filed under: Humor,Searching — Patrick Durusau @ 4:26 pm

Greg Linden pointed out an excellent xkcd cartoon composed of auto-completed questions from Google.

Maximize your enjoyment by entering a few of the terms in your search box.

The auto-completed questions and their “answers” may surprise you.

Principles of Reactive Programming [Nov. 2013]

Filed under: Functional Programming,Scala — Patrick Durusau @ 4:18 pm

Principles of Reactive Programming by Martin Odersky, Erik Meijer and Roland Kuhn.

From the webpage:

This is a follow-on for the Coursera class “Principles of Functional Programming in Scala”, which so far had more than 100’000 inscriptions over two iterations of the course, with some of the highest completion rates of any massive open online course worldwide.

The aim of the second course is to teach the principles of reactive programming. Reactive programming is an emerging discipline which combines concurrency and event-based and asynchronous systems. It is essential for writing any kind of web-service or distributed system and is also at the core of many high-performance concurrent systems. Reactive programming can be seen as a natural extension of higher-order functional programming to concurrent systems that deal with distributed state by coordinating and orchestrating asynchronous data streams exchanged by actors.

In this course you will discover key elements for writing reactive programs in a composable way. You will find out how to apply these building blocks in the construction of event-driven systems that are scalable and resilient.

The course is hands on; most units introduce short programs that serve as illustrations of important concepts and invite you to play with them, modifying and improving them. The course is complemented by a series of assignments, which are also programming projects.

Starts November 4, 2013 and last for seven weeks.

See the webpage for the syllabus, requirements, etc.

Cassandra – A Decentralized Structured Storage System [Annotated]

Filed under: Cassandra,CQL - Cassandra Query Language,NoSQL — Patrick Durusau @ 4:11 pm

Cassandra – A Decentralized Structured Storage System by Avinash Lakshman, Facebook and Prashant Malik, Facebook.

Abstract:

Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure. Cassandra aims to run on top of an infrastructure of hundreds of nodes (possibly spread across different data centers). At this scale, small and large components fail continuously. The way Cassandra manages the persistent state in the face of these failures drives the reliability and scalability of the software systems relying on this service. While in many ways Cassandra resembles a database and shares many design and implementation strategies therewith, Cassandra does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format. Cassandra system was designed to run on cheap commodity hardware and handle high write throughput while not sacrificing read efficiency.

Annotated version of the original 2009 Cassandra paper.

Not a guide to future technology but a very interesting read about how Cassandra arrived at the present.

September 15, 2013

Apache Tez: A New Chapter in Hadoop Data Processing

Filed under: Hadoop,Hadoop YARN,Tez — Patrick Durusau @ 3:56 pm

Apache Tez: A New Chapter in Hadoop Data Processing by Bikas Saha.

From the post:

In this post we introduce the motivation behind Apache Tez (http://incubator.apache.org/projects/tez.html) and provide some background around the basic design principles for the project. As Carter discussed in our previous post on Stinger progress, Apache Tez is a crucial component of phase 2 of that project.

What is Apache Tez?

Apache Tez generalizes the MapReduce paradigm to execute a complex DAG (directed acyclic graph) of tasks. It also represents the next logical next step for Hadoop 2 and the introduction of with YARN and its more general-purpose resource management framework.

While MapReduce has served masterfully as the data processing backbone for Hadoop, its batch-oriented nature makes it unsuited for certain workloads like interactive query. Tez represents an alternate to the traditional MapReduce that allows for jobs to meet demands for fast response times and extreme throughput at petabyte scale. A great example of a benefactor of this new approach is Apache Hive and the work being done in the Stinger Initiative.

Motivation

Distributed data processing is the core application that Apache Hadoop is built around. Storing and analyzing large volumes and variety of data efficiently has been the cornerstone use case that has driven large scale adoption of Hadoop, and has resulted in creating enormous value for the Hadoop adopters. Over the years, while building and running data processing applications based on MapReduce, we have understood a lot about the strengths and weaknesses of this framework and how we would like to evolve the Hadoop data processing framework to meet the evolving needs of Hadoop users. As the Hadoop compute platform moves into its next phase with YARN, it has decoupled itself from MapReduce being the only application, and opened the opportunity to create a new data processing framework to meet the new challenges. Apache Tez aspires to live up to these lofty goals.

Does your topic map engine decoupled from a single merging algorithm?

I ask because SLAs may require different algorithms for data sets or sources.

Leaked U.S. military documents may have a higher priority for completeness than half-human/half-bot posts on a Twitter stream.

Are crowdsourced maps the future of navigation? [Supplying Context?]

Filed under: Crowd Sourcing,Mapping,Maps,Open Street Map — Patrick Durusau @ 3:05 pm

Are crowdsourced maps the future of navigation? by Kevin Fitchard.

From the post:

Given the craziness of the first two weeks in September in the tech world an interesting hire that should have gotten more attention slipped largely through the cracks. Steve Coast, founder of the OpenStreetMap project, has joined Telenav, signaling a big move by the navigation outfit toward crowdsourced mapping.

OpenStreetMap is the Wikipedia of mapping. OSM’s dedicated community of 1.3 million editors have gathered GPS data while driving, biking and walking the streets of the world to build a map from the ground up. They’ve even gone so far as to mark objects that exist on few other digital maps, from trees to park benches. That map was then offered up free to all comers.

Great story about mapping, crowd sourcing, etc., but it also has this gem:

For all of its strengths, OSM primarily has been a display map filled with an enormous amount of detail — Coast said editors will spend hours placing individual trees on boulevards. Many editors often don’t want to do the grunt work that makes maps truly useful for navigation, like filling in address data or labeling which turns are allowed at an intersection. (emphasis added)

Sam Hunting has argued for years that hobbyists, sports fans, etc., are naturals for entering data into topic maps.

Well, assuming an authoring interface with a low enough learning curve.

I went to the OpenStreetMap project, discovered an error in Covington, GA (where I live), created an account, watched a short editing tutorial and completed my first edit in about ten (10) minutes. I refreshed my browser and the correction is in place.

Future edits/corrections should be on the order of less than two minutes.

Care to name a topic map authoring interface that easy to use?

Not an entirely fair question because the geographic map provided me with a lot of unspoken context.

For example, I did not create associations between my correction and the City of Covington, Newton County, Georgia, United States, Western Hemisphere, Northern Hemisphere, Earth, or fill in types or roles for all those associations. Or remove any of the associations, types or roles that were linked to the incorrect information.

Baseball fans are reported to be fairly fanatical. But can you imagine any fan starting a topic map of baseball from scratch? I didn’t think so either. But on the other hand, what if there was an interface styled in a traditional play by play format, that allowed fans to capture games in progress? And as the game progresses, the associations and calculations on those associations (stats) are updated.

All the fan is doing is entering familiar information, allowing the topic map engine to worry about types, associations, etc.

Is that the difficulty with semantic technology interfaces?

That we require users to do more than enter the last semantic mile?

September 14, 2013

How to Refine and Visualize Twitter Data [Al-Qaeda Bots?]

Filed under: Hadoop,Tweets,Visualization — Patrick Durusau @ 6:34 pm

How to Refine and Visualize Twitter Data by Cheryle Custer.

From the post:

He loves me, he loves me not… using daisies to figure out someone’s feelings is so last century. A much better way to determine whether someone likes you, your product or your company is to do some analysis on Twitter feeds to get better data on what the public is saying. But how do you take thousands of tweets and process them? We show you how in our video – Understand your customers’ sentiments with Social Media Data – that you can capture a Twitter stream to do Sentiment Analysis.

Twitter Sentiment VisualizationNow, when you boot up your Hortonworks Sandbox today, you’ll find Tutorial 13: Refining and Visualizing Sentiment Data as the companion step-by-step guide to the video. In this Hadoop tutorial, we will show you how you can take a Twitter stream and visualize it in Excel 2013 or you could use your own favorite visualization tool. Note you can use any version of Excel, but Excel 2013 allows you do plot the data on a map where other versions will limit you to the built-in charting function.
(…)

A great tutorial from Hortonworks as always!

My only reservation is the acceptance of Twitter data for sentiment analysis.

True, it is easy to obtain, not all that difficult to process, but that isn’t the same thing as having any connection with sentiment about a company or product.

Consider that a now somewhat dated report (2012) reported that 51% of all Internet traffic is “non-human.”

If that is the case or has worsened since then, how do you account for that in your sentiment analysis?

Or if you are monitoring the Internet for Al-Qaeda threats, how do you distinguish threats from Al-Qaeda bots from threats by Al-Qaeda members?

What if threat levels are being gamed by Al-Qaeda bot networks?

Forcing expenditure of resources on a global scale at a very small cost.

A new type of asymmetric warfare?

September 13, 2013

Legislative XML Data Mapping [$10K]

Filed under: Challenges,Contest,Law - Sources,Legal Informatics — Patrick Durusau @ 6:21 pm

Legislative XML Data Mapping (Library of Congress)

First, the important stuff:

First Place: $10K

Entry due by: December 31 at 5:00pm EST

Second, the details:

The Library of Congress is sponsoring two legislative data challenges to advance the development of international data exchange standards for legislative data. These challenges are an initiative to encourage broad participation in the development and application of legislative data standards and to engage new communities in the use of legislative data. Goals of this initiative include:
• Enabling wider accessibility and more efficient exchange of the legislative data of the United States Congress and the United Kingdom Parliament,
• Encouraging the development of open standards that facilitate better integration, analysis, and interpretation of legislative data,
• Fostering the use of open source licensing for implementing legislative data standard.

The Legislative XML Data Mapping Challenge invites competitors to produce a data map for US bill XML and the most recent Akoma Ntoso schema and UK bill XML and the most recent Akoma Ntoso schema. Gaps or issues identified through this challenge will help to shape the evolving Akoma Ntoso international standard.

The winning solution will win $10,000 in cash, as well as opportunities for promotion, exposure, and recognition by the Library of Congress. For more information about prizes please see the Official Rules.

Can you guess what tool or technique I would suggest that you use? 😉

The winner is announced February 12, 2014 at 5:00pm EST.

Too late for the holidays this year, too close to Valentines Day, what holiday will you be wanting to celebrate?

Scaling Apache Giraph to a trillion edges

Filed under: Facebook,Giraph,GraphLab,Hive — Patrick Durusau @ 6:03 pm

Scaling Apache Giraph to a trillion edges by Avery Ching.

From the post:

Graph structures are ubiquitous: they provide a basic model of entities with connections between them that can represent almost anything. Flight routes connect airports, computers communicate to one another via the Internet, webpages have hypertext links to navigate to other webpages, and so on. Facebook manages a social graph that is composed of people, their friendships, subscriptions, and other connections. Open graph allows application developers to connect objects in their applications with real-world actions (such as user X is listening to song Y).

Analyzing these real world graphs at the scale of hundreds of billions or even a trillion (10^12) edges with available software was impossible last year. We needed a programming framework to express a wide range of graph algorithms in a simple way and scale them to massive datasets. After the improvements described in this article, Apache Giraph provided the solution to our requirements.

In the summer of 2012, we began exploring a diverse set of graph algorithms across many different Facebook products as well as academic literature. We selected a few representative use cases that cut across the problem space with different system bottlenecks and programming complexity. Our diverse use cases and the desired features of the programming framework drove the requirements for our system infrastructure. We required an iterative computing model, graph-based API, and fast access to Facebook data. Based on these requirements, we selected a few promising graph-processing platforms including Apache Hive, GraphLab, and Apache Giraph for evaluation.

For your convenience:

Apache Giraph

Apache Hive

GraphLab

Your appropriate scale is probably less than a trillion edges but everybody likes a great scaling story.

This is a great scaling story.

« Newer PostsOlder Posts »

Powered by WordPress