October « 2014 « Another Word For It

October 3, 2014

Compressed Text Indexes: From Theory to Practice!

Filed under: Compression,Indexing — Patrick Durusau @ 4:23 pm

Compressed Text Indexes:From Theory to Practice! Paolo Ferragina, Rodrigo Gonzalez, Gonzalo Navarro, Rossano Venturini.

Abstract:

A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This technology represents a breakthrough over the text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications.

The goal of this paper is to fill this gap. First, we present the existing implementations of compressed indexes from a practitioner’s point of view. Second, we introduce the Pizza&Chili site, which offers tuned implementations and a standardized API for the most successful compressed full-text self-indexes, together with effective testbeds and scripts for their automatic validation and test. Third, we show the results of our extensive experiments on these codes with the aim of demonstrating the practical relevance of this novel and exciting technology.

A bit dated (2007) but definitely worth your attention. The “cited-by” results from the ACM Digital Library will bring you up to date.

BTW, I was pleased to find the Pizza&Chili Corpus: Compressed Indexes and their Testbeds, both Italian and Chilean mirrors are still online!

I have seen document links survive that long but rarely an online testbed.

Comments Off

EC [WPA] Brain Project Update

Filed under: Artificial Intelligence,EU — Patrick Durusau @ 3:49 pm

Electronic Brain by 2023: E.U.’s Human Brain Project ramps up by R. Colin Johnson.

From the post:

The gist of the first year’s report is that all the pieces are assembled — all personal are hired, laboratories throughout the region engaged, and the information and communications (ICT) is in place to allow the researchers and their more than 100 academic and corporate partners in more than 20 countries to effectively collaborate and share data. Already begun are projects that reconstruct the brain’s functioning at several different biological scales, the analysis of clinical data of diseases of the brain, and the development of computing systems inspired by the brain.

The agenda for the first two and a half years (the ramp-up phase) has also been set whereby the HBP will amass all known strategic data about brain functioning, develop theoretical frameworks that fit that data, and develop the necessary infrastructure for developing six ICT platforms during the following “operational” phase circa 2017.
…

“Getting ready” is a fair summary of HBP Achievements Year One.

The report fails to mention the concerns of scientists threatening to boycott the project, but given the response of the EC to that letter, which could be summarized as: “…we have decided to spend the money, get in line or get out of the way,” a further response was unlikely.

No, the EC Brain Project is more in line with the WPA projects of depression era in the United States. WPA projects were employment projects first and the results of those projects, strictly a secondary concern.

No doubt some new results will come from the EU Brain Project, simply because it isn’t possible to employ that many researchers and not have some publishable results. Particularly if self-published by the project itself.

One can only hope that the project will publish a bibliography of “all known strategic data about brain functioning” as part of its research results. Just so outsiders can gauge the development of “…theoretical frameworks that fit that data.”

One suspects for less than the conference and travel costs built into this project, the EC could have purchased a site license for the entire EU to most if not all European scientific publishers. That would do more to advance scientific research in the EU than attempting to duplicate the unknown.

Comments Off

Latency Numbers Every Programmer Should Know

Filed under: Computer Science,Programming — Patrick Durusau @ 2:05 pm

Latency Numbers Every Programmer Should Know by Jonas Bonér.

Latency numbers from “L1 cache reference” up to “Send packet CA->Netherlands->CA” and many things in between!

Latency will be with you always. 😉

I first saw this in a tweet by Julia Evans.

Comments Off

How does SQLite work? Part 2: btrees!…

Filed under: Database,SQLite — Patrick Durusau @ 1:50 pm

How does SQLite work? Part 2: btrees! (or: disk seeks are slow don’t do them!) by Julia Evans.

From the post:

Welcome back to fun with databases! In Part 1 of this series, we learned that:

SQLite databases are organized into fixed-size pages. I made an example database which had 1k pages.

The pages are all part of a kind of tree called a btree.

There are two kinds of pages: interior pages and leaf pages. Data is only stored in leaf pages.

I mentioned last time that I put in some print statements to tell me every time I read a page, like this:

I suspect Chris Granger would consider this as “plumbing” that prevents some users from using computation.

Chris would be right, to a degree, but Julia continues to lower the bar that “plumbing” poses to users.

Looking forward to more untangling and clarifying of SQLite plumbing!

Comments Off

Open Sourcing Duckling, our probabilistic (date) parser [Clojure]

Filed under: Data,Parsers,Probalistic Models — Patrick Durusau @ 1:22 pm

Open Sourcing Duckling, our probabilistic (date) parser

From the post:

We’ve previously discussed ambiguity in natural language. What’s really fascinating is that even the simplest, seemingly most structured parts of natural language, like the way we humans describe dates and times, are actually so difficult to turn into structured data.

The wild world of temporal expressions in human language

All the following expressions describe the same point in time (at least in some contexts):

“December 30th, at 3 in the afternoon”

“The day before New Year’s Eve at 3pm”

“At 1500 three weeks from now”

“The last Tuesday of December at 3pm”

But wait… is it really equivalent to say 3pm and 1500? In the latter case, it seems that speaker meant to be more precise. Is it OK to drop this information?

And what about “next Tuesday”? If today is Monday, is that tomorrow or in 8 days? When I say “last month”, is it the last full month or the last 30 days?

A last example: “one month” looks like a well defined duration. That is, until you try to normalize durations in seconds, and you realize different months have anywhere between 28 and 31 days! Even “one day” is difficult. Yes, a day can last between 23 and 25 hours, because of daylight savings. Oh, and did I mention that at midnight at the end of 1927 in Shanghai, the clocks went back 5 minutes and 52 seconds? So “1927-12-31 23:54:08” actually happened twice there.

There are hundreds of hard things like these, and the more you dig into this, believe me, the more you’ll encounter. But that’s out of the scope of this post.

An introduction to the vagaries of date statements in natural language, a probabilistic (date) parser in Clojure, and an opportunity to extend said parser to other data types.

Nice way to end the week!

Comments Off

Dynamic Columns Tutorial – Part 1: Introduction

Filed under: Database,MariaDB,SQL — Patrick Durusau @ 1:01 pm

Dynamic Columns Tutorial – Part 1: Introduction by Max Mether.

From the post:

For certain situations, the static structure of tables in a relational database can be very limited. Each column is statically defined, has a pre-defined type and you can only enter a value of that type into the column.You can be creative and list multiple values in one column, but then those values are not generally easily accessed and manipulated with other functions. You have to use an API or contortions of a function like SUBSTRING() to pull out a value you want. Even then, you have to know what is contained in the column to be able to manipulate it properly. These methods can require too much manual intervention to assess and access the data contained in the column.

If you want to add columns as the information stored in your tables grows and your needs change you need to do fairly expensive ALTER TABLE operations. These have traditionally been very expensive in MySQL and MariaDB although the performance is a bit better starting with MariaDB 10.0 and MySQL 5.6.

The other option for having a flexible structure is to use something like Anchor Modeling . This allows you to have a very flexible schema as adding an attribute basically just means adding a table. The problem with this approach is that you’ll end up with a lot of tables which means a lot of joins when looking for results which can easily become un-manageable, or at least hard to manage.

This is where dynamic columns steps into the picture. A good solution to the static structure problem is to use dynamic columns provided in MariaDB. It allows flexibility within a defined structure, within a column. A Dynamic Column is defined as a BLOB on the DDL level. However, within the BLOB column, you may set arbitrarily and dynamically defined attributes and values–for a maximum of 64k.

Dynamic columns are not in isolation: The usual functions will work fine with the values contained within them. And they can be used as join points for joining to other table as you would normally. This allows you to retain the power of Relational SQL while still mainting a flexibility with regards to your attributes for specific tables where it makes sense.
…

Probably channeling topic maps when I observe that dynamic columns are associating multiple properties with a subject. 😉

Very interested in seeing how joins are performed using dynamic columns, but that awaits in a future post.

I first saw this in a tweet by MariaDB

Comments Off

Beyond Light Table

Filed under: Computer Science,Interface Research/Design,Programming,Transparency — Patrick Durusau @ 10:38 am

Beyond Light Table by Chris Granger.

From the post:

I have three big announcements to make today. The first is the official announcement of our next project. We’ve been quietly talking about it over the past few months, but today we want to tell you a bit more about it and finally reveal its name:

Eve is our way of bringing the power of computation to everyone, not by making everyone a programmer but by finding a better way for us to interact with computers. On the surface, Eve is an environment a little like Excel that allows you to “program” simply by moving columns and rows around in tables. Under the covers it’s a powerful database, a temporal logic language, and a flexible IDE that allows you to build anything from a simple website to complex algorithms. Instead of poring over text files full of abstract symbols, you interact with domain editors that are parameterized by grids of data. To build a UI you don’t open a text editor, you just draw it on the screen and drag data to it. It’s much closer to the ideal we’ve always had of just describing what we want and letting the machine do the rest. Eve makes the computer a real tool again – one that doesn’t require decades of training to use.

Imagine a world where everyone has access to computation without having to become a professional programmer – where a scientist doesn’t have to rely on the one person in the lab who knows python, where a child could come up with an idea for a game and build it in a couple of weekends, where your computer can help you organize and plan your wedding/vacation/business. A world where programmers could focus on solving the hard problems without being weighed down by the plumbing. That is the world we want to live in. That is the world we want to help create with Eve.

We’ve found our way to that future by studying the past and revisiting some of the foundational ideas of computing. In those ideas we discovered a simpler way to think about computation and have used modern research to start making it into reality. That reality will be an open source platform upon which anyone can explore and contribute their own ideas.
…

Chris goes onto announce that they have raised more money and they are looking to make one or more new hires.

Exciting news and I applaud viewing computers as tools, not as oracles that perform operations on data beyond our ken and deliver answers.

Except easy access to computation doesn’t guarantee useful results. Consider the case of automobiles. Easy access to complex machines results in 37,000 deaths and 2.35 million injuries each year.

Easy access to computers for word processing, email, blogging, webpages, Facebook, etc., hasn’t resulted in a single Shakespearean sonnet, much less the complete works of Shakespeare.

Just as practically, how do I distinguish between success on the iris dataset and a data set with missing values, which can make a significant difference in results when I am dragging and dropping?

I am not a supporter of using artificial barriers to exclude people from making use of computation but on the other hand, what weight should be given to their “results?”

As “computation” spreads will “verification of results” become a new discipline in CS?

Comments Off

October 2, 2014

XSLT 3.0 Draft – Saxon 9.6

Filed under: Saxon,XSLT — Patrick Durusau @ 7:24 pm

I saw a tweet by Michael Kay announcing:

XSL Transformations (XSLT) Version 3.0 – W3C Last Call Working Draft 2 October 2014,

and,

Saxon 9.6 released!

Now that is a great Thursday!

PS: Deadline for comments on the working draft is 26 November 2014.

Comments Off

Consolidated results of the public Webizen survey

Filed under: W3C — Patrick Durusau @ 7:12 pm

Consolidated results of the public Webizen survey by Coralie Mercier.

You may recall my post: A Greater Voice for Individuals in W3C – Tell Us What You Would Value [Deadline: 30 Sept 2014]. The W3C had a survey on what would attract individuals to the W3C.

Out of 7,264,774,500 (approximately as of 19:53 EST) potential voters, Coralie reports that 205 answers were received.

Even allowing for those too young, too infirm, in prison, stranded in reality shows, that is a very poor showing.

Some of the answers, at least to me, appear to be spot on. But given the response rate, it will be hard to reach even irrational conclusions about individuals at the W3C. A Ouija board would have required less technology and be about as accurate.

If you don’t want a voice at the W3C for individuals, not filing a survey response was a step in that direction. Congratulations.

PS: I don’t mean to imply that the W3C would listen to N number of survey responses saying X. Maybe yes and maybe no. But non-participation saves them from having to make that choice.

Comments Off

Clojure and Emacs without Cider

Filed under: Clojure,Editor — Patrick Durusau @ 4:54 pm

Clojure and Emacs without Cider by Martin Trojer.

From the post:

I’ve been hacking Clojure for many years now, and I’ve been happy to rekindle my love for Emacs. The Clojure/Emacs tool-chain has come a long way during this time, swank-clojure, nREPL, nrepl.el and now Cider. The feature list is ever growing, and every-time you look there are some new awesome shortcut that will ‘make your day’.

However, the last couple of months have been rough for the Cider project. I’ve experienced lots of instability, crashes and hanged UIs. Cider has become very complex and is starting to feel bloated. I went from Visual Studio to the simpler & snappier Emacs for a reason, and there is a part of me that feel concerned that Cider is ‘re-inventing’ an IDE inside Emacs. If you want a full Clojure/IDE experience with all the bells and whistles, check out Cursive Clojure, its awesome.

In this post I’ll describe a simpler Emacs/Clojure setup that I’ve been using for the last couple of weeks. It’s much closer to ‘vanilla Emacs’ and thus has much less features. On the flip side, it’s very fast and super stable.
…

“…very fast and super stable.” That sounds good to me!

As Hesse would say: “Not for Everyone.”

Comments Off

The Early Development of Programming Languages

Filed under: Computer Science,Programming — Patrick Durusau @ 4:29 pm

The Early Development of Programming Languages by Donald E. Knuth and Luis Trabb Pardo.

A survey of the first ten (10) years of “high level” computer languages. Ends in 1947 and was written based on largely unpublished materials.

If you want to find a “new” idea, there are few better places to start than with this paper.

Enjoy!

I first saw this in a tweet by JD Maturen.

Comments Off

Readings in Databases

Filed under: Computer Science,Database — Patrick Durusau @ 4:14 pm

Readings in Databases by Reynold Xin.

From the webpage:

A list of papers essential to understanding databases and building new data systems. The list is curated and maintained by Reynold Xin (@rxin)

Not a comprehensive list but it is an annotated one, which should enable you to make better choices.

Concludes with reading lists from several major computer science programs.

Comments Off

NCBI webinar on E-Utilities October 15th

Filed under: Bioinformatics,Medical Informatics — Patrick Durusau @ 3:58 pm

NCBI webinar on E-Utilities October 15th

From the post:

On October 15th, NCBI will have a webinar entitled “An Introduction to NCBI’s E-Utilities, an NCBI API.” E-Utilities is a tool to assist programmers in accessing, searching and retrieving a wide variety of data from NCBI servers.

This presentation will introduce you to the Entrez Programming Utilities (E-Utilities), the public API for the NCBI Entrez system that includes 40 databases such as Pubmed, PMC, Gene, Genome, GEO and dbSNP. After covering the basic functions and URL syntax of the E-utilities, we will then demonstrate these functions using Entrez Direct, a set of UNIX command line programs that allow you to incorporate E-utility calls easily into simple shell scripts.

Click here to register.

Thought you might find this interesting for populating topic maps out of NCBI servers.

Comments Off

Data Auditing and Contamination in Genome Databases

Filed under: Data Auditing,Data Contamination,Genome,Genomics — Patrick Durusau @ 3:50 pm

Contamination of genome databases highlight the need for data auditing trails.

Consider:

Abundant Human DNA Contamination Identified in Non-Primate Genome Databases by Mark S. Longo, Michael J. O’Neill, Rachel J. O’Neill (rachel.oneill@uconn.edu). (Longo MS, O’Neill MJ, O’Neill RJ (2011) Abundant Human DNA Contamination Identified in Non-Primate Genome Databases. PLoS ONE 6(2): e16410. doi:10.1371/journal.pone.0016410) (herein, Longo.

During routine screens of the NCBI databases using human repetitive elements we discovered an unlikely level of nucleotide identity across a broad range of phyla. To ascertain whether databases containing DNA sequences, genome assemblies and trace archive reads were contaminated with human sequences, we performed an in depth search for sequences of human origin in non-human species. Using a primate specific SINE, AluY, we screened 2,749 non-primate public databases from NCBI, Ensembl, JGI, and UCSC and have found 492 to be contaminated with human sequence. These represent species ranging from bacteria (B. cereus) to plants (Z. mays) to fish (D. rerio) with examples found from most phyla. The identification of such extensive contamination of human sequence across databases and sequence types warrants caution among the sequencing community in future sequencing efforts, such as human re-sequencing. We discuss issues this may raise as well as present data that gives insight as to how this may be occurring.

Mining of public sequencing databases supports a non-dietary origin for putative foreign miRNAs: underestimated effects of contamination in NGS. by Tosar JP, Rovira C, Naya H, Cayota A. (RNA. 2014 Jun;20(6):754-7. doi: 10.1261/rna.044263.114. Epub 2014 Apr 11.)

The report that exogenous plant miRNAs are able to cross the mammalian gastrointestinal tract and exert gene-regulation mechanism in mammalian tissues has yielded a lot of controversy, both in the public press and the scientific literature. Despite the initial enthusiasm, reproducibility of these results was recently questioned by several authors. To analyze the causes of this unease, we searched for diet-derived miRNAs in deep-sequencing libraries performed by ourselves and others. We found variable amounts of plant miRNAs in publicly available small RNA-seq data sets of human tissues. In human spermatozoa, exogenous RNAs reached extreme, biologically meaningless levels. On the contrary, plant miRNAs were not detected in our sequencing of human sperm cells, which was performed in the absence of any known sources of plant contamination. We designed an experiment to show that cross-contamination during library preparation is a source of exogenous RNAs. These contamination-derived exogenous sequences even resisted oxidation with sodium periodate. To test the assumption that diet-derived miRNAs were actually contamination-derived, we sought in the literature for previous sequencing reports performed by the same group which reported the initial finding. We analyzed the spectra of plant miRNAs in a small RNA sequencing study performed in amphioxus by this group in 2009 and we found a very strong correlation with the plant miRNAs which they later reported in human sera. Even though contamination with exogenous sequences may be easy to detect, cross-contamination between samples from the same organism can go completely unnoticed, possibly affecting conclusions derived from NGS transcriptomics.

Whether the contamination of these databases is significant or not, is a matter for debate. See the comments to Longo.

Even if errors are “easy to spot,” the question remains for both users and curators of these databases, how to provide data auditing for corrections/updates?

At a minimum, one would expect to know:

Database/dataset values for any given date?
When values changed?
What values changed?
Who changed those values?
On what basis were the changes made?
Comments on the changes
Links to literature concerning the changes

Do changes have an “audit” trail that includes both the original and new values?

If there is no “audit” trail, on what basis would I “trust” the data on a particular date?

Suggestions on current correction practices?

I first saw this in a post by Mick Watson.

Comments Off

Data Blog Aggregation – Coffeehouse

Filed under: Data,Data Management,Digital Library — Patrick Durusau @ 10:43 am

Coffeehouse

From the about page:

Coffeehouse aggregates posts about data management from around the internet.

The idea for this site draws inspiration from other aggregators such as Ecobloggers and R-Bloggers.

Coffeehouse is a project of DataONE, the Data Observation Network for Earth.

Posts are lightly curated. That is, all posts are brought in, but if we see posts that aren’t on topic, we take them down from this blog. They are not of course taken down from the original poster, just this blog.

Recently added data blogs:

Archive and Data Management Training Center

We believe that the character and structure of the social science research environment determines attitudes to re-use.

We also believe a healthy research environment gives researchers incentives to confidently create re-usable data, and for data archives and repositories to commit to supporting data discovery and re-use through data enhancement and long-term preservation.

The purpose of our center is to ensure excellence in the creation, management, and long-term preservation of research data. We promote the adoption of standards in research data management and archiving to support data availability, re-use, and the repurposing of archived data.

Our desire is to see the European research area producing quality data with wide and multipurpose re-use value. By supporting multipurpose re-use, we want to help researchers, archives and repositories realize the intellectual value of public investment in academic research. (From the “about” page for the Archive and Data Management Training Center website but representative of the blog as well)

Data Ab Initio

My name is Kristin Briney and I am interested in all things relating to scientific research data.

I have been in love with research data since working on my PhD in Physical Chemistry, when I preferred modeling and manipulating my data to actually collecting it in the lab (or, heaven forbid, doing actual chemistry). This interest in research data led me to a Master’s degree in Information Studies where I focused on the management of digital data.

This blog is something I wish I had when I was a practicing scientist: a resource to help me manage my data and navigate the changing landscape of research dissemination.

Digital Library Blog (Stanford)

The latest news and milestones in the development of Stanford’s digital library–including content, new services, and infrastructure development.

Dryad News and Views

Welcome to Dryad news and views, a blog about news and events related to the Dryad digital repository. Subscribe, comment, contribute– and be sure to Publish Your Data!

Dryad is a curated general-purpose repository that makes the data underlying scientific publications discoverable, freely reusable, and citable. Any journal or publisher that wishes to encourage data archiving may refer authors to Dryad. Dryad welcomes data submissions related to any published, or accepted, peer reviewed scientific and medical literature, particularly data for which no specialized repository exists.

Journals can support and facilitate their authors’ data archiving by implementing “submission integration,” by which the journal manuscript submission system interfaces with Dryad. In a nutshell: the journal sends automated notifications to Dryad of new manuscripts, which enables Dryad to create a provisional record for the article’s data, thereby streamlining the author’s data upload process. The published article includes a link to the data in Dryad, and Dryad links to the published article.

The Dryad documentation site provides complete information about Dryad and the submission integration process.

Dryad staff welcome all inquiries. Thank you.

<tamingdata/>

The data deluge refers to the increasingly large and complex data sets generated by researchers that must be managed by their creators with “industrial-scale data centres and cutting-edge networking technology” (Nature 455) in order to provide for use and re-use of the data.

The lack of standards and infrastructure to appropriately manage this (often tax-payer funded) data requires data creators, data scientists, data managers, and data librarians to collaborate in order to create and acquire the technology required to provide for data use and re-use.

This blog is my way of sorting through the technology, management, research and development that have come together to successfully solve the data deluge. I will post and discuss both current and past R&D in this area. I welcome any comments.

There are fourteen (14) data blogs to date feeding into Coffeehouse. Unlike some data blog aggregations, ads do not overwhelm content at Coffeehouse.

If you have a data blog, please consider adding it to Coffeehouse. Suggest that other data bloggers do the same.

Comments Off

October 1, 2014

The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox

Filed under: HPC,Interface Research/Design,Machine Learning,Modeling,Velox — Patrick Durusau @ 8:25 pm

The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox by Daniel Crankshaw, et al.

Abstract:

To support complex data-intensive applications such as personalized recommendations, targeted advertising, and intelligent services, the data management community has focused heavily on the design of systems to support training complex models on large datasets. Unfortunately, the design of these systems largely ignores a critical component of the overall analytics process: the deployment and serving of models at scale. In this work, we present Velox, a new component of the Berkeley Data Analytics Stack. Velox is a data management system for facilitating the next steps in real-world, large-scale analytics pipelines: online model management, maintenance, and serving. Velox provides end-user applications and services with a low-latency, intuitive interface to models, transforming the raw statistical models currently trained using existing offline large-scale compute frameworks into full-blown, end-to-end data products capable of recommending products, targeting advertisements, and personalizing web content. To provide up-to-date results for these complex models, Velox also facilitates lightweight online model maintenance and selection (i.e., dynamic weighting). In this paper, we describe the challenges and architectural considerations required to achieve this functionality, including the abilities to span online and offline systems, to adaptively adjust model materialization strategies, and to exploit inherent statistical properties such as model error tolerance, all while operating at “Big Data” scale.

Early Warning: Alpha code drop expected December 2014.

If you want to get ahead of the curve I suggest you start reading this paper soon. Very soon.

Written from the perspective of end-user facing applications but applicable to author-facing applications for real time interaction with subject identification.

Comments Off

Integrating Kafka and Spark Streaming: Code Examples and State of the Game

Filed under: Avro,Kafka,Spark — Patrick Durusau @ 7:55 pm

Integrating Kafka and Spark Streaming: Code Examples and State of the Game by Michael G. Noll.

From the post:

Spark Streaming has been getting some attention lately as real-time data processing tool, often mentioned alongside Apache Storm. If you ask me, no real-time data processing tool is complete without Kafka integration (smile), hence I added an example Spark Streaming application to kafka-storm-starter that demonstrates how to read from Kafka and write to Kafka, using Avro as the data format and Twitter Bijection for handling the data serialization.

In this post I will explain this Spark Streaming example in further detail and also shed some light on the current state of Kafka integration in Spark Streaming. All this with the disclaimer that this happens to be my first experiment with Spark Streaming.
…

If mid-week is when you like to brush up on emerging technologies, Michael’s post is a good place to start.

The post is well organized and has enough notes, asides and references to enable you to duplicate the example and to expand your understanding of Kafka and Spark Streaming.

Comments Off

The Case for HTML Word Processors

Filed under: HTML,Software,Word Processing — Patrick Durusau @ 5:07 pm

The Case for HTML Word Processors by Adam Hyde.

From the post:

Making a case for HTML editors as stealth Desktop Word Processors…the strategy has been so stealthy that not even the developers realised what they were building.

We use all these over-complicated softwares to create Desktop documents. Microsoft Word, LibreOffice, whatever you like – we know them. They are one of the core apps in any users operating system. We also know that they are slow, unwieldy and have lots of quirky ways of doing things. However most of us just accept that this is the way it is and we try not to bother ourselves by noticing just how awful these softwares actually are.

So, I think it might be interesting to ask just this simple question – what if we used Desktop HTML Editors instead of Word Processors to do Word Processing? It might sound like an irrational proposition…Word Processors are, after all, for Word Processing. HTML editors are for creating…well, …HTML. But lets just forget that. What if we could allow ourselves to imagine we used an HTML editor for all our word processing needs and HTML replaces .docx and .odt and all those other over-burdened word processing formats. What do we win and what do we lose?
…

I’m not convinced about HTML word processors but Adam certainly starts with the right question:

What do we win and what do we lose? (emphasis added)

Line your favorite word processing format up along side HTML + CSS and calculate the wins and loses.

Not that HTML word processors can, should or will replace complex typography when appropriate, but how many documents need the full firepower of a modern word processor?

I would ask a similar question about authoring interfaces for topic maps. What is the least interface that can usefully produce a topic map?

The full bells and whistle versions are common now (I omit naming names) but should those be the only choices?

PS: As far as MS Word, I use “open,” “close,” “save,” “copy,” “paste,” “delete,” “hyperlink,” “bold,” and “italic.” What’s that? Nine operations? You experience may vary. 😉

I use LaTeX and another word processing application for most of my writing off the Web.

I first saw this in a tweet by Ivan Herman

Comments Off

FOAM (Functional Ontology Assignments for Metagenomes):…

Filed under: Bioinformatics,Genomics,Ontology — Patrick Durusau @ 4:43 pm

FOAM (Functional Ontology Assignments for Metagenomes): a Hidden Markov Model (HMM) database with environmental focus by Emmanuel Prestat, et al. (Nucl. Acids Res. (2014) doi: 10.1093/nar/gku702 )

Abstract:

A new functional gene database, FOAM (Functional Ontology Assignments for Metagenomes), was developed to screen environmental metagenomic sequence datasets. FOAM provides a new functional ontology dedicated to classify gene functions relevant to environmental microorganisms based on Hidden Markov Models (HMMs). Sets of aligned protein sequences (i.e. ‘profiles’) were tailored to a large group of target KEGG Orthologs (KOs) from which HMMs were trained. The alignments were checked and curated to make them specific to the targeted KO. Within this process, sequence profiles were enriched with the most abundant sequences available to maximize the yield of accurate classifier models. An associated functional ontology was built to describe the functional groups and hierarchy. FOAM allows the user to select the target search space before HMM-based comparison steps and to easily organize the results into different functional categories and subcategories. FOAM is publicly available at http://portal.nersc.gov/project/m1317/FOAM/.

Aside from its obvious importance for genomics and bioinformatics, I mention this because the authors point out:

A caveat of this approach is that we did not consider the quality of the tree in the tree-splitting step (i.e. weakly supported branches were equally treated as strongly supported ones), producing models of different qualities. Nevertheless, we decided that the approach of rational classification is better than no classification at all. In the future, the groups could be recomputed, or split more optimally when more data become available (e.g. more KOs). From each cluster related to the KO in process, we extracted the alignment from which HMMs were eventually built.

I take that to mean that this “ontology” represents no unchanging ground truth but rather an attempt to enhance the “…screening of environmental metagenomic and metatranscriptomic sequence datasets for functional genes.”

As more information is gained, the present “ontology” can and will change. Those future changes create the necessity to map those changes and the facts that drove them.

I first saw this in a tweet by Jonathan Eisen

Comments Off

Continuum Analytics Releases Anaconda 2.1

Filed under: Anaconda,BigData,Python — Patrick Durusau @ 4:18 pm

Continuum Analytics Releases Anaconda 2.1 by Corinna Bahr.

From the post:

Continuum Analytics, the premier provider of Python-based data analytics solutions and services, announced today the release of the latest version of Anaconda, its free, enterprise-ready collection of libraries for Python.

Anaconda enables big data management, analysis, and cross-platform visualization for business intelligence, scientific analysis, engineering, machine learning, and more. The latest release, version 2.1, adds a new version of the Anaconda Launcher and PyOpenSSL, as well as updates NumPy, Blaze, Bokeh, Numba, and 50 other packages.

Available on Windows, Mac OS X and Linux, Anaconda includes more than 195 of the most popular numerical and scientific Python libraries used by scientists, engineers and data analysts, with a single integrated and flexible installer. It also allows for the mixing and matching of different versions of Python (2.6, 2.7, 3.3, 3.4), NumPy, SciPy, etc., and the ability to easily switch between these environments.
…

See the post for more details, check the change log, or, what the hell, download the most recent version of Anaconda.

Remember, it’s open source so you can see “…where it keeps its brain.” Be wary of results based on software that operates behind a curtain.

BTW, check out the commercial services and products from Continuum Analytics if you need even more firepower for your data processing.

Comments Off

Uncovering Community Structures with Initialized Bayesian Nonnegative Matrix Factorization

Filed under: Bayesian Data Analysis,Matrix,Social Graphs,Social Networks,Subgraphs — Patrick Durusau @ 3:28 pm

Uncovering Community Structures with Initialized Bayesian Nonnegative Matrix Factorization by Xianchao Tang, Tao Xu, Xia Feng, and, Guoqing Yang.

Abstract:

Uncovering community structures is important for understanding networks. Currently, several nonnegative matrix factorization algorithms have been proposed for discovering community structure in complex networks. However, these algorithms exhibit some drawbacks, such as unstable results and inefficient running times. In view of the problems, a novel approach that utilizes an initialized Bayesian nonnegative matrix factorization model for determining community membership is proposed. First, based on singular value decomposition, we obtain simple initialized matrix factorizations from approximate decompositions of the complex network’s adjacency matrix. Then, within a few iterations, the final matrix factorizations are achieved by the Bayesian nonnegative matrix factorization method with the initialized matrix factorizations. Thus, the network’s community structure can be determined by judging the classification of nodes with a final matrix factor. Experimental results show that the proposed method is highly accurate and offers competitive performance to that of the state-of-the-art methods even though it is not designed for the purpose of modularity maximization.

Some titles grab you by the lapels and say, “READ ME!,” don’t they? 😉

I found the first paragraph a much friendlier summary of why you should read this paper (footnotes omitted):

Many complex systems in the real world have the form of networks whose edges are linked by nodes or vertices. Examples include social systems such as personal relationships, collaborative networks of scientists, and networks that model the spread of epidemics; ecosystems such as neuron networks, genetic regulatory networks, and protein-protein interactions; and technology systems such as telephone networks, the Internet and the World Wide Web [1]. In these networks, there are many sub-graphs, called communities or modules, which have a high density of internal links. In contrast, the links between these sub-graphs have a fairly lower density [2]. In community networks, sub-graphs have their own functions and social roles. Furthermore, a community can be thought of as a general description of the whole network to gain more facile visualization and a better understanding of the complex systems. In some cases, a community can reveal the real world network’s properties without releasing the group membership or compromising the members’ privacy. Therefore, community detection has become a fundamental and important research topic in complex networks.

If you think of “the real world network’s properties” as potential properties for identification of a network as a subject or as properties of the network as a subject, the importance of this article becomes clearer.

Being able to speak of sub-graphs as subjects with properties can only improve our ability to compare sub-graphs across complex networks.

BTW, all the data used in this article is available for downloading: http://dx.doi.org/10.6084/m9.figshare.1149965

I first saw this in a tweet by Brian Keegan.

Comments Off

« Newer Posts

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 3, 2014

October 2, 2014

October 1, 2014