Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 15, 2014

Distributed Environments and VirtualBox

Filed under: Distributed Computing,Distributed Systems,Virtual Machines — Patrick Durusau @ 10:35 am

While writing about Distributed LIBLINEAR: I discovered two guides to creating distributed environments with VirtualBox.

I mention that fact in the other post but thought the use of VirtualBox to create distributed environments needed more visibility than a mention.

The guides are:

MPI LIBLINEAR – VirtualBox Guide

Spark LIBLINEAR – VirtualBox Guide

and you will need to refer to the original site: Distributed LIBLINEAR: Libraries for Large-scale Linear Classification on Distributed Environments for information on using those environments with “Distributed LIBLINEAR.”

VirtualBox brings research on and using distributed systems within the reach of anyone with reasonable computing resources.

Please drop me a note if you are using VirtualBox to create distributed systems for topic map processing.

Distributed LIBLINEAR:

Filed under: Machine Learning,MPI,Spark,Virtual Machines — Patrick Durusau @ 10:23 am

Distributed LIBLINEAR: Libraries for Large-scale Linear Classification on Distributed Environments

From the webpage:

MPI LIBLINEAR is an extension of LIBLINEAR on distributed environments. The usage and the data format are the same as LIBLINEAR. Currently only two solvers are supported:

  • L2-regularized logistic regression (LR)
  • L2-regularized L2-loss linear SVM

NOTICE: This extension can only run on Unix-like systems. (We test it on Ubuntu 13.10.) Python and Matlab interfaces are not supported.

Spark LIBLINEAR is a Spark implementation based on LIBLINEAR and integrated with Hadoop distributed file system. This package is developed using Scala. Currently it supports the same two solvers as MPI LIBLINEAR.

If you are unfamiliar with LIBLINEAR:

LIBLINEAR is a linear classifier for data with millions of instances and features. It supports

  • L2-regularized classifiers
    L2-loss linear SVM, L1-loss linear SVM, and logistic regression (LR)
  • L1-regularized classifiers (after version 1.4)
    L2-loss linear SVM and logistic regression (LR)
  • L2-regularized support vector regression (after version 1.9)
    L2-loss linear SVR and L1-loss linear SVR.

Main features of LIBLINEAR include

  • Same data format as LIBSVM, our general-purpose SVM solver, and also similar usage
  • Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer
  • Cross validation for model selection
  • Probability estimates (logistic regression only)
  • Weights for unbalanced data
  • MATLAB/Octave, Java, Python, Ruby interfaces

You will also find instructions for creating distributed environments using VirtualBox for both MPI LIBLINEAR and Spark LIBLINEAR. I am going to post on that separately to draw attention to it.

The phrase “standalone computer” is rapidly becoming a misnomer. Forward looking algorithm designers and power users will begin gaining experience with the new distributed “normal,” at every opportunity.

I first saw this in a tweet by Reynold Xin.

May 14, 2014

Lisp: Common Lisp, Racket, Clojure, Emacs Lisp

Filed under: Clojure,Lisp,Programming — Patrick Durusau @ 8:04 pm

Lisp: Common Lisp, Racket, Clojure, Emacs Lisp

A “hyperpolyglot” (although technically only a quadglot) that is a side-by-side reference sheet for Common Lisp, Racket, Clojure, and, Emacs Lisp.

One of the Hexaglot versions of the Bible included: Old Testament in Hebrew, Greek, Latin, English, German and French; New Testament in Greek, Syriac, Latin, English, German and French.

For another interesting example of analog information retrieval, see: Complutensian Polyglot Bible

Note that the location of the parallel texts meant the reader did not lose their original context when consulting another text. Unlike hyperlinks that take a reader away from the current resource.

Just out of curiosity I backed up the URL and found: Hyperpolyglot.

Which includes side by side references for:

Programming Languages

commonly used features in a side-by-side format

Interpreted Languages: JavaScript, PHP, Python, Ruby
More Interpreted Languages: Perl, Tcl, Lua, Groovy
C++ Style Languages: C++, Objective-C, Java, C#
Languages in the Key of C: C, Go
Pascal Style Languages: Pascal, Ada, PL/pgSQL
Lisp Dialects: Common Lisp, Racket, Clojure, Emacs Lisp
ML Dialects and Friends: OCaml, F#, Scala, Haskell
Prolog and Erlang: Prolog, Erlang
Stack-Oriented Languages: Forth, PostScript, Factor
Operating System Automation: POSIX Shell, AppleScript, PowerShell
Relational Data Languages: SQL, Awk, Pig
Numerical Analysis & Statistics: MATLAB, R, NumPy and Fortran
Computer Algebra Software: Mathematica, SymPy, Pari/GP

Programming Tools

Unix Shells: Bash, Fish, Ksh, Tcsh, Zsh
Text Mode Editors: Vim, Emacs, Nano
Version Control: Git, Mercurial
Build Tools: Make, Rake, Ant
Terminal Multiplexers: Screen, Tmux
Databases: PostgreSQL, MySQL, SQLite, Redis, MongoDB, Neo4j
Markup: Markdown, reStructuredText, MediaWiki, Wikidot, LaTeX
2D Vector Graphics: PostScript, Processing, SVG
Mathematical Notation: LaTeX, Mathematica, HTML Entities, Unicode

Of course, one downside to such a listing is that it would be difficult to supplement the information given without manually editing the tables.

CrossClj

Filed under: Clojure,Indexing,Interface Research/Design,Programming — Patrick Durusau @ 7:38 pm

CrossClj: cross-referencing the clojure ecosystem

From the webpage:

CrossClj is a tool to explore the interconnected Clojure universe. As an example, you can find all the usages of the reduce function across all projects, or find all the functions called map. Or you can list all the projects using ring. You can also walk the source code across different projects.

Interesting search interface. You could lose some serious time just reading the project names. 😉

Makes me curious about the potential of listing functions and treating other functions/operators in their scope as facets?

Enjoy!

The history of the vector space model

Filed under: Similarity,Vector Space Model (VSM) — Patrick Durusau @ 7:04 pm

The history of the vector space model by Suresh Venkatasubramanian.

From the post:

Gerald Salton is generally credited with the invention of the vector space model: the idea that we could represent a document as a vector of keywords and use things like cosine similarity and dimensionality reduction to compare documents and represent them.

But the path to this modern interpretation was a lot twistier than one might think. David Dubin wrote an article in 2004 titled ‘The Most Influential Paper Gerard Salton Never Wrote‘. In it, he points out that most citations that refer to the vector space model refer to a paper that doesn’t actually exist (hence the title). Taking that as a starting point, he then traces the lineage of the ideas in Salton’s work.

Suresh summarizes some of the discoveries made by Dubin in his post but this sounds like an interesting research project. To take Dubin’s article as a starting point and follow the development of the vector space model.

Particularly since it is used so often for “similarity.” Understanding the mathematics is good, understanding how that particular model was arrived at would be even better.

Enjoy!

Spy On Your CPU

Filed under: Linux OS,Performance,Programming — Patrick Durusau @ 3:45 pm

I can spy on my CPU cycles with perf! by Julia Evans.

From the post:

Yesterday I talked about using perf to profile assembly instructions. Today I learned how to make flame graphs with perf today and it is THE BEST. I found this because Graydon Hoare pointed me to Brendan Gregg’s excellent page on how to use perf.

Julia is up to her elbows in her CPU.

You can throw hardware at a problem or you can tune the program you are running on hardware.

Julia’s posts are about the latter.

Feathers, Gossip and the European Union Court of Justice (ECJ)

Filed under: EU,Privacy,Search Engines — Patrick Durusau @ 2:52 pm

It is a common comment that the United States Supreme Court has difficulty with technology issues. Not terribly surprising since digital technology evolves several orders of magnitude faster than legal codes and customs.

But even if judicial digital illiteracy isn’t surprising, judicial theological illiteracy should be.

I am referring, of course, to the recent opinion by the European Court of Justice that there is a right to be “forgotten” in the records of the search giant Google.

In the informal press release about its decision, the ECJ states:

Finally, in response to the question whether the directive enables the data subject to request that links to web pages be removed from such a list of results on the grounds that he wishes the information appearing on those pages relating to him personally to be ‘forgotten’ after a certain time, the Court holds that, if it is found, following a request by the data subject, that the inclusion of those links in the list is, at this point in time, incompatible with the directive, the links and information in the list of results must be erased. The Court observes in this regard that even initially lawful processing of accurate data may, in the course of time, become incompatible with the directive where, having regard to all the circumstances of the case, the data appear to be inadequate, irrelevant or no longer relevant, or excessive in relation to the purposes for which they were processed and in the light of the time that has elapsed. The Court adds that, when appraising such a request made by the data subject in order to oppose the processing carried out by the operator of a search engine, it should in particular be examined whether the data subject has a right that the information in question relating to him personally should, at this point in time, no longer be linked to his name by a list of results that is displayed following a search made on the basis of his name. If that is the case, the links to web pages containing that information must be removed from that list of results, unless there are particular reasons, such as the role played by the data subject in public life, justifying a preponderant interest of the public in having access to the information when such a search is made. (The press release version, The official judgement).

Which doesn’t sound unreasonable, particularly if you are a theological illiterate.

One contemporary retelling of a story about St. Philip Neri goes as follows:

The story is often told of the most unusual penance St. Philip Neri assigned to a woman for her sin of spreading gossip. The sixteenth-century saint instructed her to take a feather pillow to the top of the church bell tower, rip it open, and let the wind blow all the feathers away. This probably was not the kind of penance this woman, or any of us, would have been used to!

But the penance didn’t end there. Philip Neri gave her a second and more difficult task. He told her to come down from the bell tower and collect all the feathers that had been scattered throughout the town. The poor lady, of course, could not do it-and that was the point Philip Neri was trying to make in order to underscore the destructive nature of gossip. When we detract from others in our speech, our malicious words are scattered abroad and cannot be gathered back. They continue to dishonor and divide many days, months, and years after we speak them as they linger in people’s minds and pass from one tale-bearer to the next. (From The Feathers of Gossip: How our Words can Build Up or Tear Down by Edward P. Sri)*

The problem with “forgetting” is the same one as the gossip penitent. Information is copied and replicated by sites for their own purposes. Nothing Google can do will impact those copies. Even if Google, removes all of its references from a particular source, the information could be re-indexed in the future from new sources.

This decision is a “feel good” one for privacy advocates. But, the ECJ should have recognized the gossip folktale parallel and decided that effective relief is impossible. Ordering an Impossible solution diminishes the stature of the court and the seriousness with which its decisions are regarded.

Not to mention the burden this will place on Google and other search result providers, with no guarantee that the efforts will be successful.

Sometimes the best solution is to simply do nothing at all.

* There isn’t a canonical form for this folktale, which has been told and re-told by many cultures.

Is PDF the Problem?

Filed under: Bibliometrics,PDF — Patrick Durusau @ 1:39 pm

The solutions to all our problems may be buried in PDFs that nobody reads by Christopher Ingraham.

From the post:

What if someone had already figured out the answers to the world’s most pressing policy problems, but those solutions were buried deep in a PDF, somewhere nobody will ever read them?

According to a recent report by the World Bank, that scenario is not so far-fetched. The bank is one of those high-minded organizations — Washington is full of them — that release hundreds, maybe thousands, of reports a year on policy issues big and small. Many of these reports are long and highly technical, and just about all of them get released to the world as a PDF report posted to the organization’s Web site.

The World Bank recently decided to ask an important question: Is anyone actually reading these things? They dug into their Web site traffic data and came to the following conclusions: Nearly one-third of their PDF reports had never been downloaded, not even once. Another 40 percent of their reports had been downloaded fewer than 100 times. Only 13 percent had seen more than 250 downloads in their lifetimes. Since most World Bank reports have a stated objective of informing public debate or government policy, this seems like a pretty lousy track record.

I’m not so sure that the PDF format, annoying as it can be, lies at the heart of non-reading of World Bank reports.

Consider Rose Eveleth’s recent (2014) Academics Write Papers Arguing Over How Many People Read (And Cite) Their Papers.

Eveleth writes:

There are a lot of scientific papers out there. One estimate puts the count at 1.8 million articles published each year, in about 28,000 journals. Who actually reads those papers? According to one 2007 study, not many people: half of academic papers are read only by their authors and journal editors, the study’s authors write.

But not all academics accept that they have an audience of three. There’s a heated dispute around academic readership and citation—enough that there have been studies about reading studies going back for more than two decades.

In the 2007 study, the authors introduce their topic by noting that “as many as 50% of papers are never read by anyone other than their authors, referees and journal editors.” They also claim that 90 percent of papers published are never cited. Some academics are unsurprised by these numbers. “I distinctly remember focusing not so much on the hyper-specific nature of these research topics, but how it must feel as an academic to spend so much time on a topic so far on the periphery of human interest,” writes Aaron Gordon at Pacific Standard. “Academia’s incentive structure is such that it’s better to publish something than nothing,” he explains, even if that something is only read by you and your reviewers.

Fifty (50%) of papers have an audience of three? Being mindful these aren’t papers from the World Bank but papers spread across a range of disciplines.

Before you decide that PDF format is the issue or that academic journal articles aren’t read, you need to consider other evidence from sources such as: Measuring Total Reading of Journal, Donald W. King, Carol Tenopir, and, Michael Clarke, D-Lib Magazine, October 2006, Volume 12 Number 10, ISSN 1082-9873.

King, Tenopir, and, Clarke write in part:

The Myth of Low Use of Journal Articles

A myth that journal articles are read infrequently persisted over a number of decades (see, for example, Williams 1975, Lancaster 1978, Schauder 1994, Odlyzko 1996). In fact, early on this misconception led to a series of studies funded by the National Science Foundation (NSF) in the 1960s and 1970s to seek alternatives to traditional print journals, which were considered by many to be a huge waste of paper. The basis for this belief was generally twofold. First, many considered citation counts to be the principal indicator of reading articles, and studies showed that articles averaged about 10 to 20 citations to them (a number that has steadily grown over the past 25 years). Counts of citations to articles tend to be highly skewed with a few articles having a large number of citations and many with few or even no citation to them. This led to the perception that articles were read infrequently or simply not at all.

King, Tenopir, and, Clarke make a convincing case that “readership” for an article is a more complex question than checking download statistics.

Let’s say that the question of usage/reading of reports/articles is open to debate. Depending on who you ask, some measures are thought to be better than others.

But there is a common factor that all of these studies ignore: Usage, however you define it, is based on article or paper level access.

What if instead of looking for an appropriate World Bank PDF (or other format) file, I could search for the data used in such a file? Or the analysis of some particular data that is listed in a file? I may or may not be interested in the article as a whole.

An author’s arrangement of data and their commentary on it is one presentation of data, shouldn’t we divorce access to the data from reading it through the lens of the author?

If we want greater re-use of experimental, financial, survey and other data, then let’s stop burying it in an author’s presentation, whether delivered as print, PDF, or some other format.

I first saw this in a tweet by Duncan Hull.

Reverse Engineering for Beginners

Filed under: Cybersecurity,Reverse Engineering — Patrick Durusau @ 8:57 am

Reverse Engineering for Beginners by Dennis Yurichev.

From the webpage:

Topics discussed: x86, ARM.

Topics touched: Oracle RDBMS, Itanium, copy-protection dongles, LD_PRELOAD, stack overflow, ELF, win32 PE file format, x86-64, critical sections, syscalls, TLS, position-independent code (PIC), profile-guided optimization, C++ STL, OpenMP, win32 SEH.

I guess I have a different definition of “beginner.”

Chapter 2 starts off with “Hello, World!” from C and by section 2.1.1:

Let’s compile it in MSVC 2010:

😉

At more than 600 pages this took a lot of work. I suspect that it will repay a lot of work with the text.

I first saw this in Nat Torkington’s Four short links: 13 May 2014.

May 13, 2014

Bringing machine learning and compositional semantics together

Filed under: Machine Learning,Semantics — Patrick Durusau @ 6:24 pm

Bringing machine learning and compositional semantics together by Percy Liang and Christopher Potts.

Abstract:

Computational semantics has long been seen as a fi eld divided between logical and statistical approaches, but this divide is rapidly eroding, with the development of statistical models that learn compositional semantic theories from corpora and databases. This paper presents a simple discriminative learning framework for defi ning such models and relating them to logical theories. Within this framework, we discuss the task of learning to map utterances to logical forms (semantic parsing) and the task of learning from denotations with logical forms as latent variables. We also consider models that use distributed (e.g., vector) representations rather than logical ones, showing that these can be seen as part of the same overall framework for understanding meaning and structural complexity.

My interest is in how computational semantics can illuminate issues in semantics. It has been my experience that the transition from natural language to more formal (and less robust) representations draws out semantic issues, such as ambiguity, that lurk unnoticed in natural language texts.

With right at seven pages of references, you will have no shortage of reading material on compositional semantics.

I first saw this in a tweet by Chris Brockett.

Online Language Taggers

Filed under: Language,Linguistics,Tagging — Patrick Durusau @ 4:21 pm

UCREL Semantic Analysis System (USAS)

From the homepage:

The UCREL semantic analysis system is a framework for undertaking the automatic semantic analysis of text. The framework has been designed and used across a number of research projects and this page collects together various pointers to those projects and publications produced since 1990.

The semantic tagset used by USAS was originally loosely based on Tom McArthur’s Longman Lexicon of Contemporary English (McArthur, 1981). It has a multi-tier structure with 21 major discourse fields (shown here on the right), subdivided, and with the possibility of further fine-grained subdivision in certain cases. We have written an introduction to the USAS category system (PDF file) with examples of prototypical words and multi-word units in each semantic field.

There are four online taggers available:

English: 100,000 word limit

Italian: 2,000 word limit

Dutch: 2,000 word limit

Chinese: 3,000 character limit

Enjoy!

I first saw this in a tweet by Paul Rayson.

Non-English/Spanish Language by State

Filed under: Government,Language — Patrick Durusau @ 3:57 pm

I need your help. I saw this on a twitter feed from Slate.

non-english-spanish

I don’t have confirmation that any member of Georgia (United States) government reads Slate, but putting this type of information where it might be seen by Georgia government staffers strikes me as irresponsible news reporting.

Publishing all of Snowden’s documents as an unedited dump would have less of an impact than members of the Georgia legislature finding out there is yet another race to worry about in Georgia.

The legislature hardly knows which way to turn now, knowing about African-Americans and Latinos. Adding another group to that list will only make matters worse.

Question: How to suppress information about the increasing diversity of the population of Georgia?

Not for long, just until it becomes diverse enough to replace all the sitting members of the Georgia legislature in one fell swoop. 😉

The more diverse Georgia becomes, the more vibrant its rebirth will be following its current period of stagnation trying to hold onto the “good old days.”

The Shrinking Big Data MarketPlace

Filed under: BigData,Marketing,VoltDB — Patrick Durusau @ 3:33 pm

VoltDB Survey Finds That Big Data Goes to Waste at Most Organizations

From the post:

VoltDB today announced the findings of an industry survey which reveals that most organizations cannot utilize the vast majority of the Big Data they collect. The study exposes a major Big Data divide: the ability to successfully capture and store huge amounts of data is not translating to improved bottom-line business benefits.

Untapped Data Has Little or No Value

The majority of respondents reveal that their organizations can’t utilize most of their Big Data, despite the fact that doing so would drive real bottom line business benefits.

  • 72 percent of respondents cannot access and/or utilize the majority of the data coming into their organizations.
  • Respondents acknowledge that if they were able to better leverage Big Data their organizations could: deliver a more personalized customer experience (49%); increase revenue growth (48%); and create competitive advantages (47%).

(emphasis added)

News like that makes me wonder how long the market for “big data tools” that can’t produce ROI is going to continue?

I suspect VoltDB has its eyes on addressing the software aspects of the non-utilization problem (more power to them) but that still leaves the usual office politics of who has access to what data and the underlying issues of effectively sharing data across inconsistent semantics.

Topic maps can’t help you address the office politics problem, unless you want to create a map of who is in the way of effective data sharing. Having created such a map, how you resolve personnel issues is your problem.

Topic maps can help with the inconsistent semantics that are going to persist even in the best of organizations. Departments have inconsistent semantics in many cases because their semantics or “silo” if you like, works best for their workflow.

Why not allow the semantics/silo stay in place and map it into other semantics/silos as need be? That way every department gets their familiar semantics and you get the benefit of better workflow.

To put it another way, silos aren’t the problem, it is the opacity of silos that is the problem. Make silos transparent and you have better data interchange and as a consequence, greater access to the data you are collecting.

Improve your information infrastructure on top of improved mapping/access to data and you will start to improve your bottom line. Someday you will get to “big data.” But as the survey says: Using big data tools != improved bottom line.

Postgres-XL

Filed under: Database,Postgre-XL,PostgreSQL — Patrick Durusau @ 3:12 pm

Database vendor open sources Postgres-XL for scale-out workloads by Derrick Harris.

From the post:

Database software company TransLattice has rebuilt the popular PostgreSQL database as a clustered system designed for handling large datasets. The open-source product, called Postgres-XL, is designed to be just like regular Postgres, only more scalable and also functional as a massively parallel analytic database.

That’s interesting news but I puzzled over a comment that Derrick makes:

Postgres-XL is among a handful of recent attempts to turn Postgres into a more fully functional database.

True, there are projects to add features to Postgres that it previously did not have but I would not characterize them as making Postgres “…a more fully functional database.”

PostgreSQL is already a fully functional database, a rather advanced one. It may lack some feature someone imagines as useful, but the jury is still out on whether such added “features” are features or cruft.

Functional Pearl:…

Filed under: Category Theory,Functional Programming,Haskell — Patrick Durusau @ 3:01 pm

Functional Pearl: Kleisli arrows of outrageous fortune by Conor McBride.

Abstract:

When we program to interact with a turbulent world, we are to some extent at its mercy. To achieve safety, we must ensure that programs act in accordance with what is known about the state of the world, as determined dynamically. Is there any hope to enforce safety policies for dynamic interaction by static typing? This paper answers with a cautious ‘yes’.

Monads provide a type discipline for effectful programming, mapping value types to computation types. If we index our types by data approximating the ‘state of the world’, we refine our values to witnesses for some condition of the world. Ordinary monads for indexed types give a discipline for effectful programming contingent on state, modelling the whims of fortune in way that Atkey’s indexed monads for ordinary types do not (Atkey, 2009). Arrows in the corresponding Kleisli category represent computations which a reach a given postcondition from a given precondition: their types are just specifications in a Hoare logic!

By way of an elementary introduction to this approach, I present the example of a monad for interacting with a file handle which is either ‘open’ or ‘closed’, constructed from a command interface
specfied Hoare-style. An attempt to open a file results in a state which is statically unpredictable but dynamically detectable. Well typed programs behave accordingly in either case. Haskell’s dependent type system, as exposed by the Strathclyde Haskell Enhancement preprocessor, provides a suitable basis for this simple experiment.

Even without a weakness for posts/articles/books about category theory, invoking the Bard is enough to merit a pointer.

Rest easy, the author does not attempt to render any of the sonnets using category theory notation.

I first saw this in a tweet by Computer Science.

Text Coherence

Filed under: Text Analytics,Text Coherence,Text Mining,Topic Models (LDA) — Patrick Durusau @ 2:12 pm

Christopher Phipps mentioned Automatic Evaluation of Text Coherence: Models and Representations by Mirella Lapata and Regina Barzilay in a tweet today. Running that article down, I discovered it was published in the proceedings of International Joint Conferences on Artificial Intelligence in 2005.

Useful but a bit dated.

A more recent resource: A Bibliography of Coherence and Cohesion, Wolfram Bublitz (Universität Augsburg). Last updated: 2010.

The Bublitz bibliography is more recent but current bibliography would be even more useful.

Can you suggest a more recent bibliography on text coherence/cohesion?

I ask because while looking for such a bibliography, I encountered: Improving Topic Coherence with Regularized Topic Models by David Newman, Edwin V. Bonilla, and, Wray Buntine.

The abstract reads:

Topic models have the potential to improve search and browsing by extracting useful semantic themes from web pages and other text documents. When learned topics are coherent and interpretable, they can be valuable for faceted browsing, results set diversity analysis, and document retrieval. However, when dealing with small collections or noisy text (e.g. web search result snippets or blog posts), learned topics can be less coherent, less interpretable, and less useful. To over-come this, we propose two methods to regularize the learning of topic models. Our regularizers work by creating a structured prior over words that reflect broad patterns in the external data. Using thirteen datasets we show that both regularizers improve topic coherence and interpretability while learning a faithful representation of the collection of interest. Overall, this work makes topic models more useful across a broader range of text data.

I don’t think the “…small collections or noisy text (e.g. web search result snippets or blog posts), learned topics can be less coherent, less interpretable, and less useful” is a surprise to anyone. I take that as the traditional “garbage in, garbage out.”

However, “regularizers” may be useful for automatic/assisted authoring of topics in the topic map sense of the word topic. Assuming you want to mine “small or small and noisy texts.” The authors say the technique should apply to large texts and promise future research on applying “regularizers” to large texts.

I checked the authors’ recent publications but didn’t see anything I would call a “large” text application of “regularizers.” Open area of research if you want to take the lead.

Choosing a fast unique identifier (UUID) for Lucene

Filed under: ElasticSearch,Lucene,Solr — Patrick Durusau @ 9:44 am

Choosing a fast unique identifier (UUID) for Lucene by Michael McCandless.

From the post:

Most search applications using Apache Lucene assign a unique id, or primary key, to each indexed document. While Lucene itself does not require this (it could care less!), the application usually needs it to later replace, delete or retrieve that one document by its external id. Most servers built on top of Lucene, such as Elasticsearch and Solr, require a unique id and can auto-generate one if you do not provide it.

Sometimes your id values are already pre-defined, for example if an external database or content management system assigned one, or if you must use a URI, but if you are free to assign your own ids then what works best for Lucene?

One obvious choice is Java’s UUID class, which generates version 4 universally unique identifiers, but it turns out this is the worst choice for performance: it is 4X slower than the fastest. To understand why requires some understanding of how Lucene finds terms.
….

Excellent tips for creating identifiers for Lucene! Complete with tests and an explanation for the possible choices.

Enjoy!

May 12, 2014

CRDTs in New York (May 15, 2014)

Filed under: Conferences,CRDT — Patrick Durusau @ 7:10 pm

Chas Emerick – A comp study of Convergent & Commutative Replicated Data Types (May 15, 2014)

Thursday, May 15, 2014 7:00 PM

Viggle Inc. 902 Broadway 11, New York, NY (map)

From the meeting notice:

A comprehensive study of Convergent and Commutative Replicated Data Types‘ by Shapiro et al.

Commutative Replicated Data Types (CRDTs) are a formalism for providing practical data and programming primitives for use in distributed systems applications without necessitating expensive (and sometimes impractical) consensus mechanisms. Their key characteristic is that they provide conflict-free “merging” of distributed concurrent updates given only the weak guarantees of eventual consistency.

While this paper did not coin the term ‘CRDT’, it was the first to provide a comprehensive treatment of their definition, semantics, and possible construction separate from and beyond previous implementations of distributable datatypes that happened to provide CRDT-like semantics.

A “papers we love” meetup later this week in New York.

If you have or can make the time, RSVP and join the discussion!

Categories from scratch

Filed under: Category Theory,Computer Science,Mathematics — Patrick Durusau @ 7:00 pm

Categories from scratch by Rapahel ‘kena’ Poss.

From the post:

Prologue

The concept of category from mathematics happens to be useful to computer programmers in many ways. Unfortunately, all “good” explanations of categories so far have been designed by mathematicians, or at least theoreticians with a strong background in mathematics, and this makes categories especially inscrutable to external audiences.

More specifically, the common explanatory route to approach categories is usually: “here is a formal specification of what a category is; then look at these known things from maths and theoretical computer science, and admire how they can be described using the notions of category theory.” This approach is only successful if the audience can fully understand a conceptual object using only its formal specification.

In practice, quite a few people only adopt conceptual objects by abstracting from two or more contexts where the concepts are applicable, instead. This is the road taken below: reconstruct the abstractions from category theory using scratches of understanding from various fields of computer engineering.

Overview

The rest of this document is structured as follows:

  1. introduction of example Topics of study: unix process pipelines, program statement sequences and signal processing circuits;
  2. Recollections of some previous knowledge about each example; highlight of interesting analogies between the examples;
  3. Identification of the analogies with existing concepts from category theory;
  4. a quick preview of Goodies from category theory;
  5. references to Further reading.

If you don’t already grok category theory, perhaps this will be the approach that tips the balance in your favor!

Facebook teaches you exploratory data analysis with R

Filed under: Data Analysis,Exploratory Data Analysis,Facebook,R — Patrick Durusau @ 6:44 pm

Facebook teaches you exploratory data analysis with R by David Smith.

From the post:

Facebook is a company that deals with a lot of data — more than 500 terabytes a day — and R is widely used at Facebook to visualize and analyze that data. Applications of R at Facebook include user behaviour, content trends, human resources and even graphics for the IPO prospectus. Now, four R users at Facebook (Moira Burke, Chris Saden, Dean Eckles and Solomon Messing) share their experiences using R at Facebook in a new Udacity on-line course, Exploratory Data Analysis.

The more data you explore, the better data explorer you will be!

Enjoy!

I first saw this in a post by David Smith.

Enough Machine Learning to…

Filed under: Machine Learning,Python,Scikit-Learn — Patrick Durusau @ 6:38 pm

Enough Machine Learning to Make Hacker News Readable Again by Ned Jackson Lovely.

From the description:

It’s inevitable that online communities will change, and that we’ll remember the community with a fondness that likely doesn’t accurately reflect the former reality. We’ll explore how we can take a set of articles from an online community and winnow out the stuff we feel is unworthy. We’ll explore some of the machine learning tools that are just a “pip install” away, such as scikit-learn and nltk.

Ned recommends you start with the map I cover at: Machine Learning Cheat Sheet (for scikit-learn).

Great practice with scikit-learn. Following this as a general outline will develop your machine learning skills!

EnviroAtlas

Filed under: Data Integration,Environment,Government,Government Data — Patrick Durusau @ 4:13 pm

EnviroAtlas

From the homepage:

What is EnviroAtlas?

EnviroAtlas is a collection of interactive tools and resources that allows users to explore the many benefits people receive from nature, often referred to as ecosystem services. Key components of EnviroAtlas include the following:


Why is EnviroAtlas useful?

Though critically important to human well-being, ecosystem services are often overlooked. EnviroAtlas seeks to measure and communicate the type, quality, and extent of the goods and services that humans receive from nature so that their true value can be considered in decision-making processes.

Using EnviroAtlas, many types of users can access, view, and analyze diverse information to better understand how various decisions can affect an array of ecological and human health outcomes. EnviroAtlas is available to the public and houses a wealth of data and research.

EnvironAtlas integrates over 300 data layers listed in: Available EnvironAtlas data.

News about the cockroaches infesting the United States House/Senate makes me forget there are agencies laboring to provide benefits to citizens.

Whether this environmental goldmine will be enough to result in a saner environmental policy remains to be seen.

I first saw this in a tweet by Margaret Palmer.

Distributed Systems and the End of the API

Filed under: CRDT,Distributed Computing,Distributed Systems — Patrick Durusau @ 3:53 pm

Distributed Systems and the End of the API by Chas Emerick.

From the post:

This is a written (expanded) narrative of the content from a talk I first gave at PhillyETE on April 23rd, 2014. It mostly follows the flow of the presentation given then, but with a level of detail that I hope enhances clarity of the ideas therein. The talk’s original slides are available, though the key illustrations and bullet points contained therein are replicated (and somewhat enhanced) below. When audio/video of the talk is published, I will update this page to link to it.

I have two claims of which I would like to convince you today:

  1. The notion of the networked application API is an unsalvageable anachronism that fails to account for the necessary complexities of distributed systems.
  2. There exist a set of formalisms that do account for these complexities, but which are effectively absent from modern programming practice.

A bit further into the paper, distributed systems are defined as:

A distributed system is one that is comprised of multiple processes that must communicate to perform work.

The bottom line is that, given the ambient nature of the networks that surround us and the dependence we have upon those networks for so many of the tasks our programs, clients, customers, and users take for granted, nearly every system we build is a distributed system. Unless your software runs in a totally isolated environment — e.g. on an air-gapped computer — you are building a distributed system.

This is problematic in that distributed systems exhibit a set of uniformly unintuitive behaviours related to causality, consistency, and availability. These behaviours are largely emergent, and spring from the equally unintuitive semantics of the non-locality of the parts of those distributed systems and the networks that connect them. None of these behaviours or semantics are related at all to those which we — as programmers and engineers — are typically trained and acclimated to expect and reason about.

Note that even if you are doing something small, or “normal”, or common, you are not immune to these challenges. Even the most vanilla web application is definitionally a distributed system. By sending data from one computer (e.g. a server) to another (e.g. your customer’s web browser), you end up having to contemplate and address all sorts of problems that simply don’t exist when you run a program in a single process on a single machine that doesn’t touch the network: consistency, coping with non-availability (i.e. latency, services being down, timing-related bugs caused by long-running computations or things as simple as garbage collection), dealing with repeated messages from clients with spotty connections, and more. If you’ve not been bitten by these things, that is evidence of luck (or, of your not having noticed the problems yet!), not of your being immune, or otherwise that what you’ve built is somehow not a distributed system and so isn’t subject to these challenges.

A lot of heavy sledding but important for the future development of robust distributed systems.

It is important that people interested in semantics and XML participate in these discussions.

For example, Chas says of XML (and JSON):

the “richer” data representations that are favoured by most API services and clients (again, JSON, XML, etc) are fundamentally opaque and in general make reconciling independent changes impossible in a consistent way without special, often domain-specific intervention.

I’m am curious what is meant by “fundametally opaque,” at least insofar as Chas is talking about XML. If he means that independent changes impact the tree structure and make reconciliation of concurrent changes challenging, ok, but that’s not being opaque. And even that is an artifact of a processing model for XML, not XML proper.

I am even more concerned about the “semantics” to be addressed in distributed systems. At this point I will have to take Chas’ word for the distributed systems preserving machine to machine semantics (I have much reading left to do) but correct machine processing doesn’t warrant correct semantics for a human consumer of the same data.

I first saw this in a tweet by Tom Santero.

GATE 8.0

Filed under: Annotation,Linguistics,Text Analytics,Text Corpus,Text Mining — Patrick Durusau @ 2:34 pm

GATE (general architecture for text engineering) 8.0

From the download page:

Release 8.0 (May 11th 2014)

Most users should download the installer package (~450MB):

If the installer does not work for you, you can download one of the following packages instead. See the user guide for installation instructions:

The BIN, SRC and ALL packages all include the full set of GATE plugins and all the libraries GATE requires to run, including sample trained models for the LingPipe and OpenNLP plugins.

Version 8.0 requires Java 7 or 8, and Mac users must install the full JDK, not just the JRE.

Four major changes in this release:

  1. Requires Java 7 or later to run
  2. Tools for Twitter.
  3. ANNIE (named entity annotation pipeline) Refreshed.
  4. Tools for Crowd Sourcing.

Not bad for a project that will turn twenty (20) next year!

More resources:

UsersGuide

Nightly Snapshots

Mastering a substantial portion of GATE should keep you in nearly constant demand.

High-Performance Browser Networking

Filed under: Networks,Topic Map Software,WWW — Patrick Durusau @ 10:42 am

High-Performance Browser Networking by Ilya Grigorik.

From the foreword:

In High Performance Browser Networking, Ilya explains many whys of networking: Why latency is the performance bottleneck. Why TCP isn’t always the best transport mechanism and UDP might be your better choice. Why reusing connections is a critical optimization. He then goes even further by providing specific actions for improving networking performance. Want to reduce latency? Terminate sessions at a server closer to the client. Want to increase connection reuse? Enable connection keep-alive. The combination of understanding what to do and why it matters turns this knowledge into action.

Ilya explains the foundation of networking and builds on that to introduce the latest advances in protocols and browsers. The benefits of HTTP 2.0 are explained. XHR is reviewed and its limitations motivate the introduction of Cross-Origin Resource Sharing. Server-Sent Events, WebSockets, and WebRTC are also covered, bringing us up to date on the latest in browser networking.

Viewing the foundation and latest advances in networking from the perspective of performance is what ties the book together. Performance is the context that helps us see the why of networking and translate that into how it affects our website and our users. It transforms abstract specifications into tools that we can wield to optimize our websites and create the best user experience possible. That’s important. That’s why you should read this book.

Network latency may be responsible for a non-responsive app but can you guess who the user is going to blame?

Right in one, the app!

“Not my fault” isn’t a line item on any bank deposit form.

You or someone on your team needs to be tasked with performance, including reading High-Performance Browser Networking.

I first saw this in a tweet by Jonas BonĂŠr

strace Wow Much Syscall

Filed under: Linux OS — Patrick Durusau @ 8:50 am

strace Wow Much Syscall by Brendan Gregg.

From the post:

I wouldn’t dare run strace(1) in production without seriously considering the consequences, and first trying the alternates. While it’s widely known (and continually rediscovered) that strace is an amazing tool, it’s much less known that it currently is – and always has been – dangerous.

strace is the system call tracer for Linux. It currently uses the arcane ptrace() (process trace) debugging interface, which operates in a violent manner: pausing the target process for each syscall so that the debugger can read state. And doing this twice: when the syscall begins, and when it ends.

With strace, this means pausing the target application for every syscall, twice, and context-switching between the application and strace. It’s like putting traffic metering lights on your application.

A great guide to strace, including a handy set of strace one-liners, references, “How To Learn strace,” and other goodies.

If you are interested in *nix internals and the potential of topic maps for the same, this is a great post.

I first saw this in post by Julia Evans.

May 11, 2014

…Technology-Assisted Review in Electronic Discovery…

Filed under: Machine Learning,Spark — Patrick Durusau @ 7:16 pm

Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery by Gordon V. Cormack & Maura R. Grossman.

Abstract:

Using a novel evaluation toolkit that simulates a human reviewer in the loop, we compare the effectiveness of three machine-learning protocols for technology-assisted review as used in document review for discovery in legal proceedings. Our comparison addresses a central question in the deployment of technology-assisted review: Should training documents be selected at random, or should they be selected using one or more non-random methods, such as keyword search or active learning? On eight review tasks — four derived from the TREC 2009 Legal Track and four derived from actual legal matters — recall was measured as a function of human review effort. The results show that entirely non-random training methods, in which the initial training documents are selected using a simple keyword search, and subsequent training documents are selected by active learning, require substantially and significantly less human review effort (P<0.01) to achieve any given level of recall, than passive learning, in which the machine-learning algorithm plays no role in the selection of training documents. Among passive-learning methods, significantly less human review effort (P<0.01) is required when keywords are used instead of random sampling to select the initial training documents. Among active-learning methods, continuous active learning with relevance feedback yields generally superior results to simple active learning with uncertainty sampling, while avoiding the vexing issue of "stabilization" -- determining when training is adequate, and therefore may stop.

New acronym for me: TAR (technology-assisted review).

If you are interested in legal discovery, take special note that the authors have released a TAR evaluation toolkit.

This article and its references will repay a close reading several times over.

brat rapid annotation tool

Filed under: Annotation,Natural Language Processing,Visualization — Patrick Durusau @ 3:20 pm

brat rapid annotation tool

From the introduction:

brat is a web-based tool for text annotation; that is, for adding notes to existing text documents.

brat is designed in particular for structured annotation, where the notes are not free form text but have a fixed form that can be automatically processed and interpreted by a computer.

The examples page has examples of:

  • Entity mention detection
  • Event extraction
  • Coreference resolution
  • Normalization
  • Chunking
  • Dependency syntax
  • Meta-knowledge
  • Information extraction
  • Bottom-up Metaphor annotation
  • Visualization
  • Information Extraction system evaluation

I haven’t installed the local version but it is on my to-do list.

I first saw this in a tweet by Steven Bird.

Twitter User Targeting Data

Filed under: Geographic Data,Geography,Georeferencing,Tweets — Patrick Durusau @ 2:59 pm

Geotagging One Hundred Million Twitter Accounts with Total Variation Minimization by Ryan Compton, David Jurgens, and, David Allen.

Abstract:

Geographically annotated social media is extremely valuable for modern information retrieval. However, when researchers can only access publicly-visible data, one quickly finds that social media users rarely publish location information. In this work, we provide a method which can geolocate the overwhelming majority of active Twitter users, independent of their location sharing preferences, using only publicly-visible Twitter data.

Our method infers an unknown user’s location by examining their friend’s locations. We frame the geotagging problem as an optimization over a social network with a total variation-based objective and provide a scalable and distributed algorithm for its solution. Furthermore, we show how a robust estimate of the geographic dispersion of each user’s ego network can be used as a per-user accuracy measure, allowing us to discard poor location inferences and control the overall error of our approach.

Leave-many-out evaluation shows that our method is able to infer location for 101,846,236 Twitter users at a median error of 6.33 km, allowing us to geotag roughly 89\% of public tweets.

If 6.33 km sounds like a lot of error, check out NUKEMAP by Alex Wellerstein.

Retail Spy Services for Business Relationships

Filed under: Cybersecurity,Security — Patrick Durusau @ 2:14 pm

I ran across a retail spying operation today. There is a startup that allows you to upload documents for other to view and then:

Every time someone clicks on this link, it opens the document in DocSend’s HTML viewer in your browser — and of course, it also collects data. You know if the recipient went through all the slides or pages, and you even know how much time he or she spent on every slide. You know if he or she forwarded the document to someone else as DocSend collects email addresses of people who opened the document.

This sounds like the metadata that lawyers are careful to scrub from documents before forwarding them to other parties. Yes?

BTW, when asked about the obvious privacy issue:

Shana Fisher: It’s a little like spying. What do you think of that?

Answer: There is no privacy issue. It’s a tool, like any tool you misuse it in some scenario. It’s a tool for business people, for business relationships.

Sorry, that went by a little fast. “There is no privacy issue….It’s a tool for business people, for business relationships.”

Really? I may be wrong but I thought most spying is concerned with business people and business relationships? Governments worry about who said some government minister was butt ugly but no serious people are worried about that sort of thing.

The question is: Do you want send information to a classic intelligence gathering application for future unknown uses by others?

The answer is: NO!

The quotes in this post are from: DocSend Is The Analytics Tool For Documents We’ve All Been Waiting For by Romain Dillet.

PS: Don’t bother sending me a file link for this or similar services. It will languish unread.

« Newer PostsOlder Posts »

Powered by WordPress