Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 21, 2013

SuggestStopFilter carefully removes stop words for suggesters

Filed under: Lucene,Search Engines — Patrick Durusau @ 6:07 pm

SuggestStopFilter carefully removes stop words for suggesters by Michael McCandless.

Michael has tamed the overly “aggressive” StopFilter with SuggestStopFilter.

From the post:

Finally, you could use the new StopSuggestFilter at lookup time: this filter is just like StopFilter except when the token is the very last token, it checks the offset for that token and if the offset indicates that the token has ended without any further non-token characters, then the token is preserved. The token is also marked as a keyword, so that any later stem filters won’t change it. This way a query “a” can find “apple”, but a query “a ” (with a trailing space) will find nothing because the “a” will be removed.

I’ve pushed StopSuggestFilter to jirasearch.mikemccandless.com and it seems to be working well so far!

Have you noticed how quickly improvements for Lucene and Solr emerge?

A Look Inside Our 210TB 2012 Web Crawl

Filed under: Common Crawl,Search Data — Patrick Durusau @ 5:13 pm

A Look Inside Our 210TB 2012 Web Crawl by Lisa Green.

From the post:

Want to know more detail about what data is in the 2012 Common Crawl corpus without running a job? Now you can thanks to Sebastian Spiegler!

Sebastian is a highly talented data scientist who works at the London based startup SwiftKey and volunteers at Common Crawl. He did an exploratory analysis of the 2012 Common Crawl data and produced an excellent summary paper on exactly what kind of data it contains: Statistics of the Common Crawl Corpus 2012.

From the conclusion section of the paper:

The 2012 Common Crawl corpus is an excellent opportunity for individuals or businesses to cost- effectively access a large portion of the internet: 210 terabytes of raw data corresponding to 3.83 billion documents or 41.4 million distinct second- level domains. Twelve of the top-level domains have a representation of above 1% whereas documents from .com account to more than 55% of the corpus. The corpus contains a large amount of sites from youtube.com, blog publishing services like blogspot.com and wordpress.com as well as online shopping sites such as amazon.com. These sites are good sources for comments and reviews. Almost half of all web documents are utf-8 encoded whereas the encoding of the 43% is unknown. The corpus contains 92% HTML documents and 2.4% PDF files. The remainder are images, XML or code like JavaScript and cascading style sheets.

View or download a pdf of Sebastian’s paper here. If you want to dive deeper you can find the non-aggregated data at s3://aws-publicdatasets/common-crawl/index2012 and the code on GitHub.

Don’t have your own server farm crawling the internet?

Take a long look at CommonCrawl and their publicly accessible crawl data.

If the enterprise search bar is at 9%, the Internet search bar is even lower.

Use CommonCrawl data as a practice field.

Does your first ten “hits” include old data because it is popular?

Simple Hive ‘Cheat Sheet’ for SQL Users

Filed under: Hive,SQL — Patrick Durusau @ 4:51 pm

Simple Hive ‘Cheat Sheet’ for SQL Users by Marc Holmes.

From the post:

If you’re already familiar with SQL then you may well be thinking about how to add Hadoop skills to your toolbelt as an option for data processing.

From a querying perspective, using Apache Hive provides a familiar interface to data held in a Hadoop cluster and is a great way to get started. Apache Hive is data warehouse infrastructure built on top of Apache Hadoop for providing data summarization, ad-hoc query, and analysis of large datasets. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL).

Naturally, there are a bunch of differences between SQL and HiveQL, but on the other hand there are a lot of similarities too, and recent releases of Hive bring that SQL-92 compatibility closer still.

To highlight that – and as a bit of fun to get started – below is a simple ‘cheat sheet’ (based on a simple MySQL reference such as this one) for getting started with basic querying for Hive. Here, we’ve done a direct comparison to MySQL, but given the simplicity of these particular functions, then it should be the same in essentially any SQL dialect.

Of course, if you really want to get to grips with Hive, then take a look at the full language manual.
(…)

Definitely going to print this cheat sheet out and put it in plastic.

A top of the desk sort of reference.

ASCII Cheat Sheet

Filed under: ASCII — Patrick Durusau @ 4:43 pm

ASCII Cheat Sheet by Peteris Krumins.

From the post:

I created an ASCII cheat sheet from the table that I used in my blog post about my favorite regular expression. I know there are a billion ASCII cheat sheets out there but I wanted my own. This cheat sheet includes dec, hex, oct and bin values for ASCII characters.

Any cheat sheets to suggest on less popular character sets?

Particularly ones that had font mappings prior to Unicode?

ack 2.04

Filed under: Programming,Searching — Patrick Durusau @ 4:36 pm

ack 2.04

From the webpage:

Top 5 reasons to use ack

Blazing fast
It’s fast because it only searches the stuff it makes sense to search.

Better search
Searches entire trees by default while ignoring Subversion, Git and other VCS directories and other files that aren’t your source code.

Designed for code search
Where grep is a general text search tool, ack is especially for the programmer searching source code. Common tasks take fewer keystrokes.

Highly portable
ack is pure Perl, so it easily runs on a Windows installation Perl (like Strawberry Perl) without modifications.

Free and open
Ack costs nothing. It’s 100% free and open source under Artistic License v2.0.

I was doubtful until I saw the documentation page.

I had to concede that there were almost enough command line switches to qualify for a man page. 😉

I suspect it is going to be a matter of personal preference.

See what your personal preference says.

Techniques To Improve Your Solr Search Results

Filed under: Drupal,Solr — Patrick Durusau @ 4:28 pm

Techniques To Improve Your Solr Search Results by Chris Johnson.

From the post:

Solr is a tremendously popular option for providing search functionality in Drupal. While Solr provides pretty good default results, making search results great requires analysis of what your users search for, consideration of which data is sent to Solr, and tuning of Solr’s ‘boosting’. In this post, I will show you a few techniques that can help you leverage Solr to produce great results. I will specifically be covering the Apache Solr Search module. Similar concepts exist in the Search API Solr Search module, but with different methods of configuring boosting and altering data sent to Solr.

If you are interested in using Solr in a Drupal environment, here is a starting place for you.

Wrap-up of the Solr Usability Contest

Filed under: Solr,Usability — Patrick Durusau @ 4:20 pm

Wrap-up of the Solr Usability Contest by Alexandre Rafalovitch.

From the post:

The Solr Usability Contest has finished. It run for four weeks, has received 29 suggestions, 113 votes and more than 300 visits. People from several different Solr communities participated.

See Alexandre’s post for the 29 suggestions.

Six (6) of them, including the #1 suggestion, concern documentation.

Measuring the Complexity of the Law: The United States Code

Filed under: Complexity,Law — Patrick Durusau @ 3:12 pm

Measuring the Complexity of the Law: The United States Code by Daniel Martin Katz and Michael James Bommarito II.

Abstract:

Einstein’s razor, a corollary of Ockham’s razor, is often paraphrased as follows: make everything as simple as possible, but not simpler. This rule of thumb describes the challenge that designers of a legal system face — to craft simple laws that produce desired ends, but not to pursue simplicity so far as to undermine those ends. Complexity, simplicity’s inverse, taxes cognition and increases the likelihood of suboptimal decisions. In addition, unnecessary legal complexity can drive a misallocation of human capital toward comprehending and complying with legal rules and away from other productive ends.

While many scholars have offered descriptive accounts or theoretical models of legal complexity, empirical research to date has been limited to simple measures of size, such as the number of pages in a bill. No extant research rigorously applies a meaningful model to real data. As a consequence, we have no reliable means to determine whether a new bill, regulation, order, or precedent substantially effects legal complexity.

In this paper, we address this need by developing a proposed empirical framework for measuring relative legal complexity. This framework is based on “knowledge acquisition,” an approach at the intersection of psychology and computer science, which can take into account the structure, language, and interdependence of law. We then demonstrate the descriptive value of this framework by applying it to the U.S. Code’s Titles, scoring and ranking them by their relative complexity. Our framework is flexible, intuitive, and transparent, and we offer this approach as a first step in developing a practical methodology for assessing legal complexity.

Curious what you make of the treatment of the complexity of language of laws in this article?

The authors compute the number of words and the average length of words in each title of the United States Code. In addition, the Shannon entropy of each title is also calculated. Those results figure in the author’s determination of the complexity of each title.

To be sure, those are all measurable aspects of each title and so in that sense the results and the process to reach them can be duplicated and verified by others.

The author’s are using a “knowledge acquisition model,” that is measuring the difficulty a reader would experience in reading and acquiring knowledge about any part of the United States Code.

But reading the bare words of the U.S. Code is not a reliable way to acquire legal knowledge. Words in the U.S. Code and their meanings have been debated and decided (sometimes differently) by various courts. A reader doesn’t understand the U.S. Code without knowledge of court decisions on the language of the text.

Let me give you a short example:

42 U.S.C. §1983 read:

Every person who, under color of any statute, ordinance, regulation, custom, or usage, of any State or Territory or the District of Columbia, subjects, or causes to be subjected, any citizen of the United States or other person within the jurisdiction thereof to the deprivation of any rights, privileges, or immunities secured by the Constitution and laws, shall be liable to the party injured in an action at law, suit in equity, or other proper proceeding for redress, except that in any action brought against a judicial officer for an act or omission taken in such officer’s judicial capacity, injunctive relief shall not be granted unless a declaratory decree was violated or declaratory relief was unavailable. For the purposes of this section, any Act of Congress applicable exclusively to the District of Columbia shall be considered to be a statute of the District of Columbia. (emphasis added)

Before reading the rest of this post, answer this question: Is a municipality a person for purposes of 42 U.S.C. §1983?

That is if city employees violate your civil rights, can you sue them and the city they work for?

That seems like a straightforward question. Yes?

In Monroe v. Pape, 365 US 167 (1961), the Supreme Court found the answer was no. Municipalities were not “persons” for purposes of 42 U.S.C. §1983.

But a reader who only remembers that decision would be wrong if trying to understand that statute today.

In Monell v. New York City Dept. of Social Services, 436 U.S. 658 (1978), the Supreme Court found that it was mistaken in Monroe v. Pape and found the answer was yes. Municipalities could be “persons” for purposes of 42 U.S.C. §1983, in some cases.

The language in 42 U.S.C. §1983 did not change between 1961 and 1978. Nor did the circumstances under which section 1983 was passed (Civil War reconstruction) change.

But the meaning of that one word changed significantly.

Many other words in the U.S. Code have had a similar experience.

If you need assistance with 42 U.S.C. §1983 or any other part of the U.S. Code or other laws, seek legal counsel.

August 20, 2013

Step by step to build my first R Hadoop System

Filed under: Hadoop,R,RHadoop — Patrick Durusau @ 5:16 pm

Step by step to build my first R Hadoop System by Yanchang Zhao.

From the post:

After reading documents and tutorials on MapReduce and Hadoop and playing with RHadoop for about 2 weeks, finally I have built my first R Hadoop system and successfully run some R examples on it. My experience and steps to achieve that are presented at http://www.rdatamining.com/tutorials/rhadoop. Hopefully it will make it easier to try RHadoop for R users who are new to Hadoop. Note that I tried this on Mac only and some steps might be different for Windows.

Before going through the complex steps, you may want to have a look what you can get with R and Hadoop. There is a video showing Wordcount MapReduce in R at http://www.youtube.com/watch?v=hSrW0Iwghtw.

Unfortunately, I can’t get the video sound to work.

On the other hand, the step by step instructions are quite helpful, even without the video.

Solr and available query parsers

Filed under: Parsers,Solr — Patrick Durusau @ 4:20 pm

Solr and available query parsers

From the post:

Every now and than there is a question appearing on the mailing list – what type of query parsers are available in Solr. So we decided to make such a list with a short description about each of the query parsers available. If you are interested to see what Solr has to offer, please read the rest of this post.

I count eighteen (18) query parsers available for Solr.

If you can’t name each one and give a brief description of its use, you need to read this post.

Visualization on Impala: Big, Real-Time, and Raw

Filed under: HDFS,Impala — Patrick Durusau @ 4:14 pm

Visualization on Impala: Big, Real-Time, and Raw by Justin Kestelyn.

From the post:

What if you could affordably manage billions of rows of raw Big Data and let typical business people analyze it at the speed of thought in beautiful, interactive visuals? What if you could do all the above without worrying about structuring that data in a data warehouse schema, moving it, and pre-defining reports and dashboards? With the approach I’ll describe below, you can.

The traditional Apache Hadoop approach — in which you store all your data in HDFS and do batch processing through MapReduce — works well for data geeks and data scientists, who can write MapReduce jobs and wait hours for them to run before asking the next question. But many businesses have never even heard of Hadoop, don’t employ a data scientist, and want their data questions answered in a second or two — not in hours.

We at Zoomdata, working with the Cloudera team, have figured out how to make Big Data simple, useful, and instantly accessible across an organization, with Cloudera Impala being a key element. Zoomdata is a next-generation user interface for data, and addresses streams of data as opposed to sets. Zoomdata performs continuous math across data streams in real-time to drive visualizations on touch, gestural, and legacy web interfaces. As new data points come in, it re-computes their values and turns them into visuals in milliseconds.

To handle historical data, Zoomdata re-streams the historical raw data through the same stream-processing engine, the same way you’d rewind a television show on your home DVR. The amount of the data involved can grow rapidly, so the ability to crunch billions of rows of raw data in a couple seconds is important –- which is where Impala comes in.

With Impala on top of raw HDFS data, we can run flights of tiny queries, each to do a tiny fraction of the overall work. Zoomdata adds the ability to process the resulting stream of micro-result sets instead of processing the raw data. We call this approach “micro-aggregate delegation”; it enables users to see results immediately, allowing for instantaneous analysis of arbitrarily large amounts of raw data. The approach also allows for joining micro-aggregate streams from disparate Hadoop, NoSQL, and legacy sources together while they are in-flight, an approach we call the “Death Star Join” (more on that in a future blog post).

The demo below shows how this works, by visualizing a dataset of 1 billion raw records per day nearly instantaneously, with no pre-aggregation, no indexing, no database, no star schema, no pre-built reports, and no data movement — just billions of rows of raw data in HDFS with Impala and Zoomdata on top.

The demo is very impressive! A must see.

The riff that Zoomdata is a “dvr” for data will resonate with many users.

My only caveat is caution with regard to the cleanliness of your data. The demo presumes that the underlying data is clean and the relationships displayed are relevant to the user’s query.

Neither of those assumptions may be correct in your particular case. Not the fault of Zoomdata because no software can correct a poor choice of data for analysis.

See Zoomdata.

Entity Resolution for Big Data

Filed under: BigData,Entity Resolution — Patrick Durusau @ 3:29 pm

Entity Resolution for Big Data by Benjamin Bengfort.

From the post:

A Summary of the KDD 2013 Tutorial Taught by Dr. Lise Getoor and Dr. Ashwin Machanavajjhala

Entity Resolution is becoming an important discipline in Computer Science and in Big Data, especially with the recent release of Google’s Knowledge Graph and the open Freebase API. Therefore it is exceptionally timely that last week at KDD 2013, Dr. Lise Getoor of the University of Maryland and Dr. Ashwin Machanavajjhala of Duke University will be giving a tutorial on Entity Resolution for Big Data. We were fortunate enough to be invited to attend a run through workshop at the Center for Scientific Computation and Mathematical Modeling at College Park, and wanted to highlight some of the key points for those unable to attend.

A summary that makes you regret not seeing the tutorial!

FoundationDB: Version 1.0 and Pricing Announced!

Filed under: FoundationDB,NoSQL — Patrick Durusau @ 3:17 pm

FoundationDB: Version 1.0 and Pricing Announced!

From the post:

After a successful 18-month Alpha and Beta testing program involving more than 2,000 participants, we’re very excited to announce that we’ve released version 1.0 of FoundationDB and general availability pricing!

Built on a distributed shared-nothing architecture, FoundationDB is a unique database technology that combines the time-proven power of ACID transactions with the scalability, fault tolerance, and operational elegance of distributed NoSQL databases.

You can download FoundationDB and use it under our Community License today and run as many server processes as you’d like to in non-production use, and use up to six processes in production for free! You don’t even have to sign up – just go to our download page for instant access. You’ll get all the technical goodness of FoundationDB – exceptional fault tolerance, high performance distributed ACID transactions, and access to our growing catalog of open source layers – regardless of whether you’re a community user or a paying customer.

Have a big application that needs more than six processes in production, or want your FoundationDB cluster supported? We’re also offering commercial licensing and support priced starting at $99 per server process per month. Check out our commercial license and support plans on our pricing page.

I don’t know if FoundationDB will meet your requirements but I can say their business model should set the standard for software offerings.

High quality software with aggressive pricing and no registration required for the community edition.

I am downloading the community version now.

When are you going to grab a copy?

Cleaning Data with OpenRefine

Filed under: Data Quality,OpenRefine — Patrick Durusau @ 2:54 pm

Cleaning Data with OpenRefine by Seth van Hooland, Ruben Verborgh, and, Max De Wilde.

From the post:

Don’t take your data at face value. That is the key message of this tutorial which focuses on how scholars can diagnose and act upon the accuracy of data. In this lesson, you will learn the principles and practice of data cleaning, as well as how OpenRefine can be used to perform four essential tasks that will help you to clean your data:

  1. Remove duplicate records
  2. Separate multiple values contained in the same field
  3. Analyse the distribution of values throughout a data set
  4. Group together different representations of the same reality

These steps are illustrated with the help of a series of exercises based on a collection of metadata from the Powerhouse museum, demonstrating how (semi-)automated methods can help you correct the errors in your data.

(…)

If you only remember on thing from this lesson, it should be this: all data is dirty, but you can do something about it. As we have shown here, there is already a lot you can do yourself to increase data quality significantly. First of all, you have learned how you can get a quick overview of how many empty values your dataset contains and how often a particular value (e.g. a keyword) is used throughout a collection. This lessons also demonstrated how to solve recurrent issues such as duplicates and spelling inconsistencies in an automated manner with the help of OpenRefine. Don’t hesitate to experiment with the cleaning features, as you’re performing these steps on a copy of your data set, and OpenRefine allows you to trace back all of your steps in the case you have made an error.

It is so rare that posts have strong introductions and conclusions that I had to quote both of them.

Great introduction to OpenRefine.

I fully agree that all data is dirty, and that you can do something about it.

However, data is dirty or clean only from a certain point of view.

You may “clean” data in a way that makes it incompatible with my input methods. For me, the data remains “dirty.”

Or to put it another way, data cleaning is like housekeeping. It comes around day after day. You may as well plan for it.

Solr Tutorial [No Ads]

Filed under: Search Engines,Searching,Solr — Patrick Durusau @ 2:38 pm

Solr Tutorial from the Apache Software Foundation.

A great tutorial on Solr that is different from most of the Solr tutorials you will ever see.

There are no ads, popup or otherwise. 😉

Should be the first tutorial that you recommend for anyone new to Solr!

PS: You do not have to give your email address, phone number, etc. to view the tutorial.

The Curse of Enterprise Search… [9% Solutions]

Filed under: Marketing,Search Requirements,Searching — Patrick Durusau @ 2:23 pm

The Curse of Enterprise Search and How to Break It by Maish Nichani.

From the post:

The Curse

Got enterprise search? Try answering these questions: Are end users happy? Has decision-making improved? Productivity up? Knowledge getting reused nicely? Your return-on-investment positive? If you’re finding it tough to answer these questions then most probably you’re under the curse of enterprise search.

The curse is cast when you purchase an enterprise search software and believe that it will automagically solve all your problems the moment you switch it on. You believe that the boatload of money you just spent on it justifies the promised magic of instant findability. Sadly, this belief cannot be further from the truth.

Search needs to be designed. Your users and content are unique to your organisation. Search needs to work with your users. It needs to make full use of the type of content you have. Search really needs to be designed.

Don’t believe in the curse? Consider these statistics from the Enterprise Search and Findability Survey 2013 done by Findwise with 101 practitioners working for global companies:

  • Only 9% said it was easy to find the right information within the organisation
  • Only 19% said they were happy with the existing search application in their organisation
  • Only 20% said they had a search strategy in place

Just in case you need some more numbers when pushing your better solution to enterprise search.

I wonder how search customers would react to an application that made it easy to find the right data 20% of the time?

Just leaving room for future versions and enhancements. 😉

Maish isn’t handing out silver bullets but a close read will improve your search application (topic map or not).

August 19, 2013

Systems that run forever self-heal and scale (Scaling Topic Maps?)

Filed under: Concurrent Programming,Erlang — Patrick Durusau @ 2:43 pm

Systems that run forever self-heal and scale by Joe Armstrong.

From the description:

Joe Armstrong outlines the architectural principles needed for building scalable fault-tolerant systems built from small isolated parallel components which communicate though well-defined protocols.

Great visuals on the difference between imperative programming and concurrent programming.

About half of the data transmission from smart phones uses Erlang.

A very high level view of the architectural principles for building scalable fault-tolerant systems.

All of Joe’s talk is important but for today I want to focus on his first principle for scalable fault-tolerant systems:

ISOLATION.

Joe enumerates the benefits of isolation of processes as follows:

Isolation enables:

  • Fault-tolerant
  • Scalability
  • Reliability
  • Testability
  • Comprehensibility
  • Code Upgrade

Are you aware of any topic map engine that uses multiple, isolated processes for merging topics?

Not threads, but processes.

Threads being managed by an operating system scheduler are not really parallel processes, whatever its appearance to the casual user. Erlang processes, on the other hand, do run in parallel and when more processes are required, simply add more hardware.

We could take a clue from Scalable SPARQL Querying of Large RDF Graphs Jiewen Huang, Daniel J. Abadi and, Kun Ren, partitioning parts of a topic map into different data stores and querying each store for a part of any query.

But that’s adapting data to a sequential process, not a bad solution but one that you will have to repeat as data or queries change and evolve. Pseudo-parallelism.

One of a concurrent process approach on immutable topics, associations, occurrences (see Working with Immutable Data by Saša Jurić) would be that different processes could be applying different merging tests to the same set of topics, associations, occurrences.

Or the speed of your answer might depend on whether you have sent a query over a “free” interface, which is supported by a few processes or over a subscription interface, which has dozens if not hundreds of processes at your disposal.

The speed and comprehensiveness of a topic map answer to any query might be a economic model for a public topic map service.

If all I want to know about Anthony Weiner was: “Vote NO!” that could be free.

If you wanted pics, vics and all, that could be a different price.

August 18, 2013

Distributed Machine Learning with Spark using MLbase

Filed under: Machine Learning,MLBase,Spark — Patrick Durusau @ 1:01 pm

Apache Spark: Distributed Machine Learning with Spark using MLbase by Ameet Talwaker and Evan Sparks.

From the description:

In this talk we describe our efforts, as part of the MLbase project, to develop a distributed Machine Learning platform on top of Spark. In particular, we present the details of two core components of MLbase, namely MLlib and MLI, which are scheduled for open-source release this summer. MLlib provides a standard Spark library of scalable algorithms for common learning settings such as classification, regression, collaborative filtering and clustering. MLI is a machine learning API that facilitates the development of new ML algorithms and feature extraction methods. As part of our release, we include a library written against the MLI containing standard and experimental ML algorithms, optimization primitives and feature extraction methods.

Useful links:

http://mlbase.org

http://spark-project.org/

http://incubator.apache.org/projects/spark.html

Suggestion: When you make a video of a presentation, don’t include members of the audience eating (pizza in this case). It’s distracting.

From: http://mlbase.org

  • MLlib: A distributed low-level ML library written directly against the Spark runtime that can be called from Scala and Java. The current library includes common algorithms for classification, regression, clustering and collaboritive filtering, and will be included as part of the Spark v0.8 release.
  • MLI: An API / platform for feature extraction and algorithm development that introduces high-level ML programming abstractions. MLI is currently implemented against Spark, leveraging the kernels in MLlib when possible, though code written against MLI can be executed on any runtime engine supporting these abstractions. MLI includes more extensive functionality and has a faster development cycle than MLlib. It will be released in conjunction with MLlib as a separate project.
  • ML Optimizer: This layer aims to simplify ML problems for End Users by automating the task of model selection. The optimizer solves a search problem over feature extractors and ML algorithms included in MLI. This component is under active development.

The goal of this project, to make machine learning easier for developers and end users is a laudable one.

And it is the natural progression of a technology from being experimental to common use.

On the other hand, I am uneasy about the weight users will put on results, while not understanding biases or uncertainties that are cooked into the data or algorithms.

I don’t think there is a solution to the bias/uncertainty problem other than to become more knowledgeable about machine learning.

Not that you will win an argument with an end users who keeps pointing to a result as though it were untouched by human biases.

But you may be able to better avoid such traps for yourself and your clients.

August 17, 2013

Got Genitalia?

Filed under: Humor,Language — Patrick Durusau @ 6:23 pm

Extensive timelines of slang for genitalia by Nathan Yau.

Nathan has discovered interactive time lines of slang for male and female genitalia. (Goes back to 1250-1300 CE.)

If you know Anthony Weiner, please forward these links to his attention.

If you don’t know Anthony Weiner, take this as an opportunity to expand your twitter repartee.

AT4AM: The XML Web Editor Used By…

Filed under: Editor,EU,Semantics — Patrick Durusau @ 4:27 pm

AT4AM: The XML Web Editor Used By Members Of European Parliment

From the post:

AT4AM – Authoring Tool for Amendments – is a web editor provided to Members of European Parliament (MEPs) that has greatly improved the drafting of amendments at European Parliament since its introduction in 2010.

The tool, developed by the Directorate for Innovation and Technological Support of European Parliament (DG ITEC) has replaced a system based on a collection of macros developed in MS Word and specific ad hoc templates.

Moving beyond guessing the semantics of an author depends upon those semantics being documented at the point of creation.

Having said that, I think we all acknowledge that for the average user, RDF and its kin, were DOA.

Interfaces such as AT4AM, if they can be extended to capture the semantics of their authors, would be a step in the right direction.

BTW, see the AT4AM homepage, complete with live demo.

Creating a Solr <search> HTML element…

Filed under: Interface Research/Design,Javascript,Solr — Patrick Durusau @ 4:07 pm

Creating a Solr <search> HTML element with AngularJS! by John Berryman.

From the post:

Of late we’ve been playing around with EmberJS for putting together slick client-side apps. But one thing that bothers me is how heavy-weight it feels. Another thing that concerns me is that AngularJS is really getting a lot of good attention and I want to make sure I’m not missing the boat! Here, look, just check out the emberjs/angularjs Google Trends plot: – See more at: http://www.opensourceconnections.com/2013/08/11/creating-a-search-html-element-with-angularjs/#sthash.ZH22mU0h.dpuf

It’s great to have a rocking search, topic map, or other retrieval application.

However, to make any sales, it needs to also deliver content to users.

I know, pain in the ass but people who pay for things want a result on the screen, intangible though it may be. 😉

Parallel Astronomical Data Processing with Python:…

Filed under: Astroinformatics,Parallel Programming,Python — Patrick Durusau @ 3:52 pm

Parallel Astronomical Data Processing with Python: Recipes for multicore machines by Bruce Berriman.

From the post:

Most astronomers (myself included) have a high performance compute engine on their desktops. Modern computers now contain multicore processors, whose development was prompted by the need to reduce heat dissipation and power consumption but which give users a powerful processing machine at their fingertips. Singh, Browne and Butler have recently posted a preprint on astro-ph, submitted to Astronomy and Computing, that offers recipes in Python for running data parallel processing on multicore machines. Such machines offer an alternative to grids, clouds and clusters for many tasks, and the authors give examples based on commonly used astronomy toolkits.

The paper restricts itself to the use of CPython’s native multiprocessing module, for two reasons: much astronomical software is written in it, and it places sufficiently strong restrictions on managing threads launched by the OS that it can make parallel jobs run slower than serial jobs (not so for other flavors of Python, though, such as PyPy and Jython). The authors also chose to study data parallel applications, which are common in astronomy, rather than task parallel applications. The heart of the paper is a comparison of three approaches to multiprocessing in Python, with sample code snippets for each:
(…)

Bruce’s quick overview will give you the motivation to read this paper.

Astronomical data is easier to process in parallel than some data.

Suggestions on how to transform other data to make it easier to process in parallel?

frak

Filed under: Regex,Regexes,Text Mining — Patrick Durusau @ 1:57 pm

frak

From the webpage:

frak transforms collections of strings into regular expressions for matching those strings. The primary goal of this library is to generate regular expressions from a known set of inputs which avoid backtracking as much as possible.

This looks quite useful for text mining.

A large amount of which is on the near horizon.

I first saw this in Nat Torkington’s Four short links: 16 August 2013.

August 16, 2013

Finding Parties Named in U.S. Law…

Filed under: Law,Natural Language Processing,NLTK,Python — Patrick Durusau @ 4:59 pm

Finding Parties Named in U.S. Law using Python and NLTK by Gary Sieling.

From the post:

U.S. Law periodically names specific institutions; historically it is possible for Congress to write a law naming an individual, although I think that has become less common. I expect the most common entities named in Federal Law to be groups like Congress. It turns out this is true, but the other most common entities are the law itself and bureaucratic functions like archivists.

To get at this information, we need to read the Code XML, and use a natural language processing library to get at the named groups.

NLTK is such an NLP library. It provides interesting features like sentence parsing, part of speech tagging, and named entity recognition. (If interested in the subject see my review of “Natural Language Processing with Python“, a book which covers this library in detail)

I would rather know who paid for particular laws but that requires information external to the Code XML data set. 😉

A very good exercise to become familiar with both NLTK and the Code XML data set.

Semantic Computing of Moods…

Filed under: Music,Music Retrieval,Semantics,Tagging — Patrick Durusau @ 4:46 pm

Semantic Computing of Moods Based on Tags in Social Media of Music by Pasi Saari, Tuomas Eerola. (IEEE Transactions on Knowledge and Data Engineering, 2013; : 1 DOI: 10.1109/TKDE.2013.128)

Abstract:

Social tags inherent in online music services such as Last.fm provide a rich source of information on musical moods. The abundance of social tags makes this data highly beneficial for developing techniques to manage and retrieve mood information, and enables study of the relationships between music content and mood representations with data substantially larger than that available for conventional emotion research. However, no systematic assessment has been done on the accuracy of social tags and derived semantic models at capturing mood information in music. We propose a novel technique called Affective Circumplex Transformation (ACT) for representing the moods of music tracks in an interpretable and robust fashion based on semantic computing of social tags and research in emotion modeling. We validate the technique by predicting listener ratings of moods in music tracks, and compare the results to prediction with the Vector Space Model (VSM), Singular Value Decomposition (SVD), Nonnegative Matrix Factorization (NMF), and Probabilistic Latent Semantic Analysis (PLSA). The results show that ACT consistently outperforms the baseline techniques, and its performance is robust against a low number of track-level mood tags. The results give validity and analytical insights for harnessing millions of music tracks and associated mood data available through social tags in application development.

These results make me wonder if the results of tagging represents the average semantic resolution that users want?

Obviously a musician or musicologist would want far finer and sharper distinctions, at least for music of interest to them. Or substitute the domain of your choice. Domain experts want precision, while the average user muddles along with coarser divisions.

We already know from Karen Drabenstott’s work (Subject Headings and the Semantic Web) that library classification systems are too complex for the average user and even most librarians.

On the other hand, we all have some sense of the wasted time and effort caused by the uncharted semantic sea where Google and others practice catch and release with semantic data.

Some of the unanswered questions that remain:

How much semantic detail is enough?

For which domains?

Who will pay for gathering it?

What economic model is best?

BirdWatch v0.2…

Filed under: Graphics,Tweets,Visualization — Patrick Durusau @ 4:17 pm

BirdWatch v0.2: Tweet Stream Analysis with AngularJS, ElasticSearch and Play Framework by Matthias Nehlsen.

From the post:

I am happy to get a huge update of the BirdWatch application out of the way. The changes are much more than what I would normally want to work on for a single article, but then again there is enough interesting stuff going on in this new version for multiple blog articles to come. Initially this application was only meant to be an exercise in streaming information to web clients. But in the meantime I have noticed that this application can be useful and interesting beyond being a mere learning exercise. Let me explain what it has evolved to:

BirdWatch is an open-source real-time tweet search engine for a defined area of interest. I am running a public instance of this for software engineering related tweets. The application subscribes to all Tweets containing at least one out of a set of terms (such as AngularJS, Java, JavaScript, MongoDB, Python, Scala, …). The application receives all those tweets immediately through the Twitter Streaming API. The limitation here is that the delivery is capped to one percent of all Tweets. This is plenty for a well defined area of interest, considering that Twitter processes more than 400 million tweets per day.

Just watching the public feed is amusing.

As Matthias says, there is a lot more that could be done with the incoming feed.

For some well defined area, you could be streaming the latest tweets on particular subjects or even who to follow, after you have harvested enough tweets.

See the project at GIthub.

ST_Geometry Aggregate Functions for Hive…

Filed under: Geographic Data,Geographic Information Retrieval,Hadoop,Hive — Patrick Durusau @ 4:00 pm

ST_Geometry Aggregate Functions for Hive in Spatial Framework for Hadoop by Jonathan Murphy.

From the post:

We are pleased to announce that the ST_Geometry aggregate functions are now available for Hive, in the Spatial Framework for Hadoop. The aggregate functions can be used to perform a convex-hull, intersection, or union operation on geometries from multiple records of a dataset.

While the non-aggregate ST_ConvexHull function returns the convex hull of the geometries passed like a single function call, the ST_Aggr_ConvexHull function accumulates the geometries from the rows selected by a query, and performs a convex hull operation over those geometries. Likewise, ST_Aggr_Intersection and ST_Aggr_Union aggregrate the geometries from multiple selected rows, to perform intersection and union operations, respectively.

The example given covers earthquake data and California-county data.

I have a weakness for aggregating functions as you know. 😉

The other point this aggregate functions illustrates is that sometimes you want subjects to be treated as independent of each other and sometimes you want to treat them as a group.

Depends upon your requirements.

There really isn’t a one size fits all granularity of subject identity for all situations.

imGraph: A distributed in-memory graph database

Filed under: Graphs,Neo4j,Titan — Patrick Durusau @ 3:51 pm

imGraph: A distributed in-memory graph database by Salim Jouili.

From the post:

Eura Nova contribution

Having these challenges in mind, we introduce a new graph database system called imGraph. We have considered the random access requirement for large graphs as a key factor on deciding the type of storage. Then, we have designed a graph database where all data is stored in memory so the speed of random access is maximized. However, as large graphs can not be completely loaded in the RAM of a single machine, we designed imGraph as distributed graph database. That is, the vertices and the edges are partitioned into subsets, and each subset is located in the memory of one machine belonging to the involved machines (see the following figure). Furthermore, we implemented on imGraph a graph traversal engine that takes advantage of distributed parallel computing and fast in-memory random access to gain performance.

I haven’t verified the numbers but imGraph is reported to have beaten both Titan and Neo4j by x150 and x200, respectively on particular data sets.

Enough to justify reading the paper.

The test machines each had 7.5 GB of memory, which seems a little lite to me.

Particularly since the IBM Power 770 server can expand to hold 4 TB of memory.

Imagine the performance on five (5) machines where each has 4 TB of memory.

True, it would be more expensive but at some point, there is only so much performance you can squeeze out of a commodity box.

BTW, the paper: imGraph: A distributed in-memory graph database.

Dynamic Simplification

Filed under: Graphics,Subject Identity,Topic Maps,Visualization — Patrick Durusau @ 3:18 pm

Dynamic Simplification by Mike Bostock.

From the post:

A combination of the map zooming and dynamic simplification demonstrations: as the map zooms in and out, the simplification area threshold is adjusted so that it is always appropriate to the current scale. Thus, the map looks good and renders quickly at all points during the animation.

While d3.js is the secret sauce here, I am posting this for the notion of “dynamic simplification.”

What if the presentation of a topic map were to use “dynamic simplification?”

Say that I have a topic map with topics for all the tweets on some major event. (Lady Gaga’s latest video (NSFW) for example.

The number of tweets for some locations would display as a mass of dots. Not terribly informative.

If on the other hand, from say a country wide perspective, the tweets were displayed as a solid form and only on zooming in did they become distinguished (looking to see if Dick Cheney tweeted about it), that would be more useful.

Or at least more useful for some use cases.

The Dynamic Simplification demo is part of a large collection of amazing visuals you will find at: http://bl.ocks.org/mbostock.

August 15, 2013

Video Tutorials on Hadoop for Microsoft Developers

Filed under: Hadoop,Microsoft — Patrick Durusau @ 7:05 pm

Video Tutorials on Hadoop for Microsoft Developers by Marc Holmes.

From the post:

If you’re a Microsoft developer and stepping into Hadoop for the first time with HDP for Windows, then we thought we’d highlight this fantastic resource from Rob Kerr, Chris Campbell and Garrett Edmondson : the MSBIAcademy.

They’ve produced a high quality, practical series of videos covering anything from essential MapReduce concepts, to using .NET (in this case C#) to submit MapReduce jobs to HDInsight, to using Apache Pig for Web Log Analysis. As you may know, HDInsight is based on Hortonworks HDP platform.

More resources on Hadoop by Microsoft! (see: Microsoft as Hadoop Leader)

The more big data, the greater the need for accurate and repeatable semantics.

Go big data!

« Newer PostsOlder Posts »

Powered by WordPress