Books: Inquiry-Based Learning Guides

October 30th, 2014

Books: Inquiry-Based Learning Guides

From the webpage:

The DAoM library includes 11 inquiry-based books freely available for classroom use. These texts can be used as semester-long content for themed courses (e.g. geometry, music and dance, the infinite, games and puzzles), or individual chapters can be used as modules to experiment with inquiry-based learning and to help supplement typical topics with classroom tested, inquiry based approaches (e.g. rules for exponents, large numbers, proof). The topic index provides an overview of all our book chapters by topic.

From the about page:

Discovering the Art of Mathematics (DAoM), is an innovative approach to teaching mathematics to liberal arts and humanities students, that offers the following vision:

Mathematics for Liberal Arts students will be actively involved in authentic mathematical experiences that

  • are both challenging and intellectually stimulating,
  • provide meaningful cognitive and metacognitive gains, and,
  • nurture healthy and informed perceptions of mathematics, mathematical ways of thinking, and the ongoing impact of mathematics not only on STEM fields but also on the liberal arts and humanities.

DAoM provides a wealth of resources for mathematics faculty to help realize this vision in their Mathematics for Liberal Arts (MLA) courses: a library of 11 inquiry-based learning guides, extensive teacher resources and many professional development opportunities. These tools enable faculty to transform their classrooms to be responsive to current research on learning (e.g. National Academy Press’s How People Learn) and the needs and interests of MLA students without enormous start-up costs or major restructuring.

All of these books are concerned with mathematics from a variety of perspectives but I didn’t see anything in How People Learn: Brain, Mind, Experience, and School: Expanded Edition (2000) that suggested such techniques are limited to the teaching of mathematics.

Easy to envision teaching of CS or semantic technologies using the same methods.

What inquiries would you construct for the exploration of semantic diversity? Roles? Contexts? Or the lack of a solution to semantic diversity? What are its costs?

Thinking semantic integration could become a higher priority if the costs of semantic diversity or the savings of semantic integration could be demonstrated.

For example, most Americans nod along with public service energy conservation messages. Just like people do with semantic integration pitches.

But if it was demonstrated for a particular home that 1/8 of the energy for heat or cooling was being wasted and that $X investment would lower utility bills by $N, there would be a much different reaction.

There are broad numbers on the losses from semantic diversity but broad numbers are not “in our budget” line items. It’s time to develop strategies that can expose the hidden costs of semantic diversity. Perhaps inquiry-based learning could be that tool.

I first saw this in a tweet by Steven Strogatz.

Pinned Tabs: myNoSQL

October 30th, 2014

Alex Popescu & Ana-Maria Bacalu have added a new feature at myNoSQL called “Pinned Tabs.”

The feature started on 28 Oct. 2014 and consists of very short (2-3 sentence descriptions) with links on NoSQL, BigData, etc. topics.

Today’s “pinned tabs” included:

03: If you don’t test for the possible failures, you might be in for a surprise. Stripe has tried a more organized chaos monkey attack and discovered a scenario in which their Redis cluster is losing all the data. They’ll move to Amazon RDS PostgreSQL. From an in-memory smart key-value engine to a relational database.

Game Day Exercises at Stripe: Learning from kill -9

04: How a distributed database should really behave in front of massive failures. Netflix recounts their recent experience of having 218 Cassandra nodes rebooted without losing availability. At all.

How Netflix Handled the Reboot of 218 Cassandra Nodes

Curated news saves time and attention span!

Enjoy!

How to run the Caffe deep learning vision library…

October 29th, 2014

How to run the Caffe deep learning vision library on Nvidia’s Jetson mobile GPU board by Pete Warden.

From the post:

Jetson boardPhoto by Gareth Halfacree

My colleague Yangqing Jia, creator of Caffe, recently spent some free time getting the framework running on Nvidia’s Jetson board. If you haven’t heard of the Jetson, it’s a small development board that includes Nvidia’s TK1 mobile GPU chip. The TK1 is starting to appear in high-end tablets, and has 192 cores so it’s great for running computational tasks like deep learning. The Jetson’s a great way to get a taste of what we’ll be able to do on mobile devices in the future, and it runs Ubuntu so it’s also an easy environment to develop for.

Caffe comes with a pre-built ‘Alexnet’ model, a version of the Imagenet-winning architecture that recognizes 1,000 different kinds of objects. Using this as a benchmark, the Jetson can analyze an image in just 34ms! Based on this table I’m estimating it’s drawing somewhere around 10 or 11 watts, so it’s power-intensive for a mobile device but not too crazy.

Yangqing passed along his instructions, and I’ve checked them on my own Jetson, so here’s what you need to do to get Caffe up and running.

Hardware fun for the middle of your week!

192 cores for under $200, plus GPU experience.

Introducing osquery

October 29th, 2014

Introducing osquery by Mike Arpaia.

From the post:

Maintaining real-time insight into the current state of your infrastructure is important. At Facebook, we’ve been working on a framework called osquery which attempts to approach the concept of low-level operating system monitoring a little differently.

Osquery exposes an operating system as a high-performance relational database. This design allows you to write SQL-based queries efficiently and easily to explore operating systems. With osquery, SQL tables represent the current state of operating system attributes, such as:

  • running processes
  • loaded kernel modules
  • open network connections

SQL tables are implemented via an easily extendable API. Several tables already exist and more are being written. To best understand the expressiveness that is afforded to you by osquery, consider the following examples….

I haven’t installed osquery, yet, but suspect that most of the data it collects is available now through a variety of admin tools. But not through a single tool that enables you to query across tables to combine that data. That is the part that intrigues me.

Code and documentation on Github.

AsterixDB: Better than Hadoop? Interview with Mike Carey

October 29th, 2014

AsterixDB: Better than Hadoop? Interview with Mike Carey by Roberto V. Zicari.

The first two questions should be enough incentive to read the full interview and get your blood pumping in the middle of the week:

Q1. Why build a new Big Data Management System?

Mike Carey: When we started this project in 2009, we were looking at a “split universe” – there were your traditional parallel data warehouses, based on expensive proprietary relational DBMSs, and then there was the emerging Hadoop platform, which was free but low-function in comparison and wasn’t based on the many lessons known to the database community about how to build platforms to efficiently query large volumes of data. We wanted to bridge those worlds, and handle “modern data” while we were at it, by taking into account the key lessons from both sides.

To distinguish AsterixDB from current Big Data analytics platforms – which query but don’t store or manage Big Data – we like to classify AsterixDB as being a “Big Data Management System” (BDMS, with an emphasis on the “M”).
We felt that the Big Data world, once the initial Hadoop furor started to fade a little, would benefit from having a platform that could offer things like:

  • a flexible data model that could handle data scenarios ranging from “schema first” to “schema never”;
  • a full query language with at least the expressive power of SQL;
  • support for data storage, data management, and automatic indexing;
  • support for a wide range of query sizes, with query processing cost being proportional to the given query;
  • support for continuous data ingestion, hence the accumulation of Big Data;
  • the ability to scale up gracefully to manage and query very large volumes of data using commodity clusters; and,
  • built-in support for today’s common “Big Data data types”, such as textual, temporal, and simple spatial data.

So that’s what we set out to do.

Q2. What was wrong with the current Open Source Big Data Stack?

Mike Carey: First, we should mention that some reviewers back in 2009 thought we were crazy or stupid (or both) to not just be jumping on the Hadoop bandwagon – but we felt it was important, as academic researchers, to look beyond Hadoop and be asking the question “okay, but after Hadoop, then what?”

We recognized that MapReduce was great for enabling developers to write massively parallel jobs against large volumes of data without having to “think parallel” – just focusing on one piece of data (map) or one key-sharing group of data (reduce) at a time. As a platform for “parallel programming for dummies”, it was (and still is) very enabling! It also made sense, for expedience, that people were starting to offer declarative languages like Pig and Hive, compiling them down into Hadoop MapReduce jobs to improve programmer productivity – raising the level much like what the database community did in moving to the relational model and query languages like SQL in the 70’s and 80’s.

One thing that we felt was wrong for sure in 2009 was that higher-level languages were being compiled into an assembly language with just two instructions, map and reduce. We knew from Tedd Codd and relational history that more instructions – like the relational algebra’s operators – were important – and recognized that the data sorting that Hadoop always does between map and reduce wasn’t always needed.

Trying to simulate everything with just map and reduce on Hadoop made “get something better working fast” sense, but not longer-term technical sense. As for HDFS, what seemed “wrong” about it under Pig and Hive was its being based on giant byte stream files and not on “data objects”, which basically meant file scans for all queries and lack of indexing. We decided to ask “okay, suppose we’d known that Big Data analysts were going to mostly want higher-level languages – what would a Big Data platform look like if it were built ‘on purpose’ for such use, instead of having incrementally evolved from HDFS and Hadoop?”

Again, our idea was to try and bring together the best ideas from both the database world and the distributed systems world. (I guess you could say that we wanted to build a Big Data Reese’s Cup… J)

I knew words would fail me if I tried to describe the AsterixDB logo so I simply reproduce the logo:

asterickdb logo

Read the interview in full and then grab a copy of AsterixDB.

The latest beta release is 0.8.6. The software appears under the Apache Software 2.0 license.

Microsoft Garage

October 29th, 2014

Microsoft Garage

From the webpage:

Hackers, makers, artists, tinkerers, musicians, inventors — on any given day you’ll find them in The Microsoft Garage.

We are a community of interns, employees, and teams from everywhere in the company who come together to turn our wild ideas into real projects. This site gives you early access to projects as they come to life.

Tell us what rocks, and what doesn’t.

Welcome to The Microsoft Garage.

Two projects (out of several) that I thought were interesting:

Collaborate

Host or join collaboration sessions on canvases that hold text cards and images. Ink on the canvas to organize your content, or manipulate the text and images using pinch, drag, and rotate gestures.

Floatz

Floatz, a Microsoft Garage project, lets you float an idea out to the people around you, and see what they think. Join in on any nearby Floatz conversation, or start a new one with a question, idea, or image that you share anonymously with people nearby.

Share your team spirit at a sporting event, or your awesome picture of the band at a rock concert. Ask the locals where to get a good meal when visiting an unfamiliar neighborhood. Speak your mind, express your feelings, and find out if there are others around you who feel the same way—all from the safety of an anonymous screen name in Floatz.

I understand the theory of asking for advice anonymously, but I assume that also means the person answering is anonymous as well. Yes? I don’t have a cellphone so I can’t test that theory. Comments?

On the other hand, if you are sharing data with known and unknown others, so you know which “anonymous” screen names to trust (for example, don’t trust name with FBI, CIA or NSA preceded or followed by hyphens), then Floatz could very useful.

I first saw this in Nat Torkington’s Four short links: 23 October 2014.

UX Directory

October 29th, 2014

UX Directory

Two Hundred and six (206) resources listed under the following categories:

  • A/B Testing
  • Blogroll
  • Design Evaluation Tools
  • Dummy Text Generators
  • Find Users to Test
  • Gamification Companies
  • Heatmaps / Mouse Tracking Tools
  • Information Architecture Creation Tools
  • Information Architecture Evaluation Tools
  • Live Chat Support Tools
  • Marketing Automation Tools
  • Mobile Prototyping
  • Mockup User Testing
  • Multi-Use UX Tools
  • Screen Capture Tools
  • Synthetic Eye-Tracking Tools
  • User Testing Companies
  • UX Agencies / Consultants
  • UX Survey Tools
  • Web Analytics Tools
  • Webinar / Web Conference Platforms
  • Wirefram/Mockup Tools

If you have a new resource that should be on this list, contact abetteruserexperience@gmail.com

I first saw this in Nat Torkington’s Four short links: 28 October 2014.

Datomic Pull API

October 28th, 2014

Datomic Pull API by Stuart Holloway.

From the post:

Datomic‘s new Pull API is a declarative way to make hierarchical selections of information about entities. You supply a pattern to specify which attributes of the entity (and nested entities) you want to pull, and db.pull returns a map for each entity.

Pull API vs. Entity API

The Pull API has two important advantages over the existing Entity API:

Pull uses a declarative, data-driven spec, whereas Entity encourages building results via code. Data-driven specs are easier to build, compose, transmit and store. Pull patterns are smaller than entity code that does the same job, and can be easier to understand and maintain.

Pull API results match standard collection interfaces (e.g. Java maps) in programming languages, where Entity results do not. This eliminates the need for an additional allocation/transformation step per entity.

A sign that it is time to catch up on what has been happening with Datomic!

HTML5 is a W3C Recommendation

October 28th, 2014

HTML5 is a W3C Recommendation

From the post:

(graphic omitted) The HTML Working Group today published HTML5 as W3C Recommendation. This specification defines the fifth major revision of the Hypertext Markup Language (HTML), the format used to build Web pages and applications, and the cornerstone of the Open Web Platform.

Today we think nothing of watching video and audio natively in the browser, and nothing of running a browser on a phone,” said Tim Berners-Lee, W3C Director. “We expect to be able to share photos, shop, read the news, and look up information anywhere, on any device. Though they remain invisible to most users, HTML5 and the Open Web Platform are driving these growing user expectations.

HTML5 brings to the Web video and audio tracks without needing plugins; programmatic access to a resolution-dependent bitmap canvas, which is useful for rendering graphs, game graphics, or other visual images on the fly; native support for scalable vector graphics (SVG) and math (MathML); annotations important for East Asian typography (Ruby); features to enable accessibility of rich applications; and much more.

The HTML5 test suite, which includes over 100,000 tests and continues to grow, is strengthening browser interoperability. Learn more about the Test the Web Forward community effort.

With today’s publication of the Recommendation, software implementers benefit from Royalty-Free licensing commitments from over sixty companies under W3C’s Patent Policy. Enabling implementers to use Web technology without payment of royalties is critical to making the Web a platform for innovation.

Read the Press Release, testimonials from W3C Members, and
acknowledgments. For news on what’s next after HTML5, see W3C CEO Jeff Jaffe’s blog post: Application Foundations for the Open Web Platform. We also invite you to check out our video Web standards for the future.

Just in case you have been holding off on HTML5 until it became an W3C Recommendation. ;-)

Enjoy!

Category Theory for Programmers: The Preface

October 28th, 2014

Category Theory for Programmers: The Preface by Bartosz Milewski.

From the post:

For some time now I’ve been floating the idea of writing a book about category theory that would be targeted at programmers. Mind you, not computer scientists but programmers — engineers rather than scientists. I know this sounds crazy and I am properly scared. I can’t deny that there is a huge gap between science and engineering because I have worked on both sides of the divide. But I’ve always felt a very strong compulsion to explain things. I have tremendous admiration for Richard Feynman who was the master of simple explanations. I know I’m no Feynman, but I will try my best. I’m starting by publishing this preface — which is supposed to motivate the reader to learn category theory — in hopes of starting a discussion and soliciting feedback.

I will attempt, in the space of a few paragraphs, to convince you that this book is written for you, and whatever objections you might have to learning one of the most abstracts branches of mathematics in your “copious spare time” are totally unfounded.

My optimism is based on several observations. First, category theory is a treasure trove of extremely useful programming ideas. Haskell programmers have been tapping this resource for a long time, and the ideas are slowly percolating into other languages, but this process is too slow. We need to speed it up.

Second, there are many different kinds of math, and they appeal to different audiences. You might be allergic to calculus or algebra, but it doesn’t mean you won’t enjoy category theory. I would go as far as to argue that category theory is the kind of math that is particularly well suited for the minds of programmers. That’s because category theory — rather than dealing with particulars — deals with structure. It deals with the kind of structure that makes programs composable.

Composition is at the very root of category theory — it’s part of the definition of the category itself. And I will argue strongly that composition is the essence of programming. We’ve been composing things forever, long before some great engineer came up with the idea of a subroutine. Some time ago the principles of structural programming revolutionized programming because they made blocks of code composable. Then came object oriented programming, which is all about composing objects. Functional programming is not only about composing functions and algebraic data structures — it makes concurrency composable — something that’s virtually impossible with other programming paradigms.

See the rest of the preface and the promise to provide examples in code for most major concepts.

Are you ready for discussion and feedback?

On Excess: Susan Sontag’s Born-Digital Archive

October 28th, 2014

On Excess: Susan Sontag’s Born-Digital Archive by Jeremy Schmidt & Jacquelyn Ardam.

From the post:


In the case of the Sontag materials, the end result of Deep Freeze and a series of other processing procedures is a single IBM laptop, which researchers can request at the Special Collections desk at UCLA’s Research Library. That laptop has some funky features. You can’t read its content from home, even with a VPN, because the files aren’t online. You can’t live-Tweet your research progress from the laptop — or access the internet at all — because the machine’s connectivity features have been disabled. You can’t copy Annie Leibovitz’s first-ever email — “Mat and I just wanted to let you know we really are working at this. See you at dinner. xxxxxannie” (subject line: “My first Email”) — onto your thumb drive because the USB port is locked. And, clearly, you can’t save a new document, even if your desire to type yourself into recent intellectual history is formidable. Every time it logs out or reboots, the laptop goes back to ground zero. The folders you’ve opened slam shut. The files you’ve explored don’t change their “Last Accessed” dates. The notes you’ve typed disappear. It’s like you were never there.

Despite these measures, real limitations to our ability to harness digital archives remain. The born-digital portion of the Sontag collection was donated as a pair of external hard drives, and that portion is composed of documents that began their lives electronically and in most cases exist only in digital form. While preparing those digital files for use, UCLA archivists accidentally allowed certain dates to refresh while the materials were in “thaw” mode; the metadata then had to be painstakingly un-revised. More problematically, a significant number of files open as unreadable strings of symbols because the software with which they were created is long out of date. Even the fully accessible materials, meanwhile, exist in so many versions that the hapless researcher not trained in computer forensics is quickly overwhelmed.

No one would dispute the need for an authoritative copy of Sontag‘s archive, or at least as close to authoritative as humanly possible. The heavily protected laptop makes sense to me, assuming that the archive considers that to be the authoritative copy.

What has me puzzled, particularly since there are binary formats not recognized in the archive, is why isn’t a non-authoritative copy of the archive online. Any number of people may still possess the software necessary to read the files and/or be able to decrypt the file formats. That would be a net gain to the archive if recovery could be practiced on a non-authoritative copy. They may well encounter such files in the future.

After searching the Online Archive of California, I did encounter Finding Aid for the Susan Sontag papers, ca. 1939-2004 which reports:

Restrictions Property rights to the physical object belong to the UCLA Library, Department of Special Collections. Literary rights, including copyright, are retained by the creators and their heirs. It is the responsibility of the researcher to determine who holds the copyright and pursue the copyright owner or his or her heir for permission to publish where The UC Regents do not hold the copyright.

Availability Open for research, with following exceptions: Boxes 136 and 137 of journals are restricted until 25 years after Susan Sontag’s death (December 28, 2029), though the journals may become available once they are published.

Unfortunately, this finding aid does not mention Sontag’s computer or the transfer of the files to a laptop. A search of Melvyl (library catalog) finds only one archival collection and that is the one mentioned above.

I have written to the special collections library for clarification and will update this post when an answer arrives.

I mention this collection because of Sontag’s importance for a generation and because digital archives will soon be the majority of cases. One hopes the standard practice will be to donate all rights to an archival repository to insure its availability to future generations of scholars.

Text Visualization Browser [100 Techniques]

October 28th, 2014

Text Visualization Browser: A Visual Survey of Text Visualization Techniques by Kostiantyn Kucher and Andreas Kerren.

From the abstract:

Text visualization has become a growing and increasingly important subfield of information visualization. Thus, it is getting harder for researchers to look for related work with specific tasks or visual metaphors in mind. In this poster, we present an interactive visual survey of text visualization techniques that can be used for the purposes of search for related work, introduction to the subfield and gaining insight into research trends.

Even better is the Text Visual Browser webpage where one hundred (100) different techniques have thumbnails and links to the original papers.

Quite remarkable. I don’t think I can name anywhere close to all the techniques.

You?

Announcing Clasp

October 28th, 2014

Announcing Clasp by Christian Schafmeister.

From the post:

Click here for up to date build instructions

Today I am happy to make the first release of the Common Lisp implementation “Clasp”. Clasp uses LLVM as its back-end and generates native code. Clasp is a super-set of Common Lisp that interoperates smoothly with C++. The goal is to integrate these two very different languages together as seamlessly as possible to provide the best of both worlds. The C++ interoperation allows Common Lisp programmers to easily expose powerful C++ libraries to Common Lisp and solve complex programming challenges using the expressive power of Common Lisp. Clasp is licensed under the LGPL.

Common Lisp is considered by many to be one of the most expressive programming languages in existence. Individuals and small teams of programmers have created fantastic applications and operating systems within Common Lisp that require much larger effort when written in other languages. Common Lisp has many language features that have not yet made it into the C++ standard. Common Lisp has first-class functions, dynamic variables, true macros for meta-programming, generic functions, multiple return values, first-class symbols, exact arithmetic, conditions and restarts, optional type declarations, a programmable reader, a programmable printer and a configurable compiler. Common Lisp is the ultimate programmable programming language.

Clojure is a dialect of Lisp, which means you may spot situations where Lisp would be the better solution. Especially if you can draw upon C++ libraries.

The project is “actively looking” for new developers. Could be your opportunity to get in on the ground floor!

Madison: Semantic Listening Through Crowdsourcing

October 28th, 2014

Madison: Semantic Listening Through Crowdsourcing by Jane Friedhoff.

From the post:

Our recent work at the Labs has focused on semantic listening: systems that obtain meaning from the streams of data surrounding them. Chronicle and Curriculum are recent examples of tools designed to extract semantic information (from our corpus of news coverage and our group web browsing history, respectively). However, not every data source is suitable for algorithmic analysis–and, in fact, many times it is easier for humans to extract meaning from a stream. Our new projects, Madison and Hive, are explorations of how to best design crowdsourcing projects for gathering data on cultural artifacts, as well as provocations for the design of broader, more modular kinds of crowdsourcing tools.

(image omitted)

Madison is a crowdsourcing project designed to engage the public with an under-viewed but rich portion of The New York Times’s archives: the historical ads neighboring the articles. News events and reporting give us one perspective on our past, but the advertisements running alongside these articles provide a different view, giving us a sense of the culture surrounding these events. Alternately fascinating, funny and poignant, they act as commentary on the technology, economics, gender relations and more of that time period. However, the digitization of our archives has primarily focused on news, leaving the ads with no metadata–making them very hard to find and impossible to search for them. Complicating the process further is that these ads often have complex layouts and elaborate typefaces, making them difficult to differentiate algorithmically from photographic content, and much more difficult to scan for text. This combination of fascinating cultural information with little structured data seemed like the perfect opportunity to explore how crowdsourcing could form a source of semantic signals.

From the projects homepage:

Help preserve history with just one click.

The New York Times archives are full of advertisements that give glimpses into daily life and cultural history. Help us digitize our historic ads by answering simple questions. You’ll be creating a unique resource for historians, advertisers and the public — and leaving your mark on history.

Get started with our collection of ads from the 1960s (additional decades will be opened later)!

I would like to see a Bible transcription project that was that user friendly!

But, then the goal of the New York Times is to include as many people as possible.

Looking forward to more news on Madison!

Guide to Law Online

October 28th, 2014

Guide to Law Online

From the post:

The Guide to Law Online, prepared by the Law Library of Congress Public Services Division, is an annotated guide to sources of information on government and law available online. It includes selected links to useful and reliable sites for legal information.

Select a Link:

The Guide to Law Online is an annotated compendium of Internet links; a portal of Internet sources of interest to legal researchers. Although the Guide is selective, inclusion of a site by no means constitutes endorsement by the Law Library of Congress.

In compiling this list, emphasis wherever possible has been on sites offering the full texts of laws, regulations, and court decisions, along with commentary from lawyers writing primarily for other lawyers. Materials related to law and government that were written by or for lay persons also have been included, as have government sites that provide even quite general information about themselves or their agencies.

Every direct source listed here was successfully tested before being added to the list. Users, however, should be aware that changes of Internet addresses and file names are frequent, and even sites that usually function well do not always do so. Thus a successful connection may sometimes require several attempts. If such an attempt to access a file indicates an error, the information can sometimes still be accessed by truncating the URL address to access a directory at the site.

Last Updated: 07/10/2014

While I was the Library of Congress site today I encountered this set of law guides and thought they might be of interest. Updated in July of this year so most of the links should still work.

Congress.gov Officially Out of Beta

October 28th, 2014

Congress.gov Officially Out of Beta

From the post:

The free legislative information website, Congress.gov, is officially out of beta form, and beginning today includes several new features and enhancements. URLs that include beta.Congress.gov will be redirected to Congress.gov The site now includes the following:

New Feature: Congress.gov Resources

  • A new resources section providing an A to Z list of hundreds of links related to Congress
  • An expanded list of “most viewed” bills each day, archived to July 20, 2014

New Feature: House Committee Hearing Videos

  • Live streams of House Committee hearings and meetings, and an accompanying archive to January, 2012

Improvement: Advanced Search

  • Support for 30 new fields, including nominations, Congressional Record and name of member

Improvement: Browse

  • Days in session calendar view
  • Roll Call votes
  • Bill by sponsor/co-sponsor

When the Library of Congress, in collaboration with the U.S. Senate, U.S. House of Representatives and the Government Printing Office (GPO) released Congress.gov as a beta site in the fall of 2012, it included bill status and summary, member profiles and bill text from the two most recent congresses at that time – the 111th and 112th.

Since that time, Congress.gov has expanded with the additions of the Congressional Record, committee reports, direct links from bills to cost estimates from the Congressional Budget Office, legislative process videos, committee profile pages, nominations, historic access reaching back to the 103rd Congress and user accounts enabling saved personal searches. Users have been invited to provide feedback on the site’s functionality, which has been incorporated along with the data updates.

Plans are in place for ongoing enhancements in the coming year, including addition of treaties, House and Senate Executive Communications and the Congressional Record Index.

Field Value Lists:

Use search fields in the main search box (available on most pages), or via the advanced and command line search pages. Use terms or codes from the Field Value Lists with corresponding search fields: Congress [congressId], Action – Words and Phrases [billAction], Subject – Policy Area [billSubject], or Subject (All) [allBillSubjects].

Congresses (44, stops with 70th Congress (1927-1929))

Legislative Subject Terms, Subject Terms (541), Geographic Entities (279), Organizational Names (173). (total 993)

Major Action Codes (98)

Policy Area (33)

Search options:

Search Form: “Choose collections and fields from dropdown menus. Add more rows as needed. Use Major Action Codes and Legislative Subject Terms for more precise results.”

Command Line: “Combine fields with operators. Refine searches with field values: Congresses, Major Action Codes, Policy Areas, and Legislative Subject Terms. To use facets in search results, copy your command line query and paste it into the home page search box.”

Search Tips Overview: “You can search Congress.gov using the quick search available on most pages or via the advanced search page. Advanced search gives you the option of using a guided search form or a command line entry box.” (includes examples)

Misc.

You can follow this project @congressdotgov.

Orientation to Legal Research & Congress.gov is available both as a seminar (in-person) and webinar (online).

Enjoy!

I first saw this at Congress.gov is Out of Beta with New Features by Africa S. Hands.

Qatar Digital Library

October 28th, 2014

New Qatar Digital Library Offers Readers Unrivalled Collection of Precious Heritage Material

From the post:

The Qatar Digital Library which provides new public access to over half a million pages of precious historic archive and manuscript material has been launched today thanks to the British Library-Qatar Foundation Partnership project. This incredible resource makes documents and other items relating to the modern history of Qatar, the Gulf region and beyond, fully accessible and free of charge to researchers and the general public through a state-of-the-art online portal.

In line with the principles of the Qatar National Vision 2030, which aims to preserve the nation’s heritage and enhance Arab and Islamic values and identity, the launch of the Qatar Digital Library supports QF’s aim of unlocking human potential for the benefit of Qatar and the world.

Qatar National Library, a member of Qatar Foundation, has a firm commitment to preserving and showcasing Qatar’s heritage and promoting education and community development by sharing knowledge and providing resources to students, researchers, and the wider community.

With Qatar Foundation’s support, an expert, technical team has been preserving and digitising materials from the UK’s India Office Records archives over the past two years in order to be shared publicly on the portal owned and managed by Qatar National Library.

The Qatar Digital Library provides online access to over 475,000 pages from the India Office Records that date from the mid-18th century to 1951, and relate to modern historic events in Qatar, the Gulf and the Middle East region.

In addition, the Qatar Digital Library shares 25,000 pages of medieval Arab Islamic sciences manuscripts, historical maps, photographs and sound recordings.

These precious materials are being made available online for the first time. The Qatar Digital Library provides clear descriptions of the digitised materials in Arabic and English, and can be accessed for personal and research use from anywhere free of charge.

The Qatar Digital Library (homepage).

Simply awesome!

A great step towards unlocking the riches of Arab scholarship.

I first saw this in British Library Launches Qatar Digital Library by Africa S. Hands.

Building a language-independent keyword-based system with the Wikipedia Miner

October 27th, 2014

Building a language-independent keyword-based system with the Wikipedia Miner by Gauthier Lemoine.

From the post:

Extracting keywords from texts and HTML pages is a common subject that opens doors to a lot of potential applications. These include classification (what is this page topic?), recommendation systems (identifying user likes to recommend the more accurate content), search engines (what is this page about?), document clustering (how can I pack different texts into a common group) and much more.

Most applications of these are usually based on only one language, usually english. However, it would be better to be able to process document in any language. For example, a case in a recommender system would be a user that speaks French and English. In his history, he gave positive ratings to a few pages containing the keyword “Airplane”. So, for next recommendations, we would boost this keyword. With a language-independent approach, we would also be able to boost pages containing “Avion”, the french term for airplane. If the user gave positive ratings to pages in English containing “Airplane”, and in French containing “Avion”, we would also be able to merge easily into the same keyword to build a language-independent user profile that will be used for accurate French and English recommendations.

This articles shows one way to achieve good results using an easy strategy. It is obvious that we can achieve better results using more complex algorithms.

The NSA can hire translators so I would not bother sharing this technique for harnessing the thousands of expert hours in Wikipedia with them.

Bear in mind that Wikipedia does not reach a large number of minority languages, dialects, and certainly not deliberate obscurity in any language. Your mileage will vary depending upon your particular use case.

On the Computational Complexity of MapReduce

October 27th, 2014

On the Computational Complexity of MapReduce by Jeremy Kun.

From the post:

I recently wrapped up a fun paper with my coauthors Ben Fish, Adam Lelkes, Lev Reyzin, and Gyorgy Turan in which we analyzed the computational complexity of a model of the popular MapReduce framework. Check out the preprint on the arXiv.

As usual I’ll give a less formal discussion of the research here, and because the paper is a bit more technically involved than my previous work I’ll be omitting some of the more pedantic details. Our project started after Ben Moseley gave an excellent talk at UI Chicago. He presented a theoretical model of MapReduce introduced by Howard Karloff et al. in 2010, and discussed his own results on solving graph problems in this model, such as graph connectivity. You can read Karloff’s original paper here, but we’ll outline his model below.

Basically, the vast majority of the work on MapReduce has been algorithmic. What I mean by that is researchers have been finding more and cleverer algorithms to solve problems in MapReduce. They have covered a huge amount of work, implementing machine learning algorithms, algorithms for graph problems, and many others. In Moseley’s talk, he posed a question that caught our eye:

Is there a constant-round MapReduce algorithm which determines whether a graph is connected?

After we describe the model below it’ll be clear what we mean by “solve” and what we mean by “constant-round,” but the conjecture is that this is impossible, particularly for the case of sparse graphs. We know we can solve it in a logarithmic number of rounds, but anything better is open.

In any case, we started thinking about this problem and didn’t make much progress. To the best of my knowledge it’s still wide open. But along the way we got into a whole nest of more general questions about the power of MapReduce. Specifically, Karloff proved a theorem relating MapReduce to a very particular class of circuits. What I mean is he proved a theorem that says “anything that can be solved in MapReduce with so many rounds and so much space can be solved by circuits that are yae big and yae complicated, and vice versa.

But this question is so specific! We wanted to know: is MapReduce as powerful as polynomial time, our classical notion of efficiency (does it equal P)? Can it capture all computations requiring logarithmic space (does it contain L)? MapReduce seems to be somewhere in between, but it’s exact relationship to these classes is unknown. And as we’ll see in a moment the theoretical model uses a novel communication model, and processors that never get to see the entire input. So this led us to a host of natural complexity questions:

  1. What computations are possible in a model of parallel computation where no processor has enough space to store even one thousandth of the input?
  2. What computations are possible in a model of parallel computation where processor’s can’t request or send specific information from/to other processors?
  3. How the hell do you prove that something can’t be done under constraints of this kind?
  4. How do you measure the increase of power provided by giving MapReduce additional rounds or additional time?

These questions are in the domain of complexity theory, and so it makes sense to try to apply the standard tools of complexity theory to answer them. Our paper does this, laying some brick for future efforts to study MapReduce from a complexity perspective.

Given the prevalence of MapReduce, progress on understanding what is or is not possible is an important topic.

The first two complexity questions strike me as the ones most relevant to topic map processing with map reduce. Depending upon the nature of your merging algorithm.

Enjoy!

Data Modelling: The Thin Model [Entities with only identifiers]

October 27th, 2014

Data Modelling: The Thin Model by Mark Needham.

From the post:

About a third of the way through Mastering Data Modeling the authors describe common data modelling mistakes and one in particular resonated with me – ‘Thin LDS, Lost Users‘.

LDS stands for ‘Logical Data Structure’ which is a diagram depicting what kinds of data some person or group wants to remember. In other words, a tool to help derive the conceptual model for our domain.

They describe the problem that a thin model can cause as follows:

[...] within 30 minutes [of the modelling session] the users were lost…we determined that the model was too thin. That is, many entities had just identifying descriptors.

While this is syntactically okay, when we revisited those entities asking, What else is memorable here? the users had lots to say.

When there was flesh on the bones, the uncertainty abated and the session took a positive course.

I found myself making the same mistake a couple of weeks ago during a graph modelling session. I tend to spend the majority of the time focused on the relationships between the bits of data and treat the meta data or attributes almost as an after thought.

A good example of why subjects need multiple attributes, even multiple identifying attributes.

When sketching just a bare data model, the author, having prepared in advance is conversant with the scant identifiers. The audience, on the other hand is not. Additional attributes for each entity quickly reminds the audience of the entity in question.

Take this as anecdotal evidence that multiple attributes assist users in recognition of entities (aka subjects).

Will that impact how you identify subjects for your users?

Apache Flink (formerly Stratosphere) Competitor to Spark

October 27th, 2014

From the Apache Flink 0.6 release page:

What is Flink?

Apache Flink is a general-purpose data processing engine for clusters. It runs on YARN clusters on top of data stored in Hadoop, as well as stand-alone. Flink currently has programming APIs in Java and Scala. Jobs are executed via Flink's own runtime engine. Flink features:

Robust in-memory and out-of-core processing: once read, data stays in memory as much as possible, and is gracefully de-staged to disk in the presence of memory pressure from limited memory or other applications. The runtime is designed to perform very well both in setups with abundant memory and in setups where memory is scarce.

POJO-based APIs: when programming, you do not have to pack your data into key-value pairs or some other framework-specific data model. Rather, you can use arbitrary Java and Scala types to model your data.

Efficient iterative processing: Flink contains explicit "iterate" operators that enable very efficient loops over data sets, e.g., for machine learning and graph applications.

A modular system stack: Flink is not a direct implementation of its APIs but a layered system. All programming APIs are translated to an intermediate program representation that is compiled and optimized via a cost-based optimizer. Lower-level layers of Flink also expose programming APIs for extending the system.

Data pipelining/streaming: Flink's runtime is designed as a pipelined data processing engine rather than a batch processing engine. Operators do not wait for their predecessors to finish in order to start processing data. This results to very efficient handling of large data sets.

The latest version is Apache Flink 0.6.1

See more information at the incubator homepage. Or consult the Apache Flink mailing lists.

The Quickstart is…, wait for it: word count on Hamlet. Nothing against the Bard, but you do know that everyone dies at the end. Yes? Seems like a depressing example.

What you suggest as an example application(s) for this type of software?

I first saw this on Danny Bickson’s blog as Apache flink.

Extended Artificial Memory:…

October 27th, 2014

Extended Artificial Memory: Toward an Integral Cognitive Theory of Memory and Technology by Lars Ludwig. (PDF) (Or you can contribute to the cause by purchasing a printed or Kindle copy of: Information Technology Rethought as Memory Extension: Toward an integral cognitive theory of memory and technology.)

Convention book selling wisdom is that a title should provoke people to pick up the book. First step towards a sale. Must be the thinking behind this title. Just screams “Read ME!”

;-)

Seriously, I have read some of the PDF version and this is going on the my holiday wish list as a hard copy request.

Abstract:

This thesis introduces extended artificial memory, an integral cognitive theory of memory and technology. It combines cross-scientific analysis and synthesis for the design of a general system of essential knowledge-technological processes on a sound theoretical basis. The elaboration of this theory was accompanied by a long-term experiment for understanding [Erkenntnisexperiment]. This experiment included the agile development of a software prototype (Artificial Memory) for personal knowledge management.

In the introductory chapter 1.1 (Scientific Challenges of Memory Research), the negative effects of terminological ambiguity and isolated theorizing to memory research are discussed.

Chapter 2 focuses on technology. The traditional idea of technology is questioned. Technology is reinterpreted as a cognitive actuation process structured in correspondence with a substitution process. The origin of technological capacities is found in the evolution of eusociality. In chapter 2.2, a cognitive-technological model is sketched. In this thesis, the focus is on content technology rather than functional technology. Chapter 2.3 deals with different types of media. Chapter 2.4 introduces the technological role of language-artifacts from different perspectives, combining numerous philosophical and historical considerations. The ideas of chapter 2.5 go beyond traditional linguistics and knowledge management, stressing individual constraints of language and limits of artificial intelligence. Chapter 2.6 develops an improved semantic network model, considering closely associated theories.

Chapter 3 gives a detailed description of the universal memory process enabling all cognitive technological processes. The memory theory of Richard Semon is revitalized, elaborated and revised, taking into account important newer results of memory research.

Chapter 4 combines the insights on the technology process and the memory process into a coherent theoretical framework. Chapter 4.3.5 describes four fundamental computer-assisted memory technologies for personally and socially extended artificial memory. They all tackle basic problems of the memory-process (4.3.3). In chapter 4.3.7, the findings are summarized and, in chapter 4.4, extended into a philosophical consideration of knowledge.

Chapter 5 provides insight into the relevant system landscape (5.1) and the software prototype (5.2). After an introduction into basic system functionality, three exemplary, closely interrelated technological innovations are introduced: virtual synsets, semantic tagging, and Linear Unit tagging.

The common memory capture (of two or more speakers) imagery is quite powerful. It highlights a critical aspect of topic maps.

Be forewarned this is European style scholarship, where the reader is assumed to be comfortable with philosophy, linguistics, etc., in addition to the more narrow aspects of computer science.

To see these ideas in practice: http://www.artificialmemory.net/.

Slides on What is Artificial Memory.

I first saw this in a note from Jack Park, the source of many interesting and useful links, papers and projects.

Think Big Challenge 2014 [Census Data - Anonymized]

October 27th, 2014

Think Big Challenge 2014 [Census Data - Anonymized]

The Think Big Challenge 2014 closed October 19, 2014, but the data sets for that challenge remain available.

From the data download page:

This subdirectory contains a small extract of the data set (1,000 records). There are two data sets provided:

A complete set of records from after the year 1820 is available for download from Amazon S3 at The full data set is available for download from Amazon S3 at https://s3.amazonaws.com/think.big.challenge/AncestryPost1820Data.gz as a 127MB gzip file.

A sample of records pre-1820 for use in the data science “Learning of Common Ancestors” challenge. This can be downloaded at https://s3.amazonaws.com/think.big.challenge/AncestryPre1820Sample.gz as a 4MB gzip file.

The records have been pre-processed:

The contest data set includes both publicly availabl[e] records (e.g., census data) and user-contributed submissions on Ancestry.com. To preserve user privacy, all surnames present in the data have been obscured with a hash function. The hash is constructed such that all occurrences of the same string will result in the same hash code.

Reader exercise: You can find multiple ancestors of yours in these records with different surnames and compare those against the hash function results. How many you will need to reverse the hash function and recover all the surnames? Use other ancestors of yours to check your results.

Take a look at the original contest tasks for inspiration. What other online records would you want to merge with these? Thinking local newspapers? What about law reporters?

Enjoy!

I first saw this mentioned on Danny Bickson’s blog as: Interesting dataset from Ancestry.com.


Update: I meant to mention Risks of Not Understanding a One-Way Function by Bruce Schneier, to get you started on the deanonymization task. Apologies for the omission.

If you are interested in cryptography issues, following Bruce Schneier’s blog should be on your regular reading list.

Nothing to Hide

October 26th, 2014

Nothing to Hide: Look out for yourself by Nicky Case.

Greg Linden describes it as:

Brilliantly done, free, open source, web-based puzzle game with wonderfully dark humor about ubiquitous surveillance

First and foremost, I sense there is real potential for this to develop into an enjoyable online game.

Second, this could be a way to educate users to security/surveillance threats.

Enjoy!

I first saw this in Greg Linden’s Quick Links for Wednesday, October 01, 2014.

Death of Yahoo Directory

October 26th, 2014

Progress Report: Continued Product Focus by Jay Rossiter, SVP, Cloud Platform Group.

From the post:

At Yahoo, focus is an important part of accomplishing our mission: to make the world’s daily habits more entertaining and inspiring. To achieve this focus, we have sunset more than 60 products and services over the past two years, and redirected those resources toward products that our users care most about and are aligned with our vision. With even more smart, innovative Yahoos focused on our core products – search, communications, digital magazines, and video – we can deliver the best for our users.

Directory: Yahoo was started nearly 20 years ago as a directory of websites that helped users explore the Internet. While we are still committed to connecting users with the information they’re passionate about, our business has evolved and at the end of 2014 (December 31), we will retire the Yahoo Directory. Advertisers will be upgraded to a new service; more details to be communicated directly.

Understandable but sad. Think of indexing a book that expanded as rapidly as the Internet over the last twenty (20) years. Especially if the content might or might not have any resemblance to already existing content.

Internet remains in serious need of a curated means to access quality information. Almost any search returns links ranging from high to questionable quality.

Imagine if Yahoo segregated the top 500 computer science publishers, archives, societies, departments, blogs into a block of searchable content. (The 500 number is wholly arbitrary, could be some other number) Users would pre-qualify themselves as interested in computer science materials and create a market segment for advertising purposes.

Users would get less trash in their results and advertisers would have pre-qualified targets.

A pre-curated search set might mean you would miss an important link, but realistically, few people read beyond the first twenty (20) links anyway. An analysis of search logs at PubMed show that 80% of users choose a link from the first twenty results.

In theory you may have > 10,000 “hits” but querying all of those up for serving to a user is a waste to time.

Suspect it varies by domain but twenty (20) high quality “hits” from curated content would be a far cry from average search results now.

I first saw this in Greg Linden’s Quick Links for Wednesday, October 01, 2014.

The Chapman University Survey on American Fears

October 26th, 2014

The Chapman University Survey on American Fears

From the webpage:

Chapman University has initiated a nationwide poll on what strikes fear in Americans. The Chapman University Survey on American Fears included 1,500 participants from across the nation and all walks of life. The research team leading this effort pared the information down into four basic categories: personal fears, crime, natural disasters and fear factors. According to the Chapman poll, the number one fear in America today is walking alone at night.

A multi-disciplinary team of Chapman faculty and students wanted to capture this information on a year-over-year basis to draw comparisons regarding what items are increasing in fear as well as decreasing. The fears are presented according to fears vs. concerns because that was the necessary phrasing to capture the information correctly.

Your marketing department will find this of interest.

If you are not talking about power, fear or sex, then you aren’t talking about marketing.

IT is no different from any other product or service. Perhaps that’s why the kumbaya approach to selling semantic solutions has done so poorly.

You will need far deeper research than this to integrate fear into your marketing program but at least it is a starting point for discussion.

I first saw this at Full Text Reports as: The Chapman Survey on American Fears

Wastebook 2014

October 25th, 2014

Wastebook 2014: What Washington doesn’t want you to read. (Voodoo Dolls, Gambling Monkeys, Zombies in Love and Paid Vacations for Misbehaving Bureaucrats Top List of the Most Outlandish Government Spending in Wastebook 2014)

From the webpage:

Gambling monkeys, dancing zombies and mountain lions on treadmills are just a few projects exposed in Wastebook 2014 – highlighting $25 billion in Washington’s worst spending of the year.

Wastebook 2014 — the report Washington doesn’t want you to read —reveals the 100 most outlandish government expenditures this year, costing taxpayers billions of dollars.

“With no one watching over the vast bureaucracy, the problem is not just what Washington isn’t doing, but what it is doing.” Dr. Coburn said. “Only someone with too much of someone else’s money and not enough accountability for how it was being spent could come up some of these projects.”

“I have learned from these experiences that Washington will never change itself. But even if the politicians won’t stop stupid spending, taxpayers always have the last word.”

Congress actually forced federal agencies to waste billions of dollars for purely parochial, political purposes.

For example, lawmakers attached a rider to a larger bill requiring NASA to build a $350 million launch pad tower, which was mothballed as soon as it was completed because the rockets it was designed to test were scrapped years ago. Similarly, when USDA attempted to close an unneeded sheep research station costing nearly $2 million every year to operate, politicians in the region stepped in to keep it open.

Examples of wasteful spending highlighted in “Wastebook 2014” include:

  • Coast guard party patrols – $100,000
  • Watching grass grow – $10,000
  • State department tweets @ terrorists – $3 million
  • Swedish massages for rabbits – $387,000
  • Paid vacations for bureaucrats gone wild – $20 million
  • Mountain lions on a treadmill – $856,000
  • Synchronized swimming for sea monkeys – $50,000
  • Pentagon to destroy $16 billion in unused ammunition — $1 billion
  • Scientists hope monkey gambling unlocks secrets of free will –$171,000
  • Rich and famous rent out their luxury pads tax free – $10 million
  • Studying “hangry” spouses stabbing voodoo dolls – $331,000
  • Promoting U.S. culture around the globe with nose flutists – $90 million

Read the full report here.

Watch the Wastebook 2014 videos here and here and here

Wastebook 2014 runs a total of one hundred and ten (110) pages and has 1137 footnotes (with references to data analysis in many cases). It occurs to me to ask if the lavish graphics, design and research were donated by volunteers or perhaps this was the work of paid staff of Sen. Coburn?

The other question to ask is what definition of “waste” is Sen. Coburn using?

I suspect the people who were paid monthly salaries for any of the listed projects would disagree their salaries were “waste.” A sentiment that would be echoed by their landlords, car dealers, grocery stores, etc.

It might be cheaper to simply pay all those staffer and not buy equipment and materials for their projects, but that would have an adverse impact on the vendors for those products and their staffs, who likewise have homes, cars, and participate in their local economies.

Not that governments are the sole offenders when it comes to waste but they are easy targets since unlike most corporations, more information is public about their internal operations.

The useful question that topic maps could play a role in on questions of “waste” would be to track the associations of people involved in a project to all the other participants in the local economy. I think you will find that the economic damage of cutting some “waste” is far higher than the cost of continuing the “waste.”

Such a project would give you the data on which to make principled arguments to distinguish between waste with little local impact and waste with a large local impact.

I first saw this at Full Text Reports as: Wastebook 2014: What Washington doesn’t want you to read.

Data Visualization with JavaScript

October 25th, 2014

Data Visualization with JavaScript by Stephen A. Thomas.

From the introduction:

It’s getting hard to ignore the importance of data in our lives. Data is critical to the largest social organizations in human history. It can affect even the least consequential of our everyday decisions. And its collection has widespread geopolitical implications. Yet it also seems to be getting easier to ignore the data itself. One estimate suggests that 99.5% of the data our systems collect goes to waste. No one ever analyzes it effectively.

Data visualization is a tool that addresses this gap.

Effective visualizations clarify; they transform collections of abstract artifacts (otherwise known as numbers) into shapes and forms that viewers quickly grasp and understand. The best visualizations, in fact, impart this understanding subconsciously. Viewers comprehend the data immediately—without thinking. Such presentations free the viewer to more fully consider the implications of the data: the stories it tells, the insights it reveals, or even the warnings it offers. That, of course, defines the best kind of communication.

If you’re developing web sites or web applications today, there’s a good chance you have data to communicate, and that data may be begging for a good visualization. But how do you know what kind of visualization is appropriate? And, even more importantly, how do you actually create one? Answers to those very questions are the core of this book. In the chapters that follow, we explore dozens of different visualizations and visualization techniques and tool kits. Each example discusses the appropriateness of the visualization (and suggests possible alternatives) and provides step-by-step instructions for including the visualization in your own web pages.

To give you a better idea of what to expect from the book, here’s a quick description of what the book is, and what it is not.

The book is a sub-part of http://jsdatav.is/ where Stephen maintains his blog, listing of talks and a link to his twitter account.

If you are interested in data visualization with JavaScript, this should be on a short list of bookmarks.

Building Scalable Search from Scratch with ElasticSearch

October 25th, 2014

Building Scalable Search from Scratch with ElasticSearch by Ram Viswanadha.

From the post:

1 Introduction

Savvy is an online community for the world’s product enthusiasts. Our communities are the product trendsetters that the rest of the world follows. Across the site, our users are able to compare products, ask and answer product questions, share product reviews, and generally share their product interests with one another. Savvy1.com boasts a vibrant community that save products on the site at the rate of 1 product every second. We wanted to provide a search bar that can search across various entities in the system – users, products, coupons, collections, etc. – and return the results in a timely fashion.

2 Requirements

The search server should satisfy the following requirements:

  1. Full Text Search: The ability to not only return documents that contain the exact keywords, but also documents that contain words that are related or relevant to the keywords.
  2. Clustering: The ability to distribute data across multiple nodes for load balancing and efficient searching.
  3. Horizontal Scalability: The ability to increase the capacity of the cluster by adding more nodes.
  4. Read and Write Efficiency: Since our application is both read and write heavy, we need a system that allows for high write loads and efficient read times on heavy read loads.
  5. Fault Tolerant: The loss of any node in the cluster should not affect the stability of the cluster.
  6. REST API with JSON: The server should support a REST API using JSON for input and output.

At the time, we looked at Sphinx, Solr and ElasticSearch. The only system that satisfied all of the above requirements was ElasticSearch, and — to sweeten the deal — ElasticSearch provided a way to efficiently ingest and index data in our MongoDB database via the River API so we could get up and running quickly.

If you need an outline for building a basic ElasticSearch system, this is it!

It has the advantage of introducing you to a number of other web technologies that will be handy with ElasticSearch.

Enjoy!

Overview App API

October 25th, 2014

Overview App API

From the webpage:

An Overview App is a program that uses Overview.

You can make one. You know you want to.

Using Overview’s App API you can drive Overview’s document handling engine from your own code, create new visualizations that replace Overview’s default Topic Tree, or write interactive document handling or data extraction apps.

If you don’t remember the Overview Project:

Overview is just what you need to search, analyze and cull huge volumes of text or documents. It was built for investigative journalists who go through thousands of pages of material, but it’s also used by reasearchers facing huge archives and social media analysts with millions of posts. With advanced search and interactive topic modeling, you can:

  • find what you didn’t even know to look for
  • quickly tag or code documents
  • let the computer organize your documents by topic, automatically

Leveraging the capabilities in Overview is a better use of resources than re-inventing basic file and search capabilities.