Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 21, 2013

Setting up a Hadoop cluster

Filed under: Documentation,Hadoop,Topic Maps — Patrick Durusau @ 6:36 pm

Setting up a Hadoop cluster – Part 1: Manual Installation by Lars Francke.

From the post:

In the last few months I was tasked several times with setting up Hadoop clusters. Those weren’t huge – two to thirteen machines – but from what I read and hear this is a common use case especially for companies just starting with Hadoop or setting up a first small test cluster. While there is a huge amount of documentation in form of official documentation, blog posts, articles and books most of it stops just where it gets interesting: Dealing with all the stuff you really have to do to set up a cluster, cleaning logs, maintaining the system, knowing what and how to tune etc.

I’ll try to describe all the hoops we had to jump through and all the steps involved to get our Hadoop cluster up and running. Probably trivial stuff for experienced Sysadmins but if you’re a Developer and finding yourself in the “Devops” role all of a sudden I hope it is useful to you.

While working at GBIF I was asked to set up a Hadoop cluster on 15 existing and 3 new machines. So the first interesting thing about this setup is that it is a heterogeneous environment: Three different configurations at the moment. This is where our first goal came from: We wanted some kind of automated configuration management. We needed to try different cluster configurations and we need to be able to shift roles around the cluster without having to do a lot of manual work on each machine. We decided to use a tool called Puppet for this task.

While Hadoop is not currently in production at GBIF there are mid- to long-term plans to switch parts of our infrastructure to various components of the HStack. Namely MapReduce jobs with Hive and perhaps Pig (there is already strong knowledge of SQL here) and also storing of large amounts of raw data in HBase to be processed asynchronously (~500 million records until next year) and indexed in a Lucene/Solr solution possibly using something like Katta to distribute indexes. For good measure we also have fairly complex geographic calculations and map-tile rendering that could be done on Hadoop. So we have those 18 machines and no real clue how they’ll be used and which services we’d need in the end.

Dated, 2011, but illustrates some of the issues I raised in: Hadoop Ecosystem Configuration Woes?

Do you keep this level of documentation on your Hadoop installs?

I first saw this in a tweet by Marko A. Rodriguez.

Putting Spark to Use:…

Filed under: Hadoop,MapReduce,Spark — Patrick Durusau @ 5:43 pm

Putting Spark to Use: Fast In-Memory Computing for Your Big Data Applications by Justin Kestelyn.

From the post:

Apache Hadoop has revolutionized big data processing, enabling users to store and process huge amounts of data at very low costs. MapReduce has proven to be an ideal platform to implement complex batch applications as diverse as sifting through system logs, running ETL, computing web indexes, and powering personal recommendation systems. However, its reliance on persistent storage to provide fault tolerance and its one-pass computation model make MapReduce a poor fit for low-latency applications and iterative computations, such as machine learning and graph algorithms.

Apache Spark addresses these limitations by generalizing the MapReduce computation model, while dramatically improving performance and ease of use.

Fast and Easy Big Data Processing with Spark

At its core, Spark provides a general programming model that enables developers to write application by composing arbitrary operators, such as mappers, reducers, joins, group-bys, and filters. This composition makes it easy to express a wide array of computations, including iterative machine learning, streaming, complex queries, and batch.

In addition, Spark keeps track of the data that each of the operators produces, and enables applications to reliably store this data in memory. This is the key to Spark’s performance, as it allows applications to avoid costly disk accesses. As illustrated in the figure below, this feature enables:

I would not use the following example to promote Spark:

One of Spark’s most useful features is the interactive shell, bringing Spark’s capabilities to the user immediately – no IDE and code compilation required. The shell can be used as the primary tool for exploring data interactively, or as means to test portions of an application you’re developing.

The screenshot below shows a Spark Python shell in which the user loads a file and then counts the number of lines that contain “Holiday”.

Spark Example

Isn’t that just:

grep holiday WarAndPeace.txt | wc -l
15

?

Grep doesn’t require an IDE or compilation either. Of course, grep isn’t reading from an HDFS file.

The “file.filter(lamda line: “Holiday” in.line).count()” works but some of us prefer the terseness of Unix.

Unix text tools for HDFS?

Set The WayBack Machine for 1978 – Destination: Unix

Filed under: Design,Programming — Patrick Durusau @ 3:58 pm

Bell System Technical Journal, v57: i6 July-August 1978

Where you will find:

Before you build another software monolith, you should spend some time reading this set of Unix classics.

Before you object to the age of the materials, can you name another OS that is forty (40)+ years old? (And still popular. I don’t count legacy systems in the basement of the SSA. 😉 )

Perhaps there’s something to the “small tool” mentality of Unix.

I first saw this in a tweet by CompSciFact.

How To Make Operating System by Yourself ?

Filed under: Programming,Software — Patrick Durusau @ 2:52 pm

How To Make Operating System by Yourself? by Jasmin Shah.

From the post:

Having an Operating System named after you, Sounds Amazing ! Isn’t it ?

Specially, after watching the IronMan Series, I am a die hard fan on J.A.R.V.I.S. Operating System.

So, let’s get started to make Operating System on our own. Once you are done with it, Don’t forget to share your operating system with me in the comment section below.

A bit oversold, ;-), but Jasmin walks the reader through using SuseStudio.com to create a complete operating system.

Uses?

Well, an appliance that saves first-time topic map users from installation purgatory is one idea.

Another idea would be to bundle content and/or tutorials with your topic map software.

Or to bundle databases/stores, etc. for a side by side comparison by users on the same content.

What would you put in your “operating system?”

Experimenting with visualisation tools

Filed under: Graphics,Metaphors,Thesaurus,Visualization — Patrick Durusau @ 2:34 pm

Experimenting with visualisation tools by Brian Aitken.

From the post:

Over the past few months I’ve been working to develop some interactive visualisations that will eventually be made available on the Mapping Metaphor website. The project team investigated a variety of visualisation approaches that they considered well suited to both the project data and the connections between the data, and they also identified a number of toolkits that could be used to generate such visualisations.

Brian experiments with the JavaScript InfoVis Toolkit for the Mapping Metaphor with the Historical Thesaurus project.

Interesting read. Promises to cover D3 in a future post.

Could be very useful for other graph or topic map visualizations.

Neo4j 2.0.0-RC1 – Final preparations

Filed under: Graphs,Neo4j — Patrick Durusau @ 1:12 pm

Neo4j 2.0.0-RC1 – Final preparations by Andreas Kollegger.

From the post:

WARNING: This release is not compatible with earlier 2.0.0 milestones. See details below.

The next major version of Neo4j has been under development for almost a year now, methodically elaborated and refined into a solid foundation. Neo4j 2.0 is now feature-complete. We’re pleased to announce the first Release Candidate build is available today.

With that feature-completeness in mind, let’s see what’s on offer…

Andreas summarizes a number of new features for Cypher and concludes with:

To be clear: DO NOT USE THIS RELEASE WITH EXISTING DATA

Consider yourself warned!

Only bug fixes will be addressed between now and the GA release of Neo4j 2.0.

Now would be a good time to grab this release and go bug hunting.

Finding Occam’s razor in an era of information overload

Filed under: Modeling,Skepticism — Patrick Durusau @ 11:54 am

Finding Occam’s razor in an era of information overload

From the post:

How can the actions and reactions of proteins so small or stars so distant they are invisible to the human eye be accurately predicted? How can blurry images be brought into focus and reconstructed?

A new study led by physicist Steve Pressé, Ph.D., of the School of Science at Indiana University-Purdue University Indianapolis, shows that there may be a preferred strategy for selecting mathematical models with the greatest predictive power. Picking the best model is about sticking to the simplest line of reasoning, according to Pressé. His paper explaining his theory is published online this month in Physical Review Letters, a preeminent international physics journal.

“Building mathematical models from observation is challenging, especially when there is, as is quite common, a ton of noisy data available,” said Pressé, an assistant professor of physics who specializes in statistical physics. “There are many models out there that may fit the data we do have. How do you pick the most effective model to ensure accurate predictions? Our study guides us towards a specific mathematical statement of Occam’s razor.”

Occam’s razor is an oft cited 14th century adage that “plurality should not be posited without necessity” sometimes translated as “entities should not be multiplied unnecessarily.” Today it is interpreted as meaning that all things being equal, the simpler theory is more likely to be correct.

Comforting that the principles of good modeling have not changed since the 14th century. (Occam’s Razor)

Bear in mind Occam’s Razor is guidance and not a hard and fast rule.

On the other hand, particularly with “big data,” be wary of complex models.

Especially the ones that retroactively “predict” unique events as a demonstration of their model.

If you are interested in the full “monty:”

Nonadditive Entropies Yield Probability Distributions with Biases not Warranted by the Data by Steve Pressé, Kingshuk Ghosh, Julian Lee, and Ken A. Dill. Phys. Rev. Lett. 111, 180604 (2013)

Abstract:

Different quantities that go by the name of entropy are used in variational principles to infer probability distributions from limited data. Shore and Johnson showed that maximizing the Boltzmann-Gibbs form of the entropy ensures that probability distributions inferred satisfy the multiplication rule of probability for independent events in the absence of data coupling such events. Other types of entropies that violate the Shore and Johnson axioms, including nonadditive entropies such as the Tsallis entropy, violate this basic consistency requirement. Here we use the axiomatic framework of Shore and Johnson to show how such nonadditive entropy functions generate biases in probability distributions that are not warranted by the underlying data.

November 20, 2013

Middle Earth and Hobbits, A Winning Combination!

Filed under: Graphics,Interface Research/Design,Visualization — Patrick Durusau @ 8:16 pm

Google turns to Middle Earth and Hobbits to show off Chrome’s magic by Kevin C. Tofel.

From the post:

Google has a new Chrome Experiment out in the wild — or the wilds, if you prefer. The latest is a showcase for the newest web technologies packed into Chrome for mobile devices, although it works on traditional computers as well. And what better or richer world to explore on your mobile device is there then J.R.R. Tolkien’s Middle Earth?

Point your Chrome mobile browser to middle-earth.thehobbit.com to explore the Trollshaw Forrest, Rivendell and Dol Guldur with additional locations currently locked. Here’s a glimpse of what to expect:

“It may not feel like it, but this cinematic part of the experience was built with just HTML, CSS, and JavaScript. North Kingdom used the Touch Events API to support multi-touch pinch-to-zoom and the Full Screen API to allow users to hide the URL address bar. It looks natural on any screen size thanks to media queries and feels low-latency because of hardware-accelerated CSS Transitions.”

(Note, I repaired the link to http://middle-earth.thehobbit.com in the post which as posted, simply returned you to the post.)

This project and others like it should have UI coders taking a hard look at browsers.

What are your requirements that can’t be satisfied by a browser interface? (Be sure you understand the notion of sunk costs before answering that question.)

Relevancy 301 – The Graduate Level Course

Filed under: Relevance,Search Algorithms,Search Engines — Patrick Durusau @ 7:58 pm

Relevancy 301 – The Graduate Level Course by Paul Nelson.

From the post:

So, I was going to write an article entitled “Relevancy 101”, but that seemed too shallow for what has become a major area of academic research. And so here we are with a Graduate-Level Course. Grab your book-bag, some Cheetos and a Mountain Dew, and let’s kick back and talk search engine relevancy.

I have blogged about relevancy before (see “What does ‘relevant’ mean?)”, but that was a more philosophical discussion of relevancy. The purpose of this blog is to go in-depth into the different types of relevancy, how they’re computed, and what they’re good for. I’ll do my best to avoid math, but no guarantees.

A very good introduction to measures of “relevancy,” most of which are no longer used.

Pay particular attention to Paul’s remarks about the weaknesses of inverse document frequency (IDF).

Before Paul posts part 2, how do you determine the relevance of documents?

Exercise:

Pick a subject covered by a journal or magazine, one with twelve issues each year and review a year’s worth of issues for “relevant” articles.

Assuming the journal is available electronically, does the search engine suggest your other “relevant” articles?

If it doesn’t, can you determine why it recommended different articles?

Dublin Lucene Revolution 2013 (videos/slides)

Filed under: Conferences,Lucene,Solr — Patrick Durusau @ 7:46 pm

Dublin Lucene Revolution 2013 (slides/presentations)

I had confidence that LuceneRevolution wouldn’t abandon non-football fans in the U.S. for the Thanksgiving or Black Friday!

My faith has been vindicated!

I’ll create a sorted list of the presentations by author and title, to post here tomorrow.

In the meantime, I wanted to relieve your worry about endless hours of sports or shopping next week. 😉

…Scorers, Collectors and Custom Queries

Filed under: Lucene,Search Engines,Searching — Patrick Durusau @ 7:30 pm

Lucene Search Essentials: Scorers, Collectors and Custom Queries by Mikhail Khludnev.

From the description:

My team is building next generation eCommerce search platform for major an online retailer with quite challenging business requirements. Turns out, default Lucene toolbox doesn’t ideally fit for those challenges. Thus, the team had to hack deep into Lucene core to achieve our goals. We accumulated quite a deep understanding of Lucene search internals and want to share our experience. We will start with an API overview, and then look at essential search algorithms and their implementations in Lucene. Finally, we will review a few cases of query customization, pitfalls and common performance problems.

Don’t be frightened of the slide count at 179!

Multiple slides are used with single illustrations to demonstrate small changes.

Having said that, this is a “close to the metal” type presentation.

Worth your time but read along carefully.

Don’t miss the extremely fine index on slide 18.

Follow http://www.lib.rochester.edu/index.cfm?PAGE=489 for images of pages that go with the index. This copy of Fasciculus Temporum dates from 1480.

Big Data: Main Research/Business Challenges Ahead?

Filed under: Findability,Integration,Marketing,Personalization,Searching — Patrick Durusau @ 7:13 pm

Big Data Analytics at Thomson Reuters. Interview with Jochen L. Leidner by Roberto V. Zicari.

In case you don’t know, Jochen L. Leidner has the title: “Lead Scientist, of the London R&D at Thomson Reuters.”

Which goes a long way to explaining the importance of this Q&A exchange:

Q12 What are the main research challenges ahead? And what are the main business challenges ahead?

Jochen L. Leidner: Some of the main business challenges are the cost pressure that some of our customers face, and the increasing availability of low-cost or free-of-charge information sources, i.e. the commoditization of information. I would caution here that whereas the amount of information available for free is large, this in itself does not help you if you have a particular problem and cannot find the information that helps you solve it, either because the solution is not there despite the size, or because it is there but findability is low. Further challenges include information integration, making systems ever more adaptive, but only to the extent it is useful, or supporting better personalization. Having said this sometimes systems need to be run in a non-personalized mode (e.g. in the field of e-discovery, you need to have a certain consistency, namely that the same legal search systems retrieves the same things today and tomorrow, and to different parties.

How are you planning to address:

  1. The required information is not available in the system. A semantic 404 as it were. To distinguish the case of its there but wrong search terms in use.
  2. Low findability.
  3. Information integration (not normalization)
  4. System adaptability/personalization, but to users and not developers.
  5. Search consistency, same result tomorrow as today.

?

The rest of the interview is more than worth your time.

I singled out the research/business challenges as a possible map forward.

We all know where we have been.

R and Solr Integration…

Filed under: R,Solr — Patrick Durusau @ 5:41 pm

R and Solr Integration Using Solr’s REST APIs by Jitender Aswani.

From the post:

Solr is the most popular, fast and reliable open source enterprise search platform from the Apache Luene project. Among many other features, we love its powerful full-text search, hit highlighting, faceted search, and near real-time indexing. Solr powers the search and navigation features of many of the world’s largest internet sites. Solr, written in Java, uses the Lucene Java search library for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language including R.

We invested significant amount of time integrating our R-based data-management platform with Solr using HTTP/JSON based REST interface. This integration allowed us to index millions of data-sets in solr in real-time as these data-sets get processed by R. It took us few days to stabilize and optimize this approach and we are very proud to share this approach and source code with you. The full source code can be found and downloaded from datadolph.in’s git repository.

The script has R functions for:

  • querying Solr and returning matching docs
  • posting a document to solr (taking a list and converting it to JSON before posting it)
  • deleting all indexes, deleting indexes for a certain document type and for a certain category within document type

Integration across systems is the lifeblood of enterprise IT systems.

I was extolling the virtues of reaching across silos earlier today.

A silo may provide comfort but it doesn’t offer much room for growth.

Or to put it another way, semantic integration doesn’t have one path, one process or one technology.

Once you’re past that, the rest is a question of requirements, resources and understanding identity in your domain (and/or across domains).

Learning MapReduce:…[Of Ethics and Self-Interest]

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 4:57 pm

Learning MapReduce: Everywhere and For Everyone

From the post:

Tom White, author of Hadoop: The Definitive Guide, recently celebrated his five-year anniversary at Cloudera with a blog post reflecting on the early days of Big Data and what has changed and remained since 2008. Having just seen Tom in New York at the biggest and best Hadoop World to date, I’m struck by the poignancy of his earliest memories. Even then, Cloudera’s projects were focused on broadening adoption and building the community by writing effective training material, integrating with other systems, and building on the core open source. The founding team had a vision to make Apache Hadoop the focal point of an accessible, powerful, enterprise-ready Big Data platform.

Today, Cloudera is working harder than ever to help companies deploy Hadoop as part of an Enterprise Data Hub. We’re just as committed to a healthy and vibrant open-source community, have a lively partner ecosystem over 700, and have contributed innovations that make data access and analysis faster, more secure, more relevant, and, ultimately, more profitable.

However, with all these successes in driving Hadoop towards the mainstream and providing a new and dynamic data engine, the fact remains that broadening adoption at the end-user level remains job one. Even as Cloudera unifies the Big Data stack, the availability of talent to drive operations and derive full value from massive data falls well short of the enormous demand. As more companies across industries adopt Hadoop and build out their Big Data strategies focused on the Enterprise Data Hub, Cloudera has expanded its commitment to educating technologists of all backgrounds on Hadoop, its applications, and its systems.

A Partnership to Cultivate Hadoop Talent

We at Cloudera University are proud to announce a new partnership with Udacity, a leader in open, online professional education. We believe in Udacity’s vision to democratize professional development by making technical training affordable and accessible to everyone, and this model will enable us to reach aspiring Big Data practitioners around the world who want to expand their skills into Hadoop.

Our first Udacity course, Introduction to Hadoop and MapReduce, guides learners from an understanding of Big Data to the basics of Hadoop, all the way through writing your first MapReduce program. We partnered directly with Udacity’s development team to build the most engaging online Hadoop course available, including demonstrative instruction, interactive quizzes, an interview with Hadoop co-founder Doug Cutting, and a hands-on project using live data. Most importantly, the lessons are self-paced, open, and based on Cloudera’s insights into industry best practices and professional requirements.

Cloudera, and to be fair, others, have adopted a strategy of self-interest that is also ethical.

They are literally giving away the knowledge and training to use a free product. Think of it as a rising tide that floats all boats higher.

The more popular and widely use Hadoop/MapReduce become, the greater the demand for professional training and services from Cloudera (and others).

You may experiment or even run a local cluster, but if you are a Hadoop newbie, who are you going to call when it is a mission-critical application? (Hopefully professionals but there’s no guarantee on that.)

You don’t have to build silos or closed communities to be economically viable.

Delivering professional services for a popular technology seems to do the trick.

Storm, Neo4j and Python:…

Filed under: Graphs,Neo4j,Python,Storm — Patrick Durusau @ 4:26 pm

Storm, Neo4j and Python: Real-Time Stream Computation on Graphs by Sonal Raj.

From the webpage:

This page serves a resource repository for my talk at Pycon India 2013 held at Bangalore, India on 30th August – 1st September, 2013. The talk introduces the basics of the Storm real-time distributed Computation Platform popularised by Twitter, and the Neo4J Graph Database and goes on to explain how they can be used in conjuction to perform real-time computations on Graph Data with the help of emerging python libraries – py2neo (for Neo4J) and petrel (for Storm)

Great slides, code skeletons, pointers to references and a live visualization!

See the video at: PyCon India 2013.

Demo gremlins mar the demonstration part but you can see:

A Storm Topology on AWS showing signup locations for people joining based on a sample Social Network data
http://www.enfoss.org/map-aws/storm-aws-visual.html

A quote from the slides that sticks with me:

Process Infinite Streams of data one-tuple-at-a-time.

😉

Casualty Count for Obamacare (0)

Filed under: Advertising,Government,Government Data,Health care,Marketing — Patrick Durusau @ 3:08 pm

5 lessons IT leaders can learn from Obamacare rollout mistakes by Teena Hammond.

Teena reports on five lessons to be learned from the HealthCare.gov rollout:

  1. If you’re going to launch a new website, decide whether to use in-house talent or outsource. If you opt to outsource, hire a good contractor.
  2. Follow the right steps to hire the best vendor for the project, and properly manage the relationship.
  3. Have one person in charge of the project with absolute veto power.
  4. Do not gloss over any problems along the way. Be open and honest about the progress of the project. And test the site.
  5. Be ready for success or failure. Hope for the best but prepare for the worst and have guidelines to manage any potential failure.

There is a sixth lesson that emerges from Vaughn Bullard, CEO and founder of Build.Automate Inc., who is quoted in part saying:

The contractor telling the government that it was ready despite the obvious major flaws in the system is just baffling to me. If I had an employee that did something similar, I would have terminated their employment. It’s pretty simple.”

What it comes down to in the end, Bullard said, is that, “Quality and integrity count in all things.”

To avoid repeated failures in the future (sixth lesson), terminate those responsible for the current failure.

All contractors and their staffs. Track the staffs in order to avoid the same staff moving to other contractors.

Termination all appointed or hired staff who responsible for the contract and/or management of the project.

Track former staff employment by contractors and refuse contracts wherever they are employed.

You may have noticed that the reported casualty count for the Obamacare failure has been zero.

What incentive exists for the next group of contract/project managers and/or contractors for “quality and integrity?”

That would be the same as the casualty count, zero.


PS: Before you protest the termination and ban of failures as cruel, consider its advantages as a wealth redistribution program.

The government may not get better service but it will provide opportunities for fraud and poor quality work from new participants.

Not to mention there are IT service providers who exhibit quality and integrity. Absent traditional mis-management, the government could happen upon one of those.

The tip for semantic technologies is to under-promise and over-deliver. Always.

HyperDex 1.0RC5

Filed under: Advertising,HyperDex,Marketing,NoSQL — Patrick Durusau @ 1:44 pm

HyperDex 1.0RC5 by Robert Escriva.

From the post:

We are proud to announce HyperDex 1.0.rc5, the next generation NoSQL data store that provides ACID transactions, fault-tolerance, and high-performance. This new release has a number of exciting features:

  • Improved cluster management. The cluster will automatically grow as new nodes are added.
  • Backup support. Take backups of the coordinator and daemons in a consistent state and be able to restore the cluster to the point when the backup was taken.
  • An admin library which exposes performance counters for tracking cluster-wide statistics relating to HyperDex
  • Support for HyperLevelDB. This is the first HyperDex release to use HyperLevelDB, which brings higher performance than Google’s LevelDB.
  • Secondary indices. Secondary indices improve the speed of search without the overhead of creating a subspace for the indexed attributes.
  • New atomic operations. Most key-based operations now have conditional atomic equivalents.
  • Improved coordinator stability. This release introduces an improved coordinator that fixes a few stability problems reported by users.

Binary packages for Debian 7, Ubuntu 12.04-13.10, Fedora 18-19, and CentOS 6 are available on the HyperDex Download page, as well as source tarballs for other Linux platforms.

BTW, HyperDex has a cool logo:

HyperDex

Good logos are like good book covers, they catch the eye of potential customers.

A book sale starts when a customer pick a book up, hence the need for a good cover.

What sort of cover does your favorite semantic application have?

November 19, 2013

Mortar’s Open Source Community

Filed under: BigData,Ethics,Mortar,Open Source — Patrick Durusau @ 8:28 pm

Building Mortar’s Open Source Community: Announcing Public Plans by K. Young.

From the post:

We’re big fans of GitHub. There are a lot of things to like about the company and the fantastic service they’ve built. However, one of the things we’ve come to admire most about GitHub is their pricing model.

If you’re giving back to the community by making your work public, you can use GitHub for free. It’s a great approach that drives tremendous benefits to the GitHub community.

Starting today, Mortar is following GitHub’s lead in supporting those who contribute to the data science community.

If you’re improving the data science community by allowing your Mortar projects to be seen and forked by the public, we will support you by providing free access to our complete platform (including unlimited development time, up to 25 public projects, and email support). In short, you’ll pay nothing beyond Amazon Web Services’ standard Elastic MapReduce fees if you decide to run a job.

A good illustration of the difference between talking about ethics (Ethics of Big Data?) and acting ethically.

Acting ethically benefits the community.

Government grants to discuss ethics, well, you know who benefits from that.

Ethics of Big Data?

Filed under: BigData,Ethics — Patrick Durusau @ 7:58 pm

The ethics of big data: A council forms to help researchers avoid pratfalls by Jordan Novet.

From the post:

Big data isn’t just something for tech companies to talk about. Researchers and academics are forming a council to analyze the hot technology category from legal, ethical, and political angles.

The researchers decided to create the council in response to a request from the National Science Foundation (NSF) for “innovation projects” involving big data.

The Council for Big Data, Ethics, and Society will convene for the first time next year, with some level of participation from the NSF. Alongside Microsoft researchers Kate Crawford and Danah Boyd, two computer-science-savvy professors will co-direct the council: Geoffrey Bowker from the University of California, Irvine, and Helen Nissenbaum of New York University.

Through “public commentary, events, white papers, and direct engagement with data analytics projects,” the council will “address issues such as security, privacy, equality, and access in order to help guard against the repetition of known mistakes and inadequate preparation,” according to a fact sheet the White House released on Tuesday.

“We’re doing all of these major investments in next-generation internet (projects), in big data,” Fen Zhao, an NSF staff associate, told VentureBeat in a phone interview. “How do we in the research-and-development phase make sure they’re aware and cognizant of any issues that may come up?”

Odd that I should encounter this just after seeing the latest NSA surveillance news.

Everyone cites the Tuskegee syphilis study as an example of research with ethical lapses.

Tuskegee is only one of many ethical lapses in American history. I think hounding native Americans to near extermination would make a list of moral lapses. But, that was more application than research.

It doesn’t require training in ethics to know Tuskegee and the treatment of native Americans were wrong.

And whatever “ethics” come out of this study are likely to resemble the definition of a prisoner of war as defined in Geneva Convention (III), Article 4(a)(2)

(2) Members of other militias and members of other volunteer corps, 
including those of organized resistance movements, belonging to a 
Party to the conflict and operating in or outside their own territory, 
even if this territory is occupied, provided that such militias or 
volunteer corps, including such organized resistance movements, 
fulfill the following conditions:

(a) that of being commanded by a person responsible for his 
subordinates;

(b) that of having a fixed distinctive sign recognizable at a distance;

(c) that of carrying arms openly;

(d) that of conducting their operations in accordance with the laws 
and customs of war.

That may seem neutral on its face, but it’s fair to say that major nation states and not groups that have differences with them are likely to meet those requirements.

In fact, the Laws of War Deskbook argues in part that members of the Taliban had no distinctive uniforms and thus no POW status. (At page 79, footnote 31.)

The point being discussion of ethics should be in concrete cases, so we can judge who will win and who will lose.

Otherwise you will have general principles of ethics that favor the rule makers.

Known NSA Collaborators

Filed under: Cybersecurity,NSA,Security — Patrick Durusau @ 6:45 pm

Here’s what we know about European collaboration with the NSA by David Meyer.

Norway is the latest named collaborator with the NSA.

David summarizes surveillance details (so far), including known collaborators of the NSA:

Puzzling that for all the huffing and puffing about sovereignty in public from these collaborators, in private they can’t wait to abase themselves before the United States.

Not that sovereign nations need always disagree but this sort of toadyism endangers citizens of the United States as well as citizens of other countries around the world.

Toadyism isn’t an effective means of provoking rational discussion and debate among nations.

What’s missing from David’s post are the individual names from the NSA, U.S. government, and its collaborators, who should be held accountable by their respective legal systems.

Creating topic maps of surveillance activities will of necessity be a diverse projects. Different laws, information sources, etc.

Should common questions come up about creating and/or merging such topic maps, I will contribute answers whenever possible on this blog. And solicit input from any readers of this blog who care to contribute their insights.

Should you require more regular involvement on my part, you know where to find me for further discussions.

My public PGP key.

Bridging Semantic Gaps

Filed under: Language,Lexicon,Linguistics,Sentiment Analysis — Patrick Durusau @ 4:50 pm

OK, the real title is: Cross-Language Opinion Lexicon Extraction Using Mutual-Reinforcement Label Propagation by Zheng Lin, Songbo Tan, Yue Liu, Xueqi Cheng, Xueke Xu. (Lin Z, Tan S, Liu Y, Cheng X, Xu X (2013) Cross-Language Opinion Lexicon Extraction Using Mutual-Reinforcement Label Propagation. PLoS ONE 8(11): e79294. doi:10.1371/journal.pone.0079294)

Abstract:

There is a growing interest in automatically building opinion lexicon from sources such as product reviews. Most of these methods depend on abundant external resources such as WordNet, which limits the applicability of these methods. Unsupervised or semi-supervised learning provides an optional solution to multilingual opinion lexicon extraction. However, the datasets are imbalanced in different languages. For some languages, the high-quality corpora are scarce or hard to obtain, which limits the research progress. To solve the above problems, we explore a mutual-reinforcement label propagation framework. First, for each language, a label propagation algorithm is applied to a word relation graph, and then a bilingual dictionary is used as a bridge to transfer information between two languages. A key advantage of this model is its ability to make two languages learn from each other and boost each other. The experimental results show that the proposed approach outperforms baseline significantly.

I have always wondered when someone would notice the WordNet database is limited to the English language. 😉

The authors are seeking to develop “…a language-independent approach for resource-poor language,” saying:

Our approach differs from existing approaches in the following three points: first, it does not depend on rich external resources and it is language-independent. Second, our method is domain-specific since the polarity of opinion word is domain-aware. We aim to extract the domain-dependent opinion lexicon (i.e. an opinion lexicon per domain) instead of a universal opinion lexicon. Third, the most importantly, our approach can mine opinion lexicon for a target language by leveraging data and knowledge available in another language…

Our approach propagates information back and forth between source language and target language, which is called mutual-reinforcement label propagation. The mutual-reinforcement label propagation model follows a two-stage framework. At the first stage, for each language, a label propagation algorithm is applied to a large word relation graph to produce a polarity estimate for any given word. This stage solves the problem of external resource dependency, and can be easily transferred to almost any language because all we need are unlabeled data and a couple of seed words. At the second stage, a bilingual dictionary is introduced as a bridge between source and target languages to start a bootstrapping process. Initially, information about the source language can be utilized to improve the polarity assignment in target language. In turn, the updated information of target language can be utilized to improve the polarity assignment in source language as well.

Two points of particular interest:

  1. The authors focus on creating domain specific lexicons and don’t attempt to boil the ocean. Useful semantic results will arrive sooner if you avoid attempts at universal solutions.
  2. English speakers are a large market, but the target of this exercise is the #1 language of the world, Mandarin Chinese.

    Taking the numbers for English speakers at face value, approximately 0.8 billion speakers, with a world population of 7.125 billion, that leaves 6.3 billion potential customers.

You’ve heard what they say: A billion potential customers here and a billion potential customers there, pretty soon you are talking about a real market opportunity. (The original quote misattributed to Sen. Everett Dirksen.)

Got Space? Got Time? Want Space + Time?

Filed under: BigData,Graphics,Visualization — Patrick Durusau @ 3:37 pm

FAU Neuroscientists Receive Patent for New 5D Method to Understand Big Data

5-D image of brain

From the news release:

Florida Atlantic University received a U.S. patent for a new method to display large amounts of data in a color-coded, easy-to-read graph. Neuroscientists Emmanuelle Tognoli, Ph.D., and Scott Kelso, Ph.D., both researchers at the Center for Complex Systems and Brain Sciences at FAU, originally designed the method to interpret enormous amounts of data derived from their research on the human brain. The method, called a five dimensional (5D) colorimetric technique, is able to graph spatiotemporal data (data that includes both space and time), which has not previously been achieved. Until now, spatiotemporal problems were analyzed either from a spatial perspective (for instance, a map of gas prices in July 2013), or from a time-based approach (evolution of gas prices in one county over time), but not simultaneously from both perspectives. Without both space and time, analysts have been faced with an incomplete picture until now, with the creation of the 5D colorimetric technique.

The new method has already been used to examine climatic records of sea surface temperature at 65,000 points around the world over a period of 28 years and provided scientists with a clear understanding of when and where temperature fluctuations occur. While the possibilities are endless, a few practical examples of use for the 5D colorimetric technique could include tracking gas prices per county, analyzing foreclosure rates in different states or tracking epidemiological data for a virus.

Tognoli and Kelso’s research involves tracking neural activity from different areas of the human brain every one thousandth of a second. This creates a massive amount of data that is not easy to understand using conventional methods.

“Using the 5D colorimetric technique, these huge datasets are transformed into a series of color-coded dynamic patterns that actually reveal the neural choreography completely,” said Kelso. Combining this new method with conceptual and theoretical tools in real experiments will help us and others elucidate the basic coordination dynamics of the human brain.”

A new visualization technique for big data.

Interesting that we experience multiple dimensions of data embedded in a constant stream of time and space, yet have no difficulty interacting with it and others embedded in the same context.

When we have to teach our benighted servants (computers) to display what we intuitively understand, difficulties ensue.

Just in case you are interested: System and method for analysis of spatio-temporal data, Patent #8,542,916.

November 18, 2013

jLemmaGen

Filed under: Lexicon,Linguistics — Patrick Durusau @ 7:17 pm

jLemmaGen by Michal Hlaváč.

From the webpage:

JLemmaGen is java implmentation of LemmaGen project. It’s open source lemmatizer with 15 prebuilded european lexicons. Of course you can build your own lexicon.

LemmaGen project aims at providing standardized open source multilingual platform for lemmatisation.

Project contains 2 libraries:

  • lemmagen.jar – implementation of lemmatizer and API for building own lemmatizers
  • lemmagen-lang.jar – prebuilded lemmatizers from Multext Eastern dictionaries

Whether you want to expand your market or just to avoid officious U.S. officials for the next decade or so, multilingual resources are the key to making that happen.

Enjoy!

Solr Query Parsing

Filed under: Lucene,Solr — Patrick Durusau @ 7:07 pm

Solr Query Parsing by Eric Hatcher.

From the description:

Interpreting what the user meant and what they ideally would like to find is tricky business. This talk will cover useful tips and tricks to better leverage and extend Solr‘s analysis and query parsing capabilities to more richly parse and interpret user queries.

It may just be me but does it seem like Solr presentations hit the ground thinking you have a background on the subject at hand?

I won’t name names or topics but presentations start off with the same basics in any number of other talks, it’s hard to get interested.

That’s not the case, even just with the slides from Eric’s presentation!

Highly recommended!

Advanced Bash-Scripting Guide

Filed under: Awk,Sed — Patrick Durusau @ 11:57 am

Advanced Bash-Scripting Guide by Mendel Cooper.

I searched for an awk switch recently and ran across what I needed in an appendix to this book.

It is well written and has copious examples.

You can always fire up heavy duty tools but for many text processing tasks, shell scripts along with awk and sed are quite sufficient.

November 17, 2013

Current RFCs and Their Citations

Filed under: Citation Practices,Standards,Topic Maps — Patrick Durusau @ 8:51 pm

Current RFCs and Their Citations

A resource I created to give authors and editors a cut-n-paste way to use correct citations to current RFCs.

I won’t spread bad data by repeating some of the more imaginative citations of RFCs that I have seen.

Being careless about citations has the same impact as being careless about URLs. The end result is at best added work for your reader and at worst, no communication at all.

I will be updating this resource on a weekly basis but remember the canonical source of information on RFCs is the RFC-Editor’s page.

From a topic map perspective, the URLs you see in this resource are subject locators for the subjects, which are the RFCs.

Spelling isn’t a subject…

Filed under: Lucene,Solr — Patrick Durusau @ 8:39 pm

Have you seen Alec Baldwin’s teacher commercial?

A student suggests spelling as a subject and Alec responds: “Spelling isn’t a subject, spell-check, that’s a program, right?”

In Spellchecking in Trovit by Xavier Sanchez Loro, you will find that spell-check is more than a “program.”

Especially in a multi-language environment where the goal isn’t just correct spelling but delivery of relevant information to users.

From the post:

This post aims to explain the implementation and use case for spellchecking in the Trovit search engine that we will be presenting at the Lucene/Solr Revolution EU 2013 [1]. Trovit [2] is a classified ads search engine supporting several different sites, one for each country and vertical. Our search engine supports multiple indexes in multiple languages, each with several millions of indexed ads. Those indexes are segmented in several different sites depending on the type of ads (homes, cars, rentals, products, jobs and deals). We have developed a multi-language spellchecking system using SOLR [3] and Lucene [4] in order to help our users to better find the desired ads and to avoid the dreaded 0 results as much as possible (obviously, whilst still reporting back relevant information to the user). As such, our goal is not pure orthographic correction, but also to suggest correct searches for a certain site.

Our approach: Contextual Spellchecking

One key element in the spellchecking process is choosing the right dictionary, one with a relevant vocabulary for the type of information included in each site. Our approach is specializing the dictionaries based on user’s search context. Our search contexts are composed of country (with a default language) and vertical (determining the type of ads and vocabulary). Each site’s document corpus has a limited vocabulary, reduced to the type of information, language and terms included in each site’s ads. Using a more generalized approach is not suitable for our needs, since a unique vocabulary for each language (regardless of the vertical) is not as precise as specialized vocabularies for each language and vertical. We have observed drastic differences in the type of terms included in the indexes and the semantics of each vertical. Terms that are relevant in one context are meaningless in another one (e.g. “chalet” is not a relevant word in cars vertical, but is a highly relevant word for homes vertical). As such, Trovit’s spellchecking implementation exhibits very different vocabularies for each site, even when supporting the same language.

I like the emphasis on “contextual” spellchecking.

Sounds a lot like “contextual” subject recognition.

Yes?

Walking through this post in detail is an excellent exercise!

November 16, 2013

Hyperlink Graph

Filed under: Common Crawl,Graphs — Patrick Durusau @ 7:47 pm

Web Data Commons – Hyperlink Graph by Robert Meusel, Oliver Lehmberg and Christian Bizer.

From the post:

This page provides a large hyperlink graph for public download. The graph has been extracted from the Common Crawl 2012 web corpus and covers 3.5 billion web pages and 128 billion hyperlinks between these pages. To the best of our knowledge, the graph is the largest hyperlink graph that is available to the public outside companies such as Google, Yahoo, and Microsoft. Below we provide instructions on how to download the graph as well as basic statistics about its topology.

We hope that the graph will be useful for researchers who develop

  • search algorithms that rank results based on the hyperlinks between pages.
  • SPAM detection methods which identity networks of web pages that are published in order to trick search engines.
  • graph analysis algorithms and can use the hyperlink graph for testing the scalability and performance of their tools.
  • Web Science researchers who want to analyze the linking patterns within specific topical domains in order to identify the social mechanisms that govern these domains.

This is great news!

Competing graph engines won’t need to create synthetic data to gauge their scalability/performance.

Looking forward to news of results measured against this data set.

Kudos to the Web Data Commons and Robert Meusel, Oliver Lehmberg and Christian Bizer.

CLue

Filed under: Indexing,Lucene,Luke — Patrick Durusau @ 7:24 pm

CLue – Command Line tool for Apache Lucene by John Wang.

From the webpage:

When working with Lucene, it is often useful to inspect an index.

Luke is awesome, but often times it is not feasible to inspect an index on a remote machine using a GUI. That’s where Clue comes in. You can ssh into your production box and inspect your index using your favorite shell.

Another important feature for Clue is the ability to interact with other Unix commands via piping, e.g. grep, more etc.

[New in 0.0.4 Release]

  • Add ability to investigate indexes on HDFS
  • Add command to dump the index
  • Add command to import from a dumped index
  • Add configuration support, now you can configure Clue to run your own custom code
  • Add index trimming functionlity: sometimes you want a smaller index to work with
  • lucene 4.5.1 upgrade

Definitely a tool to investigate for adding to your tool belt!

Cassandra and Naive Bayes

Filed under: Bayesian Data Analysis,Cassandra — Patrick Durusau @ 7:14 pm

Using Cassandra to Build a Naive Bayes Classifier of Users Based Upon Behavior by John Berryman.

From the post:

In our last post, we found out how simple it is to use Cassandra to estimate ad conversion. It’s easy, because effectively all you have to do is accumulate counts – and Cassandra is quite good at counting. As we demonstrated in that post, Cassandra can be used as a giant, distributed, redundant, “infinitely” scalable counting framework. During this post will take the online ad company example just a bit further by creating a Cassandra-backed Naive Bayes Classifier. Again, we see that the “secret sauce” is simply keeping track of the appropriate counts.

In the previous post, we helped equip your online ad company with the ability to track ad conversion rates. But competition is steep and we’ll need to do a little better than ad conversion rates if your company is to stay on top. Recently, suspicions have arisen that ads are often being shown to unlikely customers. A quick look at the logs confirms this concern. For instance, there was a case of one internet user that clicked almost every single ad that he was shown – so long as it related to the camping gear. Several times, he went on to make purchases: a tent, a lantern, and a sleeping bag. But despite this users obvious interest in outdoor sporting goods, your logs indicated that fully 90% of the ads he was shown were for women’s apparel. Of these ads, this user clicked none of them.

Let’s attack this problem by creating a classifier. Fortunately for us, your company specializes in two main genres, fashion, and outdoors sporting goods. If we can determine which type of user we’re dealing with, then we can improve our conversion rates considerably by simply showing users the appropriate ads.

So long as you remember the unlikely assumption of feature independence of Naive Bayes, you should be ok.

That is whatever features you are measuring are independent of each other.

Has been “successfully” used in a number of contexts, but the descriptions I have read don’t specify what they meant by “successful.” 😉

« Newer PostsOlder Posts »

Powered by WordPress