Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 22, 2013

Making It Happen:…

Filed under: Data,Data Preservation,Preservation — Patrick Durusau @ 9:45 am

Making It Happen: Sustainable Data Preservation and Use by Anita de Waard.

Great set of overview slides on why research data should be preserved.

Not to mention making the case that semantic diversity, in systems for capturing research data, between researchers, etc., needs to be addressed by any proffered solution.

If you don’t know Anita de Waard’s work, search for “Anita de Waard” on Slideshare.

As of today, I am getting one hundred and forty (140) presentations.

All of which you will find useful on a variety of data related topics.

SnoPy – SNOBOL Pattern Matching for Python

Filed under: Pattern Matching,Python,SNOBOL — Patrick Durusau @ 9:32 am

SnoPy – SNOBOL Pattern Matching for Python

Description:

SnoPy – A Python alternative to regular expressions. Borrowed from SNOBOL this alternative is both easier to use and more powerful than regular expressions. NO backslashes to count.

See also: SnoPy – SNOBOL Pattern Matching for Python Web Site.

For cross-disciplinary data mining, what could be more useful than SNOBOL pattern matching?

I first saw this in a Facebook link posted by Sam Hunting.

American Geophysical Union (AGU)

Filed under: Data,Geophysical,Science — Patrick Durusau @ 9:25 am

American Geophysical Union (AGU)

The mission of the AGU:

The purpose of the American Geophysical Union is to promote discovery in Earth and space science for the benefit of humanity.

While I was hunting down information on DataONE, I ran across the AGU site.

Like all disciplines, data analysis, collection, collation, sharing, etc. are ongoing concerns at the AGU.

My interest in more in the data techniques than the subject matter.

Seeking to avoid re-inventing the wheel and learning new insights than has yet to reach more familiar areas.

DataONE

Filed under: Climate Data,Environment — Patrick Durusau @ 9:13 am

DataONE

From the “about” page:

Data Observation Network for Earth (DataONE) is the foundation of new innovative environmental science through a distributed framework and sustainable cyberinfrastructure that meets the needs of science and society for open, persistent, robust, and secure access to well-described and easily discovered Earth observational data.

Supported by the U.S. National Science Foundation (Grant #OCI-0830944) as one of the initial DataNets, DataONE will ensure the preservation, access, use and reuse of multi-scale, multi-discipline, and multi-national science data via three primary cyberinfrastucture elements and a broad education and outreach program.

“…preservation, access, use and reuse of multi-scale, multi-discipline, and multi-national science data….”

Sounds like they are playing our song!

See also: DataONE: Survey of Earth Scientists, To Share or Not to Share Data, abstract of a poster from the American Geophysical Union, Fall Meeting 2010, abstract #IN11A-1062.

Interesting summary of the current data habits and preferences of scientists.

Starting point for shaping a topic map solution to problems as perceived by a group of users.

A Distributed Graph Engine…

Filed under: Distributed Systems,Graphs,RDF,Trinity — Patrick Durusau @ 5:56 am

A Distributed Graph Engine for Web Scale RDF Data by Kai Zeng, Jiacheng Yang, Haixum Wang, Bin Shao and Zhongyuan Wang.

Abstract:

Much work has been devoted to supporting RDF data. But state-of-the-art systems and methods still cannot handle web scale RDF data e ffectively. Furthermore, many useful and general purpose graph-based operations (e.g., random walk, reachability, community discovery) on RDF data are not supported, as most existing systems store and index data in particular ways (e.g., as relational tables or as a bitmap matrix) to maximize one particular operation on RDF data: SPARQL query processing. In this paper, we introduce Trinity.RDF, a distributed, memory-based graph engine for web scale RDF data. Instead of managing the RDF data in triple stores or as bitmap matrices, we store RDF data in its native graph form. It achieves much better (sometimes orders of magnitude better) performance for SPARQL queries than the state-of-the-art approaches. Furthermore, since the data is stored in its native graph form, the system can support other operations (e.g., random walks, reachability) on RDF graphs as well. We conduct comprehensive experimental studies on real life, web scale RDF data to demonstrate the e ffectiveness of our approach.

From the conclusion:

We propose a scalable solution for managing RDF data as graphs in a distributed in-memory key-value store. Our query processing and optimization techniques support SPARQL queries without relying on join operations, and we report performance numbers of querying against RDF datasets of billions of triples. Besides scalability, our approach also has the potential to support queries and analytical tasks that are far more advanced than SPARQL queries, as RDF data is stored as graphs. In addition, our solution only utilizes basic (distributed) key-value store functions and thus can be ported to any in-memory key-value store.

A result that is:

  • scalable
  • goes beyond SPARQL
  • can be ported to any in-memory key-value store

Merits a very close read.

Makes me curious what other data models would work better if cast as graphs?

I first saw this in a tweet by Juan Sequeda.

March 21, 2013

Striking a Blow for Complexity

Filed under: RDF,SPARQL — Patrick Durusau @ 2:57 pm

The W3C struck a blow for complexity today.

It’s blog entry entitled: Eleven SPARQL 1.1 Specifications are W3C Recommendations reads:

The SPARQL Working Group has completed development of its full-featured system for querying and managing data using the flexible RDF data model. It has now published eleven Recommendations for SPARQL 1.1, detailed in SPARQL 1.1 Overview. SPARQL 1.1 extends the 2008 Recommendation for SPARQL 1.0 by adding features to the query language such as aggregates, subqueries, negation, property paths, and an expanded set of functions and operators. Beyond the query language, SPARQL 1.1 adds other features that were widely requested, including update, service description, a JSON results format, and support for entailment reasoning. Learn more about the Semantic Web Activity.

I can’t wait for the movie version starring IBM’s Watson playing sudden death Jeopardy against Bob Ducharme, category SPARQL.

I’m betting on Bob!

Data.ac.uk

Filed under: Data,Open Data,RDF — Patrick Durusau @ 2:38 pm

Data.ac.uk

From the website:

This is a landmark site for academia providing a single point of contact for linked open data development. It not only provides access to the know-how and tools to discuss and create linked data and data aggregation sites, but also enables access to, and the creation of, large aggregated data sets providing powerful and flexible collections of information.
Here at Data.ac.uk we’re working to inform national standards and assist in the development of national data aggregation subdomains.

I can’t imagine a greater contrast between my poor web authoring skills and a website than this one.

But having said that, I think you will be as disappointed as I was when you start looking for data on this “landmark site.”

There is some but not nearly enough to match the promise of such a cleverly designed website.

Perhaps they are hoping that someday RDF data (they also offer comma and tab delimited versions) will catch up to the site design.

I first saw this in a tweet by Frank van Harmelen.

Training a New Generation of Data Scientists

Filed under: Cloudera,CS Lectures,Data Science — Patrick Durusau @ 2:26 pm

Training a New Generation of Data Scientists by Ryan Goldman.

From the post:

Data scientists drive data as a platform to answer previously unimaginable questions. These multi-talented data professionals are in demand like never before because they identify or create some of the most exciting and potentially profitable business opportunities across industries. However, a scarcity of existing external talent will require companies of all sizes to find, develop, and train their people with backgrounds in software engineering, statistics, or traditional business intelligence as the next generation of data scientists.

Join us for the premiere of Training a New Generation of Data Scientists on Tuesday, March 26, at 2pm ET/11am PT. In this video, Cloudera’s Senior Director of Data Science, Josh Wills, will discuss what data scientists do, how they think about problems, the relationship between data science and Hadoop, and how Cloudera training can help you join this increasingly important profession. Following the video, Josh will answer your questions about data science, Hadoop, and Cloudera’s Introduction to Data Science: Building Recommender Systems course.

This could be fun!

And if nothing else, will give you the tools to distinguish legitimate training, like Cloudera’s, from the “How to make $millions in real estate,” from the guy who makes money selling lectures and books sort of training.

As “hot” as data science is, you don’t have to look for to find that sort of training.

elasticsearch 0.90.0.RC1 Released

Filed under: ElasticSearch,Lucene,Searching — Patrick Durusau @ 2:08 pm

elasticsearch 0.90.0.RC1 Released by Shay Banon.

From the post:

elasticsearch version 0.90.0.RC1 is out, the first release candiate for the 0.90 release. You can download it here.

This release includes an upgrade to Lucene 4.2, many improvements to the suggester feature (including its own dedicated API), another round of memory improvements to field data (long type will now automatically “narrow” to the smallest type when loaded to memory) and several bug fixes. Upgrading to it from previous beta releases is highly recommended. (inserted URL to release notes)

Just to keep you on the cutting edge of search technology!

Google Keep: Another Temporary Google Data Silo

Filed under: Data Silos — Patrick Durusau @ 2:00 pm

Google launches Google Keep, an app to help you remember things by Laura Hazard Owen.

I report this only to ask is anyone tracking new data silos as they appear?

If anyone is tracking them I would be willing to submit candidates as I encounter them.

Thanks!

PS: When Google decides to close Google Keep, please let me know if you create an export to topic map script for it.

Snowflake Data Science [Three R’s of Topic Maps?]

Filed under: BigData,Data Science,Topic Maps — Patrick Durusau @ 1:37 pm

Much of today’s statistical modeling and predictive analytics is beautiful but unique. It’s impossible to repeat, it’s snowflake data science. (Matt Wood, principal data scientist for Amazon Web Services)

Think about that for a moment.

Snowflakes are unique. Can the same be said about your data science projects?

Would that explain the 80% figure of data science time being spent on cleaning, ETL, and similar tasks with data?

Is it that data never gets clean or are you cleaning the same data over and over again?

Barb Darrow reported in: From Amazon’s top data geek: data has got to be big — and reproducible:

The next frontier is making that data reproducible, said Matt Wood, principal data scientist for Amazon Web Services, at GigaOM’s Structure:Data 2013 event Wednesday.

In short, it’s great to get a result from your number crunching, but if the result is different next time out, there’s a problem. No self-respecting scientist would think of submitting the findings for a trial or experiment unless she is able to show that the it will be the same after multiple runs.

“Much of today’s statistical modeling and predictive analytics is beautiful but unique. It’s impossible to repeat, it’s snowflake data science.” Wood told attendees in New York. “Reproducibility becomes a key arrow in the quiver of the data scientist.”

The next frontier is making sure that people can reproduce, reuse and remix their data which provides a “tremendous amount of value,” Wood noted. (emphasis added)

I like that: Reproduce, Reuse, Remix data.

That’s going to require robust and granular handling of subject identity.

The three R’s of topic maps.

Yes?

Should Business Data Have An Audit Trail?

Filed under: Auditing,Business Intelligence,Datomic,Transparency — Patrick Durusau @ 11:19 am

The “second slide” I would lead with from Stuart Halloway’s Datomic, and How We Built It would be:

Should Business Data Have An Audit Trail?

Actually Stuart’s slide #65 but who’s counting? 😉

Stuart points out the irony of git, saying:

developer data is important enough to have an audit trail, but business data is not

Whether business data should always have an audit trail would attract shouts of yes and no, depending on the audience.

Regulators, prosecutors, good government types, etc., mostly shouting yes.

Regulated businesses, security brokers, elected officials, etc., mostly shouting no.

Some in between.

Datomic, which has some common characteristics with topic maps, gives you the ability to answer these questions:

  • Do you want auditable business data or not?
  • If yes to auditable business data, to what degree?

Rather different that just assuming it isn’t possible.

Abstract:

Datomic is a database of flexible, time-based facts, supporting queries and joins, with elastic scalability and ACID transactions. Datomic queries run your application process, giving you both declarative and navigational access to your data. Datomic facts (“datoms”) are time-aware and distributed to all system peers, enabling OLTP, analytics, and detailed auditing in real time from a single system.

In this talk, I will begin with an overview of Datomic, covering the problems that it is intended to solve and how its data model, transaction model, query model, and deployment model work together to solve those problems. I will then use Datomic to illustrate more general points about designing and implementing production software, and where I believe our industry is headed. Key points include:

  • the pragmatic adoption of functional programming
  • how dynamic languages fare in mission- and performance- critical settings
  • the importance of data, and the perils of OO
  • the irony of git, or why developers give themselves better databases than they give their customers
  • perception, coordination, and reducing the barriers to scale

Resources

  • Video from CME Group Technology Conference 2012
  • Slides from CME Group Technology Conference 2012

TinkerPop 2.3.0 has been unleashed

Filed under: Blueprints,Frames,Graphs,Gremlin,Pipes,Rexster — Patrick Durusau @ 5:48 am

TinkerPop 2.3.0 has been unleashed by Marko A. Rodriguez.

Release notes for:

Blueprints

Pipes

Gremlin

Frames

Rexster

Enjoy!

Freebase Data Dumps

Filed under: Freebase,RDF — Patrick Durusau @ 5:30 am

Freebase Data Dumps

From the webpage:

Data Dumps are a downloadable version of the data in Freebase. They constitute a snapshot of the data stored in Freebase and the Schema that structures it, and are provided under the same CC-BY license.

Full data dumps of every fact and assertion in Freebase are available as RDF and are updated every week. Deltas are not available at this time.

Total triples: 585 million
Compressed size: 14 GB
Uncompressed size: 87 GB
Data Format: Turtle RDF

I first saw this in a tweet by Thomas Steiner.

FORCE 11

Filed under: Communication,Publishing — Patrick Durusau @ 5:19 am

FORCE 11

Short description:

Force11 (the Future of Research Communications and e-Scholarship) is a virtual community working to transform scholarly communications toward improved knowledge creation and sharing. Currently, we have 315 active members.

A longer description from the “about” page:

Research and scholarship lead to the generation of new knowledge. The dissemination of this knowledge has a fundamental impact on the ways in which society develops and progresses; and at the same time, it feeds back to improve subsequent research and scholarship. Here, as in so many other areas of human activity, the Internet is changing the way things work: it opens up opportunities for new processes that can accelerate the growth of knowledge, including the creation of new means of communicating that knowledge among researchers and within the wider community. Two decades of emergent and increasingly pervasive information technology have demonstrated the potential for far more effective scholarly communication. However, the use of this technology remains limited; research processes and the dissemination of research results have yet to fully assimilate the capabilities of the Web and other digital media. Producers and consumers remain wedded to formats developed in the era of print publication, and the reward systems for researchers remain tied to those delivery mechanisms.

Force11 is a community of scholars, librarians, archivists, publishers and research funders that has arisen organically to help facilitate the change toward improved knowledge creation and sharing. Individually and collectively, we aim to bring about a change in modern scholarly communications through the effective use of information technology. Force11 has grown from a small group of like-minded individuals into an open movement with clearly identified stakeholders associated with emerging technologies, policies, funding mechanisms and business models. While not disputing the expressive power of the written word to communicate complex ideas, our foundational assumption is that scholarly communication by means of semantically enhanced media-rich digital publishing is likely to have a greater impact than communication in traditional print media or electronic facsimiles of printed works. However, to date, online versions of ‘scholarly outputs’ have tended to replicate print forms, rather than exploit the additional functionalities afforded by the digital terrain. We believe that digital publishing of enhanced papers will enable more effective scholarly communication, which will also broaden to include, for example, the publication of software tools, and research communication by means of social media channels. We see Force11 as a starting point for a community that we hope will grow and be augmented by individual and collective efforts by the participants and others. We invite you to join and contribute to this enterprise.

Force11 grew out of the FORC Workshop held in Dagstuhl, Germany in August 2011.

FORCE11 is a movement of people interested in furthering the goals stated in the FORCE11 manifesto. An important part of our work is information gathering and dissemination. We invite anyone with relevant information to provide us links which we may include on our websites. We ask anyone with similar and/or related efforts to include links to FORCE11. We are a neutral information market, and do not endorse or seek to block any relevant work.

The Tools and Resources page is particularly interesting.

Current divisions are:

  • Alternative metrics
  • Author Identification
  • Annotation
  • Authoring tools
  • Citation analysis
  • Computational Linguistics/Text Mining Efforts
  • Data citation
  • Ereaders
  • Hypothesis/claim-based representation of the rhetorical structure of a scientific paper
  • Mapping initiatives between ontologies
  • Metadata standards and ontologies
  • Modular formats for science publishing
  • Open Citations
  • Peer Review: New Models
  • Provenance
  • Publications and reports relevant to scholarly digital publication and data
  • Semantic publishing initiatives and other enriched forms of publication
  • Structured Digital Abstracts – modeling science (especially biology) as triples
  • Structured experimental methods and workflows
  • Text Extraction

Topic maps fit into communication agendas quite easily.

The first step in communication is capturing something to say.

The second step in communication is expressing what has been captured so it can be understood by others (or yourself next week).

Topic maps do both quite nicely.

I first saw this in a tweet by Anita de Waard.

March 20, 2013

Neo4j.org 3.0 Launch

Filed under: Graphs,Neo4j — Patrick Durusau @ 4:43 pm

Neo4j.org 3.0 Launch

From the post:

One major goal is to make it easier for you to get up and running with Neo4j. We hope to achieve this by providing everything in one place, from the download, set-up screencasts and step by step instructions to the rich choice of language support and the appropriate drivers.

We also want to make it easier for people that never worked with graph databases before to learn about Neo4j. So we created the infrastructure and started to work on learning paths that will tell a consistent story around a use-case or technology involving Neo4j. Currently we feature a learning path for Java and for Cypher but there will be many more to come. Any input in how to structure the paths and present the material is highly welcome!

Not that I am a good judge of website design but I like it a lot better than the previous version.

That maybe an artifact of liking graphs and so finding the presentation more intuitive.

There may be some confusion over “Learn” versus “Training” but finding better terms to distinguish self-learning versus instruction would not be easy.

If you really miss lists of links, those are at the bottom of the page.

I would mark this one as a win for the Neo4j team!

Large-Scale Learning with Less… [Less Precision Viable?]

Filed under: Algorithms,Artificial Intelligence,Machine Learning — Patrick Durusau @ 4:32 pm

Large-Scale Learning with Less RAM via Randomization by Daniel Golovin, D. Sculley, H. Brendan McMahan, Michael Young.

Abstract:

We reduce the memory footprint of popular large-scale online learning methods by projecting our weight vector onto a coarse discrete set using randomized rounding. Compared to standard 32-bit float encodings, this reduces RAM usage by more than 50% during training and by up to 95% when making predictions from a fixed model, with almost no loss in accuracy. We also show that randomized counting can be used to implement per-coordinate learning rates, improving model quality with little additional RAM. We prove these memory-saving methods achieve regret guarantees similar to their exact variants. Empirical evaluation confirms excellent performance, dominating standard approaches across memory versus accuracy tradeoffs.

I mention this in part because topic map authoring can be assisted by the results of machine learning.

It is also a data point for the proposition that unlike their human masters, machines are too precise.

Perhaps it is the case that the vagueness of human reasoning has significant advantages over the disk grinding precision of our machines.

The question then becomes: How do we capture vagueness in a system where every point is either 0 or 1?

Not probabilistic because that can be expressed but vagueness, which I experience as something different.

Suggestions?

PS: Perhaps that is what makes artificial intelligence artificial. It is too precise. 😉

I first saw this in a tweet by Stefano Bertolo.

Active Defense Harbinger Distribution (ADHD)

Filed under: Cybersecurity,Security — Patrick Durusau @ 4:20 pm

Active Defense Harbinger Distribution (ADHD)

Description:

The Active Defense Harbinger Distribution (ADHD) is a Linux distro based on Ubuntu 12.04 LTS. It comes with many tools aimed at active defense preinstalled and configured. The purpose of this distribution is to aid defenders by giving them tools to “strike back” at the bad guys.

ADHD has tools whose functions range from interfering with the attackers’ reconnaissance to compromising the attackers’ systems. Innocent bystanders will never notice anything out of the ordinary as the active defense mechanisms are triggered by malicious activity such as network scanning or connecting to restricted services.

SANS sponsored: Special Webcast: Active Defense Harbinger Distribution – Defense is Cool Again

Along with “big data,” cybersecurity is an up and coming area for employment.

Topic maps, giving you the ability to keep found information found, give you an advantage in either field.

I first saw this in: DARPA’S Cyber Tools: We have had our hands on DARPA’s distribution platform for cyber defense tools. No links to the SANS webinar or ADHD.

“Functional Programming for…Big Data”

Filed under: BigData,Cascading,Cascalog,Clojure,Functional Programming,Scala,Scalding — Patrick Durusau @ 3:27 pm

“Functional Programming for optimization problems in Big Data” by Paco Nathan.

Interesting slide deck, even if it doesn’t start with high drama. 😉

Covers:

  1. Data Science
  2. Functional Programming
  3. Workflow Abstraction
  4. Typical Use Cases
  5. Open Data Example

The reading list mentioned in these slides makes a nice self-review course in data science.

The Open Data Example is for Palo Alto but you can substitute a city with open data closer to home.

Start with the Second Slide

Filed under: Marketing,Topic Maps — Patrick Durusau @ 1:42 pm

Start Presentations on the Second Slide by Kent Beck.

From the slide:

Technical presos need background but it’s not engaging. What’s a geeky presenter to do?

I’ve been coaching technical presenters lately, and a couple of concepts come up with almost all of them. I figured I’d write them down so I don’t necessarily have to explain them all the time. One is to use specifics and data. I’ll write that later. This post explains why to start your presentation on the second slide.

I stole this technique from Lawrence Block’s outstanding Telling Lies for Fun and Profit http://amzn.to/YTAf3C, a book about writing fiction. He suggests drafting a story the “natural” way, with the first chapter introducing the hero and the second getting the action going, then swapping the two chapters. Now the first chapter starts with a gun pointed at the hero’s head. By the end, he is teetering on a cliff about to jump into a crocodile-infested river. Just when the tension reaches a peak, we’re introduced to the character but we have reason to want to get to know him.

Technical presentations need to set some context and then present the problem to be solved. When presenters follow this order, though, the resulting presentation starts with information some listeners already know and other listeners don’t have any motivation to try to understand. It’s like our adventure story where we’re not interested in the color of the hero’s hair, at least not until he’s about to become a croc-snack.

Be honest. At least with yourself.

How many times have you started a topic maps (or other) technical presentation with content either known or irrelevant (at that point) to your audience?

We may be covering what we think is essential background information, but at that point, the audience has no reason to care.

Let me put it this way: explaining topic maps isn’t for our benefit. It isn’t supposed to make us look clever or industrious.

Explaining topic maps is supposed to interest other people in topic maps. And the problems they solve.

Have you tried the dramatic situation approach in a presentation? How did it work out?

Open Data: The World Bank Data Blog

Filed under: Government,Government Data,Open Data,Open Government — Patrick Durusau @ 1:25 pm

Open Data: The World Bank Data Blog

In case you are following open data/government issues, you will want to add this blog to your RSS feed.

Not a high traffic blog but with twenty-seven contributing authors, you get a diversity of viewpoints.

Not to mention that the World Bank is a great source for general data.

I persist in thinking that transparency means identifying individuals responsible for decisions, expenditures and the beneficiaries of those decisions and expenditures.

That isn’t a popular position among those who make decisions and approve expenditures for unidentified beneficiaries.

You will either have to speculate on your own or ask someone else why that is an unpopular position.

Scenes from a Dive

Filed under: BigData,Data Mining,Open Data,Public Data — Patrick Durusau @ 10:27 am

Scenes from a Dive – what’s big data got to do with fighting poverty and fraud? by Prasanna Lal Das.

From the post:

A more detailed recap will follow soon but here’s a very quick hats off to the about 150 data scientists, civic hackers, visual analytics savants, poverty specialists, and fraud/anti-corruption experts that made the Big Data Exploration at Washington DC over the weekend such an eye-opener.We invite you to explore the work that the volunteers did (these are rough documents and will likely change as you read them so it’s okay to hold off if you would rather wait for a ‘final’ consolidated  document). The projects that the volunteers worked on include: 

Here are some visualizations that some project teams built. A few photos from the event are here (thanks @neilfantom). More coming soon (and yes, videos too!). Thanks @francisgagnon for the first blog about the event. The event hashtag was #data4good (follow @datakind and @WBopenfinances for more updates on Twitter).

Great meeting and projects but I would suggest a different sort of “big data”

Requiring recipients to grant reporting access to all bank accounts where funds will be transferred and requiring the same for any entity paid out of those accounts to the point where transfers over 90 days are less than $1,000 for any entity (or related entity), would be a better start.

With the exception of the “related entity” information, banks already keep transfer of funds information as a matter of routine business. It would be “big data” that is rich in potential for spotting fraud and waste.

The reporting banks should also be required to deliver other banking records they have on the accounts where funds are transferred and other activity in those accounts.

Before crying “invasion of privacy,” remember World Bank funding is voluntary.

As is acceptance of payment from World Bank funded projects. Anyone and everyone is free to decline such funding and avoid the proposed reporting requirements.

“Big data” to track fraud and waste is already collected by the banking industry.

The question is whether we will use that “big data” to effectively track fraud and waste or wait for particularly egregious cases to come to light?

Data Science for Social Good (Fellowship) [1 April 2013 deadline]

Filed under: Data Science,Fellowships — Patrick Durusau @ 6:34 am

Data Science for Social Good (Fellowship)

Dates:

Application Deadline: April 1, 2013

Acceptance Notification: April 10, 2013

Fellowship Program: Early June – Late August 2013

Want to use your Machine Learning and Data Mining skills for Social Good? Love the intellectual rigor of science, but want to work on real-world problems? Looking for a meaningful bridge into the growing field of data science?

The Computation Institute at the University of Chicago and Argonne National Laboratory invite you to apply for the 2013 Data Science for Social Good summer fellowship.

The program

We’re training future data scientists to work on the world’s most challenging social problems.

Fellows will work in small teams with the Obama campaign analytics team and other seasoned data scientists from academia and business as mentors and project leaders on high-impact projects in education, healthcare, energy, transportation, and more.

The program is selective, full-time for the summer, and hands-on:

  • You’ll work with data from nonprofits and governments to help them solve large problems.
  • You’ll learn how to apply statistics, machine learning, and big data technologies to problems that matter.
  • And you’ll work collaboratively with interdisciplinary teams.

The fellowship stipend is competitive and is based on your experience. We’ll house you in Chicago for the summer, from early June to late August. (Exact dates coming soon.)

Our advisory team is led by Eric Schmidt (Google) and Rayid Ghani (former Chief Scientist for Obama 2012 campaign).

Who we’re looking for

We’re looking for PhD, masters, and advanced undergraduate students in the computer science, statistics, and the computational and quantitative sciences. If you’re an amazing software engineer with a serious interest in data science, we want to hear from you too.

You’ll need statistics, programming, and data skills – but you don’t need to be an expert in every area.

Most of all, we want people who are passionate about using data for social impact.

In addition to “fellows” they are also looking for mentors and governments and nonprofits with lots of data and problems. Can’t imagine the lots of data and problems being hard to find. 😉

Good opportunity to gain some visibility for topic maps if you don’t mind spending the summer in Chicago.

Too bad the Daley machine isn’t what it once was. A topic map of relationships overlaid on a map of Chicago would be interesting.

I first saw this in a tweet by paco nathan.

Pyrallel – Parallel Data Analytics in Python

Filed under: Data Analysis,Parallel Programming,Programming,Python — Patrick Durusau @ 6:12 am

Pyrallel – Parallel Data Analytics in Python by Olivier Grisel.

From the webpage:

Overview: experimental project to investigate distributed computation patterns for machine learning and other semi-interactive data analytics tasks.

Scope:

  • focus on small to medium dataset that fits in memory on a small (10+ nodes) to medium cluster (100+ nodes).
  • focus on small to medium data (with data locality when possible).
  • focus on CPU bound tasks (e.g. training Random Forests) while trying to limit disk / network access to a minimum.
  • do not focus on HA / Fault Tolerance (yet).
  • do not try to invent new set of high level programming abstractions (yet): use a low level programming model (IPython.parallel) to finely control the cluster elements and messages transfered and help identify what are the practical underlying constraints in distributed machine learning setting.

Disclaimer: the public API of this library will probably not be stable soon as the current goal of this project is to experiment.

This project brought to mind two things:

  1. Experimentation can lead to new approaches, such as “Think like a vertex.” (GraphLab: A Distributed Abstraction…), and
  2. A conference anecdote about a Python application written so the customer would need to upgrade for higher performance. Prototype performed so well the customer didn’t need the fuller version. I thought that was a tribute to Python and the programmer. Opinions differed.

Applied Natural Language Processing

Filed under: Natural Language Processing,Scala — Patrick Durusau @ 5:53 am

Applied Natural Language Processing by Jason Baldridge.

Description:

This class will provide instruction on applying algorithms in natural language processing and machine learning for experimentation and for real world tasks, including clustering, classification, part-of-speech tagging, named entity recognition, topic modeling, and more. The approach will be practical and hands-on: for example, students will program common classifiers from the ground up, use existing toolkits such as OpenNLP, Chalk, StanfordNLP, Mallet, and Breeze, construct NLP pipelines with UIMA, and get some initial experience with distributed computation with Hadoop and Spark. Guidance will also be given on software engineering, including build tools, git, and testing. It is assumed that students are already familiar with machine learning and/or computational linguistics and that they already are competent programmers. The programming language used in the course will be Scala; no explicit instruction will be given in Scala programming, but resources and assistance will be provided for those new to the language.

From the syllabus:

The foremost goal of this course is to provide practical exposure to the core techniques and applications of natural language processing. By the end, students will understand the motivations for and capabilities of several core natural language processing and machine learning algorithms and techniques used in text analysis, including:

  • regular expressions
  • vector space models
  • clustering
  • classification
  • deduplication
  • n-gram language models
  • topic models
  • part-of-speech tagging
  • named entity recognition
  • PageRank
  • label propagation
  • dependency parsing

We will show, on a few chosen topics, how natural language processing builds on and uses the fundamental data structures and algorithms presented in this course. In particular, we will discuss:

  • authorship attribution
  • language identification
  • spam detection
  • sentiment analysis
  • influence
  • information extraction
  • geolocation

Students will learn to write non-trivial programs for natural language processing that take advantage of existing open source toolkits. The course will involve significant guidance and instruction in to software engineering practices and principles, including:

  • functional programming
  • distributed version control systems (git)
  • build systems
  • unit testing
  • distributed computing (Hadoop)

The course will help prepare students both for jobs in the industry and for doing original research that involves natural language processing.

A great start to one aspect of being a “data scientist.”

I encountered this course via the Nak (Scala library for NLP) project. Version 1.1.1 was just released and I saw a tweet from Jason Baldridge on the same.

The course materials have exercises and a rich set of links to other resources.

You may also enjoy:

Jason Baldridge’s homepage.

Bcomposes (Jason’s blog).

March 19, 2013

MongoDB 2.4 Release

Filed under: Lucene,MongoDB,NoSQL,Searching,Solr — Patrick Durusau @ 1:11 pm

MongoDB 2.4 Release

From the webpage:

Developer Productivity

  • Capped Arrays simplify development by making it easy to incorporate fixed, sorted lists for features like leaderboards and logging.
  • Geospatial Enhancements enable new use cases with support for polygon intersections and analytics based on geospatial data.
  • Text Search provides a simplified, integrated approach to incorporating search functionality into apps (Note: this feature is currently in beta release).

Operations

  • Hash-Based Sharding simplifies deployment of large MongoDB systems.
  • Working Set Analyzer makes capacity planning easier for ops teams.
  • Improved Replication increases resiliency and reduces administration.
  • Mongo Client creates an intuitive, consistent feature set across all drivers.

Performance

  • Faster Counts and Aggregation Framework Refinements make it easier to leverage real-time, in-place analytics.
  • V8 JavaScript Engine offers better concurrency and faster performance for some operations, including MapReduce jobs.

Monitoring

  • On-Prem Monitoring provides comprehensive monitoring, visualization and alerting on more than 100 operational metrics of a MongoDB system in real time, based on the same application that powers 10gen’s popular MongoDB Monitoring Service (MMS). On-Prem Monitoring is only available with MongoDB Enterprise.



Security
….

  • Kerberos Authentication enables enterprise and government customers to integrate MongoDB into existing enterprise security systems. Kerberos support is only available in MongoDB Enterprise.
  • Role-Based Privileges allow organizations to assign more granular security policies for server, database and cluster administration.

You can read more about the improvements to MongoDB 2.4 in the Release Notes. Also, MongoDB 2.4 is available for download on MongoDB.org.

Lots to look at in MongoDB 2.4!

But I am curious about the beta text search feature.

MongoDB Text Search: Experimental Feature in MongoDB 2.4 says:

Text search (SERVER-380) is one of the most requested features for MongoDB 10gen is working on an experimental text-search feature, to be released in v2.4, and we’re already seeing some talk in the community about the native implementation within the server. We view this as an important step towards fulfilling a community need.

MongoDB text search is still in its infancy and we encourage you to try it out on your datasets. Many applications use both MongoDB and Solr/Lucene, but realize that there is still a feature gap. For some applications, the basic text search that we are introducing may be sufficient. As you get to know text search, you can determine when MongoDB has crossed the threshold for what you need. (emphasis added)

So, why isn’t MongoDB incorporating Solr/Lucene instead of a home grown text search feature?

Seems like users could leverage their Solr/Lucene skills with their MongoDB installations.

Yes?

… Preservation and Stewardship of Scholarly Works, 2012 Supplement

Digital Curation Bibliography: Preservation and Stewardship of Scholarly Works, 2012 Supplement by Charles W. Bailey, Jr.

From the webpage:

In a rapidly changing technological environment, the difficult task of ensuring long-term access to digital information is increasingly important. The Digital Curation Bibliography: Preservation and Stewardship of Scholarly Works, 2012 Supplement presents over 130 English-language articles, books, and technical reports published in 2012 that are useful in understanding digital curation and preservation. This selective bibliography covers digital curation and preservation copyright issues, digital formats (e.g., media, e-journals, research data), metadata, models and policies, national and international efforts, projects and institutional implementations, research studies, services, strategies, and digital repository concerns.

It is a supplement to the Digital Curation Bibliography: Preservation and Stewardship of Scholarly Works, which covers over 650 works published from 2000 through 2011. All included works are in English. The bibliography does not cover conference papers, digital media works (such as MP3 files), editorials, e-mail messages, letters to the editor, presentation slides or transcripts, unpublished e-prints, or weblog postings.

The bibliography includes links to freely available versions of included works. If such versions are unavailable, italicized links to the publishers' descriptions are provided.

Links, even to publisher versions and versions in disciplinary archives and institutional repositories, are subject to change. URLs may alter without warning (or automatic forwarding) or they may disappear altogether. Inclusion of links to works on authors' personal websites is highly selective. Note that e-prints and published articles may not be identical.

The bibliography is available under a Creative Commons Attribution-NonCommercial 3.0 Unported License.

Supplement to “the” starting point for research on digital curation.

AI Algorithms, Data Structures, and Idioms…

Filed under: Algorithms,Artificial Intelligence,Data Structures,Java,Lisp,Prolog — Patrick Durusau @ 10:51 am

AI Algorithms, Data Structures, and Idioms in Prolog, Lisp and Java by George F. Luger and William A. Stubblefield.

From the introduction:

Writing a book about designing and implementing representations and search algorithms in Prolog, Lisp, and Java presents the authors with a number of exciting opportunities.

The first opportunity is the chance to compare three languages that give very different expression to the many ideas that have shaped the evolution of programming languages as a whole. These core ideas, which also support modern AI technology, include functional programming, list processing, predicate logic, declarative representation, dynamic binding, meta-linguistic abstraction, strong-typing, meta-circular definition, and object-oriented design and programming. Lisp and Prolog are, of course, widely recognized for their contributions to the evolution, theory, and practice of programming language design. Java, the youngest of this trio, is both an example of how the ideas pioneered in these earlier languages have shaped modern applicative programming, as well as a powerful tool for delivering AI applications on personal computers, local networks, and the world wide web.

Where could you go wrong with comparing Prolog, Lisp and Java?

Either for the intellectual exercise or because you want a better understanding of AI, a resource to enjoy!

Open Annotation Data Model

Filed under: Annotation,RDF,Semantic Web,W3C — Patrick Durusau @ 10:34 am

Open Annotation Data Model

Abstract:

The Open Annotation Core Data Model specifies an interoperable framework for creating associations between related resources, annotations, using a methodology that conforms to the Architecture of the World Wide Web. Open Annotations can easily be shared between platforms, with sufficient richness of expression to satisfy complex requirements while remaining simple enough to also allow for the most common use cases, such as attaching a piece of text to a single web resource.

An Annotation is considered to be a set of connected resources, typically including a body and target, where the body is somehow about the target. The full model supports additional functionality, enabling semantic annotations, embedding content, selecting segments of resources, choosing the appropriate representation of a resource and providing styling hints for consuming clients.

My first encounter with this proposal so I need to compare it to my Simple Web Semantics.

At first blush, the Open Annotation Core Model looks a lot heavier than Simple Web Semantics.

I need to reform my blog posts into a formal document and perhaps attach a comparison as an annex.

Knowledge Discovery from Mining Big Data [Astronomy]

Filed under: Astroinformatics,BigData,Data Mining,Knowledge Discovery — Patrick Durusau @ 10:17 am

Knowledge Discovery from Mining Big Data – Presentation by Kirk Borne by Bruce Berriman.

From the post:

My friend and colleague Kirk Borne, of George Mason University, is a specialist in the modern field of data mining and astroinformatics. I was delighted to learn that he was giving a talk on an introduction to this topic as part of the Space Telescope Engineering and Technology Colloquia, and so I watched on the webcast. You can watch the presentation on-line, and you can download the slides from the same page. The presentation is a comprehensive introduction to data mining in astronomy, and I recommend it if you want to grasp the essentials of the field.

Kirk began by reminding us that responding to the data tsunami is a national priority in essentially all fields of science – a number of nationally commissioned working groups have been unanimous in reaching this conclusion and in emphasizing the need for scientific and educational programs in data mining. The slides give a list of publications in this area.

Deeply entertaining presentation on big data.

The first thirty minutes or so are good for “big data” quotes and hype but the real meat comes at about slide 22.

Extends the 3 V’s (Volume, Variety, Velocity) to include Veracity, Variability, Venue, Vocabulary, Value.

And outlines classes of discovery:

  • Class Discovery
    • Finding new classes of objects and behaviors
    • Learning the rules that constrain the class boundaries
  • Novelty Discovery
    • Finding new, rare, one-in-a-million(billion)(trillion) objects and events
  • Correlation Discovery
    • Finding new patterns and dependencies, which reveal new natural laws or new scientific principles
  • Association Discovery
    • Finding unusual (improbable) co-occurring associations

A great presentation with references and other names you will want to follow on big data and astroinformatics.

« Newer PostsOlder Posts »

Powered by WordPress