Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 22, 2015

Commercial Users of Functional Programming 2015 (Call for Presentations)

Filed under: Conferences,Functional Programming — Patrick Durusau @ 11:19 am

Commercial Users of Functional Programming 2015 (Call for Presentations)

CUFP 2015
Co-located with ICFP 2015
Vancouver, Canada
September 3-5
Talk Proposal Submission Deadline: 14 June 2015
CUFP 2015 Presentation Submission Form

From the webpage:


If you have experience using functional languages in a practical setting, we invite you to submit a proposal to give a talk at the event. We’re looking for two kinds of talks:

Experience reports are typically 25 minutes long, and aim to inform participants about how functional programming plays out in real-world applications, focusing especially on lessons learnt and insights gained. Experience reports don’t need to be highly technical; reflections on the commercial, management, or software engineering aspects are, if anything, more important.

Technical talks are also 25 minutes long, and should focus on teaching the audience something about a particular technique or methodology, from the point of view of someone who has seen it play out in practice. These talks could cover anything from techniques for building functional concurrent applications, to managing dynamic reconfigurations, to design recipes for using types effectively in large-scale applications. While these talks will often be based on a particular language, they should be accessible to a broad range of programmers.

I thought it was particularly interesting that you can propose a presentation or nominate someone to make a presentation. That may be standard at CUFP but I haven’t noticed it at other conferences. An innovation that could/should be adopted elsewhere?

BTW, while you wait for this year’s CUFP meeting, you can review videos from CUFP meetings starting with 2011 up to 2015. A very nice resource tucked away on a conference page. (see “Videos” on the top menu bar)

AccuWeather.com reports the June average temperature is between 66 and 70 degrees F. Great conference weather!

I first saw this in a tweet by fogus.

The Morning Paper [computing papers selected by Adrian Colyer]

Filed under: Computer Science,Distributed Computing,Programming — Patrick Durusau @ 10:58 am

The Morning Paper [computing papers selected by Adrian Colyer]

From the about page:

The Morning Paper: a short summary of an important, influential, topical or otherwise interesting paper in the field of CS every weekday. The Morning Paper started out as a twitter project (#themorningpaper), then it became clear a longer form was also necessary because some papers just have too much good content to get across in a small batch of 140-character tweets!

The daily selection will still be tweeted on my twitter account (adriancolyer), with a quote or two to whet your appetite. Any longer excerpts or commentary will live here.

Why ‘The Morning Paper?’ (a) it’s a habit I enjoy, and (b) if one or two papers catch your attention and lead you to discover (or rediscover) something of interest then I’m happy.

Adrian’s 100th post was January 7, 2015 so you have some catching up to do. 😉

Very impressive and far more useful than the recent “newspaper” formats that automatically capture content from a variety of sources.

The Morning Paper is curated content, which makes all the difference in the world.

There is an emphasis on distributed computing making The Morning Paper a must read for anyone interested in the present and future of computing services.

Enjoy!

I first saw this in a tweet by Tyler Treat.

February 21, 2015

No First Amendment for Tweets?

Filed under: Government,Law — Patrick Durusau @ 7:41 pm

U.S. Government Demands Social Media Censorship by Kurt Nimmo.

From the post:

Congress and the White House are leaning on Twitter to censor Islamic State posts on its network.

Rep. Ted Poe, R-Texas, the chair of a House foreign affairs subcommittee on terrorism, has singled out Twitter for allowing supposed IS operatives to recruit and propagandize on the social media platform.

“This is the way (the Islamic State) is recruiting — they are getting people to leave their homelands and become fighters,” Poe said.

He added “there is frustration with Twitter specifically” over its refusal to censor tweets the government claims promotes terrorism.

I was hoping to read that Twitter told Congress and the White House to read the Constitution of the United States, especially:

Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of the press; or the right of the people peaceably to assemble, and to petition the Government for a redress of grievances.

What part of “Congress shall make no law…abridging the freedom of speech, or of the press….” seems unclear?

What was the response of Twitter?

Twitter has responded to the accusations by saying it provides user tracking information on alleged IS members to the FBI.

Over the last year, however, Twitter has suspended a large number of IS accounts. It has suspended nearly 800 suspected accounts since last autumn, but this “may be the tip of the iceberg,” as almost 18,000 accounts “related” to the Islamic State were suspended over the same time period, according to JM Berger, a fellow at the Brookings Institution who tracks Islamists on social media.

Assuming IS related accounts don’t violate Twitter’s terms of usage, why should they be suspended? An action for which there is no legal recourse?

Even worse, the censoring of IS on social media leaves the public with no alternative sources of information about IS, other than U.S. vetted propaganda.

An uniformed public cannot effective exercise its democratic franchise. Perhaps that is the goal of censoring IS content on social media. What are government leaders so afraid of the American public and other learning? That we have been lied to by our own government about IS?

Twitter should grow a spine and stop supplying tracking information on suspected IS members, stop suspending IS accounts and appeal to its user base to support those positions.

I fear ignorance and censorship far more than anyone identified as a terrorist by the U.S. government.

I first saw this in a tweet by the U.S. Department of Fear.

The Many Faces of Science (the journal)

Filed under: Peer Review,Publishing — Patrick Durusau @ 7:17 pm

Andy Dalby tells a chilling tale in Why I will never trust Science again.

You need to read the full account but as a quick summary, Andy submits a paper to Science that is rejected and within weeks finds that Science accepted another paper, a deeply flawed one, reaching the same conclusion and when he notified Science, it was suggested he post an online comment. Andy’s account has quotes, links to references, etc.

That is one face of Science, secretive, arbitrary and restricted peer review of submissions. I say “restricted peer” because Science has a tiny number of reviewers, compared to your peers, who review submissions. If you want “peer review,” you should publish with an open source journal that enlists all of your peers as reviewers, not just a few.

There is another face of Science, which appeared last December without any trace of irony at all:

Does journal peer review miss best and brightest? by David Shultz, which reads in part:

Sometimes greatness is hard to spot. Before going on to lead the Chicago Bulls to six NBA championships, Michael Jordan was famously cut from his high school basketball team. Scientists often face rejection of their own—in their case, the gatekeepers aren’t high school coaches, but journal editors and peers they select to review submitted papers. A study published today indicates that this system does a reasonable job of predicting the eventual interest in most papers, but it may shoot an air ball when it comes to identifying really game-changing research.

There is a serious chink in the armor, though: All 14 of the most highly cited papers in the study were rejected by the three elite journals, and 12 of those were bounced before they could reach peer review. The finding suggests that unconventional research that falls outside the established lines of thought may be more prone to rejection from top journals, Siler says.

Science publishes research showing its methods are flawed and yet it takes no notice. Perhaps its rejection of Andy’s paper isn’t so strange. It must have not traveled far enough down the stairs.

I first saw Andy’s paper in a tweet by Mick Watson.

Feedback and data-driven updates to Google’s disclosure policy [Project Zero]

Filed under: Cybersecurity,Security — Patrick Durusau @ 5:15 pm

Feedback and data-driven updates to Google’s disclosure policy [Project Zero] by Chris Evans, et al.

From the post:

Disclosure deadlines have long been an industry standard practice. They improve end-user security by getting security patches to users faster. As noted in CERT’s 45-day disclosure policy, they also “balance the need of the public to be informed of security vulnerabilities with vendors’ need for time to respond effectively”. Yahoo!’s 90-day policy notes that “Time is of the essence when we discover these types of issues: the more quickly we address the risks, the less harm an attack can cause”. ZDI’s 120-day policy notes that releasing vulnerability details can “enable the defensive community to protect the user”.

Deadlines also acknowledge an uncomfortable fact that is alluded to by some of the above policies: the offensive security community invests considerably more into vulnerability research than the defensive community. Therefore, when we find a vulnerability in a high profile target, it is often already known by advanced and stealthy actors.

Project Zero has adhered to a 90-day disclosure deadline. Now we are applying this approach for the rest of Google as well. We notify vendors of vulnerabilities immediately, with details shared in public with the defensive community after 90 days, or sooner if the vendor releases a fix. We’ve chosen a middle-of-the-road deadline timeline and feel it’s reasonably calibrated for the current state of the industry.

To see how things are going, we crunched some data on Project Zero’s disclosures to date. For example, the Adobe Flash team probably has the largest install base and number of build combinations of any of the products we’ve researched so far. To date, they have fixed 37 Project Zero vulnerabilities (or 100%) within the 90-day deadline. More generally, of 154 Project Zero bugs fixed so far, 85% were fixed within 90 days. Restrict this to the 73 issues filed and fixed after Oct 1st, 2014, and 95% were fixed within 90 days. Furthermore, recent well-discussed deadline misses were typically fixed very quickly after 90 days. Looking ahead, we’re not going to have any deadline misses for at least the rest of February.

Deadlines appear to be working to improve patch times and end user security — especially when enforced consistently.

I ran across the Project Zero post after reading Google Threatens to Air Microsoft and Apple’s Dirty Code by Chris Strohm and Jordan Robertson. Strohm and Robertson recite the usual pouting from Apple and Microsoft about fixed deadlines for disclosure of flaws, as though they are concerned about the safety of users.

If either Microsoft or Apple were concerned about users they would assign teams to flaws on notice and resource those teams to produce and test fixes (do no harm) before the ninety days are up. Project Zero now has a fourteen (14) day grace period after the ninety days so if fix is about to be released, disclosure can be delayed. (That wasn’t always the case, a source of complaints now.)

While Apple and Microsoft are important software sources, one hopes that Project Zero will not restrict its attention to those vendors or even to software. (I am assuming Project Zero monitors Google code as well?)

The average user lacks the time, training and resources to detect software flaws and certainly would not get the attention that a Project Zero report of a flaw commands.

The long tradition of secrecy until flaws are fixed have not served the user community well. Even now, it isn’t possible to say software is secure, it is only possible to say it may be free from flaw X. Maybe.

If Apple, Microsoft and others want to complain about the disclosure policies of Project Zero they should produce evidence of how secrecy of security flaws has resulted in more secure software. Not just insecure software with flaws known only to a few.

Who knows? Maybe Project Zero will attract such attention to security flaws that vendors will spend the time and money necessary to produce secure software. What a concept!

I first saw the “dirty code” post in a tweet by Marin Dimitrov.

Yelp Dataset Challenge

Filed under: Challenges,Data — Patrick Durusau @ 4:38 pm

Yelp Dataset Challenge

From the webpage:

Yelp Dataset Challenge is doubling up: Now 10 cities across 4 countries! Two years, four highly competitive rounds, over $35,000 in cash prizes awarded and several hundred peer-reviewed papers later: the Yelp Dataset Challenge is doubling up. We are proud to announce our latest dataset that includes information about local businesses, reviews and users in 10 cities across 4 countries. The Yelp Challenge dataset is much larger and richer than the Academic Dataset. This treasure trove of local business data is waiting to be mined and we can’t wait to see you push the frontiers of data science research with our data.

The Challenge Dataset:

  • 1.6M reviews and 500K tips by 366K users for 61K businesses
  • 481K business attributes, e.g., hours, parking availability, ambience.
  • Social network of 366K users for a total of 2.9M social edges.
  • Aggregated check-ins over time for each of the 61K businesses
  • The deadline for the fifth round of the Yelp Dataset Challenge is June 30, 2015. Submit your project to Yelp by visiting yelp.com/challenge/submit. You can submit a research paper, video presentation, slide deck, website, blog, or any other medium that conveys your use of the Yelp Dataset Challenge data.

    Pitched at students but it is an interesting dataset.

    I first saw this in a tweet by Marin Dimitrov.

    Guantanamo – A Spot of Good News!

    Filed under: Government,Security — Patrick Durusau @ 3:55 pm

    Andy Worthington writes in The collapse of Guantanamo’s military commissions

    From the post:

    The news that the US Court of Military Commission Review has dismissed the conviction against David Hicks, the first prisoner convicted in Guantanamo’s much-criticised military commission trial system, calls the future of the entire system into doubt. Hicks, an Australian seized in Afghanistan, had undertaken military training, although there was never any proof that he had engaged in combat with anyone, let alone US forces.

    In March 2007, he accepted a plea deal in his trial by military commission, admitting to providing material support for terrorism, convinced that it was his only way out of Guantanamo, and waiving his right to appeal. In return, he received a seven-year sentence, although all but nine months were suspended. He was repatriated the following month, and was released in December 2007. Since then, he has fought to clear his name, and has finally been vindicated in a ruling that, importantly, also overturned the waiver against lodging an appeal.
    Defence lawyer criticises Guantanamo trial

    This is the fourth humiliation for the military commissions, which have only reached results in the cases of eight men in total since they were first revived in November 2001.

    The Office of Military Commissions is composed of military officers who are also trained in law. I mention that to make it clear the court that dismissed the charges against David Hicks was a military court.

    The terrorist fear mongers failed to account for one thing with military trials at Guantanoamo: There are any number of judges, both civilian and military, that adhere to traditional views of due process and rights as embodied in the United States Constitution, its laws and legal traditions.

    The dismissal of charges against David Hicks for him, other detainees at Guantanamo, but more importantly, it is another indication that the rule of law isn’t dead in the United States. Looking forward to its full return.

    This is Visual Journalism [100]

    Filed under: Graphics,Infographics,Journalism,News — Patrick Durusau @ 3:31 pm

    This is Visual Journalism [100] by Tiago Veloso.

    From the post:

    Edition number one hundred of our round up of infographics from the print industry, and the selection we pulled together today is a perfect celebration – after all, we have dozens of new works from newsrooms all over the world, making this one of the biggest selections published on Visualoop so far.

    And in less than a month, we’ll be covering the 23rd edition of Malofiej Awards – the world’s main stage for journalistic infographics. We’ve actually begun our coverage, with two great posts: our friend Marco Vergotti, infographic editor of Época magazine, made this special infographic about last year’s Malofiej Awards; and this exclusive interview with the main responsible for the success of the event, the Spanish journalist Javier Errea. If you missed these posts, we definitively recommend you to read them.

    I count fifty-four (54) stunning infographics from print publications.

    Before you skip these as “just print infographics” remember that print infographics can’t rely on interaction with a user.

    They either capture the attention of a reader or fail, usually miserably.

    Which of these capture your attention? How would you duplicate that in a more forgiving digital environment?

    PS: If you can’t capture and hold a user’s attention, the quality or capabilities of your software aren’t going to have an opportunity to shine.

    Redefining “URL” to Invalidate Twenty-One (21) Years of Usage

    Filed under: HTML5,WWW — Patrick Durusau @ 3:11 pm

    You may be interested to know that efforts are underway to bury the original meaning of URL and to replace it with another meaning.

    Our trail starts with the HTML 5 draft of 17 December 2012, which reads in part:

    2.6 URLs

    This specification defines the term URL, and defines various algorithms for dealing with URLs, because for historical reasons the rules defined by the URI and IRI specifications are not a complete description of what HTML user agents need to implement to be compatible with Web content.

    The term “URL” in this specification is used in a manner distinct from the precise technical meaning it is given in RFC 3986. Readers familiar with that RFC will find it easier to read this specification if they pretend the term “URL” as used herein is really called something else altogether. This is a willful violation of RFC 3986. [RFC3986]

    2.6.1 Terminology

    A URL is a string used to identify a resource.

    A URL is a valid URL if at least one of the following conditions holds:

    • The URL is a valid URI reference [RFC3986].
    • The URL is a valid IRI reference and it has no query component. [RFC3987]
    • The URL is a valid IRI reference and its query component contains no unescaped non-ASCII characters. [RFC3987]
    • The URL is a valid IRI reference and the character encoding of the URL’s Document is UTF-8 or a UTF-16 encoding. [RFC3987]

    You may not like the usurpation of URL and its meaning but at least it is honestly reported.

    Compare Editor’s Draft 13 November 2014, which reads in part:

    2.5 URLs

    2.5.1 Terminology

    A URL is a valid URL if it conforms to the authoring conformance requirements in the WHATWG URL standard. [URL]

    A string is a valid non-empty URL if it is a valid URL but it is not the empty string.

    Hmmm, all the references to IRIs and violating RFC3986 has disappeared.

    But there is a reference to the WHATWG URL standard.

    If you follow that internal link to the bibliography you will find:

    [URL]
    URL (URL: http://url.spec.whatwg.org/), A. van Kesteren. WHATWG.

    Next stop: URL Living Standard — Last Updated 6 February 2015, which reads in part:

    The URL standard takes the following approach towards making URLs fully interoperable:

    • Align RFC 3986 and RFC 3987 with contemporary implementations and obsolete them in the process. (E.g. spaces, other “illegal” code points, query encoding, equality, canonicalization, are all concepts not entirely shared, or defined.) URL parsing needs to become as solid as HTML parsing. [RFC3986] [RFC3987]
    • Standardize on the term URL. URI and IRI are just confusing. In practice a single algorithm is used for both so keeping them distinct is not helping anyone. URL also easily wins the search result popularity contest.

    A specification being developed by WHATWG.org.

    Not nearly as clear and forthcoming as the HTML5 draft as of 17 December 2012. Yes?

    RFC3986 and RFC3987 are products of the IETF. If revisions of those RFCs are required, shouldn’t that work be at IETF?

    Or at a minimum, why is a foundation for HTML5 not at the W3C, if not at IETF?

    The conflating URLs (RFC3986) and IRIs (RFC3987) is taking place well away from the IETF and W3C processes.

    A conflation that invalidates twenty-one (21) years of use of URL in books, papers, presentations, documentation, etc.

    BTW, URL was originally defined in 1994 in RFC1738.

    Is popularity of an acronym worth that cost?

    Museums: The endangered dead [Physical Big Data]

    Filed under: BigData,Museums — Patrick Durusau @ 11:43 am

    Museums: The endangered dead by Christopher Kemp.

    Ricardo Moratelli surveys several hundred dead bats — their wings neatly folded — in a room deep inside the Smithsonian Institution in Washington DC. He moves methodically among specimens arranged in ranks like a squadron of bombers on a mission. Attached to each animal’s right ankle is a tag that tells Moratelli where and when the creature was collected, and by whom. Some of the tags have yellowed with age — they mark bats that were collected more than a century ago. Moratelli selects a small, compact individual with dark wings and a luxurious golden pelage. It fits easily in his cupped palm.

    To the untrained eye, this specimen looks identical to the rest. But Moratelli, a postdoctoral fellow at the Smithsonian’s National Museum of Natural History, has discovered that the bat in his hands is a new species. It was collected in February 1979 in an Ecuadorian forest on the western slopes of the Andes. A subadult male, it has been waiting for decades for someone such as Moratelli to recognize its uniqueness. He named it Myotis diminutus1. Before Moratelli could take that step, however, he had to collect morphometric data — precise measurements of the skull and post-cranial skeleton — from other specimens. In all, he studied 3,000 other bats from 18 collections around the world.

    Myotis diminutus is not alone. And neither is Ricardo Moratelli.

    Across the world, natural-history collections hold thousands of species awaiting identification. In fact, researchers today find many more novel animals and plants by sifting through decades-old specimens than they do by surveying tropical forests and remote landscapes. An estimated three-quarters of newly named mammal species are already part of a natural-history collection at the time they are identified. They sometimes sit unrecognized for a century or longer, hidden in drawers, half-forgotten in jars, misidentified, unlabelled.

    A reminder that not all “big data” is digital, at least not yet.

    The number of specimens already collected number in the billions worldwide. As Chris makes clear, many are languishing for lack of curators and in some cases, the collected specimens are the only evidence such creatures ever lived on the Earth.

    Vint Cerf (“Father of the Internet,” not Al Gore) has warned of a “…’forgotten century…” of digital data.

    As bad as a lost century of digital data may sound, our neglect of natural history collections threatens the loss millions of years of evolutionary history, forever.

    PS: Read Chris’ post in full and push for greater funding for natural history collections. The history we save may turn out to be critically important.

    Basic Understanding of Big Data…. [The need for better filtering tools]

    Filed under: BigData,Intelligence — Patrick Durusau @ 11:12 am

    Basic Understanding of Big Data. What is this and How it is going to solve complex problems by Deepak Kumar.

    From the post:

    Before going into details about what is big data let’s take a moment to look at the below slides by Hewlett-Packard.

    What_is_BigData

    The post goes on to describe big data but never quite reaches saying how it will solve complex problems.

    I mention it for the HP graphic that illustrates the problem of big data for the intelligence community.

    Yes, they have big data as in the three V’s: volume, variety, velocity and so need processing infrastructure to manage that as input.

    However, the results they seek are not the product of summing clicks, likes, retweets, ratings and/or web browsing behavior, at least not for the most part.

    The vast majority of the “big data” at their disposal is noise that is masking a few signals that they wish to detect.

    I mention that because of the seeming emphasis of late on real time or interactive processing of large quantities of data, which isn’t a bad thing, but also not a useful thing when what you really want are the emails, phone contacts and other digital debris of say < one thousand (1,000) people (that number was randomly chosen as an illustration, I have no idea of the actual number of people being monitored). It may help to think of big data in the intelligence community as consisting of a vast amount of "big data" about which it doesn't care and a relatively tiny bit of data that it cares about a lot. The problem being one of separating the data into those two categories. Take the telephone metadata records as an example. There is some known set of phone numbers that are monitored and contacts to and from those numbers. The rest of the numbers and their data are of interest if and only if at some future date they are added to the known set of phone numbers to be monitored. When the monitored numbers and their metadata are filtered out, I assume that previously investigated numbers for pizza delivery, dry cleaning and the like are filtered from the current data, leaving only current high value contacts or new unknowns for investigation. An emphasis on filtering before querying big data would reduce the number of spurious connections simply because a smaller data set has less random data that could be seen as patterns with other data. Not to mention that the smaller the data set, more prior data could be associated with current data without overwhelming the analyst. You may start off with big data but the goal is a very small amount of actionable data.

    Introducing MicroXML

    Filed under: XML — Patrick Durusau @ 10:11 am

    Introducing MicroXML by Uche Ogbuji.

    Uche took until now to post his slides from XML Prague 2013 so I’m excused for not posting about them sooner! 😉

    Some resources to help get you started:

    Introducing MicroXML (the movie, starring Uche Ogbuji)

    Introducing MicroXML, Part 1: Explore the basic principles of MicroXML

    Introducing MicroXML, Part 2: Process MicroXML with microxml-js

    MicroXML Community Group (W3C)

    MicroXML (2012 spec)

    Abstract:

    MicroXML is a subset of XML intended for use in contexts where full XML is, or is perceived to be, too large and complex. It has been designed to complement rather than replace XML, JSON and HTML. Like XML, it is a general format for making use of markup vocabularies rather than a specific markup vocabulary like HTML. This document provides a complete description of MicroXML.

    If you have seen any of the recent XML work you will be glad someone is working on MicroXML.

    Enjoy!

    February 20, 2015

    Deep Learning Track at GTC

    Filed under: Deep Learning,Machine Learning — Patrick Durusau @ 8:22 pm

    Deep Learning Track at GTC

    March 17-20, 2015 | San Jose, California

    From the webpage:

    The Deep Learning Track at GTC features over 40 sessions from industry experts on topics ranging from visual object recognition to the next generation of speech.

    Just the deep learning sessions.

    Keynote Speakers in Deep Learning track:

    Jeff Dean – Google, Senior Fellow

    Jen-Hsun – Huang NVIDIA, CEO & Co-Founder

    Andrew Ng – Baidu, Chief Scientist

    Featured Speakers:

    John Canny – UC Berkeley, Professor

    Dan Ciresan – IDSIA, Senior Researcher

    Rob Fergus – Facebook, Research Scientist

    Yangqing Jia – Google, Research Scientist

    Ian Lane – Carnegie Mellon University, Assistant Research Professor

    Ren Wu – Baidu, Distinguished Scientist

    Have you registered yet? If not, why not? 😉

    Expecting lots of blog posts covering presentations at the conference.

    Army Changing How It Does Requirements [How Are Your Big Data Requirements Coming?]

    Filed under: BigData,Design,Requirements — Patrick Durusau @ 8:07 pm

    Army Changing How It Does Requirements: McMaster by Sydney J. Freedberg Jr.

    From the post:


    So there’s a difficult balance to strike between the three words that make up “mobile protected firepower.” The vehicle is still just a concept, not a funded program. But past projects like FCS began going wrong right from those first conceptual stages, when TRADOC Systems Managers (TSMs) wrote up the official requirements for performance with little reference to what tradeoffs would be required in terms of real-world engineering. So what is TRADOC doing differently this time?

    “We just did an Initial Capability Document [ICD] for ‘mobile protected firepower,’” said McMaster. “When we wrote that document, we brought together 18th Airborne Corps and other [infantry] and Stryker brigade combat team leadership” — i.e. the units that would actually use the vehicle — “who had recent operational experience.”

    So they’re getting help — lots and lots of help. In an organization as bureaucratic and tribal as the Army, voluntarily sharing power is a major breakthrough. It’s especially big for TRADOC, which tends to take on priestly airs as guardian of the service’s sacred doctrinal texts. What TRADOC has done is a bit like the Vatican asking the Bishop of Boise to help draft a papal bull.

    But that’s hardly all. “We brought together, obviously, the acquisition community, so PEO Ground Combat Vehicle was in on the writing of the requirements. We brought in the Army lab, TARDEC,” McMaster told reporters at a Defense Writers’ Group breakfast this morning. “We brought in Army Materiel Command and the sustainment community to help write it. And then we brought in the Army G-3 [operations and plans] and the Army G-8 [resources]” from the service’s Pentagon staff.

    Traditionally, all these organizations play separate and unequal roles in the process. This time, said McMaster, “we wrote the document together.” That’s the model for how TRADOC will write requirements in the future, he went on: “Do it together and collaborate from the beginning.”

    It’s important to remember how huge a hole the Army has to climb out of. The 2011 Decker-Wagner report calculated that, since 1996, the Army had wasted from $1 billion to $3 billion annually on two dozen different cancelled programs. The report pointed out an institutional problem much bigger than just the Future Combat System. Indeed, since FCS went down in flames, the Army has cancelled yet another major program, its Ground Combat Vehicle.

    As I ask in the headline: How Are Your Big Data Requirements Coming?

    Have you gotten all the relevant parties together? Have they all collaborated on making the business case for your use of big data? Or are your requirements written by managers who are divorced from the people to use the resulting application or data? (Think Virtual Case File.)

    The Army appears to have gotten the message on requirements, temporarily at least. How about you?

    A massive database now translates news in 65 languages in real time [GDELT 2.0]

    Filed under: GDELT,News,Reporting — Patrick Durusau @ 7:52 pm

    A massive database now translates news in 65 languages in real time by Derrick Harris.

    From the post:

    I have written quite a bit about GDELT (the Global Database of Events, Languages and Tone) over the past year, because I think it’s a great example of the type of ambitious project only made possible by the advent of cloud computing and big data systems. In a nutshell, it’s database of more than 250 million socioeconomic and geopolitical events and their metadata dating back to 1979, all stored (now) in Google’s cloud and available to analyze for free via Google BigQuery or custom-built applications.

    On Thursday, version 2.0 of GDELT was unveiled, complete with a slew of new features — faster updates, sentiment analysis, images, a more-expansive knowledge graph and, most importantly, real-time translation across 65 different languages. That’s 98.4 percent of the non-English content GDELT monitors. Because you can’t really have a global database, or expect to get a full picture of what’s happening around the world, if you’re limited to English language sources or exceedingly long turnaround times for translated content.

    The GDELT homepage reports:

    We’ll be releasing a new “Getting Started With GDELT” user guide in the next few days to walk you through the incredibly vast array of new capabilities in GDELT 2.0,…

    Awesome, simply awesome!

    Bear in mind that the data presented here isn’t “cooked.” That is it hasn’t been trimmed and merged with your client’s internal knowledge of “…socioeconomic and geopolitical events…” and how it impacts their interests.

    For example, labor strikes in a shipping port on one continent may delay ontime shipments from a manufacturer on another for delivery to still a third continent. The information that ties all those items together is held by your client, not any public source.

    There is vast sea of client data, relationships and interests to be mapped to from a resource like GDELT and the 2.0 version is simply upping the possible rewards.

    Just in case you are curious:

    Terms of Use

    What can I do with GDELT and how can I use it in my projects?

    Using GDELT

    The GDELT Project is an open platform for research and analysis of global society and thus all datasets released by the GDELT Project are available for unlimited and unrestricted use for any academic, commercial, or governmental use of any kind without fee.

    Redistributing GDELT

    You may redistribute, rehost, republish, and mirror any of the GDELT datasets in any form. However, any use or redistribution of the data must include a citation to the GDELT Project and a link to this website (http://gdeltproject.org/).

    It is hard to imagine a data resource getting any better than this!

    PS: By late Spring 2015, the backfiles to 1979 will be available in GDELT 2.0 format. Maybe it can get better. 😉

    PPS: See the GDELT Blog for posts on using GDELT.

    Academic Karma: a case study in how not to use open data

    Filed under: Open Data — Patrick Durusau @ 7:28 pm

    Academic Karma: a case study in how not to use open data by Neil Saunders.

    From the post:

    A news story in Nature last year caused considerable mirth and consternation in my social networks by claiming that ResearchGate, a “Facebook for scientists”, is widely-used and visited by scientists. Since this is true of nobody that we know, we can only assume that there is a whole “other” sub-network of scientists defined by both usage of ResearchGate and willingness to take Nature surveys seriously.

    You might be forgiven, however, for assuming that I have a profile at ResearchGate because here it is. Except: it is not. That page was generated automatically by ResearchGate, using what they could glean about me from bits of public data on the Web. Since they have only discovered about one-third of my professional publications, it’s a gross misrepresentation of my achievements and activity. I could claim the profile, log in and improve the data, but I don’t want to expose myself and everyone I know to marketing spam until the end of time.

    One issue with providing open data about yourself online is that you can’t predict how it might be used. Which brings me to Academic Karma.

    Neil points out that Academic Karma generated an inaccurate profile of Neil’s academic activities. Based on partial information from a profile at ResearchGate, which Neil did not create.

    Neil concludes:

    So let me try to spell it out as best I can.

    1. I object to the automated generation of public profiles, without my knowledge or consent, which could be construed as having been generated by me
    2. I especially object when those profiles convey an impression of my character, such as “someone who publishes but does not review”, based on incomplete and misleading data

    I’m sure that the Academic Karma team mean well and believe that what they’re doing can improve the research process. However, it seems to me that this is a classic case of enthusiasm for technological solutions without due consideration of the human and social aspects.

    To their credit, Academic Karma has stopped listing profiles for people who haven’t requested accounts.

    How would you define the “human and social aspects” of open data?

    In hindsight, the answer to that question seems to be clear. Or at least is thought to be clear. How do you answer that question before your use of open data goes live?

    27 hilariously bad maps that explain nothing

    Filed under: Humor,Maps — Patrick Durusau @ 5:46 pm

    27 hilariously bad maps that explain nothing by Max Fisher.

    For your weekend enjoyment!

    One sample:

    wrongest-map

    Max says that the United States is incorrect and I agree.

    Should extend down to the tip of South America, plus our clients states in Europe and two still occupied countries, Germany and Japan.

    Oh, it was supposed to be acknowledged international borders! I see. A fictional map available at many locations on the Internet and at better stores everywhere.

    The Great SIM Heist

    Filed under: Cybersecurity,NSA,Security — Patrick Durusau @ 4:47 pm

    The Great SIM Heist – How Spies Stole the Keys to the Encryption Castle by Jeremy Scahill and Josh Begley.

    From the post:

    AMERICAN AND BRITISH spies hacked into the internal computer network of the largest manufacturer of SIM cards in the world, stealing encryption keys used to protect the privacy of cellphone communications across the globe, according to top-secret documents provided to The Intercept by National Security Agency whistleblower Edward Snowden.

    The hack was perpetrated by a joint unit consisting of operatives from the NSA and its British counterpart Government Communications Headquarters, or GCHQ. The breach, detailed in a secret 2010 GCHQ document, gave the surveillance agencies the potential to secretly monitor a large portion of the world’s cellular communications, including both voice and data.

    The company targeted by the intelligence agencies, Gemalto, is a multinational firm incorporated in the Netherlands that makes the chips used in mobile phones and next-generation credit cards. Among its clients are AT&T, T-Mobile, Verizon, Sprint and some 450 wireless network providers around the world. The company operates in 85 countries and has more than 40 manufacturing facilities. One of its three global headquarters is in Austin, Texas and it has a large factory in Pennsylvania.

    In all, Gemalto produces some 2 billion SIM cards a year. Its motto is “Security to be Free.”

    Read the original post to get an idea of the full impact of this heist.

    Bottom line: Anything transmitted or stored electronically (phone, Internet, disk drive) should be considered as compromised.

    How can people protect themselves when their government “protectors” are spying on them in addition to many others?

    There isn’t a good answer to that last question but one needs to be found and soon.


    Update: Mike Masnick says theft of SIM encryption keys demonstrates that any repository of backdoors will be a prime target for hackers, endangering the privacy of all users with those backdoors. Not a theoretical risk, the NSA and others have demonstrated the risk to be real. See: NSA’s Stealing Keys To Mobile Phone Encryption Shows Why Mandatory Backdoors To Encryption Is A Horrible Idea

    More Bad Data News – Psychology

    Filed under: Data Quality,Psychology — Patrick Durusau @ 4:28 pm

    Statistical Reporting Errors and Collaboration on Statistical Analyses in Psychological Science by Coosje L. S. Veldkamp, et al. (PLOS Published: December 10, 2014 DOI: 10.1371/journal.pone.0114876)

    Abstract:

    Statistical analysis is error prone. A best practice for researchers using statistics would therefore be to share data among co-authors, allowing double-checking of executed tasks just as co-pilots do in aviation. To document the extent to which this ‘co-piloting’ currently occurs in psychology, we surveyed the authors of 697 articles published in six top psychology journals and asked them whether they had collaborated on four aspects of analyzing data and reporting results, and whether the described data had been shared between the authors. We acquired responses for 49.6% of the articles and found that co-piloting on statistical analysis and reporting results is quite uncommon among psychologists, while data sharing among co-authors seems reasonably but not completely standard. We then used an automated procedure to study the prevalence of statistical reporting errors in the articles in our sample and examined the relationship between reporting errors and co-piloting. Overall, 63% of the articles contained at least one p-value that was inconsistent with the reported test statistic and the accompanying degrees of freedom, and 20% of the articles contained at least one p-value that was inconsistent to such a degree that it may have affected decisions about statistical significance. Overall, the probability that a given p-value was inconsistent was over 10%. Co-piloting was not found to be associated with reporting errors.

    If you are relying on statistical reports from psychology publications, you need to keep the last part of that abstract firmly in mind:

    Overall, 63% of the articles contained at least one p-value that was inconsistent with the reported test statistic and the accompanying degrees of freedom, and 20% of the articles contained at least one p-value that was inconsistent to such a degree that it may have affected decisions about statistical significance. Overall, the probability that a given p-value was inconsistent was over 10%. Co-piloting was not found to be associated with reporting errors.

    That is an impressive error rate. Imagine incorrect GPS locations 63% of the time and your car starting only 80% of the time. I would take that as a sign that something was seriously wrong.

    Not an amazing results considering reports of contamination in genome studies and bad HR data, not to mention that only 6% of landmark cancer research projects could be replicated.

    At the root of the problem are people. People just like you and me.

    People who did not follow (or in some cases record) a well defined process that included independent verification results they obtained.

    Independent verification is never free but then neither are the consequences of errors. Choose carefully.

    February 19, 2015

    Solr 5.0 RC3!

    Filed under: Search Engines,Solr — Patrick Durusau @ 7:49 pm

    Yes, Solr 5.0 RC3 has dropped!

    http://lucene.apache.org/solr/mirrors-solr-latest-redir.html which will toss you out at the 4.10.3 release.

    Let it take you to the suggested site and then move up and then down into the 5.0 directory.

    Download.

    Enjoy!

    PS: For Lucene, follow the same directions but going to the Lucene download page.

    Facial Recognition Breakthrough!

    Filed under: Face Detection,Security — Patrick Durusau @ 7:32 pm

    A specialized face-processing network consistent with the representational geometry of monkey face patches by Amirhossein Farzmahdi, et al.

    Abstract:

    Ample evidence suggests that face processing in human and non-human primates is performed differently compared with other objects. Converging reports, both physiologically and psychophysically, indicate that faces are processed in specialized neural networks in the brain -i.e. face patches in monkeys and the fusiform face area (FFA) in humans. We are all expert face-processing agents, and able to identify very subtle differences within the category of faces, despite substantial visual and featural similarities. Identification is performed rapidly and accurately after viewing a whole face, while significantly drops if some of the face configurations (e.g. inversion, misalignment) are manipulated or if partial views of faces are shown due to occlusion. This refers to a hotly-debated, yet highly-supported concept, known as holistic face processing. We built a hierarchical computational model of face-processing based on evidence from recent neuronal and behavioural studies on faces processing in primates. Representational geometries of the last three layers of the model have characteristics similar to those observed in monkey face patches (posterior, middle and anterior patches). Furthermore, several face-processing-related phenomena reported in the literature automatically emerge as properties of this model. The representations are evolved through several computational layers, using biologically plausible learning rules. The model satisfies face inversion effect, composite face effect, other race effect, view and identity selectivity, and canonical face views. To our knowledge, no models have so far been proposed with this performance and agreement with biological data.

    The article runs a full forty-eight (48) pages of citation laden text.

    If you want a shorter synopsis, try: Human Face Recognition Found In Neural Network Based On Monkey Brains, which summarizes the paper and mentions the following similarities between human facial recognition and recognition by the neural network:

    • Both recognize faces easiest when seen between a full frontal and a profile
    • Both have difficulty recognizing faces when upside down
    • Composite faces, top and bottom from different people, are recognized by both as different people
    • If the neural network is trained on one race, has difficulty recognizing faces of other races, just like people

    A large amount of investigation remains to be done, along with extending the methodology used here to explore and create the neural network.

    From a privacy/security perspective, counter-measures will be needed to defeat ever more accurate facial recognition software.

    Titan 0.5.4 Release!

    Filed under: Graphs,Titan — Patrick Durusau @ 7:01 pm

    Titan 0.5.4 Release! by Dan LaRocque.

    From the post:

    We’re pleased to announce the release of Titan 0.5.4.

    This is mostly a bugfix release. It also includes property read optimization.

    The zip archives:

    http://s3.thinkaurelius.com/downloads/titan/titan-0.5.4-hadoop1.zip
    http://s3.thinkaurelius.com/downloads/titan/titan-0.5.4-hadoop2.zip

    The documentation:

    Manual: http://s3.thinkaurelius.com/docs/titan/0.5.4/
    Javadoc: http://titan.thinkaurelius.com/javadoc/0.5.4/

    The 0.5.4 release is compatible with earlier releases in the 0.5 series. There are no user-facing API changes and no storage changes between 0.5.3 and this release. For upgrades from 0.5.2 and earlier, consider the upgrade notes about minor API changes:

    http://s3.thinkaurelius.com/docs/titan/0.5.4/upgrade.html

    The changelog contains a bit more information about what’s new in this
    release:

    http://s3.thinkaurelius.com/docs/titan/0.5.4/changelog.html

    We are indebted to the community for valuable bug and pain point reports that shaped 0.5.4.

    Bugfix only or not, users in the United States will welcome any distraction from the current cold wave! 😉

    The Great Bank Robbery: the Carbanak APT

    Filed under: Cybersecurity — Patrick Durusau @ 6:53 pm

    The Great Bank Robbery: the Carbanak APT by GReAT (Kaspersky Labs’ Global Research & Analysis Team).

    From the post:

    carbanak

    A great read on how hackers may have pocketed up to $1bn.

    Download the full report (PDF).

    Before some cyber-defender uses this as another example of why we need a national cyberdefense program, consider this paragraph from the conclusion of the full report:

    Despite increased awareness of cybercrime within the financial services sector, it appears that spear phishing attacks and old exploits (for which patches have been disseminated) remain effective against larger companies. Attackers always use this minimal effort approach in order to bypass a victim’s defenses.

    In other words, a human opened an infected attachment to an email.

    You first question in cyberdefense debates should be:

    Will solution X prevent users from opening infected email attachments?

    Second question: Does it protect the system despite users opening infected email attachments?

    If the answer to both questions is no, you have enough information to make a decision.

    An effective cyberdefense must address basic security issues before more exotic ones.

    Graph data management

    Filed under: Graph Databases,Graphs — Patrick Durusau @ 4:01 pm

    Graph data management by Amol Deshpande.

    From the post:

    Graph data management has seen a resurgence in recent years, because of an increasing realization that querying and reasoning about the structure of the interconnections between entities can lead to interesting and deep insights into a variety of phenomena. The application domains where graph or network analytics are regularly applied include social media, finance, communication networks, biological networks, and many others. Despite much work on the topic, graph data management is still a nascent topic with many open questions. At the same time, I feel that the research in the database community is fragmented and somewhat disconnected from application domains, and many important questions are not being investigated in our community. This blog post is an attempt to summarize some of my thoughts on this topic, and what exciting and important research problems I think are still open.

    At its simplest, graph data management is about managing, querying, and analyzing a set of entities (nodes) and interconnections (edges) between them, both of which may have attributes associated with them. Although much of the research has focused on homogeneous graphs, most real-world graphs are heterogeneous, and the entities and the edges can usually be grouped into a small number of well-defined classes.

    Graph processing tasks can be broadly divided into a few categories. (1) First, we may to want execute standard SQL queries, especially aggregations, by treating the node and edge classes as relations. (2) Second, we may have queries focused on the interconnection structure and its properties; examples include subgraph pattern matching (and variants), keyword proximity search, reachability queries, counting or aggregating over patterns (e.g., triangle/motif counting), grouping nodes based on their interconnection structures, path queries, and others. (3) Third, there is usually a need to execute basic or advanced graph algorithms on the graphs or their subgraphs, e.g., bipartite matching, spanning trees, network flow, shortest paths, traversals, finding cliques or dense subgraphs, graph bisection/partitioning, etc. (4) Fourth, there are “network science” or “graph mining” tasks where the goal is to understand the interconnection network, build predictive models for it, and/or identify interesting events or different types of structures; examples of such tasks include community detection, centrality analysis, influence propagation, ego-centric analysis, modeling evolution over time, link prediction, frequent subgraph mining, and many others [New10]. There is much research still being done on developing new such techniques; however, there is also increasing interest in applying the more mature techniques to very large graphs and doing so in real-time. (5) Finally, many general-purpose machine learning and optimization algorithms (e.g., logistic regression, stochastic gradient descent, ADMM) can be cast as graph processing tasks in appropriately constructed graphs, allowing us to solve problems like topic modeling, recommendations, matrix factorization, etc., on very large inputs [Low12].

    Prior work on graph data management could itself be roughly divided into work on specialized graph databases and on large-scale graph analytics, which have largely evolved separately from each other; the former has considered end-to-end data management issues including storage representations, transactions, and query languages, whereas the latter work has typically focused on processing specific tasks or types of tasks over large volumes of data. I will discuss those separately, focusing on whether we need “new” systems for graph data management and on open problems.

    Very much worth a deep, slow read. Despite marketing claims about graph databases, fundamental issues remain to be solved.

    Enjoy!

    I first saw this in a tweet by Kirk Borne

    Most HR Data Is Bad Data

    Filed under: Psychology,Recommendation — Patrick Durusau @ 3:22 pm

    Most HR Data Is Bad Data by Marcus Buckingham.

    “Bad data” can come in any number of forms and Marcus Buckingham focuses on one of the most pernicious: Data that is flawed at its inception. Data that doesn’t measure what it purports to measure. Performance evaluation data.

    From the post:

    Over the last fifteen years a significant body of research has demonstrated that each of us is a disturbingly unreliable rater of other people’s performance. The effect that ruins our ability to rate others has a name: the Idiosyncratic Rater Effect, which tells us that my rating of you on a quality such as “potential” is driven not by who you are, but instead by my own idiosyncrasies—how I define “potential,” how much of it I think I have, how tough a rater I usually am. This effect is resilient — no amount of training seems able to lessen it. And it is large — on average, 61% of my rating of you is a reflection of me.

    In other words, when I rate you, on anything, my rating reveals to the world far more about me than it does about you. In the world of psychometrics this effect has been well documented. The first large study was published in 1998 in Personnel Psychology; there was a second study published in the Journal of Applied Psychology in 2000; and a third confirmatory analysis appeared in 2010, again in Personnel Psychology. In each of the separate studies, the approach was the same: first ask peers, direct reports, and bosses to rate managers on a number of different performance competencies; and then examine the ratings (more than half a million of them across the three studies) to see what explained why the managers received the ratings they did. They found that more than half of the variation in a manager’s ratings could be explained by the unique rating patterns of the individual doing the rating— in the first study it was 71%, the second 58%, the third 55%.

    You have to follow the Idiosyncratic Rater Effect link to find the references Buckingham cites so I have repeated them (with links and abstracts) below:

    Trait, Rater and Level Effects in 360-Degree Performance Ratings by Michael K. Mount, et al., Personnel Psychology, 1998, 51, 557-576.

    Abstract:

    Method and trait effects in multitrait-multirater (MTMR) data were examined in a sample of 2,350 managers who participated in a developmental feedback program. Managers rated their own performance and were also rated by two subordinates, two peers, and two bosses. The primary purpose of the study was to determine whether method effects are associated with the level of the rater (boss, peer, subordinate, self) or with each individual rater, or both. Previous research which has tacitly assumed that method effects are associated with the level of the rater has included only one rater from each level; consequently, method effects due to the rater’s level may have been confounded with those due to the individual rater. Based on confirmatory factor analysis, the present results revealed that of the five models tested, the best fit was the 10-factor model which hypothesized 7 method factors (one for each individual rater) and 3 trait factors. These results suggest that method variance in MTMR data is more strongly associated with individual raters than with the rater’s level. Implications for research and practice pertaining to multirater feedback programs are discussed.

    Understanding the Latent Structure of Job Performance Ratings, by Michael K. Mount, Steven E. Scullen, Maynard Goff, Journal of Applied Psychology, 2000, Vol. 85, No. 6, 956-970 (I looked but apparently the APA hasn’t gotten the word about access to abstracts online, etc.)

    Rater Source Effects are Alive and Well After All by Brian Hoffman, et al., Personnel Psychology, 2010, 63, 119-151.

    Abstract:

    Recent research has questioned the importance of rater perspective effects on multisource performance ratings (MSPRs). Although making a valuable contribution, we hypothesize that this research has obscured evidence for systematic rater source effects as a result of misspecified models of the structure of multisource performance ratings and inappropriate analytic methods. Accordingly, this study provides a reexamination of the impact of rater source on multisource performance ratings by presenting a set of confirmatory factor analyses of two large samples of multisource performance rating data in which source effects are modeled in the form of second-order factors. Hierarchical confirmatory factor analysis of both samples revealed that the structure of multisource performance ratings can be characterized by general performance, dimensional performance, idiosyncratic rater, and source factors, and that source factors explain (much) more variance in multisource performance ratings whereas general performance explains (much) less variance than was previously believed. These results reinforce the value of collecting performance data from raters occupying different organizational levels and have important implications for research and practice.

    For students: Can you think of other sources that validate the Idiosyncratic Rater Effect?

    What about algorithms that make recommendations based on user ratings of movies? Isn’t the premise of recommendations that the ratings tell us more about the rater than about the movie? So we can make the “right” recommendation for a person very similar to the rater?

    I don’t know that it means anything but a search with a popular search engine turns up only 258 “hits” for “Idiosyncratic Rater Effect.” On the other hand, “recommendation system” turns up 424,000 “hits” and that sounds low to me considering the literature on recommendation.

    Bottom line on data quality is that widespread use of data is no guarantee of quality.

    What ratings reflect is useful in one context (recommendation) and pernicious in another (employment ratings).

    I first saw this in a tweet by Neil Saunders.

    Congress.gov offers email alerts

    Filed under: Government,Government Data — Patrick Durusau @ 1:38 pm

    Congress.gov offers email alerts

    From the post:

    Beginning today [5 February 2015], the free legislative information website Congress.gov offers users a new optional email-alerts system that makes tracking legislative action even easier. Users can elect to receive email alerts for tracking:

    • A specific bill in the current Congress: Receive an email when there are updates to a specific bill (new cosponsors, committee action, vote taken, etc.); emails are sent once a day if there has been a change in a particular bill’s status since the previous day.
    • A specific member’s legislative activity: Receive an email when a specific member introduces or cosponsors a bill; emails are sent once a day if a member has introduced or cosponsored a bill since the previous day.
    • Congressional Record: Receive an email as soon as a new issue of the Congressional Record is available on Congress.gov.

    The alerts system is a new feature available to anyone who creates a free account on the Congress.gov site. Creating an account also enables users to save searches. Create an account and sign up for alerts at congress.gov/account.

    If you are interested in legislation or in influencing those who vote on it, you should sign up for these alerts. No promises other than if you aren’t heard, your opinion won’t be considered.

    You should also use Congress.gov to verify the content of legislation when you get a “…the world is ending as we know it…” emails from interest groups. You are not well-informed if you are completely reliant on the opinions of others. Misguided perhaps but not well-informed.

    GOLD (General Ontology for Linguistic Description) Standard

    Filed under: Linguistics,Ontology — Patrick Durusau @ 11:28 am

    GOLD (General Ontology for Linguistic Description) Standard

    From the homepage:

    The purpose of the GOLD Community is to bring together scholars interested in best-practice encoding of linguistic data. We promote best practice as suggested by E-MELD, encourage data interoperability through the use of the GOLD Standard, facilitate search across disparate data sets and provide a platform for sharing existing data and tools from related research projects. The development and refinement of the GOLD Standard will be the basis for and the product of the combined efforts of the GOLD Community. This standard encompasses linguistic concepts, definitions of these concepts and relationships between them in a freely available ontology.

    The GOLD standard is dated 2010 and I didn’t see any updates for it.

    If you are interested in capturing the subject identity properties before new nomenclatures replace the ones found here now would be a good time.

    I first saw this in a tweet by the Linguist List.

    Reporting Context for News Stories (Hate Crimes)

    Filed under: News,Reporting — Patrick Durusau @ 11:11 am

    AJ+ tweeted two graphics on 17 February 2015:

    hate-muslims

    hate-crimes

    Unless my math is off, that is 1,031 religion based hate crimes in 2013.

    We would all prefer that number to be 0 but it’s not.

    The problem with those graphics is they give no sense of context for how those crimes compare to the incident of crime in general.

    Assuming that hate crimes can be violent or property crimes, the total of those two categories of crime in the United States for 2013 were:

    9,795,659 (1,163,146 violent crimes + 8,632,512 property crimes)

    Or if you want to know the percentage of religious hate crimes against all violent and property crimes:

    1031 / 9,792,659 = % of religious hate crime of all crime.

    Or, let’s assume every religious hate crime was committed by different individuals, giving us a total of 1,031 offenders.

    To put that in context, the estimated U.S. population was 316,497,531 in 2013, with 23.3% of the population being under 18 years of age. That leaves a population over 18 of 242,753,606.

    If you want to know the percentage of religious hate crime offenders to the U.S. population over 18 years of age:

    1031 / 242,753,606 = % of religious hate crime offenders to the U.S. population over 18.

    Or the number of people in the U.S. who didn’t commit religious hate crimes in 2013: 242,753,606 – 1,036 = 242,752,575

    Including context in those graphics would be extremely difficult because the context is so large that the acts in question would not show up on the graphics at all.

    What should our response to religious hate crime be? At a minimum the offenders should be caught and punished and the local community should rally around the victims to assure them the aberrant offenders do not represent the local community and to help the victims and their community heal.

    At the same time, we should recognize, as should religious communities, that religious hate crimes are aberrant behavior that represents views not shared by the general population or the government.

    Take this as an illustration that: News without context isn’t news. It is noise.

    Update: I omitted my source for U.S. population statistics: USA QuickFacts

    Introducing DataFrames in Spark for Large Scale Data Science

    Filed under: Data Frames,Spark — Patrick Durusau @ 10:12 am

    Introducing DataFrames in Spark for Large Scale Data Science by Reynold Xin, Michael Armbrust and Davies Liu.

    From the post:

    Today, we are excited to announce a new DataFrame API designed to make big data processing even easier for a wider audience.

    When we first open sourced Spark, we aimed to provide a simple API for distributed data processing in general-purpose programming languages (Java, Python, Scala). Spark enabled distributed data processing through functional transformations on distributed collections of data (RDDs). This was an incredibly powerful API: tasks that used to take thousands of lines of code to express could be reduced to dozens.

    As Spark continues to grow, we want to enable wider audiences beyond “Big Data” engineers to leverage the power of distributed processing. The new DataFrames API was created with this goal in mind. This API is inspired by data frames in R and Python (Pandas), but designed from the ground-up to support modern big data and data science applications. As an extension to the existing RDD API, DataFrames feature:

    • Ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster
    • Support for a wide array of data formats and storage systems
    • State-of-the-art optimization and code generation through the Spark SQL Catalyst optimizer
    • Seamless integration with all big data tooling and infrastructure via Spark
    • APIs for Python, Java, Scala, and R (in development via SparkR)

    For new users familiar with data frames in other programming languages, this API should make them feel at home. For existing Spark users, this extended API will make Spark easier to program, and at the same time improve performance through intelligent optimizations and code-generation.

    The dataframe API will be released for Spark 1.3 in early March.

    BTW, the act of using a dataframe creates a new subject, yes? How are you going to document the semantics of such subjects? I didn’t notice a place to write down that information.

    That’s a good question of ask of many of the emerging big/large/ginormous data tools. I have trouble remembering what I meant from notes yesterday and that’s not an uncommon experience. Imagine six months from now. Or when you are at your third client this month and the first one calls for help.

    Remember: To many eyes undocumented subjects are opaque.

    I first saw this in a tweet by Sebastian Rascha

    February 18, 2015

    Study Finds Jack Shit

    Filed under: Humor — Patrick Durusau @ 5:24 pm

    Study Finds Jack Shit

    From the post:

    BALTIMORE—A team of scientists at Johns Hopkins University announced Monday that a five-year study examining the link between polyphenols and lower cholesterol rates has found jack shit.

    “I can’t explain what happened,” head researcher Dr. Jeremy Ingels said. “We meticulously followed correct scientific procedure. Our methods were sufficiently rigorous that they should have produced some sort of result. Instead, we found out nothing.”

    Added Ingels: “Nothing!”

    As Ingels stepped aside to compose himself, fellow researcher Dr. Thomas Chen took the podium to discuss the $7 million jack-shit-yielding study.

    “We are all very upset,” Chen said. “When we began, this looked so promising, I would have bankrolled it myself. Now, after five years, I couldn’t tell you if polyphenols even exist.”

    The study, which Chen characterized as a “huge waste of time and money,” was financed by a Johns Hopkins alumni grant to determine the effects of the compound polyphenol on cholesterol. A known antioxidant found in herbs, teas, olive oil, and wines, polyphenol was originally thought to lower cholesterol—a theory that remains unproven because the Johns Hopkins researchers couldn’t prove squat.

    “We can’t say zip about whether it lowers cholesterol,” Ingels said. “We don’t know if it raises cholesterol. Hell, we don’t know if it joins with cholesterol to form an unholy alliance to take over your gall bladder. At this point, I couldn’t prove that a male donkey has nuts if they were swinging in my face.”

    I mentioned earlier today that neither phone vacuuming nor the TSA have identified a single terrorist since 9/11. I didn’t want those responsible to feel like they were the only people who find “jack shit.”

    The difference here, of course, is that a university that doesn’t find jack shit runs out of funding, unlike the NSA and TSA.

    Enjoy!

    « Newer PostsOlder Posts »

    Powered by WordPress