Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 8, 2015

Haskell by Example

Filed under: Functional Programming,Haskell — Patrick Durusau @ 8:43 pm

Haskell by Example

A port of Go by Example to Haskell.

Reading this will be less useful than starting from Go by Example and creating your own port to Haskell. Use this page to check your work.

You could use Go by Example to create ports to other languages as well.

Enjoy!

Hubble Ultra Deep Field

Filed under: Astroinformatics,Data Collection — Patrick Durusau @ 8:34 pm

Hubble Ultra Deep Field: UVUDF: Ultraviolet Imaging of the HUDF with WFC3

From the webpage:

HST Program 12534 (Principal Investigator: Dr. Harry Teplitz)

Project Overview Paper: Teplitz, H. et al. (2013), AJ 146, 159

Science Project Home Page: http://uvudf.ipac.caltech.edu/

The Hubble UltraDeep Field (UDF) previously had deep observations at Far-UV, optical (B-z), and NIR wavelengths (Beckwith et al. 2006; Siana et al. 2007, Bouwens et al. 2011; Ellis et al. 2013; Koekemoer et al. 2013; Illingworth et al. 2013), but only comparatively shallow near-UV (u-band) imaging from WFPC2. With this new UVUDF project (Teplitz et al. 2013), we fill this gap in UDF coverage with deep near-ultraviolet imaging with WFC3-UVIS in F225W, F275W, and F336W. In the spirit of the UDF, we increase the legacy value of the UDF by providing science quality mosaics, photometric catalogs, and improved photometric redshifts to enable a wide range of research by the community. The scientific emphasis of this project is to investigate the episode of peak star formation activity in galaxies at 1 < z < 2.5. The UV data are intended to enable identification of galaxies in this epoch via the Lyman break and can allow us to trace the rest-frame FUV luminosity function and the internal color structure of galaxies, as well as measuring the star formation properties of moderate redshift starburst galaxies including the UV slope. The high spatial resolution of UVIS (a physical scale of about 700 pc at 0.5 < z < 1.5) enable the investigation of the evolution of massive galaxies by resolving sub-galactic units (clumps). We will measure (or set strict limits on) the escape fraction of ionizing radiation from galaxies at z~2-3 to better understand how star-forming galaxies reionized the Universe. Data were obtained in three observing Epochs, each using one of two observing modes (as described in Teplitz et al. 2013). Epochs 1 and 2 together obtained about 15 orbits of data per filter, and Epoch 3 obtained another 15 orbits per filter. In the second release, we include Epoch 3, which includes all the data that were obtained using post-flash (the UVIS capability to add internal background light), to mitigate the effects of degradation of the charge transfer efficiency of the detectors (Mackenty & Smith 2012). The data were reduced using a combination of standard and custom calibration scripts (see Rafelski et al. 2015), including the use of software to correct for charge transfer inefficiency and custom super dark files. The individual reduced exposures were then registered and combined using a modified version of the MosaicDrizzle pipeline (see Koekemoer et al. 2011 and Rafelski et al. 2015 for further details) and are all made available here. In addition to the image mosaics, an aperture matched PSF corrected photometric catalog is made available, including photometric and spectroscopic redshifts in the UDF. The details of the catalog and redshifts are described in Rafelski et al. (2015). If you use these mosaics or catalog, please cite Teplitz et al. (2013) and Rafelski et al. (2015).

Open but also challenging data.

This is an example of how to document the collection and processing of data sets.

Enjoy!

Open Data: Getting Started/Finding

Filed under: Government Data,Open Data — Patrick Durusau @ 8:23 pm

Data Science – Getting Started With Open Data

23 Resources for Finding Open Data

Ryan Swanstrom has put together two posts will have you using and finding open data.

“Open data” can be a boon to researchers and others, but you should ask the following questions (among others) of any data set:

  1. Who collected the data?
  2. Why was the data collected?
  3. How was the recorded data selected?
  4. How large was the potential data pool?
  5. Was the original data cleaned after collection?
  6. If the original data was cleaned, by what criteria?
  7. How was the accuracy of the data measured?
  8. What instruments were used to collect the data?
  9. How were the instruments used to collect the data developed?
  10. How were the instruments used to collect the data validated?
  11. What publications have relied upon the data?
  12. How did you determine the semantics of the data?

That’s not a compete set but a good starting point.

Just because data is available, open, free, etc. doesn’t mean that it is useful. The best example is the still-in-print Budge translation The book of the dead : the papyrus of Ani in the British Museum. The original was published in 1895, making the current reprints more than a century out of date.

It is a very attractive reproduction (it is rare to see hieroglyphic text with inter-linear transliteration and translation in modern editions) of the papyrus of Ani, but it gives a mis-leading impression of the state of modern knowledge and translation of Middle Egyptian.

Of course, some readers are satisfied with century old encyclopedias as well, but I would not rely upon them or their sources for advice.

Digital Approaches to Hebrew Manuscripts

Filed under: Digital Research,Humanities,Library,Manuscripts — Patrick Durusau @ 7:48 pm

Digital Approaches to Hebrew Manuscripts

Monday 18th – Tuesday 19th of May 2015

From the webpage:

We are delighted to announce the programme for On the Same Page: Digital Approaches to Hebrew Manuscripts at King’s College London. This two-day conference will explore the potential for the computer-assisted study of Hebrew manuscripts; discuss the intersection of Jewish Studies and Digital Humanities; and share methodologies. Amongst the topics covered will be Hebrew palaeography and codicology, the encoding and transcription of Hebrew texts, the practical and theoretical consequences of the use of digital surrogates and the visualisation of manuscript evidence and data. For the full programme and our Call for Posters, please see below.

Organised by the Departments of Digital Humanities and Theology & Religious Studies (Jewish Studies)
Co-sponsor: Centre for Late Antique & Medieval Studies (CLAMS), King’s College London

I saw this at the blog for DigiPal: Digital Resource and Database of Palaeolography, Manuscript Studies and Diplomatic. Confession, I have never understood how the English derive acronyms and this confounds me as much as you. 😉

Be sure to look around at the DigiPal site. There are numerous manuscript images, annotation techniques, and other resources for those who foster scholarship by contributing to it.

Best Practices for Victim Response and Reporting of Cyber Incidents

Filed under: Cybersecurity,Security — Patrick Durusau @ 3:55 pm

Best Practices for Victim Response and Reporting of Cyber Incidents (source: Department of Justice, Cybersecurity Unit))

From the introduction:

Any Internet-connected organization can fall prey to a disruptive network intrusion or costly cyber attack. A quick, effective response to cyber incidents can prove critical to minimizing the resulting harm and expediting recovery. The best time to plan such a response is now, before an incident occurs.

This “best practices” document was drafted by the Cybersecurity Unit to assist organizations in preparing a cyber incident response plan and, more generally, in preparing to respond to a cyber incident. It reflects lessons learned by federal prosecutors while handling cyber investigations and prosecutions, including information about how cyber criminals’ tactics and tradecraft can thwart recovery. It also incorporates input from private sector companies that have managed cyber incidents. It was drafted with smaller, less well-resourced organizations in mind; however, even larger organizations with more experience in handling cyber incidents may benefit from it.

Best practice for using this paper:

  1. Annotate a it with the current state of your organization.
  2. Annotate a separate copy of it with the state of your organization after needed changes.
  3. Compare the two versions.

Remembering what you can’t measure you can’t manage. Nor can there be accountability in the absence of measurement.

Metaflop: Hello World

Filed under: Fonts,WWW — Patrick Durusau @ 3:02 pm

Metaflop: Hello World

From the webpage:

Metaflop is an easy to use web application for modulating your own fonts. Metaflop uses Metafont, which allows you to easily customize a font within the given parameters and generate a large range of font families with very little effort.

With the Modulator it is possible to use Metafont without dealing with the programming language and coding by yourself, but simply by changing sliders or numeric values of the font parameter set. This enables you to focus on the visual output – adjusting the parameters of the typeface to your own taste. All the repetitive tasks are automated in the background.

The unique results can be downloaded as a webfont package for embedding on your homepage or an OpenType PostScript font (.otf) which can be used on any system in any application supporting otf.

Various Metafonts can be chosen from our type library. They all come along with a small showcase and a preset of type derivations.

Metaflop is open source – you can find us on Github, both for the source code of the platform and for all the fonts.

If metafont rings any bells, congratulations! Metafont was invented by Don Knuth for TeX.

Metaflop provides a web interface to the Metafont program and with parameters that can be adjusted.

Only A-Z, a-z, and 0-9 are available for font creation.

In the FAQ, the improvement over Metafont is said to be:

  • font creators are mostly designers, not engineers. so metafont is rather complicated to use, you need to learn programming.
  • it has no gui (graphical user interface).
  • the native export is to bitmap fonts which is a severe limitation compared to outline fonts.

Our contribution to metafont is to address these issues. we are aware that it is difficult to produce subtle and refined typographical fonts (in the classical meaning). Nevertheless we believe there is a undeniable quality in parametric font design and we try to bring it closer to the world of the designers.

While Metaflop lacks the full generality of Metafont, it is a big step in the right direction to bring Metafont to a broader audience.

With different underlying character sets, certainly of interest to anyone interested in pre-printing press texts. Glyphs can transliterate to the same characters but which glyph was used can be important information to both capture and display.

May 7, 2015

Apache Lucene 5.1.0, Solr 5.1.0 Available

Filed under: Lucene,Solr — Patrick Durusau @ 8:14 pm

From the news:

Lucene can be downloaded from http://www.apache.org/dyn/closer.cgi/lucene/java/5.1.0 and Solr can be downloaded from http://www.apache.org/dyn/closer.cgi/lucene/solr/5.1.0

Both releases contain a number of new features, bug fixes, and optimizations.

See the Lucene CHANGES.txt and Solr CHANGES.txt files included with the release for a full list of details.

See also Solr 5.1 Features by Yonik Seeley.

Of particular interest, Streaming Aggregation For SolrCloud (new in Solr 5.1) by Joel Bernstein.

Enjoy!

GQL and SharePoint Online Search REST APIs

Filed under: Graphs,Microsoft — Patrick Durusau @ 7:57 pm

Query the Office graph using GQL and SharePoint Online Search REST APIs

From the post:

Graph Query Language (GQL) is a preliminary query language designed to query the Office graph via the SharePoint Online Search REST API. By using GQL, you can query the Office graph to get items for an actor that satisfies a particular filter.

Note The features and APIs documented in this article are in preview and are subject to change. The current additions to the Search REST API are a preliminary solution to make it possible to query the Office graph, mainly intended for the Office Delve experience. Feel free to experiment with querying the Office graph but do not use these features, or other features and APIs documented in this article, in production. Your feedback about these features and APIs is important. Let us know what you think. Connect with us on Stack Overflow. Tag your questions with [office365].

An interesting development from Microsoft!

Early days so there is a long way to go before we are declaring relationships between entities inside objects and assigning the entities and their relationships properties.

Still, a promising development.

Open But Recorded Access

Filed under: Government,Government Data — Patrick Durusau @ 7:44 pm

Search Airmen Certificate Information

Registry of certified pilots.

From the search page:

airmen-search

I didn’t perform a search so I don’t have a feel for what, if any, validation is done on the requested searcher information.

If you are on Tor, you might want to consider using the address for Wrigley field, 1060 W Addison St, Chicago, IL 60613, to see if it complains.

Bureau of Transportation Statistics

Filed under: Government Data,Politics,Travel — Patrick Durusau @ 4:57 pm

Bureau of Transportation Statistics

I discovered this site while looking for “official” statistics to debunk claims about air travel and screening for terrorists. (Begging National Security Questions #1)

I didn’t find it an easy site to navigate but that probably reflects my lack of familiarity with the data being collected. A short guide with a very good index would be quite useful.

A real treasure trove of transportation information (from the about page):

Major Programs of the Bureau of Transportation Statistics (BTS)

It is important to remember that federal agencies (and their equivalents under other governments) have distinct agendas. When confronting outlandish claims from one of the security agencies, it helps to have contradictory data gathered by other, “disinterested,” agencies of the same government.

Security types can dismiss your evidence and analysis as “that’s what you think.” After all, their world is nothing but suspicion and conjecture. Why shouldn’t that be true for others?

Not as easy to dismiss data and analysis by other government agencies.

Begging National Security Questions #1

Filed under: Government,National Security,Politics — Patrick Durusau @ 3:41 pm

In the interview of Bruce Schneier by Steward Baker (formerly of the DHS and NSA), Bruce was too polite to point out that Baker was begging the question on a number of national security issues.

That sort of rhetoric comes up often in discussions of national security issues and just as often is unchallenged by reporters and other participants in the discussion.

One example (I will post on others) of begging the question was when Baker talks about the DNS reviewing passenger manifests from airlines to decide who they need to interview.

Baker “begs” the question of whether terrorists are flying airline monitored by the TSA and if they are, that TSA methods are sufficient to discover them. He simply assumes those to be true in order to justify his conclusion that the TSA needs the information from passenger manifests.

But what are the facts about airline passengers and the TSA?

If you look at: Passengers All Carriers – All Airports, a webpage maintained by the US Department of Transportation, Bureau of Transportation Statistics, you will find a table that reads in part:

Year Total Passengers
2002 670,604,493
2003 700,863,621
2004 763,709,691
2005 800,849,909
2006 808,103,211
2007 835,436,440
2008 809,449,524
2009 767,816,588
2010 787,478,056
2011 802,134,604
2012 813,127,939
2013 824,956,471
2014 847,767,888
2015 63,344,516
Total: 10,295,642,951

Out of over 10 billion passengers, how many terrorists has the TSA apprehended?

0, nada, the empty set, none.

It isn’t possible to know from the available evidence if:

  • There are no terrorists.
  • Terrorists do not fly into or out of airports monitored by the DHS/TSA.
  • DHS/TSA methods are insufficient to catch terrorists who are using US airports.

Rather than assuming terrorists justify the governments use of passenger manifests for screening passengers, Baker should be challenged to produce evidence that:

  • Terrorists fly in or out of airports under the control of the U.S. government, and
  • DHS/TSA techniques result in the arrest of terrorists.

Lacking proof of either of those points, there is no demonstration of need or effectiveness on the part of the DHS/TSA.

The government has the burden of proof for any government program, but especially ones that intrude on the privacy of its citizens. Force them to carry that burden in discussions of national security.

May 6, 2015

Expand Your Big Data Capabilities With Unstructured Text Analytics

Filed under: BigData,Text Analytics,Unstructured Data — Patrick Durusau @ 7:58 pm

Expand Your Big Data Capabilities With Unstructured Text Analytics by Boris Evelson.

From the post:

Beware of insights! Real danger lurks behind the promise of big data to bring more data to more people faster, better and cheaper. Insights are only as good as how people interpret the information presented to them.

When looking at a stock chart, you can’t even answer the simplest question — “Is the latest stock price move good or bad for my portfolio?” — without understanding the context: Where you are in your investment journey and whether you’re looking to buy or sell.

While structured data can provide some context — like checkboxes indicating your income range, investment experience, investment objectives, and risk tolerance levels — unstructured data sources contain several orders of magnitude more context.

An email exchange with a financial advisor indicating your experience with a particular investment vehicle, news articles about the market segment heavily represented in your portfolio, and social media posts about companies in which you’ve invested or plan to invest can all generate much broader and deeper context to better inform your decision to buy or sell.

A thumbnail sketch of the complexity of extracting value from unstructured data sources. As such a sketch, there isn’t much detail but perhaps enough to avoid paying $2495 for the full report.

The Internet of Things to take a beating in DefCon hacking contest

Filed under: Cybersecurity,IoT - Internet of Things — Patrick Durusau @ 7:49 pm

The Internet of Things to take a beating in DefCon hacking contest by Lucian Constantin.

From the post:

Hackers will put Internet-connected embedded devices to the test at the DefCon 23 security conference in August. Judging by the results of previous Internet-of-Things security reviews, prepare for flaws galore.

This year, DefCon, the largest hacker convention in the U.S., will host a so-called IoT Village, a special place to discuss, build and break IoT devices.

“Show us how secure (or insecure) IP-enabled embedded systems are,” a description of the new village reads. “Routers, network storage systems, cameras, HVAC systems, refrigerators, medical devices, smart cars, smart home technology, and TVs — if it is IP-enabled, we’re interested.”

Def Con 23 August 6-9 at Paris & Bally’s in Las Vegas!

The call for papers is open until May 26, 2015.

This should be a real hoot!

Enjoy!

Malware’s Most Wanted: Linux and Internet of Things Malware? (webinar)

Filed under: Cybersecurity,Linux OS,Security — Patrick Durusau @ 7:37 pm

Malware’s Most Wanted: Linux and Internet of Things Malware?

From the description:

Speaker: Marion Marschalek, Security Researcher of Cyphort Labs
Date and Time: Thursday, May 28, 2015 9:00 AM PDT

Occasionally we see samples coming out of our pipe which do not fit with the stream of malware, such as clickjackers, banking Trojans and spybots. These exotic creatures are dedicated to target platforms other than the Windows operating system. While they make up for a significantly smaller portion than the load of Windows malware, Cyphort labs has registered a rise in Linux and Internet of Things Malware (IoT) malware. A number of different families has been seen. But what is their level of sophistication and the associated risk? This webinar provides an overview of Linux and IoT malware that Cyphort labs has spotted in the wild and gives an insight into the development of these threats and the direction they are taking. Attendees may opt in to receive a special edition t-shirt.

I haven’t seen a Cyphort webinar so I am taking a chance on this one.

Enjoy!

“there are times when you don’t have time for due process…”

Filed under: Cybersecurity,Security — Patrick Durusau @ 7:24 pm

I started this post to talk about: Steptoe Cyberlaw Podcast, Episode 65: An Interview with Bruce Schneier by Steward Baker.

From the post:

Episode 65 would be ugly if it weren’t so much fun. Our guest is Bruce Schneier, cryptographer, computer science and privacy guru, and author of the best-selling Data and Goliath – a book I annotated every few pages of with the words, “Bruce, you can’t possibly really believe this.” And that’s pretty much how the interview goes, as Bruce and I mix it up over hackbacks, whether everyone but government should be allowed to use Big Data tools, Edward Snowden, whether “mass surveillance” has value in fighting terrorism, and whether damaging cyberattacks are really infrequent and hard to attribute. We disagree mightily – and with civility.

The interviewer, Steward Baker, spent 3 1/2 years at the Department of Homeland Security and formerly as general counsel of the National Security Agency.

The interview with Bruce proper starts at time mark 28:00.

In an admittedly hypothetical and strained case, Baker responds to Schneier’s argument for due process by saying:

“there are times when you don’t have time for due process…” (31:44)

The hypothetical involves following a thief and breaking into their server to retrieve something they have stolen. I know, problematic since digital theft doesn’t usually leave you without a copy, but that is how it was posed. Bruce argues, quite correctly, that there is no determination that a theft occurred, who the thief might be or what damage you will do in breaking into the server.

I find it deeply disturbing that Baker’s views may reflect those of the Department of Homeland Security, that:

“there are times when you don’t have time for due process…” (31:44)

I find it quite remarkable that anyone practicing law in the United States could trivialize due process so easily.

When convicted criminals are released due to violation of their due process rights, it isn’t that a court has found them not guilty. There may be little or no practical doubt about their guilt but as part of the social contract that is the Constitution, we have agreed that an absence of due process trumps all other considerations, including factual ones.

What Baker forgets is that he and/or the Department of Homeland Security aren’t the only ones who can decide to set aside the social contract that includes due process. What if a person decides that the election process has erred and there are too few intelligent members of Congress? Such a person could (illegally) create vacancies in Congress to give the electorate another chance.

Their argument would be the same one that Baker makes, the election cycle takes too long. Too many crises need resolution by a competent congress.

Their only difference with Baker would be one of degree rather than kind. Baker and company should be very careful about callously abandoning inconvenient parts of the Constitution. It sets a bad precedent for others who may do the same thing.

Open Access Journals in Ancient Studies

Filed under: IoT - Internet of Things — Patrick Durusau @ 4:11 pm

Alphabetical List of Open Access Journals in Ancient Studies by Charles Jones.

From AWOL – The Ancient World Online.

If you are interested in ancient studies, do visit Online Resources from ISAW (Institute for the Study of the Ancient World) at New York University.

It is an exemplar of what scholarship should look like in the 21st century.

Malware Kits for IoT?

Filed under: Cybersecurity,IoT - Internet of Things,Security — Patrick Durusau @ 3:09 pm

Analysis of a MICROSOFT WORD INTRUDER sample: execution, check-in and payload delivery by Yonathan Klijnsma.

From the post:

On April 1st FireEye released a report on “MWI” and “MWISTAT” which is a sort of exploit kit for Word Documents if you will: A New Word Document Exploit Kit

In the article FireEye goes over MWI which is the short for “Microsoft Word Intruder’ coded by an actor going by the handle ’Objekt’. MWI is a ‘kit’ for people to use for spreading malware. It can generate malicious word document exploiting any of the following CVE’s:

  • CVE-2010-3333
  • CVE-2012-0158
  • CVE-2013-3906
  • CVE-2014-1761

The builder, named MWI, generates these documents which call back to a server to download malicious payloads. Together with the MWI builder the author has also released MWISTAT; a statistics backend and optional downloader component for MWI documents to track campaigns and spread of the documents.

This post prompted me to look for malware kits for the Internet of Things (IoT).

I didn’t find any with a quick search but did find several IoT malware stories that may be of interest:

The Internet Of Things Has Been Hacked, And It’s Turning Nasty by Selena Larson.

From the post:

Don’t say we didn’t warn you. Bad guys have already hijacked up to 100,000 devices in the Internet of Things and used them to launch malware attacks, Internet security firm Proofpoint said on Thursday.

It’s apparently the first recorded large-scale Internet of Things hack. Proofpoint found that the compromised gadgets—which included everything from routers and smart televisions to at least one smart refrigerator—sent more than 750,000 malicious emails to targets between December 26, 2013 and January 6, 2014.

The Botnet of the Internet of Things by Waylon Grange.

From the post:

Last month we released our report on the Inception Framework and as part of that report outlined how a nation-state level attack compromised over 100 embedded devices on the Internet to use them as a private proxy to mask their identity. Since the release of the paper we have further discovered that the attackers not only targeted MIPS-el devices but also had binaries for ARM, SuperH, and PowerPC embedded processors. In light of this the 100 devices that we knew about is most likely only the tip of the iceberg and the total count was much, much more.

This network of proxies was managed by a central backend that tunneled attacks through an ever-cycling list of compromised devices, thus changing the IP address their attacks came from every few minutes. The whole system for tracking which compromised devices were available and managing the change in proxies at regular intervals had to be a fairly complex system, but the benefit to the attackers was clear. No one entity would have full insight into their attacks, only portions of it and it is hard for investigators to put together a puzzle with only a handful of the pieces.

This year your refrigerator may be a spam-bot and next year your toaster?

Don’t know how I will feel getting a spam email with return address: Joe’s Toaster.

Unfortunately, people who are concerned about IoT security, aren’t the ones building devices to become part of the IoT. Strict liability for losses, spamming, etc. due to IoT devices would go a long way towards generating concern among IoT device manufacturers.

I didn’t find any malware kits for the IoT but I will keep looking. Until the IoT becomes more secure, I’m not sharing network access with my refrigerator or toaster.

Glossary of linguistic terms

Filed under: Language,Linguistics — Patrick Durusau @ 1:45 pm

Glossary of linguistic terms by Eugene E. Loos (general editor), Susan Anderson (editor), Dwight H., Day, Jr. (editor), Paul C. Jordan (editor), J. Douglas Wingate (editor).

An excellent source for linguistic terminology.

If you have any interest in languages or linguistics you should give SIL International a visit.

BTW, the last update on the glossary page was in 2004 so if you can suggest some updates or additions, I am sure they would be appreciated.

Enjoy!

Topic Extraction and Bundling of Related Scientific Articles

Topic Extraction and Bundling of Related Scientific Articles by Shameem A Puthiya Parambath.

Abstract:

Automatic classification of scientific articles based on common characteristics is an interesting problem with many applications in digital library and information retrieval systems. Properly organized articles can be useful for automatic generation of taxonomies in scientific writings, textual summarization, efficient information retrieval etc. Generating article bundles from a large number of input articles, based on the associated features of the articles is tedious and computationally expensive task. In this report we propose an automatic two-step approach for topic extraction and bundling of related articles from a set of scientific articles in real-time. For topic extraction, we make use of Latent Dirichlet Allocation (LDA) topic modeling techniques and for bundling, we make use of hierarchical agglomerative clustering techniques.

We run experiments to validate our bundling semantics and compare it with existing models in use. We make use of an online crowdsourcing marketplace provided by Amazon called Amazon Mechanical Turk to carry out experiments. We explain our experimental setup and empirical results in detail and show that our method is advantageous over existing ones.

On “bundling” from the introduction:

Effective grouping of data requires a precise definition of closeness between a pair of data items and the notion of closeness always depend on the data and the problem context. Closeness is defined in terms of similarity of the data pairs which in turn is measured in terms of dissimilarity or distance between pair of items. In this report we use the term similarity,dissimilarity and distance to denote the measure of closeness between data items. Most of the bundling scheme start with identifying the common attributes(metadata) of the data set, here scientific articles, and create bundling semantics based on the combination of these attributes. Here we suggest a two step algorithm to bundle scientific articles. In the first step we group articles based on the latent topics in the documents and in the second step we carry out agglomerative hierarchical clustering based on the inter-textual distance and co-authorship similarity between articles. We run experiments to validate the bundling semantics and to compare it with content only based similarity. We used 19937 articles related to Computer Science from arviv [htt12a] for our experiments.

Is a “bundle” the same thing as a topic that represents “all articles on subject X?”

I have seen a number of topic map examples that use the equivalent proper noun, a proper subject, that is a singular and unique subject.

But there is no reason why I could not have a topic that represents all the articles on deep learning written in 2014, for example. Methods such as the bundling techniques described here could prove to be quite useful in such cases.

May 5, 2015

Lex Machina – Legal Analytics for Intellectual Property Litigation

Filed under: Law,Law - Sources,Searching — Patrick Durusau @ 4:06 pm

Lex Machina – Legal Analytics for Intellectual Property Litigation by David R. Hansen.

From the post:

Lex Machina—Latin that translates to “law machine”—is an interesting name for a legal analytics platform that focuses not on the law itself but on providing insights into the human aspects of the practice of law. While traditional legal research platforms—Lexis, Westlaw, Bloomberg, etc.—help guide attorneys to information about where the law is and how it is developing, Lex Machina focuses on providing information about how attorneys, judges, and other involved parties act in the high-stakes world of IP litigation.

Leveraging databases from PACER, the USPTO, and the ITC, Lex Machina cleans and codes millions of data elements from IP-related legal filings to cull information about how judges, attorneys, law firms, and particular patents are treated in various cases. Using that information, Lex Machina is able to offer insights into, for example, how long a particular judge typically takes to decide on summary judgment motions, or how frequently a particular judge grants early motions in favor of defendants. Law firms use the service to create client pitches—highlighting with hard data, for example, how many times they have litigated and won particular types of cases before particular judges or courts as compared to competing law firms. And companies can use the service to assess the historical effectiveness of their counsel and to judge the reasonableness of proposed litigation strategies.

For academic uses, the possibilities for engaging in empirical research with the covered dataset are great. A quick search of law reviews articles in Westlaw shows Lex Machina used in seventy-five articles published since 2009, covering empirical research into everything from the prevalence of assertions of state sovereign immunity for cases involving state-owned patents to effect of patent monetization entities on U.S. patent litigation.

If you are interested in gaining access to Lex Machina and are university and college faculty, staff or students directly engaged in research on, or study of, IP law and policy, you can request a free public-interest account here (Lex Machina notes, however, “to enable public interest users to make best use of Lex Machina, we require prospective new users to attend an online training prior to receiving a user account.”)

When I first wrote about Lex Machina (2013), I don’t recall there being a public interest option. Amusing to see its use as a form of verified advertising for attorneys.

Now, if judicial oversight boards had the same type of information across the board for all judges.

Not that legal outcomes can or should be uniform, but they shouldn’t be freakish as well.

I first saw this in a tweet by Aaron Kirschenfeld.

Achieving All with No Parameters: Adaptive NormalHedge

Filed under: Machine Learning — Patrick Durusau @ 3:50 pm

Achieving All with No Parameters: Adaptive NormalHedge by Haipeng Luo and Robert E. Schapire.

Abstract:

We study the classic online learning problem of predicting with expert advice, and propose a truly parameter-free and adaptive algorithm that achieves several objectives simultaneously without using any prior information. The main component of this work is an improved version of the NormalHedge.DT algorithm (Luo and Schapire, 2014), called AdaNormalHedge. On one hand, this new algorithm ensures small regret when the competitor has small loss and almost constant regret when the losses are stochastic. On the other hand, the algorithm is able to compete with any convex combination of the experts simultaneously, with a regret in terms of the relative entropy of the prior and the competitor. This resolves an open problem proposed by Chaudhuri et al. (2009) and Chernov and Vovk (2010). Moreover, we extend the results to the sleeping expert setting and provide two applications to illustrate the power of AdaNormalHedge: 1) competing with time-varying unknown competitors and 2) predicting almost as well as the best pruning tree. Our results on these applications significantly improve previous work from different aspects, and a special case of the first application resolves another open problem proposed by Warmuth and Koolen (2014) on whether one can simultaneously achieve optimal shifting regret for both adversarial and stochastic losses.

The terminology, “sleeping expert,” is particularly amusing.

Probably more correct to say “unpaid expert, because unpaid experts, the cleverer ones, don’t offer advice.

I first saw this in a tweet by Nikete.

New York City Subway Anthrax/Plague

Filed under: Data,Skepticism — Patrick Durusau @ 3:27 pm

Spoiler Alert: This paper discusses a possible find of anthrax and plague DNA in the New York Subway. It concludes that either a related but harmless strain wasn’t considered and/or there was random sequencing error. In either case, it is a textbook example of the need for data skepticism.

Searching for anthrax in the New York City subway metagenome by Robert A Petit, III, Matthew Ezewudo, Sandeep J. Joseph, Timothy D. Read.

From the introduction:

In February 2015 Chris Mason and his team published an in-depth analysis of metagenomic data (environmental shotgun DNA sequence) from samples isolated from public surfaces in the New York City (NYC) subway system. Along with a ton of really interesting findings, the authors claimed to have detected DNA from the bacterial biothreat pathogens Bacillus anthracis (which causes anthrax) and Yersinia pestis (causes plague) in some of the samples. This predictably led to a huge interest from the press and scientists on social media. The authors followed up with an re-analysis of the data on microbe.net, where they showed some results that suggested the tools that they were using for species identification overcalled anthrax and plague.

The NYC subway metagenome study raised very timely questions about using unbiased DNA sequencing for pathogen detection. We were interested in this dataset as soon as the publication appeared and started looking deeper into why the analysis software gave false positive results and indeed what exactly was found in the subway samples. We decided to wrap up the results of our preliminary analysis and put it on this site. This report focuses on the results for B. anthracis but we also did some preliminary work on Y.pestis and may follow up on this later.

The article gives a detailed accounting of the tools and issues involved in the identification of DNA fragments from pathogens. It is hard core science but it also illustrates how iffy hard core science can be. Sure, you have the data, that doesn’t mean you will reach the correct conclusion from it.

The authors also mention a followup study by Chris Mason, the author of the original paper, entitled:

The long road from Data to Wisdom, and from DNA to Pathogen by Christopher Mason.

From the introduction:

There is an oft-cited DIKW). Just because you have data, it takes some processing to get quality information, and even good information is not necessarily knowledge, and knowledge often requires context or application to become wisdom.

And from his conclusion:

But, perhaps the bigger issue is social. I confess I grossly underestimated how the press would sensationalize these results, and even the Department of Health (DOH) did not believe it, claiming it simply could not be true. We sent the MTA and the DOH our first draft upon submission in October 2014, the raw and processed data, as well as both of our revised drafts in December 2014 and January 2015, and we did get some feedback, but they also had other concerns at the time (Ebola guy in the subway). This is also different from what they normally do (PCR for specific targets), so we both learned from each other. Yet, upon publication, it was clear that Twitter and blogs provided some of the same scrutiny as the three reviewers during the two rounds of peer review. But, they went even deeper and dug into the raw data, within hours of the paper coming online, and I would argue that online reviewers have become an invaluable part of scientific publishing. Thus, published work is effectively a living entity before (bioRxiv), during (online), and after publication (WSJ, Twitter, and others), and online voices constitute an critical, ensemble 4th reviewer.

Going forward, the transparency of the methods, annotations, algorithms, and techniques has never been more essential. To this end, we have detailed our work in the supplemental methods, but we have also posted complete pipelines in this blog post on how to go from raw data to annotated species on Galaxy. Even better, the precise virtual machines and instantiation of what was run on a server needs to be tracked and guaranteed to be 100% reproducible. For example, for our .vcf characterizations of the human alleles, we have set up our entire pipeline on Arvados/Curoverse, free to use, so that anyone can take a .vcf file and run the exact same ancestry analyses and get the exact same results. Eventually, tools like this can automate and normalize computational aspects of metagenomics work, which is an ever-increasingly important component of genomics.

Mason’s”

Data –>Information –>Knowledge –>Wisdom (DIKW).

sounds like:

evidence based data science.

to me.

Another quick point, note that Chris Mason and team made all their data available for others to review and Chris states that informal review was a valuable contributor to the scientific process.

That is an illustration of the value of transparency. Contrast that with the Obama Administration’s default position of opacity. Which one do you think serves a fact finding process better?

Perhaps that is the answer. The Obama administration isn’t interested in a fact finding process. It has found the “facts” that it wants and reaches its desired conclusions. What is there left to question or discuss?

One Subject, Three Locators

Filed under: Identifiers,Library,Topic Maps — Patrick Durusau @ 2:01 pm

As you may know, the Library of Congress actively maintains its subject headings. Not surprising to anyone other than purveyors of fixed ontologies. New subjects appear, terminology changes, old subjects have new names, etc.

The Subject Authority Cooperative Program (SACO) has a mailing list:

About the SACO Listserv (sacolist@loc.gov)

The SACO Program welcomes all interested parties to subscribe to the SACO listserv. This listserv was established first and foremost to facilitate communication with SACO contributors throughout the world. The Summaries of the Weekly Subject Editorial Review Meeting are posted to enable SACO contributors to keep abreast of changes and know if proposed headings have been approved or not. The listserv may also be used as a vehicle to foster discussions on the construction, use, and application of subject headings. Questions posted may be answered by any list member and not necessarily by staff in the Cooperative Programs Section (Coop) or PSD. Furthermore, participants are encouraged to provide comments, share examples, experiences, etc.

On the list this week was the question:

Does anyone know how these three sites differ as sources for consulting approved subject lists?

http://www.loc.gov/aba/cataloging/subject/weeklylists/

http://www.loc.gov/aba/cataloging/subject/

http://classificationweb.net/approved-subjects/

Janis L. Young, Policy and Standards Division, Library of Congress replied:

Just to clarify: all of the links that you and Paul listed take you to the same Approved Lists. We provide multiple access points to the information in order to accommodate users who approach our web site in different ways.

Depending upon your goals, the Approved Lists could be treated as a subject that has three locators.

SOP for the IoT, Pwning a Pain Machine

Filed under: Cybersecurity,Security — Patrick Durusau @ 1:17 pm

Bugs in the hospital: how to pwn your own pethidine machine by Paul Ducklin.

Paul describes CVE-2015-3459.

From the NVD description:

Hospira Lifecare PCA infusion pump running “SW ver 412” does not require authentication for Telnet sessions, which allows remote attackers to gain root privileges via TCP port 23.

PCA = patient-controlled analgesia.

It’s score?

CVSS v2 Base Score: 10.0 (HIGH) (AV:N/AC:L/Au:N/C:C/I:C/A:C)

Impact Subscore: 10.0

Exploitability Subscore: 10.0

Perfect. 10’s across the board.

Paul goes on to point out the many reasons why Telnet should not be used under any circumstances but fails to acknowledge that vendors for the Internet of Things (IoT) care more about profit than they do about security.

Kurt Mackie, in Microsoft beefs up Azure for Internet of Things, says Microsoft CEO Satya Nadella:

depicted a future world that will have “26 billion general purpose compute devices” by 2019 that would produce “something like 44 zettabytes of data that’s going to be in the cloud.”

How many of those “26 billion general purpose compute devices” will be vulnerable to cyber attacks?

Off hand, I would say all of them, if current conditions are an indication of the future (they often are).

Think about that for a minute. There are “secure” systems but that comes at the price of being cut off from the rest of the world, guarded by fences and people with guns and obsessive security procedures. None of those will be true for devices on the IoT.

Criminal laws and penalties haven’t stopped the gentle tides of current cyber-insecurity. Given that history, they are laughable as an approach to stopping the tsunami of cyber-insecurity that approaches with the IoT.

There is presently and will be in the future, any number of snake oil solutions to software security issues. If you like the idea of patching a punctured tire by wrapping another punctured tire around it, you may be happy with one or more such solutions. At least until they fail.

There are alternatives, workable alternatives. Not to eliminate risk or achieve complete security, but to make the level of risk manageable, not random or episodic. Incentives (software liability) for more security, standards for software practices, better sharing of vulnerability information, are only a few of the current alternatives to spewing more insecure software to form the IoT.

If you start feeling too good on a pain machine in the hospital, someone may have rooted your machine. A little late to be working about the security of the IoT at that point.

May 4, 2015

SIGIR 2015 Technical Track

Filed under: Conferences,Information Retrieval — Patrick Durusau @ 8:15 pm

SIGIR 2015 Technical Track

The list of accepted papers for SIGIR 2015 Technical Track have been published!

As if you need any further justification to attend the conference in Santiago, Chile, August 9-13, 2015.

Curious, would anyone be interested in a program listing that links the authors to their DBLP listings? Just in case you want to catch up on their recent publications before the conference?

Enjoy!

Notes on Theory of Distributed Systems

Filed under: CS Lectures,Distributed Computing — Patrick Durusau @ 8:06 pm

Notes on Theory of Distributed Systems by James Aspnes.

From the preface:

These are notes for the Spring 2014 semester version of the Yale course CPSC 465/565 Theory of Distributed Systems. This document also incorporates the lecture schedule and assignments, as well as some sample assignments from previous semesters. Because this is a work in progress, it will be updated frequently over the course of the semester.

Notes from Fall 2011 can be found at http://www.cs.yale.edu/homes/aspnes/classes/469/notes-2011.pdf.

Notes from earlier semesters can be found at http://pine.cs.yale.edu/pinewiki/465/.

Much of the structure of the course follows the textbook, Attiya and Welch’s Distributed Computing [AW04], with some topics based on Lynch’s Distributed Algorithms [Lyn96] and additional readings from the research literature. In most cases you’ll find these materials contain much more detail than what is presented here, so it is better to consider this document a supplement to them than to treat it as your primary source of information.

When something exceeds three hundred (> 300) pages, I have trouble calling it “notes.” 😉

A treasure trove of information on distributed computing.

I first saw this in a tweet by Henry Robinson.

Breaking the Silence – Gaza – “There were no rules”

Filed under: Government,Politics — Patrick Durusau @ 7:51 pm

New report details how Israeli soldiers killed civilians in Gaza: “There were no rules” by William Booth.

From the post:

On Monday, an organization of Israeli soldiers known as “Breaking the Silence” released a report containing testimonies from more than 60 officers and soldiers from the Israel Defense Forces who served during the 50-day war against Hamas militants last summer in the Gaza Strip.

An Israel Defense Forces spokesman declined to respond to details in the report, saying Breaking the Silence refuses to share information with the IDF “in a manner which would allow a proper response, and if required, investigation.” The spokesman added that “contrary to their claims, this organization does not act with the intention of correcting any wrongdoings they allegedly uncovered.”

The soldiers who testified received guarantees of anonymity from Breaking the Silence. The 240-page book in English can be found online here.

Don’t you like the IDF response:

in a manner which would allow a proper response, and if required, investigation.

Of course, not anonymous but with names and who was with you (other people that could be pressured), attendant damage to your career or future job prospects, etc.

Not to single out the IDF for criticism. Virtually the same response has been given by the U.S. military for a variety of issues.

Governments and their military services fear transparency because transparency could lead to accountability. Civilians should not second-guess decisions made in the heat of battle by combat troops. Their leaders, who made decisions for political gain, should certainly be called to account.

Running Spark GraphX algorithms on Library of Congress subject heading SKOS

Filed under: GraphX,SKOS,Spark — Patrick Durusau @ 4:02 pm

Running Spark GraphX algorithms on Library of Congress subject heading SKOS by Bob Ducharme.

From the post:

Well, one algorithm, but a very cool one.

Last month, in Spark and SPARQL; RDF Graphs and GraphX, I described how Apache Spark has emerged as a more efficient alternative to MapReduce for distributing computing jobs across clusters. I also described how Spark’s GraphX library lets you do this kind of computing on graph data structures and how I had some ideas for using it with RDF data. My goal was to use RDF technology on GraphX data and vice versa to demonstrate how they could help each other, and I demonstrated the former with a Scala program that output some GraphX data as RDF and then showed some SPARQL queries to run on that RDF.

Today I’m demonstrating the latter by reading in a well-known RDF dataset and executing GraphX’s Connected Components algorithm on it. This algorithm collects nodes into groupings that connect to each other but not to any other nodes. In classic Big Data scenarios, this helps applications perform tasks such as the identification of subnetworks of people within larger networks, giving clues about which products or cat videos to suggest to those people based on what their friends liked.

As so typically happens when you are reading one Bob DuCharme post, you see another that one requires reading!

Bob covers storing RDF in RDD (Resilient Distributed Dataset), the basic Spark data structure, creating the report on connected components and ends with heavily commented code for his program.

Sadly the “related” values assigned by the Library of Congress don’t say how or why the values are related, such as:


“Hiding places”

“Secrecy”

“Loneliness”

“Solitude”

“Privacy”

Related values could be useful in some cases but if I am searching on “privacy,” as in the sense of being free from government intrusion, then “solitude,” “loneliness,” and “hiding places” aren’t likely to be helpful.

That’s not a problem with Spark or SKOS, but a limitation of the data being provided.

SPARQL in 11 minutes (Bob DuCharme)

Filed under: RDF,SPARQL — Patrick Durusau @ 3:30 pm

From the description:

An introduction to the W3C query language for RDF. See http://www.learningsparql.com for more.

I first saw this in Bob DuCharme’s post: SPARQL: the video.

Nothing new for old hands but useful to pass on to newcomers.

I say nothing new, I did learn that Bob has a Korg Monotron synthesizer. Looking forward to more “accompanied” blog posts. 😉

FOIA and 5,000 Blank Pages

Filed under: Government,Politics,Security — Patrick Durusau @ 2:53 pm

FBI replies to Stingray Freedom of Information request with 5,000 blank pages by Cory Doctorow.

Cory has a great post on FBI stonewalling on information about “Stingrays,” devices that act as cell phone towers to gather information from cell phone users.

The FBI response illustrates the issue I raised in Debating Public Policy, On The Basis of Fictions, which was:

To hold government accountable, its citizens need to know what government has been doing, to whom and why.

There is no place in the Constitution that says citizens are entitled only to some information, to a little information, to the information the executive branch decides to share (or the legislative branch for that matter), etc.

Every blank page in that FOIA answer diminishes your right as a citizen to control your government. That’s the part the FBI keeps overlooking. It’s not their government, it not the government of the NSA, it is the government of every voting citizen.

« Newer PostsOlder Posts »

Powered by WordPress