Archive for the ‘CERN’ Category

CatBoost: Yandex’s machine learning algorithm (here be Russians)

Thursday, December 7th, 2017

CatBoost: Yandex’s machine learning algorithm is available free of charge Victoria Zavyalova.

From the post:

Russia’s Internet giant Yandex has launched CatBoost, an open source machine learning service. The algorithm has already been integrated by the European Organization for Nuclear Research to analyze data from the Large Hadron Collider, the world’s most sophisticated experimental facility.

Machine learning helps make decisions by analyzing data and can be used in many different areas, including music choice and facial recognition. Yandex, one of Russia’s leading tech companies, has made its advanced machine learning algorithm, CatBoost, available free of charge for developers around the globe.

“This is the first Russian machine learning technology that’s an open source,” said Mikhail Bilenko, Yandex’s head of machine intelligence and research.

I called out the Russian origin of the CatBoost algorithm, not because I have any nationalistic tendencies but you can find frothing paranoids in U.S. government agencies and their familiars who do. In those cases, avoid CatBoost.

If you work in saner environments, or need to use categorical data (read not converted to numbers), give CatBoost a close look!


ROOT Files

Friday, March 21st, 2014

ROOT Files

From the webpage:

Today, a huge amount of data is stored into files present on our PC and on the Internet. To achieve the maximum compression, binary formats are used, hence they cannot simply be opened with a text editor to fetch their content. Rather, one needs to use a program to decode the binary files. Quite often, the very same program is used both to save and to fetch the data from those files, but it is also possible (and advisable) that other programs are able to do the same. This happens when the binary format is public and well documented, but may happen also with proprietary formats that became a standard de facto. One of the most important problems of the information era is that programs evolve very rapidly, and may also disappear, so that it is not always trivial to correctly decode a binary file. This is often the case for old files written in binary formats that are not publicly documented, and is a really serious risk for the formats implemented in custom applications.

As a solution to these issues ROOT provides a file format that is a machine-independent compressed binary format, including both the data and its description, and provides an open-source automated tool to generate the data description (or “dictionary“) when saving data, and to generate C++ classes corresponding to this description when reading back the data. The dictionary is used to build and load the C++ code to load the binary objects saved in the ROOT file and to store them into instances of the automatically generated C++ classes.

ROOT files can be structured into “directories“, exactly in the same way as your operative system organizes the files into folders. ROOT directories may contain other directories, so that a ROOT file is more similar to a file system than to an ordinary file.

Amit Kapadia mentions ROOT files in his presentation at CERN on citizen science.

I have only just begun to read the documentation but wanted to pass this starting place along to you.

I don’t find the “machine-independent compressed binary format” argument all that convincing but apparently it has in fact worked for quite some time.

Of particular interest will be the data dictionary aspects of ROOT.

Other data and description capturing file formats?


Saturday, November 26th, 2011


From the webpage:

CERN, DESY, Fermilab and SLAC have built the next-generation High Energy Physics (HEP) information system, INSPIRE, which empowers scientists with innovative tools for successful research at the dawn of an era of new discoveries.

INSPIRE combines the successful SPIRES database content, curated at DESY, Fermilab and SLAC, with the Invenio digital library technology developed at CERN. INSPIRE is run by a collaboration of the four labs, and interacts closely with HEP publishers,, NASA-ADS, PDG, and other information resources.

INSPIRE represents a natural evolution of scholarly communication, built on successful community-based information systems, and provides a vision for information management in other fields of science.

INSPIRE builds on SPIRES’ expertise

  • Decades of trusted, curated content
  • Experience in managing a discipline’s wide information resources
  • Close relationship with the worldwide user community

What are the major innovations of INSPIRE?

  • Author disambiguation for high-quality profiles and improved search capabilities
  • Fulltext search and snippet display for access restricted content
  • Faster results
  • Variety of search and display options
  • Detailed record pages
  • Searchable fulltext for 5 years of arXiv content
  • Figures and searchable figure captions extracted from 5 years of arXiv articles
  • LHC experimental notes

What will be available soon?

  • Personalized features (bookshelves, author pages, paper claiming)
  • More APIs for third parties to build new tools
  • More historical content
  • Conference slides

Deeply cool digital library system from CERN.