Data-centric Languages and Systems

Overview

The data-centric languages and systems thematic aims at designing and developping programming langages as well as systems that seriously take into account complex and massive data. The purpose is to build robust and efficient platforms on well founded theoretical grounds.

Managing, querying and making sense of data have become major aspects of our society. In the past forty years, the advance of technology has allowed computer systems to store vast amounts of data. This has, in turn, spurred novel ways of handling data. AI and Data-analytics for instance have caused a paradigm shift : data is nowadays massive, heterogeneous, unstructured and manipulated with application domain specific languages such as Python (AI, Physics, …), R (bio-informatics, statistics) or Javascript (Web-programming).
These newer approach typically manipulate vast amount of data, for long period of times, which in turn make testing and prototyping difficult. Consequently, there is a high demand for safer and more robust systems, designed from scratch while taking into account correctness and and efficiency.
This provides the key research direction we follow :

The formalization of SQL and data-centric programming languages, a joined effort with the Formalization of Languages and Systems thematic
The design, formalization and implementation of the BOLDR system
The study of advanced type systems, that improve safety and code quality for dynamic languages such as Javascript.

People

Current members

Véronique Benzaken
Evelyne Contejean
Mohammed Hachmaoui
Chantal Keller
Kim Nguyen
Rébecca Zucchini

Past members

Stefania Dumbrava (Phd. candidate, defended in 2016)
Hyeonseung Im (Postdoc)
Eunice Martins (Master Internship)
Romain Vernoux (Master Internship)
Rebecca Zucchini (Master Internship)

Research axes

Data Intensive Systems Formalization (DataCert)

This research direction is at the intersection with the Formalisation of Languages and Systems thematic, see here.

Breaking Boundaries between Language and Database Runtimes (BOLDR)

The goal of this project is to create a uniform and universal query intermediate representation (QIR), to bridge the gap between programming language construct and database queries.

This project defines the semantics of the QIR, its properties, and its translation to database runtimes.
The on-going implementation generate efficient database queries from SimpleScript (a toy language) or R code for SQL database (in ANSI syntax) or databases based on the Apache Hadoop framework, such as HBase and Hive, while support for Python (as a frontend) and Spark (as a backend) are being investigated.

Advanced type system for data-oriented and dynamic languages

A goal of type-systems is to allow programmers to use ever more complex programming idioms. In that respects, dynamic ( i.e. untyped) languages such as Javascript or Python represent interesting challenges. Their pervasiveness in domains such as Web programming or data-sciences makes them a target of choice to improve the quality of software through typing. Their highly dynamic behaviour (which cannot be easily characterized statically), which is one of their appeals to the general public, cannot be easily handled by classical type-systems. One of our objectives is thererfore to study more sophisticated ones that mix advanced features such as polymorphism and subtyping or rely on gradual typing, which allows one to reason about statically and dynamically typed programs in the same framework.

Grants

Sponsored Research Grant - Oracle (Deep Integration of Programming Languages and Databases with Truffle/Graal)

Former grants:

Projet ANR - Blanc Typex Typeful certified XML: integrating language, logic, and data-oriented best practices.
Projet ANR - DEFIS Codex

Software

A link to the implementation of BOLDR (developped by Julien Lopez)
The CDuce: an XML-centric functional programming language , part of major Linux distributions (Debian/Ubuntu, Fedora, Mandriva). , is used as a sandbox to investigate advanced type-systems

Main publications

Books and books chapters

Chapter 1 : NoSQL Languages and Systems, Kim Nguyễn, NoSQL Data Models: Trends and Challenges, Olivier Pivert Editor, 2018, ISTE
XML Typechecking , V. Benzaken and G. Castagna and H. Hosoya and B-C Pierce and S. Vansummeren (Invited chapter in) Encyclopedia of Database Systems, Springer Verlag 2009.

International Journals

Fast in-memory XPath search using compressed indexes, Diego Arroyuelo, Francisco Claude, Sebastian Maneth, Veli Mäkinen, Gonzalo Navarro, Kim Nguyen 0001, Jouni Sirén, Niko Välimäki, 399-434, Softw., Pract. Exper., 2015,
Optimizing XML Querying using Type-based Document Projection V. Benzaken, G. Castagna, D. Colazzo, K. Nguyen. In ACM Transactions on Database Systems (TODS) March 2013.

International Conferences

Language-integrated queries: a BOLDR approach, Véronique Benzaken, Laurent Daynès, Giuseppe Castagna, Julien Lopez, Kim Nguyễn, Jérôme Siméon, Romain Vernoux, WWW, 2018
Set-theoretic types for polymorphic variants, Giuseppe Castagna, Tommaso Petrucciani, Kim Nguyễn, 378-391, ICFP, 2016
A Core Calculus for XQuery 3.0 Giuseppe Castagna, Hyeonseung Im, Kim Nguyễn and Véronique Benzaken In ESOP'15, European Symposium on Programming Languages, ETAPS 2015: 11-18 April 2015, London, UK.
Polymorphic Functions with Set-Theoretic Types. Part 2: Local Type Inference and Type Reconstruction. G. Castagna, K. Nguyễn, Z. Xu, and P. Abate In POPL'15, 42nd ACM Symposium on Principles of Programming Languages, pag. 289-302, January, 2015.
Polymorphic Functions with Set-Theoretic Types. Part 1: Syntax, Semantics, and Evaluation. G. Castagna, K. Nguyễn, Z. Xu, H. Im, S. Lenglet, and L. Padovani In POPL'14, 41th ACM Symposium on Principles of Programming Languages, January, 2014.
Static and dynamic semantics for NoSQL Languages V. Benzaken, G. Castagna, K. Nguyen, J. Simeon. in ACM International Conference on Principles of Programming Languages POPL Roma 2013

VALS

Verified Algorithms, Languages and Systems