The Data Challenge The Data Challenge is a competition which aims at studying datasets provided by public organizations.
Edition 2018 (see the poster)
The two winner teams are LPSM204(Paris Sorbonne) and Keyrus(UPMC). The first prize has been awarded by Francis Bach during the 50th Journées de Statistique at EDF Lab Saclay.
This year, it is organized in collaboration with Quantmetry and EDF.
Datasets and system evaluation
The objective of this challenge is to predict the electricity consumption of the island of Ouessant to 8 days, with the help of the following data:
- one year of historical consumption data, at hourly grid (conso_train.csv),
- one year of meteorological data at the three-hour grid , from the nearby Brest meteorological station (meteo_train.csv),
- a week of meteorological data at the three-hour grid , from the same station and acting as a weather forecast (meteo_prev.csv).
The principle is to consider the meteorological forecasts as perfect and to be placed in the framework defined by the figure below to predict hourly consumption over a week:
The estimate criterion for the forecast quality will be the mean absolute percentage error (MAPE) .
In case of a tie at the MAPE level, the jury will use the mean squared error (RMSE ) to identify the best-performing solution.
The introduction of additional datasets is allowed, with the agreement of one of the members of the jury, if this may have an interest in modeling.
The use of information that is not present in the dataset is allowed as it is general and public information, allowing the creation or modification of variables from the provided data.
For example, this means that it will be allowed to pick up holidays by hand, but not to add more consumption data for training.
More specifically, in the context of the use of external data to the dataset provided by the jury, the rules are as follows:
- the data must be publicly available,
- the sources must be clearly explained in the notebook
- the data must be dated before the week to be predicted (that is, they must be available before 13/09/2016).
Thus, according to these rules, the finer weather data are usable but only in the past, as well as the consumption data (local or national) but over the same period of time as that of the learning sample (between the 13/09/2015 and 13/09/2016). For example, information known in advance such as holidays, tide days, or sunset time can be used. But if you use data available day after day, you will have to use an available and explicit prediction in the notebook.
Feel free to contact the jury if you have a doubt about what you have the right to use or not.
Regulation and registration
The teams registering for the data challenge must be composed of one to three people.
The solutions must consist of a file named theNameofYourSet.csv of the form of sample_solution.csv (also called conso_prev.csv : hourly consumption between 09/13/2016 00:00 and 20/09/2016 23:00) as well as a notebook with the reproducible code that made it possible to obtain the results from the provided data. The notebook format is encouraged by the jury but another tutorial format, with explanations in support, will be accepted as well.
They must be sent by email to the following address: email@example.com with the following title: [Data Challenge Challenge] Team Name .
The jury will only be able to judge the detailed reports (codes provided: R, Python, Matlab) and clear explanations.
The organizers refrain from taking advantage of the proposed solutions by monetizing them, other than the increase of scientific knowledge, which is also for everyone; the solutions remain the property of the authors.
Form for registration here .
Warning , to obtain the data you must fill in the registration form.
The competition will run as follows:
- Start of registration:
30th January 2018
- Closure of registration:
12th February 2018
- Sending the dataset:
17th February 2018
- Start of submissions:
19th February 2018
- End of submissions:
23rd April 2018
- Publication of results:
14th May 2018
- Presentation: week of May 28, 2018
- Luis Blanche, Quantmetry R&D
- Vincent Brault, Université Grenoble-Alpes
- Émilie Devijver, CNRS
- Raphaël Nédellec, EDF R&D
A prize will be awarded for the best prediction. The two best solutions are committed to presenting their productions during the special session of the group "Young Statistician.ne.s" organized during the 50th Journées de Statistique at EDF Lab Saclay. In addition, the two best groups will be offered an article submission summarizing the award-winning work for the journal CSBIGS of the SFdS.
Young Statisticians Organizers: Émilie Devijver et Valérie Robert