Developing a nowcasting algorithm that can work without big data
The need for short-range forecasting
The constant demand for a faster release of official statistics has fuelled a growing need by statistical offices for accurate nowcasts based on reliable nowcasting techniques aimed at advance estimation of key components of their headline statistics. To respond to this challenge, in 2016 Eurostat – the EU’s Central Statistics Office – initiated the Eurostat Nowcasting Competition for the nowcasting of official statistics of EU member states. In the sphere of official statistics, nowcasting refers to the short-range forecasting of an officially published statistic with short time lines of release.
The Eurostat Nowcasting Competition is also known as the Big Data for Official Statistics Competition or BDCOMP for short, due to encouraging the use of big data in nowcasting. Big data are data sources composed of high volumes in the sense of having large scale; high variety in that they exist in many different forms; and high velocity in the sense of streaming swiftly. All of these components demand cost-effective and innovative forms of information processing to enhance insight as well as decision-making. However, the use of big data was not an explicit requirement of the competition. Incorporating big data was part of a formal attempt by the competition’s Scientific Committee to establish what the accuracy of nowcasts will be with and without the use of such data.
… nowcasting refers to the short-range forecasting of an officially published statistic with short time lines of release.
About the Eurostat Nowcasting Competition
A seven-member Scientific Committee drawn from Eurostat itself, the Organisation for Economic Co-operation and Development (OECD), the German Central Bank (Deutsche Bundesbank), and the Statistical Offices of Italy (ISTAT), Slovenia (SURS) and Romania (INS) adjudicated the 2016 instalment of the competition.
Calls for participation in the competition were publicised on the Eurostat website as well as the newsletter of the Institute of Mathematical Statistics, i.e. the IMS Bulletin. Participation in the competition entailed 12 submission rounds – for every month in 2016 – according to the participant’s choice of track as well as the country for which the nowcasts are to be made. Anyone of the 28 member countries could be paired with a choice of seven tracks referring to the monthly indicators for nowcasting, these being:
- Unemployment levels
- Harmonised Index of Consumer Prices (HICP) – all items
- HICP excluding energy
- Tourism – nights spent at tourist accommodation establishments
- Tourism – nights spent at hotels
- Volume of retail trade
- Volume of retail trade excluding automotive fuel.
… the accuracy of nowcasts does not necessarily improve with the increased use of big data.
Participation in the competition is anonymised, and administered by a two-stage elimination process. The main requirements of participation were that no participants could change their methodology after its disclosure at the time of entry, and all nowcasts were to be submitted ahead of the release of the official statistics for which they are being made. Failing to meet these requirements led to elimination at the first stage. In the second stage, the surviving participants were further trimmed to determine the top five entries. From among these, the best-performing entry in the respective track of participation is then chosen. The second stage selections were done after the last submission round, according to advance criteria established by the competition’s Scientific Committee. The three criteria used in the 2016 competition were:
- Average error of the nowcasts, in terms of how far they are overall from their actual, i.e. official, counterparts;
- Directional accuracy, in terms of whether the nowcasts correctly predict the direction of change of their official counterparts; and
- Their likelihood of being generally representative, in terms of the extent to which the nowcasted estimates resemble their official counterparts.
The top five entries as well as the best performers of the competition, announced by Eurostat in March 2017 at its biennial statistics conference, New Techniques and Technologies for Statistics, held in Brussels, Belgium, were:
- Team ETLA of the Research Institute of the Finnish Economy in collaboration with the Massachusetts Institute of Technology
- Team JRC of the European Commission’s Joint Research Centre
- University of Warwick Forecast Team
- Dr Roland Weigand of the Research Institute of the German Federal Employment Agency
- Prof George Djolov of the University of Stellenbosch Business School and Stats SA.
Prof Djolov’s entry – which was based on a newly developed Robust Nowcasting Algorithm or RNA for short – finished first in Directional accuracy in the track of nowcasting Ireland’s monthly unemployment levels.
… in relative terms, techniques short of big data can perform similarly or just as well, especially if they use data sources that fit directly into their nowcasting context while also being made adaptive.
Reflecting on the competition’s results
In commenting on the competition’s results, its Scientific Committee reflected that the accuracy of nowcasts does not necessarily improve with the increased use of big data. The results suggested that, in relative terms, techniques short of big data can perform similarly or just as well, especially if they use data sources that fit directly into their nowcasting context while also being made adaptive.
The RNA resonates with the Scientific Committee’s findings. It is an algorithm that functions without a precondition for having big data or knowing anything about the data’s distribution properties. However, when available, such data are bonus to the algorithm.
The RNA is developed from established methods whose seeming remoteness is brought together by the “plug-and-play” principle. Formally, Harrison’s smoothing procedure is “plugged” into the Kolmogorov-Zurbenko filter, and in turn their combination is implemented by a mix between Tukey’s and Hann’s filters. The RNA emerging from this blending has the advantages of technical simplicity, reliability, and ease of use in practice. Furthermore, to isolate seasonality from obscuring the principal behaviour of an examined series, the effects of seasonality are first chained to determine how they change this behaviour from one to another period before factoring them out according to the periods they come from. In the RNA, this is achieved by plugging-in the Persons method of seasonal adjustment to the algorithm’s blended filter.
The RNA … an algorithm that functions without a precondition for having big data or knowing anything about the data’s distribution properties, focuses on the filtration of a series by signal extraction and noise reduction.
It’s like making wine, in a way
The RNA’s premise is that the best nowcast of a series is most likely to come from the series itself, thus ensuring that any noise in estimation can be traced only to a single source, i.e. the series itself. The RNA’s goal then becomes extracting the series signal, or what we commonly refer to as its trend, by recursively drawing it out from the coarseness of the data. This is repetitively done much like filtering out impurities from wine with a funnel until a satisfactory finish is reached. By this analogy, the wine is the data; the funnel is the RNA algorithm; and its filters for removing impurities, i.e. the noise, are the smoothing techniques that make up its inner mechanism.
With each filtration, noise is cleared out of the data until clarity of taste is reached, i.e. the trend is revealed and improved. Key to this improvement is controlling for the trend’s extraction from the start, by beginning with establishing the encountered boundaries of variation in the data, formally called the control limits. They are derived from Tukey’s control chart using the interquartile range as a measure of spread. The boundaries or limits so derived serve as guideposts to continuously refine the extractions of prior rounds with the objective either to minimise or at least not to grow the encountered variability. As part of this, the limits dictate either the recasting of abnormal points in the processed data to its more usual ones or their replacement with the next-in-line less abnormal ones, which is a procedure known as Winsorisation. This control monitoring and enforcement in the RNA kicks-off with a sample size, which is mechanically determined to contain the least amount of wine, i.e. data, from what is available in stock. Technically speaking, Dodge’s sampling rule is applied. In this way, trial and error about the needed amount of data to initialise the algorithm by manual guesses is eliminated. Once the sample size has been established, the corresponding data are collected “prospectively” up to the most recent or newest available point, given the aim of collection is to operationalise the RNA as a forward-looking (nowcasting) instrument capable of determining the immediate future of a series.
In terms of the wine analogy used earlier, the main output from the RNA’s clean-up is a triple-distilled wine, which is further distilled for a seasonally neutral taste, so that the end product is not shaped by a specific harvesting period. In technical terms, the output is a stably-smoothed nominal series that is projected, with or without the retention of seasonality, by the Hann filter as the projection rule. In the case without seasonality, the smoothed nominal series is deseasonalised by having its inter-seasonal movements removed, based on their periods of origin. For this, the inter-seasonal movements are localised at their mid-points in order to determine their representative periodic (or originating) values, which are then filtered thrice for noise reduction before being separated from the smoothed nominal series in turns. Afterwards, the left-over deseasonalised series is redistilled twice more to cut down any left-over impurities, i.e. noise, before being extrapolated in the same manner as its nominal counterpart. As mentioned, this is done by a mechanical rule drawing on the weights of the Hann filter.
The RNA’s extra strength comes from it being a sequential method in the sense that it operates by accumulating information. It reboots, i.e. reruns its computational sequence every time a new observation is added, resulting in its estimates being progressively refined as new data are drafted in. This systematic inclusion of the latest information for purposes of recalculation gives the RNA its adaptive ability to update as well as to cast forward a series during its processing. Based on our wine analogy, wine fortification is performed where the estimation of the series is strengthened by taking on extra spirits, i.e. additional observations. The exception is the rebooting of the calculations for the seasonal effects, which are updated once off, at the beginning of every nowcasting period. This is done in order to build up the evidence of the seasons’ effects whose fluctuating and extended nature exposes their influence only post their occurrence. When they come to an end a picture emerges of how the seasonalities prevail, shedding light on how the stockpile of such prevalence would shape up a series development until the next occurrence. That is why in the RNA the stockpiled effects from the seasons are prospectively removed based on what is observed about them from their prior cycles.
It’s like making wine in a way – with each filtration, noise is cleared out of the data until clarity of taste is reached, i.e. the trend is revealed and improved.
Visual diagnostics make things better
A picture is worth a thousand words, and certainly, this is true of the RNA’s diagnostic toolkit where the overlay plot and the diagonal plot are the two visual diagnostics by which the algorithm’s generated or nowcasted series are juxtaposed along their official counterparts. The first of them, i.e. the overlay plot, gives a visual confirmation as to whether the trend is extracted successfully in the sense of being clear and traceable in path following the RNA’s noise compression. The second of them, i.e. the diagonal plot, visually confirms whether there is similarity between the RNA’s generated series and their official counterparts in the sense of how well the former imitate the latter, namely as to whether they are limited or free from inventing a direction that does not exist in the original (in this case official) series. If there are any shortcomings, these plots will expose them graphically. If such were the case the RNA would then be the inappropriate nowcasting algorithm to use, implying that the use of an alternative filter or approach would be better suited for the nowcast.
As with all visual checks, optical illusion is possible. To minimise the chances of this, the RNA’s diagnostic toolkit includes three numerical measures, which summarise into single numbers what is observed in the overlay and diagonal plots. They include the extent of the dissimilarity between the generated and the official series; the degree of association between them; and the extent to which the generated series systematically deviates from its official counterpart. All are expressed as percent for purposes of standardising their evaluation. Technically, the relative mean absolute error, the Pearson correlation, and Lin’s bias coefficient are computed respectively. The benefit of these numerical measures is to reinforce verification of the visuals by intuitively encouraging secondary check-ups before deciding on the suitability of the nowcasted RNA series. In the end, this consolidates interaction with the RNA and serves to improve confidence about its results in terms of the immediate future of the series it is applied to.
The RNA’s extra strength comes from … being a sequential method in the sense that it operates by accumulating information … resulting in its estimates being progressively refined as new data are drafted in. This systematic inclusion of the latest information … gives the RNA its adaptive ability to update as well as to cast forward a series during … processing.
RNA – an algorithm that can handle the nowcasting of big data with speed
The RNA’s computational simplicity makes it attractive for handling the high volumes that characterise big data. By default, it also gives it the speed to work quickly with the high variety of such data, and the fastness to absorb it as it streams. Its use to nowcast Ireland’s monthly unemployment levels demonstrates this. Its blending of established techniques gives it familiarity and also promotes their integrated as opposed to isolated use.
- For more details about the competition, its participants and their methodologies visit the European Commission’s website at https://ec.europa.eu/eurostat/cros/content/bdcomp_en
- Read more about the competition’s session during Eurostat’s New Techniques and Technologies for Statistics (NTTS) Conference at: https://www.conference-service.com /NTTS2017/documents/agenda/data/sessions/session_29.html
- Prof George Djolov is an Associate Professor Extraordinaire at USB. He lectures in quantitative methods and economics.