Extranet access

EUROCONTROL offers a wide range of online services to stakeholders through its "One Sky Online" extranet portal. Participate in our many working groups, get the latest data on European ATM, or access advanced operational applications all in one click. Register now!

Member Login

 
 EEC home
 About the EEC
 Work programme
 Research areas
 Research methods
 Models, tools, and simulators
 Projects
 Events
 Documents / Publications
 Job opportunities
 Visit us
 Contact us
 Abbreviations/Acronyms
 Related sites
 

Air traffic control speech corpus online

A speech corpus is a set of digital recordings of speech together with annotation, meta data, and documentation. Speech corpora provide the basic data for research and development in various fields such as spoken language processing, automatic speech recognition, text-to-speech systems, spoken language interfaces, spoken language understanding or speaker verification. There are many different aspects that characterise a corpus, among these being speaker profiles, number of speakers, vocabulary, domain, task, phonological distribution, speaking style, recording set-up, annotation, technical aspects, structure, validation, meta data and documentation.

The ATCOSIM corpus is within the domain of civil air traffic control (ATC). The development of spoken language technologies for ATC requires corpora that are tailored to this task and domain. This is due to the specific language and phraseology in use. The ATCOSIM corpus also facilitates the study and modelling of the actual controller language in use.

The voice recordings for the ATCOSIM corpus were made in the ATC operations room of the EUROCONTROL Experimental Centre (EEC) in Brétigny-sur-Orge, France. The real-time simulations during which the recordings were made studied the impact of reduced vertical separation minima (RVSM) between aircraft, a concept that is now in operation in core Europe. A detailed report on this ’3rd Continental RVSM Real-Time Simulation’ is available.
  Acrobat 3rd Continental RVSM Real-time Simulation. EEC Report 315
The simulation implemented real sector layouts and air traffic samples. The amount of air traffic was changed to simulate various controller workloads, and also the vertical separation of the aircraft was modified. During the simulations only controllers’ voices, but not the pilots’, were recorded. The recordings cover simulations of airspace sectors in Germany (Söllingen, controlled by the Karlsruhe centre) and Switzerland (Zurich and Geneva). The ten recorded speakers were all actively employed controllers in the respective ATC centres and were of Swiss or German nationality.

The ATCOSIM corpus is based on 50 sessions of real-time simulation exercises during which the controllers’ voice was recorded using a close-talk headset microphone. Since 80% of the 51 hours of recordings contained no speech, the segments containing a speech signal were automatically extracted using the status of the controller’s push-to-talk (PTT) button, which was also recorded. In total 10078 controller utterances are included, which account for approximately 10.7 hours of speech. The ATCOSIM corpus contains the speech signal of each utterance in separate audio files, which are grouped and labelled by speakers and sessions. An orthographic transcription of each utterance is included in a separate file. The transcription follows a stringent format definition, which is included in the corpus documentation. It defines a clear notation for truncations, acronyms, unknown words, etc., and defines a consistent spelling for numbers, airline telephony designators, navigational aids, phonetically spelt letter sequences and non-verbal articulations. The transcription includes additional labels for human noises, word fragments, unintelligible words, etc., and markings for off-talk and speech in non-English language. A paper entitled ‘The ATCOSIM Corpus of Non-Prompted Clean ATC Speech’ by Konrad Hofbauer et al. has been accepted for the 6th Language Resource and Evaluation Conference (LREC 2008), organised by the European Language Resource Association (ELRA).

Links

The ATCOSIM corpus was produced jointly by EEC and Graz University of Technology and is available on line free of charge at:
  HTML ATCOSIM corpus
or on DVD for a small handling fee from the European Language Resource Association:
  HTML European Language Resource Association
  HTML ‘The ATCOSIM Corpus of Non-Prompted Clean ATC Speech’, Konrad Hofbauer et al., 6th Language Resource and Evaluation Conference (LREC 2008) pre-print

Contact

For further information please contact:
Horst Hering
Email: 
 
  Last validation: 20/02/2008