Language researchers are collating the nation’s first representative collection of digitised spoken and written language in Australia.
The Australian National Corpus project is a joint partnership between Griffith, Sydney, Melbourne, Monash and Macquarie Universities, ANU and UNE.
It aims to bring together the country’s language data resources in one place so researchers and educators can have access to a wide range of language data types.
Dr Michael Haugh from Griffith University’s School of Languages and Linguistics says the researchers will collate all forms of written and spoken language including fiction and non-fiction books, emails and other online interactions and speech.
“Many countries have large corpora including the US, UK, Germany and Denmark, but Australia’s language data resources remain scattered and relatively inaccessible,” Dr Haugh said.
“The Australian National Corpus initiative involves a concerted push by linguists, applied linguists, language technologists and those interested in language more generally to establish a massive database of language in Australia.
“A national corpus will give educators access to authentic language for teaching purposes as well as providing linguists with the information they need to test hypotheses.”
Currently the largest collections of Australian English are the Australian Corpus of English and the Australian component of the International Corpus of English – about one million words each.
“While this may sound like a lot, other countries hold much larger corpora. The British National Corpus is one hundred million words, and the Corpus of Contemporary American English is more than four hundred million words.
“The reason other countries have built such large corpora is that many questions about language and its use can only be answered when you have a much bigger, more representative collection.”
Dr Haugh says the corpus will be useful to those studying languages in Australia and seeking to understand what it means to be Australian.
“It will also be a helpful resource for those teaching English and other languages as it will provide real-life, authentic examples of spoken and written language to use in the classroom.”
A national corpus would also help the construction of human-computer interaction systems.
“Unless we all want to start speaking like Americans, then language technologists will increasingly need access to large collections of data where Australians are speaking English.
“While it might be irritating to be answered by a computer system on the phone, the development of more Australian-friendly systems is at least one way to reduce the annoyance factor.”
Researchers will hold an Australian National Corpus workshop at the Queensland College of Art, South Bank on Friday, November 19 from 9.30 to 5pm.