Professor Will Lamb Inaugural Lecture

Recording of Professor Will Lamb's Inaugural Lecture

Thank  you  very  much,  Alex. Before  I  begin,  I'd  like  to mention  a  few  people  that  have  been really  instrumental  in  the  work that  I'm  going  to  be  describing  tonight. My  Gallic  technology  partners  in  crime, doctor  Mark  Sinclair, doctor  B  Alex,  who  can't  be  with  us  tonight. She's  ill  at  the  moment, Professor  Peter  Bell, doctor  Andre  Click  who  works with  Peter  and  Professor  Robo  Malali. Without  Mark's  help  in particular,  his  help  and  expertise, I'm  not  sure  we  would  have  ever  gotten started  um  we  met  ten  years  ago  when I  was  auditing  a  Python  course  here  at the  university  and  he was  one  of  the  best  teachers  I've  ever  had. I  went  up  to  after  one  of his  classes  and  asked  him  if  he  fancied working  on Gallic  language  technology  together, and  he  said,  Sure.  I  couldn't  believe  it. I'm  still  amazed.  He  said,  Yes. B,  Alex  and  I  have  been  collaborating  for the  past  six  years  or  so  on a  host  of  projects and  she's  such  a  brilliant, multifaceted  and  friendly  colleague, it's  just  always  a  joy  to  work  with  her. Peter  and  Andre  are researchers  on  our  speech  recognition  grant. They  were  also  my  MSE  supervisors, and  they  handled  that  inversion of  roles  beautifully, and  it  was  a  huge  privilege to  learn  from  them  that  way. Also  on  that  grant  is my  first  MSC  supervisor, Professor  Robio  Malali. He  gave  me  my  first  real  taste  of Gallic  linguistics  a  long, long  time  ago  and  he's been  an  inspiration  over my  career  since  then. So  thanks  to  all  of  you. And  finally,  our  head  of  department, doctor  Neil  Martin,  who's  here  somewhere. By  all  rights,  I  should  have  taken  over  from Neil  as  head  of department  about  four  years  ago. Frankly,  he's  better  at it  than  I  would  ever  be. But  I'm  grateful  to  him  for  continuing when  he  could  have insisted  that  it  was  my  turn. He's  always  made  me  feel  that  my  work  is viable  and  he's facilitated  wherever  he  could. So  thank  you  very  much,  Neil. It  seems  like  AI  is  everywhere. It's  in  our  cars,  it's  in  our  phones, it's  in  our  medical  devices, and  in  our  entertainment  systems. It's  now  even  in  some  of  our  rubbish  bins. If  you  ask  enough  people,  some of  them  will  say they  wish  it  would  stay  there. But  for  others,  we're  living in  exciting  times. Speakers  of  English  and a  handful  of  other  languages  can  now  hold nearly  seamless  conversations  with AI  based  conversational  agents. Unfortunately, this  isn't  true  for  the  rest  of  the 7,000  spoken  languages  in the  world.  But  what  if  it  were? To  give  you  a  taste  of what  this  might  look  like, here's  a  video  demonstrating open  AI's  advanced  voice  mode for  one  language,  Portuguese. Hey,  I'm  Christine, and  I'm  a  native  English  speaker, but  I've  been  trying  to learn  Portuguese  for  fun. And  hi,  I'm  Nacho. I  speak  Spanish  natively,  English, and  I  understand  most  of  Portuguese, but  I  can't  really  speak  it. So  can  you  help  us  have a  conversation,  Portuguese? Clara  is  still  quia  jaula. Could  you  start  us  off  with  a  conversation? Maybe  ask  us  a  few  questions in  Portuguese  so  we  can  practise? Claro.  For  a  Christine,  Christine, aaborcaPiano. Christine. Tocar  Piano  mazy  Dachon  say  Nacho. Ego  sugar,  how  do  you  say  chess? Could  you  hear  that  okay? Okay.  Now,  I'm  not  a  Portuguese  speaker. One  of  my  colleagues  Rob  Dumba  is  amazingly, but  I'm  guessing  that  the  voice synthesis  there  isn't  perfect. In  time,  I  imagine  it  probably  will  be. About  half  of  human  languages  are predicted  to  disappear  in  the  next  century. Wouldn't  providing  robust conversation  technology for  them  be  the  best  way  of  saving  them? After  all  these  tools,  model, linguistic  production  and  exception, better  than  any  dictionary  or  grammar  can. Could  AI  save  endangered  languages? In  particular,  could  AI  save  Scotts  Gaelic? Unfortunately,  I think  the  answer  is  unlikely. I'm  hedging  because  who  knows what  AI  in  the  future  might  resemble. But  I'm  sure  it  won't  save any  languages  on  its  own. A  thought  experiment  or two  can  make  this  really  clear. Imagine  a  cavernous  room of  identical  desks  set  out  in  rows. Upon  each  desk  is  a  laptop  that  can  converse fluently  in  one  of every  human  language  that's  ever  been  spoken. Would  this  save  the  world's  languages? Now,  as  long  as  humans  exist to  visit  such  a  place, it  might  have  some  limited  value, but  what  relevance  would a  random  stream  of  sound from  50,000  years  ago  hold  for  you? With  no  other  information on  what  basis  would  you prefer  one  stream  of  sound  over  another? As  Fishman  says  in  reversing  language  shift, languages  are inseparable  from  their  cultures. There's  little,  if  any,  culture in  this  room  and  I'd  argue  that  it  would be  useful  only  really  to a  small  number  of  hardcore  linguists. It  would  be a  digital  mausoleum  in  some  sense. Let's  put  humans  back  in the  picture  and  see  if  anything  changes. Let's  replace  each  laptop with  one  human  speaker  for  every  language. This  time  we'll  allow  a  basic  label for  each  language  written  on a  piece  of  paper  before  them. For  the  construct  that  we  call  English, what  if  our  representative  is a  22-year-old  middle  class black  female  from  Baltimore? Instead,  what  if  it  is a  78-year-old  upper  class  male  from  London? On  what  basis  is one  more  representative  than  another? Which  would  you  choose  and  why? I  think  it's  worth  thinking  about  that. It's  actually  very  hard  to  pin  down what  we  mean  by  a  language. Linguistic  form  varies  with  age,  ethnicity, location,  time  period,  social  position, situational  context,  and  more. No  representation  can  exist  without  loss, whether  it's  computer  based  or  the forms  produced  by  a  single  individual. If  we  limit  the  representation of  a  language  to  a  single  point, we  lose  nearly  all  that  variation that  makes  it  real  in  the  first  place. Arguably  to  save  21st century  English  for  posterity, we'd  need  the  diversity  that exists  in  that  entire  room. Although  other  languages  may  be  more  local, less  ethnically  diverse  or  whatever, they  too  are  little without  their  living  communities. We're  saying  that  AI  can't  save Gallic  or  any  other  language  of  its  own, maybe  we're  asking  the  wrong  question. Could  AI  help  revitalise  Gallic? Well,  I  think  that  that's  more  likely. The  word  revitalised  means to  imbue  something  with  life again  and  life  can only  exist  in  something  that's  living, for  example,  a  speech  community. In  the  remainder  of  this  lecture, I'll  make  a  start  on  examining how  AI,  so  called, might  help  in  the  revitalization  effort  for Gallic  and  other endangered  languages  by  extension. I  also  outlined  some  ways to  assess  the  risks  and benefits  of  language  technologies for  endangered  languages. Here  are  the  questions  that  will  guide  this. What's  the  status  of  Gallic  today? How  do  at  least  some  Gallic  users  view  AI? What  is  AI  anyway? What  can  we  do with  Gallic  technology  currently? How  can  we  assess  the  impacts  of AI  on  threatened  languages and  a  quick  health  warning? These  are  huge  areas  to discuss  in  45  minutes. This  is  going  to  be  fully  satisfying, but  I  hope  at some  point,  I'll  be  able  to  write  this  up. It'll  be  a  little  bit  more  satisfying  then. So  what's  the  status  of  Gallic  today? Well,  perhaps  Gaelic  is doing  fine  without  AI. Let's  look  at  the  recent  census as  flawed  as  it  is. In  the  2022  census, the  number  of  people with  some  Gallic  skills  in Scotland  increased  by  43,100  people. This  might  suggest the  Galaxs  on  firm  footing. That's  a  massive  increase. A  major  problem  with  the  census,  though, is  that  one  can't  establish respondents'  fluency  levels, how  often  they  use the  language,  or  indeed  where. In  contrast  to  that  apparent  growth, a  number  of  people  who  can  speak Gallic  in  the  so  called  Heartland, the  Outer  Hebrids  has  dropped  considerably. It's  now  45%  of the  population  of  the  Outer  Hebrides. Whereas  in  2011,  it  was  52% and  in  2001,  it  was  60%. So  the  trend  is  for  more  people  to  report Gallic  skills  while  speakers in  hereditary  areas  decline. It's  a  metaphor  for the  ages,  isn't  it,  in  a  way? Without  some  intervention, this  decrease  in  the  heartland is  unlikely  to  change. To  improve  the  situation for  a  language  like  Gallic, it's  helpful  to  keep  certain  goals  in  mind. And  at  the  top  of  the  list, of  course,  everybody  would  like  to increase  the  active  users  of  the  language. We  can  do  that  by  looking at  transmission  in  the  home, also  thinking  about  new  speakers, adult  learners,  pupils  in immersive  schools,  and  so  on. Developing  resources  is  hugely  important. So  with  Gallic,  we  already  have a  standard  orthography.  Great. We  can  take  that  box.  A  lot of  languages  don't  even  have  that. We've  got  dictionaries,  we've got  grammars,  we've  got  corpora. There's  still  a  lot  to  do.  This  says it's  a  comprehensive  grammar. It  really?  How  can  it  be? It's  a  start.  But  anyway, there's  a  lot  more  to  be  done  even  with  that. In  terms  of  structured  support, getting  structured  support  from policy  and  institution  institutions, we  think  about  trying  to  embed the  language  in  formal  education  more, strengthen  the  language  in  business  settings and  economic  life, developing  grassroots  support via  community  groups. Diversifying  usage  domains,  of  course, right  now,  these  domains have  attenuated  so  much. Even  things  like  crafting are  now  done,  I  mean, based  upon  my  experience, they're  now  done  through the  medium  in  English, much  more  than  they  were  20  years  ago. First  went  to  US  in, you  know,  1997,  I  think, if  you  went  out  onto  the  Murr  to, you  know,  I  don't  know, do  the  sheep  dipping  or  something  like  that, it  was  predominantly  through the  Museum  of  Gay. I  can  guarantee  that's  not  the  case  today. So  think  about  widening  domains  of  usage, raising  the  status  and visibility  of  the  language, through  signage  and  media  presence, et  cetera. We  can't  have  a  living  language without  a  thriving  speech  community. Could  AI  be  the  DS  machina that  allows  us  to  progress  this, the  unexpected  solution  that  saves  the  day? Well,  let's  start  by  looking  at  how Gallic  users  view  it  at  the  moment, at  least  some  Gallic  users. I  did  a  very  unscientific, very  brief  survey  of  people's  ideas  about how  AI  could  help them  learn  or  use  Gallic  better. I  pose  this  question  to  x.com  as  well  as several  Gallic  interest  groups  that  I belong  to  on  Redit  and  Facebook. I  have  to  say  the  results  surprised  me. I  should  state,  however,  that  I  think the  sample  population  here  is  not a  great  representation  of the  views  of  heritage  speakers  of  Gallic. In  general, my  impression  is  that  they  are  much more  open  to  the  idea of  using  AI  to  benefit  the  language. I  took  all  of  the  comments  and  likes associated  with  them  and assembled  them  in  a  spreadsheet. Then  I  manually  judged each  comment  as  having  positive, negative  or  neutral  sentiment, as  can  be  seen  in  this  chart, the  likes  of  negative  comments outnumber  those  for  positive comments  five  to  three. Over  half  of  the  total comments  were  negative. It's  difficult  to  know  how knowledgeable  the  people  who  are  responding were  about  AI  or  language  technology in  general  and  how  it  works. Certainly,  there's  a  lot  of  fear about  its  impact  on the  environment  and  employment  and the  notion  that's  being  imposed  upon  people. The  top  comments,  the  top  five  ones  were AI  is  harmful  to  jobs  and  the  environment. Keep  AI  away  from  heritage  languages, get  rid  of  AI. AI  is  being  forced  on  us and  GalaxUlingo  doesn't  use  AI, which  seems  a  little  bit  random, but  actually  a  lot  of  the  people  responding were  on  the  forum  for  GalaxUlingo. Now,  I  thought  that last  comment  was  interesting. Gala  eolingos  been  used by  over  2  million  people, and  that's  really  impressive. I  mean,  it's  ores  of  magnitude above  the  number  of Galax  speakers  that  we  have  today. When  somebody  suggests  that Galax  J  Lingo  did  not  use  AI, I  put  up  the  following  clip  from  2020, following  news  article  2  years before  Chat  GPT  came  on  the  scene. Je  Lingo's  own  CEO  said  at  that  time  that AI  was  embedded  in  every  aspect  of  the  app. What  was  the  response?  Radio  silence. This  suggested  to  me  a  certain  amount of  cognitive  dissonance, but  also  that  many  people  today  equate AI  very  strongly  with  large  language  models. Now,  in  any  case, I  think  Big  Tech  isn't really  winning  hearts  and minds  here  at  the  moment, at  least  with  the  Gallic  learner  community. So  let's  turn  to  the  positive  comments. There  were  a  number. The  top  one  was  that  wouldn't it  be  great  if  AI  could provide  interactive conversation  in  the  language. Would  it  be  great  if  it  could  help  us locate  phrases  and  other  information  better, help  teach  us  pronunciation, help  build  corpora  and  language  resources, and  AISR  or  speech  recognition is  actually  really  useful, people  were  saying. These  suggestions  align  with my  own  intuitions  about what  would  benefit  the  Gallic  community. The  biggest  bottleneck  towards fluency  for  Gallic  learners, and  I  know  this  very  well, as  well  as  a  lot  of  people  in  this  room, it's  finding  opportunities  to speak  the  language  with a  native  speaker  or even  just  a  really  good  fluent  speaker. Simply  put,  that  situation is  not  going  to  improve. Additionally,  gaining  entry  to that  experience  is  very  fraught. You  have  to  pretend  that  you understand  everything  when  you  really  don't. It's  like  gaining  credit when  you've  got  none, at  least  back  in  the  old  days. The  great  promise  of  technology  is  providing a  simulation  of  naturalistic  conversation. But  getting  there  is  a  challenge even  with  large  languages. When  you  see  this  technology  working, Today,  we're  so  used  to  it. We're  inundated  with  it  that  we don't  think  about  what  actually went  on  behind  the  hood to  get  there.  It  is  tough. It  is  backbreaking. It's  intellectually  difficult, and  a  lot  of  it  is  actually just  annotation  getting  dapted  together. But  that  in  of  itself, we're  talking  about  millions  of work  hours  devoted  to just  one  aspect  of  something  a  lot  of  times. So  it  requires  collaboration between  language  communities  as well  as  large  tech. If  we  were  to  get  somewhere advanced  with  Gallic, we  almost  certainly  need to  involve  large  tech because  of  the  cost of  developing  these  models. You  just  can't  do  it  within a  university  most  of  the  time. As  we'll  be  clear  in a  moment,  we  can,  however, locate  phrases  and  information embedded  in  audio  files, for  example,  and  use  technology  to build  corporate  and  language  resources. That's  possible  in  large  part because  of  speech  recognition, and  a  lot  of  what  we're doing  right  now  is  exactly  that. But  anyway,  before  we  get  to  that, before  we  get  to  some  demonstrations, let's  consider  what  AI  is and  how  it  works.  So  what  say  I? Well,  in  vernacular  usage,  as  I  said, the  connotations  associated  with artificial  intelligence have  changed  a  lot  recently. I  remember  the  day  that Chachi  PT  was  launched because  I  was  doing  the  MSC here  at  the  University, and  it  blew  everyone's  mind. Um,  can  talk  about  that  ad  infinitum. But  anyway,  these  days, the  term  AIs  becomes  synonymous  with generating  from  large  language models  like  open  AIs, hatGBT  and  Google's  Gemini. When  this  term  was  first  coined  in  1955, AI  meant  to  make  machines  use  language, form  abstractions  and  concepts, solve  the  kinds  of  problems  now reserved  for  humans  and  improve  themselves. That  is  the  models,  improving  themselves. So  that  definition  suggests  that  we  should  be able  to  generate  and understand  natural  language. That's  what  we  can  do  with  chatbots, induce  hypotheses  from empirical  data,  that's  a  little  bit  broader. Produce  solutions  to  problems and  learn  from  past  errors. All  of  this  sounds  a  lot like  the  promise  of  AI  today. It  was  quite  prophetic when  you  think  about  it. What  was  far  from  prophetic  though, was  how  long  researchers expected  that  to  take? There'd  been  a  lot  of AI  winters  in  the  interim. In  July  1958,  the  New  York  Times  published an  article  about  the  first  type  of neural  network  called  a  perceptron, and  the  perceptron  was expected  to  form  the  basis  of a  thinking  computer  that  I  could  walk, talk,  see,  write,  reproduce  itself, and  be  conscious  of  its  own  existence. And  they  thought  that that  would  take  one  year. Needless  to  say,  this  type of  strong  AI  still  does  not  exist, but  the  performance  of large  language  models  is very  impressive  across  many  tasks  today. Behind  that  impressive  performance  though, it's  remarkable  how  simple LLMs  large  language  models actually  are  in  some  ways. They  work  by  predicting the  most  likely  token, a  word  or  a  part  of  word given  the  tokens  that  you  already  have. When  you  put  a  prompt  into  Cha  chiPT  it breaks  it  down  into  little bits,  and  that  forms, all  those  tokens  form your  initial  context  for  querying the  model  and  use that  to  predict  the  next  following  token. So  here,  if  you  take  the  phrase present  united  and  put  that  in  a  Chachi  PT, it'll  tell  you  that  the  next  word is  most  likely  going  to  be  states. It  does  that  implicitly  as  it  generates. It's  almost  certainly  going  to  give  you the  top  response  or  one  of  the  top  responses, although  there's  a  certain  amount of  randomness  in  there. And  this  kind  of  repetitive generation  has  a  name. It's  called  autoregression. Now,  the  basis  of nearly  all  advanced  language  technology today  is  the  neural  network. Here's  a  really  simple  representation  of  one. And  you  can  think  of  each  one  of  these  nodes, pardon  me,  the  circles  as representing  a  step  through  the  network. Our  programme  director on  the  MSE  used  to  talk about  thinking  about  a  meat  grinder or  something  like  just  turned the  grinder  and  it  went  through. But  anyway,  the  knowledge,  if  you  like, is  stored  in  the  lines that  connect  these  nodes, and  these  are  known  as  weights  or  parameters. A  neural  network  is  sorry trained  by  tweaking  these  parameters countless  times  in  response to  getting  things  wrong, incorrect  predictions. It's  a  form  of  conditional  learning. So  when  you  make a  prediction  using  a  neural  network, you're  basically  taking  some  group  of  numbers Sticking  it  through  the  neural  network and  they  get  transformed  by these  weights  and  certain  other operations  as  you  go  on. When  those  numbers  reach the  far  side  of  the  network, they're  often  turned  into a  set  of  probabilities  across all  the  possible  outputs, a  probability  distribution. In  LLM,  one  of  those  final  nodes will  represent  the  next  most  likely  word. Now,  Large  language  models don't  actually  process  text  under  the  hood. It  looks  like  they  do,  but  they  don't  really. Each  token  is  assigned  a  number. Think  of  it  as  an address  or  a  telephone  number. It's  a  vector  of  numbers that  represents  its  meaning, its  grammatical  category,  is  it a  noun  or  a  verb  or whatever,  and  other  aspects. These  vectors  are  called  word  embeddings. The  word  bank  after  president  of  the  has a  very  different  embedding than  bank  would  be  if it  followed  the  word  River  Bank,  for  example. This  is  a  consequence  of a  very  famous  machine learning  technique  called  attention. I'm  being  very  hand  waving, glossing  over  a  lot  of  details  here, but  hopefully  some  of  this  makes  sense. To  make  a  prediction,  you  send all  these  word  embeddings,  for  example, into  a  neural  network  and due  to  the  way  it  was  trained, it  spits  out  a  prediction  of  the  next  token. You  can  see  there  isn't  really  a  lot  that's fundamentally  mysterious about  how  these  things  work. They're  prediction  machines.  That's  all. They're  not  conscious  entities despite  what  you  might  have read  and  they're  not  likely to  take  over  the  world  anytime  soon. What's  complicated  about  them is  the  intricacy  of  those  weights. You're  talking  about  billions upon  billions  of  them folding  into  one  another in  high  dimensional  spaces, and  these  weights  can  in  a  sense compress  things  like  the  entire  Internet. To  read  the  full  text  that  was  used to  develop  the  first  iteration  of  hat  GPT, so  GPT  three,  it would  take  a  single  individual 26,000  years  of  reading 24  hours  a  day,  seven  days  a  week. That's  a  lot  of  compressed  information. But  AI  is  a  lot  more  than just  large  language  models because  of  how  vague the  term  AI  is  and its  connotations  with  chat  bots, terminators,  et  cetera,  I  think it's  helpful  to  use  a  different  term. So  we  could  use speech  and  language  technology as  a  more  neutral  term. Chatbots  are  a  form  of  that, but  so  is  speech  recognition, handwriting  recognition, speech  synthesis, orthographic  normalisation  systems, handwriting  recognition,  and  much  more. Let's  look  at  a  few  of  these  now  in terms  of  what  you  can  do with  Gallic  language  technology. Much  of  the  potential  training  corpus that  we  have  for  Gach  is  actually  quite  old. A  lot  of  the  text  that's  online  thanks to  Robio  Malali  and his  team  at  DASk at  the  University  of  Glasgow. So  a  lot  of  this  text  goes  back to  the  19th  century  before, and  it's  not  immediately usable  for some  of  the  things  that  we  want  to  do. So  you're  talking about  millions  and  millions  of words  and  kind  of  older  forms  of  orthography. So  one  of  the  things  that  we  try  to do  is  develop  a  way  using neural  networks  to  convert it  into  modern  orthography. Um,  so  we  developed this  tool  for  correcting  things like  OCR  mistakes  also, and  it's  just  a  proof  of  concept, but  it's  available  online  for  people  to  try. So  here  you  can  see  that  we've  taken a  really  messy  text  and made  some  guesses  about  how  it  would look  in  modern  orthography. Now,  it's  quite  slow.  That's  the  only  thing. If  we're  going  to  do  this  at  scale, we  need  to  find  a  way  to  speed up  substantially using  probably  simpler  architectures and  also  GPUs. The  way  that  we  got started  with  all  this,  though, is  actually  on  a  simpler  problem, and  that's  recognising  handwriting. So  the  first  thing  that  we did  with  Mark  is  this  project  on handwriting  recognition  with  a  view to  doing  things  like  speech  recognition. The  School  Scottish  Studies  Archives has  a  huge  supply of  meticulously  transcribed  audio, so  transcriptions  of  folklore  narratives and  interviews  from  the  1950s, 60s,  and  so  forth. But  and  we  developed  using transcribes  a  tool  that  many  of  you will  know  for  the  digital  humanities. We  built  up  a  model  with  this  that eventually  achieved  95%  accuracy at  the  word  level. So  we  could  run  through  tonnes  and tonnes  of  handwritten  transcriptions, after  digitising  them  and  get the  words  back  from  these  transcriptions. Hugely  useful.  We're  now disseminating  these  texts  back  to the  public,  and  this  week, in  fact,  we're  finishing a  large  research  project that  will  make  thousands of  these  pages  of transcribed  folklore  available  online for  the  first  time. And  here's  a  first  glimpse  of what  that  website's  going  to  look  like. So  you  type  in  a  type  of, you  know, folk  tale  that  you're  interested  in, so  there's  this  classification  system called  Anor  Thompson  Ur, you  can  type  in  um,  you  know, attack  like  a  number and  get  back  that  particular  folk  tale. You  can  see  that  on  the  map, you  can  get  all  the  versions  in  PDF. You  can  get  the  text  extracted from  them  and  that  kind  of  thing. So  it's  going  to  be  a  lot  of  fun. And  Julianne  is  one  of  the  people who's  really  helped  push this  for  Julianne  Mini. Now,  so  I  talked  about  speech  recognition. Here's  a  demo  using  a  recent  news  broadcast. I  should  say  the  subtitles  that  you'll  see here  are  the  raw  output  from  the  system. Nothing's  being  corrected. He  Saguain  hoc  the  American  and  War. Ink  fishing  challenge  Ig ca  it  also  is  dish  scenario. Drag  is  finchner  MunvoyOison, I  guess  that  that  Etiquette Connie  Aywi  a  how. Wishing  you  couldn't  hand  it  on the  Ovatran  wa. I  guess  I  wish  you  could  have  heating a  central  heating  at  Patra. Sale  here  stray  and  mit proctiRiver  and  go  to  be  sure  with  Raj  and Ta  Edman  Haliper  who is  doing  this  studding  bird  Amado. RaltsnHalipu, rack  and  fame  policy and  garlic  fierce  the  horse, they're  sag  and  social  to his  EconomoHunc  the  skid and  tradition  to  FaavelHKbm. Like  I  said  here  at  lot  the  Genau  ano  had his  otonHalipu  hypiaccsoich,  shallow  Cersno. Okay. So  how  accurate  is  it? Well,  this  graph  shows our  accuracy  or word  error  rate  for  our  test  set. Right  now  we're  well,  a  year  ago, we  were  getting  77.4%  correct, you  know,  according  to the  words.  Now  we're  at  86.9. That's  a  jump  of  about  10%.  It  seems  small. It's  massive  for  one  year, and  it's  very  much  thanks  to Peter  Bell  and  Andre  Clique. So,  also,  there's  been  a  huge  amount  of data  collection  and  stuff  around  that to  involving Robie  and  a  number  of  other  people. This  means  that  we  can  transcribe a  huge  set  of  audio  video now  fairly  reliably. So  recording  is  on  ToponEdoks or  the  BBC  archives,  for  example, it's  possible  to  search  that  audio for  words  and  phrases  that  occur  and get  the  points  in  those  bits  of audio  so  we  can  go  straight  to  that  point. It's  really  fantastic. It's  incredibly  helpful  towards resource  development  and  language  teaching. Although  we  only  have  about  150  million words  of  Galac  text  right now  and  several  hundred  hours of  aligned  audio  and  that's  not  very  much. They're  diminishing  returns  with  the data  increases  that you  stick  into  this  system, the  closer  that  you  get  to  lower  error  rates. It  gets  harder  and  harder to  bring  the  error  rates  down. You've  got  to  multiply the  amount  of  training  data  that  you  put  in. To  improve  Gallic  speech  recognition, we  need  a  lot  more  text, especially  transcribed  text,  particularly  in underrepresented  domains like  tradition  narrative. During  my  MSC  thesis, we  experimented  with  uh, synthesising  that  type  of  data instead  using  GPT  four, the  language  model  underlying  chat  GPT. We  took  a  series  of human  produced  summaries in  English  from  Toppin and  dl  and  fed  them  through a  fine  tuned  GPT  40  model to  produce  a  story  text. And  then  to  make  this  come  to  life  a  little bit  for  us  tonight,  I  asked  Dan  Wells, a  PhD  student  in  informatics  to pass  one  of  these  stories through  his  text  to  speech  system. Dan  trained  the  synthetic  Gallic  voice that  you  hear  in  a  moment  with Roy  McClain's  excellent  letters for  learners  broadcasts, all  of  which  are  available  on the  learned  Gallic  website. The  rendering  of  Roy's  voice  is  very  good, but  there  are  a  few  pronunciation errors  here  and  there, and  I  just  hasten  to  add  that that's  not  Rudi's  fault  in  any  way. It's  due  to  limitations  in the  model  and  the  training  data. The  machine  translation  was  carried  out using  GPT  four  ohs  baseline  model, so  I'll  play  this  for  you  now. Hi  politician  four. He  monkey  as  hire  cheer  and  gary  monarch. That  dah  hohinH  at  a  50  more  una. I  guess  all  cheers  gota tratas  honey  at  a  monkey, chin  source  can  try. Oh,  Cost. Shop  50  in  matter  of  a  honey  o. A.  ****  Costa  Gila.  Go  home. Hami  kina  Create  master  scoop  ach  a  hound. Oh  who's  the  head. Gini.  Go  Machesters  and tire  I  guess  Gokeleton  Doss  got  here. Hodges  to  continue. Hokogkaske  feeling  y.  HokaFlgagsk  feelingly. AvaunUHamKin,  Catholic  hound, Augusta  Jejun  So  one  of the  remarkable  things  that  emerged  from this  experiment  is  that  the  model  came  up with  a  few  neologisms, words  that  never  existed  before  for  Gallic, and  some  of  them  were  ridiculous. But  one  in  particular  stood out  as  being  kind  of  interesting, and  that  was  a  word  for  spit  or  vomit,  Hush. That  kind  of hallucination  is  really  interesting. It's  a  little  bit  like  remember when,  you  know, an  AI  model  beat  everybody  and go  very  complicated  game,  indeed, and  people  started  talking about  some  of  the  moves  as being  almost  kind  of  genius  in  a  I  mean, to  come  up  with  monopaic  word for  language  that  never  existed  before, I  think  is  actually  quite  difficult. I'm  not  sure  I  could  do  it. So  it's  fascinating, but  it  does  illustrate one  of  the  potential  harms  as  well, and  that's  information  hazards. So  now  that  we've  looked at  what  we've  achieved in  Gallic  technology  so far,  let's  look  at  the  big  question, which  is  how  we  assess the  impact  of  language  technology  on endangered  languages  and  what their  potential  is  for language  revitalization. To  my  knowledge,  no  one's  looked  at this  topic  very  closely  or  meticulously. What  I'm  trying  to  do  here  is just  an  initial  examination  of  the  area. One  thing  is  clear  though, the  calculus  is  different for  every  type  of  technology. The  risks  and  potential  benefits  for the  minority  language  community  differ. So  let's  begin  by  identifying the  stakeholders  that  are  most affected  by  these  innovations, and  then  we'll  look  at  some  of  the  key  risks and  consider  two  short  case  studies. Some  of  the  key  stakeholders  here  are, of  course,  the  users  of  language, I  mean,  the  top  ones. Adult  learners,  L two  speakers,  heritage  speakers, L  one  speakers,  immersion pupils  and  parents  as  well. Of  course,  businesses, educators  at  all  levels, the  government,  researchers,  tech  companies, and  the  third  sector,  community  groups. There  are  others  here,  but  I  think these  are  some  of  the  main  ones. Terms  of  the  risks,  well, these  risks  intersect  with the  risks  that  have already  been  identified  in  the  literature. For  example,  Google  Deep  Minds  paper, taxonomy  of  risks  posed  by  language  models. Starting  with  information  hazards, these  come  from  a  proliferation of  synthetic  text  in  particular, you've  got  two  different  kinds, linguistic  distortion, you  have  sparse  training  data  that leads  to  poor  synthetic  text  output, leading  to  distortion  of linguistic  norms  like non  native  word  orders  or idiom  or  hallucinations  like  we  saw  before. Of  course,  distortion  of  culture. So  LLMs  are  getting  better at  factual  accuracy.  I  tried  this. I  suggested  to  ChatBT yesterday  in  Galac  I  said, Can  you  tell  me  more  about  all  the  people in  the  Western  Iles that  still  believe  in  fairies. And  it  came  back  and  it  said,  Well,  actually, there's  no  evidence  that  this  is widespread  in  the  Western  Isles  of  Scotland. I  was  kind  of  a  little  bit, you  know,  I  was  encouraged  by  that. I  talked  about  a  lot  of very  interesting  information  that you  can  find  in  the  archives, for  example,  about  fairy  belief. So  it  wasn't  bad,  but there's  still  some  real  risks  here, particularly  when  it comes  to  minority  cultures. Model  collapse  is  a  consequence of  information  hazards. So  while  synthetic  text  augmentation could  be  useful  as  we  talked  about  before, in  the  early  stages  of  modelling, if  you  over  rely  upon  it, if  you  iterate  using if  you  train  on synthetic  texts  again  and  again, your  models  become  very,  very  bad. They  overgeneralize, they  become  more  homogeneous  and predictable  and  currently  there's no  universally  accepted  way  to sign  post  synthetic  text  or  media. Um,  so  this  is  a  real  problem. So  it  shows  up  as  being  Gallic  taxed  online. If  it  has  the  code  for Gallic,  and  it's  synthetic, there's  no  way  for  you  as a  user  immediately  to figure  out  that  it's  synthetic. We  need  to  do  something  about  that. Representation  bias  comes  from the  fact  that  linguistic  production is  never  context  free. So  if  we  create  a  synthetic  voice like  the  one  that  you  heard  before, it  will  index  a  particular  dialect, gender,  age,  and  so  forth. And  we  wouldn't  want  our  models  to  be over  representing  one  type  of  voice, implicitly suggesting  that  that's  the  best  one, that,  say,  the  North  US  dialect  is  the  best. I  mean,  of  course,  it  is.  But  we  wouldn't want  to  say  that  because anybody  from  any  other  place is  going  to  say  that, you  know,  the  dialect that  they  learned  is  the  best. Um,  Environmental  harms  are  really  clear. It's  clear  that  the  emissions associated  with  LLMs, in  particular  are  very  heavy, particularly  when  training  them, but  also  when  doing  inference  with  them. Many  minority  languages  are  spoken  in areas  that  are  already  at risk  of  ecological  collapse. The  place  where  I  used  to  walk  on the  beach  in  North  US  no  longer  exists. It  was  wiped  out  in,  I  think 2005  during  a  storm. So  that's  happening  out  there. It's  a  real  thing. You  have  the  encroachment  of  oceans  due to  climate  change  and  other  factors, but  climate  change  is  a  big  one. So  it's  kind  of ironic  that  we'd  be  thinking  about revitalising  a  language  using the  very  technology  that  might  be, harming  a  lot  of  these  areas. And  then  you  have  socioeconomic  harms, the  risk  that  language  specialist  jobs  could be  replaced  with  automation,  for  example, in  translation  and  content  creation, but  also  in  the  creative  industries in  various  ways.  Me. Let's  look  at  the  situation  for two  contrasting  forms  of  language  technology, speech  recognition  and  large  language  models, beginning  with  speech  recognition. So  some  of  the  risks  linguistic distortion  with  ASR  is actually  relatively  low  depending on  the  accuracy  of  your  models. So  it's  a  function  of  your  error  rate. Again,  the  culture  distortion  is  low because  you  have a  supervised  signal  that  you're  following. You're  not  just  picking  words  out  of  the  air. The  risk  of  model  collapse is  again  quite  low, at  least  with  Gallic  ASR, the  representation  bias  is, I  would  say  moderate  because  currently  we can  recognise some  dialects  better  than  others. The  environmental  harms  with this  type  of  technology  are relatively  low  because  you're not  using  massive, you  know,  you're  not  generating,  using massive  arrays  of  GPUs  and  things  like  that. And  the  socioeconomic  harms are  again,  quite  low. There  are  very,  very  few people  that  can  do  reliable, like, really  good  Galax  transcription  these  days. It's  very,  very  difficult to  hire  people  to  do  this. So  I  don't  think  that  we're going  to  put  anyone  out  of  a  job, and  the  technology  still isn't  as  good  as  the  best  people  would  be. Vitalization  potential with  that  technology,  well, for  increasing  active  users is  probably  quite  low,  to  be  honest. But  for  developing  resources, I  think  it's  quite  high. In  terms  of  structured  support from  policy  and  institutions, I'd  say  it's  moderate,  so  you  can  strengthen the  language  in  business  settings and  economic  life  to  an  extent. But  beyond  that,  I'm  not  sure. Diversifying  usage  domains, I  think  it  is  probably  quite  low, but  be  willing  to  be  surprised. And  for  raising  status  and  visibility, it  depends,  but  I  think  it's  quite  moderate. If  we  could  incorporate ASR  or  Gallic  ASR across  all  the  devices  that  we  use,  like, you  know,  Apple  Mac,  PCs,  phones, et  cetera,  it  could  make  a  real  difference, I  think,  to  a  lot  of  Galax  users. For  example,  I  mean,  a  big  one  would  be GME  students  that  have  learning  difficulties. I  had  an  email  yesterday from  one  of  the  education  boards asking  if  this  was  going  to  come  online  soon. LLMs.  Well,  so  what  are  we  going  to  say here  assumes  that  Gallic  is being  included  in  LLM  already, and  it's  based  upon  what  I've  seen  in terms  of  the  performance of  the  LLMs  that  we  have. The  risks,  Well,  for  linguistic  distortion, I'd  say  the  moderate  to  high. It's  a  function  of  the LLMs  predictive  power  or  as,  you  know, we  say  in  this  work in  machine  learning  perplexity how  perplexed  the  model  is  by the  type  of  texts  that  should  be  putting  out. The  cultural  distortion  is  moderate, although  it  might  be  slightly  less  than  that. I  think  the  models  are  getting better  at  this  aspect. Potential  for moral  collapse  without  identifying synthetic  text  is  actually moderate  to  high,  I  would  say. The  representation  bias  is relatively  low  because  you're  dealing with  orthographically standardised  text  anyway. The  environmental  harms  from  including Gallic  in  these  are  moderate. Talk  more  about  that  in  a  moment. The  socioeconomic  harms,  I  would  say, are  moderate  and  they  grow  the better  that  the  models get  for  Gallic  probably. Terms  of the  revitalization  potentials  right  now, for  increasing  a  users, I'd  say  it's  moderate. There  are  a  lot  of  people  that  are  using, say  GBT  to  improve the  gallax  skills  for  better  or  for  worse. So  it  is  increasing the  user  base  to  an  extent. For  developing  resources  at the  stage  that  we're  at  right  now, I'd  say  the  potential  is  very low  because  they're  not  great. The  structured  support from  policy  and  institutions, again,  I'd  say  is  low. For  diversifying  usage  domains,  it's  low, but  if  they  were  to  get  much  better, I  think  that  could  be  moderate  to  high. And  finally,  for  raising the  status  and  visibility, I'd  say  that  we're  dealing  with a  moderate  potential  right  there  right  now. So  what  do  we  do  about  the  LLM  problem? How  do  we  make  them work  better  for  the  Galt  community? Well, here's  a  pragmatic  view  on  the  situation. We  already  have  a  Gallic  speaking  monkey on  a  computer  keyboard.  It's  out  there. Our  choice  is  to  teach  the monkey  to  be  more  fluent or  to  ignore  the  monkey and  hope  he'll  go  away. Realistically,  that  monkey  is  going  nowhere. So  unless  we  remove all  the  Gallic  texts  on  the  Internet, get  rid  of  Wikipedia  and  Galic, that  monkeys  going  to  stay  there. So  perhaps  we  should work  with  Big  Tech  to  make sure  that  the  monkey can  produce  reasonable  Gallic. Let's  go  back  to the  top  positive  comment  from my  scientific  social  media  study. Moving  away  from  LLMs  to  an  extent, what  would  really  make a  difference  to  the  Gallic  community? What  would  make  a  difference  to the  revitalization  effort? I  think  it's  a little  bit  crass,  but,  you  know, for  lack  of  a  better  label, I  think  it  would  be  developing a  virtual  Galax  speaker. And  that  was  the  top  positive  comment from  that  social  media  study. Um,  if  you  could,  you  know, choose  your  dialect,  your  voice  type, and  the  fluency  level for  this  type  of  system, I  think  it  could  make  a  really  big  dent in  teaching  people  Gallic  and helping  people  improve their  current  skills  in  the  language. This  would  be  a  system  that  could politely  correct  you  when  you've  made a  mistake  and  encourage  you  along  the  way. I  don't  know  how  I  learn  Gaelic. I  must  have  an  incredibly  brass  neck. You  know, being  an  American  helps,  definitely. But  I  mean,  the  amount of  discouragement  that  you encounter  as  you're  going through  this  journey  is  quite  remarkable. There's  a  lot  of  encouragement,  too, but,  you  know,  it's  a  mixed  back. It's  guerrilla  warfare. So  if  you  had,  you  know, you  could  just  pull  out  your  phone  and have  a  conversation  and  have this  voice  that  would  lead  you into  speaking  better  Gallic  over  time, I  think  it  would  be  a  bonus. I  think  on  balance,  despite  the  risks, you  know, these  opportunities  are  significant. Especially  if  the  audio  here, especially  if  the  output  is audio  and  not  text. If  it's  a  closed  system  that's  audio, you're  not  going  to  get  the  kind  of  pollution that  you  get  from  LLMs. So  how  can  we  do  this?  First  of  all, we  can't  do  this  on  our  own. I  mean,  no  university  has  the  kind  of budget  that  would  allow us  to  do  this  right  now. We  need  assistance  from  the  Gallic  community and  also  from  Big  Tech. This  is  just  a  very  simple  illustration of  some  of  the  main  ingredients. We  need  a  lot  more  data. The  most  important  type  would come  from  transcribed  audio. And  the  core  idea  here  is  that  we'd  use our  current  ASR  system  to transcribe  recordings  from  various  sources, crowd  source  the  correction of  the  transcriptions  and  pass the  corrected  documents  on  as training  data  back  to  the  community. We  can  incorporate  reinforcement  learning with  human  feedback  to  ensure  that  the  system is  aligned  for  the  purposes  of teaching  Gallic  and  holding  conversations that  support  language  skills. And  while  we're  building  such  a  system, we  can  achieve  a  massive  increase in  Gaelic  language  resources, disseminated  back  to  the  community, as  I  said,  Also  through  things  like the  digital  archives  Scotti  Gallic at  the  University  of  Glasgow, Topelendch  and  so  on. This  kind  of  approach  could  in theory  work  for  other  endangered  languages, too,  at  least  where  you  have an  orthography  and  some  resources. So  just  to  wrap  up  here,  if  we're  careful, I  think  we  can  make  a  massive difference  to  Gallic  speakers, especially  to  the  learning  community, the  active  learning  community. Here  are  some  of  the  ways  that we  can  ensure  that  we  do  this  well, staying  cognizant  of  the  risks. So  involving  the  community obviously  in  the  design  and  evaluation, curating  an  accessible high  quality  training  corpus, improving  the  documentation  that we  have  the  language, and  combating  misrepresentation  and disseminating  that back  to  the  Gallic  community. We  need  to  sign  post synthetic  text  and  media. There  are  people  working  on  this. There  just  aren't  any  solutions, any  good  ones  yet,  but  it definitely  needs  to  happen. So  as  human  beings, where  we  use  synthetic  text, we  should  say  that  it's  synthetic. If  we  use  closed  systems  for  generative  AI, there'd  be  a  lot  less  information  pollution. And  finally,  I  think  it's important  for  all  of  us  involved in  this  work  to  educate the  community  about  generative  AI, especially  its  risks  and  limitations. Teaching  the  public  about  this is  very,  very  important. It  should  be  part  of  our  curriculum. So  just  some  concluding  remarks, the  Gallic  language,  like many  other  endangered  languages is  at  a  crossroads. Unlike  many  other  smaller  languages,  though, there  are  literally  millions  of  people who  want  to  learn  Gallic or  improve  their  skills  in  it. But  there  are  relatively few  teachers  available. We  can't  provide  a patient  native  speaking  teacher for  each  of  these  potential  students. But  perhaps  we  can provide  the  next  best  thing. With  a  careful  approach and  sufficient  investment, we  can  put  out  a  range  of virtual  Galax  speakers  in the  hands  of  everyone with  a  connected  device. This  is  a  moonshot  with  many,  many  risks. But  I  bounced,  I  think language  technology  can  play a  major  role  in revitalising  Gallic and  other  endangered  languages. I  look  forward  to  hearing your  own  thoughts  about  that. If  you're  interested  in  further information  about  these  topics, here  are  some  links  that  you  can  follow. I'll  put  these  slides  up  or  we'll  make  them available  through  the  University  website, so  don't  worry  about  copying  anything  down. I'd  like  to  just  say  a  quick thank  you  to  our  funders and  our  many  collaborators, AgasTapapin. Thank  you.  I 

The future is uncertain for Gaelic and most of the world’s minority languages. Could cutting-edge language technologies be the key to their survival? English speakers can now hold real-time spoken conversations with apps like OpenAI’s ChatGPT. What breakthroughs are needed to get us to that point for Gaelic? How might such a transformation affect language revitalisation efforts, for better and for worse?
This lecture introduces modern language technology to a general audience, showcasing ongoing research involving Gaelic at the University of Edinburgh. It then addresses tensions in collaborations between big tech and minority language communities, such as navigating data ownership and cultural preservation. Finally, it looks ahead, considering how AI might help revitalise not just Gaelic, but other minority languages.

About the speaker

Will Lamb was born and raised in Baltimore, Maryland. He completed a degree in Psychology from the University of Maryland Baltimore County in 1993 and spent two years as an RA on a Johns Hopkins led research project on sleep disorders and biometrics. In 1995, after taking an interest in Gaelic and traditional music, he went to Nova Scotia and spent an academic year at St Francis Xavier University.
Will began his postgraduate study at the University of Edinburgh in 1996, taking an MSc in Celtic Studies. His dissertation was on the development of the Gaelic news register and was supervised by Rob Ó Maolalaigh. He started a PhD in Linguistics the following year. In Jan 2000, nearing the end of his PhD, he moved to North Uist to take up a lecturing position at Lews Castle College Benbecula (University of the Highlands and Islands). He is credited with initiating the successful music programme at Lews Castle College. Will finished his PhD in 2002, and it was published in 2008 as 'Scottish Gaelic Speech and Writing: Register Variation in an Endangered Language'.
Will was promoted to Senior Lecturer in 2017 and to Personal Chair in Gaelic Ethnology and Linguistics in 2022. His research interests span music, linguistics, traditional narrative and language technology. He is known, in particular, for his work on formulaic language, traditional music, Gaelic grammatical description and Natural Language Processing (NLP). Most of his recent work has been in Gaelic NLP, and he recently finished an MSc in Speech and Language Processing (University of Edinburgh).