Professor Will Lamb Inaugural Lecture Recording of Professor Will Lamb's Inaugural Lecture View media transcript Thank you very much, Alex. Before I begin, I'd like to mention a few people that have been really instrumental in the work that I'm going to be describing tonight. My Gallic technology partners in crime, doctor Mark Sinclair, doctor B Alex, who can't be with us tonight. She's ill at the moment, Professor Peter Bell, doctor Andre Click who works with Peter and Professor Robo Malali. Without Mark's help in particular, his help and expertise, I'm not sure we would have ever gotten started um we met ten years ago when I was auditing a Python course here at the university and he was one of the best teachers I've ever had. I went up to after one of his classes and asked him if he fancied working on Gallic language technology together, and he said, Sure. I couldn't believe it. I'm still amazed. He said, Yes. B, Alex and I have been collaborating for the past six years or so on a host of projects and she's such a brilliant, multifaceted and friendly colleague, it's just always a joy to work with her. Peter and Andre are researchers on our speech recognition grant. They were also my MSE supervisors, and they handled that inversion of roles beautifully, and it was a huge privilege to learn from them that way. Also on that grant is my first MSC supervisor, Professor Robio Malali. He gave me my first real taste of Gallic linguistics a long, long time ago and he's been an inspiration over my career since then. So thanks to all of you. And finally, our head of department, doctor Neil Martin, who's here somewhere. By all rights, I should have taken over from Neil as head of department about four years ago. Frankly, he's better at it than I would ever be. But I'm grateful to him for continuing when he could have insisted that it was my turn. He's always made me feel that my work is viable and he's facilitated wherever he could. So thank you very much, Neil. It seems like AI is everywhere. It's in our cars, it's in our phones, it's in our medical devices, and in our entertainment systems. It's now even in some of our rubbish bins. If you ask enough people, some of them will say they wish it would stay there. But for others, we're living in exciting times. Speakers of English and a handful of other languages can now hold nearly seamless conversations with AI based conversational agents. Unfortunately, this isn't true for the rest of the 7,000 spoken languages in the world. But what if it were? To give you a taste of what this might look like, here's a video demonstrating open AI's advanced voice mode for one language, Portuguese. Hey, I'm Christine, and I'm a native English speaker, but I've been trying to learn Portuguese for fun. And hi, I'm Nacho. I speak Spanish natively, English, and I understand most of Portuguese, but I can't really speak it. So can you help us have a conversation, Portuguese? Clara is still quia jaula. Could you start us off with a conversation? Maybe ask us a few questions in Portuguese so we can practise? Claro. For a Christine, Christine, aaborcaPiano. Christine. Tocar Piano mazy Dachon say Nacho. Ego sugar, how do you say chess? Could you hear that okay? Okay. Now, I'm not a Portuguese speaker. One of my colleagues Rob Dumba is amazingly, but I'm guessing that the voice synthesis there isn't perfect. In time, I imagine it probably will be. About half of human languages are predicted to disappear in the next century. Wouldn't providing robust conversation technology for them be the best way of saving them? After all these tools, model, linguistic production and exception, better than any dictionary or grammar can. Could AI save endangered languages? In particular, could AI save Scotts Gaelic? Unfortunately, I think the answer is unlikely. I'm hedging because who knows what AI in the future might resemble. But I'm sure it won't save any languages on its own. A thought experiment or two can make this really clear. Imagine a cavernous room of identical desks set out in rows. Upon each desk is a laptop that can converse fluently in one of every human language that's ever been spoken. Would this save the world's languages? Now, as long as humans exist to visit such a place, it might have some limited value, but what relevance would a random stream of sound from 50,000 years ago hold for you? With no other information on what basis would you prefer one stream of sound over another? As Fishman says in reversing language shift, languages are inseparable from their cultures. There's little, if any, culture in this room and I'd argue that it would be useful only really to a small number of hardcore linguists. It would be a digital mausoleum in some sense. Let's put humans back in the picture and see if anything changes. Let's replace each laptop with one human speaker for every language. This time we'll allow a basic label for each language written on a piece of paper before them. For the construct that we call English, what if our representative is a 22-year-old middle class black female from Baltimore? Instead, what if it is a 78-year-old upper class male from London? On what basis is one more representative than another? Which would you choose and why? I think it's worth thinking about that. It's actually very hard to pin down what we mean by a language. Linguistic form varies with age, ethnicity, location, time period, social position, situational context, and more. No representation can exist without loss, whether it's computer based or the forms produced by a single individual. If we limit the representation of a language to a single point, we lose nearly all that variation that makes it real in the first place. Arguably to save 21st century English for posterity, we'd need the diversity that exists in that entire room. Although other languages may be more local, less ethnically diverse or whatever, they too are little without their living communities. We're saying that AI can't save Gallic or any other language of its own, maybe we're asking the wrong question. Could AI help revitalise Gallic? Well, I think that that's more likely. The word revitalised means to imbue something with life again and life can only exist in something that's living, for example, a speech community. In the remainder of this lecture, I'll make a start on examining how AI, so called, might help in the revitalization effort for Gallic and other endangered languages by extension. I also outlined some ways to assess the risks and benefits of language technologies for endangered languages. Here are the questions that will guide this. What's the status of Gallic today? How do at least some Gallic users view AI? What is AI anyway? What can we do with Gallic technology currently? How can we assess the impacts of AI on threatened languages and a quick health warning? These are huge areas to discuss in 45 minutes. This is going to be fully satisfying, but I hope at some point, I'll be able to write this up. It'll be a little bit more satisfying then. So what's the status of Gallic today? Well, perhaps Gaelic is doing fine without AI. Let's look at the recent census as flawed as it is. In the 2022 census, the number of people with some Gallic skills in Scotland increased by 43,100 people. This might suggest the Galaxs on firm footing. That's a massive increase. A major problem with the census, though, is that one can't establish respondents' fluency levels, how often they use the language, or indeed where. In contrast to that apparent growth, a number of people who can speak Gallic in the so called Heartland, the Outer Hebrids has dropped considerably. It's now 45% of the population of the Outer Hebrides. Whereas in 2011, it was 52% and in 2001, it was 60%. So the trend is for more people to report Gallic skills while speakers in hereditary areas decline. It's a metaphor for the ages, isn't it, in a way? Without some intervention, this decrease in the heartland is unlikely to change. To improve the situation for a language like Gallic, it's helpful to keep certain goals in mind. And at the top of the list, of course, everybody would like to increase the active users of the language. We can do that by looking at transmission in the home, also thinking about new speakers, adult learners, pupils in immersive schools, and so on. Developing resources is hugely important. So with Gallic, we already have a standard orthography. Great. We can take that box. A lot of languages don't even have that. We've got dictionaries, we've got grammars, we've got corpora. There's still a lot to do. This says it's a comprehensive grammar. It really? How can it be? It's a start. But anyway, there's a lot more to be done even with that. In terms of structured support, getting structured support from policy and institution institutions, we think about trying to embed the language in formal education more, strengthen the language in business settings and economic life, developing grassroots support via community groups. Diversifying usage domains, of course, right now, these domains have attenuated so much. Even things like crafting are now done, I mean, based upon my experience, they're now done through the medium in English, much more than they were 20 years ago. First went to US in, you know, 1997, I think, if you went out onto the Murr to, you know, I don't know, do the sheep dipping or something like that, it was predominantly through the Museum of Gay. I can guarantee that's not the case today. So think about widening domains of usage, raising the status and visibility of the language, through signage and media presence, et cetera. We can't have a living language without a thriving speech community. Could AI be the DS machina that allows us to progress this, the unexpected solution that saves the day? Well, let's start by looking at how Gallic users view it at the moment, at least some Gallic users. I did a very unscientific, very brief survey of people's ideas about how AI could help them learn or use Gallic better. I pose this question to x.com as well as several Gallic interest groups that I belong to on Redit and Facebook. I have to say the results surprised me. I should state, however, that I think the sample population here is not a great representation of the views of heritage speakers of Gallic. In general, my impression is that they are much more open to the idea of using AI to benefit the language. I took all of the comments and likes associated with them and assembled them in a spreadsheet. Then I manually judged each comment as having positive, negative or neutral sentiment, as can be seen in this chart, the likes of negative comments outnumber those for positive comments five to three. Over half of the total comments were negative. It's difficult to know how knowledgeable the people who are responding were about AI or language technology in general and how it works. Certainly, there's a lot of fear about its impact on the environment and employment and the notion that's being imposed upon people. The top comments, the top five ones were AI is harmful to jobs and the environment. Keep AI away from heritage languages, get rid of AI. AI is being forced on us and GalaxUlingo doesn't use AI, which seems a little bit random, but actually a lot of the people responding were on the forum for GalaxUlingo. Now, I thought that last comment was interesting. Gala eolingos been used by over 2 million people, and that's really impressive. I mean, it's ores of magnitude above the number of Galax speakers that we have today. When somebody suggests that Galax J Lingo did not use AI, I put up the following clip from 2020, following news article 2 years before Chat GPT came on the scene. Je Lingo's own CEO said at that time that AI was embedded in every aspect of the app. What was the response? Radio silence. This suggested to me a certain amount of cognitive dissonance, but also that many people today equate AI very strongly with large language models. Now, in any case, I think Big Tech isn't really winning hearts and minds here at the moment, at least with the Gallic learner community. So let's turn to the positive comments. There were a number. The top one was that wouldn't it be great if AI could provide interactive conversation in the language. Would it be great if it could help us locate phrases and other information better, help teach us pronunciation, help build corpora and language resources, and AISR or speech recognition is actually really useful, people were saying. These suggestions align with my own intuitions about what would benefit the Gallic community. The biggest bottleneck towards fluency for Gallic learners, and I know this very well, as well as a lot of people in this room, it's finding opportunities to speak the language with a native speaker or even just a really good fluent speaker. Simply put, that situation is not going to improve. Additionally, gaining entry to that experience is very fraught. You have to pretend that you understand everything when you really don't. It's like gaining credit when you've got none, at least back in the old days. The great promise of technology is providing a simulation of naturalistic conversation. But getting there is a challenge even with large languages. When you see this technology working, Today, we're so used to it. We're inundated with it that we don't think about what actually went on behind the hood to get there. It is tough. It is backbreaking. It's intellectually difficult, and a lot of it is actually just annotation getting dapted together. But that in of itself, we're talking about millions of work hours devoted to just one aspect of something a lot of times. So it requires collaboration between language communities as well as large tech. If we were to get somewhere advanced with Gallic, we almost certainly need to involve large tech because of the cost of developing these models. You just can't do it within a university most of the time. As we'll be clear in a moment, we can, however, locate phrases and information embedded in audio files, for example, and use technology to build corporate and language resources. That's possible in large part because of speech recognition, and a lot of what we're doing right now is exactly that. But anyway, before we get to that, before we get to some demonstrations, let's consider what AI is and how it works. So what say I? Well, in vernacular usage, as I said, the connotations associated with artificial intelligence have changed a lot recently. I remember the day that Chachi PT was launched because I was doing the MSC here at the University, and it blew everyone's mind. Um, can talk about that ad infinitum. But anyway, these days, the term AIs becomes synonymous with generating from large language models like open AIs, hatGBT and Google's Gemini. When this term was first coined in 1955, AI meant to make machines use language, form abstractions and concepts, solve the kinds of problems now reserved for humans and improve themselves. That is the models, improving themselves. So that definition suggests that we should be able to generate and understand natural language. That's what we can do with chatbots, induce hypotheses from empirical data, that's a little bit broader. Produce solutions to problems and learn from past errors. All of this sounds a lot like the promise of AI today. It was quite prophetic when you think about it. What was far from prophetic though, was how long researchers expected that to take? There'd been a lot of AI winters in the interim. In July 1958, the New York Times published an article about the first type of neural network called a perceptron, and the perceptron was expected to form the basis of a thinking computer that I could walk, talk, see, write, reproduce itself, and be conscious of its own existence. And they thought that that would take one year. Needless to say, this type of strong AI still does not exist, but the performance of large language models is very impressive across many tasks today. Behind that impressive performance though, it's remarkable how simple LLMs large language models actually are in some ways. They work by predicting the most likely token, a word or a part of word given the tokens that you already have. When you put a prompt into Cha chiPT it breaks it down into little bits, and that forms, all those tokens form your initial context for querying the model and use that to predict the next following token. So here, if you take the phrase present united and put that in a Chachi PT, it'll tell you that the next word is most likely going to be states. It does that implicitly as it generates. It's almost certainly going to give you the top response or one of the top responses, although there's a certain amount of randomness in there. And this kind of repetitive generation has a name. It's called autoregression. Now, the basis of nearly all advanced language technology today is the neural network. Here's a really simple representation of one. And you can think of each one of these nodes, pardon me, the circles as representing a step through the network. Our programme director on the MSE used to talk about thinking about a meat grinder or something like just turned the grinder and it went through. But anyway, the knowledge, if you like, is stored in the lines that connect these nodes, and these are known as weights or parameters. A neural network is sorry trained by tweaking these parameters countless times in response to getting things wrong, incorrect predictions. It's a form of conditional learning. So when you make a prediction using a neural network, you're basically taking some group of numbers Sticking it through the neural network and they get transformed by these weights and certain other operations as you go on. When those numbers reach the far side of the network, they're often turned into a set of probabilities across all the possible outputs, a probability distribution. In LLM, one of those final nodes will represent the next most likely word. Now, Large language models don't actually process text under the hood. It looks like they do, but they don't really. Each token is assigned a number. Think of it as an address or a telephone number. It's a vector of numbers that represents its meaning, its grammatical category, is it a noun or a verb or whatever, and other aspects. These vectors are called word embeddings. The word bank after president of the has a very different embedding than bank would be if it followed the word River Bank, for example. This is a consequence of a very famous machine learning technique called attention. I'm being very hand waving, glossing over a lot of details here, but hopefully some of this makes sense. To make a prediction, you send all these word embeddings, for example, into a neural network and due to the way it was trained, it spits out a prediction of the next token. You can see there isn't really a lot that's fundamentally mysterious about how these things work. They're prediction machines. That's all. They're not conscious entities despite what you might have read and they're not likely to take over the world anytime soon. What's complicated about them is the intricacy of those weights. You're talking about billions upon billions of them folding into one another in high dimensional spaces, and these weights can in a sense compress things like the entire Internet. To read the full text that was used to develop the first iteration of hat GPT, so GPT three, it would take a single individual 26,000 years of reading 24 hours a day, seven days a week. That's a lot of compressed information. But AI is a lot more than just large language models because of how vague the term AI is and its connotations with chat bots, terminators, et cetera, I think it's helpful to use a different term. So we could use speech and language technology as a more neutral term. Chatbots are a form of that, but so is speech recognition, handwriting recognition, speech synthesis, orthographic normalisation systems, handwriting recognition, and much more. Let's look at a few of these now in terms of what you can do with Gallic language technology. Much of the potential training corpus that we have for Gach is actually quite old. A lot of the text that's online thanks to Robio Malali and his team at DASk at the University of Glasgow. So a lot of this text goes back to the 19th century before, and it's not immediately usable for some of the things that we want to do. So you're talking about millions and millions of words and kind of older forms of orthography. So one of the things that we try to do is develop a way using neural networks to convert it into modern orthography. Um, so we developed this tool for correcting things like OCR mistakes also, and it's just a proof of concept, but it's available online for people to try. So here you can see that we've taken a really messy text and made some guesses about how it would look in modern orthography. Now, it's quite slow. That's the only thing. If we're going to do this at scale, we need to find a way to speed up substantially using probably simpler architectures and also GPUs. The way that we got started with all this, though, is actually on a simpler problem, and that's recognising handwriting. So the first thing that we did with Mark is this project on handwriting recognition with a view to doing things like speech recognition. The School Scottish Studies Archives has a huge supply of meticulously transcribed audio, so transcriptions of folklore narratives and interviews from the 1950s, 60s, and so forth. But and we developed using transcribes a tool that many of you will know for the digital humanities. We built up a model with this that eventually achieved 95% accuracy at the word level. So we could run through tonnes and tonnes of handwritten transcriptions, after digitising them and get the words back from these transcriptions. Hugely useful. We're now disseminating these texts back to the public, and this week, in fact, we're finishing a large research project that will make thousands of these pages of transcribed folklore available online for the first time. And here's a first glimpse of what that website's going to look like. So you type in a type of, you know, folk tale that you're interested in, so there's this classification system called Anor Thompson Ur, you can type in um, you know, attack like a number and get back that particular folk tale. You can see that on the map, you can get all the versions in PDF. You can get the text extracted from them and that kind of thing. So it's going to be a lot of fun. And Julianne is one of the people who's really helped push this for Julianne Mini. Now, so I talked about speech recognition. Here's a demo using a recent news broadcast. I should say the subtitles that you'll see here are the raw output from the system. Nothing's being corrected. He Saguain hoc the American and War. Ink fishing challenge Ig ca it also is dish scenario. Drag is finchner MunvoyOison, I guess that that Etiquette Connie Aywi a how. Wishing you couldn't hand it on the Ovatran wa. I guess I wish you could have heating a central heating at Patra. Sale here stray and mit proctiRiver and go to be sure with Raj and Ta Edman Haliper who is doing this studding bird Amado. RaltsnHalipu, rack and fame policy and garlic fierce the horse, they're sag and social to his EconomoHunc the skid and tradition to FaavelHKbm. Like I said here at lot the Genau ano had his otonHalipu hypiaccsoich, shallow Cersno. Okay. So how accurate is it? Well, this graph shows our accuracy or word error rate for our test set. Right now we're well, a year ago, we were getting 77.4% correct, you know, according to the words. Now we're at 86.9. That's a jump of about 10%. It seems small. It's massive for one year, and it's very much thanks to Peter Bell and Andre Clique. So, also, there's been a huge amount of data collection and stuff around that to involving Robie and a number of other people. This means that we can transcribe a huge set of audio video now fairly reliably. So recording is on ToponEdoks or the BBC archives, for example, it's possible to search that audio for words and phrases that occur and get the points in those bits of audio so we can go straight to that point. It's really fantastic. It's incredibly helpful towards resource development and language teaching. Although we only have about 150 million words of Galac text right now and several hundred hours of aligned audio and that's not very much. They're diminishing returns with the data increases that you stick into this system, the closer that you get to lower error rates. It gets harder and harder to bring the error rates down. You've got to multiply the amount of training data that you put in. To improve Gallic speech recognition, we need a lot more text, especially transcribed text, particularly in underrepresented domains like tradition narrative. During my MSC thesis, we experimented with uh, synthesising that type of data instead using GPT four, the language model underlying chat GPT. We took a series of human produced summaries in English from Toppin and dl and fed them through a fine tuned GPT 40 model to produce a story text. And then to make this come to life a little bit for us tonight, I asked Dan Wells, a PhD student in informatics to pass one of these stories through his text to speech system. Dan trained the synthetic Gallic voice that you hear in a moment with Roy McClain's excellent letters for learners broadcasts, all of which are available on the learned Gallic website. The rendering of Roy's voice is very good, but there are a few pronunciation errors here and there, and I just hasten to add that that's not Rudi's fault in any way. It's due to limitations in the model and the training data. The machine translation was carried out using GPT four ohs baseline model, so I'll play this for you now. Hi politician four. He monkey as hire cheer and gary monarch. That dah hohinH at a 50 more una. I guess all cheers gota tratas honey at a monkey, chin source can try. Oh, Cost. Shop 50 in matter of a honey o. A. **** Costa Gila. Go home. Hami kina Create master scoop ach a hound. Oh who's the head. Gini. Go Machesters and tire I guess Gokeleton Doss got here. Hodges to continue. Hokogkaske feeling y. HokaFlgagsk feelingly. AvaunUHamKin, Catholic hound, Augusta Jejun So one of the remarkable things that emerged from this experiment is that the model came up with a few neologisms, words that never existed before for Gallic, and some of them were ridiculous. But one in particular stood out as being kind of interesting, and that was a word for spit or vomit, Hush. That kind of hallucination is really interesting. It's a little bit like remember when, you know, an AI model beat everybody and go very complicated game, indeed, and people started talking about some of the moves as being almost kind of genius in a I mean, to come up with monopaic word for language that never existed before, I think is actually quite difficult. I'm not sure I could do it. So it's fascinating, but it does illustrate one of the potential harms as well, and that's information hazards. So now that we've looked at what we've achieved in Gallic technology so far, let's look at the big question, which is how we assess the impact of language technology on endangered languages and what their potential is for language revitalization. To my knowledge, no one's looked at this topic very closely or meticulously. What I'm trying to do here is just an initial examination of the area. One thing is clear though, the calculus is different for every type of technology. The risks and potential benefits for the minority language community differ. So let's begin by identifying the stakeholders that are most affected by these innovations, and then we'll look at some of the key risks and consider two short case studies. Some of the key stakeholders here are, of course, the users of language, I mean, the top ones. Adult learners, L two speakers, heritage speakers, L one speakers, immersion pupils and parents as well. Of course, businesses, educators at all levels, the government, researchers, tech companies, and the third sector, community groups. There are others here, but I think these are some of the main ones. Terms of the risks, well, these risks intersect with the risks that have already been identified in the literature. For example, Google Deep Minds paper, taxonomy of risks posed by language models. Starting with information hazards, these come from a proliferation of synthetic text in particular, you've got two different kinds, linguistic distortion, you have sparse training data that leads to poor synthetic text output, leading to distortion of linguistic norms like non native word orders or idiom or hallucinations like we saw before. Of course, distortion of culture. So LLMs are getting better at factual accuracy. I tried this. I suggested to ChatBT yesterday in Galac I said, Can you tell me more about all the people in the Western Iles that still believe in fairies. And it came back and it said, Well, actually, there's no evidence that this is widespread in the Western Isles of Scotland. I was kind of a little bit, you know, I was encouraged by that. I talked about a lot of very interesting information that you can find in the archives, for example, about fairy belief. So it wasn't bad, but there's still some real risks here, particularly when it comes to minority cultures. Model collapse is a consequence of information hazards. So while synthetic text augmentation could be useful as we talked about before, in the early stages of modelling, if you over rely upon it, if you iterate using if you train on synthetic texts again and again, your models become very, very bad. They overgeneralize, they become more homogeneous and predictable and currently there's no universally accepted way to sign post synthetic text or media. Um, so this is a real problem. So it shows up as being Gallic taxed online. If it has the code for Gallic, and it's synthetic, there's no way for you as a user immediately to figure out that it's synthetic. We need to do something about that. Representation bias comes from the fact that linguistic production is never context free. So if we create a synthetic voice like the one that you heard before, it will index a particular dialect, gender, age, and so forth. And we wouldn't want our models to be over representing one type of voice, implicitly suggesting that that's the best one, that, say, the North US dialect is the best. I mean, of course, it is. But we wouldn't want to say that because anybody from any other place is going to say that, you know, the dialect that they learned is the best. Um, Environmental harms are really clear. It's clear that the emissions associated with LLMs, in particular are very heavy, particularly when training them, but also when doing inference with them. Many minority languages are spoken in areas that are already at risk of ecological collapse. The place where I used to walk on the beach in North US no longer exists. It was wiped out in, I think 2005 during a storm. So that's happening out there. It's a real thing. You have the encroachment of oceans due to climate change and other factors, but climate change is a big one. So it's kind of ironic that we'd be thinking about revitalising a language using the very technology that might be, harming a lot of these areas. And then you have socioeconomic harms, the risk that language specialist jobs could be replaced with automation, for example, in translation and content creation, but also in the creative industries in various ways. Me. Let's look at the situation for two contrasting forms of language technology, speech recognition and large language models, beginning with speech recognition. So some of the risks linguistic distortion with ASR is actually relatively low depending on the accuracy of your models. So it's a function of your error rate. Again, the culture distortion is low because you have a supervised signal that you're following. You're not just picking words out of the air. The risk of model collapse is again quite low, at least with Gallic ASR, the representation bias is, I would say moderate because currently we can recognise some dialects better than others. The environmental harms with this type of technology are relatively low because you're not using massive, you know, you're not generating, using massive arrays of GPUs and things like that. And the socioeconomic harms are again, quite low. There are very, very few people that can do reliable, like, really good Galax transcription these days. It's very, very difficult to hire people to do this. So I don't think that we're going to put anyone out of a job, and the technology still isn't as good as the best people would be. Vitalization potential with that technology, well, for increasing active users is probably quite low, to be honest. But for developing resources, I think it's quite high. In terms of structured support from policy and institutions, I'd say it's moderate, so you can strengthen the language in business settings and economic life to an extent. But beyond that, I'm not sure. Diversifying usage domains, I think it is probably quite low, but be willing to be surprised. And for raising status and visibility, it depends, but I think it's quite moderate. If we could incorporate ASR or Gallic ASR across all the devices that we use, like, you know, Apple Mac, PCs, phones, et cetera, it could make a real difference, I think, to a lot of Galax users. For example, I mean, a big one would be GME students that have learning difficulties. I had an email yesterday from one of the education boards asking if this was going to come online soon. LLMs. Well, so what are we going to say here assumes that Gallic is being included in LLM already, and it's based upon what I've seen in terms of the performance of the LLMs that we have. The risks, Well, for linguistic distortion, I'd say the moderate to high. It's a function of the LLMs predictive power or as, you know, we say in this work in machine learning perplexity how perplexed the model is by the type of texts that should be putting out. The cultural distortion is moderate, although it might be slightly less than that. I think the models are getting better at this aspect. Potential for moral collapse without identifying synthetic text is actually moderate to high, I would say. The representation bias is relatively low because you're dealing with orthographically standardised text anyway. The environmental harms from including Gallic in these are moderate. Talk more about that in a moment. The socioeconomic harms, I would say, are moderate and they grow the better that the models get for Gallic probably. Terms of the revitalization potentials right now, for increasing a users, I'd say it's moderate. There are a lot of people that are using, say GBT to improve the gallax skills for better or for worse. So it is increasing the user base to an extent. For developing resources at the stage that we're at right now, I'd say the potential is very low because they're not great. The structured support from policy and institutions, again, I'd say is low. For diversifying usage domains, it's low, but if they were to get much better, I think that could be moderate to high. And finally, for raising the status and visibility, I'd say that we're dealing with a moderate potential right there right now. So what do we do about the LLM problem? How do we make them work better for the Galt community? Well, here's a pragmatic view on the situation. We already have a Gallic speaking monkey on a computer keyboard. It's out there. Our choice is to teach the monkey to be more fluent or to ignore the monkey and hope he'll go away. Realistically, that monkey is going nowhere. So unless we remove all the Gallic texts on the Internet, get rid of Wikipedia and Galic, that monkeys going to stay there. So perhaps we should work with Big Tech to make sure that the monkey can produce reasonable Gallic. Let's go back to the top positive comment from my scientific social media study. Moving away from LLMs to an extent, what would really make a difference to the Gallic community? What would make a difference to the revitalization effort? I think it's a little bit crass, but, you know, for lack of a better label, I think it would be developing a virtual Galax speaker. And that was the top positive comment from that social media study. Um, if you could, you know, choose your dialect, your voice type, and the fluency level for this type of system, I think it could make a really big dent in teaching people Gallic and helping people improve their current skills in the language. This would be a system that could politely correct you when you've made a mistake and encourage you along the way. I don't know how I learn Gaelic. I must have an incredibly brass neck. You know, being an American helps, definitely. But I mean, the amount of discouragement that you encounter as you're going through this journey is quite remarkable. There's a lot of encouragement, too, but, you know, it's a mixed back. It's guerrilla warfare. So if you had, you know, you could just pull out your phone and have a conversation and have this voice that would lead you into speaking better Gallic over time, I think it would be a bonus. I think on balance, despite the risks, you know, these opportunities are significant. Especially if the audio here, especially if the output is audio and not text. If it's a closed system that's audio, you're not going to get the kind of pollution that you get from LLMs. So how can we do this? First of all, we can't do this on our own. I mean, no university has the kind of budget that would allow us to do this right now. We need assistance from the Gallic community and also from Big Tech. This is just a very simple illustration of some of the main ingredients. We need a lot more data. The most important type would come from transcribed audio. And the core idea here is that we'd use our current ASR system to transcribe recordings from various sources, crowd source the correction of the transcriptions and pass the corrected documents on as training data back to the community. We can incorporate reinforcement learning with human feedback to ensure that the system is aligned for the purposes of teaching Gallic and holding conversations that support language skills. And while we're building such a system, we can achieve a massive increase in Gaelic language resources, disseminated back to the community, as I said, Also through things like the digital archives Scotti Gallic at the University of Glasgow, Topelendch and so on. This kind of approach could in theory work for other endangered languages, too, at least where you have an orthography and some resources. So just to wrap up here, if we're careful, I think we can make a massive difference to Gallic speakers, especially to the learning community, the active learning community. Here are some of the ways that we can ensure that we do this well, staying cognizant of the risks. So involving the community obviously in the design and evaluation, curating an accessible high quality training corpus, improving the documentation that we have the language, and combating misrepresentation and disseminating that back to the Gallic community. We need to sign post synthetic text and media. There are people working on this. There just aren't any solutions, any good ones yet, but it definitely needs to happen. So as human beings, where we use synthetic text, we should say that it's synthetic. If we use closed systems for generative AI, there'd be a lot less information pollution. And finally, I think it's important for all of us involved in this work to educate the community about generative AI, especially its risks and limitations. Teaching the public about this is very, very important. It should be part of our curriculum. So just some concluding remarks, the Gallic language, like many other endangered languages is at a crossroads. Unlike many other smaller languages, though, there are literally millions of people who want to learn Gallic or improve their skills in it. But there are relatively few teachers available. We can't provide a patient native speaking teacher for each of these potential students. But perhaps we can provide the next best thing. With a careful approach and sufficient investment, we can put out a range of virtual Galax speakers in the hands of everyone with a connected device. This is a moonshot with many, many risks. But I bounced, I think language technology can play a major role in revitalising Gallic and other endangered languages. I look forward to hearing your own thoughts about that. If you're interested in further information about these topics, here are some links that you can follow. I'll put these slides up or we'll make them available through the University website, so don't worry about copying anything down. I'd like to just say a quick thank you to our funders and our many collaborators, AgasTapapin. Thank you. I The future is uncertain for Gaelic and most of the world’s minority languages. Could cutting-edge language technologies be the key to their survival? English speakers can now hold real-time spoken conversations with apps like OpenAI’s ChatGPT. What breakthroughs are needed to get us to that point for Gaelic? How might such a transformation affect language revitalisation efforts, for better and for worse?This lecture introduces modern language technology to a general audience, showcasing ongoing research involving Gaelic at the University of Edinburgh. It then addresses tensions in collaborations between big tech and minority language communities, such as navigating data ownership and cultural preservation. Finally, it looks ahead, considering how AI might help revitalise not just Gaelic, but other minority languages.About the speakerWill Lamb was born and raised in Baltimore, Maryland. He completed a degree in Psychology from the University of Maryland Baltimore County in 1993 and spent two years as an RA on a Johns Hopkins led research project on sleep disorders and biometrics. In 1995, after taking an interest in Gaelic and traditional music, he went to Nova Scotia and spent an academic year at St Francis Xavier University.Will began his postgraduate study at the University of Edinburgh in 1996, taking an MSc in Celtic Studies. His dissertation was on the development of the Gaelic news register and was supervised by Rob Ó Maolalaigh. He started a PhD in Linguistics the following year. In Jan 2000, nearing the end of his PhD, he moved to North Uist to take up a lecturing position at Lews Castle College Benbecula (University of the Highlands and Islands). He is credited with initiating the successful music programme at Lews Castle College. Will finished his PhD in 2002, and it was published in 2008 as 'Scottish Gaelic Speech and Writing: Register Variation in an Endangered Language'.Will was promoted to Senior Lecturer in 2017 and to Personal Chair in Gaelic Ethnology and Linguistics in 2022. His research interests span music, linguistics, traditional narrative and language technology. He is known, in particular, for his work on formulaic language, traditional music, Gaelic grammatical description and Natural Language Processing (NLP). Most of his recent work has been in Gaelic NLP, and he recently finished an MSc in Speech and Language Processing (University of Edinburgh). Dec 04 2024 17.15 - 19.15 Professor Will Lamb Inaugural Lecture Professor Lamb's Inaugural Lecture, 'Could Artificial Intelligence save Scottish Gaelic?' took place on 4th December. Watch the recording here...
Professor Will Lamb Inaugural Lecture Recording of Professor Will Lamb's Inaugural Lecture View media transcript Thank you very much, Alex. Before I begin, I'd like to mention a few people that have been really instrumental in the work that I'm going to be describing tonight. My Gallic technology partners in crime, doctor Mark Sinclair, doctor B Alex, who can't be with us tonight. She's ill at the moment, Professor Peter Bell, doctor Andre Click who works with Peter and Professor Robo Malali. Without Mark's help in particular, his help and expertise, I'm not sure we would have ever gotten started um we met ten years ago when I was auditing a Python course here at the university and he was one of the best teachers I've ever had. I went up to after one of his classes and asked him if he fancied working on Gallic language technology together, and he said, Sure. I couldn't believe it. I'm still amazed. He said, Yes. B, Alex and I have been collaborating for the past six years or so on a host of projects and she's such a brilliant, multifaceted and friendly colleague, it's just always a joy to work with her. Peter and Andre are researchers on our speech recognition grant. They were also my MSE supervisors, and they handled that inversion of roles beautifully, and it was a huge privilege to learn from them that way. Also on that grant is my first MSC supervisor, Professor Robio Malali. He gave me my first real taste of Gallic linguistics a long, long time ago and he's been an inspiration over my career since then. So thanks to all of you. And finally, our head of department, doctor Neil Martin, who's here somewhere. By all rights, I should have taken over from Neil as head of department about four years ago. Frankly, he's better at it than I would ever be. But I'm grateful to him for continuing when he could have insisted that it was my turn. He's always made me feel that my work is viable and he's facilitated wherever he could. So thank you very much, Neil. It seems like AI is everywhere. It's in our cars, it's in our phones, it's in our medical devices, and in our entertainment systems. It's now even in some of our rubbish bins. If you ask enough people, some of them will say they wish it would stay there. But for others, we're living in exciting times. Speakers of English and a handful of other languages can now hold nearly seamless conversations with AI based conversational agents. Unfortunately, this isn't true for the rest of the 7,000 spoken languages in the world. But what if it were? To give you a taste of what this might look like, here's a video demonstrating open AI's advanced voice mode for one language, Portuguese. Hey, I'm Christine, and I'm a native English speaker, but I've been trying to learn Portuguese for fun. And hi, I'm Nacho. I speak Spanish natively, English, and I understand most of Portuguese, but I can't really speak it. So can you help us have a conversation, Portuguese? Clara is still quia jaula. Could you start us off with a conversation? Maybe ask us a few questions in Portuguese so we can practise? Claro. For a Christine, Christine, aaborcaPiano. Christine. Tocar Piano mazy Dachon say Nacho. Ego sugar, how do you say chess? Could you hear that okay? Okay. Now, I'm not a Portuguese speaker. One of my colleagues Rob Dumba is amazingly, but I'm guessing that the voice synthesis there isn't perfect. In time, I imagine it probably will be. About half of human languages are predicted to disappear in the next century. Wouldn't providing robust conversation technology for them be the best way of saving them? After all these tools, model, linguistic production and exception, better than any dictionary or grammar can. Could AI save endangered languages? In particular, could AI save Scotts Gaelic? Unfortunately, I think the answer is unlikely. I'm hedging because who knows what AI in the future might resemble. But I'm sure it won't save any languages on its own. A thought experiment or two can make this really clear. Imagine a cavernous room of identical desks set out in rows. Upon each desk is a laptop that can converse fluently in one of every human language that's ever been spoken. Would this save the world's languages? Now, as long as humans exist to visit such a place, it might have some limited value, but what relevance would a random stream of sound from 50,000 years ago hold for you? With no other information on what basis would you prefer one stream of sound over another? As Fishman says in reversing language shift, languages are inseparable from their cultures. There's little, if any, culture in this room and I'd argue that it would be useful only really to a small number of hardcore linguists. It would be a digital mausoleum in some sense. Let's put humans back in the picture and see if anything changes. Let's replace each laptop with one human speaker for every language. This time we'll allow a basic label for each language written on a piece of paper before them. For the construct that we call English, what if our representative is a 22-year-old middle class black female from Baltimore? Instead, what if it is a 78-year-old upper class male from London? On what basis is one more representative than another? Which would you choose and why? I think it's worth thinking about that. It's actually very hard to pin down what we mean by a language. Linguistic form varies with age, ethnicity, location, time period, social position, situational context, and more. No representation can exist without loss, whether it's computer based or the forms produced by a single individual. If we limit the representation of a language to a single point, we lose nearly all that variation that makes it real in the first place. Arguably to save 21st century English for posterity, we'd need the diversity that exists in that entire room. Although other languages may be more local, less ethnically diverse or whatever, they too are little without their living communities. We're saying that AI can't save Gallic or any other language of its own, maybe we're asking the wrong question. Could AI help revitalise Gallic? Well, I think that that's more likely. The word revitalised means to imbue something with life again and life can only exist in something that's living, for example, a speech community. In the remainder of this lecture, I'll make a start on examining how AI, so called, might help in the revitalization effort for Gallic and other endangered languages by extension. I also outlined some ways to assess the risks and benefits of language technologies for endangered languages. Here are the questions that will guide this. What's the status of Gallic today? How do at least some Gallic users view AI? What is AI anyway? What can we do with Gallic technology currently? How can we assess the impacts of AI on threatened languages and a quick health warning? These are huge areas to discuss in 45 minutes. This is going to be fully satisfying, but I hope at some point, I'll be able to write this up. It'll be a little bit more satisfying then. So what's the status of Gallic today? Well, perhaps Gaelic is doing fine without AI. Let's look at the recent census as flawed as it is. In the 2022 census, the number of people with some Gallic skills in Scotland increased by 43,100 people. This might suggest the Galaxs on firm footing. That's a massive increase. A major problem with the census, though, is that one can't establish respondents' fluency levels, how often they use the language, or indeed where. In contrast to that apparent growth, a number of people who can speak Gallic in the so called Heartland, the Outer Hebrids has dropped considerably. It's now 45% of the population of the Outer Hebrides. Whereas in 2011, it was 52% and in 2001, it was 60%. So the trend is for more people to report Gallic skills while speakers in hereditary areas decline. It's a metaphor for the ages, isn't it, in a way? Without some intervention, this decrease in the heartland is unlikely to change. To improve the situation for a language like Gallic, it's helpful to keep certain goals in mind. And at the top of the list, of course, everybody would like to increase the active users of the language. We can do that by looking at transmission in the home, also thinking about new speakers, adult learners, pupils in immersive schools, and so on. Developing resources is hugely important. So with Gallic, we already have a standard orthography. Great. We can take that box. A lot of languages don't even have that. We've got dictionaries, we've got grammars, we've got corpora. There's still a lot to do. This says it's a comprehensive grammar. It really? How can it be? It's a start. But anyway, there's a lot more to be done even with that. In terms of structured support, getting structured support from policy and institution institutions, we think about trying to embed the language in formal education more, strengthen the language in business settings and economic life, developing grassroots support via community groups. Diversifying usage domains, of course, right now, these domains have attenuated so much. Even things like crafting are now done, I mean, based upon my experience, they're now done through the medium in English, much more than they were 20 years ago. First went to US in, you know, 1997, I think, if you went out onto the Murr to, you know, I don't know, do the sheep dipping or something like that, it was predominantly through the Museum of Gay. I can guarantee that's not the case today. So think about widening domains of usage, raising the status and visibility of the language, through signage and media presence, et cetera. We can't have a living language without a thriving speech community. Could AI be the DS machina that allows us to progress this, the unexpected solution that saves the day? Well, let's start by looking at how Gallic users view it at the moment, at least some Gallic users. I did a very unscientific, very brief survey of people's ideas about how AI could help them learn or use Gallic better. I pose this question to x.com as well as several Gallic interest groups that I belong to on Redit and Facebook. I have to say the results surprised me. I should state, however, that I think the sample population here is not a great representation of the views of heritage speakers of Gallic. In general, my impression is that they are much more open to the idea of using AI to benefit the language. I took all of the comments and likes associated with them and assembled them in a spreadsheet. Then I manually judged each comment as having positive, negative or neutral sentiment, as can be seen in this chart, the likes of negative comments outnumber those for positive comments five to three. Over half of the total comments were negative. It's difficult to know how knowledgeable the people who are responding were about AI or language technology in general and how it works. Certainly, there's a lot of fear about its impact on the environment and employment and the notion that's being imposed upon people. The top comments, the top five ones were AI is harmful to jobs and the environment. Keep AI away from heritage languages, get rid of AI. AI is being forced on us and GalaxUlingo doesn't use AI, which seems a little bit random, but actually a lot of the people responding were on the forum for GalaxUlingo. Now, I thought that last comment was interesting. Gala eolingos been used by over 2 million people, and that's really impressive. I mean, it's ores of magnitude above the number of Galax speakers that we have today. When somebody suggests that Galax J Lingo did not use AI, I put up the following clip from 2020, following news article 2 years before Chat GPT came on the scene. Je Lingo's own CEO said at that time that AI was embedded in every aspect of the app. What was the response? Radio silence. This suggested to me a certain amount of cognitive dissonance, but also that many people today equate AI very strongly with large language models. Now, in any case, I think Big Tech isn't really winning hearts and minds here at the moment, at least with the Gallic learner community. So let's turn to the positive comments. There were a number. The top one was that wouldn't it be great if AI could provide interactive conversation in the language. Would it be great if it could help us locate phrases and other information better, help teach us pronunciation, help build corpora and language resources, and AISR or speech recognition is actually really useful, people were saying. These suggestions align with my own intuitions about what would benefit the Gallic community. The biggest bottleneck towards fluency for Gallic learners, and I know this very well, as well as a lot of people in this room, it's finding opportunities to speak the language with a native speaker or even just a really good fluent speaker. Simply put, that situation is not going to improve. Additionally, gaining entry to that experience is very fraught. You have to pretend that you understand everything when you really don't. It's like gaining credit when you've got none, at least back in the old days. The great promise of technology is providing a simulation of naturalistic conversation. But getting there is a challenge even with large languages. When you see this technology working, Today, we're so used to it. We're inundated with it that we don't think about what actually went on behind the hood to get there. It is tough. It is backbreaking. It's intellectually difficult, and a lot of it is actually just annotation getting dapted together. But that in of itself, we're talking about millions of work hours devoted to just one aspect of something a lot of times. So it requires collaboration between language communities as well as large tech. If we were to get somewhere advanced with Gallic, we almost certainly need to involve large tech because of the cost of developing these models. You just can't do it within a university most of the time. As we'll be clear in a moment, we can, however, locate phrases and information embedded in audio files, for example, and use technology to build corporate and language resources. That's possible in large part because of speech recognition, and a lot of what we're doing right now is exactly that. But anyway, before we get to that, before we get to some demonstrations, let's consider what AI is and how it works. So what say I? Well, in vernacular usage, as I said, the connotations associated with artificial intelligence have changed a lot recently. I remember the day that Chachi PT was launched because I was doing the MSC here at the University, and it blew everyone's mind. Um, can talk about that ad infinitum. But anyway, these days, the term AIs becomes synonymous with generating from large language models like open AIs, hatGBT and Google's Gemini. When this term was first coined in 1955, AI meant to make machines use language, form abstractions and concepts, solve the kinds of problems now reserved for humans and improve themselves. That is the models, improving themselves. So that definition suggests that we should be able to generate and understand natural language. That's what we can do with chatbots, induce hypotheses from empirical data, that's a little bit broader. Produce solutions to problems and learn from past errors. All of this sounds a lot like the promise of AI today. It was quite prophetic when you think about it. What was far from prophetic though, was how long researchers expected that to take? There'd been a lot of AI winters in the interim. In July 1958, the New York Times published an article about the first type of neural network called a perceptron, and the perceptron was expected to form the basis of a thinking computer that I could walk, talk, see, write, reproduce itself, and be conscious of its own existence. And they thought that that would take one year. Needless to say, this type of strong AI still does not exist, but the performance of large language models is very impressive across many tasks today. Behind that impressive performance though, it's remarkable how simple LLMs large language models actually are in some ways. They work by predicting the most likely token, a word or a part of word given the tokens that you already have. When you put a prompt into Cha chiPT it breaks it down into little bits, and that forms, all those tokens form your initial context for querying the model and use that to predict the next following token. So here, if you take the phrase present united and put that in a Chachi PT, it'll tell you that the next word is most likely going to be states. It does that implicitly as it generates. It's almost certainly going to give you the top response or one of the top responses, although there's a certain amount of randomness in there. And this kind of repetitive generation has a name. It's called autoregression. Now, the basis of nearly all advanced language technology today is the neural network. Here's a really simple representation of one. And you can think of each one of these nodes, pardon me, the circles as representing a step through the network. Our programme director on the MSE used to talk about thinking about a meat grinder or something like just turned the grinder and it went through. But anyway, the knowledge, if you like, is stored in the lines that connect these nodes, and these are known as weights or parameters. A neural network is sorry trained by tweaking these parameters countless times in response to getting things wrong, incorrect predictions. It's a form of conditional learning. So when you make a prediction using a neural network, you're basically taking some group of numbers Sticking it through the neural network and they get transformed by these weights and certain other operations as you go on. When those numbers reach the far side of the network, they're often turned into a set of probabilities across all the possible outputs, a probability distribution. In LLM, one of those final nodes will represent the next most likely word. Now, Large language models don't actually process text under the hood. It looks like they do, but they don't really. Each token is assigned a number. Think of it as an address or a telephone number. It's a vector of numbers that represents its meaning, its grammatical category, is it a noun or a verb or whatever, and other aspects. These vectors are called word embeddings. The word bank after president of the has a very different embedding than bank would be if it followed the word River Bank, for example. This is a consequence of a very famous machine learning technique called attention. I'm being very hand waving, glossing over a lot of details here, but hopefully some of this makes sense. To make a prediction, you send all these word embeddings, for example, into a neural network and due to the way it was trained, it spits out a prediction of the next token. You can see there isn't really a lot that's fundamentally mysterious about how these things work. They're prediction machines. That's all. They're not conscious entities despite what you might have read and they're not likely to take over the world anytime soon. What's complicated about them is the intricacy of those weights. You're talking about billions upon billions of them folding into one another in high dimensional spaces, and these weights can in a sense compress things like the entire Internet. To read the full text that was used to develop the first iteration of hat GPT, so GPT three, it would take a single individual 26,000 years of reading 24 hours a day, seven days a week. That's a lot of compressed information. But AI is a lot more than just large language models because of how vague the term AI is and its connotations with chat bots, terminators, et cetera, I think it's helpful to use a different term. So we could use speech and language technology as a more neutral term. Chatbots are a form of that, but so is speech recognition, handwriting recognition, speech synthesis, orthographic normalisation systems, handwriting recognition, and much more. Let's look at a few of these now in terms of what you can do with Gallic language technology. Much of the potential training corpus that we have for Gach is actually quite old. A lot of the text that's online thanks to Robio Malali and his team at DASk at the University of Glasgow. So a lot of this text goes back to the 19th century before, and it's not immediately usable for some of the things that we want to do. So you're talking about millions and millions of words and kind of older forms of orthography. So one of the things that we try to do is develop a way using neural networks to convert it into modern orthography. Um, so we developed this tool for correcting things like OCR mistakes also, and it's just a proof of concept, but it's available online for people to try. So here you can see that we've taken a really messy text and made some guesses about how it would look in modern orthography. Now, it's quite slow. That's the only thing. If we're going to do this at scale, we need to find a way to speed up substantially using probably simpler architectures and also GPUs. The way that we got started with all this, though, is actually on a simpler problem, and that's recognising handwriting. So the first thing that we did with Mark is this project on handwriting recognition with a view to doing things like speech recognition. The School Scottish Studies Archives has a huge supply of meticulously transcribed audio, so transcriptions of folklore narratives and interviews from the 1950s, 60s, and so forth. But and we developed using transcribes a tool that many of you will know for the digital humanities. We built up a model with this that eventually achieved 95% accuracy at the word level. So we could run through tonnes and tonnes of handwritten transcriptions, after digitising them and get the words back from these transcriptions. Hugely useful. We're now disseminating these texts back to the public, and this week, in fact, we're finishing a large research project that will make thousands of these pages of transcribed folklore available online for the first time. And here's a first glimpse of what that website's going to look like. So you type in a type of, you know, folk tale that you're interested in, so there's this classification system called Anor Thompson Ur, you can type in um, you know, attack like a number and get back that particular folk tale. You can see that on the map, you can get all the versions in PDF. You can get the text extracted from them and that kind of thing. So it's going to be a lot of fun. And Julianne is one of the people who's really helped push this for Julianne Mini. Now, so I talked about speech recognition. Here's a demo using a recent news broadcast. I should say the subtitles that you'll see here are the raw output from the system. Nothing's being corrected. He Saguain hoc the American and War. Ink fishing challenge Ig ca it also is dish scenario. Drag is finchner MunvoyOison, I guess that that Etiquette Connie Aywi a how. Wishing you couldn't hand it on the Ovatran wa. I guess I wish you could have heating a central heating at Patra. Sale here stray and mit proctiRiver and go to be sure with Raj and Ta Edman Haliper who is doing this studding bird Amado. RaltsnHalipu, rack and fame policy and garlic fierce the horse, they're sag and social to his EconomoHunc the skid and tradition to FaavelHKbm. Like I said here at lot the Genau ano had his otonHalipu hypiaccsoich, shallow Cersno. Okay. So how accurate is it? Well, this graph shows our accuracy or word error rate for our test set. Right now we're well, a year ago, we were getting 77.4% correct, you know, according to the words. Now we're at 86.9. That's a jump of about 10%. It seems small. It's massive for one year, and it's very much thanks to Peter Bell and Andre Clique. So, also, there's been a huge amount of data collection and stuff around that to involving Robie and a number of other people. This means that we can transcribe a huge set of audio video now fairly reliably. So recording is on ToponEdoks or the BBC archives, for example, it's possible to search that audio for words and phrases that occur and get the points in those bits of audio so we can go straight to that point. It's really fantastic. It's incredibly helpful towards resource development and language teaching. Although we only have about 150 million words of Galac text right now and several hundred hours of aligned audio and that's not very much. They're diminishing returns with the data increases that you stick into this system, the closer that you get to lower error rates. It gets harder and harder to bring the error rates down. You've got to multiply the amount of training data that you put in. To improve Gallic speech recognition, we need a lot more text, especially transcribed text, particularly in underrepresented domains like tradition narrative. During my MSC thesis, we experimented with uh, synthesising that type of data instead using GPT four, the language model underlying chat GPT. We took a series of human produced summaries in English from Toppin and dl and fed them through a fine tuned GPT 40 model to produce a story text. And then to make this come to life a little bit for us tonight, I asked Dan Wells, a PhD student in informatics to pass one of these stories through his text to speech system. Dan trained the synthetic Gallic voice that you hear in a moment with Roy McClain's excellent letters for learners broadcasts, all of which are available on the learned Gallic website. The rendering of Roy's voice is very good, but there are a few pronunciation errors here and there, and I just hasten to add that that's not Rudi's fault in any way. It's due to limitations in the model and the training data. The machine translation was carried out using GPT four ohs baseline model, so I'll play this for you now. Hi politician four. He monkey as hire cheer and gary monarch. That dah hohinH at a 50 more una. I guess all cheers gota tratas honey at a monkey, chin source can try. Oh, Cost. Shop 50 in matter of a honey o. A. **** Costa Gila. Go home. Hami kina Create master scoop ach a hound. Oh who's the head. Gini. Go Machesters and tire I guess Gokeleton Doss got here. Hodges to continue. Hokogkaske feeling y. HokaFlgagsk feelingly. AvaunUHamKin, Catholic hound, Augusta Jejun So one of the remarkable things that emerged from this experiment is that the model came up with a few neologisms, words that never existed before for Gallic, and some of them were ridiculous. But one in particular stood out as being kind of interesting, and that was a word for spit or vomit, Hush. That kind of hallucination is really interesting. It's a little bit like remember when, you know, an AI model beat everybody and go very complicated game, indeed, and people started talking about some of the moves as being almost kind of genius in a I mean, to come up with monopaic word for language that never existed before, I think is actually quite difficult. I'm not sure I could do it. So it's fascinating, but it does illustrate one of the potential harms as well, and that's information hazards. So now that we've looked at what we've achieved in Gallic technology so far, let's look at the big question, which is how we assess the impact of language technology on endangered languages and what their potential is for language revitalization. To my knowledge, no one's looked at this topic very closely or meticulously. What I'm trying to do here is just an initial examination of the area. One thing is clear though, the calculus is different for every type of technology. The risks and potential benefits for the minority language community differ. So let's begin by identifying the stakeholders that are most affected by these innovations, and then we'll look at some of the key risks and consider two short case studies. Some of the key stakeholders here are, of course, the users of language, I mean, the top ones. Adult learners, L two speakers, heritage speakers, L one speakers, immersion pupils and parents as well. Of course, businesses, educators at all levels, the government, researchers, tech companies, and the third sector, community groups. There are others here, but I think these are some of the main ones. Terms of the risks, well, these risks intersect with the risks that have already been identified in the literature. For example, Google Deep Minds paper, taxonomy of risks posed by language models. Starting with information hazards, these come from a proliferation of synthetic text in particular, you've got two different kinds, linguistic distortion, you have sparse training data that leads to poor synthetic text output, leading to distortion of linguistic norms like non native word orders or idiom or hallucinations like we saw before. Of course, distortion of culture. So LLMs are getting better at factual accuracy. I tried this. I suggested to ChatBT yesterday in Galac I said, Can you tell me more about all the people in the Western Iles that still believe in fairies. And it came back and it said, Well, actually, there's no evidence that this is widespread in the Western Isles of Scotland. I was kind of a little bit, you know, I was encouraged by that. I talked about a lot of very interesting information that you can find in the archives, for example, about fairy belief. So it wasn't bad, but there's still some real risks here, particularly when it comes to minority cultures. Model collapse is a consequence of information hazards. So while synthetic text augmentation could be useful as we talked about before, in the early stages of modelling, if you over rely upon it, if you iterate using if you train on synthetic texts again and again, your models become very, very bad. They overgeneralize, they become more homogeneous and predictable and currently there's no universally accepted way to sign post synthetic text or media. Um, so this is a real problem. So it shows up as being Gallic taxed online. If it has the code for Gallic, and it's synthetic, there's no way for you as a user immediately to figure out that it's synthetic. We need to do something about that. Representation bias comes from the fact that linguistic production is never context free. So if we create a synthetic voice like the one that you heard before, it will index a particular dialect, gender, age, and so forth. And we wouldn't want our models to be over representing one type of voice, implicitly suggesting that that's the best one, that, say, the North US dialect is the best. I mean, of course, it is. But we wouldn't want to say that because anybody from any other place is going to say that, you know, the dialect that they learned is the best. Um, Environmental harms are really clear. It's clear that the emissions associated with LLMs, in particular are very heavy, particularly when training them, but also when doing inference with them. Many minority languages are spoken in areas that are already at risk of ecological collapse. The place where I used to walk on the beach in North US no longer exists. It was wiped out in, I think 2005 during a storm. So that's happening out there. It's a real thing. You have the encroachment of oceans due to climate change and other factors, but climate change is a big one. So it's kind of ironic that we'd be thinking about revitalising a language using the very technology that might be, harming a lot of these areas. And then you have socioeconomic harms, the risk that language specialist jobs could be replaced with automation, for example, in translation and content creation, but also in the creative industries in various ways. Me. Let's look at the situation for two contrasting forms of language technology, speech recognition and large language models, beginning with speech recognition. So some of the risks linguistic distortion with ASR is actually relatively low depending on the accuracy of your models. So it's a function of your error rate. Again, the culture distortion is low because you have a supervised signal that you're following. You're not just picking words out of the air. The risk of model collapse is again quite low, at least with Gallic ASR, the representation bias is, I would say moderate because currently we can recognise some dialects better than others. The environmental harms with this type of technology are relatively low because you're not using massive, you know, you're not generating, using massive arrays of GPUs and things like that. And the socioeconomic harms are again, quite low. There are very, very few people that can do reliable, like, really good Galax transcription these days. It's very, very difficult to hire people to do this. So I don't think that we're going to put anyone out of a job, and the technology still isn't as good as the best people would be. Vitalization potential with that technology, well, for increasing active users is probably quite low, to be honest. But for developing resources, I think it's quite high. In terms of structured support from policy and institutions, I'd say it's moderate, so you can strengthen the language in business settings and economic life to an extent. But beyond that, I'm not sure. Diversifying usage domains, I think it is probably quite low, but be willing to be surprised. And for raising status and visibility, it depends, but I think it's quite moderate. If we could incorporate ASR or Gallic ASR across all the devices that we use, like, you know, Apple Mac, PCs, phones, et cetera, it could make a real difference, I think, to a lot of Galax users. For example, I mean, a big one would be GME students that have learning difficulties. I had an email yesterday from one of the education boards asking if this was going to come online soon. LLMs. Well, so what are we going to say here assumes that Gallic is being included in LLM already, and it's based upon what I've seen in terms of the performance of the LLMs that we have. The risks, Well, for linguistic distortion, I'd say the moderate to high. It's a function of the LLMs predictive power or as, you know, we say in this work in machine learning perplexity how perplexed the model is by the type of texts that should be putting out. The cultural distortion is moderate, although it might be slightly less than that. I think the models are getting better at this aspect. Potential for moral collapse without identifying synthetic text is actually moderate to high, I would say. The representation bias is relatively low because you're dealing with orthographically standardised text anyway. The environmental harms from including Gallic in these are moderate. Talk more about that in a moment. The socioeconomic harms, I would say, are moderate and they grow the better that the models get for Gallic probably. Terms of the revitalization potentials right now, for increasing a users, I'd say it's moderate. There are a lot of people that are using, say GBT to improve the gallax skills for better or for worse. So it is increasing the user base to an extent. For developing resources at the stage that we're at right now, I'd say the potential is very low because they're not great. The structured support from policy and institutions, again, I'd say is low. For diversifying usage domains, it's low, but if they were to get much better, I think that could be moderate to high. And finally, for raising the status and visibility, I'd say that we're dealing with a moderate potential right there right now. So what do we do about the LLM problem? How do we make them work better for the Galt community? Well, here's a pragmatic view on the situation. We already have a Gallic speaking monkey on a computer keyboard. It's out there. Our choice is to teach the monkey to be more fluent or to ignore the monkey and hope he'll go away. Realistically, that monkey is going nowhere. So unless we remove all the Gallic texts on the Internet, get rid of Wikipedia and Galic, that monkeys going to stay there. So perhaps we should work with Big Tech to make sure that the monkey can produce reasonable Gallic. Let's go back to the top positive comment from my scientific social media study. Moving away from LLMs to an extent, what would really make a difference to the Gallic community? What would make a difference to the revitalization effort? I think it's a little bit crass, but, you know, for lack of a better label, I think it would be developing a virtual Galax speaker. And that was the top positive comment from that social media study. Um, if you could, you know, choose your dialect, your voice type, and the fluency level for this type of system, I think it could make a really big dent in teaching people Gallic and helping people improve their current skills in the language. This would be a system that could politely correct you when you've made a mistake and encourage you along the way. I don't know how I learn Gaelic. I must have an incredibly brass neck. You know, being an American helps, definitely. But I mean, the amount of discouragement that you encounter as you're going through this journey is quite remarkable. There's a lot of encouragement, too, but, you know, it's a mixed back. It's guerrilla warfare. So if you had, you know, you could just pull out your phone and have a conversation and have this voice that would lead you into speaking better Gallic over time, I think it would be a bonus. I think on balance, despite the risks, you know, these opportunities are significant. Especially if the audio here, especially if the output is audio and not text. If it's a closed system that's audio, you're not going to get the kind of pollution that you get from LLMs. So how can we do this? First of all, we can't do this on our own. I mean, no university has the kind of budget that would allow us to do this right now. We need assistance from the Gallic community and also from Big Tech. This is just a very simple illustration of some of the main ingredients. We need a lot more data. The most important type would come from transcribed audio. And the core idea here is that we'd use our current ASR system to transcribe recordings from various sources, crowd source the correction of the transcriptions and pass the corrected documents on as training data back to the community. We can incorporate reinforcement learning with human feedback to ensure that the system is aligned for the purposes of teaching Gallic and holding conversations that support language skills. And while we're building such a system, we can achieve a massive increase in Gaelic language resources, disseminated back to the community, as I said, Also through things like the digital archives Scotti Gallic at the University of Glasgow, Topelendch and so on. This kind of approach could in theory work for other endangered languages, too, at least where you have an orthography and some resources. So just to wrap up here, if we're careful, I think we can make a massive difference to Gallic speakers, especially to the learning community, the active learning community. Here are some of the ways that we can ensure that we do this well, staying cognizant of the risks. So involving the community obviously in the design and evaluation, curating an accessible high quality training corpus, improving the documentation that we have the language, and combating misrepresentation and disseminating that back to the Gallic community. We need to sign post synthetic text and media. There are people working on this. There just aren't any solutions, any good ones yet, but it definitely needs to happen. So as human beings, where we use synthetic text, we should say that it's synthetic. If we use closed systems for generative AI, there'd be a lot less information pollution. And finally, I think it's important for all of us involved in this work to educate the community about generative AI, especially its risks and limitations. Teaching the public about this is very, very important. It should be part of our curriculum. So just some concluding remarks, the Gallic language, like many other endangered languages is at a crossroads. Unlike many other smaller languages, though, there are literally millions of people who want to learn Gallic or improve their skills in it. But there are relatively few teachers available. We can't provide a patient native speaking teacher for each of these potential students. But perhaps we can provide the next best thing. With a careful approach and sufficient investment, we can put out a range of virtual Galax speakers in the hands of everyone with a connected device. This is a moonshot with many, many risks. But I bounced, I think language technology can play a major role in revitalising Gallic and other endangered languages. I look forward to hearing your own thoughts about that. If you're interested in further information about these topics, here are some links that you can follow. I'll put these slides up or we'll make them available through the University website, so don't worry about copying anything down. I'd like to just say a quick thank you to our funders and our many collaborators, AgasTapapin. Thank you. I The future is uncertain for Gaelic and most of the world’s minority languages. Could cutting-edge language technologies be the key to their survival? English speakers can now hold real-time spoken conversations with apps like OpenAI’s ChatGPT. What breakthroughs are needed to get us to that point for Gaelic? How might such a transformation affect language revitalisation efforts, for better and for worse?This lecture introduces modern language technology to a general audience, showcasing ongoing research involving Gaelic at the University of Edinburgh. It then addresses tensions in collaborations between big tech and minority language communities, such as navigating data ownership and cultural preservation. Finally, it looks ahead, considering how AI might help revitalise not just Gaelic, but other minority languages.About the speakerWill Lamb was born and raised in Baltimore, Maryland. He completed a degree in Psychology from the University of Maryland Baltimore County in 1993 and spent two years as an RA on a Johns Hopkins led research project on sleep disorders and biometrics. In 1995, after taking an interest in Gaelic and traditional music, he went to Nova Scotia and spent an academic year at St Francis Xavier University.Will began his postgraduate study at the University of Edinburgh in 1996, taking an MSc in Celtic Studies. His dissertation was on the development of the Gaelic news register and was supervised by Rob Ó Maolalaigh. He started a PhD in Linguistics the following year. In Jan 2000, nearing the end of his PhD, he moved to North Uist to take up a lecturing position at Lews Castle College Benbecula (University of the Highlands and Islands). He is credited with initiating the successful music programme at Lews Castle College. Will finished his PhD in 2002, and it was published in 2008 as 'Scottish Gaelic Speech and Writing: Register Variation in an Endangered Language'.Will was promoted to Senior Lecturer in 2017 and to Personal Chair in Gaelic Ethnology and Linguistics in 2022. His research interests span music, linguistics, traditional narrative and language technology. He is known, in particular, for his work on formulaic language, traditional music, Gaelic grammatical description and Natural Language Processing (NLP). Most of his recent work has been in Gaelic NLP, and he recently finished an MSc in Speech and Language Processing (University of Edinburgh). Dec 04 2024 17.15 - 19.15 Professor Will Lamb Inaugural Lecture Professor Lamb's Inaugural Lecture, 'Could Artificial Intelligence save Scottish Gaelic?' took place on 4th December. Watch the recording here...
Dec 04 2024 17.15 - 19.15 Professor Will Lamb Inaugural Lecture Professor Lamb's Inaugural Lecture, 'Could Artificial Intelligence save Scottish Gaelic?' took place on 4th December. Watch the recording here...