Contributed by: Richard Littauer
Open source software is amazing, and is changing the world. In case you don't know, "Open Source" means code that anyone can look at, and also use. It means that, while there may be a license, anyone can go borrow the code and use it in their program, for free. Your computer is running now using some open source code, even. This made your computer cheaper, because some pieces of software that were already developed didn't have to be reinvented. I'm a computational linguist; while studying, I realized that a lot of the tools I work with aren't free, or open source, and we have to keep developing them. This costs money and time. So, I wanted to see what was available to help people designing computational tools for endangered and under-resourced languages. I decided to start with a list.
My list of open source resources for endangered languages on GitHub started incredibly simply: I just wanted a list of useful, free resources that people could use to do work with endangered languages.
At the time, there was a new trend emerging on GitHub, which involved using README files to make collaborative documents. This was novel - in the programming world, Readme files traditionally explain other files in the folder, or how to run the program. On GitHub, the largest site for sharing code in the world, they were visible on each repository's page. This presented an opportunity - you could use the Readme as the content itself, everyone would see it immediately, and other people could contribute to it collaboratively using all of the tools that the versioning software git and the site GitHub offered. I realized that I could build a text-based database of tools for endangered, minority, or low-resource languages fairly easily, and that I could develop a community around cataloguing useful resources on GitHub, where the code was likely to be used more often. That was also why I focused on open source code: I wanted to find code that other people could use, share, and talk about easily, without worrying about licensing, royalties, or proprietary concerns.
So, I made a small list, and kept adding to it as I saw more tools. Soon, there were other contributors who helped out, and the list got a tiny bit of traction. The low resource language community is not large, largely because it is fractured into researchers for each particular language. This has disadvantages - there's a lot of work which isn't shared or extended to other use cases. This leads to a lot of wasted work, and funding, and is a net loss for linguistic communities. My hope with this list is that people would look around, find something they can use, and save time, in the end.
I am currently studying for a Masters in Computational Linguistics, and I've worked with languages that don't have much data, so I know a small amount about what tools are useful and how to find relevant code. Although I am a web developer by trade, I hope to continue building the list, and I am presenting a paper on it at the International Conference on Language Resources and Evaluation (LREC) in May.
The list isn't a panacea - it's a list. I'd like for the database to become more structured, for all software mentioned on it to be saved in a permanent repository (perhaps on GitHub), and for the projects themselves to be more useful at times. But these are challenges that I think can be overcome, and there are active discussions on how to do this in the repository's discussion board. Already, I know that it has been of use to a few researchers - I've gotten a few thanks, here and there - so I am hopeful for the future. I hope it will be useful to you as well.
To check out the list, simply go to https://github.com/RichardLitt/endangered-languages. If you're interested in being involved, read through the issues on https://github.com/RichardLitt/endangered-languages/issues, or send me an email at richard.littauer@gmail.com.