Contributed
by: Richard Littauer
Open source
software is amazing, and is changing the world. In case you don't know,
"Open Source" means code that anyone can look at, and also use. It
means that, while there may be a license, anyone can go borrow the code and use
it in their program, for free. Your computer is running now using some open
source code, even. This made your computer cheaper, because some pieces of
software that were already developed didn't have to be reinvented. I'm a
computational linguist; while studying, I realized that a lot of the tools I
work with aren't free, or open source, and we have to keep developing them.
This costs money and time. So, I wanted to see what was available to help
people designing computational tools for endangered and under-resourced
languages. I decided to start with a list.
My list of
open source resources for endangered languages on GitHub started incredibly
simply: I just wanted a list of useful, free resources that people could use to
do work with endangered languages.
At the time,
there was a new trend emerging on GitHub, which involved using README files to
make collaborative documents. This was novel - in the programming world, Readme
files traditionally explain other files in the folder, or how to run the
program. On GitHub, the largest site for sharing code in the world, they were
visible on each repository's page. This presented an opportunity - you could
use the Readme as the content itself, everyone would see it immediately, and
other people could contribute to it collaboratively using all of the tools that
the versioning software git and the site GitHub offered. I realized that I
could build a text-based database of tools for endangered, minority, or
low-resource languages fairly easily, and that I could develop a community
around cataloguing useful resources on GitHub, where the code was likely to be
used more often. That was also why I focused on open source code: I wanted to
find code that other people could use, share, and talk about easily, without
worrying about licensing, royalties, or proprietary concerns.
So, I made a
small list, and kept adding to it as I saw more tools. Soon, there were other
contributors who helped out, and the list got a tiny bit of traction. The low
resource language community is not large, largely because it is fractured into
researchers for each particular language. This has disadvantages - there's a
lot of work which isn't shared or extended to other use cases. This leads to a
lot of wasted work, and funding, and is a net loss for linguistic communities.
My hope with this list is that people would look around, find something they
can use, and save time, in the end.
I am currently
studying for a Masters in Computational Linguistics, and I've worked with languages
that don't have much data, so I know a small amount about what tools are useful
and how to find relevant code. Although I am a web developer by trade, I hope
to continue building the list, and I am presenting a paper on it at the
International Conference on Language Resources and Evaluation (LREC) in May.
The list isn't
a panacea - it's a list. I'd like for the database to become more structured,
for all software mentioned on it to be saved in a permanent repository (perhaps
on GitHub), and for the projects themselves to be more useful at times. But
these are challenges that I think can be overcome, and there are active
discussions on how to do this in the repository's discussion board. Already, I
know that it has been of use to a few researchers - I've gotten a few thanks,
here and there - so I am hopeful for the future. I hope it will be useful to
you as well.
To check out the list, simply go to https://github.com/RichardLitt/endangered-languages.
If you're interested in being involved, read through the issues on https://github.com/RichardLitt/endangered-languages/issues,
or send me an email at richard.littauer@gmail.com.