Arctic Knot Conference 2021/Submissions/Sami language resources from Norwegian public sector internet domains
This is an open submission for the Celtic Knot Conference 2024. |
- Submission no.
- 19
- Title of the submission
- Sami language resources from Norwegian public sector internet domains
- Author of the submission
- Andre Kåsen and Magnus Breder Birkenes
- Submission format
- Pre-recorded video presentation (15–30 mins)
- Language of presentation
- English
- E-mail address
- magnus.birkenesnb.no
- Country of origin
- Norway
- Affiliation, if any (organisation, company etc.)
- Personal homepage or blog
- Abstract (up to 300 words to describe your proposal)
- We present the reuslts of a deep crawl of public content from Norwegian public entities. We downloaded 1.8 million unique web pages and text documents and extracted natural language from them using various text extraction methods (boilerplate removal, OCR). The resulting corpus consists of 4.3 billion words in various languages, e.g. 3.4 billion in Norwegian (Bokmål and Nynorsk), 5.7 million in Northern Sami, 400.000 in Southern Sami and 200.000 in Lule Sami. In the presentation, we will take a dive into the methods used and the Sami resources found on the domains of Norwegian public entities. The project is a cooperation between the National Library of Norway and the Norwegian Digitalisation Agency.
- What will attendees take away from this session?
- Theme of session
- Language technology
- Slides or further information (optional)
- Special requests
- Is this Submission a Draft or Final?
- Final
Interested attendees[edit]
If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with a hash and four tildes. (# ~~~~).