sbrunk / tokenizers-scala   0.0.2

Apache License 2.0 GitHub

Scala bindings for Hugging Face Tokenizers

Scala versions: 3.x 2.13

tokenizers-scala

Maven Central

Scala bindings for the Hugging Face Tokenizers library, written in Rust.

Usage

import io.brunk.tokenizers.Tokenizer

val tokenizer = Tokenizer.fromPretrained("bert-base-cased")
val encoding = tokenizer.encode("Hello, y'all! How are you 😁 ?", addSpecialTokens=true)
println(encoding.length)
// 13
println(encoding.ids)
// ArraySeq(101, 8667, 117, 194, 112, 1155, 106, 1731, 1132, 1128, 100, 136, 102)
println(encoding.tokens)
// ArraySeq([CLS], Hello, ,, y, ', all, !, How, are, you, [UNK], ?, [SEP])

Installation

sbt

libraryDependencies += "io.brunk.tokenizers" %% "tokenizers" % "<version>"

Scala CLI

//> using lib "io.brunk.tokenizers::tokenizers:<version>"

Others

Copy coordinates from Maven Central for Scala 2.13 or Scala 3.

Status

Currently, we can only load and run pre-trained tokenizers. Training is not yet possible.

How to build the project

  1. Install bleep
  2. Install Rust and Cargo
  3. bleep compile
    bleep test