Compare commits

...

10 Commits

Author SHA1 Message Date
Patrick Goldinger
f41385ae75 Adjust setup script 2024-10-20 18:46:48 +02:00
Patrick Goldinger
5c85be61d9 Add vscode env setup script 2024-10-20 18:46:41 +02:00
Patrick Goldinger
639beb9e64 Add initial flest implementation 2024-10-20 18:46:33 +02:00
Patrick Goldinger
453fb0253a Fix predictive back by removing generic onBackPressedHandler (#2646) 2024-10-19 15:13:35 +02:00
Lucas Sanginetto
13fc7679a2 Fix incorrect bracket labels in IPA symbols layout (#2644) 2024-10-19 11:26:25 +02:00
Patrick Goldinger
2421d13038 Update ROADMAP.md (#2639) 2024-10-18 00:34:15 +02:00
florisboard-bot
7dedfd4f7a Update translations from Crowdin 2024-10-17 19:07:32 +02:00
Patrick Goldinger
ef37194900 Disable compose strong skipping mode (#2637) 2024-10-17 16:51:02 +02:00
Lars Mühlbauer
58134b1ceb Add fix for sensitive clipboard suggestions (#2635) 2024-10-16 23:17:54 +02:00
Lars Mühlbauer
53cfbad404 Clipboard History enhancements (#2631)
* Hide sensitive clip data in clipboard history

* Add is remoteDevice flag

* Do not link password length to displayed characters

* Add backspace in clipboard history (#2615)

* Use ClipboardItem level function for the obfuscation of the text

* Move the backspace button to the header bar

* Adjust innerHeight to match the full layout

* Use KeyboardLikeButton instead of FlorisIconButtonWithInnerPadding
2024-10-14 19:31:37 +02:00
30 changed files with 1757 additions and 50 deletions

View File

@@ -2,24 +2,7 @@
This feature roadmap intents to provide transparency to what is planned to be added to FlorisBoard in the foreseeable future. Note that there are no ETAs for any version milestones down below, experience has shown these won't hold anyways.
Each major milestone has associated alpha/beta releases, so if you are interested in previewing features quicker, keep an eye out! Each major 0.x release has also patch releases after the initial major release, which will be published on both the stable and beta tracks.
## 0.4
**Main focus**: Getting the project back on track, see [this announcement](https://github.com/florisboard/florisboard/discussions/2314) for details. Note that this has also replaced the previous roadmap, however this step is necessary for getting the project back on track again.
This includes, but is not exclusive to:
- Fixing the most reported bugs/issues
- Merging in the Material You theme PR -> Adds Material You support (v0.4.0-alpha05)
- Merging in other external PRs as best as possible
- Reworking the Settings UI warning boxes and hiding any UI for features related to word suggestions until they are ready
- Remove existing glide/swipe typing (see 0.5 milestone)
- Improvements in clipboard / emoji functionality (v0.4.0-beta01/beta02)
- Prepare project to have native code implemented in [Rust](https://www.rust-lang.org/) (v0.4.0-beta02)
- - Upgrade Settings UI to Material 3 (v0.4.0-beta03)
- Add support for importing extensions via system file handler APIs (relevant for Addons store) (v0.4.0-beta03)
Note that the previous versioning scheme has been dropped in favor of using a major.minor.patch versioning scheme, so versions like `0.3.16` are a thing of the past :)
Each major milestone has associated alpha/beta releases, so if you are interested in previewing features quicker, keep an eye out! Each major 0.x release has also patch releases after the initial major release, which will be published on both the stable and preview tracks.
## 0.5
@@ -28,25 +11,25 @@ Note that the previous versioning scheme has been dropped in favor of using a ma
- Add new extension type: Language Pack
- Basically groups all locale-relevant data (predictive base model, emoji suggestion data, ...)
in a dynamically importable extension file
- New text processing logic (maybe moved back / split to 0.6)
- Add floating keyboard mode
- New keyboard layout engine + file syntax based on the upcoming Unicode Keyboard v3 standard
- RFC document with technical details will be released later
- New text processing logic (maybe moved back to 0.6)
- RFC document with technical details will be released later
- Add Tablet mode / Optimizations for landscape input based on new keyboard layout engine
- Reimplementation of glide typing with the new layout engine and predictive text core
- Add support for any remaining new features introduced with Android 13
## 0.6
- Complete rework of the Emoji panel
- Recently used / Emoji history (already implemented with 0.3.14)
- Emoji search
- Emoji suggestions when using :emoji_name: syntax (already implemented with v0.4.0-beta02)
- Fully scrollable emoji list (soft category borders)
- More granular themeing options
- Layout customization (e.g. placement of category buttons)
- Maybe: consider upgrading to emoji2 for better unified system-wide emoji styles
- Reimplementation of glide typing with the new layout engine and predictive text core
- Prepare FlorisBoard repository and app store presence for public beta release on Google Play (will go live with stable 0.6)
- Rework branding images and texts of FlorisBoard for the app stores
- Focus on stability and experience improvements of the app and keyboard
- Add support for new features introduced with Android 14
- Add support for new features introduced with Android 14 / 15
- Not finalized, but planned: raise minimum required Android version from Android 7 (SDK level 24) to Android 8 (SDK level 26)
## Backlog / Planned (unassigned)
@@ -58,7 +41,6 @@ Note that the previous versioning scheme has been dropped in favor of using a ma
- Adaptive themes v2
- Voice-to-text with Mozilla's open-source voice service (or any other oss voice provider)
- Text translation
- Floating keyboard
- Stickers/GIFs
- Kaomoji panel implementation
- FlorisBoard landing web page for presentation

View File

@@ -14,6 +14,7 @@
* limitations under the License.
*/
import org.jetbrains.kotlin.compose.compiler.gradle.ComposeFeatureFlag
import java.io.ByteArrayOutputStream
plugins {
@@ -161,6 +162,12 @@ android {
}
}
composeCompiler {
// DO NOT ENABLE STRONG SKIPPING! This project currently relies on
// recomposition on parent state change to update the UI correctly.
featureFlags.add(ComposeFeatureFlag.StrongSkipping.disabled())
}
tasks.withType<Test> {
useJUnitPlatform()
}

View File

@@ -0,0 +1,92 @@
{
"formatVersion": 1,
"database": {
"version": 3,
"identityHash": "282a1b421e498fd0e21c055b6a4315e0",
"entities": [
{
"tableName": "clipboard_history",
"createSql": "CREATE TABLE IF NOT EXISTS `${TABLE_NAME}` (`_id` INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, `type` INTEGER NOT NULL, `text` TEXT, `uri` TEXT, `creationTimestampMs` INTEGER NOT NULL, `isPinned` INTEGER NOT NULL, `mimeTypes` TEXT NOT NULL, `isSensitive` INTEGER NOT NULL, `isRemoteDevice` INTEGER NOT NULL)",
"fields": [
{
"fieldPath": "id",
"columnName": "_id",
"affinity": "INTEGER",
"notNull": true
},
{
"fieldPath": "type",
"columnName": "type",
"affinity": "INTEGER",
"notNull": true
},
{
"fieldPath": "text",
"columnName": "text",
"affinity": "TEXT",
"notNull": false
},
{
"fieldPath": "uri",
"columnName": "uri",
"affinity": "TEXT",
"notNull": false
},
{
"fieldPath": "creationTimestampMs",
"columnName": "creationTimestampMs",
"affinity": "INTEGER",
"notNull": true
},
{
"fieldPath": "isPinned",
"columnName": "isPinned",
"affinity": "INTEGER",
"notNull": true
},
{
"fieldPath": "mimeTypes",
"columnName": "mimeTypes",
"affinity": "TEXT",
"notNull": true
},
{
"fieldPath": "isSensitive",
"columnName": "isSensitive",
"affinity": "INTEGER",
"notNull": true
},
{
"fieldPath": "isRemoteDevice",
"columnName": "isRemoteDevice",
"affinity": "INTEGER",
"notNull": true
}
],
"primaryKey": {
"autoGenerate": true,
"columnNames": [
"_id"
]
},
"indices": [
{
"name": "index_clipboard_history__id",
"unique": false,
"columnNames": [
"_id"
],
"orders": [],
"createSql": "CREATE INDEX IF NOT EXISTS `index_clipboard_history__id` ON `${TABLE_NAME}` (`_id`)"
}
],
"foreignKeys": []
}
],
"views": [],
"setupQueries": [
"CREATE TABLE IF NOT EXISTS room_master_table (id INTEGER PRIMARY KEY,identity_hash TEXT)",
"INSERT OR REPLACE INTO room_master_table (id,identity_hash) VALUES(42, '282a1b421e498fd0e21c055b6a4315e0')"
]
}
}

View File

@@ -64,7 +64,7 @@
{ "code": 11816, "label": "⸨" },
{ "code": 10214, "label": "⟦" },
{ "code": 10216, "label": "⟨" },
{ "code": 10218, "label": "" },
{ "code": 10218, "label": "" },
{ "code": 123, "label": "{" }
]
} },
@@ -72,7 +72,7 @@
"relevant": [
{ "code": 41, "label": ")" },
{ "code": 11817, "label": "⸩" },
{ "code": 10215, "label": "" },
{ "code": 10215, "label": "" },
{ "code": 10217, "label": "⟩" },
{ "code": 10219, "label": "⟫" },
{ "code": 125, "label": "}" }

View File

@@ -31,7 +31,6 @@ import androidx.compose.material3.Surface
import androidx.compose.runtime.Composable
import androidx.compose.runtime.CompositionLocalProvider
import androidx.compose.runtime.LaunchedEffect
import androidx.compose.runtime.SideEffect
import androidx.compose.runtime.getValue
import androidx.compose.runtime.mutableStateOf
import androidx.compose.runtime.setValue
@@ -215,9 +214,5 @@ class FlorisAppActivity : ComponentActivity() {
}
intentToBeHandled = null
}
SideEffect {
navController.setOnBackPressedDispatcher(this.onBackPressedDispatcher)
}
}
}

View File

@@ -48,6 +48,7 @@ import androidx.compose.foundation.layout.wrapContentHeight
import androidx.compose.foundation.shape.CircleShape
import androidx.compose.material.icons.Icons
import androidx.compose.material.icons.automirrored.filled.ArrowBack
import androidx.compose.material.icons.automirrored.outlined.Backspace
import androidx.compose.material.icons.filled.ClearAll
import androidx.compose.material.icons.filled.Edit
import androidx.compose.material.icons.filled.ToggleOff
@@ -87,6 +88,8 @@ import dev.patrickgold.florisboard.ime.clipboard.provider.ClipboardFileStorage
import dev.patrickgold.florisboard.ime.clipboard.provider.ClipboardItem
import dev.patrickgold.florisboard.ime.clipboard.provider.ItemType
import dev.patrickgold.florisboard.ime.keyboard.FlorisImeSizing
import dev.patrickgold.florisboard.ime.media.KeyboardLikeButton
import dev.patrickgold.florisboard.ime.text.keyboard.TextKeyData
import dev.patrickgold.florisboard.ime.theme.FlorisImeTheme
import dev.patrickgold.florisboard.ime.theme.FlorisImeUi
import dev.patrickgold.florisboard.keyboardManager
@@ -193,6 +196,13 @@ fun ClipboardInputLayout(
iconColor = headerStyle.foreground.solidColor(context),
enabled = !deviceLocked && historyEnabled && !isPopupSurfaceActive(),
)
KeyboardLikeButton(
inputEventDispatcher = keyboardManager.inputEventDispatcher,
keyData = TextKeyData.DELETE,
element = FlorisImeUi.ClipboardHeader,
) {
Icon(Icons.AutoMirrored.Outlined.Backspace, null)
}
}
}
@@ -307,7 +317,7 @@ fun ClipboardInputLayout(
.fillMaxWidth()
.run { if (contentScrollInsteadOfClip) this.florisVerticalScroll() else this }
.padding(ItemPadding),
text = text,
text = item.displayText(),
style = TextStyle(textDirection = TextDirection.ContentOrLtr),
color = style.foreground.solidColor(context),
fontSize = style.fontSize.spSize(),
@@ -577,7 +587,7 @@ fun ClipboardInputLayout(
Column(
modifier = modifier
.fillMaxWidth()
.wrapContentHeight(),
.height(FlorisImeSizing.imeUiHeight()),
) {
HeaderRow()
if (deviceLocked) {

View File

@@ -17,6 +17,8 @@
package dev.patrickgold.florisboard.ime.clipboard.provider
import android.content.ClipData
import android.content.ClipDescription.EXTRA_IS_REMOTE_DEVICE
import android.content.ClipDescription.EXTRA_IS_SENSITIVE
import android.content.ContentValues
import android.content.Context
import android.database.Cursor
@@ -24,6 +26,8 @@ import android.net.Uri
import android.provider.BaseColumns
import android.provider.MediaStore.Images.Media
import android.provider.OpenableColumns
import androidx.compose.runtime.Composable
import androidx.compose.ui.platform.LocalContext
import androidx.core.database.getStringOrNull
import androidx.lifecycle.LiveData
import androidx.room.ColumnInfo
@@ -39,9 +43,14 @@ import androidx.room.RoomDatabase
import androidx.room.TypeConverter
import androidx.room.TypeConverters
import androidx.room.Update
import dev.patrickgold.florisboard.R
import kotlinx.serialization.EncodeDefault
import kotlinx.serialization.ExperimentalSerializationApi
import kotlinx.serialization.Serializable
import org.florisboard.lib.android.AndroidVersion
import org.florisboard.lib.android.UriSerializer
import org.florisboard.lib.android.query
import kotlinx.serialization.Serializable
import org.florisboard.lib.android.stringRes
import org.florisboard.lib.kotlin.tryOrNull
private const val CLIPBOARD_HISTORY_TABLE = "clipboard_history"
@@ -67,7 +76,7 @@ enum class ItemType(val value: Int) {
*/
@Serializable
@Entity(tableName = CLIPBOARD_HISTORY_TABLE)
data class ClipboardItem(
data class ClipboardItem @OptIn(ExperimentalSerializationApi::class) constructor(
@PrimaryKey(autoGenerate = true)
@ColumnInfo(name = BaseColumns._ID, index = true)
var id: Long = 0,
@@ -78,6 +87,10 @@ data class ClipboardItem(
val creationTimestampMs: Long,
val isPinned: Boolean,
val mimeTypes: Array<String>,
@EncodeDefault
val isSensitive: Boolean = false,
@EncodeDefault
val isRemoteDevice: Boolean = false,
) {
companion object {
/**
@@ -113,6 +126,18 @@ data class ClipboardItem(
else -> ItemType.TEXT
}
val isSensitive = if (AndroidVersion.ATLEAST_API33_T) {
data.description?.extras?.getBoolean(EXTRA_IS_SENSITIVE) ?: false
} else {
false
}
val isRemoteDevice = if (AndroidVersion.ATLEAST_API34_U) {
data.description?.extras?.getBoolean(EXTRA_IS_REMOTE_DEVICE) ?: false
} else {
false
}
val uri = if (type == ItemType.IMAGE || type == ItemType.VIDEO) {
if (dataItem.uri.authority == ClipboardMediaProvider.AUTHORITY || !cloneUri) {
dataItem.uri
@@ -151,7 +176,21 @@ data class ClipboardItem(
}
}
return ClipboardItem(0, type, text, uri, System.currentTimeMillis(), false, mimeTypes)
return ClipboardItem(0, type, text, uri, System.currentTimeMillis(), false, mimeTypes, isSensitive, isRemoteDevice)
}
}
@Composable
inline fun displayText(): String {
val context = LocalContext.current
return displayText(context)
}
fun displayText(context: Context): String {
return if (isSensitive) {
context.stringRes(R.string.clipboard__sensitive_clip_content)
} else {
stringRepresentation()
}
}
@@ -293,7 +332,7 @@ interface ClipboardHistoryDao {
fun deleteAllUnpinned()
}
@Database(entities = [ClipboardItem::class], version = 2)
@Database(entities = [ClipboardItem::class], version = 3)
@TypeConverters(Converters::class)
abstract class ClipboardHistoryDatabase : RoomDatabase() {
abstract fun clipboardItemDao(): ClipboardHistoryDao

View File

@@ -113,12 +113,13 @@ internal fun KeyboardLikeButton(
modifier: Modifier = Modifier,
inputEventDispatcher: InputEventDispatcher,
keyData: KeyData,
element: String = FlorisImeUi.EmojiKey,
content: @Composable RowScope.() -> Unit,
) {
val inputFeedbackController = LocalInputFeedbackController.current
var isPressed by remember { mutableStateOf(false) }
val keyStyle = FlorisImeTheme.style.get(
element = FlorisImeUi.EmojiKey,
element = element,
code = keyData.code,
isPressed = isPressed,
)

View File

@@ -61,7 +61,7 @@ class NlpManager(context: Context) {
private val subtypeManager by context.subtypeManager()
private val scope = CoroutineScope(Dispatchers.Default + SupervisorJob())
private val clipboardSuggestionProvider = ClipboardSuggestionProvider()
private val clipboardSuggestionProvider = ClipboardSuggestionProvider(context)
private val emojiSuggestionProvider = EmojiSuggestionProvider(context)
private val providers = guardedByLock {
mapOf(
@@ -349,7 +349,7 @@ class NlpManager(context: Context) {
}
}
inner class ClipboardSuggestionProvider internal constructor() : SuggestionProvider {
inner class ClipboardSuggestionProvider internal constructor(private val context: Context) : SuggestionProvider {
private var lastClipboardItemId: Long = -1
override val providerId = "org.florisboard.nlp.providers.clipboard"
@@ -378,7 +378,10 @@ class NlpManager(context: Context) {
return buildList {
val now = System.currentTimeMillis()
if ((now - currentItem.creationTimestampMs) < prefs.suggestion.clipboardContentTimeout.get() * 1000) {
add(ClipboardSuggestionCandidate(currentItem, sourceProvider = this@ClipboardSuggestionProvider))
add(ClipboardSuggestionCandidate(currentItem, sourceProvider = this@ClipboardSuggestionProvider, context = context))
if (currentItem.isSensitive) {
return@buildList
}
if (currentItem.type == ItemType.TEXT) {
val text = currentItem.stringRepresentation()
val matches = buildList {
@@ -402,6 +405,7 @@ class NlpManager(context: Context) {
}
),
sourceProvider = this@ClipboardSuggestionProvider,
context = context,
))
}
}

View File

@@ -16,6 +16,7 @@
package dev.patrickgold.florisboard.ime.nlp
import android.content.Context
import androidx.compose.material.icons.Icons
import androidx.compose.material.icons.automirrored.outlined.Assignment
import androidx.compose.material.icons.filled.Email
@@ -123,8 +124,9 @@ data class WordSuggestionCandidate(
data class ClipboardSuggestionCandidate(
val clipboardItem: ClipboardItem,
override val sourceProvider: SuggestionProvider?,
val context: Context,
) : SuggestionCandidate {
override val text: CharSequence = clipboardItem.stringRepresentation()
override val text: CharSequence = clipboardItem.displayText(context)
override val secondaryText: CharSequence? = null

View File

@@ -316,7 +316,7 @@ fun TextKeyboardLayout(
val debugShowTouchBoundaries by prefs.devtools.showKeyTouchBoundaries.observeAsState()
for (textKey in keyboard.keys()) {
TextKeyButton(
textKey, desiredKey, evaluator, fontSizeMultiplier, isSmartbarKeyboard,
textKey, evaluator, fontSizeMultiplier, isSmartbarKeyboard,
debugShowTouchBoundaries,
)
}
@@ -336,7 +336,6 @@ fun TextKeyboardLayout(
@Composable
private fun TextKeyButton(
key: TextKey,
desiredKey: TextKey,
evaluator: ComputingEvaluator,
fontSizeMultiplier: Float,
isSmartbarKey: Boolean,
@@ -359,9 +358,7 @@ private fun TextKeyButton(
KeyCode.VIEW_NUMERIC_ADVANCED -> 0.55f
else -> 1.0f
}
val size = remember(desiredKey) {
key.visibleBounds.size.toDpSize()
}
val size = key.visibleBounds.size.toDpSize()
Box(
modifier = Modifier
.requiredSize(size)

View File

@@ -15,6 +15,7 @@
<string name="media__tab__kaomoji" comment="Tab description for kaomoji in the media UI">Kaomoji</string>
<string name="prefs__media__emoji_preferred_skin_tone">لون البشرة المفضل للرموز التعبيرية</string>
<string name="prefs__media__emoji_preferred_hair_style">تصفيفة الشعر الرموز التعبيرية المفضلة</string>
<string name="prefs__media__emoji_history_enabled" comment="Preference title">Activar l\'historial de fustaxes</string>
<!-- Emoji strings -->
<string name="emoji__category__smileys_emotion" comment="Emoji category name">Sorrises y fustaxes</string>
<string name="emoji__category__people_body" comment="Emoji category name">Persones y cuerpu</string>

View File

@@ -582,6 +582,8 @@
<string name="devtools__show_input_state_overlay__summary" comment="Summary of Show input cache overlay in Devtools">Zobrazí aktuální stav vstupu pro ladění</string>
<string name="devtools__show_spelling_overlay__label" comment="Label of Show spelling overlay in Devtools">Zobrazit překrytí s pravopisem</string>
<string name="devtools__show_spelling_overlay__summary" comment="Summary of Show spelling overlay in Devtools">Zobrazí aktuální výsledky pravopisu pro ladění</string>
<string name="devtools__show_inline_autofill_overlay__label">Zobrazit překrytí automatického vyplňování na řádku</string>
<string name="devtools__show_inline_autofill_overlay__summary">Zobrazí aktuální výsledky automatického vyplňování na řádku pro ladění</string>
<string name="devtools__show_key_touch_boundaries__label" comment="Label of Show key touch boundaries in Devtools">Zobrazit hranice dotyku kláves</string>
<string name="devtools__show_key_touch_boundaries__summary" comment="Summary of Show key touch boundaries in Devtools">Zobrazit červené ohraničení hranic dotyku kláves</string>
<string name="devtools__show_drag_and_drop_helpers__label" comment="Label of Show drag and drop helpers in Devtools">Zobrazit pomocníky drag&amp;drop</string>

View File

@@ -5,7 +5,7 @@
<string name="key__phone_wait" comment="Label for the Wait key in the telephone keyboard layout">Ожидание</string>
<string name="key_popup__threedots_alt" comment="Content description for the three-dots icon in a key popup">Значок троеточия. Если отображается, показывает, что можно использовать больше знаков при долгом нажатии.</string>
<!-- One-handed strings -->
<string name="one_handed__close_btn_content_description" comment="Content description for the one-handed close button">Закрыть режим одной руки</string>
<string name="one_handed__close_btn_content_description" comment="Content description for the one-handed close button">Закрыть режим одной руки.</string>
<string name="one_handed__move_start_btn_content_description" comment="Content description for the one-handed move to left button">Переместить клавиатуру влево.</string>
<string name="one_handed__move_end_btn_content_description" comment="Content description for the one-handed move to right button">Переместить клавиатуру вправо.</string>
<!-- Media strings -->
@@ -15,6 +15,12 @@
<string name="media__tab__kaomoji" comment="Tab description for kaomoji in the media UI">Каомодзи</string>
<string name="prefs__media__emoji_preferred_skin_tone">Предпочтительный цвет кожи эмодзи</string>
<string name="prefs__media__emoji_preferred_hair_style">Предпочтительная прическа эмодзи</string>
<string name="prefs__media__emoji_history__title" comment="Preference group title">История эмодзи</string>
<string name="prefs__media__emoji_history_enabled" comment="Preference title">Включить историю эмодзи</string>
<string name="prefs__media__emoji_history_enabled__summary" comment="Preference summary">Сохраняйте недавно использованные эмодзи для быстрого доступа</string>
<string name="prefs__media__emoji_history_pinned_update_strategy" comment="Preference title">Обновление истории (Закрепленного)</string>
<string name="prefs__media__emoji_history_recent_update_strategy" comment="Preference title">Обновление истории (Недавнего)</string>
<string name="prefs__media__emoji_history_max_size">Максимум элементов, которые можно сохранить</string>
<string name="prefs__media__emoji_suggestion__title" comment="Preference group title">Подсказки смайликов</string>
<string name="prefs__media__emoji_suggestion_enabled" comment="Preference title">Включить подсказки смайликов</string>
<string name="prefs__media__emoji_suggestion_enabled__summary" comment="Preference summary">Предлагать смайлики при наборе текста</string>
@@ -34,6 +40,11 @@
<string name="emoji__category__objects" comment="Emoji category name">Объекты</string>
<string name="emoji__category__symbols" comment="Emoji category name">Символы</string>
<string name="emoji__category__flags" comment="Emoji category name">Флаги</string>
<string name="emoji__history__empty_message" comment="Message if the emoji history is empty">Недавно использованные эмодзи не найдены. Как только вы начнете использовать эмодзи, они автоматически будут появляться здесь.</string>
<string name="emoji__history__usage_tip" comment="Feature discoverability for actions of emoji history">Совет: Долго нажимайте на эмодзи в истории эмодзи, чтобы закрепить или удалить их!</string>
<string name="emoji__history__removal_success_message" comment="Toast message if user has used the delete action on an emoji in the emoji history">Удаление {emoji} из истории эмодзи</string>
<string name="emoji__history__pinned">Закреплено</string>
<string name="emoji__history__recent">Недавние</string>
<!-- Quick action strings -->
<string name="quick_action__arrow_up" maxLength="12">В начало</string>
<string name="quick_action__arrow_up__tooltip">Переместить курсор в начало</string>
@@ -569,6 +580,7 @@
<string name="devtools__show_input_state_overlay__summary" comment="Summary of Show input cache overlay in Devtools">Показывать наложением текущее состояние ввода для отладки</string>
<string name="devtools__show_spelling_overlay__label" comment="Label of Show spelling overlay in Devtools">Показывать орфографию наложением</string>
<string name="devtools__show_spelling_overlay__summary" comment="Summary of Show spelling overlay in Devtools">Показывать наложением текущие результаты проверки орфографии для отладки</string>
<string name="devtools__show_inline_autofill_overlay__summary">Отображает текущие результаты автозаполнения строки для отладки</string>
<string name="devtools__show_key_touch_boundaries__label" comment="Label of Show key touch boundaries in Devtools">Показывать границы нажатия клавиш</string>
<string name="devtools__show_key_touch_boundaries__summary" comment="Summary of Show key touch boundaries in Devtools">Обводить границы нажатия клавиш красным контуром</string>
<string name="devtools__show_drag_and_drop_helpers__label" comment="Label of Show drag and drop helpers in Devtools">Показывать вспомогательные элементы перетаскивания</string>
@@ -747,6 +759,12 @@
<string name="enum__display_language_names_in__system_locale__description" comment="Enum value description">Подписи в приложении и интерфейсе клавиатуры указаны на языке, используемом в системе по умолчанию</string>
<string name="enum__display_language_names_in__native_locale" comment="Enum value label">В исходном виде</string>
<string name="enum__display_language_names_in__native_locale__description" comment="Enum value description">Подписи в приложении и интерфейсе клавиатуры приводятся на родных языках</string>
<string name="enum__emoji_history_update_strategy__auto_sort_prepend__description" comment="Enum value description">Автоматическое изменение порядка расположения эмодзи в зависимости от их использования. Новые эмодзи добавляются в начало.</string>
<string name="enum__emoji_history_update_strategy__auto_sort_append__description" comment="Enum value description">Автоматическое изменение порядка расположения эмодзи в зависимости от их использования. Новые эмодзи добавляются в конец.</string>
<string name="enum__emoji_history_update_strategy__manual_sort_prepend__description" comment="Enum value description">Не происходит автоматической перестановки эмодзи в зависимости от их использования.
Новые эмодзи добавляются в начало.</string>
<string name="enum__emoji_history_update_strategy__manual_sort_append__description" comment="Enum value description">Не происходит автоматической перестановки эмодзи в зависимости от их использования.
Новые эмодзи добавляются в конец.</string>
<string name="enum__emoji_skin_tone__default" comment="Enum value label">Цвет кожи {emoji} по умолчанию</string>
<string name="enum__emoji_skin_tone__light_skin_tone" comment="Enum value label">Светлый цвет кожи {emoji}</string>
<string name="enum__emoji_skin_tone__medium_light_skin_tone" comment="Enum value label">Светловатый цвет кожи {emoji}</string>
@@ -758,6 +776,8 @@
<string name="enum__emoji_hair_style__curly_hair" comment="Enum value label">{emoji} Вьющиеся волосы</string>
<string name="enum__emoji_hair_style__white_hair" comment="Enum value label">{emoji} Светлые волосы</string>
<string name="enum__emoji_hair_style__bald" comment="Enum value label">{emoji} Без волос</string>
<string name="enum__emoji_suggestion_type__leading_colon__description" comment="Keep the :emoji_name while translating, this is a syntax guide">Предлагайте эмодзи, используя синтаксис :emoji_name</string>
<string name="enum__emoji_suggestion_type__inline_text__description">Предлагает эмодзи, просто набрав название эмодзи в виде слова</string>
<string name="enum__extended_actions_placement__above_candidates" comment="Enum value label">Вышестоящие предложение</string>
<string name="enum__extended_actions_placement__above_candidates__description" comment="Enum value description">Размещает строку расширенных действий между пользовательским интерфейсом приложения и строкой предложений</string>
<string name="enum__extended_actions_placement__below_candidates" comment="Enum value label">Нижестоящие предложение</string>

View File

@@ -581,6 +581,8 @@
<string name="devtools__show_input_state_overlay__summary" comment="Summary of Show input cache overlay in Devtools">Накладає поточний стан входу для налагодження</string>
<string name="devtools__show_spelling_overlay__label" comment="Label of Show spelling overlay in Devtools">Показати правопис накладанням</string>
<string name="devtools__show_spelling_overlay__summary" comment="Summary of Show spelling overlay in Devtools">Накладає поточні результати правопису для налагодження</string>
<string name="devtools__show_inline_autofill_overlay__label">Показати вбудоване накладання автозаповнення</string>
<string name="devtools__show_inline_autofill_overlay__summary">Накладає поточні результати автозаповнення для налагодження</string>
<string name="devtools__show_key_touch_boundaries__label" comment="Label of Show key touch boundaries in Devtools">Показати межі дотику клавіш</string>
<string name="devtools__show_key_touch_boundaries__summary" comment="Summary of Show key touch boundaries in Devtools">Обводить червоним кольором межі дотику клавіш</string>
<string name="devtools__show_drag_and_drop_helpers__label" comment="Label of Show drag and drop helpers in Devtools">Показати помічників перетягування</string>

View File

@@ -23,6 +23,8 @@
<string name="key__view_keshida" translatable="false">"یــــ"</string>
<string name="key__dotted_circle" translatable="false">&#9676;</string>
<string name="clipboard__sensitive_clip_content" translatable="false">************</string>
<!-- Media strings -->
<string name="media__tab__emoticons_label" translatable="false">;-)</string>
<string name="media__tab__kaomoji_label" translatable="false">(^-^*)/</string>

25
libnative/flest/Cargo.lock generated Normal file
View File

@@ -0,0 +1,25 @@
# This file is automatically @generated by Cargo.
# It is not intended for manual editing.
version = 3
[[package]]
name = "byteorder"
version = "1.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1fd0f2584146f6f2ef48085050886acf353beff7305ebd1ae69500e27c67f64b"
[[package]]
name = "flest"
version = "0.1.0"
dependencies = [
"fxhash",
]
[[package]]
name = "fxhash"
version = "0.2.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c31b6d751ae2c7f11320402d34e41349dd1016f8d5d45e48c4312bc8625af50c"
dependencies = [
"byteorder",
]

View File

@@ -0,0 +1,9 @@
[package]
name = "flest"
version = "0.1.0"
edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
fxhash = "0.2.1"

View File

@@ -0,0 +1,102 @@
use fxhash::FxHashMap;
#[derive(Default)]
struct DynTrieNode<V> where V: Default {
children: FxHashMap<char, Box<DynTrieNode<V>>>,
value: Option<V>,
}
impl<V> DynTrieNode<V> where V: Default {
fn for_each_recursive<'a, F>(&'a self, current_word: &mut Vec<char>, f: &mut F)
where F: FnMut(&[char], &'a V) {
if let Some(value) = &self.value {
f(&current_word, value);
}
for (letter, node) in &self.children {
current_word.push(*letter);
node.for_each_recursive(current_word, f);
current_word.pop();
}
}
}
#[derive(Default)]
pub struct DynTrie<V> where V: Default {
root: DynTrieNode<V>,
}
impl<V> DynTrie<V>
where V: Default {
pub fn find(&self, word: &[char]) -> Option<&V> {
let mut current_node = &self.root;
for letter in word {
match current_node.children.get(letter) {
Some(node) => current_node = node,
None => return None,
}
}
return current_node.value.as_ref();
}
fn str_fuzzy_match_whole(str1: &[char], str2: &[char]) -> f64 {
let len1 = str1.len();
let len2 = str2.len();
let max_len = std::cmp::max(len1, len2);
let mut score: f64 = 0.0;
let mut penalty: f64 = 0.0;
for i in 0..max_len {
let ch1 = str1.get(i).unwrap_or(&' ');
let ch2 = str2.get(i).unwrap_or(&' ');
if ch1 == ch2 {
score += 1.0;
} else if ch1.to_lowercase().eq(ch2.to_lowercase()) {
score += 0.5;
} else {
penalty += if i == 0 { 2.0 } else { 1.0 };
}
}
return f64::max(0.0, score - penalty)
}
// TODO: optimization: we do not need to iterate over all
// the trie, we can predict if the score will never be >= 0
// and skip the whole subtree
pub fn find_many(&self, word: &[char]) -> Vec<(Vec<char>, &V)> {
let mut results = Vec::new();
self.for_each(&mut |current_word, value| {
let score = Self::str_fuzzy_match_whole(word, current_word);
if score > 0.0 {
results.push((current_word.to_owned(), value));
}
});
return results;
}
pub fn find_or_insert(&mut self, word: &[char], value: V) -> &mut V {
let mut current_node = &mut self.root;
for letter in word {
current_node = current_node.children.entry(*letter)
.or_insert_with(|| Box::new(DynTrieNode::default()));
}
if current_node.value.is_none() {
current_node.value = Some(value);
}
return current_node.value.as_mut().unwrap();
}
#[allow(dead_code)]
fn insert(&mut self, word: &[char], value: V) {
let mut current_node = &mut self.root;
for letter in word {
current_node = current_node.children.entry(*letter)
.or_insert_with(|| Box::new(DynTrieNode::default()));
}
current_node.value = Some(value);
}
pub fn for_each<'a, F>(&'a self, f: &mut F)
where F: FnMut(&[char], &'a V) {
let mut current_word: Vec<char> = Vec::new();
self.root.for_each_recursive(&mut current_word, f);
}
}

View File

@@ -0,0 +1,4 @@
mod dyntrie;
mod ngrammodel;
pub use ngrammodel::*;

View File

@@ -0,0 +1,212 @@
use std::collections::HashMap;
use crate::dyntrie::DynTrie;
#[derive(Default)]
struct NgramModelNode {
children: DynTrie<Box<NgramModelNode>>,
time: u64,
usage: u64,
}
impl NgramModelNode {
fn find(&self, ngram: &[&str]) -> Option<&NgramModelNode> {
if ngram.is_empty() {
return None;
}
let token: Vec<char> = ngram[0].chars().collect();
let child = self.children.find(&token);
if child.is_none() {
return None;
}
let child = child.unwrap();
if ngram.len() == 1 {
return Some(child);
}
return child.find(&ngram[1..]);
}
fn find_many(&self, ngram: &[&str]) -> Vec<(Vec<char>, &NgramModelNode)> {
if ngram.is_empty() {
return Vec::new();
}
let token: Vec<char> = ngram[0].chars().collect();
let ret = self.children.find_many(&token);
if ngram.len() == 1 {
return ret
.into_iter()
.map(|node| (node.0, node.1.as_ref()))
.collect();
}
let mut ret2 = Vec::new();
for (_, child) in &ret {
ret2.extend(child.find_many(&ngram[1..]));
}
return ret2;
}
fn train(&mut self, ngram: &[&str], current_time: u64) {
if ngram.is_empty() {
panic!("ngram must not be empty");
}
let token: Vec<char> = ngram[0].chars().collect();
let child = self.children.find_or_insert(&token, Box::new(NgramModelNode::default()));
if ngram.len() == 1 {
if current_time != 0 {
child.time = current_time;
}
child.usage += 1;
} else {
child.train(&ngram[1..], current_time);
}
}
fn debug_print(&self, _indent: usize) {
// println!("{}{}{}", " ".repeat(indent), self.token, if self.time > 0 { "*" } else { "" });
// for child in &self.children {
// child.debug_print(indent + 1);
// }
}
}
#[derive(Default)]
pub struct NgramModel {
root: NgramModelNode,
time: u64,
}
impl NgramModel {
#[allow(dead_code)]
fn find(&self, ngram: &[&str]) -> Option<&NgramModelNode> {
self.root.find(ngram)
}
fn find_many(&self, ngram: &[&str]) -> Vec<(Vec<char>, &NgramModelNode)> {
self.root.find_many(ngram)
}
pub fn train_dataset(&mut self, token_list: &[&str]) {
self.root.train(token_list, 0);
}
pub fn train_input(&mut self, token_list: &[&str]) {
self.time += 1;
self.root.train(token_list, self.time);
}
pub fn debug_print(&self) {
self.root.debug_print(0);
}
pub fn predict(&self, history: &Vec<&str>) -> Vec<(String, f64)> {
let mut tmin = u64::MAX;
let mut tmax = u64::MIN;
let mut umin = u64::MAX;
let mut umax = u64::MIN;
let nmin = 1;
let nmax = 3;
let mut candidate_nodes: Vec<(Vec<char>, &NgramModelNode, f64)> = Vec::new();
let user_input_word = history.last().unwrap_or(&"");
for n in nmin..=std::cmp::min(history.len(), nmax) {
let nweight = 1.0 - (nmax - n) as f64 * 0.1;
let ngram = &history[history.len() - n..history.len() - 1];
let nodes = self.find_many(ngram);
for (_, node) in nodes {
node.children.for_each(&mut |curr_word, child| {
candidate_nodes.push((curr_word.to_owned(), child, nweight));
tmin = tmin.min(child.time);
tmax = tmax.max(child.time);
umin = umin.min(child.usage);
umax = umax.max(child.usage);
});
}
}
candidate_nodes = candidate_nodes
.into_iter()
.map(|(word, node, nweight)| {
(
word,
node,
nweight
* norm_weight(node.time, tmin, tmax)
* norm_weight(node.usage, umin, umax),
)
})
.collect();
if !user_input_word.is_empty() {
let user_input_word: Vec<char> = user_input_word.chars().collect();
let mut filtered_nodes = Vec::new();
for (word, node, weight) in candidate_nodes {
let score_len = std::cmp::min(
(word.len() + user_input_word.len()) / 2,
user_input_word.len(),
) as f64;
let score = str_fuzzy_match_live(&word, &user_input_word);
if score > 0.0 {
let new_weight = 0.95 * (score / score_len) + 0.05 * weight;
filtered_nodes.push((word, node, new_weight));
}
}
self.root.children.for_each(&mut |word, node| {
let score_len = std::cmp::min(
(word.len() + user_input_word.len()) / 2,
user_input_word.len(),
) as f64;
let score = str_fuzzy_match_live(&word, &user_input_word);
if score > 0.0 {
let new_weight = 0.75 * (score / score_len) + 0.25 * 0.0;
filtered_nodes.push((word.to_owned(), node, new_weight));
}
});
candidate_nodes = filtered_nodes;
}
candidate_nodes.sort_by(|a, b| b.2.partial_cmp(&a.2).unwrap());
let mut predictions: HashMap<String, f64> = HashMap::new();
for (word, _, weight) in candidate_nodes {
predictions
.entry(word.iter().collect())
.or_insert(weight);
}
let mut predictions_vec: Vec<(String, f64)> = predictions.into_iter().collect();
predictions_vec.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
predictions_vec.into_iter().take(8).collect()
}
}
fn norm_weight(x: u64, xmin: u64, xmax: u64) -> f64 {
if x <= xmin {
return 0.0;
}
if x >= xmax {
return 1.0;
}
let xnorm = (x - xmin) as f64 / (xmax - xmin) as f64;
return 2.0 * xnorm - xnorm.powi(2);
}
fn str_fuzzy_match_live(word: &[char], current_word: &[char]) -> f64 {
//let len1 = word.len();
let len2 = current_word.len();
let mut score = 0.0;
let mut penalty: f64 = 0.0;
for i in 0..len2 {
let ch1 = word.get(i).unwrap_or(&' ');
let ch2 = current_word.get(i).unwrap_or(&' ');
if ch1 == ch2 {
score += 1.0;
} else if ch1.to_lowercase().eq(ch2.to_lowercase()) {
score += 0.9;
} else {
penalty += if i == 0 { 2.0 } else { 1.0 };
}
}
return f64::max(0.0, score - 0.125 * penalty.powi(2));
}

354
libnative/textutils/Cargo.lock generated Normal file
View File

@@ -0,0 +1,354 @@
# This file is automatically @generated by Cargo.
# It is not intended for manual editing.
version = 3
[[package]]
name = "aho-corasick"
version = "1.1.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8e60d3430d3a69478ad0993f19238d2df97c507009a52b3c10addcd7f6bcb916"
dependencies = [
"memchr",
]
[[package]]
name = "core_maths"
version = "0.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e3b02505ccb8c50b0aa21ace0fc08c3e53adebd4e58caa18a36152803c7709a3"
dependencies = [
"libm",
]
[[package]]
name = "displaydoc"
version = "0.2.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "97369cbbc041bc366949bc74d34658d6cda5621039731c6310521892a3a20ae0"
dependencies = [
"proc-macro2",
"quote",
"syn",
]
[[package]]
name = "either"
version = "1.13.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "60b1af1c220855b6ceac025d3f6ecdd2b7c4894bfe9cd9bda4fbb4bc7c0d4cf0"
[[package]]
name = "icu_collections"
version = "1.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "db2fa452206ebee18c4b5c2274dbf1de17008e874b4dc4f0aea9d01ca79e4526"
dependencies = [
"displaydoc",
"yoke",
"zerofrom",
"zerovec",
]
[[package]]
name = "icu_locid"
version = "1.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "13acbb8371917fc971be86fc8057c41a64b521c184808a698c02acc242dbf637"
dependencies = [
"displaydoc",
"litemap",
"tinystr",
"writeable",
]
[[package]]
name = "icu_provider"
version = "1.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6ed421c8a8ef78d3e2dbc98a973be2f3770cb42b606e3ab18d6237c4dfde68d9"
dependencies = [
"displaydoc",
"icu_locid",
"icu_provider_macros",
"stable_deref_trait",
"tinystr",
"writeable",
"yoke",
"zerofrom",
"zerovec",
]
[[package]]
name = "icu_provider_macros"
version = "1.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1ec89e9337638ecdc08744df490b221a7399bf8d164eb52a665454e60e075ad6"
dependencies = [
"proc-macro2",
"quote",
"syn",
]
[[package]]
name = "icu_segmenter"
version = "1.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a717725612346ffc2d7b42c94b820db6908048f39434504cb130e8b46256b0de"
dependencies = [
"core_maths",
"displaydoc",
"icu_collections",
"icu_locid",
"icu_provider",
"icu_segmenter_data",
"utf8_iter",
"zerovec",
]
[[package]]
name = "icu_segmenter_data"
version = "1.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f739ee737260d955e330bc83fdeaaf1631f7fb7ed218761d3c04bb13bb7d79df"
[[package]]
name = "itertools"
version = "0.13.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "413ee7dfc52ee1a4949ceeb7dbc8a33f2d6c088194d9f922fb8318faf1f01186"
dependencies = [
"either",
]
[[package]]
name = "lazy_static"
version = "1.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "bbd2bcb4c963f2ddae06a2efc7e9f3591312473c50c6685e1f298068316e66fe"
[[package]]
name = "libm"
version = "0.2.8"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4ec2a862134d2a7d32d7983ddcdd1c4923530833c9f2ea1a44fc5fa473989058"
[[package]]
name = "linkify"
version = "0.10.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f1dfa36d52c581e9ec783a7ce2a5e0143da6237be5811a0b3153fedfdbe9f780"
dependencies = [
"memchr",
]
[[package]]
name = "litemap"
version = "0.7.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "643cb0b8d4fcc284004d5fd0d67ccf61dfffadb7f75e1e71bc420f4688a3a704"
[[package]]
name = "memchr"
version = "2.7.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "78ca9ab1a0babb1e7d5695e3530886289c18cf2f87ec19a575a0abdce112e3a3"
[[package]]
name = "proc-macro2"
version = "1.0.88"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7c3a7fc5db1e57d5a779a352c8cdb57b29aa4c40cc69c3a68a7fedc815fbf2f9"
dependencies = [
"unicode-ident",
]
[[package]]
name = "quote"
version = "1.0.37"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b5b9d34b8991d19d98081b46eacdd8eb58c6f2b201139f7c5f643cc155a633af"
dependencies = [
"proc-macro2",
]
[[package]]
name = "regex"
version = "1.11.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "38200e5ee88914975b69f657f0801b6f6dccafd44fd9326302a4aaeecfacb1d8"
dependencies = [
"aho-corasick",
"memchr",
"regex-automata",
"regex-syntax",
]
[[package]]
name = "regex-automata"
version = "0.4.8"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "368758f23274712b504848e9d5a6f010445cc8b87a7cdb4d7cbee666c1288da3"
dependencies = [
"aho-corasick",
"memchr",
"regex-syntax",
]
[[package]]
name = "regex-syntax"
version = "0.8.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2b15c43186be67a4fd63bee50d0303afffcef381492ebe2c5d87f324e1b8815c"
[[package]]
name = "serde"
version = "1.0.210"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c8e3592472072e6e22e0a54d5904d9febf8508f65fb8552499a1abc7d1078c3a"
dependencies = [
"serde_derive",
]
[[package]]
name = "serde_derive"
version = "1.0.210"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "243902eda00fad750862fc144cea25caca5e20d615af0a81bee94ca738f1df1f"
dependencies = [
"proc-macro2",
"quote",
"syn",
]
[[package]]
name = "stable_deref_trait"
version = "1.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a8f112729512f8e442d81f95a8a7ddf2b7c6b8a1a6f509a95864142b30cab2d3"
[[package]]
name = "syn"
version = "2.0.79"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "89132cd0bf050864e1d38dc3bbc07a0eb8e7530af26344d3d2bbbef83499f590"
dependencies = [
"proc-macro2",
"quote",
"unicode-ident",
]
[[package]]
name = "synstructure"
version = "0.13.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c8af7666ab7b6390ab78131fb5b0fce11d6b7a6951602017c35fa82800708971"
dependencies = [
"proc-macro2",
"quote",
"syn",
]
[[package]]
name = "textutils"
version = "0.1.0"
dependencies = [
"icu_segmenter",
"itertools",
"lazy_static",
"linkify",
"regex",
]
[[package]]
name = "tinystr"
version = "0.7.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9117f5d4db391c1cf6927e7bea3db74b9a1c1add8f7eda9ffd5364f40f57b82f"
dependencies = [
"displaydoc",
]
[[package]]
name = "unicode-ident"
version = "1.0.13"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e91b56cd4cadaeb79bbf1a5645f6b4f8dc5bde8834ad5894a8db35fda9efa1fe"
[[package]]
name = "utf8_iter"
version = "1.0.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b6c140620e7ffbb22c2dee59cafe6084a59b5ffc27a8859a5f0d494b5d52b6be"
[[package]]
name = "writeable"
version = "0.5.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1e9df38ee2d2c3c5948ea468a8406ff0db0b29ae1ffde1bcf20ef305bcc95c51"
[[package]]
name = "yoke"
version = "0.7.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6c5b1314b079b0930c31e3af543d8ee1757b1951ae1e1565ec704403a7240ca5"
dependencies = [
"serde",
"stable_deref_trait",
"yoke-derive",
"zerofrom",
]
[[package]]
name = "yoke-derive"
version = "0.7.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "28cc31741b18cb6f1d5ff12f5b7523e3d6eb0852bbbad19d73905511d9849b95"
dependencies = [
"proc-macro2",
"quote",
"syn",
"synstructure",
]
[[package]]
name = "zerofrom"
version = "0.1.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "91ec111ce797d0e0784a1116d0ddcdbea84322cd79e5d5ad173daeba4f93ab55"
dependencies = [
"zerofrom-derive",
]
[[package]]
name = "zerofrom-derive"
version = "0.1.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0ea7b4a3637ea8669cedf0f1fd5c286a17f3de97b8dd5a70a6c167a1730e63a5"
dependencies = [
"proc-macro2",
"quote",
"syn",
"synstructure",
]
[[package]]
name = "zerovec"
version = "0.10.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "aa2b893d79df23bfb12d5461018d408ea19dfafe76c2c7ef6d4eba614f8ff079"
dependencies = [
"yoke",
"zerofrom",
"zerovec-derive",
]
[[package]]
name = "zerovec-derive"
version = "0.10.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6eafa6dfb17584ea3e2bd6e76e0cc15ad7af12b09abdd1ca55961bed9b1063c6"
dependencies = [
"proc-macro2",
"quote",
"syn",
]

View File

@@ -0,0 +1,13 @@
[package]
name = "textutils"
version = "0.1.0"
edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
icu_segmenter = "1.5.0"
itertools = "0.13.0"
lazy_static = "1.5.0"
linkify = "0.10.0"
regex = "1.10.5"

View File

@@ -0,0 +1,20 @@
use lazy_static::lazy_static;
use linkify::{self, LinkFinder};
use regex::Regex;
lazy_static! {
static ref LINK_FINDER: LinkFinder = LinkFinder::new();
static ref REDDIT_REGEX: Regex = Regex::new(r"\/?(r\/[a-zA-Z0-9_]{3}[a-zA-Z0-9_]{0,18}|u\/[a-zA-Z0-9_-]{3}[a-zA-Z0-9_-]{0,17})").unwrap();
}
pub fn preprocess_auto(text: &str) -> String {
let mut cleaned_text = String::new();
let mut begin_cleaned_index = 0;
for span in LINK_FINDER.links(text) {
cleaned_text.push_str(&text[begin_cleaned_index..span.start()]);
begin_cleaned_index = span.end();
}
cleaned_text.push_str(&text[begin_cleaned_index..]);
cleaned_text = REDDIT_REGEX.replace_all(&cleaned_text, "").to_string();
return cleaned_text;
}

View File

@@ -0,0 +1,52 @@
mod filter;
mod segment;
pub use filter::*;
pub use segment::*;
#[cfg(test)]
mod tests {
use icu_segmenter::{SentenceSegmenter, WordSegmenter};
use super::*;
#[test]
fn segment_sentences_simple() {
let text = "Hello, world! How are you? I'm fine.";
let segmenter = SentenceSegmenter::new();
let sentences = split_sentences(text, &segmenter);
assert_eq!(&sentences, &["Hello, world!", "How are you?", "I'm fine."]);
}
#[test]
fn segment_words_simple() {
let text = "Hello, world! How are you? I'm fine.";
let segmenter = WordSegmenter::new_auto();
let words = split_words(text, &segmenter);
assert_eq!(&words, &["Hello", "world", "How", "are", "you", "I'm", "fine"]);
}
#[test]
fn preprocess_auto_simple() {
let text = "Hello, world! How are you? I'm fine. https://example.com and more";
let cleaned_text = preprocess_auto(text);
assert_eq!(&cleaned_text, "Hello, world! How are you? I'm fine. and more");
}
#[test]
fn preprocess_reddit_ids() {
let text = "have a look at r/cats, user u/example posed a cute cat in there";
let cleaned_text = preprocess_auto(text);
assert_eq!(&cleaned_text, "have a look at , user posed a cute cat in there");
}
#[test]
fn preprocess_url_markdown() {
let text = "You can find an example [in the documentation](https://example.com) or on GitHub";
let cleaned_text = preprocess_auto(text);
assert_eq!(&cleaned_text, "You can find an example [in the documentation]() or on GitHub");
let segmenter = WordSegmenter::new_auto();
let words = split_words(&cleaned_text, &segmenter);
assert_eq!(&words, &["You", "can", "find", "an", "example", "in", "the", "documentation", "or", "on", "GitHub"]);
}
}

View File

@@ -0,0 +1,63 @@
use icu_segmenter::{GraphemeClusterSegmenter, SentenceSegmenter, WordSegmenter};
use itertools::Itertools;
pub struct IcuSegmenterCache {
sentence_segmenter: SentenceSegmenter,
word_segmenter: WordSegmenter,
grapheme_cluster_segmenter: GraphemeClusterSegmenter,
}
impl IcuSegmenterCache {
pub fn new_auto() -> Self {
let sentence_segmenter = SentenceSegmenter::new();
let word_segmenter = WordSegmenter::new_auto();
let grapheme_cluster_segmenter = GraphemeClusterSegmenter::new();
return Self {
sentence_segmenter,
word_segmenter,
grapheme_cluster_segmenter,
};
}
pub fn split_sentences<'t>(&self, text: &'t str) -> Vec<&'t str> {
return split_sentences(text, &self.sentence_segmenter);
}
pub fn split_words<'t>(&self, text: &'t str) -> Vec<&'t str> {
return split_words(text, &self.word_segmenter);
}
pub fn split_grapheme_clusters<'t>(&self, text: &'t str) -> Vec<&'t str> {
return split_grapheme_clusters(text, &self.grapheme_cluster_segmenter);
}
}
pub fn split_sentences<'t>(text: &'t str, segmenter: &SentenceSegmenter) -> Vec<&'t str> {
let sentences: Vec<&str> = segmenter
.segment_str(text)
.tuple_windows()
.map(|(i, j)| text[i..j].trim())
.filter(|sentence| !sentence.is_empty())
.collect();
return sentences;
}
pub fn split_words<'t>(text: &'t str, segmenter: &WordSegmenter) -> Vec<&'t str> {
let words: Vec<&str> = segmenter
.segment_str(text)
.iter_with_word_type()
.tuple_windows()
.filter(|(_, (_, segment_type))| segment_type.is_word_like())
.map(|((i, _), (j, _))| &text[i..j])
.collect();
return words;
}
pub fn split_grapheme_clusters<'t>(text: &'t str, segmenter: &GraphemeClusterSegmenter) -> Vec<&'t str> {
let grapheme_clusters: Vec<&str> = segmenter
.segment_str(text)
.tuple_windows()
.map(|(i, j)| &text[i..j])
.collect();
return grapheme_clusters;
}

509
utils/flesttools/Cargo.lock generated Normal file
View File

@@ -0,0 +1,509 @@
# This file is automatically @generated by Cargo.
# It is not intended for manual editing.
version = 3
[[package]]
name = "aho-corasick"
version = "1.1.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8e60d3430d3a69478ad0993f19238d2df97c507009a52b3c10addcd7f6bcb916"
dependencies = [
"memchr",
]
[[package]]
name = "byteorder"
version = "1.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1fd0f2584146f6f2ef48085050886acf353beff7305ebd1ae69500e27c67f64b"
[[package]]
name = "cc"
version = "1.1.30"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b16803a61b81d9eabb7eae2588776c4c1e584b738ede45fdbb4c972cec1e9945"
dependencies = [
"shlex",
]
[[package]]
name = "core_maths"
version = "0.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e3b02505ccb8c50b0aa21ace0fc08c3e53adebd4e58caa18a36152803c7709a3"
dependencies = [
"libm",
]
[[package]]
name = "displaydoc"
version = "0.2.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "97369cbbc041bc366949bc74d34658d6cda5621039731c6310521892a3a20ae0"
dependencies = [
"proc-macro2",
"quote",
"syn",
]
[[package]]
name = "either"
version = "1.13.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "60b1af1c220855b6ceac025d3f6ecdd2b7c4894bfe9cd9bda4fbb4bc7c0d4cf0"
[[package]]
name = "flest"
version = "0.1.0"
dependencies = [
"fxhash",
]
[[package]]
name = "flesttools"
version = "0.1.0"
dependencies = [
"flest",
"pancurses",
"serde",
"serde_json",
"textutils",
]
[[package]]
name = "fxhash"
version = "0.2.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c31b6d751ae2c7f11320402d34e41349dd1016f8d5d45e48c4312bc8625af50c"
dependencies = [
"byteorder",
]
[[package]]
name = "icu_collections"
version = "1.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "db2fa452206ebee18c4b5c2274dbf1de17008e874b4dc4f0aea9d01ca79e4526"
dependencies = [
"displaydoc",
"yoke",
"zerofrom",
"zerovec",
]
[[package]]
name = "icu_locid"
version = "1.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "13acbb8371917fc971be86fc8057c41a64b521c184808a698c02acc242dbf637"
dependencies = [
"displaydoc",
"litemap",
"tinystr",
"writeable",
]
[[package]]
name = "icu_provider"
version = "1.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6ed421c8a8ef78d3e2dbc98a973be2f3770cb42b606e3ab18d6237c4dfde68d9"
dependencies = [
"displaydoc",
"icu_locid",
"icu_provider_macros",
"stable_deref_trait",
"tinystr",
"writeable",
"yoke",
"zerofrom",
"zerovec",
]
[[package]]
name = "icu_provider_macros"
version = "1.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1ec89e9337638ecdc08744df490b221a7399bf8d164eb52a665454e60e075ad6"
dependencies = [
"proc-macro2",
"quote",
"syn",
]
[[package]]
name = "icu_segmenter"
version = "1.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a717725612346ffc2d7b42c94b820db6908048f39434504cb130e8b46256b0de"
dependencies = [
"core_maths",
"displaydoc",
"icu_collections",
"icu_locid",
"icu_provider",
"icu_segmenter_data",
"utf8_iter",
"zerovec",
]
[[package]]
name = "icu_segmenter_data"
version = "1.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f739ee737260d955e330bc83fdeaaf1631f7fb7ed218761d3c04bb13bb7d79df"
[[package]]
name = "itertools"
version = "0.13.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "413ee7dfc52ee1a4949ceeb7dbc8a33f2d6c088194d9f922fb8318faf1f01186"
dependencies = [
"either",
]
[[package]]
name = "itoa"
version = "1.0.11"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "49f1f14873335454500d59611f1cf4a4b0f786f9ac11f4312a78e4cf2566695b"
[[package]]
name = "lazy_static"
version = "1.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "bbd2bcb4c963f2ddae06a2efc7e9f3591312473c50c6685e1f298068316e66fe"
[[package]]
name = "libc"
version = "0.2.160"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f0b21006cd1874ae9e650973c565615676dc4a274c965bb0a73796dac838ce4f"
[[package]]
name = "libm"
version = "0.2.8"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4ec2a862134d2a7d32d7983ddcdd1c4923530833c9f2ea1a44fc5fa473989058"
[[package]]
name = "linkify"
version = "0.10.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f1dfa36d52c581e9ec783a7ce2a5e0143da6237be5811a0b3153fedfdbe9f780"
dependencies = [
"memchr",
]
[[package]]
name = "litemap"
version = "0.7.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "643cb0b8d4fcc284004d5fd0d67ccf61dfffadb7f75e1e71bc420f4688a3a704"
[[package]]
name = "log"
version = "0.4.22"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a7a70ba024b9dc04c27ea2f0c0548feb474ec5c54bba33a7f72f873a39d07b24"
[[package]]
name = "memchr"
version = "2.7.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "78ca9ab1a0babb1e7d5695e3530886289c18cf2f87ec19a575a0abdce112e3a3"
[[package]]
name = "ncurses"
version = "5.101.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5e2c5d34d72657dc4b638a1c25d40aae81e4f1c699062f72f467237920752032"
dependencies = [
"cc",
"libc",
"pkg-config",
]
[[package]]
name = "pancurses"
version = "0.17.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0352975c36cbacb9ee99bfb709b9db818bed43af57751797f8633649759d13db"
dependencies = [
"libc",
"log",
"ncurses",
"pdcurses-sys",
"winreg",
]
[[package]]
name = "pdcurses-sys"
version = "0.7.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "084dd22796ff60f1225d4eb6329f33afaf4c85419d51d440ab6b8c6f4529166b"
dependencies = [
"cc",
"libc",
]
[[package]]
name = "pkg-config"
version = "0.3.31"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "953ec861398dccce10c670dfeaf3ec4911ca479e9c02154b3a215178c5f566f2"
[[package]]
name = "proc-macro2"
version = "1.0.88"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7c3a7fc5db1e57d5a779a352c8cdb57b29aa4c40cc69c3a68a7fedc815fbf2f9"
dependencies = [
"unicode-ident",
]
[[package]]
name = "quote"
version = "1.0.37"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b5b9d34b8991d19d98081b46eacdd8eb58c6f2b201139f7c5f643cc155a633af"
dependencies = [
"proc-macro2",
]
[[package]]
name = "regex"
version = "1.11.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "38200e5ee88914975b69f657f0801b6f6dccafd44fd9326302a4aaeecfacb1d8"
dependencies = [
"aho-corasick",
"memchr",
"regex-automata",
"regex-syntax",
]
[[package]]
name = "regex-automata"
version = "0.4.8"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "368758f23274712b504848e9d5a6f010445cc8b87a7cdb4d7cbee666c1288da3"
dependencies = [
"aho-corasick",
"memchr",
"regex-syntax",
]
[[package]]
name = "regex-syntax"
version = "0.8.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2b15c43186be67a4fd63bee50d0303afffcef381492ebe2c5d87f324e1b8815c"
[[package]]
name = "ryu"
version = "1.0.18"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f3cb5ba0dc43242ce17de99c180e96db90b235b8a9fdc9543c96d2209116bd9f"
[[package]]
name = "serde"
version = "1.0.210"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c8e3592472072e6e22e0a54d5904d9febf8508f65fb8552499a1abc7d1078c3a"
dependencies = [
"serde_derive",
]
[[package]]
name = "serde_derive"
version = "1.0.210"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "243902eda00fad750862fc144cea25caca5e20d615af0a81bee94ca738f1df1f"
dependencies = [
"proc-macro2",
"quote",
"syn",
]
[[package]]
name = "serde_json"
version = "1.0.129"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6dbcf9b78a125ee667ae19388837dd12294b858d101fdd393cb9d5501ef09eb2"
dependencies = [
"itoa",
"memchr",
"ryu",
"serde",
]
[[package]]
name = "shlex"
version = "1.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0fda2ff0d084019ba4d7c6f371c95d8fd75ce3524c3cb8fb653a3023f6323e64"
[[package]]
name = "stable_deref_trait"
version = "1.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a8f112729512f8e442d81f95a8a7ddf2b7c6b8a1a6f509a95864142b30cab2d3"
[[package]]
name = "syn"
version = "2.0.79"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "89132cd0bf050864e1d38dc3bbc07a0eb8e7530af26344d3d2bbbef83499f590"
dependencies = [
"proc-macro2",
"quote",
"unicode-ident",
]
[[package]]
name = "synstructure"
version = "0.13.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c8af7666ab7b6390ab78131fb5b0fce11d6b7a6951602017c35fa82800708971"
dependencies = [
"proc-macro2",
"quote",
"syn",
]
[[package]]
name = "textutils"
version = "0.1.0"
dependencies = [
"icu_segmenter",
"itertools",
"lazy_static",
"linkify",
"regex",
]
[[package]]
name = "tinystr"
version = "0.7.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9117f5d4db391c1cf6927e7bea3db74b9a1c1add8f7eda9ffd5364f40f57b82f"
dependencies = [
"displaydoc",
]
[[package]]
name = "unicode-ident"
version = "1.0.13"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e91b56cd4cadaeb79bbf1a5645f6b4f8dc5bde8834ad5894a8db35fda9efa1fe"
[[package]]
name = "utf8_iter"
version = "1.0.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b6c140620e7ffbb22c2dee59cafe6084a59b5ffc27a8859a5f0d494b5d52b6be"
[[package]]
name = "winapi"
version = "0.3.9"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5c839a674fcd7a98952e593242ea400abe93992746761e38641405d28b00f419"
dependencies = [
"winapi-i686-pc-windows-gnu",
"winapi-x86_64-pc-windows-gnu",
]
[[package]]
name = "winapi-i686-pc-windows-gnu"
version = "0.4.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ac3b87c63620426dd9b991e5ce0329eff545bccbbb34f3be09ff6fb6ab51b7b6"
[[package]]
name = "winapi-x86_64-pc-windows-gnu"
version = "0.4.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "712e227841d057c1ee1cd2fb22fa7e5a5461ae8e48fa2ca79ec42cfc1931183f"
[[package]]
name = "winreg"
version = "0.5.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a27a759395c1195c4cc5cda607ef6f8f6498f64e78f7900f5de0a127a424704a"
dependencies = [
"winapi",
]
[[package]]
name = "writeable"
version = "0.5.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1e9df38ee2d2c3c5948ea468a8406ff0db0b29ae1ffde1bcf20ef305bcc95c51"
[[package]]
name = "yoke"
version = "0.7.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6c5b1314b079b0930c31e3af543d8ee1757b1951ae1e1565ec704403a7240ca5"
dependencies = [
"serde",
"stable_deref_trait",
"yoke-derive",
"zerofrom",
]
[[package]]
name = "yoke-derive"
version = "0.7.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "28cc31741b18cb6f1d5ff12f5b7523e3d6eb0852bbbad19d73905511d9849b95"
dependencies = [
"proc-macro2",
"quote",
"syn",
"synstructure",
]
[[package]]
name = "zerofrom"
version = "0.1.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "91ec111ce797d0e0784a1116d0ddcdbea84322cd79e5d5ad173daeba4f93ab55"
dependencies = [
"zerofrom-derive",
]
[[package]]
name = "zerofrom-derive"
version = "0.1.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0ea7b4a3637ea8669cedf0f1fd5c286a17f3de97b8dd5a70a6c167a1730e63a5"
dependencies = [
"proc-macro2",
"quote",
"syn",
"synstructure",
]
[[package]]
name = "zerovec"
version = "0.10.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "aa2b893d79df23bfb12d5461018d408ea19dfafe76c2c7ef6d4eba614f8ff079"
dependencies = [
"yoke",
"zerofrom",
"zerovec-derive",
]
[[package]]
name = "zerovec-derive"
version = "0.10.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6eafa6dfb17584ea3e2bd6e76e0cc15ad7af12b09abdd1ca55961bed9b1063c6"
dependencies = [
"proc-macro2",
"quote",
"syn",
]

View File

@@ -0,0 +1,13 @@
[package]
name = "flesttools"
version = "0.1.0"
edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
flest = { path = "../../libnative/flest" }
textutils = { path = "../../libnative/textutils" }
pancurses = { version = "0.17.0", features = ["wide"] }
serde = "1.0.203"
serde_json = "1.0.120"

View File

@@ -0,0 +1,148 @@
use flest::NgramModel;
use textutils::IcuSegmenterCache;
use pancurses::Input;
use std::env;
use std::fs;
use std::io::BufRead;
use std::io::BufReader;
const TOKEN_SENTENCE_SEPARATOR: &str = "\\sep";
fn tokenize_text(text: &str) -> Vec<&str> {
let segmenters = IcuSegmenterCache::new_auto();
let sentences = segmenters.split_sentences(text);
let mut tokens: Vec<&str> = Vec::new();
tokens.push(TOKEN_SENTENCE_SEPARATOR);
for sentence in sentences {
let words = segmenters.split_words(sentence);
for word in words {
tokens.push(word);
}
tokens.push(TOKEN_SENTENCE_SEPARATOR);
}
//println!("Tokens: {:?}", tokens);
return tokens;
}
fn train_model(text: &str, model: &mut NgramModel) {
let text = textutils::preprocess_auto(text);
let text = text.trim();
if text.is_empty() {
return;
}
let tokens = tokenize_text(&text);
//println!("Tokens: {:?}", tokens);
let n_values = [2, 3, 4];
for &n in &n_values {
if n > tokens.len() {
continue;
}
for i in 0..tokens.len() - n + 1 {
model.train_dataset(&tokens[i..(i + n)]);
}
}
}
fn train_from_plain_text(path: &str, model: &mut NgramModel) {
let text = fs::read_to_string(path).expect("Failed to read file");
train_model(&text, model);
}
fn train_from_reddit_comments(path: &str, model: &mut NgramModel) {
let file = fs::File::open(path).expect("Failed to open file");
let reader = BufReader::new(file);
let mut line_count = 0;
for line in reader.lines() {
if let Ok(line) = line {
let json: serde_json::Value = serde_json::from_str(&line).expect("Failed to parse JSON");
if let Some(author) = json.get("author").and_then(|it| it.as_str()) {
if author == "AutoModerator" {
continue;
}
}
if let Some(body) = json.get("body").and_then(|it| it.as_str()) {
train_model(body, model);
}
}
line_count += 1;
if line_count > 10000 {
break;
}
}
}
fn main() {
let args: Vec<String> = env::args().collect();
if args.len() != 2 {
eprintln!("Usage: {} <file_path>", args[0]);
return;
}
let path = &args[1];
let mut model = NgramModel::default();
if path.ends_with(".reddit.jsonl") {
train_from_reddit_comments(path, &mut model);
} else {
train_from_plain_text(path, &mut model);
}
let window = pancurses::initscr();
let mut input_text = String::new();
pancurses::noecho();
window.keypad(true);
loop {
let mut words: Vec<&str> = input_text.split_whitespace().collect();
words.insert(0, TOKEN_SENTENCE_SEPARATOR);
if input_text.ends_with(' ') || words.last() == Some(&TOKEN_SENTENCE_SEPARATOR) {
words.push("");
}
let predictions = model.predict(&words);
window.clear();
window.addstr("N-gram model debug frontend\n");
window.addstr(" demo tokenizer only supports single-line sentence in input text!\n\n");
window.addstr(format!("enter text: {}\n", input_text));
window.addstr(format!("detected words: {:?}\n\n", words));
window.addstr("predictions:\n");
for (i, (word, weight)) in predictions.iter().enumerate() {
if i == 0 && *weight > 0.9 {
window.attron(pancurses::A_BOLD);
}
window.addstr(format!(" {}. {} (c={:.2})\n", i + 1, word, weight));
if i == 0 && *weight > 0.9 {
window.attroff(pancurses::A_BOLD);
}
}
if predictions.is_empty() {
window.addstr(" (none)\n");
}
window.mv(3, 12 + input_text.len() as i32);
window.refresh();
match window.getch().unwrap() {
Input::KeyF10 => {
break
}
Input::KeyBackspace => {
input_text.pop();
}
Input::Character('\n') => {
train_model(&input_text, &mut model)
}
Input::Character(ch) => {
input_text.push(ch)
}
_ => { () }
}
}
pancurses::endwin();
}

27
utils/setup_vscode_dev_env.sh Executable file
View File

@@ -0,0 +1,27 @@
#!/bin/bash
WORKSPACE_ROOT_DIR="$(realpath "$(dirname "$0")/..")"
VSCODE_DIR="$WORKSPACE_ROOT_DIR/.vscode"
VSCODE_SETTINGS_JSON_PATH="$VSCODE_DIR/settings.json"
if [ "$WORKSPACE_ROOT_DIR" != "$(pwd)" ]; then
echo "Not executing this script from workspace root dir!"
exit 1
fi
if [ ! -d "$VSCODE_DIR" ]; then
mkdir "$VSCODE_DIR"
fi
echo -en "{\n" > "$VSCODE_SETTINGS_JSON_PATH"
# <rust-analyzer>
rust_project_paths="$(find "$WORKSPACE_ROOT_DIR" -type f -name "Cargo.toml")"
echo -en " \"rust-analyzer.linkedProjects\": [\n" >> "$VSCODE_SETTINGS_JSON_PATH"
for rust_project_path in $rust_project_paths; do
echo -en " \"$rust_project_path\",\n" >> "$VSCODE_SETTINGS_JSON_PATH"
done
echo -en " ],\n" >> "$VSCODE_SETTINGS_JSON_PATH"
# </rust-analyzer>
echo -en "}\n" >> "$VSCODE_SETTINGS_JSON_PATH"