jingtao8a's blog

Hist-Tree: Those Who Ignore It Are Doomed to Learn

Posted on 2024-10-03 Edited on 2025-03-20 In Learned Index

背景和应用场景：

在之前的研究中，学习型索引的性能超过过了传统索引，这被归功于是利用数据分布这一特性。但是这篇文章认为，在比较试验中之学习型索引之所以能表现良好，是因为利用了一些隐含条件（例如，bulk-loading时，数据是排序的,数据的范围[min,max],…，））本文提出了一个传统索引结构（Hist-tree），利用了这些隐藏条件，并在只读工作负载的比较实验中，比RMI，PGM-index，RadixSpline的性能要好

问题描述：

学习索引做了一些隐含假设，例如

（1）假设bulk-loading数据已经排序

（2）数据的范围[min, max]

（3）数据是不被修改的

但是传统数据结构例如B-tree并没有这样的假设，这样会让B-tree的构建时间变得更长，或者导致B-tree和学习型索引进行比较实验时，处于劣势

方法：

Overview

提出Hist-tree（利用了隐含假设（1）（2））

以及压缩版本Compact-Hist-tree（利用了隐含假设（1）（2）（3））

图中的Hist-tree是从一个数据集中选取200个key构建的，数据集的数据分布如（a）所示，（b）（c）为Hist-tree和Compact Hist-tree

（b）Hist-tree

每一个node根据histogram分成多个bin，terminal bin是灰色的，non-terminal bin是白色的，每个bin的值为该子树所包含的key的数量

查找过程为先锁定一个terminal-bin（有threshold），之后进行二分查找

（c） Compact Hist-tree

每一行代表一个node，每个node根据histogram分为多个binterminal bin是灰色的，non-terminal bin是白色的，白色的bin存储的值为child node的offset，灰色的bin存储累计直到该bin的所有key的数量

physical layout

Hist-tree的物理结构为两个32位整形数组，第一个数组包含的inner node（含有child node的node 或者说是至少包含一个non-terminal bin的node），第二个数组包含leaf node（只有terminal bin的node）
例如：（b）中的root node要存储在第一个数组中占八个元素，其中由四个histogram bin计数组成，后跟四个child pointers(如果该pointer对应的bin为terminal bin或者指向leaf node，将高位flag标记为1)，并且第一个和第四个child pointer的高位flag标记为1

结果:

个人实现版本：https://github.com/jingtao8a/Hist

Towards Practical Learned Indexing

Posted on 2024-10-03 Edited on 2025-03-20 In Learned Index

背景和应用场景：

(1)Learned Index

(2)RadixSpline

只读工作负载
相对于Learned Index，RadixSpline含有两个超参数构建spline points的error和radix table的大小（b-length prefix of a lookup key），训练更快（one-single pass）或者说bulk-loading时间更短
查找过程，radix-table定位包含lookup key的两个spline point，之后使用线性插值得到predict_pos，最后使用二分查找定位real_pos

(3)PLEX

RadixSpline + CHT
只有一个超参数构建spline points的error
只读工作负载

问题描述：

(1)RadixSpline的超参数（radix table的大小（b-length prefix of a lookup key））很难调试

(2)RadixSpline的radix-layer可能因为异常值（如face数据集中key的最长公共前缀很长时，这样radix-table所需要的size就会很大）而导致性能下降

解决方案：

提出PLEX(RadixSpline + compact-hist-tree)，需要two-pass

PLEX对RadixSpline改进，解决两个问题

1.Radix layer很难参数化。

2.Radix layer会被outliers影响。

RadixSpline使用radix table来索引spline points，然而当key的二进制表示的最长公共前缀很大时，RadixSpline可能会降低性能。PLEX通过由CHT表示的RadixTree来解决这个问题

Hist-tree：

Compact-hist-tree是Hist-tree在只读工作负载中的优化

Compact-hist-tree的构建方式

（1）从hist-tree构建

（2）直接从data构建（本文实现）

auto-tuner

为Compact-hist-tree实现auto-tuner（本文实现）

cost-model

radix-Table cost-model

Radix-table使用lookup-key的prefix长度为r

则radix-table的长度为2^r

k为lookup key，则bk为radix-table通过k锁定的区间长度

Average lookup time 如下:

Memory Consumption是O(2^R)

CHT cost-model

CHT有两个参数：（1）使用lookup-key的prefix长度为r（2）通过CHT确定对应的node后，在node中对应的bin中需要进行二分查找的bound

Average lookup time 如下:

结果

EWALI:a Data-aware Learned Index Schema for Efficient Writes

Posted on 2024-09-28 Edited on 2025-03-20 In Learned Index

应用场景：

(1)支持读写工作负载,在重写工作负载中性能表现很好

(2)使用轻量的light-weight data-aware data partition algorithm构建模型

(3)当data distribution发生变化，EWALI会自动分裂相关叶子节点，重训练模型

(4)并采用duel buffer（双缓冲）来处理新数据，并采用延迟更新机制合并数据，提高写入性能

问题描述：
（1）RMI不支持写操作，插入新数据导致data distribution变化，需要重训练所有模型
（2）RMI构建时采用统一的数据划分策略，没有考虑划分到同一模型中数据的相似性，导致预测不准确

难点和分析过程：

（1）现在支持写操作的方法包括就地插入（gap array，需要移动数据或产生不必要的空间开销）缓冲区插入（发生写入阻塞）

（2）现存的data-partition-algorithm

K-means算法（Dabble模型）

贪心算法（ASLM）

Piece-wise linear function（PGM-iNDEX FITing-Tree）

以上不同的算法有不同的计算开销和误差

解决方案：

Data-partition-algorithm

考虑lower model error 和 time cost 以及 memory usage选择ShrinkingCone

DARMI Design

Model Training:

(1)non-leaf node
(2)leaf node

Model Retraining:

当incremental buffer与segments合并后，改变了数据分布，后台线程被唤醒进行retraining操作

Node Split:

当一个Node管辖的data数量超过一定数量时，会产生以下影响
1.过多的数据会导致重训练的时候要扫描更多的key，增加训练耗时，降低写性能
2.过多的数据也会让线性模型更难去拟合，精准度会下降

(1)horizontal split
Disadvantages: result in lower prediction accuracy of parent model

(2)Vertical split
Disadvantages: increase the height of DARMI, result in high lookup time

DARMI会动态地选择两种split方式

Retrain DARMI

当DARMI的高度达到一个阈值，查找性能退化地过于严重了，需要对root Node进行重训练，
将所有的old leaf node的first key用来训练root node的线性模型，DARMI又变成two-layer

Incremental Buffer Design

DARMI的核心操作

结果：

数据集：

真实数据集（NYCT、OSM）合成数据集（Lognormal）

工作负载：

（1）read-only （2）write-only （3）write-read-balance (50% read 50%write)

（4）write-heavy(95% write 5% read) （5）read-heavy(95% read 5%write)

（6）short-ranges(95% range queries 5% write)

ASLM:Adaptive Single Layer Model for Learned Index

Posted on 2024-09-28 Edited on 2025-03-20 In Learned Index

应用场景：

读写场景

问题描述：

Learned Index

(1)RMI 采用的分区策略没有考虑数据之间的相似性

(2)RMI 不支持更新

方法：

分区算法

Method1:

SLM:这是一个简单的模型，只有一层，这一层包含K个子模型

Method2：

数据点之间三角形面积可替换为欧式距离，用来表示两个数据点之间的相似度

数据插入：

结果：

Dabble:基于中间层的可扩展学习索引技术

Posted on 2024-09-28 Edited on 2025-03-20 In Learned Index

应用场景：

读写场景

解决方案：

提出Dabble模型

（1）用K-Means算法把数据集进行聚类，从而划分为K个数据区域，使得每个区域内的数据分布尽可能的一致

（2）在模型训练阶段，我们对K个数据区域采用前馈神经网络进行学习，并在模型的损失函数中加入数据的访问热度，从而使模型对访问频率高的数据预测率该更加精准

（3）针对数据插入问题，借鉴LSM树中的延迟更新机制，在内存中开辟一块缓存用来存放新插入的数据，当缓存溢出时一次性将数据进行插入

（4）针对索引更新问题，提出了一种基于中间层的机制，通过模型解耦的方式缓解了索引带来的消耗

结果：

（1）在 Lognormal 人工数据集上,Dabble 模型的查询性能比 B+树提升了 69%,比学习索引提升了 13%,比 ALEX提升了 8%;内存占用比 B+树减少了 92%,神经网络模型的内存占用比学习索引减少了 99%,
比 ALEX 减少了 98%.同时,Dabble 模型的插入性能比 B+树提升了 250%,比 ALEX 提升了 123%;

（2）在真实数据集 Weblogs 上,Dabble 的查询性能比 B+树提升了 49%,比学习索引提升了 33%,比 ALEX 提升了 9%;内存占用比 B+树减少了 92%,神经网络模型的内存占用比学习索引减少了 99%,比 ALEX 减少了 98%.同时,Dabble 的插入性能比 B+树提升了 400%,比 ALEX 提升了 88%.

VMware的三种网络模式

Posted on 2024-08-20 Edited on 2025-03-20 In 随笔

参考链接

vue3学习——Tutorial

Posted on 2024-05-12 Edited on 2025-03-20 In vue3

reactive() 和 ref()

<script setup>
import { reactive, ref } from 'vue'

const counter = reactive({ count: 0 })
const message = ref('Hello World!')
</script>

<template>
  <h1>{{ message }}</h1>
  <h1>{{ message.split('').reverse().join('') }}</h1>
  <p>Count is: {{ counter.count }}</p>
</template>

v-bind

v-bind:id=””

v-bind:class=””

v-bind:value=””

简化为

:id=””

:class=””

:value=””

<script setup>
import { ref } from 'vue'

const titleClass = ref('title')
</script>

<template>
  <h1 :class="titleClass">Make me red</h1>
</template>

<style>
.title {
  color: red;
}
</style>

v-on

Event Listeners

v-on:click=””

v-on:input=””

简化为

@click=””

@input=””

<script setup>
import { ref } from 'vue'

const count = ref(0)

function increment() {
  count.value++
}
</script>

<template>
  <button @click="increment">Count is: {{ count }}</button>
</template>

v-model

<script setup>
import { ref } from 'vue'

const text = ref('')
</script>

<template>
  <input v-model="text" placeholder="Type here">
  <p>{{ text }}</p>
</template>

v-if 和 v-else

Conditional Rendering

<script setup>
import { ref } from 'vue'

const awesome = ref(true)

function toggle() {
  awesome.value = !awesome.value
}
</script>

<template>
  <button @click="toggle">Toggle</button>
  <h1 v-if="awesome">Vue is awesome!</h1>
  <h1 v-else>Oh no 😢</h1>
</template>

v-for

List Rendering

<script setup>
import { ref } from 'vue'

// give each todo a unique id
let id = 0

const newTodo = ref('')
const todos = ref([
  { id: id++, text: 'Learn HTML' },
  { id: id++, text: 'Learn JavaScript' },
  { id: id++, text: 'Learn Vue' }
])
// const index = ref("123")
function addTodo() {
  // ...
  todos.value.push({id: id++, text: newTodo.value});
  newTodo.value = '';
}

function removeTodo(todo) {
  // ...
  // index.value
  todos.value.splice(todos.value.indexOf(todo), 1)
}
</script>

<template>
  <form @submit.prevent="addTodo">
    <input v-model="newTodo" required placeholder="new todo">
    <button>Add Todo</button>
  </form>
<!--   <p>
    {{index}}
  </p> -->
  <ul>
    <li v-for="todo in todos" :key="todo.id">
      {{ todo.text }}
      <button @click="removeTodo(todo)">X</button>
    </li>
  </ul>
</template>

computed

introducing computed().We can create a computed ref that computes its .value based on other reactive data sources:

<script setup>
import { ref, computed } from 'vue'

let id = 0

const newTodo = ref('')
const hideCompleted = ref(false)
const todos = ref([
  { id: id++, text: 'Learn HTML', done: true },
  { id: id++, text: 'Learn JavaScript', done: true },
  { id: id++, text: 'Learn Vue', done: false }
])

const filteredTodos = computed(() => {
  return hideCompleted.value
    ? todos.value.filter((t) => !t.done)
    : todos.value
})

function addTodo() {
  todos.value.push({ id: id++, text: newTodo.value, done: false })
  newTodo.value = ''
}

function removeTodo(todo) {
  todos.value = todos.value.filter((t) => t !== todo)
}
</script>

<template>
  <form @submit.prevent="addTodo">
    <input v-model="newTodo" required placeholder="new todo">
    <button>Add Todo</button>
  </form>
  <ul>
    <li v-for="todo in filteredTodos" :key="todo.id">
      <input type="checkbox" v-model="todo.done">
      <span :class="{ done: todo.done }">{{ todo.text }}</span>
      <button @click="removeTodo(todo)">X</button>
    </li>
  </ul>
  <button @click="hideCompleted = !hideCompleted">
    {{ hideCompleted ? 'Show all' : 'Hide completed' }}
  </button>
</template>

<style>
.done {
  text-decoration: line-through;
}
</style>

Lifecycle and Template Refs

Hook函数 onMounted onUpdated 等 …

<script setup>
import { ref, onMounted } from 'vue'

const pElementRef = ref(null)
const customValue = ref("123")
onMounted(() => {
  customValue.value = pElementRef.value.textContent
  pElementRef.value.textContent = 'mounted!'
})
</script>

<template>
  <p> {{customValue}} </p>
  <p ref="pElementRef">Hello</p>
</template>

watch

import { ref, watch } from 'vue'

const count = ref(0)

watch(count, (newCount) => {
  // yes, console.log() is a side effect
  console.log(`new count is: ${newCount}`)
})

watch() can directly watch a ref, and the callback gets fired whenever count’s value changes.

<script setup>
import { ref, watch } from 'vue'

const todoId = ref(1)
const todoData = ref(null)

async function fetchData(id) {
  todoData.value = null
  const res = await fetch(
    `https://jsonplaceholder.typicode.com/todos/${id.value}`
  )
  todoData.value = await res.json()
}
  
fetchData(todoId)
watch(todoId, (todoId) => {
  fetchData(todoId);
})

  
</script>

<template>
  <p>Todo id: {{ todoId }}</p>
  <button @click="todoId++" :disabled="!todoData">Fetch next todo</button>
  <p v-if="!todoData">Loading...</p>
  <pre v-else>{{ todoData }}</pre>
</template>

Components

<script setup>
import ChildComp from './ChildComp.vue'
</script>

<template>
  <!-- render child component -->
  <ChildComp></ChildComp>
</template>

Props

A child component can accept input from the parent via props. First, it needs to declare the props it accepts:

App.vue

<script setup>
import { ref } from 'vue'
import ChildComp from './ChildComp.vue'

const greeting = ref('Hello from parent')
</script>

<template>
  <ChildComp :msg="greeting" />
</template>

ChildComp.vue

<script setup>
const props = defineProps({
  msg: String
})
</script>

<template>
  <h2>{{ msg || 'No props passed yet' }}</h2>
</template>

Emits

In addition to receiving props, a child component can also emit events to the parent:

<script setup>
import { ref } from 'vue'
import ChildComp from './ChildComp.vue'

const childMsg = ref('No child msg yet')
</script>

<template>
  <ChildComp @response="(msg) => childMsg = msg " @response1="(msg)=> childMsg=msg"/>
  <p>{{ childMsg }}</p>
</template>

<script setup>
const emit = defineEmits(['response', 'response1'])

emit('response', 'hello from child')
emit('response1', "flafkjka")
</script>

<template>
  <h2>Child component</h2>
</template>

Slots

In addition to passing data via props, the parent component can also pass down template fragments to the child via slots

App.vue

<script setup>
import { ref } from 'vue'
import ChildComp from './ChildComp.vue'

const msg = ref('from parent')
</script>

<template>
  <ChildComp></ChildComp>
</template>

ChildComp.vue

1
2
3

<template>
  <slot>Fallback content</slot>
</template>

javascript基础语法

Posted on 2024-05-11 Edited on 2025-03-20 In javascript

数据类型

值类型(基本类型) ：字符串（String）、数字(Number)、布尔(Boolean)、空（null）、未定义（Undefined）、Symbol。

所有的值类型(基本类型), 除了null和undefined, 都可以被看做对象

1
2
3

new String("abc")
new Number(123)
new Boolean(true)

js中一切皆对象

引用数据类型(对象类型) ：对象(Object)、数组(Array)、函数(Function)、正则（RegExp）和日期（Date）

ps: 定义变量得使用 var 或者 let
可以用typeof 来查看变量的类型

函数

function myFunction(a, b) {
    return a * b;
}

原型链

class关键字是原型系统上的一个语法糖，创造了一种基于类的语言的错觉。

proto, prototype, constructor的关系

DILI论文重点部分

Posted on 2024-04-02 Edited on 2025-03-20 In Learned Index

DILI：A Distribution-driven Learned Index

摘要：

DILI的每个节点（internal node 和 leaf node）都有一个Linear model，并且internal node的key’s range被它的child node均分，当进行key查找时，可以准确地找到对应的leaf node。在leaf node中的linear model是可以准确预测key的位置的。

为了构建DILI，首先构建一颗bottom-up tree,然后使用这个bottom-up tree，自顶向下构建一个DILI树。DILI在the number of leaf nodes和tree height之间有一个很好的平衡（the number of leaf nodes 和 tree height是影响key搜索时间的关键因素）。设计了灵活的插入和删除算法，并在必要时调整树形结构。

实验结果表明DILI比其它baseline在不同的workloads上要表现的更好。

1.INTRODUCTION

RMI：仅支持查找

ALEX：扩展RMI，在leaf node中使用 gapped array，使用 cost model初始化RMI结构和动态更新RMI结构。存在last-mile问题

LIPP：解决last-mile问题，但是没有利用data distribution

DILI-LO：未解决last-mile问题，利用data distribution

DILI：解决last-mile问题，利用data distribution

构建DILI的目的是实现最好的搜索性能 —> 搜索的开销取决于(1)leaf node的深度(2)leaf node中的linear model预测的accuracy —> 这两个factors都应该在DILI的构建中被考虑

因此，提出了一个two-phase bulk loading 方法

Phase1 -> 使用greedy merging algorithm（考虑到了上面的两个factors）构建一棵BU-tree（BU-tree的internal node’srange并不是被它的child nodes均分的，因此在BU-tree进行key搜索时，确定child node需要额外的开销）

Phase2 -> 借鉴BU-tree的结构构建DILI

Contribution

提出DILI，algorithms 和它的cost analysis
构建DILI的two-phase bulk loading 方法
对DILI的leaf node的local optimization，解决last-mile问题
对于DILI的insertion和deletion操作提出相应的algorithm来维持查找性能
在synthetic data和real data上证实DILI的性能

2.OVERVIEW OF DILI

DILI的结构如Figure 1所示

DILI在没有使用local Optimization的时候SEARCH算法如下，查找key时，锁定了leaf node后需要再进行指数查找来确定position。

Local Optimization strategy

在leaf node中，先用线性模型预测key的position，之后直接将key放到预测的position上，如果发生冲突，就创建一个新的node，对应的entry设置为指向新node的指针

Updates

当对DILI插入数据时，如果发生conflict，会创建新的Node，当创建的新Node过多并且退化了查找性能时，需要进行调整策略（在section6会详细描述）

3.SEARCH COST ANALYSIS

Cost Analysis:
未使用local optimization strategy
Algorithm1由两步组成(1)找到包含search key的leaf node (2)leaf node中的local search

4.CONSTRUCTION OF DILI

4.1Motivation and Overall Idea of BU-Tree

DILI的internal node的线性模型有着完美的准确性，因为internal node被它的child nodes均分。—-> 现在有一个问题，如何确定DILI的fanout

大致思路：

4.2 Building BU-Tree

4.2.1 Bottom-up Node and Model Creation

构建第0层的leaf node

已知h - 1层的internal node,构建第h层的internal node

4.2.2 Determining Node Layout at A Height

但是BU-Tree是从下往上创建的，即使创建了第h层的node，我们也不知道BU-Tree的高度

4.3 BU-Tree based Bulk loading for DILI

4.4 Remarks

Skip ……

5. LOCAL OPTIMIZATION OF DILI

6. DATA UPDATES IN DILI

6.1 Insertions

P exists标注的地方错误应设置为notExist <— False

6.2 Deletions

7.EXPERIMENTAL STUDIES

7.1 Experimental Settings

数据集：使用了SOSD benchmark中的4个真实数据集和 1 个合成数据集

FB : 200M Facebook 用户id
WikiTS : 200M 维基百科网站日志条目的唯一请求时间戳
OSM : 800M OpenStreetMap单元格的id
Books : 800M Amazon的books的id
Logn : 200M 从重尾采样的独特值 N(0， 1) 的对数正态分布

比较的baselines:

Table 2 总结了所有index的属性特点，更好的表现被加粗了

Evaluation Metrics :

Lookup —> 平均每次query所用的时间

Throughtput —> 平均每second执行的operations的次数（query、insertion、deletion）

7.2 Overall Query Performance

7.3 Performance on Different Workloads

(1)The Read-only workload: 100M point queries

(2)The Read-Heavy workload: 50M insertions and 100M point queries

(3)The Write-Heavy workload: 100M insertions and 50M point queries

(4)The Write-only workload: 100M insertions

7.4 Effect of Many Deletions

(1)Read-Heavy workload : 100M lookups and 50M deletions
DILI achieves up to 3.6X, 2.3X, 7.0X and 2.3X higher throughput than B+ Tree,PGM,MassTree and ALEX,respectively.

(2)Deletion-Heavy workload : 100M deletions and 50M lookups
Only ALEX performs a little better than DILI on Logn dataset

布隆过滤器

Posted on 2024-03-16 Edited on 2025-03-20 In 布隆过滤器

作用

查找时，过滤掉一定不存在的key，提高查找效率

简介

在数组或者列表中搜索相应值的时候，你必须遍历已有的集合。若集合中存在大量的数据，就会影响查找的效率

针对这个问题，可以考虑使用哈希表，利用哈希表来对”值”进行哈希处理来获得该值对应的索引值，然后将该”值”存放到列表中对应的索引位置。
这意味着判断列表中是否存在该值时，只需要对值进行哈希处理并在相应的索引位置进行搜索即可，这时的搜索速度是非常快的。

Bloom Filter本质上是由长度为m的位向量（仅包含0或者1）组成，最初所有值均设置为0，如下图所示：

为了将数据项添加到布隆过滤器中，使用k个不同的哈希函数对其进行哈希，并将结果位置上对应位置为1

简单的例子

输入”semlinker”，预设的3个哈希函数将输出2、4、6,我们把相应位置为1

再输入”kakuqo”，哈希函数输出3、4、7,把对应位置为1

再输入”fullstack”,哈希函数输出2、3、7，这时发现相应的索引位都已经置为了1，这意味着我们可以说”fullstack”可能已经插入到集合中。这是一种误判情况