Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc] change_feature_pdf_to_md #415

Merged
merged 23 commits into from
Sep 13, 2023
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .git_bin_url
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@
{"leaf_path": "data/test/movielens_1m", "sig": "99badbeec64f2fcabe0dfa1d2bfd8fb5", "remote_path": "data/git_oss_sample_data/data_test_movielens_1m_99badbeec64f2fcabe0dfa1d2bfd8fb5"}
{"leaf_path": "data/test/mt_ckpt", "sig": "803499f48e2df5e51ce5606e9649c6d4", "remote_path": "data/git_oss_sample_data/data_test_mt_ckpt_803499f48e2df5e51ce5606e9649c6d4"}
{"leaf_path": "data/test/rtp", "sig": "76cda60582617ddbb7cd5a49eb68a4b9", "remote_path": "data/git_oss_sample_data/data_test_rtp_76cda60582617ddbb7cd5a49eb68a4b9"}
{"leaf_path": "data/test/tb_data", "sig": "b1579db090d72b3b70b59ba3c7692701", "remote_path": "data/git_oss_sample_data/data_test_tb_data_b1579db090d72b3b70b59ba3c7692701"}
{"leaf_path": "data/test/tb_data", "sig": "f1279ca42de1734be321e88f85775d5f", "remote_path": "data/git_oss_sample_data/data_test_tb_data_f1279ca42de1734be321e88f85775d5f"}
{"leaf_path": "data/test/tb_data/hard_negative_sampler_edge", "sig": "48f994681d719a2546ec4003fcbc638c", "remote_path": "data/git_oss_sample_data/data_test_tb_data_hard_negative_sampler_edge_48f994681d719a2546ec4003fcbc638c"}
{"leaf_path": "data/test/tb_data/hard_negative_sampler_item", "sig": "f23a9eb9457c14a8e57b455804b1f013", "remote_path": "data/git_oss_sample_data/data_test_tb_data_hard_negative_sampler_item_f23a9eb9457c14a8e57b455804b1f013"}
{"leaf_path": "data/test/tb_data/hard_negative_sampler_user", "sig": "23514156eae5a4250ac1d0a118883430", "remote_path": "data/git_oss_sample_data/data_test_tb_data_hard_negative_sampler_user_23514156eae5a4250ac1d0a118883430"}
Expand Down
31 changes: 31 additions & 0 deletions docs/source/feature/fg_docs/ComboFeature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# 6.3 Combo Feature

combo feature是多个字段(或表达式)的组合(即笛卡尔积),id feature可以看成是一种特殊的combo feature,即参与交叉字段只有一个的combo feature。一般来讲,参与交叉的各个字段来自不同的表(比如user特征和item特征进行交叉)。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combo feature => combo_feature


配置:

```
{
"feature_type" : "combo_feature",
"feature_name" : "comb_u_age_item",
"expression" : ["user:age_class", "item:item_id"]
}
```

## 例子

^\]表示多值分隔符,注意这是一个符号,其ASCII编码是"\\x1D",而不是两个符号

| user:age_class的取值 | item:item_id的取值 | 输出的feature |
| ----------------- | --------------- | ---------------------------------------------------------------------------------------------------------- |
| 123 | 45678 | comb_u_age_item_123_45678 |
| abc, bcd | 45678 | comb_u_age_item_abc_45678, comb_u_age_item_bcd_45678 |
| abc, bcd | 12345^\]45678 | comb_u_age_item_abc_12345, comb_u_age_item_abc_45678, comb_u_age_item_bcd_12345, comb_u_age_item_bcd_45678 |

输出的feature个数等于

```
|F1| * |F2| * ... * |Fn|
```

其中Fn指依赖的第n个字段的值的个数。
32 changes: 32 additions & 0 deletions docs/source/feature/fg_docs/IdFeature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# 6.1 Id Feature
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改成 id_feature, 和fg.json里面保持一致, 其它类型的特征也改一下


功能介绍

id feature是一个sparse feature,是一种最简单的离散特征,只是简单的将某个字段的值与用户配置的feature名字拼接。

配置方法

```json
{
"feature_type" : "id_feature",
"feature_name" : "item_is_main",
"expression" : "item:is_main"
}
```

| 字段名 | 含义 |
| -------------- | ----------------------------------------------------------------------------- |
| feature_name | 必选项,feature_name会被当做最终输出的feature的前缀 |
| expression | 必选项,expression描述该feature所依赖的字段来源 |
| need_prefix | 可选项,true表示会拼上feature_name作为前缀,false表示不拼,默认为true,通常在shared_embedding的场景会用false |
| invalid_values | 可选项,表示这些values都会被输出成null。list string,例如\[""\],表示将所有的空字符串输出变成null。 |

例子 ( ^\]表示多值分隔符,注意这是一个符号,其ASCII编码是"\\x1D",而不是两个符号)

| 类型 | item:is_main的取值 | 输出的feature |
| -------- | --------------- | ------------------------------------------- |
| int64_t | 100 | (item_is_main_100, 1) |
| double | 5.2 | (item_is_main_5, 1)(小数部分会被截取) |
| string | abc | (item_is_main_abc, 1) |
| 多值string | abc^\]bcd | (item_is_main_abc, 1),(item_is_main_bcd, 1) |
| 多值int | 123^\]456 | (item_is_main_123, 1),(item_is_main_456, 1) |
112 changes: 112 additions & 0 deletions docs/source/feature/fg_docs/LookupFeature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# 6.5 Lookup Feature

## 功能简介

如果离线生成不符合预期 请先使用最新的离线fg包

lookup feature 和 match feature类似,是从一组kv中匹配到自己需要的结果。

lookup feature 依赖 map 和 key 两个字段,map是一个多值string(MultiString)类型的字段,其中每一个string的样子如"k1:v2"。;key可以是一个任意类型的字段。生成特征时,先是取出key的值,将其转换成string类型,然后在map字段所持有的kv对中进行匹配,获取最终的特征。

map 和 key 源可以是 item,user,context 的任意组合。在线输入的时候item的多值用多值分隔符char(29)分隔,user和context的多值在tpp访问时用list表示。该特征仅支持json形式的配置方式。

## 实例

```json
{
"features" : [
{
"feature_type" : "lookup_feature",
"feature_name" : "item_match_item",
"map" : "item:item_attr",
"key" : "item:item_value",
"needDiscrete" : true
}
]
}
```

对于上面的配置,假设对于某个 doc:

```
item_attr : "k1:v1^]k2:v2^]k3:v3"
```

^\]表示多值分隔符,注意这是一个符号,其ASCII编码是"\\x1D",而不是两个符号。该字符在emacs中的输入方式是C-q C-5, 在vi中的输入方式是C-v C-5。 这里item_attr是个多值string。需要切记,当map用来表征多个kv对时,是个多值string,而不是string!

```
item_value : "k2"
```

特征结果为 item_match_item_k2_v2。由于needDiscrete的值为true,所以特征结果为离散化后的结果。

## 其它

match feature 和 lookup feature都是匹配类型的特征,即从kv对中匹配到相应的结果。两者的区别是: match feature的被匹配字段user 必须是qinfo中传入的字段,即一次查询中对所有的doc来说这个字段的值都是一致的。而 lookup feature 的 key 和 map 没有来源的限制。

## 配置详解

默认情况的配置为 `needDiscrete == true, needWeighting = false, needKey = true, combiner = "sum"`

### 默认输出

### needWeighting == true

```
feature_name:fg
map:{{"k1:123", "k2:234", "k3:3"}}
key:{"k1"}
结果:feature={"fg_k1", 123}
```

此时会用 string 部分查 weight 表,然后乘对应 feature value 用于 LR 模型。

### needDiscrete == true

```
feature_name:fg
map:{{"k1:123", "k2:234", "k3:3"}}
key:{"k1"}
结果:feature={"fg_123"}
```

### needDiscrete == false

```
map:{{"k1:123", "k2:234", "k3:3"}}
key:{"k1"}
结果:feature={123}
```

如果存在多个 key 时,可以通过配置 combiner 来组合多个查到的值。可能的配置有 `sum, mean, max, min`。 ps:如果要使用combiner的话需要将needDiscrete设置为false,只有dense类才能做conbiner,生成的value会是数值类的

一个配置样例 update on 2021.04.15

```json
"kv_fields_encode": [
{
"name": "cnty_dense_features",
"dimension": 99,
"min_hash_type": 0,
"use_sparse": true
},
{
"name": "cross_a_tag",
"dimension": 12,
"min_hash_type": 0,
"use_sparse": true
},
{
"name": "cross_gender",
"dimension": 12,
"min_hash_type": 0,
"use_sparse": true
},
{
"name": "cross_purchasing_power",
"dimension": 12,
"min_hash_type": 0,
"use_sparse": true
}
]
```
109 changes: 109 additions & 0 deletions docs/source/feature/fg_docs/MatchFeature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# 6.4 Match Feature



## Match feature使用说明

match feature一般用来做特征之间的匹配关系,要用到user,item和category三个字段的值。
match feature支持两种类型,hit和multi hit。
match feature本质是是一个两层map的匹配,user字段使用string的方式描述了一个两层map,|为第一层map的item之间的分隔符,^为第一层map的key与value之间的分隔符。,为第二层map的item之间的分隔符,:第二层map的key与value之间的分隔符。例如对于50011740^50011740:0.2,36806676:0.3,122572685:0.5|50006842^16788:0.1这样的一个string,转化为二层map就是

```json
{
"50011740" : {
"50011740" : 0.2,
"36806676" : 0.3,
"122572685" : 0.5
},
"50006842" : {
"16788" : 0.1
}
}
```

对于hit match 匹配的方式,就是用category的值在第一层map中查找,然后使用item的值在第二层map中查找,最终得到一个结果。 如果不需要使用两层匹配,只需要一层匹配,则可以在map的第一层key中填入ALL, 然后在fg配置的category一项中也填成"ALL"即可。具体见实例一。



## 配置方式

json格式配置文件:

```json
{
"feature_name": "user__l1_ctr_1",
"feature_type": "match_feature",
"category": "ALL",
"needDiscrete": false,
"item": "item:category_level1",
"user": "user:l1_ctr_1",
"matchType": "hit"
}
```

needDiscrete:true 时,模型使用 match feature 输出的特征名,忽略特征值。默认为 true。
needDiscrete:false 时,模型取 match feature 输出的特征值,而忽略特征名。

matchType:
hit:输出命中的feature

xml配置文件:

```
<features name="matched_features">
<feature name="brand_hit" dependencies="user:user_brand_tags_hit1,item:brand_id" category="item:auction_root_category" type="hit"/>
<feature name="brand_matched_hit" dependencies="user:user_brand_tags_cos1,item:brand_id" category="ALL" type="hit"/>
</features>
```

dependencie:需要做Match 的两个特征

category: 类目的feature 字段。category="ALL"不需要分类目匹配



## Normalizer

match_feature 支持和 raw_feature 一样的 normalizer,具体可见 [raw_feature](./RawFeature.md)。

## 配置详解


### hit

对于下面的配置

```json
{
"feature_name": "brand_hit",
"feature_type": "match_feature",
"category": "item:auction_root_category",
"needDiscrete": true,
"item": "item:brand_id",
"user": "user:user_brand_tags_hit",
"matchType": "hit"
}
```

假设各字段的值如下:

| user_brand_tags_hit | 50011740^107287172:0.2,36806676:0.3,122572685:0.5\|50006842^16788816:0.1,10122:0.2,29889:0.3,30068:19 |
| --------------------- | ------------------------------------------------------------ |
| brand_id | 30068 |
| auction_root_category | 50006842 |

如果 needDiscrete=true,结果为:<brand_hit_50006842_30068_19,1.0>
如果 needDiscrete=false,结果为:<brand_hit,19.0>
如果只需要使用一层匹配,则需要将上面配置里的 category 的值改为 ALL。这种情况,用户也可以考虑使用 lookup_feature。 假设各字段的值如下

| user_brand_tags_hit | ALL^16788816:40,10122:40,29889:20,30068:20 |
| ------------------- | ------------------------------------------ |
| brand_id | 30068 |

如果 needDiscrete=true,结果:<brand_hit_ALL_30068_20, 1.0> 如果 needDiscrete=false,结果:<brand_hit, 20.0>



### multihit

允许用户 category 和 item 两个值为 ALL(注意,不是配置的值,是传入的值),进行 wildcard 匹配,可以匹配出多个值。输出结果类似于 hit。
56 changes: 56 additions & 0 deletions docs/source/feature/fg_docs/OverLapFeature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# 6.7 OverLap Feature

## 功能简介

用来输出一些字符串字词匹配信息的feature

离线推荐使用1.3.56-SNAPSHOT这个版本,或者1.3.28(不支持参数need_prefix) ps: 写fg的时候注意维度,title的维度要大于或等于query的问题(简单来说就是如果title是user特征,那query也只能是user特征,user特征的batch size为1,商品特征的batch size为商品数)
chenglongliu123 marked this conversation as resolved.
Show resolved Hide resolved

| 方式 | 描述 | 备注 |
| ------------------- | ----------------------------------------------- | ------------------ |
| common_word | 计算query与title间重复term,并输出为fg_common1_common2 | 重复数不超过query term数 |
| diff_word | 计算query与title间不重复term,并输出为fg_diff1_diff2 | 不重复数不超过query term数 |
| query_common_ratio | 计算query与title间重复term数占query中term比例,乘以10取下整 | 取值为\[0,10\] |
| title_common_ratio | 计算query与title间重复term数占title中term比例,乘以100取下整 | 取值为\[0,100\] |
| is_contain | 计算query是否全部包含在title中,保持顺序 | 0表示未包含,1表示包含 |
| is_equal | 计算query是否与title完全相同 | 0表示不完全相同,1表示完全相同 |
| common_word_divided | 计算query与title间重复term,并输出为fg_common1, fg_common2 | 重复数不超过query term数 |
| diff_word_divided | 计算query与title间不重复term,并输出为fg_diff1, fg_diff2 | 重复数不超过query term数 |

## 配置方法

```json
{
"feature_type" : "overlap_feature",
"feature_name" : "is_contain",
"query" : "user:attr1",
"title" : "item:attr2",
"method" : "is_contain",
"separator" : " "
}
```

| 字段名 | 含义 |
| ------------ | -------------------------------------------------------------------------------------- |
| feature_type | 必选项,描述改feature的类型 |
| feature_name | 必选项,feature_name会被当做最终输出的feature的前缀 |
| query | 必选项,query依赖的表, attr1是一个多值string, 多值string的分隔符使用chr(29) |
| title | 必选项,title依赖的表, attr2是一个多值string |
| method | 可填common_word, diff_word, query_common_ratio, title_common_ratio, is_contain, 对应上图五种方式 |
| separator | 输出结果中的分割字符,不填写我们默认为\_ ,但也可以用户自己定制,具体看例子 |

## 例子

query为high,high2,fiberglass,abc
title为high,quality,fiberglass,tube,for,golf,bag

| method | separator | feature |
| ------------------- | --------- | -------------------------- |
| common_word | | name_high_fiberglass |
| diff_word | " " | name high2 abc |
| query_common_ratio | | name_5 |
| title_common_ratio | | name_28 |
| is_contain | | name_0 |
| is_equal | | name_0 |
| common_word_divided | | name_high, name_fiberglass |
| diff_word_divided | | name_high2, name_abc |
Loading