-
Notifications
You must be signed in to change notification settings - Fork 332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[doc] change_feature_pdf_to_md #415
Merged
Merged
Changes from 10 commits
Commits
Show all changes
23 commits
Select commit
Hold shift + click to select a range
e12c66a
change_feature_pdf_to_md
chenglongliu123 4dbd936
change_feature_pdf_to_md
chenglongliu123 de30f3b
change_feature_pdf_to_md
chenglongliu123 2443bbc
change_feature_pdf_to_md
chenglongliu123 2f3712e
change_feature_pdf_to_md
chenglongliu123 d471204
change_feature_pdf_to_md
chenglongliu123 7aa9e0f
change_feature_pdf_to_md
chenglongliu123 dddaed3
change_feature_pdf_to_md
chenglongliu123 9d74326
change_feature_pdf_to_md
chenglongliu123 0eda657
change_feature_pdf_to_md
chenglongliu123 5d0a3ad
change_feature_pdf_to_md
chenglongliu123 f86a071
change_feature_pdf_to_md
chenglongliu123 1fc5ea2
change_feature_pdf_to_md
chenglongliu123 953d94b
change_feature_pdf_to_md
chenglongliu123 b48d62a
Merge branch 'master' into change_docs_from_pdf_to_md
chenglongliu123 56b550c
change_feature_pdf_to_md
chenglongliu123 db45619
change_feature_pdf_to_md
chenglongliu123 c6fb600
change_feature_pdf_to_md
chenglongliu123 1a33771
change_feature_pdf_to_md
chenglongliu123 d6664b1
change_feature_pdf_to_md
chenglongliu123 4d3cec2
change_feature_pdf_to_md
chenglongliu123 fab082b
change_feature_pdf_to_md
chenglongliu123 117cc41
change_feature_pdf_to_md
chenglongliu123 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# 6.3 Combo Feature | ||
|
||
combo feature是多个字段(或表达式)的组合(即笛卡尔积),id feature可以看成是一种特殊的combo feature,即参与交叉字段只有一个的combo feature。一般来讲,参与交叉的各个字段来自不同的表(比如user特征和item特征进行交叉)。 | ||
|
||
配置: | ||
|
||
``` | ||
{ | ||
"feature_type" : "combo_feature", | ||
"feature_name" : "comb_u_age_item", | ||
"expression" : ["user:age_class", "item:item_id"] | ||
} | ||
``` | ||
|
||
## 例子 | ||
|
||
^\]表示多值分隔符,注意这是一个符号,其ASCII编码是"\\x1D",而不是两个符号 | ||
|
||
| user:age_class的取值 | item:item_id的取值 | 输出的feature | | ||
| ----------------- | --------------- | ---------------------------------------------------------------------------------------------------------- | | ||
| 123 | 45678 | comb_u_age_item_123_45678 | | ||
| abc, bcd | 45678 | comb_u_age_item_abc_45678, comb_u_age_item_bcd_45678 | | ||
| abc, bcd | 12345^\]45678 | comb_u_age_item_abc_12345, comb_u_age_item_abc_45678, comb_u_age_item_bcd_12345, comb_u_age_item_bcd_45678 | | ||
|
||
输出的feature个数等于 | ||
|
||
``` | ||
|F1| * |F2| * ... * |Fn| | ||
``` | ||
|
||
其中Fn指依赖的第n个字段的值的个数。 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
# 6.1 Id Feature | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 改成 id_feature, 和fg.json里面保持一致, 其它类型的特征也改一下 |
||
|
||
功能介绍 | ||
|
||
id feature是一个sparse feature,是一种最简单的离散特征,只是简单的将某个字段的值与用户配置的feature名字拼接。 | ||
|
||
配置方法 | ||
|
||
```json | ||
{ | ||
"feature_type" : "id_feature", | ||
"feature_name" : "item_is_main", | ||
"expression" : "item:is_main" | ||
} | ||
``` | ||
|
||
| 字段名 | 含义 | | ||
| -------------- | ----------------------------------------------------------------------------- | | ||
| feature_name | 必选项,feature_name会被当做最终输出的feature的前缀 | | ||
| expression | 必选项,expression描述该feature所依赖的字段来源 | | ||
| need_prefix | 可选项,true表示会拼上feature_name作为前缀,false表示不拼,默认为true,通常在shared_embedding的场景会用false | | ||
| invalid_values | 可选项,表示这些values都会被输出成null。list string,例如\[""\],表示将所有的空字符串输出变成null。 | | ||
|
||
例子 ( ^\]表示多值分隔符,注意这是一个符号,其ASCII编码是"\\x1D",而不是两个符号) | ||
|
||
| 类型 | item:is_main的取值 | 输出的feature | | ||
| -------- | --------------- | ------------------------------------------- | | ||
| int64_t | 100 | (item_is_main_100, 1) | | ||
| double | 5.2 | (item_is_main_5, 1)(小数部分会被截取) | | ||
| string | abc | (item_is_main_abc, 1) | | ||
| 多值string | abc^\]bcd | (item_is_main_abc, 1),(item_is_main_bcd, 1) | | ||
| 多值int | 123^\]456 | (item_is_main_123, 1),(item_is_main_456, 1) | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,112 @@ | ||
# 6.5 Lookup Feature | ||
|
||
## 功能简介 | ||
|
||
如果离线生成不符合预期 请先使用最新的离线fg包 | ||
|
||
lookup feature 和 match feature类似,是从一组kv中匹配到自己需要的结果。 | ||
|
||
lookup feature 依赖 map 和 key 两个字段,map是一个多值string(MultiString)类型的字段,其中每一个string的样子如"k1:v2"。;key可以是一个任意类型的字段。生成特征时,先是取出key的值,将其转换成string类型,然后在map字段所持有的kv对中进行匹配,获取最终的特征。 | ||
|
||
map 和 key 源可以是 item,user,context 的任意组合。在线输入的时候item的多值用多值分隔符char(29)分隔,user和context的多值在tpp访问时用list表示。该特征仅支持json形式的配置方式。 | ||
|
||
## 实例 | ||
|
||
```json | ||
{ | ||
"features" : [ | ||
{ | ||
"feature_type" : "lookup_feature", | ||
"feature_name" : "item_match_item", | ||
"map" : "item:item_attr", | ||
"key" : "item:item_value", | ||
"needDiscrete" : true | ||
} | ||
] | ||
} | ||
``` | ||
|
||
对于上面的配置,假设对于某个 doc: | ||
|
||
``` | ||
item_attr : "k1:v1^]k2:v2^]k3:v3" | ||
``` | ||
|
||
^\]表示多值分隔符,注意这是一个符号,其ASCII编码是"\\x1D",而不是两个符号。该字符在emacs中的输入方式是C-q C-5, 在vi中的输入方式是C-v C-5。 这里item_attr是个多值string。需要切记,当map用来表征多个kv对时,是个多值string,而不是string! | ||
|
||
``` | ||
item_value : "k2" | ||
``` | ||
|
||
特征结果为 item_match_item_k2_v2。由于needDiscrete的值为true,所以特征结果为离散化后的结果。 | ||
|
||
## 其它 | ||
|
||
match feature 和 lookup feature都是匹配类型的特征,即从kv对中匹配到相应的结果。两者的区别是: match feature的被匹配字段user 必须是qinfo中传入的字段,即一次查询中对所有的doc来说这个字段的值都是一致的。而 lookup feature 的 key 和 map 没有来源的限制。 | ||
|
||
## 配置详解 | ||
|
||
默认情况的配置为 `needDiscrete == true, needWeighting = false, needKey = true, combiner = "sum"` | ||
|
||
### 默认输出 | ||
|
||
### needWeighting == true | ||
|
||
``` | ||
feature_name:fg | ||
map:{{"k1:123", "k2:234", "k3:3"}} | ||
key:{"k1"} | ||
结果:feature={"fg_k1", 123} | ||
``` | ||
|
||
此时会用 string 部分查 weight 表,然后乘对应 feature value 用于 LR 模型。 | ||
|
||
### needDiscrete == true | ||
|
||
``` | ||
feature_name:fg | ||
map:{{"k1:123", "k2:234", "k3:3"}} | ||
key:{"k1"} | ||
结果:feature={"fg_123"} | ||
``` | ||
|
||
### needDiscrete == false | ||
|
||
``` | ||
map:{{"k1:123", "k2:234", "k3:3"}} | ||
key:{"k1"} | ||
结果:feature={123} | ||
``` | ||
|
||
如果存在多个 key 时,可以通过配置 combiner 来组合多个查到的值。可能的配置有 `sum, mean, max, min`。 ps:如果要使用combiner的话需要将needDiscrete设置为false,只有dense类才能做conbiner,生成的value会是数值类的 | ||
|
||
一个配置样例 update on 2021.04.15 | ||
|
||
```json | ||
"kv_fields_encode": [ | ||
{ | ||
"name": "cnty_dense_features", | ||
"dimension": 99, | ||
"min_hash_type": 0, | ||
"use_sparse": true | ||
}, | ||
{ | ||
"name": "cross_a_tag", | ||
"dimension": 12, | ||
"min_hash_type": 0, | ||
"use_sparse": true | ||
}, | ||
{ | ||
"name": "cross_gender", | ||
"dimension": 12, | ||
"min_hash_type": 0, | ||
"use_sparse": true | ||
}, | ||
{ | ||
"name": "cross_purchasing_power", | ||
"dimension": 12, | ||
"min_hash_type": 0, | ||
"use_sparse": true | ||
} | ||
] | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
# 6.4 Match Feature | ||
|
||
|
||
|
||
## Match feature使用说明 | ||
|
||
match feature一般用来做特征之间的匹配关系,要用到user,item和category三个字段的值。 | ||
match feature支持两种类型,hit和multi hit。 | ||
match feature本质是是一个两层map的匹配,user字段使用string的方式描述了一个两层map,|为第一层map的item之间的分隔符,^为第一层map的key与value之间的分隔符。,为第二层map的item之间的分隔符,:第二层map的key与value之间的分隔符。例如对于50011740^50011740:0.2,36806676:0.3,122572685:0.5|50006842^16788:0.1这样的一个string,转化为二层map就是 | ||
|
||
```json | ||
{ | ||
"50011740" : { | ||
"50011740" : 0.2, | ||
"36806676" : 0.3, | ||
"122572685" : 0.5 | ||
}, | ||
"50006842" : { | ||
"16788" : 0.1 | ||
} | ||
} | ||
``` | ||
|
||
对于hit match 匹配的方式,就是用category的值在第一层map中查找,然后使用item的值在第二层map中查找,最终得到一个结果。 如果不需要使用两层匹配,只需要一层匹配,则可以在map的第一层key中填入ALL, 然后在fg配置的category一项中也填成"ALL"即可。具体见实例一。 | ||
|
||
|
||
|
||
## 配置方式 | ||
|
||
json格式配置文件: | ||
|
||
```json | ||
{ | ||
"feature_name": "user__l1_ctr_1", | ||
"feature_type": "match_feature", | ||
"category": "ALL", | ||
"needDiscrete": false, | ||
"item": "item:category_level1", | ||
"user": "user:l1_ctr_1", | ||
"matchType": "hit" | ||
} | ||
``` | ||
|
||
needDiscrete:true 时,模型使用 match feature 输出的特征名,忽略特征值。默认为 true。 | ||
needDiscrete:false 时,模型取 match feature 输出的特征值,而忽略特征名。 | ||
|
||
matchType: | ||
hit:输出命中的feature | ||
|
||
xml配置文件: | ||
|
||
``` | ||
<features name="matched_features"> | ||
<feature name="brand_hit" dependencies="user:user_brand_tags_hit1,item:brand_id" category="item:auction_root_category" type="hit"/> | ||
<feature name="brand_matched_hit" dependencies="user:user_brand_tags_cos1,item:brand_id" category="ALL" type="hit"/> | ||
</features> | ||
``` | ||
|
||
dependencie:需要做Match 的两个特征 | ||
|
||
category: 类目的feature 字段。category="ALL"不需要分类目匹配 | ||
|
||
|
||
|
||
## Normalizer | ||
|
||
match_feature 支持和 raw_feature 一样的 normalizer,具体可见 [raw_feature](./RawFeature.md)。 | ||
|
||
## 配置详解 | ||
|
||
|
||
### hit | ||
|
||
对于下面的配置 | ||
|
||
```json | ||
{ | ||
"feature_name": "brand_hit", | ||
"feature_type": "match_feature", | ||
"category": "item:auction_root_category", | ||
"needDiscrete": true, | ||
"item": "item:brand_id", | ||
"user": "user:user_brand_tags_hit", | ||
"matchType": "hit" | ||
} | ||
``` | ||
|
||
假设各字段的值如下: | ||
|
||
| user_brand_tags_hit | 50011740^107287172:0.2,36806676:0.3,122572685:0.5\|50006842^16788816:0.1,10122:0.2,29889:0.3,30068:19 | | ||
| --------------------- | ------------------------------------------------------------ | | ||
| brand_id | 30068 | | ||
| auction_root_category | 50006842 | | ||
|
||
如果 needDiscrete=true,结果为:<brand_hit_50006842_30068_19,1.0> | ||
如果 needDiscrete=false,结果为:<brand_hit,19.0> | ||
如果只需要使用一层匹配,则需要将上面配置里的 category 的值改为 ALL。这种情况,用户也可以考虑使用 lookup_feature。 假设各字段的值如下 | ||
|
||
| user_brand_tags_hit | ALL^16788816:40,10122:40,29889:20,30068:20 | | ||
| ------------------- | ------------------------------------------ | | ||
| brand_id | 30068 | | ||
|
||
如果 needDiscrete=true,结果:<brand_hit_ALL_30068_20, 1.0> 如果 needDiscrete=false,结果:<brand_hit, 20.0> | ||
|
||
|
||
|
||
### multihit | ||
|
||
允许用户 category 和 item 两个值为 ALL(注意,不是配置的值,是传入的值),进行 wildcard 匹配,可以匹配出多个值。输出结果类似于 hit。 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# 6.7 OverLap Feature | ||
|
||
## 功能简介 | ||
|
||
用来输出一些字符串字词匹配信息的feature | ||
|
||
离线推荐使用1.3.56-SNAPSHOT这个版本,或者1.3.28(不支持参数need_prefix) ps: 写fg的时候注意维度,title的维度要大于或等于query的问题(简单来说就是如果title是user特征,那query也只能是user特征,user特征的batch size为1,商品特征的batch size为商品数) | ||
chenglongliu123 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
| 方式 | 描述 | 备注 | | ||
| ------------------- | ----------------------------------------------- | ------------------ | | ||
| common_word | 计算query与title间重复term,并输出为fg_common1_common2 | 重复数不超过query term数 | | ||
| diff_word | 计算query与title间不重复term,并输出为fg_diff1_diff2 | 不重复数不超过query term数 | | ||
| query_common_ratio | 计算query与title间重复term数占query中term比例,乘以10取下整 | 取值为\[0,10\] | | ||
| title_common_ratio | 计算query与title间重复term数占title中term比例,乘以100取下整 | 取值为\[0,100\] | | ||
| is_contain | 计算query是否全部包含在title中,保持顺序 | 0表示未包含,1表示包含 | | ||
| is_equal | 计算query是否与title完全相同 | 0表示不完全相同,1表示完全相同 | | ||
| common_word_divided | 计算query与title间重复term,并输出为fg_common1, fg_common2 | 重复数不超过query term数 | | ||
| diff_word_divided | 计算query与title间不重复term,并输出为fg_diff1, fg_diff2 | 重复数不超过query term数 | | ||
|
||
## 配置方法 | ||
|
||
```json | ||
{ | ||
"feature_type" : "overlap_feature", | ||
"feature_name" : "is_contain", | ||
"query" : "user:attr1", | ||
"title" : "item:attr2", | ||
"method" : "is_contain", | ||
"separator" : " " | ||
} | ||
``` | ||
|
||
| 字段名 | 含义 | | ||
| ------------ | -------------------------------------------------------------------------------------- | | ||
| feature_type | 必选项,描述改feature的类型 | | ||
| feature_name | 必选项,feature_name会被当做最终输出的feature的前缀 | | ||
| query | 必选项,query依赖的表, attr1是一个多值string, 多值string的分隔符使用chr(29) | | ||
| title | 必选项,title依赖的表, attr2是一个多值string | | ||
| method | 可填common_word, diff_word, query_common_ratio, title_common_ratio, is_contain, 对应上图五种方式 | | ||
| separator | 输出结果中的分割字符,不填写我们默认为\_ ,但也可以用户自己定制,具体看例子 | | ||
|
||
## 例子 | ||
|
||
query为high,high2,fiberglass,abc | ||
title为high,quality,fiberglass,tube,for,golf,bag | ||
|
||
| method | separator | feature | | ||
| ------------------- | --------- | -------------------------- | | ||
| common_word | | name_high_fiberglass | | ||
| diff_word | " " | name high2 abc | | ||
| query_common_ratio | | name_5 | | ||
| title_common_ratio | | name_28 | | ||
| is_contain | | name_0 | | ||
| is_equal | | name_0 | | ||
| common_word_divided | | name_high, name_fiberglass | | ||
| diff_word_divided | | name_high2, name_abc | |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
combo feature => combo_feature