How to process session based recomennder data? #1219

Jeriousman · 2022-03-31T03:20:02Z

Jeriousman
Mar 31, 2022

When I see the default benchmark data 'diginetica-session', it has a structure as following:

session_id:token	item_id_list:token_seq	item_id:token
1	21553 20071 8762 21566 6381	21566
2	21553 20071 8762 21566	6381
3	21553 20071 8762	21566
4	21553 20071	8762
5	21553	20071
6	27606	25971
7	9145 1594	9145
8	9145	1594
9	4269 25212 22925 25212	7213
10	4269 25212 22925	25212
11	4269 25212	22925
12	4269	25212
13	11999 44 20962 14434 20962	11999
14	11999 44 20962 14434	20962
15	11999 44 20962	14434
16	11999 44	20962
17	11999	44
18	31497 33889	31497
19	31497	33889
20	1177 1177 2459 1177 21904 21904 2797 21904 3634 21904 8162	21904
21	1177 1177 2459 1177 21904 21904 2797 21904 3634 21904	8162
22	1177 1177 2459 1177 21904 21904 2797 21904 3634	21904
23	1177 1177 2459 1177 21904 21904 2797 21904	3634
24	1177 1177 2459 1177 21904 21904 2797	21904
25	1177 1177 2459 1177 21904 21904	2797
26	1177 1177 2459 1177 21904	21904
27	1177 1177 2459 1177	21904
28	1177 1177 2459	1177
29	1177 1177	2459
30	1177	1177

It is making adding item one by one and making many rows by adding one item to a sequence.

But atomic file diginetica.inter has features as follows:

session_id:token	item_id:token	timestamp:float	number of times:float
1	12352	1463053070	1
1	31331	1463754218	1
1	32118	1462966769	1
1	32627	1463835608	1
1	33043	1462897112	1
1	35077	1463113272	1
1	36118	1463210569	1
1	81766	1463249509	1
1	9654	1462799048	1
2	100747	1462761517	1
2	10657	1463110540	1
2	196110	1463768218	1
2	3147	1463606239	1

If I set item_id like sequential model config as below,

        'USER_ID_FIELD': 'session_id',
        'ITEM_ID_FIELD': 'item_id',
        'TIME_FIELD': 'timestamp',
        'load_col': {'inter': ['session_id', 'item_id', 'number of items', 'timestamp']},

item_id_list will be made but it wont be anything like diginetica-session at the top. It will be just like a sequence per user as below.

tensor([[26742, 35453, 16313,  ...,     0,     0,     0],
        [ 6123, 23033,     0,  ...,     0,     0,     0],
        [  138,  2383, 19299,  ...,     0,     0,     0],
        ...,
        [13142,  3845,  2453,  ...,     0,     0,     0],
        [ 2239,     0,     0,  ...,     0,     0,     0],
        [ 1075,     0,     0,  ...,     0,     0,     0]])

I guess this is how I should make sequence for session based model? I guess diginetica-session data form should be obtained somewhere? So my question is as below.

How can I get diginetica-session like form processed data from the atomic file you have?
If I do session based rec training starting with the atomic file, not with the benchmark data which has different format than the atomic file, what should be the detailed whole process to get the training going correctly?

Answered by hyp1231

Mar 31, 2022

Hi, Sorry for misunderstanding the questions.

Yes, that's right.
If your YAML file contains the benchmark_filename arg, then nothing will happen. No data augmentation, no grouping, no splitting. Because we assume that you have done all of these to generate a benchmark dataset.

Otherwise (no benchmark_filename in your YAML), then the input is as following,
```
session_id:token	item_id:token	timestamp:float	number of times:float
1	12352	1463053070	1
1	31331	1463754218	1
1	32118	1462966769	1
1	32627	1463835608	1
1	33043	1462897112	1
```
we will then perform data augmentation for sequential recommendation, the interactions are grouped and sorted into
```
session_id	item_id_list
1	33043 32118 12352 3…
```

View full answer

hyp1231 · 2022-03-31T03:47:29Z

hyp1231
Mar 31, 2022
Maintainer

Hi,

For the first question, please pay attention to session_based_rec_example.py, which shows an example to load the session benchmarks. By the way, the example has a very similar procedure as those in GCE-GNN's release code.

For the second question, actually, it depends:

If you desire that items in a single session should not be divided into train/valid/test via data augmentation (in other words, a session-id will not occur in train/valid/test simultaneously), then maybe we cannot derive such a format from diginetica.inter from scratch.

Because RecBole will group the interactions by user/session-id, and split each sequence/session into train/valid/test in data augmentation.

That's why we preprocess diginetica-session, via an additional script (outside RecBole) similar to SR-GNN and GCE-GNN, and provide an example for session-based recommendation benchmarks.
If you do not have specific requirements for session-based recommendation, for example, just to do experiments about sequential recommendation, then please just leave the diginetica-session behind.

1 reply

Jeriousman Mar 31, 2022
Author

Thank you for your feedback! Maybe I was no so clear when I asked the question number one. I was asking the question because [session_based_rec_example.py] has the specifically preprocessed (adding one by one) like below but this is not exactly what you download from the google drive diginetica.inter atomic file which is the second example I mentioned.

session_id:token	item_id_list:token_seq	item_id:token
1	21553 20071 8762 21566 6381	21566
2	21553 20071 8762 21566	6381
3	21553 20071 8762	21566
4	21553 20071	8762
5	21553	20071

But I guess your answer for my first question is in a way answered from your answer to my second question. Let me know if I am correct. So If I wanna make data from diginetica.inter from your google drive atomic file which is as below,

session_id:token	item_id:token	timestamp:float	number of times:float
1	12352	1463053070	1
1	31331	1463754218	1
1	32118	1462966769	1
1	32627	1463835608	1
1	33043	1462897112	1

to data that is in the same format as diginetica-session as below,

session_id:token	item_id_list:token_seq	item_id:token
1	21553 20071 8762 21566 6381	21566
2	21553 20071 8762 21566	6381
3	21553 20071 8762	21566
4	21553 20071	8762
5	21553	20071

I should process the data out of RecBole. I should refer to SR-GNN, GCE-GNN code for the data processing. Is it right?

and additional questions arised on your answer below.
2. You mentioned

Because RecBole will group the interactions by user/session-id, and split each sequence/session into train/valid/test in data augmentation.

If I have, say session-id 1, in recbole you only provide grouping by user/session-id, we cannot make the cascading session sequences as below,

session_id:token	item_id_list:token_seq	item_id:token
1	21553 20071 8762 21566 6381	21566
2	21553 20071 8762 21566	6381
3	21553 20071 8762	21566
4	21553 20071	8762
5	21553	20071

But we can only make one full sequence ?

session_id:token	item_id_list:token_seq	item_id:token
1	21553 20071 8762 21566 6381	21566

Then you devide this into train/valid/test? How do you do it? For example, you take 21553 20071 8762 as training, 21566 6381 as valid and 21566 as test, for example? Actually I am not quite following. Sorry. Can you please state in other way (if possible, with an example?)

From your answer below

If you do not have specific requirements for session-based recommendation, for example, just to do experiments about sequential recommendation, then please just leave the diginetica-session behind.

Why session based data has the cascading SR-GNN like data

session_id:token	item_id_list:token_seq	item_id:token
1	21553 20071 8762 21566 6381	21566
2	21553 20071 8762 21566	6381
3	21553 20071 8762	21566
4	21553 20071	8762
5	21553	20071

whereas sequential data is normally just a full one sequence per user without the complex cascading format? What is the fundamental reason/difference between sequential and session based rec? I know this is not a question about RecBole but the recommender knowledge, but it will be really appreciated if you can answer my last question. Thank you in advance!

hyp1231 · 2022-03-31T06:09:35Z

hyp1231
Mar 31, 2022
Maintainer

Hi, Sorry for misunderstanding the questions.

Yes, that's right.
If your YAML file contains the benchmark_filename arg, then nothing will happen. No data augmentation, no grouping, no splitting. Because we assume that you have done all of these to generate a benchmark dataset.

Otherwise (no benchmark_filename in your YAML), then the input is as following,
```
session_id:token	item_id:token	timestamp:float	number of times:float
1	12352	1463053070	1
1	31331	1463754218	1
1	32118	1462966769	1
1	32627	1463835608	1
1	33043	1462897112	1
```
we will then perform data augmentation for sequential recommendation, the interactions are grouped and sorted into
```
session_id	item_id_list
1	33043 32118 12352 31331 32627
```
and finally split to train/valid/test as following

train
```
session_id	item_id_list	item_id
1	33043	32118
1	33043 32118	12352
```
valid
```
session_id	item_id_list	item_id
1	33043 32118 12352	31331
```
test
```
session_id	item_id_list	item_id
1	33043 32118 12352 31331	32627
```
(1) Firstly, I would like to clarify the different input formats when applying benchmark_filename or not, in RecBole's sequential recommendation.

For sequential models in RecBole, if you don't add benchmark_filename arg, we will group by user, sort the items, and split the sequences into train/valid/test. In this way, the original input should be like
```
session_id:token	item_id:token	timestamp:float	number of times:float
1	12352	1463053070	1
1	31331	1463754218	1
1	32118	1462966769	1
1	32627	1463835608	1
1	33043	1462897112	1
```
If you add benchmark_filename in your args, then we will not perform the grouping, sorting, cascading, or split operations. How you input, then how the data are in the final interaction objects. In this way, the original input should be like
```
session_id:token	item_id_list:token_seq	item_id:token
1	21553 20071 8762 21566 6381	21566
2	21553 20071 8762 21566	6381
3	21553 20071 8762	21566
4	21553 20071	8762
5	21553	20071
```

(2) Then, I would like to talk about my understanding of the differences between session-based recommendation and sequential recommendation. Note that this may be not correct, it's just from my current understanding.

In my opinion, in session-based recommendation, all sessions are assumed to happen in a short-term period, such as a few minutes or hours. Thus when we evaluate session-based methods, original item sequences of the same session-id should not be divided into train/test/valid sets.

In sequential recommendation, we assume that we can observe a user's long-term interaction history, then it's natural to have one user's partial history in training set, then we can predict this user's next item in valid/test sets.

Hope that this is a bit clearer than before. :)

3 replies

Jeriousman Mar 31, 2022
Author

Your feedback was perfection. Thank you for the dedicated answer! You cleared lots of questions in my mind :)

elloza Mar 31, 2022

I was wondering the same thing as @Jeriousman . Thank you so much for the clarification.
Just one more question:
When you say:
"For sequential models in RecBole, if you don't add benchmark_filename arg, we will group by user, sort the items, and split the sequences into train/valid/test. In this way, the original input should be like"

How exactly is it grouped by user?

From what I have been seeing in the examples in sequential recommendation, this library only supports session based recommendation (anonymous users), no session aware recommendation (identified users).

On the other hand, the prediction task in this type of recommenders is next-item recommendation (next session recommendation is not available).

Am I right?

Thank you very much for your help.

hyp1231 Mar 31, 2022
Maintainer

Yes, it's all about next-item recommendation.

When we select a sequential recommendation model (e.g. GRU4Rec) and don't add benchmark_filename arg, the input .inter file can be viewed as multiple rows of <user_id, item_id>. Then the so-called "group by user" is that RecBole will cluster those rows sharing the same user_id together, then Recbole will sort item_id in those clustered rows for each user.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to process session based recomennder data? #1219

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to process session based recomennder data? #1219

Jeriousman Mar 31, 2022

Replies: 2 comments · 4 replies

hyp1231 Mar 31, 2022 Maintainer

Jeriousman Mar 31, 2022 Author

hyp1231 Mar 31, 2022 Maintainer

Jeriousman Mar 31, 2022 Author

elloza Mar 31, 2022

hyp1231 Mar 31, 2022 Maintainer

Jeriousman
Mar 31, 2022

Replies: 2 comments 4 replies

hyp1231
Mar 31, 2022
Maintainer

Jeriousman Mar 31, 2022
Author

hyp1231
Mar 31, 2022
Maintainer

Jeriousman Mar 31, 2022
Author

hyp1231 Mar 31, 2022
Maintainer