Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: index + join #22

Open
kmx opened this issue Aug 16, 2015 · 3 comments
Open

Idea: index + join #22

kmx opened this issue Aug 16, 2015 · 3 comments

Comments

@kmx
Copy link

kmx commented Aug 16, 2015

Hi Zaki,

I would like to propose the following enhancements:

  • new index option that can be passed to new()
  • new method join

See the code below which demonstrates what I am talking about. In my real use case the index will be a PDL (LongLong) with kind of a timestamps.

At this moment it is just an idea (no patch, no pull request).

use Modern::Perl;
use Data::Frame;
use PDL;

my $df1 = Data::Frame->new( 
            index => pdl(1, 2, 3, 4, 5, 6),
            columns => [
              first  => random(6) * 100,
              second => sequence(6) + 100,
            ],
          );

say $df1;
# ---------------------------------
# index  first              second
# ---------------------------------
#     1  96.8891209914009   100
#     2  76.1503499302307   101
#     3  67.3669555706322   102
#     4  94.2991902576502   103
#     5  97.5514418708361   104
#     6  37.9426436114741   105
# ---------------------------------

my $df2 = Data::Frame->new( 
            index => pdl(0, 1, 2, 4, 5, 6),
            columns => [
              third  => random(6) * 1000,
              fourth => sequence(6) + 1000,
            ],
          );

say $df2;
# --------------------------------
# index  third             fourth
# --------------------------------
#     0  202.939408438848  1000
#     1  758.36712363536   1001
#     2  277.250017476778  1002
#     4  663.52298494806   1003
#     5  186.35758181922   1004
#     6  776.087658553486  1005
# --------------------------------

my $df3 = $df1->join($df2);
say $df3;
# ----------------------------------------------------------
# index  first             second  third             fourth
# ----------------------------------------------------------
#     0  BAD               BAD     202.939408438848  1000
#     1  96.8891209914009  100     758.36712363536   1001
#     2  76.1503499302307  101     277.250017476778  1002
#     3  67.3669555706322  102     BAD               BAD
#     4  94.2991902576502  103     663.52298494806   1003
#     5  97.5514418708361  104     186.35758181922   1004
#     6  37.9426436114741  105     776.087658553486  1005
# ----------------------------------------------------------
@zmughal
Copy link
Member

zmughal commented Aug 17, 2015

I like it! Setting the index in the constructor was in the back of mind, but I've been busy.

Regarding join, we definitely need that. I would like to take the behaviour of R and Pandas and bring that in to Data::Frame so that we don't have to change it later (and thus break code). A join with no other arguments would probably be best as a "natural join" where all matching columns in each Data::Frame are part of the join (see the dplyr documentation).

@kmx
Copy link
Author

kmx commented Sep 15, 2015

Just for record, here is my proof of concept:

use Modern::Perl;
use PDL;

sub join_demo {
  my ($left_idx, $left_data, $right_idx, $right_data, $how) = @_;

  my $all_idx;
  if ($how eq 'inner') {
    # inner join
    $all_idx = setops($left_idx, 'AND', $right_idx);
  }
  elsif ($how eq 'outer') {
    # full outer join
    $all_idx = setops($left_idx, 'OR', $right_idx);
  }
  elsif ($how eq 'left') {
    # left outer join
    $all_idx = setops($left_idx, 'OR', setops($left_idx, 'AND', $right_idx));
  }
  elsif ($how eq 'right') {
    # right outer join
    $all_idx = setops($right_idx, 'OR', setops($left_idx, 'AND', $right_idx));
  }
  else {
    die "invalid how='$how'";
  }

  my $new_left_idx = vsearch_match($all_idx, $left_idx);
  $new_left_idx = $new_left_idx->setbadif($new_left_idx < 0);
  my $new_left = $left_data->index1d($new_left_idx);

  my $new_right_idx = vsearch_match($all_idx, $right_idx);
  $new_right_idx = $new_right_idx->setbadif($new_right_idx < 0);
  my $new_right = $right_data->index1d($new_right_idx);

  say "OUT.I($how):", $all_idx;
  say "OUT.L($how):", $new_left;
  say "OUT.R($how):", $new_right;
}

my $lidx  = pdl(long,   [0,   1,   2,        4,   5,   6       ]);
my $ldata = pdl(double, [1.1, 2.2, 3.3,      4.4, 5.5, 6.6     ]);
my $ridx  = pdl(long,   [     1,   2,   3,   4,        6,   7  ]);
my $rdata = pdl(double, [     9.7, 9.6, 9.5, 9.4,      9.3, 9.2]);

say "IN.L:", $lidx, " ", $ldata;
say "IN.R:", $ridx, " ", $rdata;

join_demo($lidx, $ldata, $ridx, $rdata, 'outer');
join_demo($lidx, $ldata, $ridx, $rdata, 'inner');
join_demo($lidx, $ldata, $ridx, $rdata, 'left');
join_demo($lidx, $ldata, $ridx, $rdata, 'right');

The output:

IN.L:[0 1 2 4 5 6] [1.1 2.2 3.3 4.4 5.5 6.6]
IN.R:[1 2 3 4 6 7] [9.7 9.6 9.5 9.4 9.3 9.2]

OUT.I(outer):[0 1 2 3 4 5 6 7]
OUT.L(outer):[1.1 2.2 3.3 BAD 4.4 5.5 6.6 BAD]
OUT.R(outer):[BAD 9.7 9.6 9.5 9.4 BAD 9.3 9.2]

OUT.I(inner):[1 2 4 6]
OUT.L(inner):[2.2 3.3 4.4 6.6]
OUT.R(inner):[9.7 9.6 9.4 9.3]

OUT.I(left): [0 1 2 4 5 6]
OUT.L(left): [1.1 2.2 3.3 4.4 5.5 6.6]
OUT.R(left): [BAD 9.7 9.6 9.4 BAD 9.3]

OUT.I(right):[1 2 3 4 6 7]
OUT.L(right):[2.2 3.3 BAD 4.4 6.6 BAD]
OUT.R(right):[9.7 9.6 9.5 9.4 9.3 9.2]

@kmx
Copy link
Author

kmx commented Oct 5, 2015

Hi Zaki,

UPDATE on this issue: after some experiments with your Data::Frame + adding join functionality I have decided that it would be easier to implement a separate proof of concept module which is not "full" Data Frame but "only" a Time Series object - which means the index is mandatory and must always be a PDL::DateTime object (because it is my use case).

The code is available here https://gist.github.com/kmx/c46859f002b93a6c3683 - it is called PDL::TS but it is currently just a code (no tests, no doc, no plan for CPAN release). But at least it might be an inspiration if you want to add similar functionality to Data::Frame.

--kmx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants