-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark patch #139
base: main
Are you sure you want to change the base?
Spark patch #139
Conversation
spark/.gitkeep
Outdated
@@ -0,0 +1 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this file could be removed
StructField("RefererHash", LongType, nullable = false), | ||
StructField("URLHash", LongType, nullable = false), | ||
StructField("CLID", IntegerType, nullable = false)) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no way to create an index, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, no index support
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/types/StructField.html
val timeElapsed = (end - start) / 1000000 | ||
println(s"Query $itr | Time: $timeElapsed ms") | ||
itr += 1 | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls upload the results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uploaded in log.txt file
spark/benchmark.sh
Outdated
|
||
# For Spark3.0.1 installation: | ||
# wget --continue https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz | ||
# tar -xzf spark-3.0.1-bin-hadoop2.7.tgz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tar -xzf spark*
spark/benchmark.sh
Outdated
wget --continue 'https://datasets.clickhouse.com/hits_compatible/hits.tsv.gz' | ||
#gzip -d hits.tsv.gz | ||
chmod 777 ~ hits.tsv | ||
$HADOOP_HOME/bin/hdfs dfs -put hits.tsv / |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how do I set this variable?
$ echo $HADOOP_HOME
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot find it:
find spark-3.5.0-bin-hadoop3 -name hdfs
Added Spark & HDFS deployment details in benchmark.sh script. Added example of log.txt file from HPC-environment. |
The script
Should be:
|
|
Updated Spark&HDFS directories creation |
I started editing your script to make it self-sufficient, but after fixing the errors, it does not work.
Then:
PS. The current version of the script is:
|
@alexey-milovidov We assume that there is passless ssh connection defined on localhost (in other words, if we will use Please clarify the next details:
|
I do it in this way: create a fresh VM on AWS and run the commands one by one. |
@DoubleMindy, let's continue. |
Added full HDFS deployment, on "fresh" VM there is no problem with file putting |
Sorry, but the script still does not reproduce. I'm copy-pasting the commands one by one, and getting this: |
We need a reproducible script to install Spark. It should run by itself. |
@DoubleMindy It would be appreciated if you continue with this, but for now I'll close the PR (for cleanup reasons). |
I'm very interested in the results of Spark, but we need at least one person who can install it. If a system cannot be easily installed it is a game over. |
SELECT TraficSourceID, SearchEngineID, AdvEngineID, CASE WHEN (SearchEngineID = 0 AND AdvEngineID = 0) THEN Referer ELSE '' END AS Src, URL AS Dst, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND IsRefresh = 0 GROUP BY TraficSourceID, SearchEngineID, AdvEngineID, Src, Dst ORDER BY PageViews DESC LIMIT 10; | ||
SELECT URLHash, EventDate, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND IsRefresh = 0 AND TraficSourceID IN (-1, 6) AND RefererHash = 3594120000172545465 GROUP BY URLHash, EventDate ORDER BY PageViews DESC LIMIT 10; | ||
SELECT WindowClientWidth, WindowClientHeight, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND IsRefresh = 0 AND DontCountHits = 0 AND URLHash = 2868770270353813622 GROUP BY WindowClientWidth, WindowClientHeight ORDER BY PageViews DESC LIMIT 10; | ||
SELECT DATE_TRUNC('minute', EventTime) AS M, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-14' AND EventDate <= '2013-07-15' AND IsRefresh = 0 AND DontCountHits = 0 GROUP BY DATE_TRUNC('minute', EventTime) ORDER BY DATE_TRUNC('minute', EventTime) LIMIT 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, the OFFSET clause was removed, which is incorrect.
It should be either LIMIT 1010 to get the closest result or subqueries with ROW_NUMBER.
No description provided.