Scriptis mainly has the following features:
- Workspace: Used to store scripts, data and log files. Support creating sql, hive, scala, python, pyspark scripts.
- Dataset Module: Display datasets and tables based on user permission.
- UDF Module: UDF are functions that can be used in sql and hql scripts. This module has the capability of managing, sharing and loading functions.
- Function Module: Composed of personal, system and shared user-define functions. These functions can be used in python, pyspark, scala scripts.
- HDFS Module: Personal directory of user's HDFS (distributed filesystem), used to store large files.
- Script Module: Capable of editing, running and stopping scripts. Customizing variable configurations and shortcut keys are also supported.
- Results: Include displaying, downloading and exporting the results.
- Script History: Displaying the running history of the scripts.
- Console: Users can access settings, global history, resource manager, global variables and FAQs here.
- Bottom right pop up box: Include task, engine and queue managers.
These functions are described in detail below.
Workspace is a file directory that a user have full permission to. At here, a user could do various operations such as managing files. The recommending directory structure is: script, data, log and res, since it is quite clear and thus easy for users to check and manage. The major functions of workspace are listed below:
- Right-clicks on workspace, a user can select copying path, creating a directory, creating a script or refreshing.
- Locates on the top of this module, there is a search box for quick searching.
- Support creating following kinds of scripts:
-
sql: Correspond to SparkSQL in Spark engine, syntax guide: https://docs.databricks.com/spark/latest/spark-sql/index.html
-
hql: Correspond to Hive engine: syntax guide: https://cwiki.apache.org/confluence/display/Hive/LanguageManual
-
Scala: Correspond to scala in Spark engine, syntax guide: https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-scala.html
-
JDBC: sql standard syntax, not supported yet.
-
Python: Standalone python engine, compatible with python
-
PythonSpark: Correspond to python in Spark engine, syntax guide: https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-python.html
- Right-click on script folder and files under it, users can choose to rename it, delete it, open it on the right side or export to hive (csv, txt, excel files) and hdfs.
Dataset module has the following functions.
- Get the information of datasets, tables and fields.
- Right-click on a table and select query table option can quickly generate a temporary hive script for data lookup.
- Right-click on a table and select describe table option can display the detailed information of this table and its corresponding fields and partitions.
- Right-click on a table, select export table option can generate a corresponding csv or excel file.
This module not only makes it easy for user to classify and display UDF, but also enables users to manage and share UDF. The major functions are listed below:
Default top-level directory:
-
BDAP function: Provided by platform and can be used in sql, pyspark, scala and hive (written with sql) scripts.
-
System function: Functions that system provides and loaded by default. Can be used in sql, pyspark, scala and hive (written with sql) scripts.
-
Individual function: Self-define functions, include general functions and Spark exclusive functions.
-
Sharing function: Functions created by administrator and then shared to other users.
-
Apart from system functions, other types of functions must be loaded before using and a user must kill the started session after checking the functions.
In addition, if a function is checked and loaded, it would correspondingly shown in auto-complete options.
It is quite easy to create a new UDF as long as you've finished the code. The steps are as follows:
-
To create a general UDF, a user needs to compile the corresponding Jar package first. General means either hql in Hive or sql in Spark applies here.
-
To create a Spark exclusive UDF, a user needs to create a corresponding python or scala script. Besides, to ensure the correctness, it is better to test the scripts.
-
Add this UDF to Scriptis:
General UDF: Choose general then select the path in workspace for its Jar package. Next, fill in the full class path of UDF and add formatting as well as description:
Spark exclusive UDF -- written in scala: Check Spark then select the corresponding scala script and fill in the registration format. (Function name in script):
Spark exclusive UDF -- written in scala: Check Spark then select the corresponding python script and fill in the registration format. (Function name in script):
For a PythonUDF, a user only needs to define a function, and the scripts have to correspond with this function.
def hello(id):
return str(id) + ":hello"
The way to create a ScalaUDF is quite similar to creating a Python UDF, a user only needs to define a function:
def helloWord(str: String): String = "hello, " + str
Note: Python UDF and Scala UDF can only applied in scripts that corresponding to the Spark engine.
Function module is similar to UDF module, the only difference between them is that one is UDF and the other is self-defined function. Also notes that, functions defined by python can only be used in python and pyspark. Similarly, functions defined by scala can only be used in scala.
The functions of this module are mainly integrated in the script edit box:
-
Script editing: Support basic keyword highlighting, code formatting, code merging, auto-completion and shortcuts etc.
-
Running and stopping: Users can choose to run only a segment of code or the entire script. By clicking stop button, users can terminate the running scripts whenever they want.
-
Script edit box has configuration options for defining user-define functions that take effects within the script.
This module has the following functions:
-
For now, it supports showing results in a table, clicking the header to sort, double-clicking to copy the field name and all these functions are restricted to showing up to 5000 lines of records. More functions would be supported in future, such as displaying the selected columns and filed types.
-
Visual analysis: Click on visual analysis button to visualize the result through VSBI. (Soon to be released)
-
Downloading: Users can directly download the results as csv and excel files to local browser. Only support downloading 5000 lines for now.
-
Exporting: Results can be exported to the workspace (shared directory of BDAP) in either csv or excel format and would not be restricted to 5000 lines if you choose full export at first. To use full export, add a comment in front of sql:
--set wds.linkis.engine.no.limit.allow=true
-
Go to Console--Configuration--Pipeline--Import and Export settings--Result export type to choose whether export results in csv format or excel format.
Script history shows all the running information of the script. A user can quickly find the log and running results of a script that was run before and therefore, avoid running the same script repeatedly.
Console has the following functions:
- Settings: Include general settings (such as setting up queues) and data development related engine settings: spark, hive, python, pipeline, etc.
-
Global variables: A global variable is a custom variable that can be applied to all scripts. If its name is same as the name of a variable in a script, that variable would take effect.
-
Other functions: Global history, resource manager, FAQs.
Similar to the Windows task manager, users can quickly view and manage tasks, engines and queue resources here.