Skip to content

Commit

Permalink
Merge branch 'master' into vincent-dataset-explorer-create-workflow
Browse files Browse the repository at this point in the history
  • Loading branch information
vthai321 authored May 28, 2024
2 parents 45cf43b + 5d03929 commit cc093b6
Show file tree
Hide file tree
Showing 218 changed files with 3,803 additions and 1,266 deletions.
10 changes: 10 additions & 0 deletions .github/workflows/github-action-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,16 @@ jobs:
steps:
- name: Checkout Texera
uses: actions/checkout@v2
- name: Set up R for R-UDF
uses: r-lib/actions/setup-r@v2
with:
r-version: '4.3.3'
- name: Install R dependencies for R-UDF
run: |
Rscript -e 'install.packages("dplyr", repos = "http://cran.rstudio.com/")'
Rscript -e 'install.packages("arrow", version = "14.0.0.1", repos = "http://cran.rstudio.com/")'
Rscript -e 'install.packages("reticulate", version = "1.36.1", repos = "http://cran.rstudio.com/")'
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
Expand Down
17 changes: 11 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,12 @@
</p>
</p>
<p align="center">
<img alt="Static Badge" src="https://img.shields.io/badge/Users-287-blue">
<img alt="Static Badge" src="https://img.shields.io/badge/Users-332-blue">
<img alt="Static Badge" src="https://img.shields.io/badge/Projects-86-blue">
<img alt="Static Badge" src="https://img.shields.io/badge/Workflows-1,635-blue">
<img alt="Static Badge" src="https://img.shields.io/badge/Executions-22K-blue">
<img alt="Static Badge" src="https://img.shields.io/badge/Workflow_Versions-273K-blue">
<img alt="Static Badge" src="https://img.shields.io/badge/Deployments-4-blue">
<img alt="Static Badge" src="https://img.shields.io/badge/Workflows-2,257-blue">
<img alt="Static Badge" src="https://img.shields.io/badge/Executions-31K-blue">
<img alt="Static Badge" src="https://img.shields.io/badge/Workflow_Versions-357K-blue">
<img alt="Static Badge" src="https://img.shields.io/badge/Deployments-7-blue">
<img alt="Static Badge" src="https://img.shields.io/badge/Largest_Deployment-100_nodes,_400_cores-green">
</p>

Expand Down Expand Up @@ -62,6 +62,8 @@ The following is a workflow formulated using the Texera GUI in a Web browser, wh
![Sample Texera Workflow](https://user-images.githubusercontent.com/12926365/171459157-1792971d-a31f-49e7-ab98-6f3b9ead9f5b.png)

## Publications (Computer Science):
* (3/2024) Demonstration of Udon: Line-by-line Debugging of User-Defined Functions in Data Workflows, Yicong Huang, Zuozhi Wang, and Chen Li, to appear in SIGMOD 2024 Demo.
* (2/2024) Data Science Tasks Implemented with Scripts versus GUI-Based Workflows: The Good, the Bad, and the Ugly, Alexander K Taylor, Yicong Huang, Junheng Hao, Xinyuan Lin, Xiusi Chen, Wei Wang, and Chen Li, to appear in ICDE 2024 DataPlat Workshop.
* (8/2023) Building a Collaborative Data Analytics System: Opportunities and Challenges, Zuozhi Wang, Chen Li, in Tutorial at VLDB 2023 [PDF](https://www.vldb.org/pvldb/vol16/p3898-wang.pdf), [Slides](https://chenli.ics.uci.edu/files/vldb2023-texera-tutorial.pdf).
* (8/2023) Udon: Efficient Debugging of User-Defined Functions in Big Data Systems with Line-by-Line Control, Yicong Huang, Zuozhi Wang, and Chen Li, in SIGMOD 2024 [PDF](https://dl.acm.org/doi/10.1145/3626712).
* (8/2023) Improving Iterative Analytics in GUI-Based Data-Processing Systems with Visualization,
Expand Down Expand Up @@ -90,7 +92,7 @@ The following is a workflow formulated using the Texera GUI in a Web browser, wh
* (4/2021) Why Do People Oppose Mask Wearing? A Comprehensive Analysis of US Tweets During the COVID-19 Pandemic, Lu He, Changyang He, Tera Leigh Reynolds, Qiushi Bai, Yicong Huang, Chen Li, Kai Zheng, and Yunan Chen, in JAMIA 2021 [PDF](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7989302/pdf/ocab047.pdf).

## Videos

* [dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data & Workflow-based Analysis" 04/26/2024](https://www.youtube.com/watch?v=B81iMFS5fPc)
* [Texera demo in VLDB 2020](https://www.youtube.com/watch?v=SP-XiDADbw0)
* [Amber engine presentation in VLDB 2020](https://www.youtube.com/watch?v=T5ShFRfHmgI)
* See [Texera in action](https://www.youtube.com/watch?v=NXfynBUwdVg).
Expand All @@ -110,4 +112,7 @@ To try our collaborative data analytics in _Demonstration of Collaborative and I

This project is supported by the <a href="http://www.nsf.gov">National Science Foundation</a> under the awards [III 1745673](https://www.nsf.gov/awardsearch/showAward?AWD_ID=1745673), [III 2107150](https://www.nsf.gov/awardsearch/showAward?AWD_ID=2107150), AWS Research Credits, and Google Cloud Platform Education Programs.

* <a href="https://www.niddk.nih.gov/"><img src="https://github.com/Texera/texera/assets/17627829/d279897a-3efb-41c1-b2d3-8fd20c800ad7" alt="NIH NIDDK" height="30"/></a> This project is supported by NIH NIDDK.


* <a href="http://www.yourkit.com"><img src="https://www.yourkit.com/images/yklogo.png" alt="Yourkit" height="30"/></a> [Yourkit](https://www.yourkit.com/) has given an open source license to use their profiler in this project.
4 changes: 3 additions & 1 deletion core/amber/operator-requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,6 @@ wordcloud
plotly
praw
pillow
pybase64
pybase64
torch
scikit-learn
5 changes: 4 additions & 1 deletion core/amber/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,7 @@ python-lsp-server[all]==1.5.0
python-lsp-server[websockets]
bidict==0.22.0
cached_property
psutil
psutil
transformers
rpy2==3.5.11
rpy2-arrow==0.0.8
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,8 @@ message LinkOrdinal {

message InitializeExecutorV2 {
string code = 1;
bool is_source = 2;
string language = 2;
bool is_source = 3;
}

message UpdateExecutorV2 {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,7 @@ class InitializeExecutorHandler(ControlHandler):
cmd = InitializeExecutorV2

def __call__(self, context: Context, command: cmd, *args, **kwargs):
context.executor_manager.initialize_executor(command.code, command.is_source)
context.executor_manager.initialize_executor(
command.code, command.is_source, command.language
)
return None
Original file line number Diff line number Diff line change
Expand Up @@ -105,20 +105,33 @@ def is_concrete_operator(cls: type) -> bool:
and not inspect.isabstract(cls)
)

def initialize_executor(self, code: str, is_source: bool) -> None:
def initialize_executor(self, code: str, is_source: bool, language: str) -> None:
"""
Initialize the executor with the given code. The output schema is
decided by the user.
:param code: The string version of python code, containing one Operator
:param code: The string version of the code, containing one Operator
class declaration.
:param is_source: Indicating if the operator is used as a source operator.
:param language: The language of the operator code.
:param output_schema: the raw mapping of output schema, name -> type_str.
:return:
"""
executor: type(Operator) = self.load_executor_definition(code)
self.executor = executor()
self.executor.is_source = is_source
if language == "r":
# Have to import it here and not at the top in case R_HOME from udf.conf
# is not defined, otherwise an error will occur
# If R_HOME is not defined and rpy2 cannot find the
# R_HOME environment variable, an error will occur here
from core.models.RTableExecutor import RTableSourceExecutor, RTableExecutor

if is_source:
self.executor = RTableSourceExecutor(code)
else:
self.executor = RTableExecutor(code)
else:
executor: type(Operator) = self.load_executor_definition(code)
self.executor = executor()
self.executor.is_source = is_source
assert (
isinstance(self.executor, SourceOperator) == self.executor.is_source
), "Please use SourceOperator API for source operators."
Expand Down
114 changes: 114 additions & 0 deletions core/amber/src/main/python/core/models/RTableExecutor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
import pyarrow as pa
import rpy2.robjects as robjects
from rpy2_arrow.arrow import rarrow_to_py_table, converter as arrow_converter
from rpy2.robjects import default_converter
from rpy2.robjects.conversion import localconverter as local_converter
import typing
from typing import Iterator, Optional, Union
from core.models import ArrowTableTupleProvider, Tuple, TupleLike, Table, TableLike
from core.models.operator import SourceOperator, TableOperator


class RTableExecutor(TableOperator):
"""
An executor that can execute R code on Arrow tables.
"""

is_source = False

_arrow_to_r_dataframe = robjects.r(
"function(table) { return (as.data.frame(table)) }"
)

_r_dataframe_to_arrow = robjects.r(
"""
library(arrow)
function(df) { return (arrow::as_arrow_table(df)) }
"""
)

def __init__(self, r_code: str):
"""
Initialize the RTableExecutor with R code.
Args:
r_code (str): R code to be executed.
"""
super().__init__()
with local_converter(default_converter):
self._func: typing.Callable[[pa.Table], pa.Table] = robjects.r(r_code)

def process_table(self, table: Table, port: int) -> Iterator[Optional[TableLike]]:
"""
Process an input Table using the provided R function.
The Table is represented as a pandas.DataFrame.
:param table: Table, a table to be processed.
:param port: int, input port index of the current Tuple.
Currently unused in R-UDF
:return: Iterator[Optional[TableLike]], producing one TableLike object at a
time, or None.
"""
input_pyarrow_table = pa.Table.from_pandas(table)
with local_converter(arrow_converter):
input_r_dataframe = RTableExecutor._arrow_to_r_dataframe(
input_pyarrow_table
)
output_r_dataframe = self._func(input_r_dataframe, port)
output_rarrow_table = RTableExecutor._r_dataframe_to_arrow(
output_r_dataframe
)
output_pyarrow_table = rarrow_to_py_table(output_rarrow_table)

for field_accessor in ArrowTableTupleProvider(output_pyarrow_table):
yield Tuple(
{name: field_accessor for name in output_pyarrow_table.column_names}
)


class RTableSourceExecutor(SourceOperator):
"""
A source operator that produces an R Table or Table-like object using R code.
"""

is_source = True
_source_output_to_arrow = robjects.r(
"""
library(arrow)
function(source_output) {
return (arrow::as_arrow_table(as.data.frame(source_output)))
}
"""
)

def __init__(self, r_code: str):
"""
Initialize the RTableSourceExecutor with R code.
Args:
r_code (str): R code to be executed.
"""
super().__init__()
# Use the local converter from rpy2 to load in the R function given by the user
with local_converter(default_converter):
self._func = robjects.r(r_code)

def produce(self) -> Iterator[Union[TupleLike, TableLike, None]]:
"""
Produce Table using the provided R function.
Used by the source operator only.
:return: Iterator[Union[TupleLike, TableLike, None]], producing
one TupleLike object, one TableLike object, or None, at a time.
"""
with local_converter(arrow_converter):
output_table = self._func()
output_rarrow_table = RTableSourceExecutor._source_output_to_arrow(
output_table
)
output_pyarrow_table = rarrow_to_py_table(output_rarrow_table)

for field_accessor in ArrowTableTupleProvider(output_pyarrow_table):
yield Tuple(
{name: field_accessor for name in output_pyarrow_table.column_names}
)
Loading

0 comments on commit cc093b6

Please sign in to comment.