Questions tagged [apache-spark-sql]

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system.

-2
votes
0answers
21 views

Problem to convert dataframe to a textfile

I have a spark dataframe and I need to write it in a textfile with a specific format. I have tried to convert my dataframe to rdd and than write it as a text file. JavaRDD<Row> a2 =df.rdd()....
0
votes
0answers
14 views

Efficiently calculate top-k elements in spark

I have a dataframe similarly to: +---+-----+-----+ |key|thing|value| +---+-----+-----+ | u1| foo| 1| | u1| foo| 2| | u1| bar| 10| | u2| foo| 10| | u2| foo| 2| | u2| bar| 10| +---+...
0
votes
1answer
11 views

Is there a way to convert Scala-Spark DataFrame to HTML table, or converting DataFrame to Scala map then convert to Json and then HTML?

I run some test and get result which is small DataFrame, with approx 3-6 columns and 10-20 row. And now I want to send this email to my colleague, and for ease I want this to be in tabular format as ...
0
votes
0answers
7 views

CDC Logic by comparing source record and the most recent target record

I am trying to build a CDC (Change Data Capture) Logic where (for any given key-set) each source record will be compared with the most recent record (of that key) in the target. For any given key, the ...
-2
votes
0answers
17 views

pyspark: tried several ways to print pipline rdd but not worked ,u"\nmismatched input '<EOF>'

I have a pipline rdd, and I tried several ways to print it. I use map function to implement function 'nodes' on every element in data rdd. The 'data.map(nodes)' is of type 'pipline dataframe'. The '...
1
vote
0answers
14 views

Count operation not working on aggregated IgniteDataFrame

I am working with Apache Spark and Apache Ignite. I have a spark dataset which I wrote in Ignite using following code dataset.write() .mode(SaveMode.Overwrite) ....
4
votes
0answers
30 views

Hive partitioned table reads all the partitions despite having a Spark filter

I'm using spark with scala to read a specific Hive partition. The partition is year, month, day, a and b scala> spark.sql("select * from db.table where year=2019 and month=2 and day=28 and a='y' ...
0
votes
0answers
25 views

How to convert string date in first column according to format string from second column [duplicate]

I have Apach Spark`s dataframe with two column. First column contains string representations of dates, second- string with date formats. +------------+------------+ | date | format | +------...
-1
votes
1answer
17 views

Add column to a Dataset based on the value from Another Dataset

I have a dataset dsCustomer that have the customer details with columns |customerID|totalAmount| |customer1 | 250 | |customer2 | 175 | |customer3 | 4000 | I have another dataset ...
0
votes
2answers
38 views

How to reduceByKey in PySpark with custom grouping of rows?

I have a dataframe that looks as below: items_df ====================================================== | customer item_type brand price quantity | |======================================...
0
votes
0answers
22 views

Where is the sort ordering algorithm in spark source code?

I am doing a simple query: spark.sql("SELECT * FROM mytable ORDER BY age").collect() My question is: In Spark Source Code, where can I find the ordering comparator between rows, which executes this ...
1
vote
0answers
16 views

How to read an .xls file from AWS S3 using spark in java?

I am trying to read a .xls file from AWS S3 but getting java.io.FileNotFoundException exception. I tried below two approaches. One by giving the path in option() with key location and another by ...
-1
votes
1answer
45 views

how to add a list values into tuples

I tried to add the values of an existing list into a tuple. It's showing no compiler error but throwing runtime error. I hardcoded some list values and tried to add the values into a tuple using java ...
0
votes
1answer
34 views

Counting nulls and non-nulls from a dataframe in Pyspark

I have a dataframe in Pyspark on which I want to count the nulls in the columns and the distinct values of those respective columns, i.e. the non-nulls This is the dataframe that I have trans_date ...
0
votes
0answers
37 views

Why is DataBricks's Spark running slightly faster in Python than Scala

​For my Azure DataBricks, I created two notebooks, ExtractorPython and ExtractorScala, which are written in Python and Scala respectively. They each call the notebook, DocumentationPython and ...
0
votes
1answer
28 views

Applying the complex UDF for group of records , I think UDF are needed to solve this

I have to find when ever a particular store changes its brand i need to populate the mthid. This should applied to every store. +------+-----------+---------------+-------------+-------------+ |...
1
vote
0answers
16 views

Spark SQL Window over interval of between two specified time boundaries - between 3 hours and 2 hours ago

What is the proper way of specifying window interval in Spark SQL, using two predefined boundaries? I am trying to sum up values from my table over a window of "between 3 hours ago and 2 hours ago". ...
-1
votes
0answers
34 views

When to use case class and when to use normal class in Scala?

I am taking data from a source and doing some computation on it, and then I have to write that data in the table form. The table will contain around 20 columns, the value of each column I will pass ...
0
votes
2answers
49 views

Groupby regular expression Spark Scala

Let us suppose I have a dataframe that looks like this: val df2 = Seq({"A:job_1, B:whatever1"}, {"A:job_1, B:whatever2"} , {"A:job_2, B:whatever3"}).toDF("values") df2.show() How can I group it by a ...
0
votes
1answer
47 views

Fill null or empty with next Row value with spark

Is there a way to replace null values in spark data frame with next row not null value. There is additional row_count column added for windows partitioning and ordering. More specifically, I'd like to ...
0
votes
2answers
43 views

Convert Dataset of array into DataFrame

Given Dataset[Array[String]]. In fact, this structure has a single field of array type. Is there any possibility to convert it into a DataFrame with each array item placed into a separate column? If ...
0
votes
1answer
40 views

custom spark does not find hive databases when running on yarn

Starting a custom version of spark on yarn in HDP works fine following the tutorial from https://georgheiler.com/2019/05/01/headless-spark-on-yarn/ i.e. following: # download a current headless ...
-2
votes
1answer
24 views

how to perform join operation on pyspark dataframe?

i have two dataframe dd1 and dd2 and i want to join these dataframe. dd1: id name 1 red 2 green 3 yellow 4 black 5 pink 6 blue 7 white 8 grey dd2:- id name1 1 banana 2 ...
0
votes
1answer
39 views

How do I run SQL SELECT on AWS Glue created Dataframe in Spark?

I have the following job in AWS Glue which basically reads data from one table and extracts it as a csv file in S3, however I want to run a query on this table (A Select, SUM and GROUPBY) and want to ...
0
votes
0answers
19 views

How to update the coefficients of LinearRegression model from pyspark.ml.regression?

I am building a Pyspark linear model using the LinearRegression from pyspark.ml.regression from pyspark.ml.regression import LinearRegression #I have used VectorAssembler to develop the 'features' ...
0
votes
3answers
45 views

Add Column to Table with Default Value

I'm trying to add a column to a table (ideally without using a dataframe) with a default value of 'MONTHLY' ALTER TABLE aa_monthly ADD COLUMNS (Monthly_or_Weekly_Indicator string NOT NULL FIRST ...
0
votes
1answer
33 views

PySpark - Access struct field name and value when exploding

My input data is of this form: [ { "id": 123, "embedded": { "a": { "x": true, "y": 1, }, "b": { "x": false, "y": 2, }, }, }, ...
0
votes
1answer
30 views

creating a new dataframe based on a column storing index

I am working on using ALS on pyspark to do collaborative filtering. The models is giving prediction results in a dataframe like below. CustomerID ProductID Rating 0 4 ...
0
votes
2answers
30 views

Use RDD.foreach to Create a Dataframe and execute actions on the Dataframe in Spark scala

I'm trying to read a config file in spark read.textfile which basically contains my tables list. my task is to iterate through the table list and convert Avro to ORC format. please find my below code ...
-2
votes
1answer
32 views

How to truncate data and drop all partitions from a Hive table using Spark

How can i delete all data and drop all partitions from a Hive table, using Spark 2.3.0 truncate table my_table; // Deletes all data, but keeps partitions in metastore alter table my_table drop ...
-2
votes
0answers
26 views

How to transform a data frame in spark structured streaming using python?

I am testing structured streaming using localhost from which it reads streams of data. Input streaming data from localhost: ID Subject Marks 1 Maths 85 1 Physics 80 2 Maths ...
0
votes
1answer
36 views

org.apache.spark.SparkException: Task not serializable — Scala

I am reading a text file and it is fixed width file which I need to convert to csv. My program works fine in local machine but when I run it on cluster, it throws "Task not serializable" exception. I ...
0
votes
0answers
21 views

NumberFormatException thrown when passed date as lowerBound/upperBound in spark-sql-2.4.1v with ojdbc14.jar?

I have passed lowerBound/upperBound as below Dataset<Row> ss = ora_df_reader .option("inferSchema", true) .option("schema","schema1") .option("numPartitions",...
-1
votes
2answers
49 views

How to convert an array-like string to an array in spark-dataframe (Scala api)?

I have the following spark dataframe: published data 2019-05-15T10:37:22+00:00 [{"@id":"1","@type":"type","category":"cat"},{"@id":"2","@type":"type","category":"cat1"}] with the following ...
0
votes
1answer
28 views

Spark window function “rowsBetween” should consider only complete set of rows

I am using "rowsBetween" window function to calculate moving median as below val mm = new MovingMedian var rawdataFiltered = rawdata.withColumn("movingmedian", mm(col("value")).over( Window....
-1
votes
1answer
33 views

Addition of new columns using forloop in spark dataframe

I have a spark dataframe which is created dynamically.There are also a list of columns which needs to be selected from the dataframe. I need to iterate through the list of columns needed and check ...
-6
votes
0answers
24 views

How to draw bar il [on hold]

I have a requirement for the same
0
votes
1answer
36 views

Adding dynamical columns based on previous row values with PySpark?

In my case I have a dataframe that shows "Days" horizontally and in the columns the sold units of each hour. However, I would also like to display 26 hours. The first two hours of the previous day ...
0
votes
0answers
24 views

Spark SQL Functions/expression use broadcast variable

I'm thinking in implementing an expression in Spark in order to do a transformation (currently UDF), but I want this expression to be able to use a big driver-generated TreeMap (going to do range ...
0
votes
2answers
35 views

is there any built in function in Hive that calculates intersection of two lists in a hive table?

I have a hive table that have 3 columns : ["merchants_index", "weeks_index", "customer_index"]. The final goal is to calculate the percentage of repeat customers for each merchant in each week. By ...
0
votes
1answer
26 views

Spark: How to build semi-additive metrics or aggregate sum over a portion of a column?

I'm trying to reproduce some of the analytics I do in traditional BI within spark. The technical term used is how to build semi-additive metrics, but it might help if I explain what that means. ...
0
votes
0answers
12 views

Class Cast Exception from HiveHBaseTableOutputFormat to HiveOutputFormat on Data Insert Into Table

When I attempt to insert data into a table using Spark SQL with the intent of writing to HBase via Hive, I get the following error: Caused by: java.lang.ClassCastException: org.apache.hadoop.hive....
1
vote
2answers
47 views

Getting the number of rows in a Spark dataframe without counting

I am applying many transformations on a Spark DataFrame (filter, groupBy, join). I want to have the number of rows in the DataFrame after each transformation. I am currently counting the number of ...
-1
votes
1answer
22 views

Bind columns of 2 different dataframes spark

I have 2 different data frames in spark and I'd like to bind their columns to form a unique data frame. How can I do it using spark scala? Thanks
-1
votes
0answers
42 views

Add a column to a dataframe and apply a UDF to it

I have two dataframes one with various datas df_datas, another one with only businessDays df_dates. Now what I need is to query a column in df_dates (businessDay) with a column (dateOperation) from ...
0
votes
1answer
31 views

Spark: rewrite .filter(“count > 1”) without string expression

There is a piece of code in Java: Dataset<Row> dataset = ... ... dataset.groupBy("id").count().filter("count > 1"); Is there a way to set "count > 1" condition using some dataframe ...
0
votes
1answer
63 views

How to do window function inside a when?

I have a dataframe where I am trying to do window function on a array column. The logic is as follows: Group by (or window partition) the id and filtered columns. Calculate the max score of the rows ...
0
votes
0answers
18 views

What basis we should select the number of partition in re-partition in Dataframe?

What basis we should decide on the number of partitions while re-partitioning a Data-frame based on the memory core? Value should be based on the number of available core or multiples of number of ...
0
votes
1answer
30 views

Use non-consuming regular expression in pySpark sql functions [duplicate]

How can I use existing pySpark sql functions to find non-consuming regular expression patterns in a string column? The following is reproducible, but does not give the desired results. import ...
0
votes
1answer
26 views

How to convert field values as comma separated in Azure databricks SQL

I'm trying to get the get the fields values as comma separated values in single cell for each ID I am using on Azure Databricks (SQL), I know we can achieve this on traditional SQL using for xml path ...

http://mssss.yulina-kosm.ru