It seems something is wrong with the api. 1-7 all result into cross join(cartesian product), except the 8th one.
//1. val fraudsters = new_df.where(new_df("phone").in(old_df("phone"))) //2. val fraudsters = new_df.filter(new_df("phone").in(old_df("phone"))) //3 val fraudsters = new_df.filter(new_df("phone")===old_df("phone")) //4 val fraudsters = new_df.join(old_df, new_df("phone")===old_df("phone"),"inner") //5 val fraudsters = new_df.join(old_df, new_df("phone").equalTo(old_df("phone")),"inner") //6 val fraudsters = new_df.join(old_df).where(new_df("phone") === old_df("phone")) //7 val fraudsters = new_df.join(old_df, $"new_df.phone" === $"old_df.phones") //8 finally, correct! val df1 = new_df.as('df1) val df2 = susperious_number_new_trans.as('df2) val fraudsters = df1.join(df2, col("df1.phone") === col("df2.phone"))
Reference:
http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.sql.DataFrame
https://github.com/apache/spark/pull/4847/files#r25584861
https://github.com/apache/spark/blob/v1.3.0/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
You saved my day!! Thank you!
ReplyDelete