作业失败指数ObjectsExceptive和ArrowBuf错误

Groupby使用应用式Pandas时,它可能导致Apache箭头缓冲估计错误

写由烟灰

2023年3月3日

问题

工作间歇性故障java.lang.IndexOutOfBoundsException箭头Buf出错

实例栈跟踪

Py4JavaError:调用o617计时出错: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 2195, 10.207.235.228, executor 0): java.lang.IndexOutOfBoundsException: index: 0, length: 1073741824 (expected: range(0, 0))      at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:716)      at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:954)      at org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:508)      at org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1239)      at org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1066)      at org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:287)      at org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:151)      at org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:105)      at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:100)      at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)      at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)      at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:122)      at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:478)      at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2146)      at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:270)      Driver stacktrace:      at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2519)      at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2466)      at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2460)      at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)      at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)      at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2460)      at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1152)      at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1152)      at scala.Option.foreach(Option.scala:407)      at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1152)      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2721)      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2668)      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2656)      at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)  Caused by: java.lang.IndexOutOfBoundsException: index: 0, length: 1073741824 (expected: range(0, 0))      at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:716)      at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:954)      at org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:508)      at org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1239)      at org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1066)      at org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:287)      at org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:151)      at org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:105)      at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:100)      at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)      at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)      at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:122)      at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:478)      at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2146)      at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:270)

因果

这是因为Apache箭头缓冲估计问题阿帕契箭头是一个模拟列数据格式,用于spark高效传输JVM和python数据

何时群比使用带应用式Pandas可能导致错误

详情请复习ARROW-15983阿帕契网站问题

求解

偶发故障重试常成功

重试无效时,您可以通过在集群中添加下下文来解决问题spark配置高山市AWS系统|休眠|GCP:

spark.databricks.execution.pandasZeroConfConversion.groupbyApply.enabled=true
删除

信息学

赋能pandasZeroConfConversion.groupbyApply可能导致性能下降,所以只应在需要时使用不应该是集群默认设置


文章有帮助吗