Skip to content

Latest commit

 

History

History
44 lines (31 loc) · 1.78 KB

README.md

File metadata and controls

44 lines (31 loc) · 1.78 KB

GOAL

  • pdfs are serialized into AVRO
  • AVRO si distributed as a spark RDD in X partitions
  • each partition is collected and stored as a csv part
  • csv are then merged, and compressed
  • archive goes back to application serveur that load the postgresql table

PERFORMANCES

  • 50 Million of pdf of 3 pages average were transformed and dumped to text for 2 hours of runtime

BUILD

  • make build

USE (yarn)

  1. transform the pdf to avro (see PdfAvro folder)
  2. push 2 jars on the spark computer cluster
  3. spark-submit --jars wind-pdf-extractor-1.0-SNAPSHOT-jar-with-dependencies.jar --driver-java-options "-Dlog4j.configuration=file:log4jmaster" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4jslave" --num-executors 120 --executor-cores 1 --master yarn pdfextractor_2.11-0.1.0-SNAPSHOT.jar inputAvroHdfsFolder/ outputCsvHdfsFolder/ 400`
  4. it is crucial to put only one executor core

CONFIGURATION

  • ulimit -n 64000 (default is 1024, way too low)

READING