iText PdfTextExtractor Missing Ligatures in Resulting Text
I am attempting to take a pdf file and grab the text from it.
I found iText and have been using it and have had decent success. The one
problem I have remaining are ligatures.
At first I noticed that I was simply missing characters. After doing some
searches I came across this: http://support.itextpdf.com/node/25
Once I knew that it was ligatures I was missing, I began to search for
ways to solve the problem and haven't been able to come up with a solution
yet.
Here is my code:
import com.itextpdf.text.Document;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.FilteredTextRenderListener;
import java.io.File;
import java.io.OutputStreamWriter;
import java.io.FileOutputStream;
import java.io.FileWriter;
import java.io.BufferedWriter;
import java.io.IOException;
import java.util.Formatter;
import java.lang.StringBuilder;
public class ReadPdf {
private static String INPUTFILE =
"F:/Users/jmack/Webwork/Redglue_PDF/live/ADP/APR/ADP_41.pdf";
public static void writeTextFile(String fileName, String s) {
// s = s.replaceAll("\u0063\u006B", "just a test");
s = s.replaceAll("\uFB00", "ff");
s = s.replaceAll("\uFB01", "fi");
s = s.replaceAll("\uFB02", "fl");
s = s.replaceAll("\uFB03", "ffi");
s = s.replaceAll("\uFB04", "ffl");
s = s.replaceAll("\uFB05", "ft");
s = s.replaceAll("\uFB06", "st");
s = s.replaceAll("\u0132", "IJ");
s = s.replaceAll("\u0133", "ij");
FileWriter output = null;
try {
BufferedWriter writer = new BufferedWriter(new
OutputStreamWriter(new FileOutputStream(fileName), "UTF-8"));
writer.write(s);
writer.close();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (output != null) {
try {
output.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
public static void main(String[] args) {
try {
PdfReader reader = new PdfReader(INPUTFILE);
int n = reader.getNumberOfPages();
String str = PdfTextExtractor.getTextFromPage(reader, 1, new
SimpleTextExtractionStrategy());
writeTextFile("F:/Users/jmack/Webwork/Redglue_PDF/live/itext/read_test.txt",
str);
}
catch (Exception e) {
System.out.println(e);
}
}
}
In the PDF referenced above one line reads: part of its design difference
is a roofline
But when I run the Java class above the text output contains: part of its
design diference is a roofine
It is interesting to note that when I copy and paste from the PDF to
stackoverflow's textfield, it also looks like the second sentence with the
two ligatures "ff" and "fl" reduced to simply "f"s.
I am hoping that someone here can help me figure out how to catch the
ligatures and perhaps replaces them with the characters they represent, as
in the ligature "fl" being replaced with an actual "f" and a "l".
I ran some tests on the output from the PDFTextExtractor and attempted to
replace the ligature unicode characters with the actual characters, but
discovered that the unicode characters for those ligatures do not exist in
the value it returns.
It seems that it must be something in iText itself that is not reading
those ligatures correctly. I am hopeful that someone knows how to work
around that.
Thank you for any help you can give!
TLDR: Converting PDF to text with iText, had missing characters,
discovered they were ligatures, now I need to capture those ligatures, not
sure how to go about doing that.
No comments:
Post a Comment