Metadata is stored in any document by authoring application which can be user-name ; comment ;creation date;modification date. Today we will learn python script to extract metadata from pdf file.
But before that you have to install pypdf python module . For that open terminal & type
pip install pypdf
Pypdf is offers ability to extract document information, split ,merge,crop,encrypt and decrypt documents.
import pyPdf
import optparse
from pyPdf import PdfFileReader
def printMeta(fileName):
pdfFile = PdfFileReader(file(fileName, ‘rb’))
docInfo = pdfFile.getDocumentInfo()
print ‘[*] PDF MetaData For: ‘ + str(fileName)
for metaItem in docInfo:
print ‘[+] ‘ + metaItem + ‘:’ + docInfo[metaItem]
def main():
parser = optparse.OptionParser(‘usage %prog “+\
“-F ‘)
parser.add_option(‘-F’, dest=’fileName’, type=’string’,\
help=’specify PDF file name’)
(options, args) = parser.parse_args()
fileName = options.fileName
if fileName == None:
print parser.usage
exit(0)
else:
printMeta(fileName)
if __name__ == ‘__main__’:
main()
first we import pypdf module ;then optprase module. There is two function available.
(1)main
(2)printMeta
(1)main :-
First some lines are indicated usage message for user & specify argument to supply filename. Whatever file name is supplied by user is saved to fileName variable ;if the file does not exist then it print usage message & stop execution of script.
If file is exist & we supply correct argument then it call second function printMeta.
(2)printMeta(fileName):
pdfFile = PdfFileReader(file(fileName, ‘rb’)) :- read pdf file & saved it to pdfFile .
docInfo = pdfFile.getDocumentInfo() :- Get document info from pdf file & saved it to docinfo.
print ‘[*] PDF MetaData For: ‘ + str(fileName) :- It print [*] PDF MetaData For: filename.
for metaItem in docInfo:
print ‘[+] ‘ + metaItem + ‘:’ + docInfo[metaItem]
above part print every metadata one by one which is extracted from document & saved to docinfo.
Usage of script:-
chmod +x script_name
./script_name -F filename.pdf