Conf42 Python 2021 - Online

Build Your First Cyber Forensic Application using Python

Video size:


In this talk, one can learn how to develop their own cyber forensic tool using standard python library functions and modules.

A recent study by CheckPoint Research has recorded over 1,50,000 cyber-attacks every week during the COVID-19 pandemic. There has been an increase of 30% in cyber-attacks compared to previous weeks. The pandemic has been the main reason for job loss and pay cuts of people and has led to an increase in cybercrimes. Examples of cyber-attacks include phishing, ransomware, fake news, fake medicine, extortion, and insider frauds. Cyber forensics is a field that deals with the investigation of digital crimes by analyzing, examining, identifying, and recovering digital evidence from electronic devices and producing them in the court of law. Python has a great collection of built-in modules for digital forensics tasks. The talk begins with an introduction to digital crimes, digital forensics, the process of investigation, and the collection of evidence. Next, I will cover the various python modules and built-in functions required to build your first cyber forensic application. The modules covered in the discussion are pyscreenshot, PIL, secrets, argparse, hashlib, os,csv, logging, time, sys, stat and NLTK. Finally, I will demonstrate using code walk through the sample cyber forensic application.

Outline 1. Introduction to digital crimes, digital forensics, the process of investigation, and the collection of evidence. 2. Setting up Python for forensics application development 3. Built-in functions and modules for forensic tasks 4. Forensic Indexing and searching 5. Forensic Evidence extraction 6. Using Natural Language Tools in Forensics 7. Code walkthrough of sample forensic application 8. Conclusion and Next steps


  • Gajendra Deshpande: Today I will be presenting a talk on build your first cyber forensics application using Python. He will discuss introduction to digital crimes, digital forensics, the process of investigation and the collection of evidence.
  • The next setting up Python for forensics application development is installation. The next concept is forensic indexing and searching. You need to use the appropriate version of Python. Are you interested in using graphical tools or just these shell commands?
  • Hash functions are very very important. They are used for basically validation purpose. You cannot perform forensics analysis on the original data. You need to perform forensic analysis on a copy of the data. If there's a difference, then that means that something has been altered.
  • PIL is the Python imaging library used for image processing tasks. PI screenshot module allows to take Pysc screenshot without installing third party libraries. Mutagen is a Python module to handle audio metadata. Performance is not the target for this library or in any case of cyber forensics activities.
  • The next is PE file. PE file is a multiplatform Python module to argparse and work with portable executable files. The next one more important concept is using natural language tools or NLP packages in Python. In today's talk we have seen how we can create small cyber forensic applications.


This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, my name is Gajendra Deshpande. Today I will be presenting a talk on build your first cyber forensics application using Python. So in today's talk, we are going to discuss about introduction to digital crimes, digital forensics, the process of investigation and the collection of evidence, then setting up Python for forensic application development, built in functions and modules for forensic tasks, forensic indexing and searching has functions for forensics, forensic evidence extraction, metadata forensics, then using natural language tools in forensics. And finally, this summary, let us first look at some cybercrime statistics. So, the Internet Crimes report for 2019 released by USA's Internet Crime Compliance center of Federal Bureau of Investigation has revealed top four countries that are victims of Internet crimes. You can see here USA has more than four lakh reports. Then UK has more than 93,000, Canada has more than 33,000 and India has more than 27,000. Of course, these numbers are only reported numbers, so unreported numbers are much, much higher. So you can consider at least three times higher. So according to RSA report of 2015, mobile transactions are rapidly growing and cybercrimes are migrating to less protected soft channels. So less protected soft channels are mostly our mobile devices. And most of times what happens is mobile devices are operated by those who are not well educated and who are not well versed with the mobile device and different settings. So according to report by Norton 2015, an estimated 103,000,000 Indians lost about rupees 16,000 on an average to cybercrime. So that amounts to us dollar of around 200 plus. According to an article published in Indian Express on 19 number 216, over 55% millions in India are hit by this cybercrime. So a recent study by Checkpoint Research has recorded over one like 50,000 cyberattacks every week during Covid-19 pandemic. So there has been an increase of 30% in cyberattacks compared to previous weeks. So that's because many people have lost jobs and many people are suffering. And maybe there are many insider attacks who are taking advantage of the situation. Now let us look at the definition of digital forensics. So, forensics science is the use of scientific methods or expertise to investigate crimes or examine evidence that might be presented in the court of law. So cyber forensics is investigation of various crimes happening in the cyberspace. Examples of attacks include phishing, ransomware, fake news, fake medicine, extortion and insider frauds. So according to DFRWs, that is digital forensics research workshop, digital forensics can be defined as the use of scientifically derived and proven method toward the preservation, collection, validation, identification, analysis, interpretation, documentation and presentation of the digital evidence derived from digital sources for the purpose of facilitating or furthering the reconstruction of events found to be criminal, or helping to anticipate unauthorized functions shown to be disruptive to planned operations. So the digital forensic investigation process has the following steps. It starts with identification, then collection, validation, examination, preservation and presentation. So in identification steps, what happens is whenever a cybercrimes investigating officer, or basically an investigating officer, usually he's the police officer. When he visits the place, his job is to first identify all the objects so that he can seize those objects, which helps in the investigation of the case. So basically this identification of objects helps in collecting the evidence. So basically he has to collect all the electronic gadgets, including smartphones, laptops, then storage devices, et cetera. So one important thing he should note here that there may be some devices, okay, say for examples, toy USB. So these are very difficult to identify. Even he has to identify such objects and take into his custody. Then once the objects are identifying, then the next step comes is the collection. So in collection of evidence, the investigating officer has to note down the state of the system. If it is on, then he has to perform live forensics. If it is off, he should not turn on the system. So the present state of the system has to be maintained and the photograph needs to be taken. Now if in some cases, so generally that situation is rare, the police officer is not in a position to perform live forensics, then he just needs to pull the plug so that the present state of the system can be maintained. So if he turn on the system or turn off the system, then it will definitely change the state of the system and it will alter the evidence. And one more important thing, whenever investigating officer is collecting the evidence is that they collect the most volatile evidence first and the least volatile evidence last. So there's a particular order is mentioned, so that states the volatility. As per the volatility order, the investigating officer needs to collect the evidence. Then once it is done, then the next step is to validate the evidence. Now note here that the investigating officers usually takes the snapshot or the image of the system, and this image needs to be validated. So one algorithm which can be used for validation is the hashing algorithm. So I will demonstrate how it is done in later slides. The next is examination. Once the system image has been captured, the investigating officer needs to examine it. Now note here that this data will be huge. So without computer, it will be very difficult to examine the data and get the useful insights. Now, the next step is preservation. Now, note here that investigating officers are collecting different objects, such as hard disk and several other evidences. They needs to be stored in proper room temperature, in a proper security, in a proper lockers, and they also need to be stored in special bags, such as antistatic bags or faraday bags. This is very important, because if the procedure is not followed, then the evidence may be altered. If the evidence is altered by any means, then it will not be presented in the court of law. The next is presentation. The whole idea behind all these steps is to extract the evidence and present it in the court of law, right? So if these steps are not performed, then court will not accept it. So every step needs to be performed carefully, and finally, it has to be presented in the court of law. Now, there is one important standard known as Dobert Standard. So, let us discuss how Dobert standard is useful and how Python adheres to Dobert standard. So, in United States federal law, the Dobert standard is a rule of evidence regarding the admissibility of expert witness testimony. So a party may raise a Dobert motion, a special motion in limina raised before or during trial to exclude the presentation of unqualified evidence to the jury. So there are some illustrative factors. So, the court defined scientific methodology as the process of formulating hypothesis, and then conducting experiments to prove or falsify the hypothesis, and provided a set of illustrative factors. So, pursuant to rule 10 four a, in Dobert, the US Supreme Court suggested that the following factors be considered. So, has the technique been tested in actual field conditions and not just in a laboratory? Has the technique been subject to peer review and publication? What is the known or potential rate of error? Do standards exist for control of technique's operation? Has the technique been generally accepted within the relevant scientific community? Now let's see how Python implements it. So, in 2003, brain carrier published a paper that examined rules of evidence standards, including Dobert, and compared and contrasted the open source and closed source forensic tools. One of his key conclusion was using the guidelines of Dobert tests. We have shown that open source tools may more clearly and comprehensively meet the guideline requirements than closed source tools. So this statement clearly states that Python has advantage because Python is open source and free software. So we can say that Python adheres to the Dobert standard and the investigation process, or the code written using Python language for cybercrimes application that can be presented in the court of law. The results are not automatic, of course, just because the source is open. Rather, specific steps must be followed regarding design, development and validation. Can the program or algorithm be explained? This explanation should be explained in words, not only in code. Has enough information been provided such that thorough tests can be developed to test the program? Have error rates been calculated and validated independently? Has the program been studied and peer reviewed? Has the program been generally accepted by the community? Now you can see these five points correlate to the Dobert standard. Illustrative factors. So that's why we can say that since Python is open source and you can do all these points which are mentioned on this slide using Python. So it adds to the Daubert standard and hence the evidence can be presented in the court of law. It is very very important. Whenever you are using tool, you should ensure that it adheres to the Dobert standard. The next setting up Python for forensics application development now there are some factors which actually need to be considered whenever you are setting the environment. The first one is your background and the organization support. So what is your qualification, how much skill you are having in Python and what is the organization support? Say for example, is your organization funds the development of new software or is it capable of purchasing new software or is it interested to invest in open source tools? The next is choosing the third party libraries. So choosing third party libraries is also very very important because there is a dependency issue and you may have to sometimes write wrappers in order to just get the functions. The next is the ides and their features, that is integrated development environments. So what do you prefer? Are you just okay to write the command line programs or you need sophisticated ides so that it can help you code faster using the features such as intellisense and debugging. The next is installation. So where you want to install it? On which operating system are you interested in installing on Windows, Linux or Macintosh? Of course, if the evidence is just the analysis then it depends. So it is just a simple analysis then you can use any operating system. But if you are performing some system specific analysis, say for example Windows forensics or Linux forensics or Macintosh forensics, then you need to install Python on those specific operating systems, then write version of Python. So this is also very very important. You can't use the recent version of Python just because it is recent. Okay. Some libraries may support it. May not support it, okay. So the getting tasks done is very very important. So you need to use the appropriate version of Python then next, how you want to execute your programs. Are you interested in using graphical tools or just these shell commands? Okay, many times shell commands will do the job and you can get the job done very quickly. And many times it is very important to use graphical tools also. Now let's see how Python supports the development of cyber forensics applications now. Built in functions and modules now note here that Python has got many built in functions and modules. You can list all of them using the Dir builtin function and you can see here that there are several built in modules and functions listed. So if you are a Python developer, you are already aware of these functions. The only thing we need to see is how we are going to use them differently when we are developing the cyber cyber forensic application. This is a simple code which demonstrates the use of range function. So you might have used range function along with loops. So whenever you wanted to generate some list of numbers, or whenever you wanted to work with lists or basically array kind of data structures. So here you can see here that the base address has been defined and basically here we are generating ten local IP addresses. So similarly you can generate any number of IP addresses, any kind of IP addresses. You can even generate IPV six addresses. So similarly you can use this range function to generate any kind of information. The next application is to list the files from the directory. So in this case we are using the OS module. Again, it's inbuilt. So here we are getting the current working directory. Then we are using current working directory to print the files and folders in the present working directory. So note here that in this case also we have not used any additional library. The next concept is forensic indexing and searching. So in case of indexing and searching, so you already are aware of these concepts. Indexing and searching. So whenever you have worked with list data structures or arrays or matrices, two dimensional arrays or multi dimensional arrays, you might have dealt with index concept. Then similarly in case of Google, you might have also dealt with page ranking. So algorithm, so you may be aware of it. The next is searching. Searching is just a simple operation which is used to find the relevant information. So you can develop or you can write your own search functionality, or you can use the search function available in the Python core library. So these two are very simple methods. Now note here that many times what happens is our evidence may be present in files. So in that case we need to search for particular keywords. So these keywords are nothing but the clues for the evidence. You need to search for those keywords. You need to search for particular information and you can do it using a very simple code. So you just need to use files data structure. So open the file, then read the information line by line, then process the information, then check for the keywords. So if those keywords are found you can just print those are found. If they are not found you can just print they are not found. So if they are found means you have found some clues. So then you can use some additional tools to index them, right? So even you can perform simple indexing using dictionary or you can just put it in a list. So when you put it in a list you will be indexing them by default. Now there is a library called as Hoosh. So it is advanced library and it can be used for forensics indexing and searching. So Hoosh was created and is maintained by Matt. It was originally created for use in online help system of side effects software, 3d animation software, Houdini. Since 2016 it is not being maintained but we are not seeing any updates. But still it works with the present version of Python. You don't face any problem with the present version, it's still compatible and it works fine without any problem. It's a pure Python library and it supports fielded indexing and search. It supports fast indexing and retrieval. It supports pluggable scoring algorithm, text analysis, storage and various posting formats, et cetera. And you can also query it. So it supports powerful query language and pure Python spell checker. Now this is the code actually written using Hoosh. So what we are doing here is first we are importing the required modules such as create in, right? Then we are defining a schema with title path and the content. Then we are creating a directory index dir with the schema. Then we are writing the files and the content to index Dir. Now note here that once the content has been written, you also need to write the query parser. So this query parser will help you to extract the information from this library. So Hoosh can also be used to create your own custom search engine. So it supports both indexing and searching features. Now has functions for forensics. So hash functions are very very important. They are used for basically validation purpose. So whenever you take the snapshot of the image, whenever you tasks the record the entire image of the system. So when I say image of the system, we are using tools like Northern Ghost and we are making the image of the entire system. Once your image is ready, you can start analyzing it. But note here that you cannot perform forensics analysis on the original data. You need to perform forensic analysis on the copy of the data. Now when you perform forensic analysis on the copy of the data, after performing the analysis, you need to check the has you need to check the has of original image and the copied image, so it should be the same. If there's a difference, then that means that something has been altered in the copied image. So that's why you can see here that using a simple application I'm demonstrating here, we are importing a hashlib library. Then we are using SHa 256 hash module. So of course it also supports other algorithms such as MD Phi. Then note here that there are two messages have been written. One is Python is, and second one is a great programming language. They have been combined and stored in a string called as Yum. Then we are calculating the digest on Yum. Then we are defining another variable, x. Now again here also say we are using same method, that is Sha 256, and in this case we are using a single sentence. Python is a great programming language. Now at the end we are just comparing using a statement print x digest is equal to m digest. So that means we are checking whether the digest of x and Yum are same. So in this case, you can see here the output is same. And also it says that the digest is true. So that means the hash is same, the information is not being altered. Now there's the same example. So what I have done is I have just added one white space at the end of x message, right after the dot. So in this case, again the digest has been calculated. Hash has been calculated. Now in this case it is showing false. So that means the hashes are not same. So that means that the information has been altered. So has functions are very, very important. Okay. And note here that the use of hash algorithms is recognized in the court of law. Especially I'm not aware of other countries, but at least in India, it has been recorded by Information Technology Act 2000. Now next up is forensics evidence extraction. So for this we can use the library called as pillow. Pillow is the friendly PIL fork by Alex Clark and contributors. So PIL is the Python imaging library by Frederick Lund and contributors. Basically, it is used for image processing tasks. The Python image library adds image processing capabilities to your Python interpreter. So this library provides extensive file format support and efficient internal representation, and fairly powerful image processing capabilities. The core image library is designed for fast access to data stored in a few basic pixel format, so it should provide a solid foundation for general image processing tool. Now for forensic evidence extraction, we are using again the PIL library. Now note here that we can extract the XiF tasks. We can extract the GPS information using GPS tags. Now we can use both. Now let's assume that there's a picture which is taken from your mobile phone and it is stored in your mobile phone. Now the investigating officer will take that photo and he will write the third script, that is the script written at the bottom and he will extract the GPS information about that photo, okay? And he will also extract the other properties of the image such as size, image description. Then GPS tax include the longitude, latitude. So basically the location information so where that photo has been taken. So all this information can be extracted using simple PL library and the modules such as tags and GPS tasks. Of course this library supports various other modules also which are useful in extracting the evidence. The next is the PI screenshot module. So it tries to allow to take Pyscreenshot without installing third party libraries. Again, note here that it has been written as a wrapper to pillow, but PIL Pyscreenshot also supports other libraries. Now note here that performance is not the target for this library or in any case of cyber forensics activities, performance is not the target. The importance is given to the evidence and its protection. So basically it has to ensure that the information has not been altered. So this is simple code which actually tasks the screenshot of entire screen. So that can be done by using importing by screenshot module and using a grab method. So once you grab it, you just need to save the image using save method. So this particular code takes the screenshot of entire screen. Similarly you can take the screenshot of part of a screen by specifying the coordinates to BB box parameter in grab method. Now you can also check the performance of PI screenshot module if you are real, if you're interested. So you can see here there are different modules such as Pil, MSS, right, Piqt and et cetera. And n equal to ten means this is the time taken to take ten screenshots. So you can choose the one which is taking the less time. Okay, so you can force the back end. So if you force the back end to scratch and if you force the back end to MSS, and if you set child process to false, then of course it will help you to improve the performance significantly. But as I have said, performance is not the target here. Extracting the evidence is the target. The next is metadata forensics. Now note here that metadata is associated with every kind of a file. So now mutagen is the Python module to handle audio metadata. Now you can see here that many times you may get audio evidence or video evidence. So in that case you may have to extract the metadata of audio file or sometimes in even video file. So in that case, mutagen will help you. So again, mutagen is pure Python library. That means no additional modules are required, so they don't have any standard or any additional dependency. So you can install mutagen using Python three minus mPIP. Install mutagen. Now what this mutagen does is it takes any audio file and tries to guess its type and returns the file type instance or none. Many times it happens that people may change the extension, but even though you change the extension, the internal algorithm internal architecture will remain same. So in that case it becomes important to guess the or get the type of the file original type of the file. So you can see here the same mutagen library is able to get the information about a flac file and also the mp3 file, so it can get the bitrate and the length of an audio file. Then similarly, as I have said, since you are dealing with files, metadata is associated with every kind of file. So there is a library called as PyPdf. So using this you can extract metadata information of Pdf file. So again it is a pure Python library and it is capable of extracting the document information, splitting the documents page by page, merging documents page by page, cropping pages, merging multiple pages, encrypting and decrypting Pdf files, and so on. The next is PE file. So PE file is a multiplatform Python module to argparse and work with the portable executable files. So usually PE files are found on Windows operating system. So most of the information contained in PE file header is accessible as well as sections, data and the data. So the structures defined in Windows header files will be accessible as attributes. In PE instance, the naming of fields or attributes will try to add to the naming scheme in those headers, so only shortcuts added for convenience will depart from that convention. So PE file required some basic understanding of the layout of a PE file. So with it it's possible to explore nearly every single feature of PE file format. Some of the tasks which are possible with PE file are instructing headers, analyzing the functions data, retrieving embedded data, reading strings from resources, then warning of suspicious and malformed values, overwriting fields, then Packer detection with PID signatures, PID signature detection, signature generation, et cetera. The next one more important concept is using natural language tools or NLP packages in Python. Now note here that whenever you extract the information you are actually taking the image of the entire computer. So there may be lot of textual information present in it, there may be lot of system files present in it so it is not possible to examine each and every file manually. So in that case, to extract the useful information, to extract the keywords, NLP packages can be used. We know that NLP packages support features such as tokenization, lamatization, word frequency, NGram analysis, and so on. So finally, you can also generate the summary. You can also generate the frequency of co occurring words using Ngram analysis. Also they support grammatical tools such as part of speech tagging and also the named entity recognition. So several features are supported. And all these features are really important in forensics analysis. So you can get the required information, or you can try to get the required information or some insights using NLP packages. Now, these NLP packages can be classified into three categories. One is single language libraries. So most of the times NLTK space and texture, they work with English, but some of them also support some other languages. Then we have specific libraries for multiple human languages such as stanza and polyglot. They support stanza support at least 60 plus languages. Polygraph support at least 150 or 140 languages. Then there are some libraries such as NLTK or Indic NLP. These are for indian languages which have got different structure altogether. Of course, stanza supports and also polyglot supports some of the indian languages. But in LTC and Indic, NLP are much more advanced. In today's talk, we have seen how we can create small cyber forensic applications. So we had not done, say any extensive application. We had not created any extensive application. But you can see here that we have created very small applications using the concepts known to us. Of course we had seen some advanced libraries. So in cyber forensic application creation, it is very important to follow the standard procedure, the law enforcement agencies during the investigation process. Otherwise it will not be admissible in the court of law. Then there are many open source as well as commercial tools for digital forensics. Learning to develop your own tool is always advantageous because it can save time, it can help you to save money. Then many tools written in Python are pure python implementations. And most importantly, Python and open source tools comply with Dobert Standard. Thank you everyone.

Gajendra Deshpande

Assistant Professor @ KLS Gogte Institute of Technology

Gajendra Deshpande's LinkedIn account Gajendra Deshpande's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways