corrupy.picklemagic — Pickle data extraction

The picklemagic module implements tools for extracting data stored in python’s pickle object serialization format.

Technical Background

The Python pickle module is a nice tool for storing data structures, as long as the Python environment stays the same almost anything can be pickled. But if the original enviroment in which a pickle was made is unknown? Then trying to find out what data was serialized is a very cumbersome task and unpickling it is even a security risk.

The picklemagic module aims to solve this problem. Taking advantage of Python’s dynamic nature, it extracts as much information as possible from pickles and creates a data structure as close as possible to the original.

This is accomplished by generating any missing class and module definitions at runtime. These objects however only hold data which could be recovered from the unpickling process, they lack any implementation, and will therefore be referred to as fake classes and modules.

This module uses this behaviour in two ways. FakeUnpickler uses it simply to extend the normal unpickler behaviour, creating fake modules and classes when it encounters a definition it cannot find in the available modules. This ensures that the resulting data structure is as close to the original as possible. SafeUnpickler however uses it to replace the default Unpickler behaviour, replacing any class definition requested by the pickle by a fake class. This ensures safety during the unpickling process since the pickle cannot instantiate dangerous objects or call dangerous functions during the unpickling process.

Fake classes and modules

The mechanics of fake classes and modules are an important part of this module.

Fake classes get instantiated when the unpickling machinery encounters a request to load a top-level object from a module. In a normal pickle this object should either be a function or a class object. When the module cannot be found or the object cannot be found in the module, a fake class has to be inserted. This fake class is then created using the settings of the used FakeClassFactory, during which the classes __module__ attribute will be set to the module the class would have resided in, and the __name__ attribute will be set to the name the class would have had.

A similar process happens for the generation of fake modules. These modules will be generated when a FakeUnpickler encounters a reference to an object in an unknown module. When this happens, a fake module will be generated to house the fake classes which would be contained by that module. When such a module is created, it will automatically generate all necessary parent modules and add itself to sys.modules so it can be imported properly. It should be noted though that SafeUnpickler does not generate fake modules while importing since it is forbidden from importing modules.

A problem with this approach can be that it’s hard to write code to analyze the created datastructures when the fake modules and classes are only created at unpickling time. Therefore it is made possible for the user to create the necessary fake modules beforehand, either by creating FakeModule instances directly or by using fake_package(). This function allows the user to define that any modules in the given package exist, which works recursively, for example:

import picklemagic
picklemagic.fake_package("foo")

import foo.bar.baz
print(foo.bar.baz)

>>> <module 'foo.bar.baz' (fake)>

These can then be used to code with due to the special comparison behaviour of fake modules and classes. This behaviour works as follows: A fake class is equal to a fake module if it’s qualified name matches the qualified name of the fake module. This means that a fake class which says it has name bar in module foo compares equal to a fake module which identifies as foo.bar (this behaviour extends to hashing and isinstance/issubclass checking). This can then be used as follows:

import picklemagic
picklemagic.fake_package("foo")

import foo

def is_foo_bar(obj):
    if isinstance(obj, foo.bar):
        print("yes")

result = picklemagic.safe_loads(b"cfoo\nbar\n(tR.")
# This pickle results in a foo.bar instance

is_foo_bar(result)
>>> yes

This means that you don’t have to worry about definitions not existing in certain pickles. You can call fake_package() and then just code as if everything in the module actually existed.

Security Risks

While SafeUnpickler secures the unpickling process by denying the process access to globals and objects in modules by replacing the wanted definitions with fake classes which cannot do any harm, there are other possible security risks in the pickle protocol. These vulnerabilities are persistent ideas and the pickle extension registry. Although SafeUnpickler allows subclassing of SafeUnpickler.persistent_id(), care should be taken that the objects returned by it cannot be used for anything harmful. The same goes for the pickle extension registry if enabled (documented in the Python copyreg module).

Module Interface

To simply analyze a pickle string, you can simply call load() or safe_load(). Similarly if you want to analyze a pickle data stream, you can call the loads() and safe_loads() functions. However if you want more control over the missing class faking process, you can control FakeClass creation directly using FakeClassFactory and by subclassing FakeClassType. For more control over the unpickling process itself the classes FakeUnpickler and SafeUnpickler can be used directly.

The picklemagic module provides the following functions to make simple use more convenient

corrupy.picklemagic.load(file, class_factory=None, encoding='bytes', errors='errors')

Read a pickled object representation from the open binary file object file and return the reconstitutded object hierarchy specified therein, generating any missing class definitions at runtime. This is equivalent to FakeUnpickler(file).load().

The optional keyword arguments are class_factory, encoding and errors. class_factory can be used to control how the missing class definitions are created. If set to None, FakeClassFactory({}, 'strict') will be used.

In Python 3, the optional keyword arguments encoding and errors can be used to indicate how the unpickler should deal with pickle streams generated in python 2, specifically how to deal with 8-bit string instances. If set to “bytes” it will load them as bytes objects, otherwise it will attempt to decode them into unicode using the given encoding and errors arguments.

This function should only be used to unpickle trusted data.

corrupy.picklemagic.safe_load(file, class_factory=None, safe_modules=(), use_copyreg=False, encoding='bytes', errors='errors')

Read a pickled object representation from the open binary file object file and return the reconstitutded object hierarchy specified therein, substituting any class definitions by fake classes, ensuring safety in the unpickling process. This is equivalent to SafeUnpickler(file).load().

The optional keyword arguments are class_factory, safe_modules, use_copyreg, encoding and errors. class_factory can be used to control how the missing class definitions are created. If set to None, FakeClassFactory({}, 'strict') will be used. safe_modules can be set to a set of strings of module names, which will be regarded as safe by the unpickling process, meaning that it will import objects from that module instead of generating fake classes (this does not apply to objects in submodules). use_copyreg is a boolean value indicating if it’s allowed to use extensions from the pickle extension registry (documented in the copyreg module).

In Python 3, the optional keyword arguments encoding and errors can be used to indicate how the unpickler should deal with pickle streams generated in python 2, specifically how to deal with 8-bit string instances. If set to “bytes” it will load them as bytes objects, otherwise it will attempt to decode them into unicode using the given encoding and errors arguments.

This function can be used to unpickle untrusted data safely with the default class_factory when safe_modules is empty and use_copyreg is False.

corrupy.picklemagic.loads(string, class_factory=None, encoding='bytes', errors='errors')

Simjilar to load(), but takes an 8-bit string (bytes in Python 3, str in Python 2) as its first argument instead of a binary file object.

corrupy.picklemagic.safe_loads(string, class_factory=None, safe_modules=(), use_copyreg=False, encoding='bytes', errors='errors')

Similar to safe_load(), but takes an 8-bit string (bytes in Python 3, str in Python 2) as its first argument instead of a binary file object.

To ease automatic analysis, the picklemagic module provides the following functions.

corrupy.picklemagic.fake_package(name)

Mounts a fake package tree with the name name. This causes any attempt to import module name, attributes of the module or submodules will return a FakePackage instance which implements the same behaviour. These FakePackage instances compare properly with FakeClassType instances allowing you to code using FakePackages as if the modules and their attributes actually existed.

This is implemented by creating a FakePackageLoader instance with root name and inserting it in the first spot in sys.meta_path. This ensures that importing the module and submodules will work properly. Further the FakePackage instances take care of generating submodules as attributes on request.

If a fake package tree with the same name is already registered, no new fake package tree will be mounted.

This returns the FakePackage instance name.

corrupy.picklemagic.remove_fake_package(name)

Removes the fake package tree mounted at name.

This works by first looking for any FakePackageLoaders in sys.path with their root set to name and removing them from sys.path. Next it will find the top-level FakePackage instance name and from this point traverse the tree of created submodules, removing them from sys.path and removing their attributes. After this the modules are not registered anymore and if they are not referenced from user code anymore they will be garbage collected.

If no fake package tree name exists a ValueError will be raised.

The picklemagic module defines this Exception:

exception corrupy.picklemagic.FakeUnpicklingError

Error raised when there is not enough information to perform the fake unpickling process completely. It inherits from pickle.UnpicklingError.

Fake Classes

The picklemagic module uses the following classes to provide the necessary fake class definitions required by the fake unpickling process.

class corrupy.picklemagic.FakeClassType(name, bases, attributes, module=None)

The metaclass used to create fake classes. To support comparisons between fake classes and FakeModule instances custom behaviour is defined here which follows this logic:

If the other object does not have other.__name__ set, they are not equal.

Else if it does not have other.__module__ set, they are equal if self.__module__ + "." + self.__name__ == other.__name__.

Else, they are equal if self.__module__ == other.__module__ and self.__name__ == other.__name__

Using this behaviour, ==, !=, hash(), isinstance() and issubclass() are implemented allowing comparison between FakeClassType instances and FakeModule instances to succeed if they are pretending to be in the same place in the python module hierarchy.

To create a fake class using this metaclass, you can either use this metaclass directly or inherit from the fake class base instances given below. When doing this, the module that this fake class is pretending to be in should be specified using the module argument when the metaclass is called directly or a :attr:__module__ class attribute in a class statement.

This is a subclass of type.

class corrupy.picklemagic.FakeClassFactory(special_cases=(), default=FakeStrict)

Factory of fake classses. It will create fake class definitions on demand based on the passed arguments.

special_cases should be an iterable containing fake classes which should be treated as special cases during the fake unpickling process. This way you can specify custom methods and attributes on these classes as they’re used during unpickling.

default_class should be a FakeClassType instance which will be subclassed to create the necessary non-special case fake classes during unpickling. This should usually be set to FakeStrict, FakeWarning or FakeIgnore. These classes have __new__() and __setstate__() methods which extract data from the pickle stream and provide means of inspecting the stream when it is not clear how the data should be interpreted.

As an example, we can define the fake class generated for definition bar in module foo, which has a __str__() method which returns "baz":

class bar(FakeStrict, object):
    def __str__(self):
        return "baz"

special_cases = [bar]
Alternatively they can also be instantiated using FakeClassType directly::

special_cases = [FakeClassType(c.__name__, c.__bases__, c.__dict__, c.__module__)]

__call__(name, module)

Return the right class for the specified module and name.

This class will either be one of the special cases in case the name and module match, or a subclass of default_class will be created with the correct name and module.

Created class definitions are cached per factory instance.

class corrupy.picklemagic.FakeClass
class corrupy.picklemagic.FakeStrict(*args, **kwargs)
class corrupy.picklemagic.FakeWarning(*args, **kwargs)
class corrupy.picklemagic.FakeIgnore(*args, **kwargs)

These are FakeClassType instances which can easily be subclassed to get the wanted behaviour. FakeClass is a featureless instance for the rest to inherit from. FakeStrict, FakeWarning and FakeIgnore all define __new__() and __setstate__() methods to support the fake unpickling process. If FakeStrict is used, a FakeUnpicklingError will be raised if special arguments were passed into the methods during unpickling. If FakeWarning is used, a warning detailing the arguments will be printed and the arguments will be stored inside an attribute of the object (_setstate_args or _new_args). Finally if FakeIgnore is used, any unknown arguments will be stored inside an attribute of the object but no warning will be printed.

Fake Modules

The picklemagic module uses the following classees to implement the fake modules generated by fake_package() and the fake unpickling process.

class corrupy.picklemagic.FakeModule(name)

An object which pretends to be a module.

name is the name of the module and should be a "." separated alphanumeric string.

On initialization the module is added to sys.modules so it can be imported properly. Further if name is a submodule and if its parent does not exist, it will automatically create a parent FakeModule. This operates recursively until the parent is a top-level module or when the parent is an existing module.

If any fake submodules are removed from this module they will automatically be removed from sys.modules.

Just as FakeClassType, it supports comparison with FakeClassType instances, using the following logic:

If the object does not have other.__name__ set, they are not equal.

Else if the other object does not have other.__module__ set, they are equal if: self.__name__ == other.__name__

Else, they are equal if: self.__name__ == other.__module__ + "." + other.__name__

Using this behaviour, ==, !=, hash(), isinstance() and issubclass() are implemented allowing comparison between FakeClassType instances and FakeModule instances to succeed if they are pretending to bein the same place in the python module hierarchy.

It inherits from types.ModuleType.

_remove()

Removes this module from sys.modules and calls _remove() on any sub-FakeModules.

class corrupy.picklemagic.FakePackage(name)

A FakeModule subclass which lazily creates FakePackage instances on its attributes when they’re requested.

This ensures that any attribute of this module is a valid FakeModule which can be used to compare against fake classes.

class corrupy.picklemagic.FakePackageLoader(root)

A loader of FakePackage modules. When added to sys.meta_path it will ensure that any attempt to import module root or its submodules results in a FakePackage.

Together with the attribute creation from FakePackage this ensures that any attempt to get a submodule from module root results in a FakePackage, creating the illusion that root is an actual package tree.

This class is both a finder and a loader

Fake Unpicklers

These two classes do the actual work behind the fake unpickling process.

class corrupy.picklemagic.FakeUnpickler(file, class_factory=None, encoding='bytes', errors='strict')

A forgiving unpickler. On uncountering references to class definitions in the pickle stream which it cannot locate, it will create fake classes and if necessary fake modules to house them in. Since it still allows access to all modules and builtins, it should only be used to unpickle trusted data.

file is the binary file to unserialize.

The optional keyword arguments are class_factory, encoding and *errors. class_factory can be used to control how the missing class definitions are created. If set to None, FakeClassFactory((), FakeStrict) will be used.

In Python 3, the optional keyword arguments encoding and errors can be used to indicate how the unpickler should deal with pickle streams generated in python 2, specifically how to deal with 8-bit string instances. If set to “bytes” it will load them as bytes objects, otherwise it will attempt to decode them into unicode using the given encoding and errors arguments.

It inherits from pickle.Unpickler. (In Python 3 this is actually pickle._Unpickler)

This takes a binary file for reading a pickle data stream.

The protocol version of the pickle is detected automatically, so no proto argument is needed.

The argument file must have two methods, a read() method that takes an integer argument, and a readline() method that requires no arguments. Both methods should return bytes. Thus file can be a binary file object opened for reading, an io.BytesIO object, or any other custom object that meets this interface.

The file-like object must have two methods, a read() method that takes an integer argument, and a readline() method that requires no arguments. Both methods should return bytes. Thus file-like object can be a binary file object opened for reading, a BytesIO object, or any other custom object that meets this interface.

If buffers is not None, it should be an iterable of buffer-enabled objects that is consumed each time the pickle stream references an out-of-band buffer view. Such buffers have been given in order to the buffer_callback of a Pickler object.

If buffers is None (the default), then the buffers are taken from the pickle stream, assuming they are serialized there. It is an error for buffers to be None if the pickle stream was produced with a non-None buffer_callback.

Other optional arguments are fix_imports, encoding and errors, which are used to control compatibility support for pickle stream generated by Python 2. If fix_imports is True, pickle will try to map the old Python 2 names to the new names used in Python 3. The encoding and errors tell pickle how to decode 8-bit string instances pickled by Python 2; these default to ‘ASCII’ and ‘strict’, respectively. encoding can be ‘bytes’ to read these 8-bit string instances as bytes objects.

class corrupy.picklemagic.SafeUnpickler(file, class_factory=None, safe_modules=(), use_copyreg=False, encoding='bytes', errors='strict')

A safe unpickler. It will create fake classes for any references to class definitions in the pickle stream. Further it can block access to the extension registry making this unpickler safe to use on untrusted data.

file is the binary file to unserialize.

The optional keyword arguments are class_factory, safe_modules, use_copyreg, encoding and errors. class_factory can be used to control how the missing class definitions are created. If set to None, FakeClassFactory((), FakeStrict) will be used. safe_modules can be set to a set of strings of module names, which will be regarded as safe by the unpickling process, meaning that it will import objects from that module instead of generating fake classes (this does not apply to objects in submodules). use_copyreg is a boolean value indicating if it’s allowed to use extensions from the pickle extension registry (documented in the copyreg module).

In Python 3, the optional keyword arguments encoding and errors can be used to indicate how the unpickler should deal with pickle streams generated in python 2, specifically how to deal with 8-bit string instances. If set to “bytes” it will load them as bytes objects, otherwise it will attempt to decode them into unicode using the given encoding and errors arguments.

This function can be used to unpickle untrusted data safely with the default class_factory when safe_modules is empty and use_copyreg is False. It inherits from pickle.Unpickler. (In Python 3 this is actually pickle._Unpickler)

It should be noted though that when the unpickler tries to get a nonexistent attribute of a safe module, an AttributeError will be raised.

This inherits from FakeUnpickler

This takes a binary file for reading a pickle data stream.

The protocol version of the pickle is detected automatically, so no proto argument is needed.

The argument file must have two methods, a read() method that takes an integer argument, and a readline() method that requires no arguments. Both methods should return bytes. Thus file can be a binary file object opened for reading, an io.BytesIO object, or any other custom object that meets this interface.

The file-like object must have two methods, a read() method that takes an integer argument, and a readline() method that requires no arguments. Both methods should return bytes. Thus file-like object can be a binary file object opened for reading, a BytesIO object, or any other custom object that meets this interface.

If buffers is not None, it should be an iterable of buffer-enabled objects that is consumed each time the pickle stream references an out-of-band buffer view. Such buffers have been given in order to the buffer_callback of a Pickler object.

If buffers is None (the default), then the buffers are taken from the pickle stream, assuming they are serialized there. It is an error for buffers to be None if the pickle stream was produced with a non-None buffer_callback.

Other optional arguments are fix_imports, encoding and errors, which are used to control compatibility support for pickle stream generated by Python 2. If fix_imports is True, pickle will try to map the old Python 2 names to the new names used in Python 3. The encoding and errors tell pickle how to decode 8-bit string instances pickled by Python 2; these default to ‘ASCII’ and ‘strict’, respectively. encoding can be ‘bytes’ to read these 8-bit string instances as bytes objects.

Utility

Sometimes, it is necessary to be able to pickle the data structures created by the fake unpicklers. While this can be performed using the normal pickle routines from the python standard library for objects created by FakeUnpickler, this is not true for objects created by SafeUnpickler. Therefore, the following class is made available which allows objecs created by SafeUnpickler to be pickled.

class corrupy.picklemagic.SafePickler(file, protocol=None, *, fix_imports=True, buffer_callback=None)

A pickler which can repickle object hierarchies containing objects created by SafeUnpickler. Due to reasons unknown, pythons pickle implementation will normally check if a given class actually matches with the object specified at the __module__ and __name__ of the class. Since this check is performed with object identity instead of object equality we cannot fake this from the classes themselves, and we need to override the method used for normally saving classes.

This takes a binary file for writing a pickle data stream.

The optional protocol argument tells the pickler to use the given protocol; supported protocols are 0, 1, 2, 3, 4 and 5. The default protocol is 4. It was introduced in Python 3.4, and is incompatible with previous versions.

Specifying a negative protocol version selects the highest protocol version supported. The higher the protocol used, the more recent the version of Python needed to read the pickle produced.

The file argument must have a write() method that accepts a single bytes argument. It can thus be a file object opened for binary writing, an io.BytesIO instance, or any other custom object that meets this interface.

If fix_imports is True and protocol is less than 3, pickle will try to map the new Python 3 names to the old module names used in Python 2, so that the pickle data stream is readable with Python 2.

If buffer_callback is None (the default), buffer views are serialized into file as part of the pickle stream.

If buffer_callback is not None, then it can be called any number of times with a buffer view. If the callback returns a false value (such as None), the given buffer is out-of-band; otherwise the buffer is serialized in-band, i.e. inside the pickle stream.

It is an error if buffer_callback is not None and protocol is None or smaller than 5.