# Parse the requirements file¶

The previous notebook, 'PyPi_Metadata.ipynb', parsed the requirements out of every package on the pypi server. The output was a file that looks like this:

packages/astrodbkit-0.2.0

packages/astrodendro-0.1.0
aplpy
astropy
matplotlib
numpy

packages/astroid-1.4.4

packages/astroimtools-0.1
git+http://github.com/astropy/astropy.git#egg=astropy
astropy-helpers
cython>=0.23.4
distribute==0.0
matplotlib
numpy

The packages start with the name of the python package, followed by the dependencies I was able to parse. Many of them have no dependencies; for now I will assume that is correct even though I know it is not true. Any package that programmatically defines the requirements in the setup.py, and which have no requirements files, are not found.

The purpose of this notebook will largely just be to parse the output file into a pandas dataframe.

In [ ]:
import pandas as pd
from collections import defaultdict
import os
import numpy as np
import requirements
import xmlrpclib

# I need this to separate the package name from its version
client = xmlrpclib.ServerProxy('http://pypi.python.org/pypi')
packages = client.list_packages()


## 1: Parse the requirements for each package¶

In [ ]:
datadict = defaultdict(list)
with open('requirements.txt', 'r') as infile:
new_package = True
for line in infile:
if line.strip() == '':
new_package = True
#print(package_name)
continue

if new_package:
# If this is the case, the current line gives the name of the package
package_name = os.path.basename(line).strip()
new_package = False
else:
# This line gives a requirement for the current package
try:
for req in requirements.parse(line.strip()):
except ValueError:
pass

# Convert to dataframe


# 2: Get the base package name from the package string¶

The package column of the dataframe currently contains the name of the package as well as the version string. I need to separate the two. For that, I will use the package list from pypi itself again.

In [ ]:
df['package_name'] = np.nan
df['package_version'] = np.nan
for i, package in enumerate(packages):
if i % 100 == 0:
print('Package {}: {}'.format(i+1, package))
for release in client.package_releases(package):
pkg_str = '{}-{}'.format(package, release)
idx = df.loc[df.package == pkg_str].index
if len(idx) > 0:
df.loc[idx, 'package_name'] = package
df.loc[idx, 'package_version'] = release

In [ ]:
# Save to file
df.to_csv('requirements.csv', index=False)


# Base dependencies¶

I have now parsed the formal dependencies for 20642 python packages. However, some of those dependencies themselves have dependencies. Let's go ahead and find the base dependency. I will find all of the requirements that each requirements itself has, and keep going until there are no new dependencies.

## Difficulties:¶

1. Cyclic dependencies: astropy requires wcs_axes, which itself requires astropy. Therefore a naive recursive solution will never end. I use a Tree class that keeps track of what has already been searched to avoid infinite loops.
In [ ]:
class Tree(object):
def __init__(self, name):
self.name = name
self.children = []
return

def __contains__(self, obj):
return obj == self.name or any([obj in c for c in self.children])

if not self.__contains__(obj):
self.children.append(Tree(obj))
return True
return False

def get_base_requirements(self):
base = []
for child in self.children:
if len(child.children) == 0:
base.append(child.name)
else:
for b in [c.get_base_requirements() for c in child.children()]:
base.extend(b)
return np.unique(base)

def get_requirements(package):
return df.loc[(df.package_name == package) & (df.requirement.notnull()), 'requirement'].values

def get_dependency_tree(package, tree):
reqs = get_requirements(package)
for req in reqs:
#print(req)
if not flg:
continue
tree = get_base_dependencies(req, tree)
return tree


In [ ]:
p = '115wangpan'
p = 'astroquery'
get_dependency_tree(p, Tree(p)).get_base_requirements()

In [ ]:
datadict = defaultdict(list)
for i, package in enumerate(df.package_name.unique()):
if i % 100 == 0:
print('Package {}: {}'.format(i+1, package))
try:
deptree = get_dependency_tree(package, Tree(package))
except:
print('Failure getting base dependencies for {}'.format(package))
raise ValueError
for dependency in deptree.get_base_requirements():

base_df.to_csv('base_requirements.csv', index=False)