![]() ![]() Let’s now assume we want to generate data for a table that has the following schema: import pydbgen from pydbgen import pydbgen myDB = pydbgen.pydb() Following the pydbgen documentation, we first have to create an object of class pydb before we can start generating data. Now we are ready to create some test data using the Python shell. (from versions: none) ERROR: No matching distribution found for pydbgen ERROR: Could not find a version that satisfies the requirement pydbgen ![]() Pydbgen-1.0.5 python-dateutil-2.8.0 pytz-2019.1 six-1.12.0īecause pydbgen requires Python 3.x, if you are trying to install pydbgen on Python 2.x, you will get an error message indicating pydbgen could not be found. Depending on what packages were previously installed in your environment, your success message might list more or fewer packages than what’s shown below. Now you’re ready to install pydbgen via pip3. Patching your server is easy and just requires the two commands below. Done E: Unable to locate package python3-pip Done Building dependency tree Reading state information. If you see the error Unable to locate package python3-pip, this means you must first patch your server because the list of package repos is out of sync. Therefore, you need to install pip3 before installing pydbgen. Although the AMI (Amazon Machine Image) comes preinstalled with Python3, it does not come with the pip package manager. Here’s a link to the pydbgen documentation, which describes how to use the pip package manager to install pydbgen.įor the following example test, I’m creating a simple AWS EC2 box running Ubuntu 18.04. Packages such as pydbgen, which is a wrapper around Faker, make it very easy to generate synthetic data that looks like real world data, so I decided to give it a try. Python has excellent support for synthetic data generation. The question was whether any of those solutions would be able to handle the scaling requirements for this project because we were not talking about hundreds of thousands of rows, but billions of rows for most of the fact tables. Without access to the customer’s data, my only option was to create synthetic data that matched the customer’s data in terms of structure (schema) and size.Īlthough database professionals often need to generate data for test or demo purposes, it is an age-old challenge in spite of the fact that there are many tools and open source frameworks available for creating synthetic data. During one of my recent projects, the customer asked me to run a performance comparison between Snowflake and their existing system, with the caveat that I couldn’t use their data (not even a sanitized copy) to run the comparison. ![]()
0 Comments
Leave a Reply. |