Bulk Data Generation (1-2 million data per file)

Back to discussions

Expand all | Collapse all

Bulk Data Generation (1-2 million data per file)

Jump to Best Answer

1. Bulk Data Generation (1-2 million data per file)

1 Recommend
Madhava Agumbe
Posted May 02, 2018 08:30 AM

Reply Reply Privately
Hi

I working on POC where I need to generate member eligibility file with volume close to 1-2 million for performance testing. I have completed Data definition setup and generated 1k volume file. PFB the Log details.

Data generation Log :
Total column : 220
Data generation rule applied : 10-15 column
File generation : 1000 rows
Time taken :424 seconds

Is there is any way I can improve this generation time ?
2. Re: Bulk Data Generation (1-2 million data per file)

1 Recommend
Sean-K-CA
Posted May 02, 2018 10:44 AM

Reply Reply Privately
Could you provide some more details?
Are you generating in Datamaker or Portal?
Are all 220 columns in a single table?
What do you data generation rules look like?
Where is your repo DB located in relation to your DM/Portal server? (ping times, hops)
Are you publishing to a connection profile? If so, where is your target DB in relation to your DM/Portal server? (ping times, hops)

Does the GT Server meet the minimum Server requirements as noted within our documentation:
System Requirements - CA Test Data Manager - 4.5 - CA Technologies Documentation
3. Re: Bulk Data Generation (1-2 million data per file)

0 Recommend
Madhava Agumbe
Posted May 04, 2018 03:51 AM

Reply Reply Privately
Hi
PFB comments inline:
Are you generating in Datamaker or Portal? Datamaker
Are all 220 columns in a single table? Its a flat file comprising 220 odd column details
What do you data generation rules look like? Unique number generation, Seedlist for FN, LN, DOB+1, Seedlist address, city, sate, phone number and SSN
Where is your repo DB located in relation to your DM/Portal server? (ping times, hops) Repo DB and Data maker located in same server
Are you publishing to a connection profile? If so, where is your target DB in relation to your DM/Portal server? (ping times, hops) This is file generation. I have registered File layout in repository, defined the generation rule and published FD file from repository

Let me know if you require any further details.
4. Re: Bulk Data Generation (1-2 million data per file)
Best Answer

1 Recommend
Sean-K-CA
Posted May 07, 2018 07:28 PM

Reply Reply Privately
Madhava,

I ran some tests on a smaller scale...

1 FD file, with 5 columns of generated data and 1 column of fixed data.

1,000 rows = 54 seconds
10,000 rows = 7m:19s (439 seconds)

Based on those results, datamaker estimated that 100,000 rows would take 1h:13m:10s

I have a 300k row job running now that I will check on in the morning.

All jobs as noted above were causing the gtdatamaker.exe process to use about 25% CPU (on a quad core system). So that's 100% of a single CPU.

You have 2-3 times the amount of generated data and about 36 times total data per row. You job is ultimately taking nearly 8 times longer (using the 1,000 rows to compare). Obviously we have some differences in the amount of data being generated per row which would explain some of the performance differences. As this type of job is CPU intensive, hardware could also be a factor. My job was run in a test VM on a slightly older ESX server that has a Xeon X5670 CPU @ 2.93 GHz. Without knowing hardware differences, I would say that my tests and your results line up fairly well - especially considering that my data generation rules are very simple (generally just one function pulling from a seed list). Examples:
@randlov(0,@seedlist(Credit Card)@)@
@randlov(0,@list(MR,MS,MRS,DR)@)@
@randlov(0,@seedlist(FirstName)@)@
@randlov(0,@seedlist(LastName)@)@
@string(@randdate(1900-01-01,2000-01-01)@, YYYYMMDD)@

With that said, even though this is a fixed width file that you're attempting to generate - you may want to attempt an "enterprise publish" and have the file generated via TDM Portal. To do so, you just need to configure the source/target DB connections as a DSN-less-ODBC connection - you can setup a new connection profile for this to your local SQL server. Once that's done, the "Enterprise Mode" option will be enabled... Using this method on the same VM (and same CPU of course) as the tests above, I published 300,000 rows in under 4 minutes:

The java.exe (portal process) used upwards of 3 Gigs of memory for a time and spiked above 25% CPU for a time, but mostly seemed to be in that 25% CPU range for the duration of the job. This was also run at the same time as the other 300,000 row datamaker job was running. I suspect if this was the only job running, it would be slightly faster.

There are some limitations in Portal publishing files. Please refer to the documentation accordingly:
Publish Data Using Datamaker - CA Test Data Manager - 4.5 - CA Technologies Documentation
Publish Data Using the CA TDM Portal - CA Test Data Manager - 4.5 - CA Technologies Documentation

Hope this helps...

Test Data Manager

Bulk Data Generation (1-2 million data per file)

Madhava AgumbeMay 02, 2018 08:30 AM

Sean-K-CAMay 02, 2018 10:44 AM

Madhava AgumbeMay 04, 2018 03:51 AM

Sean-K-CAMay 07, 2018 07:28 PMBest Answer

1. Bulk Data Generation (1-2 million data per file)

2. Re: Bulk Data Generation (1-2 million data per file)

3. Re: Bulk Data Generation (1-2 million data per file)

4. Re: Bulk Data Generation (1-2 million data per file) Best Answer

4. Re: Bulk Data Generation (1-2 million data per file)
Best Answer