I am unclear how one steps controls the next steps. I have a project to do data validation on many fields. I am using a Shape Tool on one field to do a group by to determine if the field values are unique. I would like to do another Shape Tool in another step on a different column. How can i get all the columns back in view without creating a project for each field? Is the Shape tool the only tool that does this or do others as well?
You could do this by doing a "self join", prior to using the shaping operation to do a group by - store the view of the dataset using the lens feature in the product. Publish this lens and then do a lookup/join to bring this lens back to merge with the shaped data before performing the second shaping operation on another column.
The Shape feature is very powerful. It provides an opportunity to do multiple tasks within a single step. At a minimum it enables you to:
Eliminate fields you no longer wish to retain in your dataset
Merge duplicate records into a single record based on the fields you have chosen to remain in your shape
Furthermore - depending on the shape option you provide you also get the opportunity to:
Transform the layout of your data
Generate new fields
Derive values including distinct counts, sums, and averages (for more details please refer to the online documentation).
The cost of this operation is that the data that is removed will not be carried forward or accessible once the shape occurs. One thing you can consider is the use of a lens to preserve the base detail, publish the content to create a detail answerset, and use that answerset as the basis of a new project.
What your project would look like is:
Load your file
Do all the work before the shaping requirements
Create a lens on the step before the shaping
Publish the data
Then do to the step with the shape
Create a lens based on that output
Publish the second lens
Create a new project
Load the original detailed answerset
Do your 3rd shape
Do a lookup/join on the output of the 2nd answerset to combine the results
One other thing to consider: Window Functions. It won't help in this case, but if you are using a Shape -> Group by and doing things like Sum, Average, Count, First, Last, Min, Max, or Medium - you can do a Window Function without having to resort to a shape.
The reason for this is that when you perform a calculation in Paxata, the default behavior is the perform the calculation on the row of data you are working on. The computed columns work within the row. Filters can be used to control which rows the computed column will be calculated.
To compare the values across multiple rows - you either need to reshape the data using a Shape operation or you can define a window of values using the Window Functions Feature. It was introduced in 2019.2 and extended upon in the 2020.1 release. The notion is that you specify a field you wish to perform the aggregate, and you define a field that will serve as the window or grouping of records the aggregation should consider. You can compute # companies in a territory or # companies that purchased a product. Count Distinct is not in the Window functions as of yet.
To get complete details on the Windows Function feature - click here.