Filtering Data with Python

Author:Howard Butler
Contact:howard@hobu.co
Date:5/12/2017

This tutorial will describe using filters.python to identify outlier points in an LAS file.

Introduction

Noise filtering is a primary challenge in point cloud processing. There are many stock options available to you in PDAL to achieve it. These include:

These filters remove points that are deemed outliers in different ways. The don’t, however, identify specific points. Python plus NumPy would be a very quick way to prototype a tool that identified specific points we would like to filter.

PDAL has three different ways to manipulate data with Python. The first is filters.python, which we will be using in this tutorial. The second is the Python extension at https://pypi.python.org/pypi/PDAL that allows you to utilize PDAL processing operations in your own Python programs.

See also

Python describes PDAL’s Python story in more detail.

The Challenge

We have a ASPRS LAS file that has some points that that are completely out of range and have bad Y values. We don’t know how those points got in the file, but we want to identify which points they were, because it might be some indication about a failure point in our processing operation. We could simply use filters.mad to remove these with stock PDAL operations, but that filter doesn’t tell us which points it removed. Our own custom Python filter can do this for us.

Processing Stategy

The technique that filters.mad uses will suit our problem. In short, we want to identify which points have Y values that are more than three standard deviations away from the median. We are going to print a JSON object that will contain our information, which we can then use in some downstream processing operation (as filtered with the jq command line utility).

We also want to use Install Docker to process our data with a Bash script. We want to know for sure that we’re using PDAL’s standard release (1.5 in this case), and we’d like for our operation to be agnostic to the platform on which it is running.

Python Filter

Through the use of the filters.python, PDAL allows the use of Python and NumPy to process point cloud data. This can be very useful in prototyping situations, where PDAL can provide convenient data access and the processing logic of software that is still taking shape can be constructed with the rapid prototyping tools that Python can provide.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import numpy as np

def mad(ins, outs):

    # Fetch the Y dimension, which has our outliers
    y = ins["Y"]

    # Let numpy compute median
    median = np.median(y)

    # Let numpy compute stddev
    stddev = np.std(y)

    # Identify points that are > 3 stddev
    indexes = np.where( abs(y-median) > 3*stddev)[0]

    # Stuff our output in a dictionary
    output = {}
    output['median'] = median
    output['stddev'] = stddev
    output['indexes'] = list([v for v in indexes])
    output['values'] = [y[i] for i in indexes]

    # Print our dict to stdout
    print output

    # filters.python must return True to tell
    # PDAL it successfully completed
    return True

PDAL Pipeline

Our pipeline to apply the filter is straightforward. Because we are going to be processing the data with a Bash script, we’ll use some substitution to push our filename (taken from an argument on the command line) and our script itself. There is nothing much interesting except for the fact that we substitute our filename in its own entry so PDAL can figure out which driver to read it with. This will allow our script to work with any format PDAL can identify.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
{
  "pipeline":[
    "/data/$filename",
    {
    "type" : "filters.python",
    "function":"mad",
    "module":"anything",
    "source":"$script"
    }  ]
}

Docker

As I mentioned before, we want to run this script using Docker so we have consistent versioning and so we can use the script in the same way across a bunch of machines. To run the command in Docker we need to do two things:

  1. Start a Docker container with the PDAL Docker container tag (pdal/pdal:1.5 in this case)
  2. Run our command on that image once it is running.

docker run outputs the SHA of the container instance it is running to STDIN. We will capture that and then run our pdal pipeline command on it.

# Start the container
container=$(docker run -it -d -v `pwd`:/data pdal/pdal:1.5)

# Echo our pipeline into it and run pdal pipeline on it
output=$(echo $pipeline | docker exec -i $container pdal pipeline -i STDIN)

# redirect stderr and stdout to null. We don't want to know which ID
# was killed.
docker kill $container &> /dev/null

Final Script

Here’s our final script. The junk from lines 53-66 are to get bash to do variable substitution into our Pipeline JSON without breaking its syntax doing double quote and newline substitution.

Using our script, we can now identify which points have Y dimensions greater than three standard deviations:

$ ./run.sh bad-y-values.las
{'values': [22940882.882926211, 18747910.56323766], 'median': 4322652.908885533, 'stddev': 8292.74344366484, 'indexes': [1375302, 1376129]}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
#!/bin/bash

filename="$1"

read -d '' script <<"EOF"

import numpy as np

def mad(ins, outs):

    # Fetch the Y dimension, which has our outliers
    y = ins["Y"]

    # Let numpy compute median
    median = np.median(y)

    # Let numpy compute stddev
    stddev = np.std(y)

    # Identify points that are > 3 stddev
    indexes = np.where( abs(y-median) > 3*stddev)[0]

    # Stuff our output in a dictionary
    output = {}
    output['median'] = median
    output['stddev'] = stddev
    output['indexes'] = list([v for v in indexes])
    output['values'] = [y[i] for i in indexes]

    # Print our dict to stdout
    print output

    # filters.python must return True to tell
    # PDAL it successfully completed
    return True

EOF

read -d '' pipeline <<"EOF"
{
  "pipeline":[
    "/data/$filename",
    {
    "type" : "filters.python",
    "function":"mad",
    "module":"anything",
    "source":"$script"
    }  ]
}
EOF


# Do a bunch of bash vomit to properly substitute
# in our $filename and $script variables into the pipeline
# variable
IFS="%"

# echo newlines
script=$(awk '{printf "%s\\n", $0}' <<< "$script")

# replace " with \"
script=$(sed  's/["]/\\&/g' <<< "$script")

pipeline=$(awk '{printf "%s", $0}' <<< "$pipeline")
pipeline=$(sed  's/["]/\\&/g' <<< "$pipeline")
pipeline=$(eval echo $pipeline)


# Start the container
container=$(docker run -it -d -v `pwd`:/data pdal/pdal:1.5)

# Echo our pipeline into it and run pdal pipeline on it

output=$(echo $pipeline | docker exec -i $container pdal pipeline -i STDIN)

# redirect stderr and stdout to null. We don't want to know which ID
# was killed.
docker kill $container &> /dev/null