A file may be splitted to many chunks and replications stored on many datanodes in HDFS. Now, the question is how to find the DataNodes that actually store a file in HDFS?
You may use the dfsadmin -fsck
tool from the Hadoop hdfs util. Here is an example:
$ hadoop fsck /user/aaa/file.name -files -locations -blocks
Connecting to namenode via http://dstore-170:50070
FSCK started by hadoop (auth:SIMPLE) from /10.0.3.170 for path /user/path/to/file.gz at Fri Oct 17 12:25:55 HKT 2014
/user/path/to/file.gz 12448905476 bytes, 93 block(s): OK
0. BP-1960069741-10.0.3.170-1410430543652:blk_1074365040_625145 len=134217728 repl=2 [10.0.3.173:50010, 10.0.3.174:50010]
1. BP-1960069741-10.0.3.170-1410430543652:blk_1074365041_625146 len=134217728 repl=2 [10.0.3.175:50010, 10.0.3.174:50010]
2. BP-1960069741-10.0.3.170-1410430543652:blk_1074365042_625147 len=134217728 repl=2 [10.0.3.175:50010, 10.0.3.174:50010]
3. BP-1960069741-10.0.3.170-1410430543652:blk_1074365043_625148 len=134217728 repl=2 [10.0.3.175:50010, 10.0.3.174:50010]
4. BP-1960069741-10.0.3.170-1410430543652:blk_1074365044_625149 len=134217728 repl=2 [10.0.3.181:50010, 10.0.3.174:50010]
...
91. BP-1960069741-10.0.3.170-1410430543652:blk_1074365131_625236 len=134217728 repl=2 [10.0.3.175:50010, 10.0.3.174:50010]
92. BP-1960069741-10.0.3.170-1410430543652:blk_1074365132_625237 len=100874500 repl=2 [10.0.3.181:50010, 10.0.3.174:50010]
Status: HEALTHY
Total size: 12448905476 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 93 (avg. block size 133859198 B)
Minimally replicated blocks: 93 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 2.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 10
Number of racks: 1
FSCK ended at Fri Oct 17 12:25:55 HKT 2014 in 1 milliseconds
The filesystem under path '/user/aaa/file.name' is HEALTHY